Bryan O’Sullivan
My background is a little bit all over the place:
Distributed systems
High performance computing
Linux kernel hacking
Compilers and language runtimes
Call it a combination of outlook and personality quirk.
I like to develop my software quickly.
But I also like that software to be fast.
I often work on software where speed really matters.
Interactive tools
Components that other people depend on
Back in 2005, I started working on a DVCS:
Although written in Python:
The log
command displays history in reverse.
Two common use cases:
I just want to look at the last few changes
I’m searching through history for something
Low latency:
I want (parts of) an answer quickly
“I just want to look at the last few changes”
High throughput:
I want as much data per unit time as possible
“I’m searching through history for something”
Problem?
The original version of hg log
:
Walked history backwards
One rev at a time
How did this behave?
Displayed the first few revs quickly
Was incredibly slow at retrieving all revs
Filesystems optimize for one access pattern:
Seeking backwards a little at a time?
Force the poor user to tell us what they want!
hg log --limit 5
Problem:
Very common usage:
hg log | less
These characteristics forced users to plan for how to get best performance
For “I just need a few recent revs”:
For “I need to bulk search”:
Yuck!
Instead of reading one rev at a time, backwards:
Use a windowed approach!
Jump back \(n\) revs
Read a window of \(n\) revs forwards
Present these in reverse order
Repeat as necessary
Reduced frequency of backwards seeks by \(n\) times
Bulk throughput got better
Latency for retrieving the first few revs got worse
Not good enough!
Start with a tiny window
Repeatedly double the window size as more revs requested
Instant response for small requests
What about bulk throughput?
Hooray!
Much (but not all) of the time:
My hg log
story would play out the same way in any programming language
I spend way too much of my time paying attention to performance issues!
Current job (Facebook):
Previous job (company I founded):
CPython has so-so performance analysis support
Its cProfile
module measures every function call
Even worse:
The statprof
profiler captures a stack trace periodically
Very low overhead
Tells us about hot spots inside functions
Not accurate on short lived programs
Call-outs to C code delay sample collection
“Poor man’s profiling”
Attach gdb
to a Python process
Interrupt and capture a stack periodically
Modern gdb
can decode both C and Python stacks
Pain in the ass, high overhead, but greatest insight
If performance is important to you, build support into your app
Mercurial makes measuring really easy:
Run with --time
to measure wall clock execution time
Use --profile
to get a cProfile
breakdown
Set HGPROF=stat
before --profile
for a statprof
profile
Before we pull out the big guns of “rewrite it in C!” …
How can we improve the performance of code under CPython?
In CPython, most name lookups are expensive
For example, even in a process that is mostly network- and IO-bound, this transformation makes for a 3% speedup:
def sendstream(self, source):
+ write = self.fout.write
for chunk in source.gen:
- self.fout.write(chunk)
+ write(chunk)
self.fout.flush()
What’s this even doing?
def sendstream(self, source):
+ write = self.fout.write
for chunk in source.gen:
- self.fout.write(chunk)
+ write(chunk)
self.fout.flush()
Look up the name self.fout.write
just once
Hoist this expensive lookup out of the loop
Replace it with a local variable lookup
Even shortish chains of dots are expensive!
In fact, if you really need top performance out of CPython:
Why are nested functions expensive?
def foo():
def bar():
wat()
When resolving a name, a normal function:
A nested function:
Why are classes with deep hierarchies expensive?
They build up long chains of dict
s to search
If you use it, the super
keyword is costly (!)
Here, we reduce the number of string concatenations in a heavily used method:
def __init__(self, path, openertype, encode):
self.encode = encode
self.path = path + '/store'
+ self.pathsep = self.path + '/'
def join(self, f):
- return self.path + '/' + self.encode(f)
+ return self.pathsep + self.encode(f)
A lovely pattern for reading a file in manageable chunks:
for chunk in util.filechunkiter(open(name), limit=size):
yield chunk
This code is 10% faster by avoiding the filechunkiter
in the common case of small files:
if size <= 65536:
yield open(name).read(size)
else:
for chunk in util.filechunkiter(open(name), limit=size):
yield chunk
In Python, almost every abstraction you use to make your code more readable will hurt its performance.
Function calls
Classes of any sort
Generators
etc, etc
Ultimately, we get the best performance out of Python by relying on C
Find ways to make use of this
Cheap: transform a string in one pass, then split
Costly: split string, then transform parts
Dropping down into C is hard
Manual reference counting is painful
The usual C risks: bugs turn into corruption or crashes
5x longer development cycles, code much harder to review
Alternatives (Cython, boost::python) are heavyweight and meh
The very first thing people find in Haskell when they care about performance:
Uh oh.
The built-in String
datatype in Haskell is really just a singly linked list
Why?
Lists are famously easy to reason about
A list is either:
Empty
A single item, followed by the rest of the list
Our problem is pointer chasing:
Every link in a list is allocated separately
Each value pointed to by a link is allocated separately
There’s a ton of costly overhead associated with lists
The poor performance of strings was recognised early on
I developed a PackedString
type in 1993
A packed array of bytes, nice and compact
The String
type stays on, as a great teaching tool
The PackedString
type got a huge overhaul in the mid-2000s
New ByteString
type still a packed array of bytes
Great for binary data, but …
… useless for modern text processing
The modern Haskell type for working with Unicode text is named Text
Uses some of Haskell’s key features:
Data is immutable
Functions never change behaviour
Also has a really nice API
In Haskell, we often manipulate text using right-to-left function pipelines:
length . toUpper . dropWhile isSpace
Naively, this has quite a cost
length . toUpper . dropWhile isSpace
Most stages of our pipeline create new Text
strings:
dropWhile isSpace
toUpper
Yuck!
GHC is a powerful compiler
Its job is made easier by those key features:
Data is immutable
Functions never change behaviour
There’s a problem:
In Haskell, instead of loops, we write recursive functions
The building blocks of our pipeline are recursive functions
A compiler cannot safely inline recursive functions!
Very few options for improvement here
Instead of processing text directly, let’s process streams
A stream is:
A generic piece of state
A state transformer (“step”) function
A state transformer:
Consumes a state
Returns a new state and a value
Importantly:
State transformers are not recursive
Candidates for inlining
Our original pipeline:
length . toUpper . dropWhile isSpace
Here’s how each function is implemented in the text
library:
Stream.toText . Stream.dropWhile isSpace . Stream.fromText
Stream.toText . Stream.toUpper . Stream.fromText
Stream.length . Stream.fromText
Our original pipeline:
length . toUpper . dropWhile isSpace
Our pipeline with a tiny bit of inlining:
Stream.length .
Stream.fromText . Stream.toText {- <<-- -} .
Stream.toUpper .
Stream.fromText . Stream.toText {- <<-- -} .
Stream.dropWhile isSpace .
Stream.fromText
I’ve highlighted a few “do-nothing” transforms above. Why?
Making use of these language features:
Data is immutable
Functions never change behaviour
GHC exposes a rule-driven rewrite system to programmers:
{-# RULES "drop fromText/toText"
Stream.fromText (Stream.toText t) = t
#-}
“If you see the code on the left, replace it with the code on the right.”
Our rewrite rule causes GHC to transform our pipeline into something simpler:
Stream.length .
Stream.toUpper .
Stream.dropWhile isSpace .
Stream.fromText
What’s important about this?
These functions are all written as state transformers
None of them is recursive
They can all be inlined
This approach is called stream fusion.
Starting from an expensive looking pipeline of functions:
GHC will fuse our code into a single loop
That loop will allocate no memory
Runs about as fast as hand-written C
Built into in the text
and vector
libraries
The only “special” tricks required of users?
Compile with -O2
For an extra boost, use the LLVM back end (-fllvm
)
Easy, right?
Haskell’s powerful type system
Lets GHC (and us!) tell when code can be aggressively transformed
Segregates impure, unsafe code that deals with the outside world
Immutable data
Predictable functions
Need some speed?
Python:
Use fewer abstractions
Take more risks in dangerous languages
Haskell:
Choose smart abstractions
Make the compiler do the hard work