Fast Code Nation

Bryan O’Sullivan

Facebook

A little bit about me

My background is a little bit all over the place:

Why performance?

Call it a combination of outlook and personality quirk.

Performance—but when?

I often work on software where speed really matters.

Interactive tools

Components that other people depend on

Interactive tools

Back in 2005, I started working on a DVCS:

Although written in Python:

An example from the early days of Mercurial

The log command displays history in reverse.

Two common use cases:

Latency vs throughput

Low latency:

High throughput:

Problem?

Where we started

The original version of hg log:

How did this behave?

Why?

Filesystems optimize for one access pattern:

Seeking backwards a little at a time?

One “fix”

Force the poor user to tell us what they want!

hg log --limit 5

Problem:

Very common usage:

hg log | less

Perverting the usage pattern

These characteristics forced users to plan for how to get best performance

For “I just need a few recent revs”:

For “I need to bulk search”:

Yuck!

A step on the road

Instead of reading one rev at a time, backwards:

Use a windowed approach!

How did this fare?

Reduced frequency of backwards seeks by \(n\) times

Not good enough!

A refinement

Start with a tiny window

Repeatedly double the window size as more revs requested

Results?

Instant response for small requests

What about bulk throughput?

Hooray!

And my point is?

Much (but not all) of the time:

My hg log story would play out the same way in any programming language

How I spend my time

I spend way too much of my time paying attention to performance issues!

Current job (Facebook):

Previous job (company I founded):

Measure once

CPython has so-so performance analysis support

Its cProfile module measures every function call

Even worse:

Measure twice

The statprof profiler captures a stack trace periodically

Measure … thrice?

“Poor man’s profiling”

Pain in the ass, high overhead, but greatest insight

If performance matters …

If performance is important to you, build support into your app

Mercurial makes measuring really easy:

What measurements tell us

Before we pull out the big guns of “rewrite it in C!” …

How can we improve the performance of code under CPython?

What’s in a name?

In CPython, most name lookups are expensive

For example, even in a process that is mostly network- and IO-bound, this transformation makes for a 3% speedup:

     def sendstream(self, source):
+        write = self.fout.write
         for chunk in source.gen:
-            self.fout.write(chunk)
+            write(chunk)
         self.fout.flush()

Wait, what?

What’s this even doing?

     def sendstream(self, source):
+        write = self.fout.write
         for chunk in source.gen:
-            self.fout.write(chunk)
+            write(chunk)
         self.fout.flush()

Even shortish chains of dots are expensive!

Scary features

In fact, if you really need top performance out of CPython:

Nested functions

Why are nested functions expensive?

def foo():
    def bar():
        wat()

When resolving a name, a normal function:

A nested function:

Classes with deep hierarchies

Why are classes with deep hierarchies expensive?

Precompute stuff where possible

Here, we reduce the number of string concatenations in a heavily used method:

     def __init__(self, path, openertype, encode):
         self.encode = encode
         self.path = path + '/store'
+        self.pathsep = self.path + '/'

     def join(self, f):
-        return self.path + '/' + self.encode(f)
+        return self.pathsep + self.encode(f)

Generators are good!

A lovely pattern for reading a file in manageable chunks:

for chunk in util.filechunkiter(open(name), limit=size):
    yield chunk

Generators are bad!

This code is 10% faster by avoiding the filechunkiter in the common case of small files:

if size <= 65536:
    yield open(name).read(size)
else:
    for chunk in util.filechunkiter(open(name), limit=size):
        yield chunk

The bottom line

In Python, almost every abstraction you use to make your code more readable will hurt its performance.

So what to do?

Ultimately, we get the best performance out of Python by relying on C

Find ways to make use of this

Even scarier

Dropping down into C is hard

Alternatives (Cython, boost::python) are heavyweight and meh

Performance in Haskell

The very first thing people find in Haskell when they care about performance:

Uh oh.

A little history

The built-in String datatype in Haskell is really just a singly linked list

Why?

Lists are famously easy to reason about

A list is either:

Trouble in paradise

Our problem is pointer chasing:

There’s a ton of costly overhead associated with lists

What did we do about this?

The poor performance of strings was recognised early on

The String type stays on, as a great teaching tool

Was that it?

The PackedString type got a huge overhaul in the mid-2000s

Fast forward some more

The modern Haskell type for working with Unicode text is named Text

Uses some of Haskell’s key features:

Also has a really nice API

Programming with pipelines

In Haskell, we often manipulate text using right-to-left function pipelines:

length . toUpper . dropWhile isSpace

Naively, this has quite a cost

Wait, what cost?

length . toUpper . dropWhile isSpace

Most stages of our pipeline create new Text strings:

dropWhile isSpace
toUpper

Yuck!

What is to be done?

GHC is a powerful compiler

Its job is made easier by those key features:

But …

There’s a problem:

Very few options for improvement here

Let’s change the game

Instead of processing text directly, let’s process streams

A stream is:

What’s a state transformer?

A state transformer:

Importantly:

Pipeline transformation

Our original pipeline:

length . toUpper . dropWhile isSpace

Here’s how each function is implemented in the text library:

Stream.toText . Stream.dropWhile isSpace . Stream.fromText
Stream.toText . Stream.toUpper . Stream.fromText
Stream.length . Stream.fromText

Glue them together

Our original pipeline:

length . toUpper . dropWhile isSpace

Our pipeline with a tiny bit of inlining:

Stream.length .
Stream.fromText . Stream.toText {- <<-- -} .
Stream.toUpper .
Stream.fromText . Stream.toText {- <<-- -} .
Stream.dropWhile isSpace .
Stream.fromText

I’ve highlighted a few “do-nothing” transforms above. Why?

A nice item in GHC’s toolbox

Making use of these language features:

GHC exposes a rule-driven rewrite system to programmers:

{-# RULES "drop fromText/toText"

Stream.fromText (Stream.toText t) = t

#-}

“If you see the code on the left, replace it with the code on the right.”

Ooh!

Our rewrite rule causes GHC to transform our pipeline into something simpler:

Stream.length .
Stream.toUpper .
Stream.dropWhile isSpace .
Stream.fromText

What’s important about this?

Stream fusion

This approach is called stream fusion.

Starting from an expensive looking pipeline of functions:

How to use stream fusion

Built into in the text and vector libraries

The only “special” tricks required of users?

Easy, right?

Why does stream fusion work?

Haskell’s powerful type system

Immutable data

Predictable functions

Compare and contrast

Need some speed?

Python:

Haskell: