tl;dr: I make a petty point about premature optimization; don't go out and rewrite your
switch
statements as binary searches by hand; maybe do rewrite your jump tables asswitch
statements, though.
A couple of years ago I got into an argument in a job interview. In
this case, the question was how I would implement dispatch for a
protocol handler efficiently, and my answer was that I would write the
most obvious code possible, probably with a switch
statement, and
see what my compiler produced, before making any tricky implementation
choices. (Of course, my answer should have been, "write the obvious
thing and then measure it", but I have always had a thing for
inspecting compiler output.)
The interviewer indicated that this wasn't the answer they wanted to hear, and kept prodding me until I realized the question they thought they were asking was "how do you implement a jump table in C". At the time, I grumbled a bit that a jump table wasn't necessarily the best approach, especially if the distribution of dispatch cases is uneven, but I didn't fight too much. This stuck with me, though, for a few reasons:
The more I thought about it, the more I wondered how big these effects might be. I started to work on a simple experiment to evaluate the performance of table-based dispatch versus comparison-based dispatch, and reviewed the literature. (Spoiler: if you're looking for a rigorous experimental evaluation of these effects, it's not in this article. Read Roger Sayle's A Superoptimizer Analysis of Multiway Branch Code Generation for that.)
I got a push to finish this when I attended ILC2014, where Robert Strandh presented a paper on improving generic dispatch in CLOS which relied on the idea that table-based dispatch methods pay a penalty on modern hardware due to the additional, non-sequential memory accesses.
However, I still didn't do this, and the code sat for a long time.
But it came up again, I had some more literature references, and everyone loves a juicy interview story, so let's uncharitably phrase the interviewer's hypothesis as "jump tables are always faster" and show that this isn't so.
What do I mean by jump tables and so on? Imagine we have code like this:
switch (packet[9]) { case PROTO_TCP: return handle_tcp(packet); case PROTO_UDP: return handle_udp(packet); case PROTO_ICMP: return handle_icmp(packet); ... }
Let's abstract that a little. Our experiment will actually generate C code like this:
void dispatch(unsigned state) { switch (state) { case 0: fn_0(); return; case 1: fn_1(); return; case 2: fn_2(); return; case 3: fn_3(); return; default: abort(); } }
The code the interviewer was looking for me to manually transform that into is approximately this:
static void (*vtable[4])(void) = { fn_0, fn_1, fn_2, fn_3 }; void dispatch(unsigned state) { if (state >= 4) abort(); (*vtable[state])(); }
That's a jump table. We'll write it in x86-64 assembly like this:
.text dispatch: cmp $4, %edi jae 1f jmp *vtable(, %edi, 8) 1: call abort .data .align 16 vtable: .quad fn_0, fn_1, fn_2, fn_3
How else could the compiler compile that switch
statement? A really
simple (but not always ineffective) way is linear search:
dispatch: cmp $0, %edi je fn_0 cmp $1, %edi je fn_1 cmp $2, %edi je fn_2 cmp $3, %edi je fn_3 call abort
But if we have a lot of cases, or they're widely spread out, we'd probably want to at least use binary search:
dispatch: cmp $2, %edi jae .L1 .L0: cmp $0, %edi je fn_0 jmp fn_1 .L1: je fn_2 jmp fn_3 call abort
Those are our basic options, although when we look at the literature, we'll see there are a range of other choices.
What is this good for in general? We find the pattern of "dispatch to
a handler", often implemented with switch
or pattern matching, all
over the place: in finite state machines (check out this FSA-based
packet filter), generic dispatch, protocol handlers, simple
interpreters, and virtual machines. (Both Linux and FreeBSD use
table-based dispatch instead of switch for handling IP packets,
although I'd argue this is more about flexibility than speed.)
For example, the canonical implementation Tcl (generic/tclExecute.c:2417), Lua (lvm.c:793), as well as the most portable forms of Python and Ruby's bytecode interpreters, use switch-based dispatch.
Threaded code is more popular for interpreters these days; see Ertl and Gregg, The Structure and Performance of Efficient Interpreters, as well as this interesting comment about the efficiency of threaded code versus what the compiler generates for switch in CPython's Python/ceval.c:825.
Threaded code is basically where the end of each handler dispatches to the next one; this makes sense in a VM where you have the whole program, but not in a protocol handler where you probably don't have the next packet.
Threaded code should have better branch prediction behavior than a jump table with a single dispatch point (for a nice analysis of this, see Eli Bendersky's Computed goto for efficient dispatch tables), although indirect branch prediction should help even the field. (Update: See Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore; hardware has already caught up.)
But we're getting ahead of ourselves. Can we find any support in the literature for our argument? (Ok, we could look at the GCC source, but I promise, the literature is a better place to start in this case.)
Looking around, we quickly trace back to Arthur Sale's The Implementation of Case Statements in Pascal (1981), which is interesting just for the details on how simplistic many compilers of the time were, often because they had to emit code immediately.
Sale points out that most Pascal compilers at the time would produce simple jump tables from switch statements, and proposes binary search instead.
Robert Bernstein (Producing good code for the case statement (1985); paywalled, sorry) goes into some gritty details; he points out that binary search usually takes one more comparison than linear search per case, and asserts that binary search is faster than linear search when there are at least 4 case items. We'll revisit that further on.
Bernstein talks about some of the latitude we have in optimizing these search trees; for example, that paths which lead to traps can have the most instructions, since they're not likely to be repeatedly executed, and we can decide whether to put the jump-above / jump-equals first.
How do we decide some of these things? Bernstein says:
Faster executing code can be produced if the probabilities that the case selector takes on the case item values are known. These may be known as a result of trace information that is automatically supplied to the compiler, or perhaps as an extra-lingual mechanism pertaining to the case statement. In step 5 , the linear search can be arranged in decreasing probabilities, and a Huffman search rather than a binary search can be used.
How can you supply this information to the compiler? I found Dan (djb) Bernstein's the Death of Optimizing Compilers made concrete everything I had been thinking for a while about the need for a dialogue between the programmer and the compiler.
Right now, we can supply a bit of information ahead of time; for
example, in GCC, we can use __builtin_expect
(which you might be
familiar with from the Linux kernel's likely~/~unlikely
macros).
LLVM has branch weight metadata, although it looks like clang's
interface to this is only the same limited __builtin_expect
interface as GCC.
We can do much better with profile-guided optimization (PGO) /
feedback-driven optimization (FDO); see AutoFDO in GCC,
-fprofile-arcs
, -fbranch-probabilities
in GCC, et cetera, but
these (like the benefits of JIT compilers) require us to run the
program with representative workloads. This might be better for some
cases, but is a bit unsatisfying if we want to communicate in the
source code (to both reader and compiler), something we know ahead of
time about the distribution of cases.
(Outside the scope of this article, but interesting: branch prediction tables are power hungry; we could want to optimize a specific switch for power instead of performance. This is another application of this kind of dialogue with the compiler.)
Through the 90s, there's a trickle of papers: Kannan and Proebsting issue an important practical correction to Bernstein; H.G. Dietz's Coding Multiway Branches Using Customized Hash functions (1992) proposes a hashing approach to the problem; Erlingsson et al.'s Efficient Multiway Radix Search Trees (1996) presents a better approach for sparse sets.
Already it's becoming clear that switch
generation isn't cut and
dry, and then through the 2000s Roger Sayle publishes several papers,
culminating in A Superoptimizer Analysis of Multiway Branch Code
Generation, which almost saves us from having to run any experiment at
all. It contains citations to lots of related work (much of it by
Sayle himself), but more importantly, a great summary of techniques,
and benchmarks that pretty much prove our point for us already.
Sayle demonstrates that GCC can produce a wide range of code for
switch
, suitable to many different situations, and even suggests
that the compiler should detect attempts to manually implement
switch
(usually for irrelevant performance considerations relevant
on legacy systems) and undo them.1
However, as I am a big believer in rhetorical benchmarks, we might as well construct a small experiment that definitively proves our (admittedly ungracious and cherry-picked) point.
I wrote some code for this a few years ago, and then my ambitions for what should be tested grew beyond measure, leading to nothing getting done. So, in restoring this, I decided to explicitly allow myself to release some terrible code full of measurement errors: after all, this is a rhetorical benchmark (and what benchmark isn't?) — it doesn't need to be correct, it only needs to prove our point.2
The important thing for making our point is that we can find branch probabilities that support almost any implementation choice. For example, an IP stack protocol handler might encounter TCP packets 70% of the time, UDP packets 25% of the time, and other stuff 5% of the time. (Note: not an actual measurement.) In this case, if we can bias our dispatch code towards TCP and UDP packets, we're likely to get a win.
In this case, we'll choose a distribution like this (but even more unfair), where 80% of the time, we take the first case, and 20% of the time we take a random case. So we run: (on a machine quite unsuitable to benchmarking)
make run-experiment CALL_DISTRIBUTION=pareto N_DISPATCHES=5000000 N_RUNS=5 N_ENTRIES=256
and cherry-pick some spurious but convincing results:
Performance counter stats for './x86_64-binary 5000000' (5 runs): 6,883,819,114 cycles # 2.090 GHz ( +- 0.43% ) 232,004,486 instructions # 0.03 insns per cycle ( +- 0.06% ) 56,828,213 branches # 17.257 M/sec ( +- 0.04% ) 1,262,892 branch-misses # 2.22% of all branches ( +- 0.05% ) 3.299025345 seconds time elapsed ( +- 0.43% ) Performance counter stats for './x86_64-vtable 5000000' (5 runs): 7,709,225,443 cycles # 2.087 GHz ( +- 0.95% ) 217,283,422 instructions # 0.03 insns per cycle ( +- 0.03% ) 51,631,368 branches # 13.976 M/sec ( +- 0.03% ) 957,553 branch-misses # 1.85% of all branches ( +- 0.10% ) 3.706410106 seconds time elapsed ( +- 1.04% )
Which allows us to claim what came to us in l'esprit de l'escalier after that ill-fated interview: explicitly-constructed jump tables aren't always faster than what the compiler generates.
(The code is at https://github.com/tokenrove/dispatch-comparison, but I wouldn't recommend using it for anything.)
Ideally, we'd want to run a proper experiment on this, that tries all sorts of combinations, with proper cache-flushing3, variable amounts of work in the "handlers", et cetera. Maybe later. I'm explicitly calling this experiment cheating and moving on.
I didn't talk about the effects of these techniques on instruction cache usage, which is actually one of the most interesting factors. Unfortunately there are a lot of different patterns to talk about: when the handlers are large enough to boot the dispatch code out of I-cache, versus tight loops that are just dispatching all the time.
There are also a lot of ways we can tweak search:
Leveraging some of what we've seen so far, could we better exploit branch prediction by combining optimal decision tree dispatch with threaded code? Suppose the end of each VM instruction were rewritten to perform the first \(n\) levels of binary search, so the highest level branches would be better predicted (if instruction dispatch is close to a Markov process, anyway).4 See also Baer, On Conditional Branches in Optimal Search Trees.
But I think I'll avoid exploring this further until the next querulous job interview.5
Tangent: if we see switch
as just an anemic form
of pattern matching, there's a lot more of interest in the literature,
but that's another rabbithole.
For example, I'm totally ignoring the advice from Kalibera, Jones: Rigorous Benchmarking in Reasonable Time. If luck prevails, I'll write a bit about common easily-made benchmarking errors and the role of rhetorical benchmarks soon.
See Whaley and Castaldo, Achieving accurate and context-sensitive timing for code optimization.
NB: a binary search sorted by weights is precisely a Huffman search.
Many people helped me make this article better. Thanks! I'm not sure how to thank you all. Of course the mistakes remain my own.