40x Faster Binary Search

Ragnar {Groot Koerkamp} Link to heading
- Did IMO & ICPC; currently head-of-jury for NWERC.
- Some time at Google.
- Quit and solved all of projecteuler.net (700+) during Covid.
- Just finished PhD on high throughput bioinformatics @ ETH Zurich.
- Lots of sequenced DNA that needs processing.
- Many static datasets, e.g. a 3GB human genome.
- Revisiting basic algorithms and optimizing them to the limit.
- Good in theory \(\neq\) fast in practice.
Problem Statement Link to heading
- Input: a static sorted list of 32 bit integers.
- Queries: given \(q\), find the smallest value in the list \(\geq q\).
| |
- Why? E.g. searching through a suffix array.
- Previous work on Algorithmica:
en.algorithmica.org/hpc/data-structures/s-tree - This work: curiouscoding.nl/posts/static-search-tree
Binary Search: complexity \(O(\lg n)\) Link to heading
- Compare \(q=5\) with the middle element \(x\).
- Recurse on left half if \(q\leq x\), right half if \(q>x\).
- End when 1 element left after \(\lceil\lg_2(n+1)\rceil\) steps.
Binary Search: latency is more than \(O(\lg n)\) Link to heading
Array Indexing: \(O(n^{0.35})\) latency! Link to heading
Heap Layout: efficient caching + prefetching Link to heading
- Binary search: top of tree is spread out; each cache line has 1 value.
- Eytzinger layout: top layers of tree are clustered in cache lines.
- Also allows prefetching!
Heap Layout: close to array indexing! Link to heading
Static Search Trees / B-trees Link to heading
- Fully use each cache line.
| |
Static Search Trees: Slower than Eytzinger?! Link to heading
Up next: assembly-level optimizations :) Link to heading

Optimizing find: linear scan baseline
Link to heading
| |
Optimizing find: auto-vectorization
Link to heading
| |
| |
Optimizing find: popcount
Link to heading
| |
| |
Optimizing find: manual movemask_epi8
Link to heading
| |
| |
The results: branchless is great! Link to heading
Throughput, not latency Link to heading

Batching: Many queries in parallel Link to heading
| |
Batching: up to 2.5x faster! Link to heading
Prefetching Link to heading
| |
Prefetching: 30% faster again! Link to heading
Optimizing pointer arithmetic: more gains Link to heading
- Convert all pointers to byte units, to avoid conversions.
Interleaving: more pressure on the RAM, -20% Link to heading
- Interleave multiple batches at different iterations.
Tree layout: internal nodes store minima Link to heading
- For query 5.5, we walk down the left subtree.
- Returning 6 reads a new cache line.
Tree layout: internal nodes store maxima Link to heading
- For query 5.5, we walk down the middle subtree.
- Returning 6 reads the same cache line.
Tree layout: another 10% gained! Link to heading
Conclusion Link to heading
With 4GB input: 40x speedup!
- binary search: 1150ns
- static search tree: 27ns
Latency of Eytzinger layout search is 3x slower than RAM latency.
Throughput of static search tree is 3x slower than RAM throughput.
RAM is slow, caches are fast.
Grinding through assembly works; be your own compiler!
Theoretical big-O complexities are useless here, since caches invalidate the RAM-model assumptions.
P99 CONF Link to heading
