40x Faster Binary Search

Ragnar {Groot Koerkamp}

@curiouscoding.nl

P99 2025

curiouscoding.nl/slides/p99 curiouscoding.nl/posts/static-search-tree

Ragnar {Groot Koerkamp}

Did IMO & ICPC; currently head-of-jury for NWERC.
Some time at Google.
Quit and solved all of projecteuler.net (700+) during Covid.
Just finished PhD on high throughput bioinformatics @ ETH Zurich.
- Lots of sequenced DNA that needs processing.
- Many static datasets, e.g. a 3GB human genome.
- Revisiting basic algorithms and optimizing them to the limit.
- Good in theory \(\neq\) fast in practice.

Problem Statement

Input: a static sorted list of 32 bit integers.
Queries: given \(q\), find the smallest value in the list \(\geq q\).

trait SearchIndex {
    // Initialize the data structure.
    fn new(sorted_data: &[u32]) -> Self;
    // Return the smallest value >=q.
    fn query(&self, q: u32) -> u32;
}

Why? E.g. searching through a suffix array.
Previous work on Algorithmica:
en.algorithmica.org/hpc/data-structures/s-tree
This work: curiouscoding.nl/posts/static-search-tree

Binary Search: complexity \(O(\lg n)\)

Compare \(q=5\) with the middle element \(x\).
Recurse on left half if \(q\leq x\), right half if \(q>x\).
End when 1 element left after \(\lceil\lg_2(n+1)\rceil\) steps.

Binary Search: latency is more than \(O(\lg n)\)

Array Indexing: \(O(n^{0.35})\) latency!

Heap Layout: efficient caching + prefetching

Binary search: top of tree is spread out; each cache line has 1 value.
Eytzinger layout: top layers of tree are clustered in cache lines.
- Also allows prefetching!

Heap Layout: close to array indexing!

Static Search Trees / B-trees

Fully use each cache line.

#[repr(align(64))]       // Each block fills exactly one 512-bit cache line.
struct Block { data: [u32; 16] }  
struct Tree {
    tree: Vec<Block>,    // Blocks.
    offsets: Vec<usize>, // Index of first block in each layer.
}

Static Search Trees: Slower than Eytzinger?!

Up next: assembly-level optimizations :)

Optimizing `find`: linear scan baseline

fn find_linear(block: &Block, q: u32) -> usize {
    for i in 0..N {
        if block.data[i] >= q {
            // Early break causes branch mispredictions
            // and prevents auto-vectorization!
            return i;
        }
    }
    return N;
}

Optimizing `find`: auto-vectorization

fn find_count(block: &Block, q: u32) -> usize {
    let mut count = 0;
    for i in 0..N {
        if block.data[i] < q {
            count += 1;
        }
    }
    count
}

 vmovdqu      (%rax,%rcx), %ymm1     ; load data[..8]
 vmovdqu      32(%rax,%rcx), %ymm2   ; load data[8..]
 vpbroadcastd %xmm0, %ymm0           ; 'splat' the query value
 vpmaxud      %ymm0, %ymm2, %ymm3    ; v
 vpcmpeqd     %ymm3, %ymm2, %ymm2    ; v
 vpmaxud      %ymm0, %ymm1, %ymm0    ; v
 vpcmpeqd     %ymm0, %ymm1, %ymm0    ; 4x compare query with values
 vpackssdw    %ymm2, %ymm0, %ymm0    ;
 vpcmpeqd     %ymm1, %ymm1, %ymm1    ; v
 vpxor        %ymm1, %ymm0, %ymm0    ; 2x negate result
 vextracti128 $1, %ymm0, %xmm1       ; v
 vpacksswb    %xmm1, %xmm0, %xmm0    ; v
 vpshufd      $216, %xmm0, %xmm0     ; v
 vpmovmskb    %xmm0, %ecx            ; 4x extract mask
 popcntl      %ecx, %ecx

Optimizing `find`: popcount

#![feature(portable_simd)]
fn find_popcount(block: &Block, q: i32) -> usize {
    let data: Simd<i32, N> = Simd::from_slice(&block.data[0..N]);
    let q = Simd::splat(q);
    let mask = data.simd_lt(q);      // x[i] < q
    mask.to_bitmask().count_ones()   // count_ones
}

vpcmpgtd     (%rsi,%rdi), %ymm0, %ymm1
vpcmpgtd     32(%rsi,%rdi), %ymm0, %ymm0
vpackssdw    %ymm1, %ymm0, %ymm0     ; interleave 16bit low halves
vextracti128 $1, %ymm0, %xmm1        ; 1
vpacksswb    %xmm1, %xmm0, %xmm0     ; 2
vpshufd      $216, %xmm0, %xmm0      ; 3 unshuffle interleaving
vpmovmskb    %xmm0, %edi             ; 4 instructions to extract bitmask
popcntl      %edi, %edi

Optimizing `find`: manual `movemask_epi8`

fn find_manual(block: &Block, q: i32) -> usize { // i32 now!
    let low: Simd<i32, 8> = Simd::from_slice(&self.data[0..N / 2]);
    let high: Simd<i32, 8> = Simd::from_slice(&self.data[N / 2..N]);
    let q = Simd::<i32, 8>::splat(q);
    let mask_low  = q_simd.simd_gt(low ); // compare with low  half
    let mask_high = q_simd.simd_gt(high); // compare with high half
    let merged = _mm256_packs_epi32(mask_low, mask_high); // interleave
    let mask: i32 = _mm256_movemask_epi8(merged);         // movemask_epi8!
    mask.count_ones() as usize / 2        // correct for double-counting
}

 vpcmpgtd     (%rsi,%rdi), %ymm0, %ymm1     ; 1 cycle, in parallel
 vpcmpgtd     32(%rsi,%rdi), %ymm0, %ymm0   ; 1 cycle, in parallel
 vpackssdw    %ymm0, %ymm1, %ymm0           ; 1 cycle
-vextracti128 $1, %ymm0, %xmm1     ;
-vpacksswb    %xmm1, %xmm0, %xmm0  ;
-vpshufd      $216, %xmm0, %xmm0   ;
-vpmovmskb    %xmm0, %edi          ; 4 instructions emulating movemask_epi16
+vpmovmskb    %ymm0, %edi                   ; 2 cycles        movemask_epi8
 popcntl      %edi, %edi                    ; 1 cycle
+                                  ; /2 is folded into pointer arithmetic

The results: branchless is great!

Throughput, not latency

Batching: Many queries in parallel

-fn query                (&self, q :   u32    ) ->  u32     {
+fn batch<const B: usize>(&self, qs: &[u32; B]) -> [u32; B] {
-    let mut k =  0    ;                                    // current index
+    let mut k = [0; B];                                    // current indices
     for [o, _o2] in self.offsets.array_windows() {         // walk down the tree
+        for i in 0..B {
             let jump_to = self.node(o + k[i]).find(qb[i]); // call `find`
             k[i] = k[i] * (B + 1) + jump_to;               // update index
+        }
     }
 
     let o = self.offsets.last().unwrap();
+    from_fn(|i| {
         let idx = self.node(o + k[i]).find(qb[i]);
         self.tree[o + k[i] + idx / N].data[idx % N]        // return value
+    })
 }

Batching: up to 2.5x faster!

Prefetching

 fn batch<const B: usize>(&self, qs: &[u32; B]) -> [u32; B] {
     let mut k = [0; B];                                    // current indices
     for [o, o2] in self.offsets.array_windows() {          // walk down the tree
         for i in 0..B {
             let jump_to = self.node(o + k[i]).find(qb[i]); // call `find`
             k[i] = k[i] * (B + 1) + jump_to;               // update index
+            prefetch_index(&self.tree, o2 + k[i]);         // prefetch next layer
         }
     }
 
     let o = self.offsets.last().unwrap();
     from_fn(|i| {
         let idx = self.node(o + k[i]).find(qb[i]);
         self.tree[o + k[i] + idx / N].data[idx % N]        // return value
     })
 }

Prefetching: 30% faster again!

Optimizing pointer arithmetic: more gains

Convert all pointers to byte units, to avoid conversions.

Interleaving: more pressure on the RAM, -20%

Interleave multiple batches at different iterations.

Tree layout: internal nodes store minima

For query 5.5, we walk down the left subtree.
Returning 6 reads a new cache line.

Tree layout: internal nodes store maxima

For query 5.5, we walk down the middle subtree.
Returning 6 reads the same cache line.

Tree layout: another 10% gained!

Conclusion

With 4GB input: 40x speedup!
- binary search: 1150ns
- static search tree: 27ns
Latency of Eytzinger layout search is 3x slower than RAM latency.
Throughput of static search tree is 3x slower than RAM throughput.
RAM is slow, caches are fast.
Grinding through assembly works; be your own compiler!
Theoretical big-O complexities are useless here, since caches invalidate the RAM-model assumptions.

P99 CONF

X: @curious_coding
Bsky: @curiouscoding.nl
Blog: curiouscoding.nl
- /posts/static-search-tree
- /slides/p99

40x Faster Binary Search

Ragnar {Groot Koerkamp}

@curiouscoding.nl

P99 2025

Ragnar {Groot Koerkamp}

Problem Statement

Binary Search: complexity \(O(\lg n)\)

Binary Search: latency is more than \(O(\lg n)\)

Array Indexing: \(O(n^{0.35})\) latency!

Heap Layout: efficient caching + prefetching

Heap Layout: close to array indexing!

Static Search Trees / B-trees

Static Search Trees: Slower than Eytzinger?!

Up next: assembly-level optimizations :)

Optimizing find: linear scan baseline

Optimizing find: auto-vectorization

Optimizing find: popcount

Optimizing find: manual movemask_epi8

The results: branchless is great!

Throughput, not latency

Batching: Many queries in parallel

Batching: up to 2.5x faster!

Prefetching

Prefetching: 30% faster again!

Optimizing pointer arithmetic: more gains

Interleaving: more pressure on the RAM, -20%

Tree layout: internal nodes store minima

Tree layout: internal nodes store maxima

Tree layout: another 10% gained!

Conclusion

P99 CONF

Optimizing `find`: linear scan baseline

Optimizing `find`: auto-vectorization

Optimizing `find`: popcount

Optimizing `find`: manual `movemask_epi8`