40x Faster Binary Search

Ragnar {Groot Koerkamp}

@curiouscoding.nl

P99 2025

curiouscoding.nl/slides/p99 curiouscoding.nl/posts/static-search-tree

p99-title-slide.png

Ragnar {Groot Koerkamp}

  • Did IMO & ICPC; currently head-of-jury for NWERC.
  • Some time at Google.
  • Quit and solved all of projecteuler.net (700+) during Covid.
  • Just finished PhD on high throughput bioinformatics @ ETH Zurich.
    • Lots of sequenced DNA that needs processing.
    • Many static datasets, e.g. a 3GB human genome.
    • Revisiting basic algorithms and optimizing them to the limit.
    • Good in theory \(\neq\) fast in practice.

Problem Statement

  • Input: a static sorted list of 32 bit integers.
  • Queries: given \(q\), find the smallest value in the list \(\geq q\).
trait SearchIndex {
    // Initialize the data structure.
    fn new(sorted_data: &[u32]) -> Self;
    // Return the smallest value >=q.
    fn query(&self, q: u32) -> u32;
}


Binary Search: latency is more than \(O(\lg n)\)

bs-1.svg

Array Indexing: \(O(n^{0.35})\) latency!

bs-2.svg

Heap Layout: efficient caching + prefetching

  • Binary search: top of tree is spread out; each cache line has 1 value.
  • Eytzinger layout: top layers of tree are clustered in cache lines.
    • Also allows prefetching!

bs-eytzinger.svg

Heap Layout: close to array indexing!

bs-3.svg

Static Search Trees / B-trees

  • Fully use each cache line.








#[repr(align(64))]       // Each block fills exactly one 512-bit cache line.
struct Block { data: [u32; 16] }  
struct Tree {
    tree: Vec<Block>,    // Blocks.
    offsets: Vec<usize>, // Index of first block in each layer.
}

full.svg

Static Search Trees: Slower than Eytzinger?!

2-find-linear.svg

Up next: assembly-level optimizations :)

p99-section.png

Optimizing find: linear scan baseline

fn find_linear(block: &Block, q: u32) -> usize {
    for i in 0..N {
        if block.data[i] >= q {
            // Early break causes branch mispredictions
            // and prevents auto-vectorization!
            return i;
        }
    }
    return N;
}

Optimizing find: auto-vectorization

fn find_count(block: &Block, q: u32) -> usize {
    let mut count = 0;
    for i in 0..N {
        if block.data[i] < q {
            count += 1;
        }
    }
    count
}
 vmovdqu      (%rax,%rcx), %ymm1     ; load data[..8]
 vmovdqu      32(%rax,%rcx), %ymm2   ; load data[8..]
 vpbroadcastd %xmm0, %ymm0           ; 'splat' the query value
 vpmaxud      %ymm0, %ymm2, %ymm3    ; v
 vpcmpeqd     %ymm3, %ymm2, %ymm2    ; v
 vpmaxud      %ymm0, %ymm1, %ymm0    ; v
 vpcmpeqd     %ymm0, %ymm1, %ymm0    ; 4x compare query with values
 vpackssdw    %ymm2, %ymm0, %ymm0    ;
 vpcmpeqd     %ymm1, %ymm1, %ymm1    ; v
 vpxor        %ymm1, %ymm0, %ymm0    ; 2x negate result
 vextracti128 $1, %ymm0, %xmm1       ; v
 vpacksswb    %xmm1, %xmm0, %xmm0    ; v
 vpshufd      $216, %xmm0, %xmm0     ; v
 vpmovmskb    %xmm0, %ecx            ; 4x extract mask
 popcntl      %ecx, %ecx             

Optimizing find: popcount

#![feature(portable_simd)]
fn find_popcount(block: &Block, q: i32) -> usize {
    let data: Simd<i32, N> = Simd::from_slice(&block.data[0..N]);
    let q = Simd::splat(q);
    let mask = data.simd_lt(q);      // x[i] < q
    mask.to_bitmask().count_ones()   // count_ones
}
vpcmpgtd     (%rsi,%rdi), %ymm0, %ymm1
vpcmpgtd     32(%rsi,%rdi), %ymm0, %ymm0
vpackssdw    %ymm1, %ymm0, %ymm0     ; interleave 16bit low halves
vextracti128 $1, %ymm0, %xmm1        ; 1
vpacksswb    %xmm1, %xmm0, %xmm0     ; 2
vpshufd      $216, %xmm0, %xmm0      ; 3 unshuffle interleaving
vpmovmskb    %xmm0, %edi             ; 4 instructions to extract bitmask
popcntl      %edi, %edi

Optimizing find: manual movemask_epi8

fn find_manual(block: &Block, q: i32) -> usize { // i32 now!
    let low: Simd<i32, 8> = Simd::from_slice(&self.data[0..N / 2]);
    let high: Simd<i32, 8> = Simd::from_slice(&self.data[N / 2..N]);
    let q = Simd::<i32, 8>::splat(q);
    let mask_low  = q_simd.simd_gt(low ); // compare with low  half
    let mask_high = q_simd.simd_gt(high); // compare with high half
    let merged = _mm256_packs_epi32(mask_low, mask_high); // interleave
    let mask: i32 = _mm256_movemask_epi8(merged);         // movemask_epi8!
    mask.count_ones() as usize / 2        // correct for double-counting
}
 vpcmpgtd     (%rsi,%rdi), %ymm0, %ymm1     ; 1 cycle, in parallel
 vpcmpgtd     32(%rsi,%rdi), %ymm0, %ymm0   ; 1 cycle, in parallel
 vpackssdw    %ymm0, %ymm1, %ymm0           ; 1 cycle
-vextracti128 $1, %ymm0, %xmm1     ;
-vpacksswb    %xmm1, %xmm0, %xmm0  ;
-vpshufd      $216, %xmm0, %xmm0   ;
-vpmovmskb    %xmm0, %edi          ; 4 instructions emulating movemask_epi16
+vpmovmskb    %ymm0, %edi                   ; 2 cycles        movemask_epi8
 popcntl      %edi, %edi                    ; 1 cycle
+                                  ; /2 is folded into pointer arithmetic

The results: branchless is great!

3-find.svg

Throughput, not latency

p99-section.png

Batching: Many queries in parallel

-fn query                (&self, q :   u32    ) ->  u32     {
+fn batch<const B: usize>(&self, qs: &[u32; B]) -> [u32; B] {
-    let mut k =  0    ;                                    // current index
+    let mut k = [0; B];                                    // current indices
     for [o, _o2] in self.offsets.array_windows() {         // walk down the tree
+        for i in 0..B {
             let jump_to = self.node(o + k[i]).find(qb[i]); // call `find`
             k[i] = k[i] * (B + 1) + jump_to;               // update index
+        }
     }
 
     let o = self.offsets.last().unwrap();
+    from_fn(|i| {
         let idx = self.node(o + k[i]).find(qb[i]);
         self.tree[o + k[i] + idx / N].data[idx % N]        // return value
+    })
 }

Batching: up to 2.5x faster!

4-batching.svg

Prefetching

 fn batch<const B: usize>(&self, qs: &[u32; B]) -> [u32; B] {
     let mut k = [0; B];                                    // current indices
     for [o, o2] in self.offsets.array_windows() {          // walk down the tree
         for i in 0..B {
             let jump_to = self.node(o + k[i]).find(qb[i]); // call `find`
             k[i] = k[i] * (B + 1) + jump_to;               // update index
+            prefetch_index(&self.tree, o2 + k[i]);         // prefetch next layer
         }
     }
 
     let o = self.offsets.last().unwrap();
     from_fn(|i| {
         let idx = self.node(o + k[i]).find(qb[i]);
         self.tree[o + k[i] + idx / N].data[idx % N]        // return value
     })
 }

Prefetching: 30% faster again!

5-prefetch.svg

Optimizing pointer arithmetic: more gains

  • Convert all pointers to byte units, to avoid conversions.

6-improvements.svg

Interleaving: more pressure on the RAM, -20%

  • Interleave multiple batches at different iterations.

8-interleave.svg

Tree layout: internal nodes store minima

  • For query 5.5, we walk down the left subtree.
  • Returning 6 reads a new cache line.

full.svg

Tree layout: internal nodes store maxima

  • For query 5.5, we walk down the middle subtree.
  • Returning 6 reads the same cache line.

flipped.svg

Tree layout: another 10% gained!

9-left-max-tree.svg

Conclusion

  • With 4GB input: 40x speedup!
    • binary search: 1150ns
    • static search tree: 27ns
  • Latency of Eytzinger layout search is 3x slower than RAM latency.
  • Throughput of static search tree is 3x slower than RAM throughput.
  • RAM is slow, caches are fast.
  • Grinding through assembly works; be your own compiler!
  • Theoretical big-O complexities are useless here, since caches invalidate the RAM-model assumptions.

P99 CONF




p99-last.png