1 High level ideas
- Prefix table: for each 20-bit prefix, store the corresponding range of the array.
- Interpolation: Make one or more interpolation steps. Could store max resulting error.
- Drawback: can cause an unpredictable number of resulting iterations.
- Batching: process multiple (8-32) queries at the same time, hiding memory latency
- Query bucketing: given >>1M of queries, partition them into 1M buckets and answer bucket by bucket.
- Eytzinger layout
- B-tree layout
- prefetching (either next Eytzinger iteration, or in the batch)
1.1 Resources
- Algorithmica: https://en.algorithmica.org/hpc/data-structures/
- Khuong and Morin (2017)
- https://www.cai.sk/ojs/index.php/cai/article/view/2019_3_555
1.2 Code
- github:RagnarGrootKoerkamp/suffix-array-searching
- Some initial binary search and Btree variants.
- github:RagnarGrootKoerkamp/cpu-benchmarks
- low-level CPU benchmarks to get upper bounds on potential performance
2 To measure
- Max random access cacheline throughput (1 and many threads)
- Also variants for fetching 2/3/4 consecutive cachelines.
TODO 3 Memory efficiency
Suppose our task is to find an integer \(q\) in a sorted list \(A\) of length \(n\). One option is to use binary search, but using a B-tree or the Eytzinger layout turns out faster when \(n\) is large. See the excellent paper Khuong and Morin (2017) for background and detailed comparisons.
Here, I’d like to compare the memory efficiency of the B-tree and Eytzinger layout. That is: which method puts the least pressure on the memory system, and can thus get higher potential throughput
Let’s say we are searching an array consisting of \(n=2^k\) \(b\) byte elements.
3.1 B-tree
A cache-line has 64 bytes. Set \(B = 64/b\). Each node stores \(B-1\) values and a gap, and corresponds to \(\ell=\lg_2(B)\) levels of the tree. \(\ell\) values of each cache-line are used.