QuadRank

Engineering a High-Throughput Rank

Ragnar {Groot Koerkamp}

DSB 2026

curiouscoding.nl/{posts,slides}/quadrank https://doi.org/10.48550/ARXIV.2602.04103

Binary Rank: Problem statement

Input: a many-GB text \(T = t_0\dots t_{n-1}\) of \(n\) bits.
Queries: given \(q\), find

\begin{equation*} \newcommand{\rank}{\mathsf{rank}} \rank(q) := \sum_{0\leq i< q} t_i. \end{equation*}

\(T = \underline{\texttt{1001001}}\texttt{110010100}\)
- \(\rank(0) = 0\)
- \(\rank(7) = 3\)
- \(\rank(16) = 7\)
Why? Occurrences table in FM-index.

History

Naive solutions

Linear scan:
- \(O(n/w)\) time using \(w\) bit popcount, no space overhead.

Precompute all 64-bit answers \(r_i := \rank(i)\):
- \(O(1)\) time, \(64\times\) overhead.

Middle-ground: block-based offsets

Use \(B=512\) bit blocks, and store all \(b_j := \rank(j\cdot B)\).

Query \(q = j\cdot B + q'\): \(\rank(q) = b_j + \sum_{jB\leq i < jB+q'} t_i\).
\(O(B/w) = O(512/64) = O(1)\) time.
\(64/512 = 12.5\%\) space overhead.
2 cache misses:
- in \(n/8\) bit array: offset \(b_j\)
- in \(n\) bit array: block \(j\) bits

Reducing overhead: PastaWide [and others]

2 levels:
- L2 with 16-bit delta every block: 3.125% overhead
- L1 with 64-bit offset every 128 blocks: 0.1% overhead

Reducing cache misses: SPIDER

Inline the 16-bit L2 deltas bits into each cache line
- Remaining 0.1% overhead L1 array fits in cache.

Reducing the popcount: Pairing

Delta is to the middle instead of start of a block.
- Count only 256 bits in first or second half of block.

All together now: BiRank

Inline 16-bit deltas
Pairing
32-bit reduced L1 offsets: 0.05% overhead
- Low 11 bits are stored in deltas
- Input up to \(2^{43}\) bits
- 16 MiB cache supports 32 GiB input

What is fast?

Latency: 80 ns/q

let mut seed = 0;
for q in queries {
    seed ^= ranker.rank(q ^ seed);
}

For loop: 16-30 ns/q

for q in queries {
    ranker.rank(q);
}

Prefetching/batching: 8-23 ns/q

for i in 0..queries.len() {
    ranker.prefetch(queries[i+32]);
    ranker.rank(queries[i]);
}

BiRank results

Memory-bandwidth efficient = Fast

QuadRank

Support \(\sigma=4\) (ACGT) alphabet, and \(\rank(q, c)\) for arbitrary \(c\).

\[\rank_4(q) := [\rank(q, \texttt A),\rank(q, \texttt C),\rank(q, \texttt G),\rank(q, \texttt T)].\]

Input is 2 bits per symbol.
Like BiRank, but replicate all metadata \(4\times\).
Store input in transposed layout: low bits separate from high bits.
Efficient SIMD implementation of 4-parallel popcounting.
16 MiB cache supports 14 GiB = 61Gbp input

QuadRank results

Toy Batching FM-index

count-only; no locate.
8-character prefix-lookup.
Map 32 reads in parallel.
Do 1 character of each read before moving to the next.
Prefetch memory for the next character.
Keep a list of active reads, and only iterate over those.

FM-index results

Conclusion: Thread, Batch, and Prefetch

PS: Sassy1 grep-like approximate search

sassy grep -p <pattern> -k 3 <human-genome>.fa

QuadRank

Ragnar {Groot Koerkamp}

DSB 2026

Binary Rank: Problem statement

History

Naive solutions

Middle-ground: block-based offsets

Reducing overhead: PastaWide [and others]

Reducing cache misses: SPIDER

Reducing the popcount: Pairing

All together now: BiRank

What is fast?

BiRank results

Memory-bandwidth efficient = Fast

QuadRank

QuadRank results

Toy Batching FM-index

FM-index results

Conclusion: Thread, Batch, and Prefetch

PS: Sassy1 grep-like approximate search

PS: Sassy2 for batch searching