QuadRank

Engineering a High-Throughput Rank

Ragnar {Groot Koerkamp}

DSB 2026

curiouscoding.nl/{posts,slides}/quadrank https://doi.org/10.48550/ARXIV.2602.04103

Binary Rank: Problem statement

  • Input: a many-GB text \(T = t_0\dots t_{n-1}\) of \(n\) bits.
  • Queries: given \(q\), find
\begin{equation*} \newcommand{\rank}{\mathsf{rank}} \rank(q) := \sum_{0\leq i< q} t_i. \end{equation*}
  • \(T = \underline{\texttt{1001001}}\texttt{110010100}\)
    • \(\rank(0) = 0\)
    • \(\rank(7) = 3\)
    • \(\rank(16) = 7\)
  • Why? Occurrences table in FM-index.

History

rank-overview.svg

Naive solutions

  • Linear scan:
    • \(O(n/w)\) time using \(w\) bit popcount, no space overhead.

1-linear.svg

  • Precompute all 64-bit answers \(r_i := \rank(i)\):
    • \(O(1)\) time, \(64\times\) overhead.

2-precompute.svg

Middle-ground: block-based offsets

  • Use \(B=512\) bit blocks, and store all \(b_j := \rank(j\cdot B)\).

3-blocks.svg

  • Query \(q = j\cdot B + q'\): \(\rank(q) = b_j + \sum_{jB\leq i < jB+q'} t_i\).
  • \(O(B/w) = O(512/64) = O(1)\) time.
  • \(64/512 = 12.5\%\) space overhead.
  • 2 cache misses:
    • in \(n/8\) bit array: offset \(b_j\)
    • in \(n\) bit array: block \(j\) bits

Reducing overhead: PastaWide [and others]

  • 2 levels:
    • L2 with 16-bit delta every block: 3.125% overhead
    • L1 with 64-bit offset every 128 blocks: 0.1% overhead

4-pasta.svg

Reducing cache misses: SPIDER

  • Inline the 16-bit L2 deltas bits into each cache line
    • Remaining 0.1% overhead L1 array fits in cache.

5-spider.svg

Reducing the popcount: Pairing

  • Delta is to the middle instead of start of a block.
    • Count only 256 bits in first or second half of block.

6-pairing.svg

All together now: BiRank

  • Inline 16-bit deltas
  • Pairing
  • 32-bit reduced L1 offsets: 0.05% overhead
    • Low 11 bits are stored in deltas
    • Input up to \(2^{43}\) bits
    • 16 MiB cache supports 32 GiB input

7-birank.svg

What is fast?

Latency: 80 ns/q

let mut seed = 0;
for q in queries {
    seed ^= ranker.rank(q ^ seed);
}

For loop: 16-30 ns/q

for q in queries {
    ranker.rank(q);
}

Prefetching/batching: 8-23 ns/q

for i in 0..queries.len() {
    ranker.prefetch(queries[i+32]);
    ranker.rank(queries[i]);
}

BiRank results

plot-laptop-st-2-large.svg

Memory-bandwidth efficient = Fast

QuadRank

  • Support \(\sigma=4\) (ACGT) alphabet, and \(\rank(q, c)\) for arbitrary \(c\).

\[\rank_4(q) := [\rank(q, \texttt A),\rank(q, \texttt C),\rank(q, \texttt G),\rank(q, \texttt T)].\]

  • Input is 2 bits per symbol.
  • Like BiRank, but replicate all metadata \(4\times\).
  • Store input in transposed layout: low bits separate from high bits.
  • Efficient SIMD implementation of 4-parallel popcounting.
  • 16 MiB cache supports 14 GiB = 61Gbp input

8-quadrank.svg

QuadRank results

plot-laptop-st-4-large.svg

Toy Batching FM-index

  • count-only; no locate.
  • 8-character prefix-lookup.
  • Map 32 reads in parallel.
  • Do 1 character of each read before moving to the next.
  • Prefetch memory for the next character.
  • Keep a list of active reads, and only iterate over those.

FM-index results

plot-fm.svg

Conclusion: Thread, Batch, and Prefetch

PS: Sassy1 grep-like approximate search

sassy grep -p <pattern> -k 3 <human-genome>.fa

sassy-grep.gif

PS: Sassy2 for batch searching

sassy2-results.svg