QuadRank: Engineering a High-Throughput Rank

Binary Rank: Problem statement Link to heading

Input: a many-GB text \(T = t_0\dots t_{n-1}\) of \(n\) bits.
Queries: given \(q\), find

\begin{equation*} \newcommand{\rank}{\mathsf{rank}} \rank(q) := \sum_{0\leq i< q} t_i. \end{equation*}

\(T = \underline{\texttt{1001001}}\texttt{110010100}\)
- \(\rank(0) = 0\)
- \(\rank(7) = 3\)
- \(\rank(16) = 7\)
Why? Occurrences table in FM-index.

History Link to heading

Naive solutions Link to heading

Linear scan:
- \(O(n/w)\) time using \(w\) bit popcount, no space overhead.

Precompute all 64-bit answers \(r_i := \rank(i)\):
- \(O(1)\) time, \(64\times\) overhead.

Middle-ground: block-based offsets Link to heading

Use \(B=512\) bit blocks, and store all \(b_j := \rank(j\cdot B)\).

Query \(q = j\cdot B + q’\): \(\rank(q) = b_j + \sum_{jB\leq i < jB+q’} t_i\).
\(O(B/w) = O(512/64) = O(1)\) time.
\(64/512 = 12.5\%\) space overhead.
2 cache misses:
- in \(n/8\) bit array: offset \(b_j\)
- in \(n\) bit array: block \(j\) bits

Reducing overhead: PastaWide [and others] Link to heading

2 levels:
- L2 with 16-bit delta every block: 3.125% overhead
- L1 with 64-bit offset every 128 blocks: 0.1% overhead

Reducing cache misses: SPIDER Link to heading

Inline the 16-bit L2 deltas bits into each cache line
- Remaining 0.1% overhead L1 array fits in cache.

Reducing the popcount: Pairing Link to heading

Delta is to the middle instead of start of a block.
- Count only 256 bits in first or second half of block.

All together now: BiRank Link to heading

Inline 16-bit deltas
Pairing
32-bit reduced L1 offsets: 0.05% overhead
- Low 11 bits are stored in deltas
- Input up to \(2^{43}\) bits
- 16 MiB cache supports 32 GiB input

What is fast? Link to heading

Latency: 80 ns/q

1
2
3
4
let mut seed = 0;
for q in queries {
    seed ^= ranker.rank(q ^ seed);
}

For loop: 16-30 ns/q

1
2
3
for q in queries {
    ranker.rank(q);
}

Prefetching/batching: 8-23 ns/q

1
2
3
4
for i in 0..queries.len() {
    ranker.prefetch(queries[i+32]);
    ranker.rank(queries[i]);
}

BiRank Link to heading

Memory-bandwidth efficient = Fast Link to heading

QuadRank Link to heading

Support \(\sigma=4\) (ACGT) alphabet, and \(\rank(q, c)\) for arbitrary \(c\).

\[\rank_4(q) := [\rank(q, \texttt A),\rank(q, \texttt C),\rank(q, \texttt G),\rank(q, \texttt T)].\]

Input is 2 bits per symbol.
Like BiRank, but replicate all metadata \(4\times\).
Store input in transposed layout: low bits separate from high bits.
Efficient SIMD implementation of 4-parallel popcounting.
16 MiB cache supports 14 GiB = 61Gbp input

QuadRank Link to heading

Toy Batching FM-index Link to heading

count-only; no locate.
8-character prefix-lookup.
Map 32 reads in parallel.
Do 1 character of each read before moving to the next.
Prefetch memory for the next character.
Keep a list of active reads, and only iterate over those.

FM-index results Link to heading

Conclusion: Thread, Batch, and Prefetch Link to heading

PS: Sassy1 grep-like approximate search Link to heading

sassy grep -p <pattern> -k 3 <human-genome>.fa

Binary Rank: Problem statement Link to heading

History Link to heading

Naive solutions Link to heading

Middle-ground: block-based offsets Link to heading

Reducing overhead: PastaWide [and others] Link to heading

Reducing cache misses: SPIDER Link to heading

Reducing the popcount: Pairing Link to heading

All together now: BiRank Link to heading

What is fast? Link to heading

BiRank Link to heading

Memory-bandwidth efficient = Fast Link to heading

QuadRank Link to heading

QuadRank Link to heading

Toy Batching FM-index Link to heading

FM-index results Link to heading

Conclusion: Thread, Batch, and Prefetch Link to heading

PS: Sassy1 grep-like approximate search Link to heading

PS: Sassy2 for batch searching Link to heading