Binary Rank: Problem statement
- Input: a many-GB text \(T = t_0\dots t_{n-1}\) of \(n\) bits.
- Queries: given \(q\), find
\begin{equation*}
\newcommand{\rank}{\mathsf{rank}}
\rank(q) := \sum_{0\leq i< q} t_i.
\end{equation*}
- \(T = \underline{\texttt{1001001}}\texttt{110010100}\)
- \(\rank(0) = 0\)
- \(\rank(7) = 3\)
- \(\rank(16) = 7\)
- Why? Occurrences table in FM-index.
Naive solutions
- Linear scan:
- \(O(n/w)\) time using \(w\) bit popcount, no space overhead.
- Precompute all 64-bit answers \(r_i := \rank(i)\):
- \(O(1)\) time, \(64\times\) overhead.
Middle-ground: block-based offsets
- Use \(B=512\) bit blocks, and store all \(b_j := \rank(j\cdot B)\).
- Query \(q = j\cdot B + q'\):
\(\rank(q) = b_j + \sum_{jB\leq i < jB+q'} t_i\).
- \(O(B/w) = O(512/64) = O(1)\) time.
- \(64/512 = 12.5\%\) space overhead.
- 2 cache misses:
- in \(n/8\) bit array: offset \(b_j\)
- in \(n\) bit array: block \(j\) bits
Reducing overhead: PastaWide [and others]
- 2 levels:
- L2 with 16-bit delta every block: 3.125% overhead
- L1 with 64-bit offset every 128 blocks: 0.1% overhead
Reducing cache misses: SPIDER
- Inline the 16-bit L2 deltas bits into each cache line
- Remaining 0.1% overhead L1 array fits in cache.
Reducing the popcount: Pairing
- Delta is to the middle instead of start of a block.
- Count only 256 bits in first or second half of block.
All together now: BiRank
- Inline 16-bit deltas
- Pairing
- 32-bit reduced L1 offsets: 0.05% overhead
- Low 11 bits are stored in deltas
- Input up to \(2^{43}\) bits
- 16 MiB cache supports 32 GiB input
What is fast?
Latency: 80 ns/q
let mut seed = 0;
for q in queries {
seed ^= ranker.rank(q ^ seed);
}
For loop: 16-30 ns/q
for q in queries {
ranker.rank(q);
}
Prefetching/batching: 8-23 ns/q
for i in 0..queries.len() {
ranker.prefetch(queries[i+32]);
ranker.rank(queries[i]);
}
Memory-bandwidth efficient = Fast
QuadRank
- Support \(\sigma=4\) (
ACGT) alphabet, and \(\rank(q, c)\) for arbitrary \(c\).
\[\rank_4(q) := [\rank(q, \texttt A),\rank(q, \texttt C),\rank(q, \texttt G),\rank(q, \texttt T)].\]
- Input is 2 bits per symbol.
- Like BiRank, but replicate all metadata \(4\times\).
- Store input in transposed layout: low bits separate from high bits.
- Efficient SIMD implementation of 4-parallel popcounting.
- 16 MiB cache supports 14 GiB = 61Gbp input
Toy Batching FM-index
count-only; no locate.
- 8-character prefix-lookup.
- Map 32 reads in parallel.
- Do 1 character of each read before moving to the next.
- Prefetch memory for the next character.
- Keep a list of active reads, and only iterate over those.
Conclusion: Thread, Batch, and Prefetch
PS: Sassy1 grep-like approximate search
sassy grep -p <pattern> -k 3 <human-genome>.fa
PS: Sassy2 for batch searching