Binary Rank: Problem statement Link to heading

  • Input: a many-GB text \(T = t_0\dots t_{n-1}\) of \(n\) bits.
  • Queries: given \(q\), find

\begin{equation*} \newcommand{\rank}{\mathsf{rank}} \rank(q) := \sum_{0\leq i< q} t_i. \end{equation*}

  • \(T = \underline{\texttt{1001001}}\texttt{110010100}\)
    • \(\rank(0) = 0\)
    • \(\rank(7) = 3\)
    • \(\rank(16) = 7\)
  • Why? Occurrences table in FM-index.

History Link to heading

Naive solutions Link to heading

  • Linear scan:
    • \(O(n/w)\) time using \(w\) bit popcount, no space overhead.
  • Precompute all 64-bit answers \(r_i := \rank(i)\):
    • \(O(1)\) time, \(64\times\) overhead.

Middle-ground: block-based offsets Link to heading

  • Use \(B=512\) bit blocks, and store all \(b_j := \rank(j\cdot B)\).
  • Query \(q = j\cdot B + q’\): \(\rank(q) = b_j + \sum_{jB\leq i < jB+q’} t_i\).
  • \(O(B/w) = O(512/64) = O(1)\) time.
  • \(64/512 = 12.5\%\) space overhead.
  • 2 cache misses:
    • in \(n/8\) bit array: offset \(b_j\)
    • in \(n\) bit array: block \(j\) bits

Reducing overhead: PastaWide [and others] Link to heading

  • 2 levels:
    • L2 with 16-bit delta every block: 3.125% overhead
    • L1 with 64-bit offset every 128 blocks: 0.1% overhead

Reducing cache misses: SPIDER Link to heading

  • Inline the 16-bit L2 deltas bits into each cache line
    • Remaining 0.1% overhead L1 array fits in cache.

Reducing the popcount: Pairing Link to heading

  • Delta is to the middle instead of start of a block.
    • Count only 256 bits in first or second half of block.

All together now: BiRank Link to heading

  • Inline 16-bit deltas
  • Pairing
  • 32-bit reduced L1 offsets: 0.05% overhead
    • Low 11 bits are stored in deltas
    • Input up to \(2^{43}\) bits
    • 16 MiB cache supports 32 GiB input

What is fast? Link to heading

Latency: 80 ns/q

1
2
3
4
let mut seed = 0;
for q in queries {
    seed ^= ranker.rank(q ^ seed);
}

For loop: 16-30 ns/q

1
2
3
for q in queries {
    ranker.rank(q);
}

Prefetching/batching: 8-23 ns/q

1
2
3
4
for i in 0..queries.len() {
    ranker.prefetch(queries[i+32]);
    ranker.rank(queries[i]);
}

BiRank Link to heading

Memory-bandwidth efficient = Fast Link to heading

QuadRank Link to heading

  • Support \(\sigma=4\) (ACGT) alphabet, and \(\rank(q, c)\) for arbitrary \(c\).

\[\rank_4(q) := [\rank(q, \texttt A),\rank(q, \texttt C),\rank(q, \texttt G),\rank(q, \texttt T)].\]

  • Input is 2 bits per symbol.
  • Like BiRank, but replicate all metadata \(4\times\).
  • Store input in transposed layout: low bits separate from high bits.
  • Efficient SIMD implementation of 4-parallel popcounting.
  • 16 MiB cache supports 14 GiB = 61Gbp input

QuadRank Link to heading

Toy Batching FM-index Link to heading

  • count-only; no locate.
  • 8-character prefix-lookup.
  • Map 32 reads in parallel.
  • Do 1 character of each read before moving to the next.
  • Prefetch memory for the next character.
  • Keep a list of active reads, and only iterate over those.

FM-index results Link to heading

Conclusion: Thread, Batch, and Prefetch Link to heading

PS: Sassy1 grep-like approximate search Link to heading

sassy grep -p <pattern> -k 3 <human-genome>.fa

PS: Sassy2 for batch searching Link to heading