- Abstract
- 1 Introduction
- 2 Background
- 3 BiRank
- 4 QuadRank
- 4.1 Implementation notes
- 4.2 Blocks
- 5 Application: Parallel FM index
- 6 Results
- 7 Conclusion
- 8 Acknowledgements
- 9 Appendix
- 10 Blog
- 11 Problem statement
- 11.1 Metrics
- 12 Motivation: BWA / FM-index
- 13 Data structure layout
- 14 Evals
- 15 TODOs
- 16 Link todo:
\[\newcommand{\rank}{\mathsf{rank}}\] \[\newcommand{\rankall}{\mathsf{rankall}}\]
Abstract Link to heading
Motivation. Given a text, a rank query \(\rank(p, c)\) counts the number of occurrences of character \(c\) among the first \(p\) characters of the text. Space-efficient methods to answer rank queries form an important building block in many succinct data structures. For example, the FM-index (Ferragina and Manzini 2000) is a widely used data structure that uses rank queries to locate all occurrences of a pattern in a text.
In bioinformatics applications, the goal is usually to process a given input as fast as possible. Thus, data structures should have high throughput when used with many threads.
Contributions.
For the \(\sigma=2\) binary alphabet, we develop BiRank.
It merges the central ideas of two recent papers: (1)
we interleave (inline) offsets in each cache line of the underlying bit vector
(Laws et al. 2024), reducing cache-misses, and (2)
these offsets are to the middle of each block so that only half of it needs
popcounting (Gottlieb and Reinert 2025).
In QuadRank, we extend these techniques to the size \(\sigma=4\) (DNA) alphabet.
Both data structures are optimized for high throughput, answering many queries as fast as possible, by adding prefetch instructions to start loading the required cache lines ahead of time.
Results.
BiRank has a space overhead of \(3.125\%\) and QuadRank has a space overhead
of \(14\%\). They are around \(1.5\times\) faster than methods that do not use
inlining. Prefetching gives another \(2\times\) speedup, at which point the RAM
bandwidth becomes a hard limit on the total throughput.
When using QuadRank in a toy count-only FM-index, this results into up to 4x speedup over Genedex, a state-of-the-art batching FM-index implementation.
1 Introduction Link to heading
Given a fixed text \(T=t_0\dots t_{n-1}\) of length \(n\) over an alphabet \(\Sigma\) of size \(\sigma\), a rank query \(\rank_T(p, c)\) counts the number of occurrences of character \(c\in \Sigma\) in the first \(p\) (\(0\leq p\leq n\)) characters of the text1: \[ \rank_T(p, c) := \sum_{i\in \{0, \dots, p-1\}} [T_i = c]. \] In most literature, the binary alphabet of size \(\sigma=2\) is used, in which case the text is simply a string of \(n\) bits. In this case, we also write \(\rank_T(p) := \rank_T(p, 1)\) to count the number of \(1\) bits.
Of interest are space-efficient data structures that can answer these queries quickly. Indeed, there exist succinct data structures (Jacobson 1988) that use \(n + o(n)\) bits of space to answer queries on a binary text in \(O(1)\) time in the RAM-model with word-size \(w=\Theta(\lg n)\). In this model, a tight lower bound on the space usage is \(n + \Omega(n \log\log n / \log n)\) bits (Miltersen 2005; Golynski 2006) when the bitvector itself is explicitly stored.
A fast and widely used implementation is Rank9 (Vigna 2008), which has a fixed \(25\%\) space overhead. Many subsequent works have reduced the space overhead, usually at the cost of slightly slower queries. For example, poppy (Zhou, Andersen, and Kaminsky 2013) is quite small with only 3.125%2 overhead. In practice, nearly all fast implementations have some small constant overhead, making them compact (\(n+O(n)\) bits) but not succinct (\(n+o(n)\) bits).
Applications. A primary application of Rank queries is in the FM-index (Ferragina and Manzini 2000), a succinct data structure that can efficiently locate all occurrences of a pattern in a text and is used in tools such as BWA-MEM (Li 2013), and Bowtie (Langmead et al. 2009; Langmead and Salzberg 2012). Whereas most of the literature on rank structures assumes a binary alphabet (\(\sigma=2\)), in this case the DNA alphabet has size \(\sigma=4\). Indeed, BWA-MEM implements its own rank structure over a 2-bit alphabet, and this paper started as an attempt to speed this up.
Wavelet tree. For alphabets of arbitrary size, wavelet trees (Grossi, Gupta, and Vitter 2003) can be used instead, which need \(\lg_2 \sigma\) queries to a binary rank structure. Recently, quad wavelet trees (Ceregini, Kurpicz, and Venturini 2024) have been introduced. These use rank over a \(\sigma=4\) quad vector as a building block, and thus need only \(\log_4 \sigma\) rank queries, leading to \(2\times\) to \(3\times\) speedups.
- cite wavelet matrix?
A brief history of rank structures. We briefly go over some previous papers in chronological order and list their main technical contributions. Both the poppy (Zhou, Andersen, and Kaminsky 2013) and pasta (Kurpicz 2022a) papers contain a nice overview as well. We note that many of these papers develop a rank structure in the context of the larger rank and select problem, where there are slightly different design trade-offs. Additionally, work on compressed bitmaps is omitted here.
As a baseline, the classic succinct solution (Jacobson 1988) stores the bitvector, and then two levels of blocks alongside this. The bitvector is split into blocks of length \(\log(n)/2\) bits, and \(\log n\) blocks together form a superblock. The first level L1 of the tree then contains a $log n$-bit offset for each superblock, counting the number of set bits preceding it. The second level L2 stores for each block a $log log n$-bit delta counting the number of one bits preceding it inside its superblock.
González et al. (2005) are the first to introduce
a practical method for rank, after observing that the classic method above
has 66.85% overhead in practice for \(n=2^{30}\).
They replace a $\sqrt{n}$-size lookup table for popcounts by a sum of
precomputed per-byte lookups. (Meanwhile, CPUs natively support 64-bit popcount
instructions.) Further, they suggest to use a single-level tree
storing an offset after every \(32\cdot k\) bits. A linear scan of popcounting up
to \(k\) words takes more compute, but has the benefit of cache locality and only
requires 2 instead of 3 memory accesses.
Rank9 (Vigna 2008) has 25% overhead and interleaves the two levels L1 and L2 of the classic tree. It is designed specifically with 512-bit cache lines in mind: each block is 64 bits, and 8 blocks form a basic block. For each basic block, the interleaved tree stores a 64-bit integer with the offset of the basic block, and 7 9-bit deltas (the reason for the name) in an additional 64-bit word. This needs two cache misses per query, and is very fast in practice.
Navarro and Providel (2012) develop a data structure for rank and select, and use the extra information stored for the select queries to speed up the linear scan in the method of González et al. (2005).
Poppy (Zhou, Andersen, and Kaminsky 2013) is optimized for space and has only 3.125% overhead. First, it makes the observation that performance is largely determined by the number of cache misses. Thus, it uses larger blocks of 512 bits. It then re-uses Rank9’s interleaved index with two modifications. First, 64 bit superblocks (L1) cover 4 basic blocks, containing one 32-bit offset (L1) and 3 10-bit counts (L2) per 512-bit block. To handle 64-bit outputs, it stores an additional zero layer (L0) of the tree with the 64 bit offset after every \(2^{32}\) input bits.
Pasta-flat (Kurpicz 2022a, 2022b) has the same space overhead as Poppy, but improves rank query time by 8%. Specifically, it avoids Poppy’s need to take a prefix sum over L2 counts: it doubles the size of each superblock to 128 bits covering 8 blocks. It stores a 44 offset (L1) followed by 7 12-bit deltas (L2) from the start of the superblock to each block. A second structure, pasta-wide uses 16-bit values for L2, which allows faster select queries using SIMD instructions. Each superblock covers 128 blocks and stores a 64 bit L1 value, this time not interleaved with the L2 values, and the L0 level is dropped.
SPIDER (Laws et al. 2024) (3.3% overhead) reduces the number of cache misses from 2 to (nearly) 1 by interleaving L1 with the bitvector itself, instead of interleaving L1 and L2: in cache line it stores a 16-bit L2 delta, and 496 input bits. L1 superblocks store a 64-bit offset for each 128 blocks, taking only 0.1% extra space and thus likely fitting in a cache.
Paired-blocks (Gottlieb and Reinert 2025) is an idea that TODO
- paired-blocks (middle) (Gottlieb and Reinert 2025) 2025
- 3* (Kurpicz, Rigi-Luperti, and Sanders 2025) 2025
other:
- btree (Pibiri and Kanda 2021) 2021
implementation:
- SDSL (Gog et al. 2014) 2014
- sdsl-v is simply rank9, we have is from sux
- sdsl-v5 has 6% overhead, but is slower than poppy and pasta
- rank9 fastest pasta-wide 2nd
DNA:
BWA (Li 2013) 2013
QWT/RSWide/RSNarrow (Ceregini, Kurpicz, and Venturini 2024) 2024
genedex
Memory bandwidth.
Contributions.
- Select-only
- Assume large n, around 2^30 or more, many GB of data; AVX2
- Merge existing techniques of interleaving, count-from-middle
- Apply these techniques to quad alphabet as well
- Prefetching / streaming, with application in batching FM index
- Optimized methods for returning all 4 ranks, for approximate matching in FM index
- Rust library
- up to \(2^{40}\)
2 Background Link to heading
Rank API.
- Construct from packed bitslice:
&[u64].- Nice-to-have for modifying algorithms: construct in-place from
Vec<u64>without using more memory than the final size of the data structure.
- Nice-to-have for modifying algorithms: construct in-place from
- Rank queries are right-exclusive:
rank(i)counts the number of 1 bits before position \(i\), or equivalently, the number of 1 bits in the first \(i\) bits of the string.- Right-inclusive queries make it impossible to query the empty prefix.
- We allow up counts up to \(2^{40}\), i.e. data structures up to 128 GiB, or up to 256 GiB if only half the bits is set.
Levels.
- bitvec: the raw bits, split into basic blocks
- block counts: the number of 1 bits before each basic block
- superblocks: block counts have redundancy, so storing say 4 or 8 of them relative to a common superblock count can save space
Cache lines.
Interleaving.
count-from-middle.
Transposed layout.
3 BiRank Link to heading
Notation:
\(N\): number of bits per basic block (cache line)
\(W\): width of the rank stored in each block.
We store the low bits of the rank of the block inside the block itself.
- Optimally use the cache line
3.1 Implementation Techniques: Link to heading
Mask array: instead of shifting a
u64left or right to remove some bits, precompute eg a 16KiB[[u64;4]; 512]array with a mask to apply for each input position (for the last two blocks), or a 4KiB[u128; 256]forpos%256in each half.Add or subtract the popcount:
if pos < 256 { offset-p } else { offset+p }, compiled into acmov(data&mask).count_ones()
3.2 Superblocks: Link to heading
- When storing \(W\) low bits, each super-block can only cover \(2^W/N\) blocks. For simplicity, we use as stride \(S\) the largest power of 2 below \(2^W/N\).
- The super blocks simply store a
u64global offset. - Better: store a
u32of value divided by256. 40-bit values are sufficient in practice.- if needed, can use a different shift
- tiny bit slower at times, but half the pressure on caches is nice.
3.3 Basic Blocks Link to heading
Code can be found in repo: https://github.com/RagnarGrootKoerkamp/quadrank
- BB16: u16 offset to middle
- BB16x2: u16 offset to 1/4 and 3/4
- BB32: u32 offset to middle
- BB32B: u32 offset to real middle
- BB23_9: 23 bit offset to 1/3
- BB32x2: 32 bit offset to 1/4 and 3/4
- BB64x2:
One more that was tried but did not use in the end:
- Store eg a 23 bit offset to bit 128 (1/4), and then a 9 bit delta from there
to bit 384 (3/4), so that a 128bit popcount suffices
- If we store the rank offset in the middle of the block (and adjust
posas needed), 8 bits are actually sufficient.
- If we store the rank offset in the middle of the block (and adjust
4 QuadRank Link to heading
Main idea:
- do the same as for BiRank, but store 4 offsets instead of 1.
- Transposed layout
4.1 Implementation notes Link to heading
- avx2
_mm_sign_epi32instruction is used to conditionally negate the popcounts - We use the transposed layout. This makes it easier to directly popcount 64 bp at once.
- We use
u64x4simd registers to do a popcount for each of ACGT in its own lane.
| |
The lane-wise popcounts are implemented using Mula’s algorithm (Muła 2008; Muła, Kurz, and Lemire 2017)
use the
_mm256_shuffle_epi8instruction twice to do a table lookup for the popcount of the low and high 4-bit nibble of each byte. Then add those two counts, and accumulate the 8 bytes of eachu64using the_mm256_sad_epu8instruction.Future work: In AVX512, there is a dedicated popcount instruction that could be used instead.
4.2 Blocks Link to heading
5 Application: Parallel FM index Link to heading
- We develop a toy implementation of an FM-index for testing the benefit of our method in a setting where batching can be used.
- Count-only; no locate
- 8-character prefix lookup
- Sentinel character is handled by explicitly storing its position
- Implementation is heavily inspired by genedex (NO_ITEM_DATA:genedex)
- Setup: exact map simulated 150bp illumina reads with 1% error rate to 500Mbp of viral data. So mapping likely fails after roughly 50 characters.
- Approximate mapping using a search scheme is out of scope for now and remains future work.
- We do not use dedicated code like
count_range; instead, we simply process the start and end of each range individually and still benefit from the reused cacheline whenever possible, while avoiding possibly expensive branch misses on checking whether they are in the same cacheline indeed.
API: given a batch of B queries (ie &[Vec<u8>; B]):
- read FM-index range for the 8-character prefix from the lookup table
- store a list of up to B indices of the still active queries (those for which the current prefix has >0 matches). For each character, we only iterate over the queries pointed to by this list.
- For each character, iterate the active queries twice:
- In the first iteration, remove queries that are done from the active list (via swappop, letting the last element take their place), and prefetch the cache lines required to answer rank queries to the start and end of the interval.
- In the second iteration, do the actual queries and update the start and end of each active interval.
We briefly tried alternatives such as inserting new queries into the empty slots of the batch when a previous query has 0 matches, but this did not improve performance.
6 Results Link to heading
Code can be found in repo: https://github.com/RagnarGrootKoerkamp/quadrank
sigma = 2/4
1/6/12 threads
latency/loop/stream
FM index
AVX2, DDR4
Table of exact versions used
Specific implementation matters for perf; can’t just compare against a method in itself.
6.1 BiRank Link to heading

Figure 1: Throughput in a loop on a small input that fits in L2 cache. The red dashed line indicates the minimum time to read a cache line from RAM.
Notes:
- Rank9 is fast
- Spider is slow
- Our methods are consistently faster than the emperical 7.5ns needed to fetch a cache line.
- SmallRank@3.125% = poppy
We added support for prefetching to all methods evaluated here.

Figure 2: Space-time tradeoff for rank structures on binary input of total size 4GB. Red lines indicate: (left) the 80ns RAM latency divided by the number of threads, (top) the measured maximum RAM throughput of 1 thread, 7.5ns/cache line, and (rest) the measured maximum total RAM throughput, 2.5 ns/cache line. In the right column, the transparent markers again show the time for just looping, without our added support for prefetching.
Notes:
- Latency is nearly constant and independent of the method used, since the CPU time of computing the answer is small compared to the ~80ns wait.
- Our new methods are slightly faster but mostly comparable when used in a for loop on a single thread.
- With more threads, their benefit increases due to the reduced memory pressure.
- When streaming, our method (alongside our reimplementation of SPIDER) is the only one that can answer rank queries close to the limit imposed by the (per-thread/total) RAM bandwidth, and is around 2x faster than others, than need to read 2 cache lines instead of 1.
- When streaming or using multiple threads, things are mostly memory bound, and the extra compute needed for the more space efficient methods is mostly hidden.
- Most methods benefit at least 1.5x speedup from prefetching; some up to 2x
- Even then, we are 2x faster, or 3x faster compared to other methods without prefetching
- Hyperthreading (12 threads) helps to reduce latency nearly 2x because it can interleave a second thread in the time the first is waiting. For looping, the speedup around 1.5, while for streaming, the gains are marginal.
6.2 QuadRank Link to heading
TODO: Legend for small/large dots
Note: other methods are not optimized for returning all 4 counts, whereas in our case that is only slightly slower.

Figure 3: Space-time trade-off for size 4 alphabet on small L2-cache sized input. Small markers indicate time for a rank(i, c) query that counts only one symbol, while large markers always return all four ranks.
- Computing all 4 counts is relatively slow for the other methods.

Figure 4: Space-time trade-off for size 4 alphabet on 4GB input.
6.3 FM-index Link to heading
TODO:
- no batching, no prefetching
- batching, no prefetching
- check: 12 threads

7 Conclusion Link to heading
8 Acknowledgements Link to heading
- Heng Li
- Discord folks
- Rob
- Simon
- Felix
- Giulio
9 Appendix Link to heading
9.1 Further evals (epyc, DDR5, avx512) Link to heading
10 Blog Link to heading
If you’re in bioinformatics and do anything with data structures for indexing large amounts of data, or if you are generally into succinct data structures (that use space close to the information-theoretic lower bound, wikipedia), you’ve probably heard of rank and select
BWA Li (2013) uses a custom routine (github code) for rank over 2-bit DNA alphabets. By request of Heng Li, here we investigate if we can make it faster.
11 Problem statement Link to heading
Given the a text \(T\) over a size-4 (2-bit) alphabet, build a data structure that can quickly count the number of A, C, G, and T characters up to position \(i\) in \(S\). Specifically, for each query \(i\) we want to return all 4 counts.
| |
11.1 Metrics Link to heading
- most important: fast
- low latency?
- high throughput when called in a loop?
- high throughput with prefetching
- also, with multithreading
- secondary: small. But 2x overhead is fine.
- as long as it fits in RAM, all is good
12 Motivation: BWA / FM-index Link to heading
13 Data structure layout Link to heading
- query time is measured in the RAM model: cost 1 per memory access.
13.1 1-level indices Link to heading
- \(n\): number of characters.
13.1.1 Flat: \(2n\) bits, \(O(n)\) queries Link to heading
13.1.2 All-answers \(4n\lg n\) bits, \(O(1)\) queries Link to heading
13.2 What’s in a cache line? Link to heading
13.3 2-level indices: blocks Link to heading
- Blocks of \(B\) characters.
- For each block, store the character counts for the text preceding the block.
13.3.1 External block counts: \(2n + n/B\cdot 4\lg n\) bits, \(O(B)\) queries Link to heading
13.3.2 Internal block counts: \(2n + n/(B-4\lg n)\cdot 4\lg n\) bits, \(O(B)\) queries Link to heading
13.4 3-level indices: superblocks Link to heading
- Superblocks corresponding to \(S\) blocks or \(S\cdot B\) characters, so that the block offsets can use less precision.
13.5 Current implementation in BWA Link to heading
- query code: https://github.com/lh3/bwa/blob/master/bwt.c
bwt_occ: count occurrence of 1 character up to position.bwt_occ4: count occurrences of 4 charactersbwt_2occ4: count occurrences of 4 characters in an interval. This is the most important one.
- construction is here: https://github.com/lh3/bwa/blob/master/bwtindex.c#L150,
bwt_bwtupdate_core - flat array
- 2-level index: every
OCC_INTERVALstores 4 counts followed by BWT input text. No superblocks.
13.6 Other implementations Link to heading
- SDSL?
- QWT (Ceregini, Kurpicz, and Venturini 2024) (docs.rs/qwt): 3-level index
- Superblocks stores large (64bit?) number.
- Blocks have 12bit delta relative to superblock.
- Popcount inside the block.
14 Evals Link to heading
Current status:
Compared libaries:
Rank9and 5RankSmallvariants from sux-rs (github:vigna/sux-rs) (Vigna 2008, 2024)- PR adding prefetch: https://github.com/vigna/sux-rs/pull/98
qwtimplementations ofRSNarrowandRSWide(github:rossanoventurini/qwt) (Ceregini, Kurpicz, and Venturini 2024)- PR adding prefetch: https://github.com/rossanoventurini/qwt/pull/6
genedeximplementations of{Flat,Condensed}TextWithRankSupport<u32, {Block64,Block512}>(github:feldroop/genedex)- PR adding prefetch: https://github.com/feldroop/genedex/pull/4
spider(Laws et al. 2024) (github:williams-cs/spider): our own Rust port, TODO to achieve the same perf, because the provided C benchmark is a bit faster.- untested, no evaluation scripts available (gh issue), undocumented, no library, never used?
- TODO: non-interleaved (NI) spider
Not compared:
- The B-tree based methods of Pibiri and Kanda (2021) (github:jermp/mutable_rank_select) is missing
because it’s written in C++ and benchmarking via FFI is very likely too slow.
- (Working on a highly optimised Rust port of this library for both Rank and Select might be next.)
- The C++ method of Gottlieb and Reinert (2025) (github:seqan/pfBitvectors) was implemented in genedex already and slightly faster there3, but in these plots the small version of genedex is somewhat slow, so this needs further investigation as well.
[old caption] space overhead vs time tradeoff for various rank implementations in Rust for varying number of threads. The red lines show latency (first col) and throughput (rest) bounds given by the RAM.
15 TODOs Link to heading
- store fewer bits per super block entry:
- low 8 bits can be omitted
- Bitm
15.1 FM Link to heading
15.1.1 Perf Link to heading
- optimize interval queries when s and t are close?
- optimize parallel mapping
- batch as-is seems best
- batch_all, to fill gaps as they open up, appears slower?
- batch_interleave, to mix memory and cpu work, needs more attention
15.1.2 Evals Link to heading
- compare ext:
- AWRY => https://github.com/UM-Applied-Algorithms-Lab/AWRY/issues/44
- SDSL => broken so far
15.1.3 Features Link to heading
- locate queries via sampled suffix array
- inexact matching
- in-text verification
- bidirectional fm
16 Link todo: Link to heading
References Link to heading
Like Rank9 (Vigna 2008) and most (but not all) other implementations, we follow Dijkstra’s advise (Dijkstra 1982) and start numbering at zero. ↩︎
TODO: put numbers in math mode? ↩︎
Personal communication with Simon Gene Gottlieb ↩︎