\[\newcommand{\rank}{\mathsf{rank}}\] \[\newcommand{\rankone}{\mathsf{rank}}\] \[\newcommand{\rankall}{\mathsf{rank_4}}\]
Abstract Link to heading
Motivation. Given a text, a query \(\rank(q, c)\) counts the number of occurrences of character \(c\) among the first \(q\) characters of the text. Space-efficient methods to answer these rank queries form an important building block in many succinct data structures. For example, the FM-index (Ferragina and Manzini 2000) is a widely used data structure that uses rank queries to locate all occurrences of a pattern in a text.
In bioinformatics applications, the goal is usually to process a given input as fast as possible. Thus, data structures should have high throughput when used with many threads.
Contributions. We first survey existing results on rank data structures. For the \(\sigma=2\) binary alphabet, we then develop BiRank with 3.28% space overhead. It merges the central ideas of two recent papers: (1) we interleave (inline) offsets in each cache line of the underlying bit vector (Laws et al. 2024), reducing cache-misses, and (2) these offsets are to the middle of each block so that only half of it needs popcounting (Gottlieb and Reinert 2025b). In QuadRank (14.4% space overhead), we extend these techniques to the \(\sigma=4\) (DNA) alphabet.
Both data structures typically require only a single cache miss per query, making them highly suitable for high throughput and memory bound settings. To enable batch-processing, we support prefetching the cache line required to answer a query.
Results. BiRank and QuadRank are around 1.5\(\times\) and 2\(\times\) faster than similar-overhead methods that do not use inlining. Prefetching gives an additional 2\(\times\) speedup, at which point the dual-channel DDR4 RAM bandwidth becomes a hard limit on the total throughput. With prefetching, both methods outperform all other methods apart from SPIDER (Laws et al. 2024) by 2\(\times\).
When using QuadRank in a toy count-only FM-index, this results into up to 4\(\times\) speedup over Genedex, a state-of-the-art batching FM-index implementation.
1 Introduction Link to heading
Given a fixed text \(T=t_0\dots t_{n-1}\) of length \(n\) over an alphabet \(\Sigma\) of size \(\sigma\), a query \(\rank(q, c)\) counts the number of occurrences of symbol \(c\in \Sigma\) in the first \(q\) (\(0\leq q\leq n\)) characters of the text1: \[ \rank(q, c) := \sum_{i\in \{0, \dots, q-1\}} [t_i = c]. \] In most literature, the binary alphabet of size \(\sigma=2\) is used, in which case the text is simply a string of \(n\) bits. In this case, we also write \(\rank(q) := \rank(q, 1)\) to count the number of \(1\) bits.
Of interest are space-efficient data structures that can answer these queries quickly. Indeed, there exist succinct data structures (Jacobson 1988) that use \(n + o(n)\) bits of space to answer queries on a binary text in \(O(1)\) time in the RAM-model with word-size \(w=\Theta(\lg n)\). When the bitvector itself is stored explicitly, a tight lower bound on the space usage is \(n + \Omega(n \log\log n / \log n)\) bits (Miltersen 2005; Golynski 2006).
A fast and widely used implementation is Rank9 (Vigna 2008), which has a fixed \(25\%\) space overhead. Many subsequent works have reduced the space overhead to as little as 1.6%, as detailed in 2. In practice, most implementations have fixed overhead, making them compact (\(n+O(n)\) bits) but not succinct.
FM-index. A primary application of Rank queries is in the FM-index (Ferragina and Manzini 2000), a succinct data structure that can efficiently locate all occurrences of a pattern in a text and is used in tools such as BWA-MEM (Li 2013), and Bowtie (Langmead et al. 2009; Langmead and Salzberg 2012). Whereas most of the literature on rank structures assumes a binary alphabet (\(\sigma=2\)), in this case the DNA alphabet has size \(\sigma=4\). Indeed, BWA-MEM implements its own rank structure over a 2-bit alphabet2, and this paper started as an attempt to speed this up.
Wavelet tree. For alphabets of arbitrary size, wavelet trees (Grossi, Gupta, and Vitter 2003) or the wavelet matrix (Claude, Navarro, and Ordóñez 2015) can be used instead, which both need \(\lg_2 \sigma\) queries to a binary rank structure. Recently, quad wavelet trees (Ceregini, Kurpicz, and Venturini 2024) have been introduced, following earlier theoretical (Ferragina et al. 2007) and practical (Bowe 2010) results on multi-ary wavelet trees. Quad wavelet trees use rank over a quad vector as a building block, and thus need only \(\log_4 \sigma\) rank queries, leading to 2\(\times\) to 3\(\times\) speedups.
Multithreading and batching. In the past, increasing CPU frequencies led to faster code, but nowadays, improvements are mostly in increasing parallelism. Furthermore, total compute of a CPU increases faster than the available memory bandwidth, resulting in the need for communication-avoiding algorithms that minimize their memory bandwidth (Dongarra 2022). Indeed, in bioinformatics applications, one often has many independent queries (DNA sequences) that need to be processed (searched in an FM-index) as fast as possible. In particular, this allows using all cores/threads of the CPU as well as processing queries in batches inside each thread, to hide the memory latency.
Current benchmarks usually measure the throughput of answering rank queries in a for loop on a single thread, but this does not take into account the possibility for batching, nor does it capture the effects of running many threads in parallel. As we will see, in a high-throughput setting, many existing methods become bottlenecked by the total memory bandwidth of the CPU. We specifically design our data structures to use the memory bandwidth maximally efficient.
Contributions. We develop two data structures, BiRank and QuadRank, that support high-throughput rank queries over texts over alphabets of size 2 and 4. Our Rust library is available at https://github.com/RagnarGrootKoerkamp/quadrank.
Both of them integrate a number of existing techniques (see next section), and are not designed to support select queries, allowing for more optimizations. Specifically, BiRank has 3.28% overhead and integrates (1) inlining of L2 into the bitvector (Laws et al. 2024), which reduced cache misses, (2) paired-blocks with mask-lookup (Gottlieb and Reinert 2025b), halving the number of popcounts, and (3) an additional zeroth tree level (Zhou, Andersen, and Kaminsky 2013) that is modified to be only half its usual size.
QuadRank extends the ideas of BiRank, but has roughly 4\(\times\) larger space overhead (14.4%) since it stores metadata for each symbol. It combines the cache-locality of the implementation in BWA-MEM (Li 2013) with the low overhead of quad vectors (Ceregini, Kurpicz, and Venturini 2023) and a transposed bit layout for faster queries (Anderson and Wheeler 2021; Gottlieb and Reinert 2025b). QuadRank is optimized for returning ranks for all 4 symbols at once by using AVX2 instructions, which is useful for approximate pattern matching in an FM-index.
Both data structures need only a single cache line from RAM to answer queries as long as the input is less than roughly 16 GiB, and we provide an API to prefetch cache lines to enable efficient batch-processing of queries. As a side-effect, we also added prefetching to some other libraries.
Results. For both data structures, we implement a number of variants that have different space-time trade-offs. When used in a for loop, BiRank is up to 1.5\(\times\) faster than the next-fastest rust implementation of equal size, with the speedup being larger when using many threads. Prefetching memory improves the throughput of many libraries by around 1.5\(\times\), and improves BiRank by 2\(\times\). In this setting, all methods are bottlenecked by the memory throughput, and BiRank is 2\(\times\) faster than all others because it only needs to read 1 instead of 2 cache lines from RAM.
Similarly, QuadRank is at least 1.5\(\times\) faster than the next-fastest Rust library, QWT (Ceregini, Kurpicz, and Venturini 2024), and 2\(\times\) faster after adding prefetch instructions, again being bottlenecked by the RAM throughput.
Inspired by genedex (Droop 2025), we further develop a small toy-implementation of a count-only FM-index that uses batching and multithreading. At 12.5% overhead, our implementation is 1.5\(\times\) faster when using QuadRank compared to using QWT’s quad vector. And at 100% space overhead, we are 4\(\times\) faster than genedex.
2 Background Link to heading
We briefly go over some previous papers containing rank structures for either \(\sigma=2\) or \(\sigma=4\) in chronological order and list their main technical contributions. Both the poppy (Zhou, Andersen, and Kaminsky 2013) and pasta (Kurpicz 2022a) papers contain a nice overview as well. Note that many of these papers develop a rank structure in the context of the larger rank and select problem, where there are different design trade-offs. Additionally, work on compressed bitmaps is omitted here.
Most data structures are schematically depicted in Figure 1.
Terminology.
For later reference, we summarize our terminology, mostly following Zhou, Andersen, and Kaminsky (2013) and Kurpicz (2022a).
The raw data is split into superblocks (the first level, L1) that are further
split into blocks (L2). Block is used for both the raw bits themselves, as well as the
cache line containing them.
For each superblock, an L1 offset is stored, representing the number of 1-bits
before the superblock. For each block, an L2 delta is stored, typically representing the
number of 1-bits preceding it inside the superblock.
The overhead of a data structure is the increase in space consumption relative
to the size of the input data.
We use bp (base pair) as the unit for 2-bit encoded DNA characters, and
occasionally use the Rust syntax u64 for a 64-bit variable and u64x4 for a
256-bit SIMD register containing 4 64-bit words.
A symbol is an element of the alphabet \(\Sigma\), whereas
a character is an element of a string.
Classic succinct approach. As a baseline, Jacobson (1988) store the bitvector, and two levels of blocks alongside this. Blocks consist of \(\lfloor\log(n)/2\rfloor\) bits, and \(\lfloor\log n\rfloor\) blocks form a superblock. The first level L1 of the tree then contains a \(\lceil \log n\rceil\) bit offset for each superblock, counting the number of set bits preceding it. The second level L2 stores for each block a \(\lceil\log \log n\rceil\) bit delta counting the number of one bits preceding it inside its superblock.
A practical approach.
González et al. (2005) observe that the classic method above
has 66.85% overhead in practice for \(n=2^{30}\).
They replace a lookup table of size \(\sqrt n\) for popcounts by a sum of
precomputed per-byte lookups. (Meanwhile, CPUs natively support 64-bit popcount
instructions.)
They use 256-bit superblocks with a 32-bit offset, containing 8
32-bit blocks, each with their own 8-bit delta.
Alternatively, they introduce a single-level tree
storing a 32-bit L1 offset after every e.g. \(4\cdot 32\) bits, omitting L2.
This requires popcounting more words, but has the benefit of improved
cache-locality compared to a two-level tree.
Rank9: interleaving levels. Rank9 (Vigna 2008) has 25% overhead and interleaves the L1 and L2 levels of the classic tree. Each block is 64 bits, and 8 blocks form a 512-bit superblock, exactly matching a cache line. For each superblock, the interleaved tree stores a 64-bit integer with the offset of the superblock, and 7 9-bit deltas (for all but the first block) in an additional 64-bit word. This needs two cache misses per query (for the L1 array and bits), and is very fast in practice. Specifically it only needs to popcount a single 64-bit word, which is done using broadword programming (also known as SWAR, SIMD Within A Register).
Poppy: reducing space. Poppy (Zhou, Andersen, and Kaminsky 2013) is optimized for space and has only 3.125% overhead. First, it makes the observation that performance is largely determined by the number of cache misses. Thus, it uses larger blocks of 512 bits. It then re-uses Rank9’s interleaved index with two modifications. Each superblocks contains 4 blocks, and for each superblock it stores a 32-bit offset (L1) followed by 3 10-bit popcounts of the first 3 blocks. Queries then require a prefix-sum over these counts. To handle 64-bit outputs, it stores an additional zero layer (L0) of the tree, with a full 64 bit offset after every \(2^{32}\) input bits.
BWA-MEM: DNA alphabet. BWA-MEM (Li 2013) implements a 100% overhead rank data structure on \(\sigma=4\) DNA. It interleaves L1 offsets with the data, and requires only a single cache-miss per query. In each cache line, it stores 4 64-bit offsets (one for each DNA character), followed by 256 bits encoding 128 bp.
SDSL. The succinct data structure library (SDSL) (Gog et al. 2014) implements Rank9
and introduces rank_support_v5, which has 6.25% overhead. It uses superblocks of
2048 bits. For each, it interleaves a 64-bit offset (L1) and 5 11-bit deltas (packed
into 64 bits) to all but the first of 6 blocks covering \(6\cdot 64\) bit.
rank_support_il interleaves 64-bit offsets with 512-bit blocks.
EPR-dictionaries: arbitrary \(\sigma\). EPR-dictionaries (Pockrandt, Ehrhardt, and Reinert 2017) work for arbitrary alphabet. For \(\sigma=4\), they use 64-bit (32 bp) blocks and have 42% overhead, and interleave an independent 2-level rank structure for each character. Compared to earlier work, space is saved by storing a packed representation of the text instead of \(\sigma\) 1-hot encoded bitvectors.
B-trees. Pibiri and Kanda (2021) diverge from the classic approach and introduce a rank and select structure based on highly tuned B-trees that have 3.6% overhead. Each rank query traverses roughly \(\log_{16} n\) levels of the tree, with the middle levels packing 16 32-bit values in a cache line. Due to efficient caching of the top levels of the tree, performance is similar to poppy, although not as fast as rank9.
AWFM: transposed layout and batching/prefetching. The AWFM-index and its Rust implementation AWRY (Anderson and Wheeler 2021) builds an FM-index on a size \(\sigma=6\) alphabet of 4 DNA characters as well as a sentinel and ambiguity symbol. It uses blocks of 256 3-bit characters, preceded by 5 64-bit offsets that are padded to 512 bits. Each block is encoded using a strided or transposed layout: instead of concatenating the 3 bits of each character, it stores 3 256-bit vectors containing bit 0, bit 1, and bit 2 of each character. This allows for more efficient popcounting. The FM-index processes queries in batches of size 4, and prefetches memory needed for the next rank operation as soon as possible.
Pasta: larger L2 values and faster queries. PastaFlat (Kurpicz 2022a, 2022b) has the same 3.125% space overhead as Poppy, but improves query time by 8% by avoiding Poppy’s need to take a prefix sum over L2 counts. Pasta doubles the metadata for each superblock to 128 bits, covering 8 512-bit blocks of 4096 bits in total. It stores a 44-bit offset (L1) followed by 7 12-bit deltas (L2) from the start of the superblock to each block. A second structure, PastaWide (3.198% overhead) uses 16-bit values for L2, which allows faster select queries using SIMD instructions. Here, each superblock covers 128 blocks and stores a 64-bit L1 value, this time not interleaved with the L2 values, and the L0 level is dropped.
Quad vectors: extending PastaFlat to \(\sigma=4\). Quad wavelet trees internally use quad vectors (Ceregini, Kurpicz, and Venturini 2023, 2024), which have a layout very similar to PastaFlat. Super blocks cover eight 512 bp blocks and stores 128 bits of data for each of the 4 symbols. This takes 4\(\times\) more space per character, but since the text doubles in space as well, the overhead only doubles to 6.25%. Alternatively, 256 bp (512 bit) blocks can be used to reduce the number of cache misses, using 12.5% overhead.
SPIDER: interleaving bits for minimal cache-misses. SPIDER (Laws et al. 2024) has only 3.3% overhead and reduces the number of cache misses from 2 to (nearly) 1 by interleaving L2 with the bitvector itself (like BWA-MEM), instead of interleaving L1 and L2: each cache line stores a 16-bit L2 delta, and 496 input bits. L1 superblocks store a 64-bit offset for each 128 blocks, taking only 0.1% extra space and thus likely fitting in a cache.
Paired-blocks: halving the overhead. Paired-blocks (pfBV) (Gottlieb and Reinert 2025b) is an idea that halves the memory usage again, to 1.6%. Compared to PastaWide, instead of storing 16-bit (L2) deltas to the start of each 512-bit block, here we store 16-bit deltas to the middle of each pair of 512-bit blocks. Then, the second block can add a prefix-popcount to this as usual, while the first block can subtract a suffix-popcount instead. Similarly, the 64-bit L1 offset is to the middle of a twice-as-large superblock. This is similar to the alternate counters idea for the FM-index by Chacon et al. (2015), where, for alphabet size 4, each block stores half the offsets. A small complication with this design is that conditionally popcounting a prefix or suffix of bits is slightly slower. Instead, Gottlieb and Reinert (2025b) introduce a lookup table that stores a precomputed mask for each position. Lastly, for \(\sigma=4\), this paper uses the transposed layout of AWFM, but calls it scattered instead.
Figure 1: Schematic overview of rank data structures. The top and bottom half are for (sigma=2) and (sigma=4) respectively. Each line shows a data structure (and notable (re)implementations) with its overhead and the layout of a single superblock (not to scale). Each structure stores up to 3 vectors containing (interleaved) superblocks offsets, block deltas, and raw bits. On the right (black) are the blocks containing (bitpacked) data. Each superblock contains a single L1 offset (teal) that is either absolute, or sometimes relative to a 64-bit L0 value (green). They usually count the number of 1-bits/characters before the start of the superblock as indicated by the teal dot, or to the middle of the superblock for paired variants. L2 deltas (yellow) count from the start/middle of the superblock to the start of each block (yellow dots). Only for poppy they count individual blocks (yellow lines). For paired, paired fBV, BiRank, and QuadRank, L2 deltas are to the middle of each (pair of) block(s). AWFM, (paired) fBV, and QuadRank store the text transposed, alternating words of low and high bits.
2.1 Further implementations Link to heading
We now list some specific Rust crates containing (re)implementations of rank structures.
QWT. qwt (github:rossanoventurini/qwt) implements RSQVector256 and RSQVector512
corresponding to the Quad Vectors in the paper (Ceregini, Kurpicz, and Venturini 2024) with
12.5% and 6.25% overhead. It further contains RSWide, which implements the
PastaFlat structure of Kurpicz (2022a) (omitting the L0 layer), and RSNarrow,
which exactly implements Rank9.
Sux. sux (github:vigna/sux-rs) (Vigna and Fontana 2024) contains an implementation of Rank9, as well as five
versions of RankSmall.
These are all variants on Rank9, but use Poppy’s 64-bit L0 to allow for 32-bit L1
values. They vary in the number of u32 used to store the L2 values and the
width of the L2 values. A special case is RankSmall3 (3.125% overhead), which stores 3 11-bit
values in a single 32-bit word by using 0-extension for the implicit high 0-bit of
the first value.
Bitm. bitm (github:beling/bsuccinct-rs) is part of bsuccinct (Beling 2024). Its RankSimple (6.25%
overhead) stores a 32-bit L1 offset for every 512 bit block.
RankSelect101111 (read: 10-11-11) has 3.125% overhead and is the same as
RankSmall3 of sux.
Genedex. genedex (github:feldroop/genedex) (Droop 2025) implements variants of the data structures of
Gottlieb and Reinert (2025b). It is designed for \(\sigma>2\), but also supports
\(\sigma=2\).
Flat512 stores the text using 4 indicator bitvectors and uses 4 interleaved copies of SPIDER, one for each symbol.
Flat64 is the same but with 64-character blocks.
Condendensed512 implements the flattened bit vectors (fBV) of
Gottlieb and Reinert (2025b), with blocks representing 512 transposed
characters, and using \(\sigma\) interleaved copies of PastaWide.
Further Rust implementations.
We did not include the following libraries in the evaluations because they are not (close to) Pareto optimal.
Bio (Köster 2015) has a RankSelect
structure that stores a 64-bit offset after every few 32-bit words, but is not
very optimized.
RsDict implements a compact
encoding (Navarro and Providel 2012), making it relatively slow.
Sucds implements Rank9, which is already covered.
Succinct provides both Rank9 and
JacobsonRank, which is both slower and larger.
Vers_vecs implements PastaWide, but with superblocks spanning \(2^{13}\) rather than \(2^{16}\) bits.
3 BiRank Link to heading
Let \(T = t_0\dots t_{n-1}\) be a text of \(n\) binary characters \(\{0,1\}\). For a query \(q\) (\(0\leq q\leq n\)), \(\rank(q) = \sum_{i\in \{0,\dots,q-1\}} [t_i = 1]=\sum_{i\in [q]} t_i\) counts the number of 1-bits in the first \(q\) characters of the text.
BiRank is a data structure that answers \(\rank(q)\) queries in constant time
using 3.28% space overhead.
It can be constructed in parallel on multiple threads from a slice of already-packed data and
provides rank_unchecked and prefetch functions that assume the argument is
in \([0, n]\).
Single cache-miss queries. We aim to minimize the number of cache misses on large (many GB) inputs, to enable efficient usage in high-throughput settings where the memory bandwidth is the bottleneck. A single cache-miss is inevitable, and so we must avoid any further cache misses. This means that any additional data should fit in L3 cache, which is not the case for non-interleaved layouts. For example, Paired has 1.6% overhead, which would only support 1 GiB of input with a 16 MiB L3 cache.
Interleaved L2. Thus, like SPIDER (Laws et al. 2024), BiRank inlines a 16-bit L2 delta \(b_j\) into each block/cache line \(j\), so that each of \(\lceil n/B\rceil\) blocks covers \(B:=512-16=496\) bits. The \(\lceil n/S\rceil\) superblocks cover \(S:=128\cdot B\) bits each, and a second much smaller array stores a 32-bit L1 offset \(s_i\) for each superblock \(i\).
Shifted 32-bit L1 offset. Poppy (Zhou, Andersen, and Kaminsky 2013) uses an additional 64-bit zeroth level, so that 32-bit L1 values are sufficient. Even though the L0 layer is already very small, we remove it completely. Rather than directly encoding \(\rank(i\cdot S)\), the rank at the start of the \(i\)’th superblock, we store \[s_i := \lfloor \rank(i\cdot S)/2^{11}\rfloor\] in the 32-bit L1 value. The remainder \(\rank(i\cdot S)\bmod 2^{11}\) will be added to the 16-bit \(b_j\) delta for each block in the superblock. This configuration supports inputs up to \(2^{43}\) bits, or 1 TiB, since \(n<2^{43}\) implies \(\rank(i\cdot S)/2^{11} < 2^{32}\).
Size of a superblock. Each superblock must contain at most \(2^{16}\) bits, so that the 16-bit \(b_j\) can represent their deltas. Thus, we could fit \(\lfloor 2^{16} / B\rfloor = 132\) blocks inside each superblock, but we round this down for computational efficiency: \(S := 128\cdot B\).
We also implemented a variant of the pairing of superblocks technique of Gottlieb and Reinert (2025b), which doubles the superblock size \(S\) and halves the cache usage, see 9. We decided not to use it though: while the reduced cache usage could be beneficial, in practice, the gains are inconsistent and small at best.
Overhead. The overhead of the block deltas is \(16 / B = 3.226\%\), whereas the superblocks have an overhead of \(32 / S = 0.05\%\). Thus, the total overhead is 3.28%, and the superblock array fits in a 16 MiB L3 cache for inputs up to 32 GiB.
L2-delta to the middle. To reduce the amount of work needed for popcounting, we apply a variant of the paired-blocks technique (Gottlieb and Reinert 2025b): the 16-bit L2 value is not the delta from the start of the superblock to the start of the current block (\(\rank(j\cdot B)\)), but instead to the middle of the current block (\(\rank(j\cdot B + 240)\), after taking into account a 16-bit padding): \[ b_j := \rank(j\cdot B + 240) - 2^{11}\cdot s_{\lfloor j/128\rfloor}. \] By construction, these values are indeed bounded by \(b_j \leq S + (2^{11}-1) < 2^{16}\). Simplified Rust code for this can be found in 8.
Queries. A query for position \(q\) first determines the superblock \(i_q = \lfloor q/S\rfloor\) and block \(j_q = \lfloor q/B\rfloor\). Then, we compute the rank of the middle of the block as \(\rank(j\cdot B+240) = 2^{11}\cdot s_{i_q} + b_{j_q}\). We then make a case distinction on whether \(q\) lies left or right of the middle of its block to determine the final value
\begin{equation} \rank(q) = 2^{11}\cdot s_{i_q} + b_{j_q} + \begin{cases} \phantom{-}\sum_{k\in \{j_q B+240, \dots, q-1\}} t_i & \textrm{if }q \geq j_q \cdot B + 240,\\ -\sum_{k\in \{q, \dots, j_q B+240-1\}} t_i & \textrm{if }q < j_q \cdot B + 240. \end{cases} \end{equation}
In code, we popcount up to 256 bits: either a suffix of the first half, or a prefix of
the second half. The conditional negation is optimized to a branchless cmov instruction.
Masking.
Instead of a for loop over the 64-bit words in the block and bit-shifting, we
prefer a branchless technique that always covers the full 256 bits.
Uncounted bits are masked out via a 256-bit mask that is
precomputed for each \(0\leq (q\bmod B)< 512\), again following
Gottlieb and Reinert (2025b). These are simply stored as a 16 KiB array [u256; 512], which fits in a typical 32 KiB L1 cache.
Prefetching.
In order to facilitate efficient batching algorithms (see 11), we provide
a prefetch(q) function that starts loading the two cache lines containing
\(s_{\lfloor q/S\rfloor}\) and \(b_{\lfloor q/B\rfloor}\) that are needed for \(\rank(q)\).
For simplicity and reliability, we prefetch into all levels of the cache hierarchy.
3.1 Variants Link to heading
We consider a few larger but faster variants of BiRank. The ones marked with a * are chosen for the evaluations.
- BiRank16* (3.28%) is the original as described above and inlines a 16-bit value in each cache line.
- BiRank32 (6.67%) is identical but stores a 32-bit value instead, doubling the overhead. This allows for a much (\(\approx 2^{16}\times\)) smaller L1 array.
- BiRank16x2* (6.72%) stores two 16-bit deltas, to 1/4th and 3/4th into the block. Then, only a quarter of the cache line (2 64-bit words) has to be popcounted.
- BiRank23_9 (6.67%) takes a middle ground: it stores a 23-bit L2 delta to 1/4th of the block, and a 9-bit L3 delta (\(\leq 256\)) from there to 3/4th.
- BiRank64 (14.3%) directly stores a 64-bit value instead, completely removing the need for a separate L1 level.
- BiRank32x2* (14.3%) doubles the overhead again and stores two 32-bit L2 values, shrinking the L1 array.
- BiRank64x2* (33.3%) again doubles the overhead, and completely removes the L1 level.
- (TODO) BiRankR9 (33.3%) is an inline version of Rank9: it inlines a 64-bit L1 offset, followed by a 64-bit word containing 6 9-bit deltas to the start of each remaining 64-bit word.
4 QuadRank Link to heading
QuadRank is the extension of BiRank to the 2-bit DNA alphabet. It can be constructed in parallel on multiple threads from bitpacked data. Rank queries can be either for a specific symbol (\(\rankone(q, c)\)), or for all 4 symbols at once (\(\rankall(q)\)). We do not provide a dedicated function to count a range, as is commonly used for the FM-index, because the associated branch-misses would hurt performance, and cache lines are automatically reused anyway.
As with BiRank, QuadRank is optimized for having as few cache misses as possible. In particular, the data-layout is nearly the same, but with the L1 and L2 data replicated for each symbol: each cache-line contains 4 16-bit deltas \(b_{j,c}\) and \(B_4=(512-4\cdot 16)/2=224\) characters. Superblocks cover 256 blocks (\(S_4:=256\cdot B < 2^{16}\)), and for each we store 4 shifted 32-bit offsets \(s_{i, c} := \lfloor \rankone(i\cdot S_4, c) / 2^{13}\rfloor\). We now divide by \(2^{13}\), since \(S + (2^{13}-1) < 2^{16}\), which allows inputs up to \(2^{45}\) characters or 8 TiB. The overhead over the \(b_i\) deltas is \(4\cdot 16 / (2B) = 14.29\%\) and the overhead of the offsets is \((4\cdot 32) / (2S) = 0.11\%\), for 14.40% overhead in total. A 16 MiB L3 cache can support over 14 GiB of input.
To compute the rank of all 4 symbols at once, relatively more time is spent on popcounting than in the binary case. Thus, we detail our optimizations to compute all 4 ranks efficiently.
Transposed layout. Compared to the layout for binary input, the main difference is that we now store the input data in transposed (or strided) layout (Anderson and Wheeler 2021; Gottlieb and Reinert 2025b) (as opposed to packed). Ignoring the inlined \(b_{j,c}\) deltas for the moment, the 256 characters in a block are split into 4 groups of 64. Each group of 64 characters is encoded as two 64-bit values, one consisting of the negation of all low bits, and one of the negation all high bits (see Code Snippet 1). The 4 16-bit deltas replace the 64 bits corresponding to the 32 first characters, as shown in the bottom row of Figure 1. The positions matching a symbol are then found by and-ing the two values together, after possibly negating one or both. This layout makes more efficient use of popcount instructions, since each counted word now contains up to 64 1 bits, compared to 32 with the packed layout.
rank1. Computing the rank for a single character is similar to before (1): we retrieve the superblock offset \(s_{i_q}\), multiply it by \(2^{13}\), and then add the \(b_{j_q,c}\) for the current block and character. Lastly, we add or remove the count for up to 128 characters in either the first or second half of the cache line, processed in two chunks of 64 characters.
4-way popcount. To return the rank of all 4 symbols, we essentially do the
above method 4 times in parallel in u64x4 256-bit AVX2 SIMD registers. In particular,
we use a single SIMD lane for each symbol (Code Snippet 1).
To popcount the number of 1-bits in each lane,
we use Mula’s algorithm (Muła 2008; Muła, Kurz, and Lemire 2017). Essentially,
this splits each byte into two 4-bit nibbles and for each does a
_mm256_shuffle_epi8 instruction to do a 16-way lookup returning the
precomputed number of 1 bits in each nibble. It then adds these two values, resulting in per-byte
popcounts, and finally uses the _mm256_sad_epu8 instruction to take a
horizontal sum of the 8 bytes in each 64-bit lane.
We convert the counts to u32x4 and then
conditionally negate them using _mm_sign_epi32(counts, u32x4::splat(pos-96)),
which multiplies each lane by the sign of \((q\bmod B)-96\) (i.e., -1, 0, or 1).
4.1 Variants Link to heading
Again, we consider a number of slightly faster variants that use larger inline values. Since returning all 4 counts takes more compute, we specifically focus on methods that reduce the amount of characters to be counted from 128 to 64. There is more variation here than in the binary case: we can use packed (P) or transposed layout (T), and we can avoid using the pairing technique (Bidirectional vs Forward) to save the small CPU overhead for negating values. This is just a small selection of possibilities, and not all implementations are equally optimized. Those marked with a * have been chosen for the evaluation.
- QuadRank16* (TB, 14.40%) is as described above and inlines 4 16-bit values containing the rank to the middle of each block.
- QuadRank32 (TB, 33%) instead uses 4 32-bit values, making the L1 array much smaller.
- QuadRank24_8* (TB, 33%) leaves space for 3 groups of 64 characters and splits this into 3 sub-blocks, storing an L2 delta to the end of the first and third group. This way, only a 64-character popcount remains.
- QuadRank7_18_7 (PB, 33%) uses a normal packed layout. It stores an 18-bit L2 to the middle of 6 32-character blocks, and two 7-bit L3 deltas to 1/6th and 5/6th.
- QuadRank64* (TB, 100%) stores 4 64-bit values, as does BWA-MEM, removing the L1 array. This only leaves space for 128 characters, so each half is now only 64 of them.
- QuadRank32_8x4 (PF, 100%) uses packed layout. It stores a 32-bit L2 delta to the start of the block, and 4 8-bit L3 deltas to each 32-character sub-block.
- QuadRank32x2 (PF, 100%) stores 2 32-bit L2 deltas to the start and halfway point, and does a forward scan.
TODO: Figures?
5 Results Link to heading
Both our implementation of BiRank and QuadRank and the evaluations can be found at https://github.com/RagnarGrootKoerkamp/quadrank. All experiments are run on an AVX2 Intel Core i7-10750H Skylake CPU with 6 cores and hyper-threading enabled. The frequency is pinned to 3.0GHz. Cache sizes are 32 KiB L1 and 256 KiB L2 per core, and 12 MiB shared L3 cache. Main memory is 64 GiB as dual-channel 32 GiB 3200MHz DDR4 sticks, with the memory controller running at 2933 MHz.
Benchmarks on a 92-core AMD Zen 4 EPYC with 12 DDR5 memory channels can be found in the appendix (10.2). The appendix also contains additional plots analysing the CPU time for very small inputs (10.1).
We only compare Rust implementations, since our aim is to provide a ready-to-use Rust library as well. Furthermore, cross-language function calls would likely prevent the compiler from optimizing all code equally, and re-implementing comparable benchmarks in C++ and getting all libraries to work was deemed too much work.
5.1 BiRank Link to heading
We compare BiRank and its variants against the Rust crates mentioned in 2.1. In order to make the evaluations with prefetching fair, we have created PRs adding support for this to each of them.3
Excluded libraries. SPIDER (github:williams-cs/spider) (Laws et al. 2024) was not yet implemented in Rust, so we made a variant of BiRank that approximately uses SPIDER’s linear-scan for popcounting inside a block. Unfortunately, we were unable to compare the performance against the original C implementation. Paired-blocks (github:seqan/pfBitvectors) (Gottlieb and Reinert 2025b) also has only been implemented in C++. Nevertheless, genedex was reported to be faster (personal communication). Lastly, we exclude the dynamic B-tree (github:jermp/mutable_rank_select) of Pibiri and Kanda (2021), but consider a Rust-reimplementation of this work a promising direction for future work on select specifically.
Setup.
For each run, we first build each data structure on a random 4 GiB input and
generate 10 million random queries.
Then we run three types of benchmarks. In the first, we measure the average latency of
each query, by making each query dependent on the result of the previous one.
In the second, we measure the inverse throughput (i.e., amortized time per
query, which we will just call throughput) when processing all queries in a for
loop: for i in 0..n { BiRank::rank(queries[i]) }.
Lastly, we add prefetching, where we prefetch the required cache lines 32
iterations ahead: for i in 0..n { BiRank::prefetch(queries[i+32]); BiRank::rank(queries[i]); }4
We then repeat these three benchmarks when running in parallel on 1, 6, and 12
threads, where each thread has its own set of 10 M queries.
Each measurement is repeated 3 times, and the median is reported.

Figure 2: Log-log space-time trade-offs for rank structures on binary input of total size 4 GB. The top/middle/bottom row show results for 1/6/12 threads on a CPU with 6 cores. The left/middle/right column show results for the latency, the throughput of a for loop, and the throughput of a for loop with prefetching. Red lines indicate: (left) the roughly 80 ns RAM latency divided by the number of threads, (top mid/right) the 7.5 ns/read maximum random-access RAM throughput of 1 thread, and (rest) the 2.5 ns/cache line total random-access RAM throughput. In the right column, the transparent markers repeat the for-loop throughput. The legend entries are sorted by increasing overhead.
Lower bounds. The results are in Figure 2. There are different lower-bounds on the throughput: the measured latency of the RAM is around 80 ns/read, which gives a lower-bound of 80/\(t\) ns/query for \(t\) threads (dotted red lines). When using only a single thread, we are further bound by its maximum random access throughput of around 7.5 ns per cache line (dotted red lines). In all other cases, we are limited by the 2.5 ns/cache line throughput of the RAM (solid red lines).
Analysis. The first observation is that the latency of all methods is similar, as this is always limited by the memory latency. Furthermore, processing queries in a loop is consistently faster: around 4\(\times\) in a loop, and up to 8\(\times\) with prefetching, when using a single thread. Using 12 threads halves the gap, but nevertheless, this indicates that processing multiple independent queries in each thread should be preferred.
Looking at the middle column, we see that BiRank is just slightly better than other methods when using a single thread. This grows to 1.4\(\times\) speedup when using many threads, where it likely benefits from the reduced memory bandwidth. We see that BiRank16 (the default) is smaller but slightly slower than the larger BiRank variants. BiRank16x2 has double the overhead and is slightly faster, while the variants with larger overhead (that shrink/remove the array of superblock offsets) only provide very minimal gains.
After adding prefetching (right column), we see that all methods improve compared to the shaded data points without prefetching, most somewhere around 1.5\(\times\). Whereas non-interleaving methods are limited to around 16 ns/query, BiRank is able to reach the hard limit of 8 ns/query that each thread needs per cache line. Here, the larger variants benefit from requiring a bit less compute compared to the smaller variants. When using prefetching from multiple threads, the situation is the same: BiRank can fully exhaust the RAM random-access throughput, even in its smallest configuration, whereas other methods are 2\(\times\) as slow. We also see that prefetching speeds up BiRank around 2\(\times\) compared to just a plain for loop. A special case here is SPIDER, which is also interleaved. With a single thread, the branch-misses of popcounting in a for loop hurt its performance, but it becomes as fast as BiRank and memory bound when multithreading.
5.2 QuadRank Link to heading
We now run the same set of experiments to compare QuadRank against QWT and genedex on size-4 alphabets. On additional feature is that we compare both \(\rankone(q, c)\) and \(\rankall(q)\). For the other libraries, \(\rankall\) is implemented naively by simple calling \(\rankone\) four times, whereas QuadRank is primarily optimized for this case.

Figure 3: Space-time trade-off of rank structures on size 4 alphabet on 4 GiB input. Compared to Figure 2, here we benchmark both (rankone(q, c)) (small markers), and (rankall) (large markers).
Overall, the situation here is similar to the binary case. In high-throughput settings, QuadRank can again saturate the memory bandwidth and answer one query per cache line, being 2\(\times\) faster than all other methods below 100% overhead. When prefetching, QuadRank shows little to no overhead for computing \(\rankall\). Also without prefetching, QuadRank is around 2\(\times\) faster than other methods below 25% overhead.
6 Conclusion Link to heading
We surveyed a large number of existing rank structures and, inspired by them, developed BiRank and QuadRank. Their main novelty is in bringing together many independent parts, and applying them to size-4 alphabets. We benchmarked them in a high-throughput setting, with many threads and batching of multiple queries inside each thread.
For binary input, the previous best data structure is SPIDER (Laws et al. 2024), but this did not come with a directly reusable implementation. BiRank is usually slightly faster than our reimplementation. When multithreading, both are around 1.5x faster than other methods. Prefetching improves all methods, and doubles the throughput of BiRank and QuadRank, making them 2\(\times\) faster than other methods with prefetching. For DNA input, QuadRank is consistently at least 2\(\times\) faster than other implementations with overhead <15%.
In a multithreaded setting, processing multiple independent queries in a for loop is 1.5\(\times\) to 2\(\times\) faster than sequential processing. With prefetching, this increases to a 3\(\times\) speedup over sequential processing, and this specifically benefits our memory-bandwidth-frugal methods.
These results also translate to significant speedups when used in an FM-index, and we hope that batching and multithreading become standard in both applications and benchmarks.
Future work. Future work remains in generalizing and optimizing the library for other platforms than AVX2. As the current code was optimized for Skylake (2015), it is likely that more modern platforms (Golden Cove, 2021 or zen 5, 2024) admit different trade-offs. For AVX512, there are dedicated popcount instructions that could be used, while ARM NEON only supports 128-bit instructions and will need further work.
More features could also be added, such as in-place parallel construction that grows an existing allocation. Additionally, a \(\mathsf{prefix\_rank}(q, c)\) operation could be added, that counts the number of occurrences of characters at most \(c\). Lastly, our toy FM-index could be developed into a full data structure with support for \(\mathsf{locate}\) queries as well.
7 Acknowledgements Link to heading
I thank Heng Li for motivating me to start this project in an attempt to speed up BWA-MEM. I also thank those who were involved in discussions surrounding this project or gave feedback on the text: Piotr Beling, Felix Leander Droop, Simon Gene Gottlieb, Florian Kurpicz, Rob Patro, and Giulio Ermanno Pibiri.
8 Code snippets Link to heading
Simplified code for computing the rank inside a QuadRank block is shown in Code Snippet 1.
| |
9 Pairing superblocks Link to heading
Here we give a variant on the idea of paired superblocks of Gottlieb and Reinert (2025b). Just like we store block offsets \(b_j\) to the middle of a block and then branch on adding or removing to/from that, we can also let the superblock offsets \(s_i\) be to the middle of a double-sized superblock. Let \(S’ = 2S=256\cdot B\). We now store \(\lfloor n/S’\rfloor\) 32-bit superblock offsets \(s’_i := \lfloor \rank(i\cdot S’ + S’/2)/2^{11}\rfloor\). Given a block \(j\) in superblock \(i_j := \lfloor j/256\rfloor\), its delta is incremented by \(S’\) when it is in the lower half of the superblock:
\begin{equation*} b’_j := \rank(j\cdot B + 240) - 2^{11}\cdot s’_{i_j} + \begin{cases} 0 & \textrm{if } j \geq 256\cdot i_j + S’/2 \\ (S’/2 - 240) & \textrm{if }j < 256\cdot i_j + S’/2 \end{cases}. \end{equation*}
For blocks in the upper half of a superblock this is \(< S’/2\) as before. In the left half, the uncorrected value is the negation of the number of 1 bits between the block middle \(j\cdot B+240\) and superblock middle \(i_j\cdot S’ + S’/2\), which is between \(0\) and \(S’/2 - 240\). After the correction, we obtain a non-negative value \(<S’/2\) again. Queries are modified accordingly to remove this extra term again:
\begin{align*} \rank(q) =& 2^{11}\cdot s’_{i_q} + b’_{j_q} - \begin{cases} 0 & \textrm{if } q \geq S’\cdot i_q + S’/2 \\ S’/2 & \textrm{if }q < S’\cdot i_q + S’/2 \end{cases}\\ & + \begin{cases} \phantom{-}\sum_{k\in \{j_q B+240, \dots, q-1\}} t_i & \textrm{if }q \geq j_q \cdot B + 240\\ -\sum_{k\in \{q, \dots, j_q B+240-1\}} t_i & \textrm{if }q < j_q \cdot B + 240 \end{cases}. \end{align*}
10 Additional results Link to heading
10.1 Throughput for small n Link to heading
In Figure 4 and Figure 5 we benchmark on small 128 KiB inputs that fit comfortably in the L2 cache. This way, experiments will be mostly CPU-bound, and we get an idea of the maximum performance of each method and their relative computational cost.
As expected, we see a space-time tradeoff, with methods that need more space typically being faster. Rank9 (25% overhead) is the fastest, while RankSmall0 is small but slow. The BiRank variants are all roughly equally as fast, with the BiRank16 variant (3.28% overhead) being slightly slower, but still faster than other methods. The server (see next section) is up to 2\(\times\) faster, and surprisingly, even the smallest BiRank variant is very fast.
For QuadRank, we see that the laptop is faster with \(\rankone\) than \(\rankall\) queries, while, again surprisingly, the server is faster for \(\rankall\) queries.

Figure 4: Space-time trade-off plot of the inverse throughput of rank queries in a for loop on a small 128 KiB binary input that fits in L2 cache.

Figure 5: Space-time trade-off plot of the inverse throughput of rank queries in a for loop on a small 128 KiB input over alphabet size 4 that fits in L2 cache. Small markers indicate the time for a (rankone) query that counts only one symbol, while large markers always return all four ranks for (rankall).
10.2 AMD EPYC evals Link to heading
We replicate the experiments on a large AMD EPYC 9684X server with 96 cores, 192 threads, 96 MiB of L3 cache for each 8 cores, and 12-channel DDR5 RAM. It has a base clock frequency of 2.55 GHz, but during our experiments, it ran consistently at 3.7 GHz with 1 thread, at 3.4 GHz when using all 192 threads, and at 3.0 GHz when using 192 threads with batch processing. We run experiments using 1, 48, and 96 threads. Results for 192 threads are nearly identical to using 96 threads.
Figure 6 and Figure 7 show the results, which follow a similar trend as the earlier results: BiRank and QuadRank are consistently faster, with the improvement of BiRank becoming more pronounced as we add more threads. With prefetching, BiRank saturates the memory bandwidth when using 92 threads, and is then 2x faster than nearly all other methods. One notable difference with the laptop benchmarks is that here, SPIDER is as fast as BiRank when using prefetching, suggesting that on zen 4, the associated branch misses do not cost nearly as much. A further difference is that benefit of independent queries over sequential queries is up to 4\(\times\) here, compared to at most 2\(\times\) on the laptop. Likely, this is due to the server having a 60% higher RAM latency (130 ns vs 80 ns), as well as it having a larger reorder window that can process more loop iterations in parallel. At the same time, this reduces the impact of prefetching from 2\(\times\) to 1.5\(\times\).
For \(\sigma=4\), we see that larger data structures are often slower. We hypothesize that this is because the 1.1GiB L3 cache can hold a larger fraction of the data when the overhead is small (<25% of 4 GiB TODO UPDATE).

Figure 6: Scaling with input size for size 2 alphabet.

Figure 7: Scaling with input size for size 4 alphabet.
TODO 11 Application: Batching FM-index Link to heading
To showcase an application of our high-throughput data structure, we develop a
toy implementation5 of an FM-index for the \(\sigma=4\) DNA alphabet.
Inspired by Movi (Zakeri et al. 2024, 2025) and genedex (Droop 2025)
we process queries in batches and prefetch memory for upcoming rank queries.
For simplicity, our implementation only counts the number of matches and does
not support locating them. It only supports exact forward searching, and does not
implement bidirectional search or search schemes
(Renders, Depuydt, and Fostier 2022; Renders et al. 2025; Gottlieb and Reinert 2025a), nor in-text verification.
We use a prefix lookup table for the first 8
characters, and handle a single sentinel character ($) by storing its position
in the Burrows-Wheeler Transform (BWT) (Li 2013; Chacon et al. 2015; Burrows 1994).
The main function query_batch takes a batch of \(32\) queries and returns an
array of 32 BWT-intervals indicating where each query matches.
Similar to genedex, during the processing, we keep an array of the indices of active queries whose interval is not
empty yet. As long as there are active queries, we loop over those queries
twice. First, we detect queries that were completed and swap-pop them from the
active list, and then prefetch the memory needed for the rank queries. In
a second loop, we perform the rank queries and LF-mapping for each
active query. We do not optimize pairs of rank queries for small ranges, to
avoid branch-misses.
We briefly tried alternative ways to batch queries, such as replacing a query by a new one as soon as it runs out of matches, but this did not improve performance.
11.1 Results Link to heading
- Setup: exact map simulated 150bp illumina reads with 1% error rate to 500Mbp of viral data. So mapping likely fails after roughly 50 characters.
TODO:
- no batching, no prefetching
- batching, no prefetching
- 1/6/12 threads
- sequential/batch/batch+prefetch
11.1.1 Evals Link to heading
- compare ext:
- AWRY => https://github.com/UM-Applied-Algorithms-Lab/AWRY/issues/44
- SDSL => broken so far

TODO 12 Link to heading
ANONYMIZE!!!
12.1 code Link to heading
TODO:
- Optimize mask lookup:
- Shuffle-based lookup
- 8-byte version, then overwrite 1 non0/non1 byte
- 8x long 000111000 vec with byte-aligned load
- reverse the bits in low half of split blocks, so popcounting is always from the left, so we can reduce the masks by 2.
- Do a epi64-native conditional negation of lanes, rather than converting to
u32x4 for
sign.
12.2 paper Link to heading
Florian’s feedback:
I understand what the dots in the Blocks mean but it might not be clear at first glance. Some things I might would change or at least try:
- Use different notation when storing deltas in the L1-blocks. Maybe even just a \(\Delta\) symbol?
- Maybe indicate what the L1/L2 blocks are storing. I know that this is done by color right now, but I think a small arrow would be nice.
- Maybe replace the dots for the Blocks with \(\Sigma\) to show that it is the prefix sum?
Omitted for now:
12.3 evals Link to heading
- rerun benches for Flat64 and Flat512
- server benches at 4 GiB input
References Link to heading
Like Rank9 (Vigna 2008) and most (but not all) other implementations, we follow Dijkstra’s advise (Dijkstra 1982) and start numbering at zero. ↩︎
https://github.com/vigna/sux-rs/pull/98, https://github.com/rossanoventurini/qwt/pull/6, https://github.com/feldroop/genedex/pull/4, https://github.com/beling/bsuccinct-rs/pull/14. ↩︎
In practice, we must also prefetch the upcoming query values themselves. We keep the memory system so busy that it does not have time to do this by itself, leading to cache misses on the query values themselves if we do not prefetch them. ↩︎
https://github.com/RagnarGrootKoerkamp/quadrank/blob/master/fm-index/src/fm.rs ↩︎