\[\newcommand{\rank}{\mathsf{rank}}\] \[\newcommand{\rankall}{\mathsf{rankall}}\]

Abstract Link to heading

Motivation. Given a text, a rank query \(\rank(p, c)\) counts the number of occurrences of character \(c\) among the first \(p\) characters of the text. Space-efficient methods to answer rank queries form an important building block in many succinct data structures. For example, the FM-index (Ferragina and Manzini 2000) is a widely used data structure that uses rank queries to locate all occurrences of a pattern in a text.

In bioinformatics applications, the goal is usually to process a given input as fast as possible. Thus, data structures should have high throughput when used with many threads.

Contributions. For the \(\sigma=2\) binary alphabet, we develop BiRank. It merges the central ideas of two recent papers: (1) we interleave (inline) offsets in each cache line of the underlying bit vector (Laws et al. 2024), reducing cache-misses, and (2) these offsets are to the middle of each block so that only half of it needs popcounting (Gottlieb and Reinert 2025). In QuadRank, we extend these techniques to the size \(\sigma=4\) (DNA) alphabet.

Both data structures are optimized for high throughput, answering many queries as fast as possible, by adding prefetch instructions to start loading the required cache lines ahead of time.

Results. BiRank has a space overhead of 3.125% and QuadRank has a space overhead of 14%. They are around \(1.5\times\) faster than methods that do not use inlining. Prefetching gives another \(2\times\) speedup, at which point the RAM bandwidth becomes a hard limit on the total throughput.

When using QuadRank in a toy count-only FM-index, this results into up to \(4\times\) speedup over Genedex, a state-of-the-art batching FM-index implementation.

1 Introduction Link to heading

Given a fixed text \(T=t_0\dots t_{n-1}\) of length \(n\) over an alphabet \(\Sigma\) of size \(\sigma\), a rank query \(\rank_T(p, c)\) counts the number of occurrences of character \(c\in \Sigma\) in the first \(p\) (\(0\leq p\leq n\)) characters of the text1: \[ \rank_T(p, c) := \sum_{i\in \{0, \dots, p-1\}} [T_i = c]. \] In most literature, the binary alphabet of size \(\sigma=2\) is used, in which case the text is simply a string of \(n\) bits. In this case, we also write \(\rank_T(p) := \rank_T(p, 1)\) to count the number of \(1\) bits.

Of interest are space-efficient data structures that can answer these queries quickly. Indeed, there exist succinct data structures (Jacobson 1988) that use \(n + o(n)\) bits of space to answer queries on a binary text in \(O(1)\) time in the RAM-model with word-size \(w=\Theta(\lg n)\). When the bitvector itself is stored explicitly, a tight lower bound on the space usage is \(n + \Omega(n \log\log n / \log n)\) bits (Miltersen 2005; Golynski 2006).

A fast and widely used implementation is Rank9 (Vigna 2008), which has a fixed \(25\%\) space overhead. Many subsequent works have reduced the space overhead, usually at the cost of slightly slower queries. For example, poppy (Zhou, Andersen, and Kaminsky 2013) is quite small with only 3.125% overhead. In practice, nearly all fast implementations have some small constant overhead, making them compact (\(n+O(n)\) bits) but not succinct (\(n+o(n)\) bits). See the next section for a detailed overview of past work.

QuadRank. In this paper, we first develop fast data structures for rank over the binary alphabet (BiRank) by combining many existing techniques. We then extend these results to the \(\sigma=4\) (DNA) alphabet in QuadRank, which has direct applications to both the FM-index and wavelet trees.

FM-index. A primary application of Rank queries is in the FM-index (Ferragina and Manzini 2000), a succinct data structure that can efficiently locate all occurrences of a pattern in a text and is used in tools such as BWA-MEM (Li 2013), and Bowtie (Langmead et al. 2009; Langmead and Salzberg 2012). Whereas most of the literature on rank structures assumes a binary alphabet (\(\sigma=2\)), in this case the DNA alphabet has size \(\sigma=4\). Indeed, BWA-MEM implements its own rank structure over a 2-bit alphabet, and this paper started as an attempt to speed this up.

Wavelet tree. For alphabets of arbitrary size, wavelet trees (Grossi, Gupta, and Vitter 2003) or the wavelet matrix (Claude, Navarro, and Ordóñez 2015) can be used instead, which need \(\lg_2 \sigma\) queries to a binary rank structure. Recently, quad wavelet trees (Ceregini, Kurpicz, and Venturini 2024) have been introduced, following earlier theoretical (Ferragina et al. 2007) and practical (Bowe 2010) results on multi-ary wavelet trees. Quad wavelet trees use rank over a \(\sigma=4\) quad vector as a building block, and thus need only \(\log_4 \sigma\) rank queries, leading to \(2\times\) to \(3\times\) speedups.

Multithreading and batching. In applications in bioinformatics, one often has many independent queries (DNA sequences) that need to be processed (searched in an FM-index). Thus, the relevant metric is how fast a CPU can answer all these queries. In particular, this allows using all cores/threads of the CPU as well as processing queries in batches inside each thread, to hide the memory latency.

Current benchmarks usually measure the throughput of answering rank queries in a for loop, but this does not take into account the possibility for batching, nor does it include the effects of running many threads in parallel. As we will see, many existing methods become bottlenecked by the total memory bandwidth of the CPU when used in a high-throughput setting, and we specifically design our data structures to make efficient use of the memory bandwidth.

Contributions. We develop two data structures, BiRank and QuadRank, for rank queries over texts over alphabets of size 2 and 4 of length up to 128 GiB and 256 GiB respectively. (Longer texts are possible by using slightly more space.) Our Rust library is available at https://github.com/RagnarGrootKoerkamp/quadrank.

Both of them integrate a number of existing techniques (see next section), and are not designed to support select queries, allowing for more optimizations. Specifically, BiRank integrates (1) inlining of L2 into the bitvector (Laws et al. 2024), which reduced cache misses, (2) paired-blocks with mask-lookup (Gottlieb and Reinert 2025), halving the number of popcounts, and (3) an additional zeroth tree level (Zhou, Andersen, and Kaminsky 2013) that is modified to be only half the size to allow ranks up to \(2^{40}\).

QuadRank extends the ideas of BiRank, but has roughly \(4\times\) larger space overhead since it stores offsets for each character. It combines the cache-locality of the implementation in BWA-MEM (Li 2013) with the low overhead of quad vectors (Ceregini, Kurpicz, and Venturini 2023) and a transposed bit layout for faster queries (Anderson and Wheeler 2021; Gottlieb and Reinert 2025). QuadRank is optimized for returning ranks for all 4 characters at once by using AVX2 instructions, which is useful for approximate pattern matching in an FM-index.

Both data structures usually only need a single cache line from RAM to answer queries, and we provide an API to prefetch this when processing queries in batches. We added similar prefetch instructions to other popular rank libraries as well.

Results. For both data structures, we implement many variants that have different space-time tradeoffs and use different ways of encoding the L1 and L2 values. When used in a for loop, BiRank is up to \(1.5\times\) faster than the next-fastest rust implementation of equal size, with the difference being larger when using many threads. Prefetching memory improves the throughput of many libraries by around \(1.5\times\), and improves BiRank by \(2\times\). In this setting, all methods are bottlenecked by the memory throughput, and BiRank is \(2\times\) faster than all others because it only needs to read 1 instead of 2 cache lines from RAM.

Similarly, QuadRank is at least \(1.5\times\) faster than the next-fastest Rust library (QWT (Ceregini, Kurpicz, and Venturini 2024)), and \(2\times\) faster after adding prefetch instructions, again being bottlenecked by the RAM throughput.

Inspired by genedex (Droop 2025), we further develop a small toy-implementation of a count-only FM-index that uses batching and multithreading. This leads to an implementation that is \(1.5\times\) faster when using QuadRank compared to QWT’s quad vector at 12.5% space overhead, and \(4\times\) faster than genedex at 100% space overhead.

2 Background Link to heading

We briefly go over some previous papers containing rank structures for either \(\sigma=2\) or \(\sigma=4\) in chronological order and list their main technical contributions. Both the poppy (Zhou, Andersen, and Kaminsky 2013) and pasta (Kurpicz 2022a) papers contain a nice overview as well. We note that many of these papers develop a rank structure in the context of the larger rank and select problem, where there are slightly different design trade-offs. Additionally, work on compressed bitmaps is omitted here.

As a baseline, the classic succinct solution (Jacobson 1988) stores the bitvector, and then two levels of blocks alongside this. The bitvector is split into blocks of length \(\log(n)/2\) bits, and \(\log n\) blocks together form a superblock. The first level L1 of the tree then contains a $log n$-bit offset for each superblock, counting the number of set bits preceding it. The second level L2 stores for each block a $log log n$-bit delta counting the number of one bits preceding it inside its superblock.

González et al. (2005) are the first to introduce a practical method for rank, after observing that the classic method above has 66.85% overhead in practice for \(n=2^{30}\). They replace a $\sqrt{n}$-size lookup table for popcounts by a sum of precomputed per-byte lookups. (Meanwhile, CPUs natively support 64-bit popcount instructions.) They suggest to use a 2-level tree with 32-bit L1 values covering a 256-bit superblock that is split into 8 32-bit blocks, for each of which an 8-bit L2 delta is stored. Further, they suggest to use a single-level tree storing a 32-bit L1 offset after every e.g. \(4\cdot 32\) bits. A linear scan of popcounting up to \(4\) 32-bit words takes more compute, but has the benefit of cache locality and only requires 2 instead of 3 memory accesses.

Rank9 (Vigna 2008) has 25% overhead and interleaves the L1 and L2 levels of the classic tree. It is designed specifically with 512-bit cache lines in mind: each block is 64 bits, and 8 blocks form a basic block. For each basic block, the interleaved tree stores a 64-bit integer with the offset of the basic block, and 7 9-bit deltas (the reason for the name) in an additional 64-bit word. This needs two cache misses per query, and is very fast in practice, specifically because it only needs to popcount a single 64-bit word, which is done using broadword programming (also known as SWAR, SIMD Within A Register).

Navarro and Providel (2012) develop a data structure for rank and select, and use the extra information stored for the select queries to speed up the linear scan in the method of González et al. (2005).

Poppy (Zhou, Andersen, and Kaminsky 2013) is optimized for space and has only 3.125% overhead. First, it makes the observation that performance is largely determined by the number of cache misses. Thus, it uses larger blocks of 512 bits. It then re-uses Rank9’s interleaved index with two modifications. First, 64 bit superblocks (L1) cover 4 basic blocks, containing one 32-bit offset (L1) and 3 10-bit counts (L2) per 512-bit block. To handle 64-bit outputs, it stores an additional zero layer (L0) of the tree with the 64 bit offset after every \(2^{32}\) input bits.

BWA-MEM (Li 2013) implements a 100% overhead rank data structure on \(\sigma=4\) DNA that is fully inline, requiring only a single cache-miss per query. In each cache line, it stores 4 64-bit offsets, followed by 256 bits encoding 128 characters.

The succinct data structure library (SDSL) (Gog et al. 2014) implements Rank9 and introduces rank_support_v5, which has 6.25% overhead. It uses superblocks of 2048 bits. For each, it stores a 64-bit offset (L1) and 5 11-bit deltas (packed into 64 bits) to all but the first of 6 $6⋅ 64$-bit blocks.

EPR-dictionaries (Pockrandt, Ehrhardt, and Reinert 2017) work for arbitrary alphabet. For \(\sigma=4\), they use 64-bit (32 bp) blocks and have 42% overhead, and effectively store a 2-level rank structure for each character. Compared to earlier work, the main novelty is to store the packed representation of the text, rather than \(\sigma\) 1-hot encoded bitvectors, each with their own rank structure.

Pibiri and Kanda (2021) diverge from the classic approach and introduce a rank and select structure based on highly tuned B-trees that takes 3.6% extra space. Here, each rank query traverses around \(\log_{16} n\) levels of the tree, with the middle levels packing 16 32-bit values in a cache line. Due to efficient caching of the top levels of the tree, performance is similar to poppy, although not as fast as rank9.

The AWFM-index and its Rust implementation AWRY (Anderson and Wheeler 2021) builds on FM-index on a size \(\sigma=6\) alphabet of 4 DNA characters as well as a sentinel and ambiguity symbol. It uses blocks of 256 3-bit characters, preceded by 5 64-bit offsets padded to 512 bits. Each block is encoded using a strided or transposed layout: instead of concatenating the 3 bits of each character, it stores 3 256-bit vectors containing bit 0, bit 1, and bit 2 of each character. This allows for more efficient popcounting. The FM-index processes queries in batches of size 4, and prefetches memory needed for the next rank operation as soon as possible.

FlatRank (Kurpicz 2022a, 2022b) has the same space 3.125% overhead as Poppy, but improves rank query time by 8%. Specifically, it avoids Poppy’s need to take a prefix sum over L2 counts: it doubles the size of each superblock to 128 bits covering 8 512-bit blocks of 4096 bits in total. It stores a 44-bit offset (L1) followed by 7 12-bit deltas (L2) from the start of the superblock to each block. A second structure, WideRank (3.198% overhead) uses 16-bit values for L2, which allows faster select queries using SIMD instructions. Each superblock covers 128 blocks and stores a 64-bit L1 value, this time not interleaved with the L2 values, and the L0 level is dropped.

Quad wavelet trees internally use quad vectors (Ceregini, Kurpicz, and Venturini 2023, 2024), which have a layout very similar to Pasta-flat. Super blocks cover 4096 characters and store a 44-bit offset (L1). This is followed by 7 12-bit deltas (L2) for 8 512-character blocks, so a single 128-bit value is stored or each character, resulting in 6.25% overhead. Alternatively, 256-character blocks can be used to reduce the number of cache misses, using 12.5% overhead.

SPIDER (Laws et al. 2024) has only 3.3% overhead and reduces the number of cache misses from 2 to (nearly) 1 by interleaving L1 with the bitvector itself (like BWA-MEM), instead of interleaving L1 and L2: each cache line stores a 16-bit L2 delta, and 496 input bits. L1 superblocks store a 64-bit offset for each 128 blocks, taking only 0.1% extra space and thus likely fitting in a cache.

Paired-blocks (Gottlieb and Reinert 2025) is an idea that halves the memory usage again, to 1.6%. Compared to WideRank, instead of storing 16-bit (L2) deltas to the start of each block, here we store 16-bit deltas to the middle of each pair of blocks. Similarly, the 64-bit L1 offset is to the middle of the superblock. Then, the second block can add a prefix-popcount to this as usual, while the first block can subtract a suffix-popcount instead. This is similar to the alternate counters idea for the FM-index by Chacon et al. (2015), where, for alphabet size 4, each block stores half the offsets. A small complication with this design is that conditionally shifting away a prefix or suffix of bits is slightly slower. Instead, Gottlieb and Reinert (2025) introduce a mask lookup table that stores the mask for each position. Lastly, for \(\sigma=4\), this paper uses the transposed layout of AWFM, but calls it scattered instead.

Figure 1: Schematic overview of rank data structures.

Figure 1: Schematic overview of rank data structures.

2.1 Further implementations Link to heading

While we’re at it, we now list some Rust crates containing additional (re)implementations do not necessarily directly correspond to a paper.

QWT (Venturini and Savino 2023) implements RSQVector256 and RSQVector512 corresponding to the Quad Vectors in the paper (Ceregini, Kurpicz, and Venturini 2024) with 12.5% and 6.25% overhead. It further contains RSWide, which implements the FlatRank structure of Kurpicz (2022a) (omitting the L0 layer), and RSNarrow, which exactly implements Rank9.

Sux (Vigna and Fontana 2024) contains an implementation of Rank9, as well as five versions of RankSmall2. These are all variants on Rank9, but use Poppy’s L0 to allow for 32-bit L1 values. They vary in the number of u32 used to store the L2 values, and the width of the L2 values. A special case is RankSmall3 (3.125% overhead), which stores 3 11-bit values in a single 32-bit word by using 0-extension for the implicit 0-bit of the first value.

Genedex (Droop 2025) implements variants of the data structures of Gottlieb and Reinert (2025). It is designed for \(\sigma>2\), but also supports \(\sigma=2\). It uses blocks of 64 or 512 characters. Each block stores \(\log_2 \sigma\) 64- or 512-bit words containing the bits in transposed layout. For each block, it stores \(\sigma\) 64-bit (or 32-bit) L1 offsets (and L0 and L2 are unused). Condensed64 and Condensed512 do use L2 and generalize WideRank by storing \(\sigma\) 16-bit deltas in L2 and \(\sigma\) 64-bit offsets in L1 for each superblock. Note that for \(\sigma=4\), 512-character blocks span two cache lines.

The genedex crate further provides an FM-index implementation that uses batching of queries, resulting in the currently fastest Rust FM-index.

Bitm is part of bsuccinct (Beling 2024). Its RankSimple (6.25% overhead) stores a 32-bit L1 offset for every 512 bit block. RankSelect1011113 (read: 10-11-11) has 3.125% overhead and is the same as RankSmall3 of sux.

Further implementations that we excluded from the evals since they are not (close to) pareto optimal:

Bio (Köster 2015) has a RankSelect4 structure that stores a 64-bit offset after every configurable number of 32-bit words, but an inefficiency in the implementation makes this use 128 bits in practice.

RsDict5 is based on Navarro and Providel (2012) and uses a compact encoding, leading to a large overhead.

Sucds6 provides an implementation of Rank9, which is already covered by Sux.

Succinct7 provides both Rank9 (already covered) and JacobsonRank (slower and larger).

Vers8 provides RsVec, which implements WideRank, but with superblocks of \(2^{13}\) rather than \(2^{16}\) bits.

2.2 Summary of terminology Link to heading

  • offset: absolute number of 1-bits before a block
  • delta: number of 1-bits from start of super block to current block
  • L0: optional 64-bit values
  • L1: super block offsets
  • L2: block deltas or counts
  • blocks: the bits themselves

Omitted for now:

TODO: fig:

  • Paired with sigma=4
  • EPR: basically like paired but without the pairing and scattered bits

Optimize mask lookup:

  • Shuffle-based lookup
  • 8-byte version, then overwrite 1 non0/non1 byte
  • 8x long 000111000 vec with byte-aligned load

3 BiRank Link to heading

Notation:

  • \(N\): number of bits per basic block (cache line)

  • \(W\): width of the rank stored in each block.

  • We store the low bits of the rank of the block inside the block itself.

    • Optimally use the cache line

3.1 API Link to heading

  • Construct from packed bitslice: &[u64].
    • Nice-to-have for modifying algorithms: construct in-place from Vec<u64> without using more memory than the final size of the data structure.
  • Rank queries are right-exclusive: rank(i) counts the number of 1 bits before position \(i\), or equivalently, the number of 1 bits in the first \(i\) bits of the string.
    • Right-inclusive queries make it impossible to query the empty prefix.
  • We allow up counts up to \(2^{40}\), i.e. data structures up to 128 GiB, or up to 256 GiB if only half the bits is set.

3.2 Implementation Techniques: Link to heading

  • Mask array: instead of shifting a u64 left or right to remove some bits, precompute eg a 16KiB [[u64;4]; 512] array with a mask to apply for each input position (for the last two blocks), or a 4KiB [u128; 256] for pos%256 in each half.

  • Add or subtract the popcount: if pos < 256 { offset-p } else { offset+p }, compiled into a cmov

  • (data&mask).count_ones()

3.3 Superblocks: Link to heading

  • When storing \(W\) low bits, each super-block can only cover \(2^W/N\) blocks. For simplicity, we use as stride \(S\) the largest power of 2 below \(2^W/N\).
  • The super blocks simply store a u64 global offset.
  • Better: store a u32 of value divided by 256. 40-bit values are sufficient in practice.
    • if needed, can use a different shift
    • tiny bit slower at times, but half the pressure on caches is nice.

3.4 Basic Blocks Link to heading

Code can be found in repo: https://github.com/RagnarGrootKoerkamp/quadrank

  • BB16: u16 offset to middle
  • BB16x2: u16 offset to 1/4 and 3/4
  • BB32: u32 offset to middle
  • BB32B: u32 offset to real middle
  • BB23_9: 23 bit offset to 1/3
  • BB32x2: 32 bit offset to 1/4 and 3/4
  • BB64x2:

One more that was tried but did not use in the end:

  • Store eg a 23 bit offset to bit 128 (1/4), and then a 9 bit delta from there to bit 384 (3/4), so that a 128bit popcount suffices
    • If we store the rank offset in the middle of the block (and adjust pos as needed), 8 bits are actually sufficient.

4 QuadRank Link to heading

Main idea:

  • do the same as for BiRank, but store 4 offsets instead of 1.
  • Transposed layout

4.1 Implementation notes Link to heading

  • avx2 _mm_sign_epi32 instruction is used to conditionally negate the popcounts
  • We use the transposed layout. This makes it easier to directly popcount 64 bp at once.
  • We use u64x4 simd registers to do a popcount for each of ACGT in its own lane.
1
2
3
const CL: u64x4 = u64x4::from_array([!0, 0, !0, 0]);
const CH: u64x4 = u64x4::from_array([!0, !0, 0, 0]);
let indicators = (u64x4::splat(l) ^ CL) & (u64x4::splat(h) ^ CH) & u64x4::splat(mask);

The lane-wise popcounts are implemented using Mula’s algorithm (Muła 2008; Muła, Kurz, and Lemire 2017)

  • use the _mm256_shuffle_epi8 instruction twice to do a table lookup for the popcount of the low and high 4-bit nibble of each byte. Then add those two counts, and accumulate the 8 bytes of each u64 using the _mm256_sad_epu8 instruction.

  • Future work: In AVX512, there is a dedicated popcount instruction that could be used instead.

4.2 Blocks Link to heading

5 Application: Parallel FM index Link to heading

  • We develop a toy implementation of an FM-index for testing the benefit of our method in a setting where batching can be used.
  • Count-only; no locate
  • 8-character prefix lookup
  • Sentinel character is handled by explicitly storing its position
  • Implementation is heavily inspired by genedex (Droop 2025)
  • Setup: exact map simulated 150bp illumina reads with 1% error rate to 500Mbp of viral data. So mapping likely fails after roughly 50 characters.
  • Approximate mapping using a search scheme is out of scope for now and remains future work.
  • We do not use dedicated code like count_range; instead, we simply process the start and end of each range individually and still benefit from the reused cache line whenever possible, while avoiding possibly expensive branch misses on checking whether they are in the same cache line indeed.

API: given a batch of B queries (ie &[Vec<u8>; B]):

  1. read FM-index range for the 8-character prefix from the lookup table
  2. store a list of up to B indices of the still active queries (those for which the current prefix has >0 matches). For each character, we only iterate over the queries pointed to by this list.
  3. For each character, iterate the active queries twice:
    • In the first iteration, remove queries that are done from the active list (via swappop, letting the last element take their place), and prefetch the cache lines required to answer rank queries to the start and end of the interval.
    • In the second iteration, do the actual queries and update the start and end of each active interval.

We briefly tried alternatives such as inserting new queries into the empty slots of the batch when a previous query has 0 matches, but this did not improve performance.

6 Results Link to heading

Code can be found in repo: https://github.com/RagnarGrootKoerkamp/quadrank

  • sigma = 2/4

  • 1/6/12 threads

  • latency/loop/stream

  • FM index

  • AVX2, DDR4

  • Table of exact versions used

  • Specific implementation matters for perf; can’t just compare against a method in itself.

6.1 BiRank Link to heading

Figure 2: Throughput in a loop on a small input that fits in L2 cache. The red dashed line indicates the minimum time to read a cache line from RAM.

Figure 2: Throughput in a loop on a small input that fits in L2 cache. The red dashed line indicates the minimum time to read a cache line from RAM.

Notes:

  • Rank9 is fast
  • Spider is slow
  • Our methods are consistently faster than the emperical 7.5ns needed to fetch a cache line.
  • SmallRank@3.125% = poppy

We added support for prefetching to all methods evaluated here.

Figure 3: Space-time tradeoff for rank structures on binary input of total size 4GB. Red lines indicate: (left) the 80ns RAM latency divided by the number of threads, (top) the measured maximum RAM throughput of 1 thread, 7.5ns/cache line, and (rest) the measured maximum total RAM throughput, 2.5 ns/cache line. In the right column, the transparent markers again show the time for just looping, without our added support for prefetching.

Figure 3: Space-time tradeoff for rank structures on binary input of total size 4GB. Red lines indicate: (left) the 80ns RAM latency divided by the number of threads, (top) the measured maximum RAM throughput of 1 thread, 7.5ns/cache line, and (rest) the measured maximum total RAM throughput, 2.5 ns/cache line. In the right column, the transparent markers again show the time for just looping, without our added support for prefetching.

Notes:

  • Latency is nearly constant and independent of the method used, since the CPU time of computing the answer is small compared to the ~80ns wait.
  • Our new methods are slightly faster but mostly comparable when used in a for loop on a single thread.
  • With more threads, their benefit increases due to the reduced memory pressure.
  • When streaming, our method (alongside our reimplementation of SPIDER) is the only one that can answer rank queries close to the limit imposed by the (per-thread/total) RAM bandwidth, and is around 2x faster than others, than need to read 2 cache lines instead of 1.
  • When streaming or using multiple threads, things are mostly memory bound, and the extra compute needed for the more space efficient methods is mostly hidden.
  • Most methods benefit at least 1.5x speedup from prefetching; some up to 2x
  • Even then, we are 2x faster, or 3x faster compared to other methods without prefetching
  • Hyperthreading (12 threads) helps to reduce latency nearly 2x because it can interleave a second thread in the time the first is waiting. For looping, the speedup around 1.5, while for streaming, the gains are marginal.

6.2 QuadRank Link to heading

TODO: Legend for small/large dots

Note: other methods are not optimized for returning all 4 counts, whereas in our case that is only slightly slower.

Figure 4: Space-time trade-off for size 4 alphabet on small L2-cache sized input. Small markers indicate time for a rank(i, c) query that counts only one symbol, while large markers always return all four ranks.

Figure 4: Space-time trade-off for size 4 alphabet on small L2-cache sized input. Small markers indicate time for a rank(i, c) query that counts only one symbol, while large markers always return all four ranks.

  • Computing all 4 counts is relatively slow for the other methods.
Figure 5: Space-time trade-off for size 4 alphabet on 4GB input.

Figure 5: Space-time trade-off for size 4 alphabet on 4GB input.

6.3 FM-index Link to heading

TODO:

  • no batching, no prefetching
  • batching, no prefetching
  • check: 12 threads

7 Conclusion Link to heading

8 Acknowledgements Link to heading

  • Heng Li
  • Discord folks
    • Rob
    • Simon
    • Felix
    • Giulio

9 Appendix Link to heading

9.1 Further evals (epyc, DDR5, avx512) Link to heading

10 Blog Link to heading

If you’re in bioinformatics and do anything with data structures for indexing large amounts of data, or if you are generally into succinct data structures (that use space close to the information-theoretic lower bound, wikipedia), you’ve probably heard of rank and select

BWA Li (2013) uses a custom routine (github code) for rank over 2-bit DNA alphabets. By request of Heng Li, here we investigate if we can make it faster.

11 Problem statement Link to heading

Given the a text \(T\) over a size-4 (2-bit) alphabet, build a data structure that can quickly count the number of A, C, G, and T characters up to position \(i\) in \(S\). Specifically, for each query \(i\) we want to return all 4 counts.

1
2
3
4
5
6
7
impl DnaRank {
    /// Take a DNA string over ACGT characters.
    pub fn new(seq: &[u8]) -> Self;

    /// Count the number of A, C, G, *and* T characters before the given position.
    pub fn count4(&self, pos: usize) -> [u32; 4];
}

11.1 Metrics Link to heading

  1. most important: fast
    • low latency?
    • high throughput when called in a loop?
    • high throughput with prefetching
    • also, with multithreading
  2. secondary: small. But 2x overhead is fine.
    • as long as it fits in RAM, all is good

12 Motivation: BWA / FM-index Link to heading

13 Data structure layout Link to heading

  • query time is measured in the RAM model: cost 1 per memory access.

13.1 1-level indices Link to heading

  • \(n\): number of characters.

13.1.1 Flat: \(2n\) bits, \(O(n)\) queries Link to heading

13.1.2 All-answers \(4n\lg n\) bits, \(O(1)\) queries Link to heading

13.2 What’s in a cache line? Link to heading

13.3 2-level indices: blocks Link to heading

  • Blocks of \(B\) characters.
  • For each block, store the character counts for the text preceding the block.

13.3.1 External block counts: \(2n + n/B\cdot 4\lg n\) bits, \(O(B)\) queries Link to heading

13.3.2 Internal block counts: \(2n + n/(B-4\lg n)\cdot 4\lg n\) bits, \(O(B)\) queries Link to heading

13.4 3-level indices: superblocks Link to heading

  • Superblocks corresponding to \(S\) blocks or \(S\cdot B\) characters, so that the block offsets can use less precision.

13.5 Current implementation in BWA Link to heading

13.6 Other implementations Link to heading

14 Evals Link to heading

Current status:

Compared libaries:

Not compared:

  • The B-tree based methods of Pibiri and Kanda (2021) (github:jermp/mutable_rank_select) is missing because it’s written in C++ and benchmarking via FFI is very likely too slow.
    • (Working on a highly optimised Rust port of this library for both Rank and Select might be next.)
  • The C++ method of Gottlieb and Reinert (2025) (github:seqan/pfBitvectors) was implemented in genedex already and slightly faster there9, but in these plots the small version of genedex is somewhat slow, so this needs further investigation as well.

[old caption] space overhead vs time tradeoff for various rank implementations in Rust for varying number of threads. The red lines show latency (first col) and throughput (rest) bounds given by the RAM.

15 TODOs Link to heading

  • store fewer bits per super block entry:
    • low 8 bits can be omitted
  • Bitm

15.1 FM Link to heading

15.1.1 Perf Link to heading

  • optimize interval queries when s and t are close?
  • optimize parallel mapping
    • batch as-is seems best
    • batch_all, to fill gaps as they open up, appears slower?
    • batch_interleave, to mix memory and cpu work, needs more attention

15.1.2 Evals Link to heading

15.1.3 Features Link to heading

  • locate queries via sampled suffix array
  • inexact matching
  • in-text verification
  • bidirectional fm

References Link to heading

Anderson, Tim, and Travis J. Wheeler. 2021. “An Optimized Fm-Index Library for Nucleotide and Amino Acid Search.” Algorithms for Molecular Biology 16 (1). https://doi.org/10.1186/s13015-021-00204-6.
Beling, Piotr. 2024. “Bsuccinct: Rust Libraries and Programs Focused on Succinct Data Structures.” Softwarex 26 (May): 101681. https://doi.org/10.1016/j.softx.2024.101681.
Bowe, Alex. 2010. “mutiary Wavelet Trees in Practice.” School of Computer Science and Information Technology RMIT University, Melbourne, Australia. https://raw.githubusercontent.com/alexbowe/wavelet-paper/thesis/thesis.pdf.
Ceregini, Matteo, Florian Kurpicz, and Rossano Venturini. 2023. “Faster Wavelet Trees with Quad Vectors.” arXiv. https://doi.org/10.48550/ARXIV.2302.09239.
———. 2024. “Faster Wavelet Tree Queries.” In 2024 Data Compression Conference (DCC). IEEE. https://doi.org/10.1109/dcc58796.2024.00030.
Chacon, Alejandro, Santiago Marco-Sola, Antonio Espinosa, Paolo Ribeca, and Juan Carlos Moure. 2015. “Boosting the Fm-Index on the Gpu: Effective Techniques to Mitigate Random Memory Access.” Ieee/Acm Transactions on Computational Biology and Bioinformatics 12 (5): 1048–59. https://doi.org/10.1109/tcbb.2014.2377716.
Claude, Francisco, Gonzalo Navarro, and Alberto Ordóñez. 2015. “The Wavelet Matrix: An Efficient Wavelet Tree for Large Alphabets.” Information Systems 47 (January): 15–32. https://doi.org/10.1016/j.is.2014.06.002.
Dijkstra, Edsger W. 1982. “Why Numbering Should Start at Zero.” https://www.cs.utexas.edu/ EWD/transcriptions/EWD08xx/EWD831.html; EWD 831.
Droop, Felix Leander. 2025. “Genedex: A Small and Fast FM-Index for Rust.” https://github.com/feldroop/genedex.
Ferragina, P., and G. Manzini. 2000. “Opportunistic Data Structures with Applications.” In Proceedings 41st Annual Symposium on Foundations of Computer Science, 390–98. SFCS-00. IEEE Comput. Soc. https://doi.org/10.1109/sfcs.2000.892127.
Ferragina, Paolo, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. 2007. “Compressed Representations of Sequences and Full-Text Indexes.” Acm Transactions on Algorithms 3 (2): 20. https://doi.org/10.1145/1240233.1240243.
Gog, Simon, Timo Beller, Alistair Moffat, and Matthias Petri. 2014. “From Theory to Practice: Plug and Play with Succinct Data Structures.” In SEA 2014, 326–37. Springer International Publishing. https://doi.org/10.1007/978-3-319-07959-2_28.
Golynski, Alexander. 2006. “Optimal Lower Bounds for Rank and Select Indexes.” In Automata, Languages and Programming, 370–81. Springer Berlin Heidelberg. https://doi.org/10.1007/11786986_33.
González, Rodrigo, Szymon Grabowski, Veli Mäkinen, and Gonzalo Navarro. 2005. “Practical implementation of rank and select queries.” In Poster proceedings of WEA 2005.
Gottlieb, Simon Gene, and Knut Reinert. 2025. “Engineering Rank Queries on Bit Vectors and Strings.” Algorithms for Molecular Biology 20 (1). https://doi.org/10.1186/s13015-025-00291-9.
Grossi, Roberto, Ankur Gupta, and Jeffrey Scott Vitter. 2003. “High-Order Entropy-Compressed Text Indexes.” In Proceedings of the Fourteenth Annual Acm-Siam Symposium on Discrete Algorithms, 841–50. Soda ’03. Baltimore, Maryland: Society for Industrial and Applied Mathematics.
Jacobson, Guy Joseph. 1988. “Succinct Static Data Structures.” Carnegie Mellon University.
Kurpicz, Florian. 2022a. “Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors.” In SPIRE 2022, 257–72. Springer International Publishing. https://doi.org/10.1007/978-3-031-20643-6_19.
———. 2022b. “Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors.” arXiv. https://doi.org/10.48550/ARXIV.2206.01149.
Kurpicz, Florian, Niccolò Rigi-Luperti, and Peter Sanders. 2025. “Theory Meets Practice for Bit Vectors Supporting Rank and Select.” arXiv. https://doi.org/10.48550/ARXIV.2509.17819.
Köster, Johannes. 2015. “Rust-Bio: A Fast and Safe Bioinformatics Library.” Bioinformatics 32 (3): 444–46. https://doi.org/10.1093/bioinformatics/btv573.
Langmead, Ben, Cole Trapnell, Mihai Pop, and Steven L Salzberg. 2009. “Ultrafast and Memory-Efficient Alignment of Short Dna Sequences to the Human Genome.” Genome Biology 10 (3). https://doi.org/10.1186/gb-2009-10-3-r25.
Langmead, Ben, and Steven L Salzberg. 2012. “Fast Gapped-Read Alignment with Bowtie 2.” Nature Methods 9 (4): 357–59. https://doi.org/10.1038/nmeth.1923.
Laws, Matthew D., Jocelyn Bliven, Kit Conklin, Elyes Laalai, Samuel McCauley, and Zach S. Sturdevant. 2024. “Spider: Improved Succinct Rank and Select Performance.” In SEA 2024, 301:21:1–21:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.SEA.2024.21.
Li, Heng. 2013. “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with Bwa-Mem.” arXiv. https://doi.org/10.48550/ARXIV.1303.3997.
Miltersen, Peter Bro. 2005. “Lower Bounds on the Size of Selection and Rank Indexes.” In Proceedings of the Sixteenth Annual Acm-Siam Symposium on Discrete Algorithms, 11–12. Soda ’05. Vancouver, British Columbia: Society for Industrial and Applied Mathematics.
Muła, Wojciech. 2008. “SSSE3: fast popcount.” http://0x80.pl/notesen/2008-05-24-sse-popcount.html.
Muła, Wojciech, Nathan Kurz, and Daniel Lemire. 2017. “Faster Population Counts Using Avx2 Instructions.” The Computer Journal 61 (1): 111–20. https://doi.org/10.1093/comjnl/bxx046.
Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, Small, Simple Rank/Select on Bitmaps.” In SEA 2012, 295–306. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-30850-5_26.
Pibiri, Giulio Ermanno, and Shunsuke Kanda. 2021. “Rank/Select Queries over Mutable Bitmaps.” Information Systems 99 (July): 101756. https://doi.org/10.1016/j.is.2021.101756.
Pockrandt, Christopher, Marcel Ehrhardt, and Knut Reinert. 2017. “Epr-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional Fm Indices.” In Research in Computational Molecular Biology, 190–206. Springer International Publishing. https://doi.org/10.1007/978-3-319-56970-3_12.
Venturini, Rossano, and Angelo Savino. 2023. “Qwt.” https://github.com/rossanoventurini/qwt.
Vigna, Sebastiano. 2008. “Broadword Implementation of Rank/Select Queries.” In WEA 2008, edited by Catherine C. McGeoch, 154–68. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-68552-4_12.
———. 2024. “sux-rs.” https://github.com/vigna/sux-rs.
Vigna, Sebastiano, and Tommaso Fontana. 2024. “Sux.” https://github.com/vigna/sux-rs.
Zhou, Dong, David G. Andersen, and Michael Kaminsky. 2013. “Space-Efficient, High-Performance Rank and Select Structures on Uncompressed Bit Sequences.” In SEA 2013, 151–63. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-38527-8_15.