Table of Contents
If you’re in bioinformatics and do anything with data structures for indexing large amounts of data, or if you are generally into succinct data structures (that use space close to the information-theoretic lower bound, wikipedia), you’ve probably heard of rank and select
BWA Li (2013) uses a custom routine (github code) for rank over 2-bit DNA alphabets. By request of Heng Li, here we investigate if we can make it faster.
1 Problem statement Link to heading
Given the a text \(T\) over a size-4 (2-bit) alphabet, build a data structure that can quickly count the number of A, C, G, and T characters up to position \(i\) in \(S\). Specifically, for each query \(i\) we want to return all 4 counts.
| |
1.1 Metrics Link to heading
- most important: fast
- low latency?
- high throughput when called in a loop?
- high throughput with prefetching
- also, with multithreading
- secondary: small. But 2x overhead is fine.
- as long as it fits in RAM, all is good
2 Motivation: BWA / FM-index Link to heading
3 Data structure layout Link to heading
- query time is measured in the RAM model: cost 1 per memory access.
3.1 1-level indices Link to heading
- \(n\): number of characters.
3.1.1 Flat: \(2n\) bits, \(O(n)\) queries Link to heading
3.1.2 All-answers \(4n\lg n\) bits, \(O(1)\) queries Link to heading
3.2 What’s in a cache line? Link to heading
3.3 2-level indices: blocks Link to heading
- Blocks of \(B\) characters.
- For each block, store the character counts for the text preceding the block.
3.3.1 External block counts: \(2n + n/B\cdot 4\lg n\) bits, \(O(B)\) queries Link to heading
3.3.2 Internal block counts: \(2n + n/(B-4\lg n)\cdot 4\lg n\) bits, \(O(B)\) queries Link to heading
3.4 3-level indices: superblocks Link to heading
- Superblocks corresponding to \(S\) blocks or \(S\cdot B\) characters, so that the block offsets can use less precision.
3.5 Current implementation in BWA Link to heading
- query code: https://github.com/lh3/bwa/blob/master/bwt.c
bwt_occ: count occurrence of 1 character up to position.bwt_occ4: count occurrences of 4 charactersbwt_2occ4: count occurrences of 4 characters in an interval. This is the most important one.
- construction is here: https://github.com/lh3/bwa/blob/master/bwtindex.c#L150,
bwt_bwtupdate_core - flat array
- 2-level index: every
OCC_INTERVALstores 4 counts followed by BWT input text. No superblocks.
3.6 Other implementations Link to heading
- SDSL?
- QWT (Matteo Ceregini 2023) (docs.rs/qwt): 3-level index
- Superblocks stores large (64bit?) number.
- Blocks have 12bit delta relative to superblock.
- Popcount inside the block.
4 Link todo: Link to heading
References Link to heading
Li, Heng. 2013. “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with Bwa-Mem.” arXiv. https://doi.org/10.48550/ARXIV.1303.3997.
Matteo Ceregini, Rossano Venturini Florian Kurpicz. 2023. “Faster Wavelet Trees with Quad Vectors.” arXiv. https://doi.org/10.48550/ARXIV.2302.09239.