\[\newcommand{\S}{\mathcal{S}}\]
Paper summary
Intro
SsHash (Pibiri 2022) is a datastructure for indexing kmers. Given a set of kmers \(\S\), it supports two operations:
- \(Lookup(g)\)
- return the unique id \(i\in [|\S|]\) of the kmer \(g\).
- \(Access(i)\)
- return the kmer corresponding to id \(i\).
It also supports streaming queries, looking up all kmers from a longer string consecutively, by expoiting the overlap between them.
Construction of \(\S\): extract paths from the De Bruijn graph.
Existing methods:
- FM index
- compact, but slow to query in practice
- Hashing based
- fast, but not compact.
SsHash exploits:
- Sparse
- Dictionary is keyed by super-k-mers, not kmers.
- Skew
- Most kmers occur very infrequent, while only few appear many times.
Prelims
- node-disjoint path cover
- Covers each node (kmer) of the DBg exactly once. Called \(\S’\). Made from unitigs or stitched unitigs (Rahman and Medvedev 2020), also known as simplitigs (Břinda, Baym, and Kucherov 2020).
- minimizer
- An $m$-mer substring of a $k$-mer with minimal hash.
- super-k-mer
- A maximal sequence of consecutive k-mers with the same minimizer.
- minimal perfect hashing (MPHF)
- A bijective map between a set \(\mathcal X\) of size \(n\) and \([n]\). Requires at least \(\log_2 e = 1.44\) bits per element, and can be constructed fast using PTHash (Pibiri and Trani 2021) and other methods in around \(3\) bits.
- Elias-Fano encoding
- Compresses a non-decreasing sequence of values \(0\leq a_1 \leq \dots \leq a_n \leq A\) in \(n\cdot (2+\lceil \log_2(A/n)\rceil)\) bits (Elias 1974; Fano 1971). (Only around \(2\) bits worse than the theoretical lower bound.)
Related work
- Bifrost
- stores minimizers in the position directly corresponding to their value. Dynamic, but not compact. (Holley and Melsted 2020)
- Pufferfish
- MPHF mapping kmers to their position in unitig. Also sparse version. (Almodaresi et al. 2018)
- Blight
- All kmers with same minimizer are grouped, and an MPHF is built for each partition. (Marchet, Kerbiriou, and Limasset 2021)
Sparse and skew hashing
Remarks
- How much is DBg construction and SPSS computation a bottleneck in SsHash construction time?
- TODO: Read up on Pufferfish and Blight (Almodaresi et al. 2018; Marchet, Kerbiriou, and Limasset 2021)
Ideas
- Streaming FM index queries: drop character from searched string in O(1)?
- Double-robust-minimizer: compute two minimizers and take the rightmost one in case of jump.
- distance distribution plot for ‘robust minimizers’
- Do some mathematical analysis.
- Understand PASHA (universal) and miniception
- minimizer: Is there bias in the position of the kmer? or always uniform for all schemes?
- Sample every kth position as minimizer, build PTHash index on it, and then query all m-mers of query independently.
- rank select mutable bitmasks
References
Almodaresi, Fatemeh, Hirak Sarkar, Avi Srivastava, and Rob Patro. 2018. “A Space and Time-Efficient Index for the Compacted Colored de Bruijn Graph.” Bioinformatics 34 (13): i169–77. https://doi.org/10.1093/bioinformatics/bty292.
Břinda, Karel, Michael Baym, and Gregory Kucherov. 2020. “Simplitigs as an Efficient and Scalable Representation of de Bruijn Graphs,” January. https://doi.org/10.1101/2020.01.12.903443.
Elias, Peter. 1974. “Efficient Storage and Retrieval by Content and Address of Static Files.” Journal of the Acm 21 (2): 246–60. https://doi.org/10.1145/321812.321820.
Fano, R.M. 1971. On the Number of Bits Required to Implement an Associative Memory. Memorandum 61, Computation Structures Group. MIT Project MAC Computer Structures Group. https://books.google.ch/books?id=07DeGwAACAAJ.
Holley, Guillaume, and Páll Melsted. 2020. “Bifrost: Highly Parallel Construction and Indexing of Colored and Compacted de Bruijn Graphs.” Genome Biology 21 (1). https://doi.org/10.1186/s13059-020-02135-8.
Marchet, Camille, Mael Kerbiriou, and Antoine Limasset. 2021. “Blight: Efficient Exact Associative Structure for K-Mers.” Edited by Alfonso Valencia. Bioinformatics 37 (18): 2858–65. https://doi.org/10.1093/bioinformatics/btab217.
Pibiri, Giulio Ermanno. 2022. “Sparse and Skew Hashing of K-Mers.” Bioinformatics 38 (Supplement\_1): i185–94. https://doi.org/10.1093/bioinformatics/btac245.
Pibiri, Giulio Ermanno, and Roberto Trani. 2021. “Pthash: Revisiting Fch Minimal Perfect Hashing.” Proceedings of the 44th International Acm Sigir Conference on Research and Development in Information Retrieval, July. https://doi.org/10.1145/3404835.3462849.
Rahman, Amatur, and Paul Medvedev. 2020. “Representation of \$$k$\$-Mer Sets Using Spectrum-Preserving String Sets.” In Research in Computational Molecular Biology, 152–68. Springer International Publishing. https://doi.org/10.1007/978-3-030-45257-5_10.