Notes on SsHash

\[\newcommand{\S}{\mathcal{S}}\]

Paper summary

Intro

SsHash (Pibiri 2022) is a datastructure for indexing kmers. Given a set of kmers \(\S\), it supports two operations:

\(Lookup(g)\)
return the unique id \(i\in [|\S|]\) of the kmer \(g\).
\(Access(i)\)
return the kmer corresponding to id \(i\).

It also supports streaming queries, looking up all kmers from a longer string consecutively, by expoiting the overlap between them.

Construction of \(\S\): extract paths from the De Bruijn graph.

Existing methods:

FM index
compact, but slow to query in practice
Hashing based
fast, but not compact.

SsHash exploits:

Sparse
Dictionary is keyed by super-k-mers, not kmers.
Skew
Most kmers occur very infrequent, while only few appear many times.

Prelims

node-disjoint path cover
Covers each node (kmer) of the DBg exactly once. Called \(\S’\). Made from unitigs or stitched unitigs (Rahman and Medvedev 2020), also known as simplitigs (Břinda, Baym, and Kucherov 2020).
minimizer
An $m$-mer substring of a $k$-mer with minimal hash.
super-k-mer
A maximal sequence of consecutive k-mers with the same minimizer.
minimal perfect hashing (MPHF)
A bijective map between a set \(\mathcal X\) of size \(n\) and \([n]\). Requires at least \(\log_2 e = 1.44\) bits per element, and can be constructed fast using PTHash (Pibiri and Trani 2021) and other methods in around \(3\) bits.
Elias-Fano encoding
Compresses a non-decreasing sequence of values \(0\leq a_1 \leq \dots \leq a_n \leq A\) in \(n\cdot (2+\lceil \log_2(A/n)\rceil)\) bits (Elias 1974; Fano 1971). (Only around \(2\) bits worse than the theoretical lower bound.)
Bifrost
stores minimizers in the position directly corresponding to their value. Dynamic, but not compact. (Holley and Melsted 2020)
Pufferfish
MPHF mapping kmers to their position in unitig. Also sparse version. (Almodaresi et al. 2018)
Blight
All kmers with same minimizer are grouped, and an MPHF is built for each partition. (Marchet, Kerbiriou, and Limasset 2021)

Sparse and skew hashing

Remarks

Ideas

  • Streaming FM index queries: drop character from searched string in O(1)?
  • Double-robust-minimizer: compute two minimizers and take the rightmost one in case of jump.
  • distance distribution plot for ‘robust minimizers’
    • Do some mathematical analysis.
    • Understand PASHA (universal) and miniception
  • minimizer: Is there bias in the position of the kmer? or always uniform for all schemes?
  • Sample every kth position as minimizer, build PTHash index on it, and then query all m-mers of query independently.
  • rank select mutable bitmasks

References

Almodaresi, Fatemeh, Hirak Sarkar, Avi Srivastava, and Rob Patro. 2018. “A Space and Time-Efficient Index for the Compacted Colored de Bruijn Graph.” Bioinformatics 34 (13): i169–77. https://doi.org/10.1093/bioinformatics/bty292.
Břinda, Karel, Michael Baym, and Gregory Kucherov. 2020. “Simplitigs as an Efficient and Scalable Representation of de Bruijn Graphs,” January. https://doi.org/10.1101/2020.01.12.903443.
Elias, Peter. 1974. “Efficient Storage and Retrieval by Content and Address of Static Files.” Journal of the Acm 21 (2): 246–60. https://doi.org/10.1145/321812.321820.
Fano, R.M. 1971. On the Number of Bits Required to Implement an Associative Memory. Memorandum 61, Computation Structures Group. MIT Project MAC Computer Structures Group. https://books.google.ch/books?id=07DeGwAACAAJ.
Holley, Guillaume, and Páll Melsted. 2020. “Bifrost: Highly Parallel Construction and Indexing of Colored and Compacted de Bruijn Graphs.” Genome Biology 21 (1). https://doi.org/10.1186/s13059-020-02135-8.
Marchet, Camille, Mael Kerbiriou, and Antoine Limasset. 2021. “Blight: Efficient Exact Associative Structure for K-Mers.” Edited by Alfonso Valencia. Bioinformatics 37 (18): 2858–65. https://doi.org/10.1093/bioinformatics/btab217.
Pibiri, Giulio Ermanno. 2022. “Sparse and Skew Hashing of K-Mers.” Bioinformatics 38 (Supplement\_1): i185–94. https://doi.org/10.1093/bioinformatics/btac245.
Pibiri, Giulio Ermanno, and Roberto Trani. 2021. “Pthash: Revisiting Fch Minimal Perfect Hashing.” Proceedings of the 44th International Acm Sigir Conference on Research and Development in Information Retrieval, July. https://doi.org/10.1145/3404835.3462849.
Rahman, Amatur, and Paul Medvedev. 2020. “Representation of \$$k$\$-Mer Sets Using Spectrum-Preserving String Sets.” In Research in Computational Molecular Biology, 152–68. Springer International Publishing. https://doi.org/10.1007/978-3-030-45257-5_10.