MinHash variants

\[ \newcommand{\sketch}{\mathsf{sketch}} \]

This is a small survey on mash-based sketching methods, to go along with my simd-sketch library. See also Cohen (2014) for a previous small survey.

Motivation. Suppose we have a large number \(r\) of sequences of length around \(n\), that we want to quickly compare. One way to do this is by sketching each genome into a much smaller representation, and then doing the \(\Theta(r^2)\) comparison only on the sketches.

1 Jaccard similarity

Instead of comparing input sequences directly, we consider them as a set of k-mers (Heintze 1996). Then, the Jaccard similarity between two sets \(A\) and \(B\) is (Broder 1997; Broder et al. 1997) \[ \frac{|A\cap B|}{|A\cup B|}. \]

Thus, the goal is to estimate the Jaccard similarity via small sketches.

2 Hash schemes

2.1 MinHash

The first step is a simple minhash (Broder 1997), that hashes all k-mers and selects the smallest hash value. The probability (over the random hash function) that sets \(A\) and \(B\) share the smallest k-mer equals the Jaccard similarity.

Algorithm. This is simple to compute by using a rolling minimum. On random data, we expect the minimum to only update \(O(\log n)\) times, so the branch-predictor can do a good job and this will be fast in practice.

2.2 $s$-mins sketch

To better approximate the similarity, we can use multiple (\(s=4\) here) hash functions, and compute the fraction of hashes for which the minimal hash is identical (Cohen 1997; Flajolet and Nigel Martin 1985). (Note that other papers also use \(k\) or \(t\) for the size of the sketch, but I will stick to \(s\) to not confuse this with the k-mer length.)

Algorithm. We run the previous algorithm \(s\) times, for \(O(ns)\) total time.

Note

There are (at least) five (!!!) papers attributing this method to Broder (1997) and Broder et al. (1997), and then remark that \(O(ns)\) is slow. As far as I can see, neither of these two papers ever considers using more than one hash function. Instead both of them only talk about the fast bottom-\(s\) sketch discussed below.

2.3 Bottom-\(s\) sketch

A problem with using multiple hash functions is that this is slow to evaluate. Instead, bottom-hash can be used, where the smallest \(s\) values of a single hash function are taken (Heintze 1996; Broder 1997; Cohen 1997).

The sketch of \(A\cup B\) is then given by the smallest \(s\) values of the union of the sketches of \(A\) and \(B\), and \(|A\cap B|\) is the fraction of those smallest \(s\) values that are in both \(A\) and \(B\).

Note

The sketch of \(A\cup B\) is not the total number of distinct elements in the union of the sketch of \(A\) and the sketch of \(B\). When \(\sketch(A) = [0,2,4,6]\) and \(\sketch(B)=[1,3,5,6]\), we have \(\sketch(A\cup B) = [0,1,2,3]\) and the size of the intersection is \(0\).

Mash is a tool that uses this bottom-\(s\) sketch, and they introduce the mash distance that estimates the mutation rate based on the Jaccard similarity.

Naive algorithm. This is where things get interesting. The usual/naive approach is to incrementally keep a set of the \(s\) smallest elements seen so far as a priority queue (max-heap). Where the rolling minimum gets a total of \(\ln n\) updates (in expectation, on random input), the bottom-\(s\) get \(s \ln n\) updates in total (Broder 1997; Cohen 2014): the \(i\)’th element has probability \(s / i\) of being in the bottom-\(s\), which sums/integrates to \(s \ln n\). Each update takes \(O(\log s)\) time, for overall complexity \(O(n + s\log s \ln n)\).

Better complexity (new?). We can do better by dropping the priority queue! Currently we do a bunch of ’early’ insertions into the heap of elements that will later be evicted. Instead, we can estimate the size of the largest value as \(T=s/n \cdot 2^{32}\) (for 32-bit hashes). Then, we can only insert elements up to, say, \(2T\) (indicated by the dashed yellow line). If there are a lot of duplicate-kmers, this may not be sufficient, and we can double the limit until \(s\) distinct hashes have been found. When there are no/few duplicate k-mers, this runs in \(O(n + s \log s)\).

Faster algorithm (new?). In practice, we can speed this up more: first collect all the values up to \(2T\), which can be done branchless and using SIMD, and then sort those in a separate pass. That way, the first loop over all \(n\) hashes can be completely branchless. See 4 for details.

2.4 FracMinHash

This problem of the unpredictable threshold can also be solved by simply fixing the threshold (as the solid yellow line), and then taking however many hashes are below it. This is what fracminhash does (Irber et al. 2022).

This is also similar to mod-sketch that simply keeps all values \(0\pmod m\) for some modulus \(m\) (Broder 1997).

2.5 Bucket-hash / $s$-partition sketch

Bottom-$s$-hash and fracminhash have the drawback that computing the similarity between two sketches requires a pass of merge-sort, which is hard to make efficient. Bucket-hash solves this by splitting the hashes into \(s\) buckets and returning the smallest hash in each bucket (Flajolet and Nigel Martin 1985; Li, Owen, and Zhang 2012).

This way, comparing two sketches is again as simple as computing the fraction of shared elements in equal positions.

Note

This scheme was introduced under the name one permutation hashing (Li, Owen, and Zhang 2012), but I think this is a bad name. In particular, the abstract of that paper writes:

Minwise hashing (Broder 1997) is a standard technique in the context of search, for efficiently computing set similarities. […] A drawback of minwise hashing is that it requires a costly preprocessing step, for conducting (e.g.,) \(s=200\sim 500\) permutations on the data.

However, as far as I can see, Broder (1997) never considers more than one hash function in the first place, and really only talks about bottom-\(s\) sketch. Indeed, the distinctive property of this one permutation scheme is not the fact that it uses only a single permutation, but rather that it uses partitions/buckets to extract multiple smallest values.

The earlier paper on $b$-bit MinHash (Li and König 2010) makes the same mistake.

It seems to me that the $s$-hashes version was never really used for sequence similarity estimation?

Algorithm. This can be implemented by tracking \(s\) bucket values, and for each hash, comparing it with the current minimum in its bucket. This now requires \(n\) random memory accesses, and \(O(s \log s)\) writes. On the other hand, L1 cache can hold 4096 such values, and adjacent iterations can be mostly executed in parallel, so it may not be too bad.

2.5.1 Densification strategies

A drawback of this method is that some buckets can be empty when \(n\) not sufficiently larger than \(s\). A number of densification strategies have been proposed that ensure these buckets do not accidentally compare as equal, by filling them with k-mers from a different bucket.

(Side note: I’m not convinced this densification matters all that much in practice. Usually when sketching, \(n\gg s\), and only very few buckets should remain empty?)

Rotation. A first option is to replace the value of an empty bucket by the (rotationally) next non-empty bucket (Shrivastava and Li 2014a).

Random rotation direction. Instead of always picking the value of the next bucket, we can also choose between the previous and next bucket, via some fixed random variable \(q_i\in\{0,1\}\) that indicates the chosen direction (Shrivastava and Li 2014b).

Still, for very sparse data the schemes above provide bad variance when there are long runs of empty buckets.

Optimal densification. A better solution is that every empty bucket \(i\) copies its value from an independent non-empty bucket \(j\). This can be done using a hash function \(h_i : \mathbb N \to \{0,\dots,s-1\}\) that is iterated until a non-empty bucket is found (Shrivastava 2017).

Fast densification. It turns out the ``optimal’’ densification strategy can be improved by using a slightly different algorithm. Instead of pulling values from filled buckets to empty bucket, filled buckets can push values into empty buckets (Mai et al. 2020). The authors shown that \(s\lg s\) rounds of pushing values is sufficient.

Multiple rounds. All the methods so far suffer when, say, all k-mers hash to the same (or only very few) buckets. A solution is to use multiple hash functions (Dahlgaard, Knudsen, and Thorup 2017). As long as there are empty buckets, we do up to \(s\) rounds of re-hashing the input with a new hash \(h_i\). This is sufficiently fast in expectation, since it’s exceedingly rare to have empty buckets. If there are still empty buckets, we fall back to \(s\) hashes \(h’_i\), the $i$th of which maps all values into bucket \(i\), so that it is guaranteed each bucket will eventually be non-empty.

SuperMinHash does conceptually the same as the scheme above, but ensures that over the first \(s\) rounds, every element is mapped to exactly once to each bucket (Ertl 2017). This is done by explicitly contructing a permutation to control the bucket each element is mapped to in each round. However, this has expected runtime \(O(n + s \log^2 s)\).

ProbMinHash improves over SuperMinHash by tracking for each bucket the best value seen so far, and ensuring that for each element, the hash-values for each bucket are generated in increasing order (Ertl 2020). That way, the insertion of an element can be stopped early. It also provides a number of different algorithms for both unweighted and weighted sketching.

Skipping empty buckets (new?). One other option could be to detect when both \(A\) and \(B\) have an empty bucket (i.e., when both store u32::MAX), and then simply skip such buckets. My feeling is that this should give unbiased results.

2.6 Mod-bucket hash (new?)

A drawback of bucket-hash is that computing it requires \(s\) ``independent’' minima. It’s not quite as bad as computing \(s\) hash functions, but it’s somewhat involved to compute (using SIMD instructions) whether an element is small within its bucket.

To fix this, we change the way we partition the data into buckets. Instead of linearly, we can split by the remainder modulo \(s\). That way, all selected elements will be small. In fact, the largest element will be around \(T=O(s \log s)\), and so we can pre-filter for elements up to \(2T\) (dashed line, again followed by doubling as long as needed).

Algorithm (new). To implement this efficiently, we can again collect small elements via a branchless SIMD implementation. Then we simply make a pass over those elements and update the minimum of each bucket.

A drawback is that there could possibly be empty buckets in unlucky cases. In that case, the threshold would be doubled until it reaches \(2^{32}\), and the pre-filter step becomes useless. But this should hopefully be rare.

TODO 2.7 SetSketch

Ertl (2021) seems more complicated and I haven’t properly read it yet. It incorporates ideas of HyperLogLog into sketching.

2.8 Variants

Instead of hashing all k-mers, it’s also possible to only hash the minimizers, as done by mashmap (Jain, Dilthey, et al. 2018) and fastANI (Jain, Rodriguez-R, et al. 2018).

Another variant is to apply the bottom-\(s\) sketch to only the k-mers with a hash \(0\pmod m\) (Broder 1997).

Another extension is to weighted input sets, e.g. (Ertl 2018, 2020).

3 $b$-bit MinHash

So far, we have been storing full 32-bit hash values. We typically know the approximate size of the smallest elements, and thus, the high bits do not give much information. Thus, we can only store the low \(b\) bits of each hash value, for example only the bottom 16 bits of each 32 bit value (Li and König 2010), or even the bottom 1 or 2 bits only. In that case the same size has to increase (by e.g. a factor \(3\)) to ensure the variance is low, but overall this then still saves \(10\times\) over a 32-bit representation.

Apart from only being smaller, this also allows significantly faster comparison of sketches, since less data has to be processed.

To compare arbitrary $b$-bit sketches efficiently, a list of 64 values can be transposed and stores as \(b\) 64-bit values instead. Then after some xor and and/or instructions, a popcount can count the number of equal values.

4 SimdSketch

SimdSketch currently implements the bottom-\(s\) and mod-bucket sketches.

In the implementation, we use the packed-seq crate to efficiently iterate over 8 chunks of a sequence in parallel using SIMD. We reuse parts of the simd-minimizers crate for efficient ntHashing of the k-mers.

If we have a u32x8 SIMD element of 8 32-bit hashes, we can compare each lane to the threshold \(T\). We efficiently append the subset of elements that are smaller to a vector using a technique of Daniel Lemire.

5 Evaluation

tools:

  • simd-mash bottom/bucket variant
  • bindash-2
  • bindash-rs
  • probminhash

metrics:

  • total sketch size
  • sketch throughput
  • comparison throughput

input:

  • ~1000 few MB bacterial genomes?
  • try both small \(s=100\) and large \(s=10000\).

6 TODO

References

Broder, A.Z. 1997. “On the Resemblance and Containment of Documents.” Proceedings. Compression and Complexity of Sequences 1997 (Cat. No.97tb100171). https://doi.org/10.1109/sequen.1997.666900.
Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. “Syntactic Clustering of the Web.” Computer Networks and Isdn Systems 29 (8–13): 1157–66. https://doi.org/10.1016/s0169-7552(97)00031-7.
Cohen, Edith. 1997. “Size-Estimation Framework with Applications to Transitive Closure and Reachability.” Journal of Computer and System Sciences 55 (3): 441–53. https://doi.org/10.1006/jcss.1997.1534.
———. 2014. “Min-Hash Sketches.” In Encyclopedia of Algorithms, 1–7. Springer US. https://doi.org/10.1007/978-3-642-27848-8_573-1.
Dahlgaard, Soren, Mathias Bak Tejs Knudsen, and Mikkel Thorup. 2017. “Fast Similarity Sketching.” In 2017 Ieee 58th Annual Symposium on Foundations of Computer Science (Focs), 663–71. IEEE. https://doi.org/10.1109/focs.2017.67.
Ertl, Otmar. 2017. “Superminhash - a New Minwise Hashing Algorithm for Jaccard Similarity Estimation.” arXiv. https://doi.org/10.48550/ARXIV.1706.05698.
———. 2018. “Bagminhash - Minwise Hashing Algorithm for Weighted Sets.” In Proceedings of the 24th Acm Sigkdd International Conference on Knowledge Discovery & Data Mining, 1368–77. Kdd ’18. ACM. https://doi.org/10.1145/3219819.3220089.
———. 2020. “Probminhash – a Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity.” Ieee Transactions on Knowledge and Data Engineering, 1. https://doi.org/10.1109/tkde.2020.3021176.
———. 2021. “Setsketch: Filling the Gap between Minhash and Hyperloglog.” Proceedings of the Vldb Endowment 14 (11): 2244–57. https://doi.org/10.14778/3476249.3476276.
Flajolet, Philippe, and G. Nigel Martin. 1985. “Probabilistic Counting Algorithms for Data Base Applications.” Journal of Computer and System Sciences 31 (2): 182–209. https://doi.org/10.1016/0022-0000(85)90041-8.
Heintze, Nevin. 1996. “Scalable Document Fingerprinting.” In 2nd Usenix Workshop on Electronic Commerce (Ec 96). Oakland, CA: USENIX Association. https://www.usenix.org/conference/2nd-usenix-workshop-electronic-commerce/scalable-document-fingerprinting.
Irber, Luiz, Phillip T Brooks, Taylor Reiter, N Tessa Pierce-Ward, Mahmudur Rahman Hera, David Koslicki, and C Titus Brown. 2022. “Lightweight Compositional Analysis of Metagenomes with Fracminhash and Minimum Metagenome Covers.” Biorxiv, 2022–01. https://doi.org/10.1101/2022.01.11.475838.
Jain, Chirag, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy. 2018. “A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.” Journal of Computational Biology 25 (7): 766–79. https://doi.org/10.1089/cmb.2018.0036.
Jain, Chirag, Luis M. Rodriguez-R, Adam M. Phillippy, Konstantinos T. Konstantinidis, and Srinivas Aluru. 2018. “High Throughput Ani Analysis of 90k Prokaryotic Genomes Reveals Clear Species Boundaries.” Nature Communications 9 (1). https://doi.org/10.1038/s41467-018-07641-9.
Li, Ping, Art Owen, and Cun-hui Zhang. 2012. “One Permutation Hashing.” In Advances in Neural Information Processing Systems, edited by F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger. Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/file/eaa32c96f620053cf442ad32258076b9-Paper.pdf.
Li, Ping, and Christian König. 2010. “B-Bit Minwise Hashing.” In Proceedings of the 19th International Conference on World Wide Web. Www ’10. ACM. https://doi.org/10.1145/1772690.1772759.
Mai, Tung, Anup Rao, Matt Kapilevich, Ryan Rossi, Yasin Abbasi-Yadkori, and Ritwik Sinha. 2020. “On Densification for Minwise Hashing.” In Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, edited by Ryan P. Adams and Vibhav Gogate, 115:831–40. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v115/mai20a.html.
Shrivastava, Anshumali. 2017. “Optimal Densification for Fast and Accurate Minwise Hashing.” In Proceedings of the 34th International Conference on Machine Learning, edited by Doina Precup and Yee Whye Teh, 70:3154–63. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v70/shrivastava17a.html.
Shrivastava, Anshumali, and Ping Li. 2014a. “Densifying One Permutation Hashing via Rotation for Fast near Neighbor Search.” In Proceedings of the 31st International Conference on Machine Learning, edited by Eric P. Xing and Tony Jebara, 32:557–65. Proceedings of Machine Learning Research 1. Bejing, China: PMLR. https://proceedings.mlr.press/v32/shrivastava14.html.
———. 2014b. “Improved Densification of One Permutation Hashing.” arXiv. https://doi.org/10.48550/ARXIV.1406.4784.