Practical selection and sampling schemes

This post introduces some new practical sampling schemes. It builds on:

1 Sampling schemes

1.1 Definitions and background

A sampling scheme is simply a function \(f: \Sigma^{w+k-1} \to [w]\). For a window of length \(w+k-1\), containing \(w\) k-mers, it samples the starting position of one of the k-mers.

The density of a sampling scheme is the expected fraction of distinct positions sampled over an infinitely long random string.

A sampling scheme is forward when the absolute sampled position never decreases as we slide the window over a sequence.

The random minimizer has a density of \(2/(w+1)\).

A trivial lower bound on the density is \(1/w\).

Previous sampling schemes are miniception and the double-decycling based minimizer.

Figure 1: Some previous sampling schemes, and the (1/w) lower bound. We use alphabet size (4), and window size (24). ((t) mentioned in the title is unused here.)

Figure 1: Some previous sampling schemes, and the (1/w) lower bound. We use alphabet size (4), and window size (24). ((t) mentioned in the title is unused here.)

1.2 Mod-minimizer

The mod-minimizer is a scheme that asymptotically achieves optimal density \(1/w\), and generally outperforms other schemes when \(k>w\).

It works by setting a small parameter \(t = (k\bmod w)\) (with the restriction that \(t\geq r:=4\), so that there are few duplicate t-mers). Then, the smallest t-mer in the window is found at position \(0\leq x< w+k-t\). Using this, the k-mer at position \(p = (x\bmod w)\) is sampled.

Figure 2: The mod-minimizer is much better for large (k), and asymptotically optimal.

Figure 2: The mod-minimizer is much better for large (k), and asymptotically optimal.

1.3 Forward scheme lower bound

Together with Bryce, we significantly improved the lower bound on the best possible density of forward schemes. In fact, we prove that when \(k\equiv 1\pmod w\) and \(\sigma\to\infty\), the mod-minimizer is optimal.

Figure 3: The lower bound is quite close to the mod-minimizer!

Figure 3: The lower bound is quite close to the mod-minimizer!

1.4 Open syncmer minimizer

Daniel Liu came up with the idea to build minimizer schemes on top of syncmers Edgar (2021). Given a parameter \(t\) (we use \(t=4\) when \(\sigma=4\)), we can hash all t-mers inside a k-mer. For our purposes, the k-mer is an open syncmer (Edgar 2021) when the smallest t-mer inside it is in the middle (at offset/position \(\lfloor(k-t)/2\rfloor\))1, following Shaw and Yu (2021) who show that this using the middle position is best for conservation. In particular, open syncmers have the property that they are never close to each other. In each window, the open syncmer minimizer simply takes the smallest k-mer2 that is an open syncmer (or the smallest kmer if there is no open syncmer).

Figure 4: For small alphabet (sigma=4), the open syncmer minimizer performs nearly as good as decycling minimizer (not shown), and slightly worse than double decycling minimizers. For large alphabet, the open syncmer minimizer performs very similar to (single) decycling.

Figure 4: For small alphabet (sigma=4), the open syncmer minimizer performs nearly as good as decycling minimizer (not shown), and slightly worse than double decycling minimizers. For large alphabet, the open syncmer minimizer performs very similar to (single) decycling.

1.5 Open-closed syncmer minimizer

Then Daniel extended this to the open-closed syncmer minimizer: If there is an open syncmer inside the window, prefer the one with the smallest t-mer. Otherwise, take a closed syncmer, i.e., a k-mer whose smallest contained t-mer is at the start or end. Otherwise, just take the smallest k-mer.

Figure 5: The open-closed syncmer minimizer improves the open syncmer minimizer, and (for large alphabets) performs very similar to double decycling for (k<w). For (k>w), it outperforms double decycling.

Figure 5: The open-closed syncmer minimizer improves the open syncmer minimizer, and (for large alphabets) performs very similar to double decycling for (k<w). For (k>w), it outperforms double decycling.

1.6 New: Open-closed mod-minimizer

Looking at the figure above, one wonders if the smoothness of the methods that perform well for \(k<w\) can be incorporated into the asymptotically optimal step-wise behaviour of the mod-minimizer. Indeed, we can! The OC-mod-minimizer works as follows:

  1. Choose the offset \(o:=\lfloor((k-t)\bmod w)/2\rfloor\).
  2. A k-mer is a ‘open mod-syncmer’ if its smallest contained t-mer is at a position \(x\) with \((x\bmod w)=o\). If there is an open mod-syncmer, take the one with the smallest t-mer hash.
  3. Otherwise, take the smallest k-mer that is a closed syncmer.
  4. Otherwise, return the smallest k-mer.
Figure 6: The open-closed mod-minimizer performs great both for small (k) and large (k).

Figure 6: The open-closed mod-minimizer performs great both for small (k) and large (k).

1.7 The $t$-gap

One issue that remains in the plot above is what I will call the $t$-gap: especially for small \(k\), the graph shifts \(t-1\) steps to the right compared to the double decycling minimizer. The reason is that by only considering t-mers, we effectively reduce the total number of positions that can be sampled by \(t-1\).

Figure 7: If we increase the alphabet size to (sigma=256), (t=1) is sufficient to get mostly unique t-mers. All our new plots shift left by (t-1). Now, the OC mod-mini is comparable to double decycling, and also touches the lower bound when (k=(1bmod w)).

Figure 7: If we increase the alphabet size to (sigma=256), (t=1) is sufficient to get mostly unique t-mers. All our new plots shift left by (t-1). Now, the OC mod-mini is comparable to double decycling, and also touches the lower bound when (k=(1bmod w)).

2 Selection schemes

2.1 Definition

While a sampling scheme selects a k-mer from a window, a selection scheme only selects a position, and is given by a function \(f: \Sigma^w \to [w]\) (Zheng, Kingsford, and Marçais 2021).

All the sampling schemes seen so far can be seen as selection schemes as well, but they are inefficient because they never sample the last \(k-1\) positions. Proper sampling schemes do not have this restriction.

2.2 Bd-anchors

One sampling scheme is bidirectional anchors (Loukides and Pissis 2021; Loukides, Pissis, and Sweering 2023). Given a window of \(w\) characters, this is simply the starting position of its smallest rotation. One drawback though is that as we shift the window through a sequence, the characters at the front can unpredictably influence whether the rotation starting at the last position is small or not. Thus, to improve the density, the rotations starting in the last \(r\) positions are excluded.

Figure 8: Bd-anchors need a parameter (r) that grows roughly as (log_sigma(w)), but are never quite optimal.

Figure 8: Bd-anchors need a parameter (r) that grows roughly as (log_sigma(w)), but are never quite optimal.

2.3 New: Smallest unique substring anchors

To avoid this instability of bd-anchors, we can simply only look for the smallest suffix instead of the smallest rotation. To improve stability, we require this suffix to be unique. That is, in the string abbab, the suffix ab is not unique, and hence the smallest suffix starts at the first a. Thus, we search for the smallest unique suffix, and some prefix of that is the smallest unique substring. Thus, we call these sus-anchors3.

Figure 9: Sus-anchors are parameter-free and usually better than bd-anchors.

Figure 9: Sus-anchors are parameter-free and usually better than bd-anchors.

2.4 New: Scrambled sort

One drawback of taking the lexicographic smallest substring is that suffixes of small substrings are also small. In particular, when a window starts with aaabb... as a SUS, after shifting the window by one position, there is a relatively large probability that aabb... will remain the smallest SUS. But for purposes of having a low density of sampled positions, we especially want to avoid sampling consecutive positions.

After some fiddling, it turns out that we can adjust the definition of ‘smallest’. Instead of taking the lexicographically smallest substring, we can first ‘invert’ the first character of the substring (as in, replace \(c\) by \(\sigma-1-c\)), and then compare substrings. This way, the smallest substring will look like zaaaa..., and after shifting one position, the smallest substring will jump to another occurrence of z (or y if there is no z), instead of starting at the next a.4

Figure 10: When doing a ‘scrambled lexmin’, sus-anchors are surprisingly close to optimal.

Figure 10: When doing a ‘scrambled lexmin’, sus-anchors are surprisingly close to optimal.

Figure 11: In the previous figure I was using the simplified bound of Theorem 1 of (Kille et al. 2024). Using the more precise version instead, we see that also for small (w), this scrambled sort is close to optimal. I enlarged it so you can see how the blue and red overlap.

Figure 11: In the previous figure I was using the simplified bound of Theorem 1 of (Kille et al. 2024). Using the more precise version instead, we see that also for small (w), this scrambled sort is close to optimal. I enlarged it so you can see how the blue and red overlap.

Figure 12: For alphabet (sigma=3), scrambled sus-anchors are also very close to optimal.

Figure 12: For alphabet (sigma=3), scrambled sus-anchors are also very close to optimal.

Figure 13: For alphabet (sigma=2), there is a bit of a gap towards optimality for (6leq wleq 18). Curiously, the gap appears much smaller both for small (w) and larger (w).

Figure 13: For alphabet (sigma=2), there is a bit of a gap towards optimality for (6leq wleq 18). Curiously, the gap appears much smaller both for small (w) and larger (w).

TODO 2.5 Scrambled sus-anchor density

It’s not hard to see that sus-anchors are forward. Thus, to compute the density, it suffices to compute the probability that two consecutive windows select different positions. That in turn is only possible if either the first window has its smallest unique substring at the start, or the second window has its smallest unique substring at the end.

3 Open questions

  • Can we use sus-anchors instead of t-mer minimizers in OC mod-minimizers to close the remaining $t$-gap?
  • What is the exact density of sus-anchors? Can we prove its near-optimality.

References

Edgar, Robert. 2021. “Syncmers Are More Sensitive than Minimizers for Selecting Conserved K‑Mers in Biological Sequences.” Peerj 9 (February): e10805. https://doi.org/10.7717/peerj.10805.
Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024. “The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:11:1–11:23. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.11.
Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd Treangen. 2024. “A near-Tight Lower Bound on the Density of Forward Sampling Schemes.” Biorxiv. https://doi.org/10.1101/2024.09.06.611668.
Loukides, Grigorios, Solon P. Pissis, and Michelle Sweering. 2023. “Bidirectional String Anchors for Improved Text Indexing and Top-$k$ Similarity Search.” Ieee Transactions on Knowledge and Data Engineering 35 (11): 11093–111. https://doi.org/10.1109/tkde.2022.3231780.
Loukides, Grigorios, and Solon P. Pissis. 2021. “Bidirectional String Anchors: A New String Sampling Mechanism.” Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ESA.2021.64.
Shaw, Jim, and Yun William Yu. 2021. “Theory of Local K-Mer Selection with Applications to Long-Read Alignment.” Edited by Can Alkan. Bioinformatics 38 (20): 4659–69. https://doi.org/10.1093/bioinformatics/btab790.
Zheng, Hongyu, Carl Kingsford, and Guillaume Marçais. 2021. “Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length.” Journal of Computational Biology 28 (4): 395–409. https://doi.org/10.1089/cmb.2020.0432.

  1. Edgar (2021) first defines open syncmers as having the smallest t-mer at the start, but also introduces offset parameter, which we set to \((k-t)/2\). ↩︎

  2. Smallest with respect to the hash of the central t-mer. ↩︎

  3. I’m not quite sure yet whether to this means smallest unique substring or smallest unique suffix↩︎

  4. This situation reminds of the classic problem to compute the probability of seeing e.g. HH or HT or longer patterns in a series of coin flips. ↩︎