Practical selection and sampling schemes

This post introduces some new practical sampling schemes. It builds on:

1 Sampling schemes

1.1 Definitions and background

A sampling scheme is simply a function \(f: \Sigma^{w+k-1} \to [w]\). For a window of length \(w+k-1\), containing \(w\) k-mers, it samples the starting position of one of the k-mers.

The density of a sampling scheme is the expected fraction of distinct positions sampled over an infinitely long random string.

A sampling scheme is forward when the absolute sampled position never decreases as we slide the window over a sequence.

The random minimizer has a density of \(2/(w+1)\).

A trivial lower bound on the density is \(1/w\).

Previous sampling schemes are miniception and the double-decycling based minimizer.

Figure 1: Some previous sampling schemes, and the (1/w) lower bound. We use alphabet size (4), and window size (24). ((t) mentioned in the title is unused here.)

Figure 1: Some previous sampling schemes, and the (1/w) lower bound. We use alphabet size (4), and window size (24). ((t) mentioned in the title is unused here.)

1.2 Mod-minimizer

The mod-minimizer is a scheme that asymptotically achieves optimal density \(1/w\), and generally outperforms other schemes when \(k>w\).

It works by setting a small parameter \(t = (k\bmod w)\) (with the restriction that \(t\geq r:=4\), so that there are few duplicate t-mers). Then, the smallest t-mer in the window is found at position \(0\leq x< w+k-t\). Using this, the k-mer at position \(p = (x\bmod w)\) is sampled.

Figure 2: The mod-minimizer is much better for large (k), and asymptotically optimal.

Figure 2: The mod-minimizer is much better for large (k), and asymptotically optimal.

1.3 Forward scheme lower bound

Together with Bryce, we significantly improved the lower bound on the best possible density of forward schemes. In fact, we prove that when \(k\equiv 1\pmod w\) and \(\sigma\to\infty\), the mod-minimizer is optimal.

Figure 3: The lower bound is quite close to the mod-minimizer!

Figure 3: The lower bound is quite close to the mod-minimizer!

1.4 Open syncmer minimizer

Daniel Liu came up with the idea to build minimizer schemes on top of syncmers Edgar (2021). Given a parameter \(t\) (we use \(t=4\) when \(\sigma=4\)), we can hash all t-mers inside a k-mer. For our purposes, the k-mer is an open syncmer (Edgar 2021) when the smallest t-mer inside it is in the middle (at offset/position \(\lfloor(k-t)/2\rfloor\))1, following Shaw and Yu (2021) who show that this using the middle position is best for conservation. In particular, open syncmers have the property that they are never close to each other. In each window, the open minimizer simply takes the smallest k-mer2 that is an open syncmer (or the smallest kmer if there is no open syncmer).

Figure 4: For small alphabet (sigma=4), the open minimizer performs nearly as good as decycling minimizer (not shown), and slightly worse than double decycling minimizers. For large alphabet, the open syncmer minimizer performs very similar to (single) decycling.

Figure 4: For small alphabet (sigma=4), the open minimizer performs nearly as good as decycling minimizer (not shown), and slightly worse than double decycling minimizers. For large alphabet, the open syncmer minimizer performs very similar to (single) decycling.

1.5 Open-closed minimizer

Then Daniel extended this to the open-closed minimizer: If there is an open syncmer inside the window, prefer the one with the smallest t-mer. Otherwise, take a closed syncmer, i.e., a k-mer whose smallest contained t-mer is at the start or end. Otherwise, just take the smallest k-mer.

Figure 5: The open-closed minimizer improves the open minimizer, and (for large alphabets) performs very similar to double decycling for (k<w). For (k>w), it outperforms double decycling.

Figure 5: The open-closed minimizer improves the open minimizer, and (for large alphabets) performs very similar to double decycling for (k<w). For (k>w), it outperforms double decycling.

1.6 New: General mod-minimizer

Looking at the figure above, one wonders if the smoothness of the methods that perform well for \(k<w\) can be incorporated into the asymptotically optimal step-wise behaviour of the mod-minimizer. Indeed, this is possible!

The current mod-minimizer basically sets \(t=(k\bmod w)\) and then samples the smallest t-mer (by a random hash). Instead, we could sample the t-mer according to any other scheme, and in particular we can sample the t-mer via the open-closed minimizer scheme.

Figure 6: The open-closed mod-minimizer performs great both for small (k) and large (k).

Figure 6: The open-closed mod-minimizer performs great both for small (k) and large (k).

1.7 Variant: Open-closed minimizer using offsets

We can also make the following variant on the OC-minimizer that performs slightly better when \(k\) is just below a multiple of \(w\).

  1. Choose the offset \(o:=\lfloor((k-t)\bmod w)/2\rfloor\).
  2. A k-mer is a ‘open mod-syncmer’ if its smallest contained t-mer is at a position \(x\) with \((x\bmod w)=o\). If there is an open mod-syncmer, take the one with the smallest t-mer hash.
  3. Otherwise, take the smallest k-mer that is a closed syncmer.
  4. Otherwise, return the smallest k-mer.

We can improve slightly more by using the t-mer hash instead of taking the smallest k-mer by k-mer hash. For open mod-syncmers, we can prefer the one with minimal t-mer, and for closed syncmers we can take the one with maximal t-mer.

Figure 7: The open-closed mod-offset-minimizer that breaks ties using t-mers is often slightly better. Especially just below (k=2w).

Figure 7: The open-closed mod-offset-minimizer that breaks ties using t-mers is often slightly better. Especially just below (k=2w).

2 Selection schemes

Before looking at more sampling schemes, we will now first consider some selection schemes.

2.1 Definition

While a sampling scheme selects a k-mer from a window, a selection scheme only selects a position, and is given by a function \(f: \Sigma^w \to [w]\) (Zheng, Kingsford, and Marçais 2021).

All the sampling schemes seen so far can be seen as selection schemes as well, but they are inefficient because they never sample the last \(k-1\) positions. Proper sampling schemes do not have this restriction.

2.2 Bd-anchors

One sampling scheme is bidirectional anchors (Loukides and Pissis 2021; Loukides, Pissis, and Sweering 2023). Given a window of \(w\) characters, this is simply the starting position of its smallest rotation. One drawback though is that as we shift the window through a sequence, the characters at the front can unpredictably influence whether the rotation starting at the last position is small or not. Thus, to improve the density, the rotations starting in the last \(r\) positions are excluded.

Figure 8: Bd-anchors need a parameter (r) that grows roughly as (log_sigma(w)), but are never quite optimal.

Figure 8: Bd-anchors need a parameter (r) that grows roughly as (log_sigma(w)), but are never quite optimal.

2.3 New: Smallest unique substring anchors

To avoid this instability of bd-anchors, we can simply only look for the smallest suffix instead of the smallest rotation. To improve stability, we require this suffix to be unique. That is, in the string abbab, the suffix ab is not unique, and hence the smallest suffix starts at the first a. Thus, we search for the smallest unique suffix, and some prefix of that is the smallest unique substring. Thus, we call these sus-anchors3.

Figure 9: Sus-anchors are parameter-free and usually better than bd-anchors.

Figure 9: Sus-anchors are parameter-free and usually better than bd-anchors.

2.4 New: Anti lexicographic sorting

One drawback of taking the lexicographic smallest substring is that suffixes of small substrings are also small. In particular, when a window starts with aaabb... as a SUS, after shifting the window by one position, there is a relatively large probability that aabb... will remain the smallest SUS. But for purposes of having a low density of sampled positions, we especially want to avoid sampling consecutive positions.

After some fiddling, it turns out that we can adjust the definition of ‘smallest’. Instead of taking the lexicographically smallest substring, we can first ‘invert’ the first character of the substring (as in, replace \(c\) by \(\sigma-1-c\)), and then compare substrings. This way, the smallest substring will look like zaaaa..., and after shifting one position, the smallest substring will jump to another occurrence of z (or y if there is no z), instead of starting at the next a.4

Figure 10: When doing a ‘anti’ lexicographic minimum (‘scrambled’ in the legend), sus-anchors are surprisingly close to optimal.

Figure 10: When doing a ‘anti’ lexicographic minimum (‘scrambled’ in the legend), sus-anchors are surprisingly close to optimal.

Figure 11: In the previous figure I was using the simplified bound of Theorem 1 of (Kille et al. 2024). Using the more precise version instead, we see that also for small (w), this anti lexicographic sort is close to optimal. I enlarged it so you can see how the blue and red overlap.

Figure 11: In the previous figure I was using the simplified bound of Theorem 1 of (Kille et al. 2024). Using the more precise version instead, we see that also for small (w), this anti lexicographic sort is close to optimal. I enlarged it so you can see how the blue and red overlap.

Figure 12: For alphabet (sigma=3), anti lexicographic sus-anchors are also very close to optimal.

Figure 12: For alphabet (sigma=3), anti lexicographic sus-anchors are also very close to optimal.

Figure 13: For alphabet (sigma=2), there is a bit of a gap towards optimality for (6leq wleq 18). Curiously, the gap appears much smaller both for small (w) and larger (w).

Figure 13: For alphabet (sigma=2), there is a bit of a gap towards optimality for (6leq wleq 18). Curiously, the gap appears much smaller both for small (w) and larger (w).

One of the reasons that this scheme can perform so well for \(k=1\) is that it is not, in fact, a minimizer scheme, but ‘only’ a sampling scheme. Minimizer schemes are those sampling schemes that take the smallest k-mer according to some order. All sampling schemes seen so far are indeed minimizer schemes, while the sus-anchors are not: even though \(k=1\), they use the surrounding context of each character to determine it’s order.

3 More sampling schemes

3.1 Anti-lex sus-anchors

The anti-lex sus-anchors are not limited to \(k=1\), and also work well for slightly larger \(k\).

Figure 14: The anti-lex sus-anchors are near-optimal for (k) up to (3), unlike any other scheme so far. We also use them in combination with the mod-minimizer.

Figure 14: The anti-lex sus-anchors are near-optimal for (k) up to (3), unlike any other scheme so far. We also use them in combination with the mod-minimizer.

3.2 Threshold anchors

Let’s try to understand why the anti-lex sus-anchors are not as good for larger \(k\). For a window size \(w\), we expect to see each string of length \(c=\log_\sigma w\) once on average. Thus, we expect the anti-lexicographic smallest string to start with a z followed by \(c-1\) a’s. This means that only the first \(\approx c\) characters of each k-mer contribute to its ‘value’ in determining whether it’s the smallest one. Clearly, to achieve optimal density, we must use all \(k\) characters, and not just the first \(c\). In a way, the first few characters contain too much entropy, while we want to use all characters.

Thus, we’d like to come up with a scheme that extracts (around \(w\)) entropy from all \(k\) characters.

One way is to artificially reduce the alphabet to for example only a single bit, by splitting it into two halves. Still, this gives \(2^k\) equally likely values, and hence the first \(c_2=\log_2 w\) characters determine the value of the k-mer, which is still sub-linear in \(k\).

So, how can we extract less information from each character? As we know, the entropy of an event that happens with probability \(p\) is \(-p \lg p - (1-p) \lg (1-p)\), which is maximized for \(p=1/2\). Thus, mapping each character to \(0\) or \(1\) with probability not equal to \(1/2\) may improve things.

For the \(\sigma=4\) case, we can simply map ACG to 0 and T to 1, so that \(p=1/4\), and then look for the smallest anti lexicographic substring, that is, a string starting with a 1 followed by as many 0’s as possible.

Figure 15: The anti-lex threshold minimizers are near-optimal for (3leq k leq 6), again unlike any other scheme so far.

Figure 15: The anti-lex threshold minimizers are near-optimal for (3leq k leq 6), again unlike any other scheme so far.

Generally, to match the lower bound, we would like to find a sampling scheme that never selects two k-mers within distance \(k\) of each other (and otherwise has roughly uniform distance between \(k+1\) and \(w\)). Requiring that each k-mer equals 1000..000 satisfies this requirement.

Thus, we would like to make the probability of a 1000..000 k-mer as large as possible, since whenever such a k-mer occurs in the window, we can push the sampled k-mers distance \(k\) away from each other.

The probability that a k-mer has string 1000..000 is \(p\cdot (1-p)^{k-1}\). This probability is maximized by choosing \(p = 1/k\) (which we can do when the alphabet is large), and then equals \[ 1/k \cdot (1-1/k)^{k-1} = 1/(k-1)\cdot (1-1/k)^k \approx 1/(k-1) \cdot 1/e \approx \frac{1}{ke}. \] Thus, the expected number of 1000..000 k-mers in a window is \(w/(ke)\). As \(k\) grows above \(w/e \approx w/3\), this means that not all windows have such a k-mer anymore, and that we potentially loose some performance. And indeed, this method appears to only work up to \(k=6\leq 24/e\).

The main bottleneck is that for \(k\approx w/e\), the probability of having not a single 1 is around \((1-1/w)^w \approx 1/e \approx 0.37\). In these cases, we can fall back to sampling a random smallest k-mer, but this quickly destroys the performance. Thus, in practice I simply use \(p=1/4\) so far, which in practice leads to there always being a 1.

TODO: Find better tiebreaking rules, and investigate more choices of \(p\).

Another potential improvement to extract less entropy from each character, while still having a sufficiently large probability of a 10000 k-mer occurring, could be to require that the first two characters sum to at least some threshold \(T\), while all next chunks of two characters sum to \(<T\).

3.3 The $t$-gap disappears for large alphabets

One issue that remains in the plot above is what I will call the $t$-gap: especially for small \(k\), the graphs of all minimizer/syncmer based methods shift \(t-1\) steps to the right compared to the double decycling minimizer. The reason is that by only considering t-mers, we effectively reduce the total number of positions that can be sampled by \(t-1\).

If we increase the alphabet size to \(\sigma=256\), \(t=1\) is sufficient to get mostly unique t-mers. All our new plots shift left by \(t-1\). Now, the OC mod-mini is comparable to double decycling, and also touches the lower bound when \(k=(1\bmod w)\).

Figure 16: For large alphabets, the syncmer based methods can use (t=1) and still have unique t-mers, and their plots shift left to touch the lower bound.

Figure 16: For large alphabets, the syncmer based methods can use (t=1) and still have unique t-mers, and their plots shift left to touch the lower bound.

4 Computing the density of forward schemes

For forward schemes, the density can be computed in multiple ways:

  1. Compute the fraction of charged contexts of size \(w+k\) where the two length-\(w\) windows select a different position.
  2. Compute the fraction of sampled positions over a cyclic De Bruijn sequence of order \(w+k\).
  3. Compute the expected fraction of sampled positions in a random cyclic sequence of length \(w+k\).

Each of these also allows for an approximate variant:

  1. Compute the fraction of charged contexts over a sufficiently large sample of $(w+1)$-mers.
  2. Compute the fraction of sampled positions over a sufficiently long sequence.
  3. Compute the fraction over sufficiently many cyclic $(w+1)$-mers.

4.1 WIP: Anti lexicographic sus-anchor density

It’s not hard to see that sus-anchors are forward. To compute the density, we’ll use the third method above, for \(k=1\). Suppose the smallest (under anti lex sorting) unique substring of a sequence of lowercase characters is simply z, i.e., there is only a single z. In that case, this will be the sus-anchor for all rotations, and only a single position is sampled.

Otherwise, suppose that za is the sus-anchor. Then the rotation a...z does not contain za and will sample some other position, and thus, two positions are sampled. That is still in line with the \(\lceil2/(w+1)\rceil\) lower-bound we are approaching.

If the second smallest unique substring (say Y) overlaps the smallest unique substring in at most one character, then one of these will always be fully present and only two positions are sampled.

The bad case where three or more positions are sampled. Let’s consider when this can happen.

Suppose the SUS is zX for some sequence X of length at least two. Then zX will be smallest for all but the rotations of the form X2...zX1 where X = X1X2, with X2 non-empty.

5 Open questions

  • Can we use sus-anchors instead of t-mer minimizers in OC mod-minimizers to close the remaining $t$-gap?
  • What is the exact density of sus-anchors? Can we prove its near-optimality.

References

Edgar, Robert. 2021. “Syncmers Are More Sensitive than Minimizers for Selecting Conserved K‑Mers in Biological Sequences.” Peerj 9 (February): e10805. https://doi.org/10.7717/peerj.10805.
Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024. “The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:11:1–11:23. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.11.
Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd Treangen. 2024. “A near-Tight Lower Bound on the Density of Forward Sampling Schemes.” Biorxiv. https://doi.org/10.1101/2024.09.06.611668.
Loukides, Grigorios, Solon P. Pissis, and Michelle Sweering. 2023. “Bidirectional String Anchors for Improved Text Indexing and Top-$k$ Similarity Search.” Ieee Transactions on Knowledge and Data Engineering 35 (11): 11093–111. https://doi.org/10.1109/tkde.2022.3231780.
Loukides, Grigorios, and Solon P. Pissis. 2021. “Bidirectional String Anchors: A New String Sampling Mechanism.” Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ESA.2021.64.
Shaw, Jim, and Yun William Yu. 2021. “Theory of Local K-Mer Selection with Applications to Long-Read Alignment.” Edited by Can Alkan. Bioinformatics 38 (20): 4659–69. https://doi.org/10.1093/bioinformatics/btab790.
Zheng, Hongyu, Carl Kingsford, and Guillaume Marçais. 2021. “Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length.” Journal of Computational Biology 28 (4): 395–409. https://doi.org/10.1089/cmb.2020.0432.

  1. Edgar (2021) first defines open syncmers as having the smallest t-mer at the start, but also introduces offset parameter, which we set to \((k-t)/2\). ↩︎

  2. Smallest with respect to the hash of the central t-mer. ↩︎

  3. I’m not quite sure yet whether to this means smallest unique substring or smallest unique suffix↩︎

  4. This situation reminds of the classic problem to compute the probability of seeing e.g. HH or HT or longer patterns in a series of coin flips. ↩︎