This post introduces some new practical sampling schemes. It builds on:
- The post and paper (Groot Koerkamp and Pibiri 2024) introducing the mod-minimizer.
- The post and paper (Kille et al. 2024) introducing a lower bound on the density of sampling schemes.
1 Sampling schemes
1.1 Definitions and background
A sampling scheme is simply a function \(f: \Sigma^{w+k-1} \to [w]\). For a window of length \(w+k-1\), containing \(w\) k-mers, it samples the starting position of one of the k-mers.
The density of a sampling scheme is the expected fraction of distinct positions sampled over an infinitely long random string.
A sampling scheme is forward when the absolute sampled position never decreases as we slide the window over a sequence.
The random minimizer has a density of \(2/(w+1)\).
A trivial lower bound on the density is \(1/w\).
Previous sampling schemes are miniception and the double-decycling based minimizer.
1.2 Mod-minimizer
The mod-minimizer is a scheme that asymptotically achieves optimal density \(1/w\), and generally outperforms other schemes when \(k>w\).
It works by setting a small parameter \(t = (k\bmod w)\) (with the restriction that \(t\geq r:=4\), so that there are few duplicate t-mers). Then, the smallest t-mer in the window is found at position \(0\leq x< w+k-t\). Using this, the k-mer at position \(p = (x\bmod w)\) is sampled.
1.3 Forward scheme lower bound
Together with Bryce, we significantly improved the lower bound on the best possible density of forward schemes. In fact, we prove that when \(k\equiv 1\pmod w\) and \(\sigma\to\infty\), the mod-minimizer is optimal.
1.4 Open syncmer minimizer
Daniel Liu came up with the idea to build minimizer schemes on top of syncmers Edgar (2021). Given a parameter \(t\) (we use \(t=4\) when \(\sigma=4\)), we can hash all t-mers inside a k-mer. For our purposes, the k-mer is an open syncmer (Edgar 2021) when the smallest t-mer inside it is in the middle (at offset/position \(\lfloor(k-t)/2\rfloor\))1, following Shaw and Yu (2021) who show that this using the middle position is best for conservation. In particular, open syncmers have the property that they are never close to each other. In each window, the open syncmer minimizer simply takes the smallest k-mer2 that is an open syncmer (or the smallest kmer if there is no open syncmer).
1.5 Open-closed syncmer minimizer
Then Daniel extended this to the open-closed syncmer minimizer: If there is an open syncmer inside the window, prefer the one with the smallest t-mer. Otherwise, take a closed syncmer, i.e., a k-mer whose smallest contained t-mer is at the start or end. Otherwise, just take the smallest k-mer.
1.6 New: Open-closed mod-minimizer
Looking at the figure above, one wonders if the smoothness of the methods that perform well for \(k<w\) can be incorporated into the asymptotically optimal step-wise behaviour of the mod-minimizer. Indeed, we can! The OC-mod-minimizer works as follows:
- Choose the offset \(o:=\lfloor((k-t)\bmod w)/2\rfloor\).
- A k-mer is a ‘open mod-syncmer’ if its smallest contained t-mer is at a position \(x\) with \((x\bmod w)=o\). If there is an open mod-syncmer, take the one with the smallest t-mer hash.
- Otherwise, take the smallest k-mer that is a closed syncmer.
- Otherwise, return the smallest k-mer.
1.7 The $t$-gap
One issue that remains in the plot above is what I will call the $t$-gap: especially for small \(k\), the graph shifts \(t-1\) steps to the right compared to the double decycling minimizer. The reason is that by only considering t-mers, we effectively reduce the total number of positions that can be sampled by \(t-1\).
2 Selection schemes
2.1 Definition
While a sampling scheme selects a k-mer from a window, a selection scheme only selects a position, and is given by a function \(f: \Sigma^w \to [w]\) (Zheng, Kingsford, and Marçais 2021).
All the sampling schemes seen so far can be seen as selection schemes as well, but they are inefficient because they never sample the last \(k-1\) positions. Proper sampling schemes do not have this restriction.
2.2 Bd-anchors
One sampling scheme is bidirectional anchors (Loukides and Pissis 2021; Loukides, Pissis, and Sweering 2023). Given a window of \(w\) characters, this is simply the starting position of its smallest rotation. One drawback though is that as we shift the window through a sequence, the characters at the front can unpredictably influence whether the rotation starting at the last position is small or not. Thus, to improve the density, the rotations starting in the last \(r\) positions are excluded.
2.3 New: Smallest unique substring anchors
To avoid this instability of bd-anchors, we can simply only look for the
smallest suffix instead of the smallest rotation. To improve stability, we
require this suffix to be unique. That is, in the string abbab
, the suffix
ab
is not unique, and hence the smallest suffix starts at the first a
.
Thus, we search for the smallest unique suffix, and some prefix of that is the
smallest unique substring. Thus, we call these sus-anchors3.
2.4 New: Scrambled sort
One drawback of taking the lexicographic smallest substring is that suffixes of
small substrings are also small. In particular, when a window starts with
aaabb...
as a SUS, after shifting the window by one position, there is a
relatively large probability that aabb...
will remain the smallest SUS. But
for purposes of having a low density of sampled positions, we especially want to avoid
sampling consecutive positions.
After some fiddling, it turns out that we can adjust the definition of
‘smallest’. Instead of taking the lexicographically smallest substring, we can first
‘invert’ the first character of the substring (as in, replace \(c\) by \(\sigma-1-c\)), and then compare
substrings. This way, the smallest substring will look like zaaaa...
, and
after shifting one position, the smallest substring will jump to another
occurrence of z
(or y
if there is no z
), instead of starting at the next
a
.4
TODO 2.5 Scrambled sus-anchor density
It’s not hard to see that sus-anchors are forward. Thus, to compute the density, it suffices to compute the probability that two consecutive windows select different positions. That in turn is only possible if either the first window has its smallest unique substring at the start, or the second window has its smallest unique substring at the end.
3 Open questions
- Can we use sus-anchors instead of t-mer minimizers in OC mod-minimizers to close the remaining $t$-gap?
- What is the exact density of sus-anchors? Can we prove its near-optimality.
References
Edgar (2021) first defines open syncmers as having the smallest t-mer at the start, but also introduces offset parameter, which we set to \((k-t)/2\). ↩︎
Smallest with respect to the hash of the central t-mer. ↩︎
I’m not quite sure yet whether to this means smallest unique substring or smallest unique suffix. ↩︎
This situation reminds of the classic problem to compute the probability of seeing e.g.
HH
orHT
or longer patterns in a series of coin flips. ↩︎