This post introduces some new practical sampling schemes. It builds on:
- The post and paper (Groot Koerkamp and Pibiri 2024) introducing the mod-minimizer.
- The post and paper (Kille et al. 2024) introducing a lower bound on the density of sampling schemes.
1 Sampling schemes
1.1 Definitions
A sampling scheme is simply a function \(f: \Sigma^{w+k-1} \to [w]\). For a window of length \(w+k-1\), containing \(w\) k-mers, it samples the starting position of one of the k-mers.
The density of a sampling scheme is the expected fraction of distinct positions sampled over an infinitely long random string.
A sampling scheme is forward when the absolute sampled position never decreases as we slide the window over a sequence.
A trivial lower bound on the density is \(1/w\).
The random minimizer has a density of \(2/(w+1)\), and chooses the k-mer \(X\) with the smallest (pseudo-random) hash \(h(X)\).
Double decycling (Pellow et al. 2023) has lower density.
1.2 Miniception
Miniception (Zheng, Kingsford, and Marçais 2020) samples the smallest closed syncmer from a window.
1.3 Mod-minimizer
The mod-minimizer is a scheme that asymptotically achieves optimal density \(1/w\), and generally outperforms other schemes when \(k>w\).
It works by setting a small parameter \(t = (k\bmod w)\) (with the restriction that \(t\geq r:=4\), so that there are few duplicate t-mers). Then, the smallest t-mer in the window is found at position \(0\leq x< w+k-t\). Using this, the k-mer at position \(p = (x\bmod w)\) is sampled.
1.4 Forward scheme lower bound
Together with Bryce, we significantly improved the lower bound on the best possible density of forward schemes (Kille et al. 2024). In fact, we prove that when \(k\equiv 1\pmod w\) and \(\sigma\to\infty\), the mod-minimizer is optimal.
In simplified form, the lower bound is \[ d(f)\geq \frac{\left\lceil\frac{w+k}{w}\right\rceil}{w+k}. \] For \(k<w\) this simplifies to \[ d(f)\geq \frac{2}{w+k}, \] and in this case optimal schemes must never sample k-mers at distance less than \(k\).
1.5 Open syncmer minimizer
Daniel Liu came up with the idea to build minimizer schemes on top of syncmers Edgar (2021). Given a parameter \(t\) (we use \(t=4\) when \(\sigma=4\)), we can hash all t-mers inside a k-mer. For our purposes, the k-mer is an open syncmer (Edgar 2021) when the smallest t-mer inside it is in the middle (at offset/position \(\lfloor(k-t)/2\rfloor\))1, following Shaw and Yu (2021) who show that this using the middle position is best for conservation. In particular, open syncmers have the property that they are never close to each other. In each window, the open minimizer simply takes the smallest k-mer2 that is an open syncmer (or the smallest kmer if there is no open syncmer).
1.6 Open-closed minimizer
Then Daniel extended this to the open-closed minimizer: If there is an open syncmer inside the window, prefer the one with the smallest t-mer. Otherwise, take a closed syncmer, i.e., a k-mer whose smallest contained t-mer is at the start or end. Otherwise, just take the smallest k-mer.
1.7 New: General mod-minimizer
Looking at the figure above, one wonders if the smoothness of the methods that perform well for \(k<w\) can be incorporated into the asymptotically optimal step-wise behaviour of the mod-minimizer. Indeed, this is possible!
The current mod-minimizer basically sets \(t=(k\bmod w)\) and then samples the smallest t-mer (by a random hash). Instead, we could sample the t-mer according to any other scheme, and in particular we can sample the t-mer via the open-closed minimizer scheme.
1.8 Variant: Open-closed minimizer using offsets
We can also make the following variant on the OC-minimizer that performs slightly better when \(k\) is just below a multiple of \(w\).
- Choose the offset \(o:=\lfloor((k-t)\bmod w)/2\rfloor\).
- A k-mer is a ‘open mod-syncmer’ if its smallest contained t-mer is at a position \(x\) with \((x\bmod w)=o\). If there is an open mod-syncmer, take the one with the smallest t-mer hash.
- Otherwise, take the smallest k-mer that is a closed syncmer.
- Otherwise, return the smallest k-mer.
We can improve slightly more by using the t-mer hash instead of taking the smallest k-mer by k-mer hash. For open mod-syncmers, we can prefer the one with minimal t-mer, and for closed syncmers we can take the one with maximal t-mer.
2 Selection schemes
Before looking at more sampling schemes, we will now first consider some selection schemes.
2.1 Definition
While a sampling scheme selects a k-mer from a window, a selection scheme only selects a position, and is given by a function \(f: \Sigma^w \to [w]\) (Zheng, Kingsford, and Marçais 2021).
All the sampling schemes seen so far can be seen as selection schemes as well, but they are inefficient because they never sample the last \(k-1\) positions. Proper sampling schemes do not have this restriction.
2.2 Bd-anchors
One sampling scheme is bidirectional anchors (Loukides and Pissis 2021; Loukides, Pissis, and Sweering 2023). Given a window of \(w\) characters, this is simply the starting position of its smallest rotation. One drawback though is that as we shift the window through a sequence, the characters at the front can unpredictably influence whether the rotation starting at the last position is small or not. Thus, to improve the density, the rotations starting in the last \(r\) positions are excluded.
2.3 New: Smallest unique substring anchors
To avoid this instability of bd-anchors, we can simply only look for the
smallest suffix instead of the smallest rotation. To improve stability, we
require this suffix to be unique. That is, in the string abbab
, the suffix
ab
is not unique, and hence the smallest suffix starts at the first a
.
Thus, we search for the smallest unique suffix, and some prefix of that is the
smallest unique substring. Thus, we call these sus-anchors3.
2.4 New: Anti lexicographic sorting
One drawback of taking the lexicographic smallest substring is that suffixes of
small substrings are also small. In particular, when a window starts with
aaabb...
as a SUS, after shifting the window by one position, there is a
relatively large probability that aabb...
will remain the smallest SUS. But
for purposes of having a low density of sampled positions, we especially want to avoid
sampling consecutive positions.
After some fiddling, it turns out that we can adjust the definition of
‘smallest’. Instead of taking the lexicographically smallest substring, we can first
‘invert’ the first character of the substring (as in, replace \(c\) by \(\sigma-1-c\)), and then compare
substrings. This way, the smallest substring will look like zaaaa...
, and
after shifting one position, the smallest substring will jump to another
occurrence of z
(or y
if there is no z
), instead of starting at the next
a
.4
One of the reasons that this scheme can perform so well for \(k=1\) is that it is not, in fact, a minimizer scheme, but ‘only’ a sampling scheme. Minimizer schemes are those sampling schemes that take the smallest k-mer according to some order. All sampling schemes seen so far are indeed minimizer schemes, while the sus-anchors are not: even though \(k=1\), they use the surrounding context of each character to determine it’s order.
3 More sampling schemes
3.1 Anti-lex sus-anchors
The anti-lex sus-anchors are not limited to \(k=1\), and also work well for slightly larger \(k\).
3.2 Threshold anchors
Let’s try to understand why the anti-lex sus-anchors are not as good for larger
\(k\). For a window size \(w\), we expect to see each string of length \(c=\log_\sigma
w\) once on average. Thus, we expect the anti-lexicographic smallest string to
start with a z
followed by \(c-1\) a
’s. This means that only the first
\(\approx c\) characters of each k-mer contribute to its ‘value’ in determining
whether it’s the smallest one. Clearly, to achieve optimal density, we must use
all \(k\) characters, and not just the first \(c\).
In a way, the first few characters contain too much entropy, while we want to
use all characters.
Thus, we’d like to come up with a scheme that extracts (around \(w\)) entropy from all \(k\) characters.
One way is to artificially reduce the alphabet to for example only a single bit, by splitting it into two halves. Still, this gives \(2^k\) equally likely values, and hence the first \(c_2=\log_2 w\) characters determine the value of the k-mer, which is still sub-linear in \(k\).
So, how can we extract less information from each character? As we know, the entropy of an event that happens with probability \(p\) is \(-p \lg p - (1-p) \lg (1-p)\), which is maximized for \(p=1/2\). Thus, mapping each character to \(0\) or \(1\) with probability not equal to \(1/2\) may improve things.
For the \(\sigma=4\) case, we can simply map ACG
to 0
and T
to 1
, so that
\(p=1/4\), and then look for the smallest anti lexicographic substring, that is, a
string starting with a 1
followed by as many 0
’s as possible.
Generally, to match the lower bound, we would like to find a sampling scheme
that never selects two k-mers within distance \(k\) of each other (and otherwise
has roughly uniform distance between \(k+1\) and \(w\)).
Requiring that each k-mer equals 1000..000
satisfies this requirement.
Thus, we would like to make the probability of a 1000..000
k-mer as large as
possible, since whenever such a k-mer occurs in the window, we can push the
sampled k-mers distance \(k\) away from each other.
The probability that a k-mer has string 1000..000
is \(p\cdot (1-p)^{k-1}\).
This probability is maximized by choosing \(p = 1/k\) (which we can do when the
alphabet is large), and then equals
\[
1/k \cdot (1-1/k)^{k-1} = 1/(k-1)\cdot (1-1/k)^k \approx 1/(k-1) \cdot 1/e \approx \frac{1}{ke}.
\]
Thus, the expected number of 1000..000
k-mers in a window is \(w/(ke)\). As \(k\)
grows above \(w/e \approx w/3\), this means that not all windows have such a k-mer
anymore, and that we potentially loose some performance.
And indeed, this method appears to only work up to \(k=6\leq 24/e\).
The main bottleneck is that for \(k\approx w/e\), the probability of having not a
single 1
is around \((1-1/w)^w \approx 1/e \approx 0.37\). In these cases,
we can fall back to sampling a random smallest k-mer, but this quickly destroys
the performance. Thus, in practice I simply use \(p=1/4\) so far, which in
practice leads to there always being a 1
.
TODO: Find better tiebreaking rules, and investigate more choices of \(p\).
Another potential improvement to extract less entropy from each character, while
still having a sufficiently large probability of a 10000
k-mer occurring,
could be to require that the first two characters sum to at least some threshold
\(T\), while all
next chunks of two characters sum to \(<T\).
3.3 The $t$-gap disappears for large alphabets
One issue that remains in the plot above is what I will call the $t$-gap: especially for small \(k\), the graphs of all minimizer/syncmer based methods shift \(t-1\) steps to the right compared to the double decycling minimizer. The reason is that by only considering t-mers, we effectively reduce the total number of positions that can be sampled by \(t-1\).
If we increase the alphabet size to \(\sigma=256\), \(t=1\) is sufficient to get mostly unique t-mers. All our new plots shift left by \(t-1\). Now, the OC mod-mini is comparable to double decycling, and also touches the lower bound when \(k=(1\bmod w)\).
4 Computing the density of forward schemes
For forward schemes, the density can be computed in multiple ways:
- Compute the fraction of charged contexts of size \(w+k\) where the two length-\(w\) windows select a different position.
- Compute the fraction of sampled positions over a cyclic De Bruijn sequence of order \(w+k\).
- Compute the expected fraction of sampled positions in a random cyclic sequence of length \(w+k\).
Each of these also allows for an approximate variant:
- Compute the fraction of charged contexts over a sufficiently large sample of $(w+1)$-mers.
- Compute the fraction of sampled positions over a sufficiently long sequence.
- Compute the fraction over sufficiently many cyclic $(w+1)$-mers.
4.1 WIP: Anti lexicographic sus-anchor density
It’s not hard to see that sus-anchors are forward.
To compute the density, we’ll use the third method above, for \(k=1\).
Suppose the smallest (under anti lex sorting) unique substring of a sequence of lowercase characters is
simply z
, i.e., there is only a single z
. In that case, this will be the
sus-anchor for all rotations, and only a single position is sampled.
Otherwise, suppose that za
is the sus-anchor. Then the rotation a...z
does
not contain za
and will sample some other position, and thus, two positions
are sampled. That is still in line with the \(\lceil2/(w+1)\rceil\) lower-bound we
are approaching.
If the second smallest unique substring (say Y
) overlaps the smallest unique
substring in at most one character, then one of these will always be fully
present and only two positions are sampled.
The bad case where three or more positions are sampled. Let’s consider when this can happen.
Suppose the SUS is zX
for some sequence X
of length at
least two. Then zX
will be smallest for all but the rotations of the form
X2...zX1
where X = X1X2
, with X2
non-empty.
5 Open questions
- Can we use sus-anchors instead of t-mer minimizers in OC mod-minimizers to close the remaining $t$-gap?
- What is the exact density of sus-anchors? Can we prove its near-optimality.
6 Ideas
- threshold open syncmers?
- sus-anchor based syncmers?
- ‘centered’ susanchor/threshold as
0001000
. - greedymini seems to prefer kmers similar to
000111000
. - Test
cgcg
order - Use
abbb
order for prefix \(\geq 2\), i.e.:aab????b
, where the suffix doesn’t have two consecutive =a=s.
References
Edgar (2021) first defines open syncmers as having the smallest t-mer at the start, but also introduces offset parameter, which we set to \((k-t)/2\). ↩︎
Smallest with respect to the hash of the central t-mer. ↩︎
I’m not quite sure yet whether to this means smallest unique substring or smallest unique suffix. ↩︎
This situation reminds of the classic problem to compute the probability of seeing e.g.
HH
orHT
or longer patterns in a series of coin flips. ↩︎