This post introduces some new practical sampling schemes. It builds on:

- The post and paper (Groot Koerkamp and Pibiri 2024) introducing the mod-minimizer.
- The post and paper (Kille et al. 2024) introducing a lower bound on the density of sampling schemes.

## 1 Sampling schemes

### 1.1 Definitions and background

A **sampling scheme** is simply a function \(f: \Sigma^{w+k-1} \to [w]\). For a
**window**
of length \(w+k-1\), containing \(w\) k-mers, it samples the starting position
of one of the k-mers.

The **density** of a sampling scheme is the expected fraction of distinct positions
sampled over an infinitely long random string.

A sampling scheme is **forward** when the absolute sampled position never
decreases as we slide the window over a sequence.

The **random minimizer** has a density of \(2/(w+1)\).

A **trivial lower bound** on the density is \(1/w\).

Previous sampling schemes are **miniception** and the **double-decycling based minimizer**.

### 1.2 Mod-minimizer

The mod-minimizer is a scheme that asymptotically achieves optimal density \(1/w\), and generally outperforms other schemes when \(k>w\).

It works by setting a small parameter \(t = (k\bmod w)\) (with the restriction that \(t\geq r:=4\), so that there are few duplicate t-mers). Then, the smallest t-mer in the window is found at position \(0\leq x< w+k-t\). Using this, the k-mer at position \(p = (x\bmod w)\) is sampled.

### 1.3 Forward scheme lower bound

Together with Bryce, we significantly improved the lower bound on the best possible density of forward schemes. In fact, we prove that when \(k\equiv 1\pmod w\) and \(\sigma\to\infty\), the mod-minimizer is optimal.

### 1.4 Open syncmer minimizer

Daniel Liu came up with the idea to build minimizer schemes on top of syncmers Edgar (2021).
Given a parameter \(t\) (we use \(t=4\) when \(\sigma=4\)), we can hash all t-mers
inside a k-mer. For our purposes, the k-mer is an *open syncmer* (Edgar 2021) when the smallest t-mer inside it
is in the middle (at offset/position
\(\lfloor(k-t)/2\rfloor\))^{1}, following Shaw and Yu (2021)
who show that this using the middle position is best for *conservation*.
In particular, open syncmers have the property that they are never close to each other.
In each window, the *open syncmer minimizer* simply takes the smallest
k-mer^{2} that is an open
syncmer (or the smallest kmer if there is no open syncmer).

### 1.5 Open-closed syncmer minimizer

Then Daniel extended this to the *open-closed syncmer minimizer*: If there is an
open syncmer inside the window, prefer the one with the smallest t-mer. Otherwise, take a closed syncmer,
i.e., a k-mer whose smallest contained t-mer is at the start or end. Otherwise,
just take the smallest k-mer.

### 1.6 New: Open-closed mod-minimizer

Looking at the figure above, one wonders if the smoothness of the methods that perform well for \(k<w\) can be incorporated into the asymptotically optimal step-wise behaviour of the mod-minimizer. Indeed, we can! The OC-mod-minimizer works as follows:

- Choose the
*offset*\(o:=\lfloor((k-t)\bmod w)/2\rfloor\). - A k-mer is a ‘open mod-syncmer’ if its smallest contained t-mer is at a position \(x\) with \((x\bmod w)=o\). If there is an open mod-syncmer, take the one with the smallest t-mer hash.
- Otherwise, take the smallest k-mer that is a closed syncmer.
- Otherwise, return the smallest k-mer.

### 1.7 The $t$-gap

One issue that remains in the plot above is what I will call the *$t$-gap*:
especially for small \(k\), the graph shifts \(t-1\) steps to the right compared to
the double decycling minimizer. The reason is that by only considering t-mers,
we effectively reduce the total number of positions that can be sampled by \(t-1\).

## 2 Selection schemes

### 2.1 Definition

While a *sampling scheme* selects a k-mer from a window, a *selection scheme*
only selects a *position*, and is given by a function \(f: \Sigma^w \to [w]\) (Zheng, Kingsford, and Marçais 2021).

All the sampling schemes seen so far can be seen as selection schemes as well, but they are inefficient because they never sample the last \(k-1\) positions. Proper sampling schemes do not have this restriction.

### 2.2 Bd-anchors

One sampling scheme is *bidirectional anchors* (Loukides and Pissis 2021; Loukides, Pissis, and Sweering 2023).
Given a window of \(w\) characters, this is simply the starting position of its
smallest rotation. One drawback though is that as we shift the window through a
sequence, the characters at the front can unpredictably influence whether the
rotation starting at the last position is small or not. Thus, to improve the
density, the rotations starting in the last \(r\) positions are excluded.

### 2.3 New: Smallest unique substring anchors

To avoid this instability of bd-anchors, we can simply only look for the
smallest suffix instead of the smallest rotation. To improve stability, we
require this suffix to be *unique*. That is, in the string `abbab`

, the suffix
`ab`

is not unique, and hence the smallest suffix starts at the first `a`

.
Thus, we search for the smallest unique *suffix*, and some prefix of that is the
smallest unique *substring*. Thus, we call these sus-anchors^{3}.

### 2.4 New: Scrambled sort

One drawback of taking the lexicographic smallest substring is that suffixes of
small substrings are also small. In particular, when a window starts with
`aaabb...`

as a SUS, after shifting the window by one position, there is a
relatively large probability that `aabb...`

will remain the smallest SUS. But
for purposes of having a low density of sampled positions, we especially want to avoid
sampling consecutive positions.

After some fiddling, it turns out that we can adjust the definition of
‘smallest’. Instead of taking the *lexicographically* smallest substring, we can first
‘invert’ the first character of the substring (as in, replace \(c\) by \(\sigma-1-c\)), and then compare
substrings. This way, the smallest substring will look like `zaaaa...`

, and
after shifting one position, the smallest substring will jump to another
occurrence of `z`

(or `y`

if there is no `z`

), instead of starting at the next
`a`

.^{4}

### TODO 2.5 Scrambled sus-anchor density

It’s not hard to see that sus-anchors are forward. Thus, to compute the density, it suffices to compute the probability that two consecutive windows select different positions. That in turn is only possible if either the first window has its smallest unique substring at the start, or the second window has its smallest unique substring at the end.

## 3 Open questions

- Can we use sus-anchors instead of t-mer minimizers in OC mod-minimizers to close the remaining $t$-gap?
- What is the exact density of sus-anchors? Can we prove its near-optimality.

## References

*Peerj*9 (February): e10805. https://doi.org/10.7717/peerj.10805.

*24th International Workshop on Algorithms in Bioinformatics (Wabi 2024)*, edited by Solon P. Pissis and Wing-Kin Sung, 312:11:1–11:23. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.11.

*Biorxiv*. https://doi.org/10.1101/2024.09.06.611668.

*Ieee Transactions on Knowledge and Data Engineering*35 (11): 11093–111. https://doi.org/10.1109/tkde.2022.3231780.

*Bioinformatics*38 (20): 4659–69. https://doi.org/10.1093/bioinformatics/btab790.

*Journal of Computational Biology*28 (4): 395–409. https://doi.org/10.1089/cmb.2020.0432.

Edgar (2021) first defines

*open*syncmers as having the smallest t-mer at the start, but also introduces*offset*parameter, which we set to \((k-t)/2\). ↩︎Smallest with respect to the hash of the central t-mer. ↩︎

I’m not quite sure yet whether to this means

*smallest unique substring*or*smallest unique suffix*. ↩︎This situation reminds of the classic problem to compute the probability of seeing e.g.

`HH`

or`HT`

or longer patterns in a series of coin flips. ↩︎