**The results of this post are now available in a pre-print: DOI, PDF:**

Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd Treangen. 2024. “A near-Tight Lower Bound on the Density of Forward Sampling Schemes.” *Biorxiv*. https://doi.org/10.1101/2024.09.06.611668.

In this post I will prove a new lower bound on the density of any minimizer or forward sampling scheme: \[ d(f) \geq \frac{\lceil\frac{w+k}{w}\rceil}{w+k} = \frac{\lceil\frac{\ell+1}{w}\rceil}{\ell+1}. \]

In particular, this implies that when \(k=1\), any forward sampling scheme has density at least \(2/(w+1)\), and thus that random minimizers are optimal in this case.

## Succinct background

For more background, see either my previous post on minimizers, or the mod-minimizer paper (Groot Koerkamp and Pibiri 2024).

### Definitions

For given parameters \(k\) and \(w\), a **local sampling scheme** is a function
\(f: \Sigma^{k+w-1}\to [w] = \{0, \dots, w-1\}\).
We set \(\ell = k+w-1\). An $ℓ$-mer is also called a *window* and consists of
\(w\) consecutive $k$-mers. Given a window \(W\), \(f\) is said to *sample* the
$k$-mer \(W[f(w)..f(w)+k)\) at position \(f(W)\) in \(W\).

A sampling scheme is **forward** when the position of the sampled $k$-mer never
jumps backwards.

The **density** \(d(f)\) of a sampling scheme \(f\) is the expected fraction of sampled
$k$-mers on an infinitely long random string.

**Random minimizers** hash each $k$-mer in a window and select the one with the
minimal hash. They have a density of \(2/(w+1)\).

**Mod-minimizers** set \(t=((k-r)\bmod w) + r\) (for a small \(r\)). They then compute
the position \(x\) of the smallest $t$-mer and sample the $k$-mer at position
\(x\bmod w\). For fixed \(w\), they have density \(1/w\) in the \(k\to\infty\) limit,
and ’locally optimal’ density \(\frac{2+(k-1)/w}{w+k}\) when \(r=1\) and \(k\equiv
1\pmod w\) and \(\sigma \gg k+w\) (the bottom-left minima in Figure 1).

### Lower bounds

Any sampling scheme must sample at least one out of every \(w\) $k$-mers, so \(1/w\) is a trivial lower bound on the density.

Marçais, DeBlasio, and Kingsford (2018) have shown an improved lower bound for
forward schemes of
\[
d(f) \geq \frac{1.5 + \max(0, \lfloor \frac{k-w}w\rfloor) + \frac 1{2w}}{w+k}.
\]
In Groot Koerkamp and Pibiri (2024), we extended it to *all* local sampling schemes, including
non-forward ones, and simplified and improved it slightly to
\[
d(f) \geq \frac{1.5}{w+k-0.5},
\]

Note that there is a large gap between the current lower bounds and the best known schemes. Clearly, this is unsatisfying and needs resolving.

## A new lower bound

**Theorem 1.** For any forward sampling scheme \(f\),
\[
d(f) \geq \frac{\lceil\frac{w+k}{w}\rceil}{w+k}.
\]

**Proof.**
Consider a circular string (that ‘wraps around’ at the end) of
length \(\ell+1 = w+k\). This circular string has \(w+k\) $k$-mers (one starting at
each position), and since at least one $k$-mer is sampled every \(w\) positions,
at least \(\lceil\frac{w+k}{w}\rceil\) $k$-mers in the circular string must be
sampled. Thus, the density on the circular string is at least
\(\lceil\frac{w+k}{w}\rceil/(w+k)\).

Each $(w+k)$-mer contains exactly two $ℓ$-mers, a left and a right one. When
the right $ℓ$-mer samples a different $k$-mer than the left $ℓ$-mer, we
say this $(w+k)$-mer is *charged*. Thus, out of the \(w+k\) $(w+k)$-mers
in the circular string, at least \(\lceil\frac{w+k}{w}\rceil\) of them are charged.

Now, consider the complete De Bruijn graph of order \(w+k\), i.e., the graph whose
vertices are $(w+k)$-mers, and whose edges connect $(w+k)$-mers that overlap in \(w+k-1\) positions.
This graph can be partitioned into *pure* cycles of length (a divisor of) \(w+k\): for
each $(w+k)$-mer vertex, simply consider the cycle induced by the rotations of
the $(w+k)$-mer
(Lemma 2 Pellow et al. 2023).

*Side note:* This idea of partitioning the De Bruijn graph goes back to Mykkeltveit (1972),
who showed that it is possible to *decycle* the graph by removing one vertex
from every cycle, creating a set of ‘density’ \(1/w\).
DOCKS (Orenstein et al. 2017), PASHA (Ekim, Berger, and Orenstein 2020), and decycling-based minimizers (Pellow et al. 2023) all take
this set as a starting point to create a universal hitting set.

By the argument above, along each non-degenerate (length \(w+k\)) cycle there are at least \(\lceil\frac{w+k}{w}\rceil\) charged $(w+k)$-mers, i.e., the density of charged $(w+k)$-mers on each non-degenerate cycle is at least \(\lceil\frac{w+k}{w}\rceil / (w+k)\). For degenerate cycles of length \((w+k)/p\) (for some divisor \(p>1\) of \(w+k\)), we can ‘unwrap’ the cycle \(p\) times into a cycle of length \(w+k\) to see that the overall density must still be at least \(\lceil\frac{w+k}{w}\rceil / (w+k)\).

Thus, we partitioned the vertices of the order \(w+k\) De Bruijn graph into parts (pure cycles) such that each part has density of charged $(w+k)$-mers at least \(\lceil\frac{w+k}{w}\rceil / (w+k)\), and thus this is also a lower bound on the overall density of charged $(w+k)$-mers in the full graph.

Since each $(w+k)$-mer is equally likely to appear at any position in an infinitely long string, we conclude that this is indeed a lower bound on the density of any forward sampling scheme. \(\blacksquare\)

**Non-forward schemes.**
Note that this proof does not directly work for any *local*, non-forward,
scheme, since a $(w+k)$-mer may be charged but ‘jump backwards’ to a $k$-mer
that was already sampled before. This may be fixable by considering a De Bruijn
graph of order \(2w+k-2\) instead (Lemma 4 Marçais et al. 2017; Marçais, DeBlasio, and Kingsford 2018).

## Discussion

As can be seen in Figure 2, this new lower bound is much stronger than the
previous one. It is the first bound to imply that density \(2/(w+1)\) is optimal
for \(k=1\), and *exactly* coincides with the density of mod-minimizers when
\(k\equiv 1\pmod w\), showing that mod-minimizers are not just optimal in the
\(k\to\infty\) limit, but also locally optimal. Indeed, when \(r=1\) and \(k\equiv
1\pmod w\), the density of mod-minimizers exactly matches the lower bound:
\[
\frac{2+\lfloor\frac{k-1}{w}\rfloor}{w+\lfloor\frac{k-1}{w}\rfloor w+1}
= \frac{2+\frac{k-1}{w}}{w+\frac{k-1}{w} w+1}
= \frac{\frac{k+2w-1}w}{k+w}
= \frac{\lceil\frac{k+w}w\rceil}{k+w}.
\]

It remains to show some ‘continuation’ of the bound, shown in purple in Figure 2 \[ \frac{\lceil\frac{k+w}w\rceil}{k+w} =\frac{\lfloor\frac{k+2w-1}w\rfloor}{k+w} \sim\frac{\frac{k+2w-1}w}{k+w} =\frac 1w + \frac 1{k+w} - \frac1{w(k+w)}. \] But we can already see that double decycling based minimizers (Pellow et al. 2023) break this continuation, so we should expect some tricky cases along the way. Nevertheless, this formula has a nice interpretation:

- We need density at least \(1/w\) as a baseline.
- Every \((k+w)/2\) steps, ‘coherence’/‘synchronization’ is lost, and with probability \(1/2\) a gap of size \(\geq w\) must be filled by a new sample.
- With probability \(1/(w(k+w))\), two consecutive but non-coherent windows sample two kmers at distance \(1/w\) anyway, and are ‘accidentally coherent’. (Thanks Daniel for this point.)

## Post scriptum

It really took a long time to discover this proof. In the sense that, it was always there, ready to be found, but nobody did. Schleimers’ original bound (Schleimer, Wilkerson, and Aiken 2003) is already 20 years old and only Marçais, DeBlasio, and Kingsford (2018) improved it. It really feels like this proof has been staring us in the face while we didn’t see it for quite some time. Especially given that it’s so simple, and all parts were hinted at:

- The density of a forward scheme can be evaluated by considering a De Bruijn sequence of order \(w+k\), so looking at an order \(w+k\) De Bruijn graph should be necessary and sufficient.
- Partitioning the De Bruijn graph has been done before, to show a density lower bound of \(1/w\).
- The density of random minimizers for \(k=1\) is \(2/(w+1)\), which, in hindsight,
*very*much reads like*at least \(2\) samples are needed in every cycle of length \(w+1\).*

**Story time.** Let me briefly write down how I came up with this for my own fun
:) It’s in a very very roundabout way:

- In the mod-minimizer paper we show a lower bound of \(1.5/(k+w-0.5)\). But in practice no scheme achieves anything close to this.
- For \(k=1\), we expect a new sample roughly every \(\ell/2\) positions. Thus instead of tiling $ℓ$-mers back-to-back, maybe we should tile them with \(\ell/2\) overlap.
- That didn’t really go anywhere, since the sampled positions don’t align with these staggered windows.
- What if we take an $ℓ$-mer \(B\), consider its sampled position \(i\), and then consider $ℓ$-mers \(A\) and \(C\) that end just before position \(i\) and start just after position \(i\).
- Well this is still annoying since now \(A\) and \(C\) have some ‘dangling ends’. It’s hard to say things about strings that have a fixed prefix and random suffix (or reversed).
- Hmm but we can make the prefix of \(A\) equal to the prefix of \(C\), and the suffix of \(C\) equal to the prefix of \(A\).
- Ah now we just get a cyclic $ℓ+1$-mer, from which at least \(2\) positions must be sampled!
- The only step left is to look at charged $ℓ+1$-mers in the $ℓ+1$-order DBg, rather than $ℓ$-mers that introduce new samples in the $ℓ$-order DBg.
- Anyway, here things clicked into place and the connection with previous DBg partitioning becomes clear.

## Acknowledgement

Thanks to both Daniel Liu and Giulio Ermanno Pibiri for many discussions and for suggesting improvements to this text.

## References

*Research in Computational Molecular Biology*, 37–53. Springer International Publishing. https://doi.org/10.1007/978-3-030-45257-5_3.

*24th International Workshop on Algorithms in Bioinformatics (Wabi 2024)*, edited by Solon P. Pissis and Wing-Kin Sung, 312:11:1–11:23. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.11.

*Biorxiv*. https://doi.org/10.1101/2024.09.06.611668.

*Bioinformatics*34 (13): i13–22. https://doi.org/10.1093/bioinformatics/bty258.

*Bioinformatics*33 (14): i110–17. https://doi.org/10.1093/bioinformatics/btx235.

*Journal of Combinatorial Theory, Series B*13 (1): 40–45. https://doi.org/10.1016/0095-8956(72)90006-8.

*Plos Computational Biology*13 (10): e1005777. https://doi.org/10.1371/journal.pcbi.1005777.

*Genome Research*, August. https://doi.org/10.1101/gr.277644.123.

*Proceedings of the 2003 Acm Sigmod International Conference on Management of Data*. Sigmod/Pods03. ACM. https://doi.org/10.1145/872757.872770.