The results of this post are now published in Bioinformatics: DOI, PDF:
Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd J Treangen. 2024. “A near-tight Lower Bound on the Density of Forward Sampling Schemes.” Edited by Yann Ponty. Bioinformatics, December. https://doi.org/10.1093/bioinformatics/btae736.
For given parameters and , a local sampling scheme is a function
.
We set . An -mer is also called a window and consists of
consecutive -mers. Given a window , is said to sample the
-mer at position in .
A sampling scheme is forward when the position of the sampled -mer never
jumps backwards.
The density of a sampling scheme is the expected fraction of sampled
-mers on an infinitely long random string.
Random minimizers hash each -mer in a window and select the one with the
minimal hash. They have a density of .
Mod-minimizers set (for a small ). They then compute
the position of the smallest -mer and sample the -mer at position
. For fixed , they have density in the limit,
and ’locally optimal’ density when and and (the bottom-left minima in Figure 1).
Any sampling scheme must sample at least one out of every -mers, so
is a trivial lower bound on the density.
Marçais, DeBlasio, and Kingsford (2018) have shown an improved lower bound for
forward schemes of
In Groot Koerkamp and Pibiri (2024), we extended it to all local sampling schemes, including
non-forward ones, and simplified and improved it slightly to
Figure 1: Density of some sampling schemes for (w=24) and alphabet size (sigma=256), and the mentioned lower bounds.
Note that there is a large gap between the current lower bounds and the best
known schemes. Clearly, this is unsatisfying and needs resolving.
Proof.
Consider a circular string (that ‘wraps around’ at the end) of
length . This circular string has -mers (one starting at
each position), and since at least one -mer is sampled every positions,
at least -mers in the circular string must be
sampled. Thus, the density on the circular string is at least
.
Each -mer contains exactly two -mers, a left and a right one. When
the right -mer samples a different -mer than the left -mer, we
say this -mer is charged. Thus, out of the -mers
in the circular string, at least of them are charged.
Now, consider the complete De Bruijn graph of order , i.e., the graph whose
vertices are -mers, and whose edges connect -mers that overlap in positions.
This graph can be partitioned into pure cycles of length (a divisor of) : for
each -mer vertex, simply consider the cycle induced by the rotations of
the -mer
(Lemma 2 Pellow et al. 2023).
Side note: This idea of partitioning the De Bruijn graph goes back to Mykkeltveit (1972),
who showed that it is possible to decycle the graph by removing one vertex
from every cycle, creating a set of ‘density’ .
DOCKS (Orenstein et al. 2017), PASHA (Ekim, Berger, and Orenstein 2020), and decycling-based minimizers (Pellow et al. 2023) all take
this set as a starting point to create a universal hitting set.
By the argument above, along each non-degenerate (length ) cycle there are at least
charged -mers, i.e., the density of charged
-mers on each non-degenerate cycle is at least . For degenerate cycles of
length (for some divisor of ), we can ‘unwrap’ the cycle
times into a cycle of length to see that the overall density must still be
at least .
Thus, we partitioned the vertices of the order De Bruijn graph into parts
(pure cycles) such that each
part has density of charged -mers at least , and thus this is also a lower bound on the overall density of charged
-mers in the full graph.
Since each -mer is equally likely to appear at any position in an
infinitely long string, we conclude that this is indeed a lower bound on the
density of any forward sampling scheme.
Non-forward schemes.
Note that this proof does not directly work for any local, non-forward,
scheme, since a -mer may be charged but ‘jump backwards’ to a -mer
that was already sampled before. This may be fixable by considering a De Bruijn
graph of order instead (Lemma 4 Marçais et al. 2017; Marçais, DeBlasio, and Kingsford 2018).
Figure 2: The new lower bound (blue) and its continuation (purple).
As can be seen in Figure 2, this new lower bound is much stronger than the
previous one. It is the first bound to imply that density is optimal
for , and exactly coincides with the density of mod-minimizers when
, showing that mod-minimizers are not just optimal in the
limit, but also locally optimal. Indeed, when and , the density of mod-minimizers exactly matches the lower bound:
It remains to show some ‘continuation’ of the bound, shown in purple in
Figure 2
But we can already see that double decycling based minimizers
(Pellow et al. 2023) break this continuation, so we should expect some
tricky cases along the way.
Nevertheless, this formula has a nice interpretation:
We need density at least as a baseline.
Every steps, ‘coherence’/‘synchronization’ is lost, and with
probability a gap of size must be filled by a new sample.
With probability , two consecutive but non-coherent windows sample
two kmers at distance anyway, and are ‘accidentally coherent’. (Thanks
Daniel for this point.)
It really took a long time to discover this proof. In the sense that, it was
always there, ready to be found, but nobody did. Schleimers’ original bound
(Schleimer, Wilkerson, and Aiken 2003) is already 20 years old and only
Marçais, DeBlasio, and Kingsford (2018) improved it.
It really feels like this proof has been staring us in the face while we didn’t
see it for quite some time. Especially given that it’s so simple, and all parts
were hinted at:
The density of a forward scheme can be evaluated by considering a De Bruijn
sequence of order , so looking at an order De Bruijn graph should
be necessary and sufficient.
Partitioning the De Bruijn graph has been done before, to show a density lower
bound of .
The density of random minimizers for is , which, in hindsight,
very much reads like at least samples are needed in every cycle of
length .
Story time. Let me briefly write down how I came up with this for my own fun
:) It’s in a very very roundabout way:
In the mod-minimizer paper we show a lower bound of . But in
practice no scheme achieves anything close to this.
For , we expect a new sample roughly every positions. Thus
instead of tiling -mers back-to-back, maybe we should tile them with
overlap.
That didn’t really go anywhere, since the sampled positions don’t align with
these staggered windows.
What if we take an -mer , consider its sampled position , and
then consider -mers and that end just before position and
start just after position .
Well this is still annoying since now and have some ‘dangling ends’.
It’s hard to say things about strings that have a fixed prefix and random
suffix (or reversed).
Hmm but we can make the prefix of equal to the prefix of , and the
suffix of equal to the prefix of .
Ah now we just get a cyclic -mer, from which at least positions
must be sampled!
The only step left is to look at charged -mers in the -order
DBg, rather than -mers that introduce new samples in the -order DBg.
Anyway, here things clicked into place and the connection with previous DBg
partitioning becomes clear.
Ekim, Barış, Bonnie Berger, and Yaron Orenstein. 2020. “A Randomized Parallel Algorithm for Efficiently Finding near-Optimal Universal Hitting Sets.” In Research in Computational Molecular Biology, 37–53. Springer International Publishing. https://doi.org/10.1007/978-3-030-45257-5_3.
Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024. “The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers.” In 24Th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:11:1–11:23. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.11.
Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd J Treangen. 2024. “A near-tight Lower Bound on the Density of Forward Sampling Schemes.” Edited by Yann Ponty. Bioinformatics, December. https://doi.org/10.1093/bioinformatics/btae736.
Marçais, Guillaume, Dan DeBlasio, and Carl Kingsford. 2018. “Asymptotically Optimal Minimizers Schemes.” Bioinformatics 34 (13): i13–22. https://doi.org/10.1093/bioinformatics/bty258.
Marçais, Guillaume, David Pellow, Daniel Bork, Yaron Orenstein, Ron Shamir, and Carl Kingsford. 2017. “Improving the Performance of Minimizers and Winnowing Schemes.” Bioinformatics 33 (14): i110–17. https://doi.org/10.1093/bioinformatics/btx235.
Mykkeltveit, Johannes. 1972. “A Proof of Golomb’s Conjecture for the de Bruijn Graph.” Journal of Combinatorial Theory, Series B 13 (1): 40–45. https://doi.org/10.1016/0095-8956(72)90006-8.
Orenstein, Yaron, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. 2017. “Designing Small Universal K-Mer Hitting Sets for Improved Analysis of High-Throughput Sequencing.” Edited by Benjamin J. Raphael. Plos Computational Biology 13 (10): e1005777. https://doi.org/10.1371/journal.pcbi.1005777.
Pellow, David, Lianrong Pu, Bariş Ekim, Lior Kotlar, Bonnie Berger, Ron Shamir, and Yaron Orenstein. 2023. “Efficient Minimizer Orders for Large Values Ofkusing Minimum Decycling Sets.” Genome Research, August. https://doi.org/10.1101/gr.277644.123.
Schleimer, Saul, Daniel S. Wilkerson, and Alex Aiken. 2003. “Winnowing: Local Algorithms for Document Fingerprinting.” In Proceedings of the 2003 Acm Sigmod International Conference on Management of Data. Sigmod/Pods03. ACM. https://doi.org/10.1145/872757.872770.