Towards Optimal Sampling Schemes

Table of Contents

1 Exactly optimal sampling schemes
2 Lower bound
3 Optimal schemes

These are some quick notes on progress towards optimal sampling schemes.

It builds on my thesis chapter (Groot Koerkamp 2025) (which is based on practical minimizers), as well as the sampling-density lower bound (Kille et al. 2024).

1 Exactly optimal sampling schemes Link to heading

Currently, no family of exactly optimal schemes exists:

Greedymini (Golan et al. 2025) finds some optimal schemes for $k=w+1$ for specific parameters.
Modmini (Groot Koerkamp and Pibiri 2024) approaches optimality in the $k\to\infty$ limit.
Bd-anchors (Loukides, Pissis, and Sweering 2023) are for $k=1$ (or sort-of-equivalent, $w\to\infty$) and approach optimality in the $\sigma\to\infty$ limit.

In the lower bound paper (Kille et al. 2024), we observe that when $k\equiv 1\mod w$, an ILP always sampling schemes that exactly match the lower bound (regardless of $\sigma$).

Figure 1: ILP results are the black circles. The solid circles indicate cases where they match the red lower bound. In particular this is the case whenever (kequiv 1mod w) (as well as (w=sigma=2)).

This motivates us to search for optimal $k=1$ schemes as a first step to finding more general optimal schemes for $k\equiv 1\mod w$. (One would hope that some mod-minimizer-like trick can be applied to the $k=1$ schemes here.)

An alternative approach could be to prove bounds on the near-optimality of the SUS-anchor schemes of practical minimizers, but proving exact (combinatorial) optimality sounds easier than the analysis needed to prove bounds on near-optimality.

Figure 2: Density of (k=1), (sigma=4) selection schemes.

2 Lower bound Link to heading

The density of a sampling scheme is measured on a long random string. Equivalently, for forward schemes, we can also consider the probability that a context of length $\ell+1 = (w+k-1)+1=w+k$ is charged. Equivalently again, we can also consider the average density of charged contexts, or the average density of sampled positions, on a cyclic (circular) string of length at least $\ell$.

Now assume $k=1$. If we take a circle of $\ell+1=w+1$ characters, we must sample at least 2 of them (each sample is a position/character), since each position is only contained in $\ell=w$ of the $w+1$ windows. This gives rise to the $2/(w+1)$ lower bound. (The paper has a more precise analysis with additional terms for the cases where the $w+1$-long string is repetitive, in which case additional samples are inevitable.) A scheme with density that matches the lower bound must have equality everywhere, which means that it must sample exactly 2 positions on every primitive (non-repetitive) circular string of length $w+1$.

This is now a purely combinatorial problem ;)

3 Optimal schemes Link to heading

If we assume a $\sigma=2$ alphabet, the anti-lexicographic SUS-anchors do great: in each window, they sample the smallest suffix, where we implicitly append a large (rather than small) sentinel at the end of the string. The anti-lexicographic part means that we would like the first character to be as large as possible, and the remaining characters to be as small as possible.

Now this scheme is not exactly optimal yet. I have increasingly complicated manual variations of the scheme that are optimal up to $w=12$ or $w=13$ or so (for $k=1$ and $\sigma=2$). It would be great to have simple and clean schemes for which we can prove exactly optimality, rather than endlessly fiddling with either very tricky schemes, or having to count/bound the exact number of cycles in which they are not optimal yet (sample 3 or more positions).

References Link to heading

Golan, Shay, Ido Tziony, Matan Kraus, Yaron Orenstein, and Arseny Shur. 2025. “GreedyMini: generating low-density DNA minimizers.” Bioinformatics 41 (July): 275–84. https://doi.org/10.1093/bioinformatics/btaf251.

Groot Koerkamp, Ragnar. 2025. “Optimal Throughput Bioinformatics.” ETH Zurich. https://doi.org/10.3929/ETHZ-C-000783091.

Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024. “The Mod-Minimizer: A Simple and Efficient Sampling Algorithm for Long $k$-Mers.” In Wabi 2024, 312:11:1–11:23. Lipics. https://doi.org/10.4230/LIPIcs.WABI.2024.11.

Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd J Treangen. 2024. “A near-tight Lower Bound on the Density of Forward Sampling Schemes.” Edited by Yann Ponty. Bioinformatics, December. https://doi.org/10.1093/bioinformatics/btae736.

Loukides, Grigorios, Solon P. Pissis, and Michelle Sweering. 2023. “Bidirectional String Anchors for Improved Text Indexing and Top-$k$ Similarity Search.” Ieee Transactions on Knowledge and Data Engineering 35 (11): 11093–111. https://doi.org/10.1109/tkde.2022.3231780.