[WIP] Minimizers and more

1 Introduction

sadf

  • Lots of DNA data
  • Most algorithms deal with k-mers.
  • k-mers overlap, and hence considering all of them is redundant.
  • Thus: sample a subset of the kmers.
  • Must be ’locally consistent’ and deterministic to be useful.
  • Enter random minimizers.
  • Parameter \(w\): guarantee that at least one k-mer is sampled out of every window of \(w\) k-mers.
  • Density \(d\): (expected) overall fraction of sampled k-mers.
  • Obviously, \(d\geq 1/w\)
  • For random mini, \(d=2/(w+1)\).
  • Lower density => fewer k-mers, smaller indices, faster algorithms.
  • Question: How small density can we get for given \(k\) and \(w\)?

1.1 Overview

Figure 1: An overview of the papers this post discusses, showing authors and categories of each paper.

Figure 1: An overview of the papers this post discusses, showing authors and categories of each paper.

1.2 Previous reviews

  • Zheng, Marçais, and Kingsford (2023)
  • Ndiaye et al. (2024)

2 Theory of sampling schemes

[At RECOMB 2022, discussing DeepMinimizer]

Why would we even care about better minimizer? We have this simple and fast random minimizer that’s only at most \(2\times\) away from optimal. Why would anyone invest time in optimizing this by maybe \(25\%\)? There are so much bigger gains possible elsewhere.

  • Broder (1997)

    • Take the \(s\) kmers with smallest \(s\) hashes, then estimate jaccard similarity based on this.
  • Schleimer, Wilkerson, and Aiken (2003)

    • \(k\): noise threshold
    • \(\ell\): guarantee threshold
    • winnowing: Definition 1: Select minimum hash in each window.
    • Charged contexts to prove a \(2/(w+1)\) density, assuming no duplicate hashes (and $k$-mers)
    • local algorithm: Function on k-mer hashes, rather than on window itself: \(S(h_i, \dots, h_{i+w-1})\).
    • Local algorithms have density at least \((1.5+1/2w)/(w+1)\).
    • Conjecture that \(2/(w+1)\) is optimal.
    • Robust Winnowing: smarter tie-breaking: same as previous window in case of tie if possible, otherwise rightmost.
    • ’threshold’ \(t=w+k-1\)
    • order via hash
  • Roberts et al. (2004)

    • interior minimizers: Length \(w+k-1\) in common, then share minimizer
    • Same heuristic argument for \(2/(w+1)\) density, assuming distinct kmers.
    • \(w\leq k\) guarantees no gaps (uncovered characters) between minimizers
    • end minimizers: minimizers of a prefix/suffix of the string of length \(<\ell\).
    • lexicographic ordering is bad on consecutive zeros.
    • ‘Alternating’ order: even positions have reversed order.
    • Increase chance of ‘rare’ k-mers being minimizers.
    • Reverse complement-stable minimizers: \(ord(kmer) = min(kmer, rev-kmer)\).
    • Some heuristic argument that sensitivity goes as \(k+w/2\).
    • \(k<\log_\sigma(N)\) may have bad sensitivity.
  • Marçais et al. (2017)

    • Main goal is to disprove the \(2/(w+1)\) conjectured lower bound.
    • States that Schleimer, Wilkerson, and Aiken (2003) defines a local scheme as only having access to the sequence within a window, but actually, it only has access to the hashes.
    • UHS to obtain ordering with lower density than lex or random.
    • DOCKS goes below \(1.8/(w+1)\), so the conjecture doesn’t hold.
    • Random order has density slightly below \(2/(w+1)\).
    • Defines density factor \(d_f = d\cdot(w+1)\).1
    • UHS sparsity \(SP(U)\): the fraction of contexts containing exactly one k-mer from the \(U\).
      • \(d = 2/(w+1) \cdot (1-SP(U))\)
    • The density of a minimizer scheme can be computed on a De Bruin sequence of order \(k+w\).
    • The density of a local scheme can be less than \(2/(w+1)\).
    • Does not refute the \((1.5+1/2w)/(w+1)\) lower bound.
  • Marçais, DeBlasio, and Kingsford (2018)

    • Properly introduces \(local \supseteq forward\supseteq minimizers\).

    • Realizes that \((1.5+1/2w)/(w+1)\) lower bound is only for randomized local schemes.

    • Studies asymptotic behaviour in \(k\) and \(w\)

    • For \(k\to\infty\), a minimizer scheme with density \(1/w\).

    • For \(w\to\infty\), a \(1/\sigma^k\) lower bound on minimizer schemes.

      • Forward schemes can achieve density \(O(1/\sqrt(w))\) instead, by using \(k’ = \log_\sigma(\sqrt{w})\) instead.
    • A lower bound on forward schemes of \(\frac{1.5 + 1/2w + \max(0, \lfloor(k-w)/w\rfloor)}{w+k}\).

      • Proof looks at two consecutive windows and the fact that half the time, the sampled kmers leave a gap of \(w\) in between, requiring an additional sampled kmer.
    • Local schemes can be strictly better than forward, found using ILP.

    • New lower bound on forward schemes.

    • For local schemes, a De Bruijn sequence of order \(2w+k-2\) can be used to compute density.

    • UHS-minimizer compatibility.

    • Naive extension for UHS: going from \(k\) to \(k+1\) by ignoring extra characters.

    • Construction of asymptotic in \(k\to\infty\) scheme is complex, but comes down to roughly: for each \(i\in [w]\), sum the characters in positions \(i\pmod w\). Take the k-mer the position \(i\) for which the sum is maximal. (In the paper it’s slightly different, in that a context-free version is defined where a k-mer is ‘good’ if the sum of it’s \(0\pmod w\) characters is larger than the sums for the other equivalence classes, and then there is an argument that good kmers close to a UHS, and turning them into a real UHS only requires ‘few’ extra kmers.)

    • \(d(k, w)\) is decreasing in \(w\).

  • Edgar (2021)

    • Introduces open syncmers, closed syncmers
    • context free: each kmer is independently selected or not
    • Conservation: probability that a sampled kmer is preserved under mutations.
    • context-free sampled kmers are better conserved.
  • Shaw and Yu (2021)

    • Formalizes conservation: the fraction of bases covered by sampled kmers.
    • k-mer selection method: samples any kind of subset of kmers
    • $q$-local selection method: \(f\) looks at a $k+q-1$-mer, and returns some subset of kmers.
    • word-based method: a ‘context free’ method where for each k-mer it is decided independently whether it is sampled or not.
  • Belbasi et al. (2022)

    • The jaccard similarity based on random minimizers is biased.
  • Levenshtein (1970)

    • Shows a bound on max number of non-overlapping words of \[\frac 1k \left(\frac{k-1}{k}\right)^{k-1} \sigma^k\]
  • Blackburn (2015)

    • divide alphabet into two parts. Then patterns abbbb and e.g. aab?b?b?b are non-overlapping. (b: any non-a character)
    • For DNA, optimal solution (max number of pairwise non-overlapping words) for \(k=2\) is [AG][CT], while for \(k\in\{3,4,5,6\}\), an optimal solution is given by A[CTG]+.
    • Re-prove upper bound on number of non-overlapping words \(\sigma^k/(2k-1)\).
    • Re-prove upper bound of Levenshtein above.
    • Show existing scheme with size \[\frac{\sigma-1}{e\sigma} \frac{\sigma^k}{k}\]
    • New scheme: not \(0\) and \({>}0\), but arbitrary partition. And prefix is in some set \(S\), while suffix is $S$-free.
      • When \(k\) divides \(\sigma\), choose \(|I| = \sigma/k\) and \(|J| = \sigma-\sigma/k\), and consider strings IIIIIIJ. These are optimal.
      • The set \(S\) is needed to avoid rounding errors when \(\sigma\) is small.
      • Conjecture: a suffix of JJ or longer is never optimal.
  • Frith, Noé, and Kucherov (2020)

    • minimally overlapping words are anti-clustered, hence good for sensitivity.
    • cg-order: alternate small and large characters, as (Roberts et al. 2004)
    • abb-order: compare first character normal, the rest by t=g=c<a.
  • Stanovnik, Moškon, and Mraz (2024)

    • ILP to solve the problem for more \((k, \sigma)\) pairs.
  • Frith, Shaw, and Spouge (2022)

    • Test various word-sets for their sparsity and specificity.
  • Golan and Shur (2024)

    • The random minimizer has density just below \(2/(w+1)\) when \(k>w\) and \(w\) is sufficiently large.
    • \(O(w^2)\) method to compute the exact density of random minimizer.
    • The \(2/j\) and \(1/j\) fractions were observed before in (Marçais et al. 2017)
  • Kille et al. (2024)

    • Lower bound on density of \(\frac1{w+k}\lceil\frac{w+k}w\rceil\).
    • Tighter version by counting pure cycles of all lengths.
    • Instead of \(k\), can also use the bound for \(k’\geq k\) with \(k\equiv 1\pmod w\).
  • Zheng, Kingsford, and Marçais (2021a)

    • UHS-minimizer compatibility; remaining path length \(L \leq \ell\)
    • \(d \leq |U|/\sigma^k\).
    • Mentions decycling set of Mykkeltveit (1972)
    • Theorem 2: Forward sampling scheme with density \(O(\ln(w) / w)\) (where \(k\) is small/constant), and a corresponding UHS.
    • selection scheme: selects positions rather than kmers, i.e., \(k=1\).
    • Assumes \(w\to\infty\), so anyway \(k=O(1)\) or \(k=1\) are kinda equivalent.
    • Theorem 1: local scheme implies $(2w-1)$-UHS, forward scheme implies $(w+1)$-UHS.
    • Theorem 3: Gives an upper and lower bound on the remaining path length of the Mykkeltveit set: it’s between \(c_1\cdot w^2\) and \(c_2\cdot w^3\).
    • Local schemes: \(w-1\) ’looking back’ context for \(2w+k-2\) total context size.
      • The charged contexts are a UHS.
    • \(O(\ln(w)/w)\) forward scheme construction:
      • Definition 2 / Lemma 2: The set of words that either start with \(0^d\) or do not contain \(0^d\) at all is a UHS. Set \(d = \log_\sigma(w /\ln w)-1\). This has longest remaining path length \(w-d\).
      • Then a long proof that the relative size is \(O(\ln(w) / w)\).
      • (In hindsight: this is a variant of picking the smallest substring, as long as it is sufficiently small.)
    • Questions:
      • We can go from a scheme \(f\) to a UHS. Can we also go back?
      • Does a perfect selection scheme exist?
  • Zheng, Kingsford, and Marçais (2020)

    • For \(w\to\infty\), minimizer schemes can be optimal (have density \(O(1/w)\)) if and only if \(k \geq \log_\sigma(w) - O(1)\). In fact, the lexicographic minimizer is optimal.
    • When \(k\geq (3+\varepsilon)\log_\sigma(w)\), the random minimizer has expected density \(2/(w+1)+o(1/w)\), fixing the proof by (Schleimer, Wilkerson, and Aiken 2003).
    • When \(\varepsilon>0\) and \(k>(3+\varepsilon)\log_\sigma w\), the probability of duplicate k-mers in a window is \(o(1/w)\).
      • TODO: Hypothesis: the \(3\) could also be a \(2\), or actually even a \(1\)?
    • turn charged contexts of a minimizer scheme into a $(w+k)$-UHS.
    • Relative size of UHS is upper bound on density of compatible minimizer.
  • (Chikhi et al. 2014)

    • Order k-mers by their frequency in the dataset.

2.1 Questions

Main question: What is the lowest possible density for given \((k, w)\)?

The first questions:

  • What is a scheme

This question is then approached from two sides:

  • Lower bounds on density for \((k,w,\sigma)\)?
  • Tight lower bounds for some parameters?
  • Tight lower bounds, asymptotic in parameters (e.g., \(\sigma\to\infty\))?
  • Can we make tight lower bounds for all practical parameters?
  • If not, can we understand why the best schemes found (using ILP) do not reach know bounds?

And:

  • What is the empirical density of existing schemes?
  • Can we model existing schemes and compute their density exactly?
  • Can we make near-optimal schemes (say, within \(1\%\) from optimal) for practical parameters?
  • Can we make exactly optimal schemes, for asymptotic parameters?
  • Can we make optimal schemes for practical parameters?
  • Can we make ‘pure’ optimal schemes, that do not require exponential memory?
  • If we can not make pure optimal schemes, can we bruteforce search for them instead?

2.2 Types of schemes

scope:

type:

  • sampling scheme: sample k-mer
  • selection scheme: sample position (\(k=1\))

2.3 Parameter regimes

  • small \(k\): \(k < \log_\sigma(w)\)
  • large \(k\): \(k\gg w\) or \(k\to \infty\).
  • ‘practical’: \(4\leq k \leq 2w\) with \(w\leq 20\) or so; depends on the application.
  • binary/DNA alphabet \(\sigma\in\{2,4\}\).
  • large/infinite alphabet, \(\sigma=256\) or \(\sigma\to\infty\).

2.4 Different perspectives

  • charged contexts of length \(w+1\).
  • pure cycles of length \(w+k\).
  • long random strings.

2.5 UHS vs minimizer scheme

  • UHS is a minimizer scheme where everything has hash/order \(0\) or \(1\).

2.6 (Asymptotic) bounds

2.7 Lower bounds

3 Minimizer schemes

3.1 Orders

3.2 UHS-based and search-based schemes

  • Orenstein et al. (2016, 2017)
    • Introduces UHS
    • DOCKS finds a UHS
    • Finding optimal UHS is hard when a set of strings to be hit is given. (But here we have a DBg, which may be easier.)
    • The size of a UHS may be much smaller than the set of all possible minimizers.
    • DOCKS UHS density is close to optimal (?)
    • Step 1: Start with the Mykkeltveit embedding
    • Step 2: repeatedly find a vertex with maximal ‘hitting number’ of $ℓ$-long paths going through it, and add it to the UHS (and remove it from the graph.)
    • DOCKSany: compute number of paths of any length, instead of length \(\ell\).
    • DOCKSanyX: remove the top \(X\) vertices at a time.
    • Applies ’naive extension’ to work for larger \(k\).
    • Runs for (many) hours to compute UHS for \(k=11\) already.
    • An ILP to improve UHSes found by DOCKS; improves by only a few percent at best.
    • DOCKS selects far fewer distinct kmers compared to random minimizers, and has slightly lower density.
    • Does not use a compatible minimizer order.
  • DeBlasio et al. (2019)
    • Extends UHS generated by DOCKS
    • larger \(k\) up to \(200\), but \(L\leq 21\).
    • Merges UHS with random minimizer tiebreaking.
    • Mentions sparsity
    • Starts with UHS for small \(k\) and grows one-by-one to larger \(k\). Full process is called reMuval.
      • First, naive extension
      • Second, an ILP to reduce the size of the new UHS and increase the number of singletons: windows containing exactly one kmer. (Since density directly correlates with sparsity.)
    • Naive extension can decrease density
    • Remove kmers from the UHS that always co-occur with another k-mer in every window.
    • ILP is on whether each kmer is retained in the UHS or not, such that every window preserves at least one element of the UHS.
    • Also does sequence-specific minimizers
  • Ekim, Berger, and Orenstein (2020)
    • Improves DOCKS using randomized parallel algorithm for set-cover.
    • Faster computation of hitting numbers.
    • Scales to \(k\leq 16\).
  • Hoang, Zheng, and Kingsford (2022)
    • Learns a total order, instead of a UHS.
    • Continuous objective, rather than discrete.
    • UHSes are ‘underspecified’ since the order withing each component is not given. Determining the permutation directly is more powerful.
    • Around \(5\%\) better than PASHA.
  • Golan et al. (2024)
    • Unlike UHS-based methods that optimize UHS size, this directly optimizes minimizer density by minimizing the number of charged context:
      • Repeatedly pick the next kmer as smallest that is in the smallest fraction of charged contexts.
      • Then do some noise (slightly submoptimal choices), and local search with random restarts on top.
    • Builds scheme for alphabet size \(\sigma’=2\) and \(k’\leq 20\) which is extended to \(\sigma=2\) and to larger \(k\) if \(k>20\).
    • Achieves very low density. Open question how close to optimal.
    • Not ‘pure’: requires the memory to store the order of kmers.
  • Zheng, Kingsford, and Marçais (2021b)
    • Polar set intersects each $w$-mer at most once.
    • Two kmers in a polar set are at least \((w+1)/2\) apart.
    • Lemma 4: Formula for probability that a window is charged, in terms of number of unique kmers.
    • Progressively add ’layers’ to the polar set to fill gaps.
    • Heuristic: greedily try to pick kmers that are exactly \(w\) apart, by choosing a random offset \(o\in [w]\), and adding all those kmers as long as they aren’t too close to already chosen kmers.
      • Up to 7 rounds in practice.
    • Filter too frequent kmers.
    • Significantly improved density over other methods.
    • Requires explicitly storing an order.

3.3 Pure schemes

  • Zheng, Kingsford, and Marçais (2020)
    • Considers all closed syncmers in a window. Picks the smallest one.
    • Parameter \(k_0\) (we call it \(s\)): the length of the hashed ‘inner’ slices.
    • For \(k > w + O(\log_\sigma(w))\), has density below \(1.67/w + o(1/w)\).
      • This requires a long proof.
    • First scheme with guaranteed density \(<2/(w+1)\) when \(k\approx w\) (instead \(k\gg w\)).
    • Does not require expensive heuristics for precomputation; no internal storage.
    • Charged contexts or a \((w_0, k_0)\) minimizer are the UHS of the \((w, k=w_0+k_0)\) minimizer, as long as \(w\geq w_0\).
  • Pellow et al. (2023)
    • MDS: a set of k-mers that hits every cycle in the DBg.
    • Mykkeltveit embedding: map each k-mer to a complex number. Take those k-mers with argument (angle) between \(0\) and \(2\pi/k\) as context-free hitting set.
    • Take a compatible minimizer.
    • Even better: prefer argument in \([0, 2\pi/k)\), and otherwise prefer argument \([\pi, \pi+2\pi/k)\).
    • Great density for \(k\) just below \(w\).
    • MDS orders outperform DOCKS and PASHA.
    • Scales to larger \(k\)
  • Groot Koerkamp and Pibiri (2024)
    • For \(k > w\), look at $t=k\bmod w$-mers instead. If the smallest $t$-mer is at position \(x\), sample the $k$-mer at position \(x\bmod w\).
    • Asymptotic optimal density as \(w\to\infty\).
    • Close to optimal for large alphabet when \(k\equiv 1\pmod w\).
  • Groot Koerkamp, Liu, and Pibiri (2024)
    • Extend miniception to open syncmers, and open followed by closed syncmers.
    • Extend modmini to wrap any other sampling scheme.
    • Simple and very efficient scheme, for any \(k\).
    • Greedymini has lower density, but is more complex.

3.4 Other variants

  • Kille et al. (2023)
    • Sample the smallest \(s\) k-mers from each \(s\cdot w\) consecutive k-mers.
  • Irber et al. (2022)
    • Sample all kmers with hash below \(max\cdot f\).
  • (Chikhi et al. 2014)
    • Frequency aware minimizers TODO
  • Alanko, Biagi, and Puglisi (2024)
    • frequency bounded minimizers, with frequency below \(t\)
    • Prefers rare kmers as minimizers
    • variable length scheme.
    • Shortest unique finimizers
    • Uses SBWT to work around ’non-local’ property.
    • Useful for SSHash-like indices.
    • Defines DSPSS: Disjoint spectrum preserving string set.
    • For each kmer, find the shortest contained substring that occurs at most \(t\) times in the DBg of the input.
    • (TODO: I’m getting a bit lost on the technicalities with the SBWT.)

Selection schemes

These have \(k=1\)

  • Loukides and Pissis (2021; Loukides, Pissis, and Sweering 2023)
    • In each window, sample the position that starts the lexicographically smallest rotation.
    • Avoid sampling the last \(r\approx \log_\sigma(w)\) positions, as they cause ‘unstable’ anchors.

Canonical minimizers

  • Pan and Reinert (2024)
    • Choose the strandedness via higher CG-content.
  • Wittler (2023)
    • TODO
  • Marçais, Elder, and Kingsford (2024)
    • TODO

4 Open questions

  • How much are local schemes better than forward schemes?
  • How much are forward schemes better than minimizer schemes? Only for small \(k\)?
  • How close to optimal is greedy minimizer?

References

Alanko, Jarno N., Elena Biagi, and Simon J. Puglisi. 2024. “Finimizers: Variable-Length Bounded-Frequency Minimizers Fork-Mer Sets,” February. https://doi.org/10.1101/2024.02.19.580943.
Belbasi, Mahdi, Antonio Blanca, Robert S Harris, David Koslicki, and Paul Medvedev. 2022. “The Minimizer Jaccard Estimator Is Biased and Inconsistent.” Bioinformatics 38 (Supplement\_1): i169–76. https://doi.org/10.1093/bioinformatics/btac244.
Blackburn, Simon R. 2015. “Non-Overlapping Codes.” Ieee Transactions on Information Theory 61 (9): 4890–94. https://doi.org/10.1109/tit.2015.2456634.
Broder, A.Z. 1997. “On the Resemblance and Containment of Documents.” Proceedings. Compression and Complexity of Sequences 1997 (Cat. No.97tb100171). https://doi.org/10.1109/sequen.1997.666900.
Chikhi, Rayan, Antoine Limasset, Shaun Jackman, Jared T. Simpson, and Paul Medvedev. 2014. “On the Representation of de Bruijn Graphs.” In Research in Computational Molecular Biology, 35–55. Springer International Publishing. https://doi.org/10.1007/978-3-319-05269-4_4.
Chikhi, Rayan, Antoine Limasset, and Paul Medvedev. 2016. “Compacting de Bruijn Graphs from Sequencing Data Quickly and in Low Memory.” Bioinformatics 32 (12): i201–8. https://doi.org/10.1093/bioinformatics/btw279.
DeBlasio, Dan, Fiyinfoluwa Gbosibo, Carl Kingsford, and Guillaume Marçais. 2019. “Practical Universal K-Mer Sets for Minimizer Schemes.” In Proceedings of the 10th Acm International Conference on Bioinformatics, Computational Biology and Health Informatics, 167–76. https://doi.org/10.1145/3307339.3342144.
Edgar, Robert. 2021. “Syncmers Are More Sensitive than Minimizers for Selecting Conserved K‑Mers in Biological Sequences.” Peerj 9 (February): e10805. https://doi.org/10.7717/peerj.10805.
Ekim, Barış, Bonnie Berger, and Yaron Orenstein. 2020. “A Randomized Parallel Algorithm for Efficiently Finding near-Optimal Universal Hitting Sets.” In Research in Computational Molecular Biology, 37–53. Springer International Publishing. https://doi.org/10.1007/978-3-030-45257-5_3.
Frith, Martin C, Laurent Noé, and Gregory Kucherov. 2020. “Minimally Overlapping Words for Sequence Similarity Search.” Edited by Yann Ponty. Bioinformatics 36 (22–23): 5344–50. https://doi.org/10.1093/bioinformatics/btaa1054.
Frith, Martin C., Jim Shaw, and John L. Spouge. 2022. “How to Optimally Sample a Sequence for Rapid Analysis,” August. https://doi.org/10.1101/2022.08.18.504476.
Golan, Shay, Ido Tziony, Matan Kraus, Yaron Orenstein, and Arseny Shur. 2024. “Generating Low-Density Minimizers,” November. https://doi.org/10.1101/2024.10.28.620726.
Golan, Shay, and Arseny M. Shur. 2024. “Expected Density of Random Minimizers.” arXiv. https://doi.org/10.48550/ARXIV.2410.16968.
Groot Koerkamp, Ragnar, Daniel Liu, and Giulio Ermanno Pibiri. 2024. “The Open-Closed Mod-Minimizer Algorithm.” Biorxiv. https://doi.org/10.1101/2024.11.02.621600.
Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024. “The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:11:1–11:23. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.11.
Hoang, Minh, Hongyu Zheng, and Carl Kingsford. 2022. “Differentiable Learning of Sequence-Specific Minimizer Schemes with Deepminimizer.” Journal of Computational Biology 29 (12): 1288–1304. https://doi.org/10.1089/cmb.2022.0275.
Irber, Luiz, Phillip T Brooks, Taylor Reiter, N Tessa Pierce-Ward, Mahmudur Rahman Hera, David Koslicki, and C Titus Brown. 2022. “Lightweight Compositional Analysis of Metagenomes with Fracminhash and Minimum Metagenome Covers.” Biorxiv, 2022–01. https://doi.org/10.1101/2022.01.11.475838.
Kille, Bryce, Erik Garrison, Todd J Treangen, and Adam M Phillippy. 2023. “Minmers Are a Generalization of Minimizers That Enable Unbiased Local Jaccard Estimation.” Edited by Peter Robinson. Bioinformatics 39 (9). https://doi.org/10.1093/bioinformatics/btad512.
Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd Treangen. 2024. “A near-Tight Lower Bound on the Density of Forward Sampling Schemes.” Biorxiv. https://doi.org/10.1101/2024.09.06.611668.
Levenshtein, Vladimir I. 1970. “Maximum Number of Words in Codes without Overlaps.” Problems Inform. Transmission 6 (4): 88–90. http://mi.mathnet.ru/ppi1773.
Loukides, Grigorios, Solon P. Pissis, and Michelle Sweering. 2023. “Bidirectional String Anchors for Improved Text Indexing and Top-$k$ Similarity Search.” Ieee Transactions on Knowledge and Data Engineering 35 (11): 11093–111. https://doi.org/10.1109/tkde.2022.3231780.
Loukides, Grigorios, and Solon P. Pissis. 2021. “Bidirectional String Anchors: A New String Sampling Mechanism.” Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ESA.2021.64.
Marçais, Guillaume, Dan DeBlasio, and Carl Kingsford. 2018. “Asymptotically Optimal Minimizers Schemes.” Bioinformatics 34 (13): i13–22. https://doi.org/10.1093/bioinformatics/bty258.
Marçais, Guillaume, C S Elder, and Carl Kingsford. 2024. “K-Nonical Space: Sketching with Reverse Complements.” Edited by Macha Nikolski. Bioinformatics, October. https://doi.org/10.1093/bioinformatics/btae629.
Marçais, Guillaume, David Pellow, Daniel Bork, Yaron Orenstein, Ron Shamir, and Carl Kingsford. 2017. “Improving the Performance of Minimizers and Winnowing Schemes.” Bioinformatics 33 (14): i110–17. https://doi.org/10.1093/bioinformatics/btx235.
Mykkeltveit, Johannes. 1972. “A Proof of Golomb’s Conjecture for the de Bruijn Graph.” Journal of Combinatorial Theory, Series B 13 (1): 40–45. https://doi.org/10.1016/0095-8956(72)90006-8.
Ndiaye, Malick, Silvia Prieto-Baños, Lucy M. Fitzgerald, Ali Yazdizadeh Kharrazi, Sergey Oreshkov, Christophe Dessimoz, Fritz J. Sedlazeck, Natasha Glover, and Sina Majidian. 2024. “When Less Is More: Sketching with Minimizers in Genomics.” Genome Biology 25 (1). https://doi.org/10.1186/s13059-024-03414-4.
Orenstein, Yaron, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. 2016. “Compact Universal K-Mer Hitting Sets.” In Algorithms in Bioinformatics, 257–68. Springer International Publishing. https://doi.org/10.1007/978-3-319-43681-4_21.
———. 2017. “Designing Small Universal K-Mer Hitting Sets for Improved Analysis of High-Throughput Sequencing.” Edited by Benjamin J. Raphael. Plos Computational Biology 13 (10): e1005777. https://doi.org/10.1371/journal.pcbi.1005777.
Pan, Chenxu, and Knut Reinert. 2024. “A Simple Refined Dna Minimizer Operator Enables Twofold Faster Computation.” Edited by Alfonso Valencia. Bioinformatics, January. https://doi.org/10.1093/bioinformatics/btae045.
Pellow, David, Lianrong Pu, Bariş Ekim, Lior Kotlar, Bonnie Berger, Ron Shamir, and Yaron Orenstein. 2023. “Efficient Minimizer Orders for Large Values Ofkusing Minimum Decycling Sets.” Genome Research, August. https://doi.org/10.1101/gr.277644.123.
Roberts, Michael, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. 2004. “Reducing Storage Requirements for Biological Sequence Comparison.” Bioinformatics 20 (18): 3363–69. https://doi.org/10.1093/bioinformatics/bth408.
Schleimer, Saul, Daniel S. Wilkerson, and Alex Aiken. 2003. “Winnowing: Local Algorithms for Document Fingerprinting.” In Proceedings of the 2003 Acm Sigmod International Conference on Management of Data. Sigmod/Pods03. ACM. https://doi.org/10.1145/872757.872770.
Shaw, Jim, and Yun William Yu. 2021. “Theory of Local K-Mer Selection with Applications to Long-Read Alignment.” Edited by Can Alkan. Bioinformatics 38 (20): 4659–69. https://doi.org/10.1093/bioinformatics/btab790.
Stanovnik, Lidija, Miha Moškon, and Miha Mraz. 2024. “In Search of Maximum Non-Overlapping Codes.” Designs, Codes and Cryptography 92 (5): 1299–1326. https://doi.org/10.1007/s10623-023-01344-z.
Wittler, Roland. 2023. “General Encoding of Canonical K-Mers.” Peer Community Journal 3 (September). https://doi.org/10.24072/pcjournal.323.
Zheng, Hongyu, Carl Kingsford, and Guillaume Marçais. 2020. “Improved Design and Analysis of Practical Minimizers.” Bioinformatics 36 (Supplement\_1): i119–27. https://doi.org/10.1093/bioinformatics/btaa472.
———. 2021a. “Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length.” Journal of Computational Biology 28 (4): 395–409. https://doi.org/10.1089/cmb.2020.0432.
———. 2021b. “Sequence-Specific Minimizers via Polar Sets.” Bioinformatics 37 (Supplement\_1): i187–95. https://doi.org/10.1093/bioinformatics/btab313.
Zheng, Hongyu, Guillaume Marçais, and Carl Kingsford. 2023. “Creating and Using Minimizer Sketches in Computational Genomics.” Journal of Computational Biology 30 (12): 1251–76. https://doi.org/10.1089/cmb.2023.0094.

  1. I am not a fan of this, since the lower bound is \(1/w\), no scheme can actually achieve density factor \(1\). Calibrating the scale to the (somewhat arbirary) random minimizer, instead of to the theoretical lower bound does not really make sense to me. ↩︎