1 Introduction
sadf
- Lots of DNA data
- Most algorithms deal with k-mers.
- k-mers overlap, and hence considering all of them is redundant.
- Thus: sample a subset of the kmers.
- Must be ’locally consistent’ and deterministic to be useful.
- Enter random minimizers.
- Parameter \(w\): guarantee that at least one k-mer is sampled out of every window of \(w\) k-mers.
- Density \(d\): (expected) overall fraction of sampled k-mers.
- Obviously, \(d\geq 1/w\)
- For random mini, \(d=2/(w+1)\).
- Lower density => fewer k-mers, smaller indices, faster algorithms.
- Question: How small density can we get for given \(k\) and \(w\)?
1.1 Overview
1.2 Previous reviews
2 Theory of sampling schemes
[At RECOMB 2022, discussing DeepMinimizer]
Why would we even care about better minimizer? We have this simple and fast random minimizer that’s only at most \(2\times\) away from optimal. Why would anyone invest time in optimizing this by maybe \(25\%\)? There are so much bigger gains possible elsewhere.
Broder (1997)
- Take the \(s\) kmers with smallest \(s\) hashes, then estimate jaccard similarity based on this.
Schleimer, Wilkerson, and Aiken (2003)
- \(k\): noise threshold
- \(\ell\): guarantee threshold
- winnowing: Definition 1: Select minimum hash in each window.
- Charged contexts to prove a \(2/(w+1)\) density, assuming no duplicate hashes (and $k$-mers)
- local algorithm: Function on k-mer hashes, rather than on window itself: \(S(h_i, \dots, h_{i+w-1})\).
- Local algorithms have density at least \((1.5+1/2w)/(w+1)\).
- Conjecture that \(2/(w+1)\) is optimal.
- Robust Winnowing: smarter tie-breaking: same as previous window in case of tie if possible, otherwise rightmost.
- ’threshold’ \(t=w+k-1\)
- order via hash
Roberts et al. (2004)
- interior minimizers: Length \(w+k-1\) in common, then share minimizer
- Same heuristic argument for \(2/(w+1)\) density, assuming distinct kmers.
- \(w\leq k\) guarantees no gaps (uncovered characters) between minimizers
- end minimizers: minimizers of a prefix/suffix of the string of length \(<\ell\).
- lexicographic ordering is bad on consecutive zeros.
- ‘Alternating’ order: even positions have reversed order.
- Increase chance of ‘rare’ k-mers being minimizers.
- Reverse complement-stable minimizers: \(ord(kmer) = min(kmer, rev-kmer)\).
- Some heuristic argument that sensitivity goes as \(k+w/2\).
- \(k<\log_\sigma(N)\) may have bad sensitivity.
Marçais et al. (2017)
- Main goal is to disprove the \(2/(w+1)\) conjectured lower bound.
- States that Schleimer, Wilkerson, and Aiken (2003) defines a local scheme as only having access to the sequence within a window, but actually, it only has access to the hashes.
- UHS to obtain ordering with lower density than lex or random.
- DOCKS goes below \(1.8/(w+1)\), so the conjecture doesn’t hold.
- Random order has density slightly below \(2/(w+1)\).
- Defines density factor \(d_f = d\cdot(w+1)\).1
- UHS sparsity \(SP(U)\): the fraction of contexts containing exactly one k-mer from
the \(U\).
- \(d = 2/(w+1) \cdot (1-SP(U))\)
- The density of a minimizer scheme can be computed on a De Bruin sequence of order \(k+w\).
- The density of a local scheme can be less than \(2/(w+1)\).
- Does not refute the \((1.5+1/2w)/(w+1)\) lower bound.
Marçais, DeBlasio, and Kingsford (2018)
Properly introduces \(local \supseteq forward\supseteq minimizers\).
Realizes that \((1.5+1/2w)/(w+1)\) lower bound is only for randomized local schemes.
Studies asymptotic behaviour in \(k\) and \(w\)
For \(k\to\infty\), a minimizer scheme with density \(1/w\).
For \(w\to\infty\), a \(1/\sigma^k\) lower bound on minimizer schemes.
- Forward schemes can achieve density \(O(1/\sqrt(w))\) instead, by using \(k’ = \log_\sigma(\sqrt{w})\) instead.
A lower bound on forward schemes of \(\frac{1.5 + 1/2w + \max(0, \lfloor(k-w)/w\rfloor)}{w+k}\).
- Proof looks at two consecutive windows and the fact that half the time, the sampled kmers leave a gap of \(w\) in between, requiring an additional sampled kmer.
Local schemes can be strictly better than forward, found using ILP.
New lower bound on forward schemes.
For local schemes, a De Bruijn sequence of order \(2w+k-2\) can be used to compute density.
UHS-minimizer compatibility.
Naive extension for UHS: going from \(k\) to \(k+1\) by ignoring extra characters.
Construction of asymptotic in \(k\to\infty\) scheme is complex, but comes down to roughly: for each \(i\in [w]\), sum the characters in positions \(i\pmod w\). Take the k-mer the position \(i\) for which the sum is maximal. (In the paper it’s slightly different, in that a context-free version is defined where a k-mer is ‘good’ if the sum of it’s \(0\pmod w\) characters is larger than the sums for the other equivalence classes, and then there is an argument that good kmers close to a UHS, and turning them into a real UHS only requires ‘few’ extra kmers.)
\(d(k, w)\) is decreasing in \(w\).
Edgar (2021)
- Introduces open syncmers, closed syncmers
- context free: each kmer is independently selected or not
- Conservation: probability that a sampled kmer is preserved under mutations.
- context-free sampled kmers are better conserved.
Shaw and Yu (2021)
- Formalizes conservation: the fraction of bases covered by sampled kmers.
- k-mer selection method: samples any kind of subset of kmers
- $q$-local selection method: \(f\) looks at a $k+q-1$-mer, and returns some subset of kmers.
- word-based method: a ‘context free’ method where for each k-mer it is decided independently whether it is sampled or not.
Belbasi et al. (2022)
- The jaccard similarity based on random minimizers is biased.
Levenshtein (1970)
- Shows a bound on max number of non-overlapping words of \[\frac 1k \left(\frac{k-1}{k}\right)^{k-1} \sigma^k\]
Blackburn (2015)
- divide alphabet into two parts. Then patterns
abbbb
and e.g.aab?b?b?b
are non-overlapping. (b
: any non-a
character) - For DNA, optimal solution (max number of pairwise non-overlapping words) for \(k=2\) is
[AG][CT]
, while for \(k\in\{3,4,5,6\}\), an optimal solution is given byA[CTG]+
. - Re-prove upper bound on number of non-overlapping words \(\sigma^k/(2k-1)\).
- Re-prove upper bound of Levenshtein above.
- Show existing scheme with size \[\frac{\sigma-1}{e\sigma} \frac{\sigma^k}{k}\]
- New scheme: not \(0\) and \({>}0\), but arbitrary partition. And prefix is in
some set \(S\), while suffix is $S$-free.
- When \(k\) divides \(\sigma\), choose \(|I| = \sigma/k\) and \(|J| =
\sigma-\sigma/k\), and consider strings
IIIIIIJ
. These are optimal. - The set \(S\) is needed to avoid rounding errors when \(\sigma\) is small.
- Conjecture: a suffix of
JJ
or longer is never optimal.
- When \(k\) divides \(\sigma\), choose \(|I| = \sigma/k\) and \(|J| =
\sigma-\sigma/k\), and consider strings
- divide alphabet into two parts. Then patterns
Frith, Noé, and Kucherov (2020)
- minimally overlapping words are anti-clustered, hence good for sensitivity.
cg
-order: alternate small and large characters, as (Roberts et al. 2004)abb
-order: compare first character normal, the rest byt=g=c<a
.
Stanovnik, Moškon, and Mraz (2024)
- ILP to solve the problem for more \((k, \sigma)\) pairs.
Frith, Shaw, and Spouge (2022)
- Test various word-sets for their sparsity and specificity.
Golan and Shur (2024)
- The random minimizer has density just below \(2/(w+1)\) when \(k>w\) and \(w\) is sufficiently large.
- \(O(w^2)\) method to compute the exact density of random minimizer.
- The \(2/j\) and \(1/j\) fractions were observed before in (Marçais et al. 2017)
Kille et al. (2024)
- Lower bound on density of \(\frac1{w+k}\lceil\frac{w+k}w\rceil\).
- Tighter version by counting pure cycles of all lengths.
- Instead of \(k\), can also use the bound for \(k’\geq k\) with \(k\equiv 1\pmod w\).
Zheng, Kingsford, and Marçais (2021a)
- UHS-minimizer compatibility; remaining path length \(L \leq \ell\)
- \(d \leq |U|/\sigma^k\).
- Mentions decycling set of Mykkeltveit (1972)
- Theorem 2: Forward sampling scheme with density \(O(\ln(w) / w)\) (where \(k\) is small/constant), and a corresponding UHS.
- selection scheme: selects positions rather than kmers, i.e., \(k=1\).
- Assumes \(w\to\infty\), so anyway \(k=O(1)\) or \(k=1\) are kinda equivalent.
- Theorem 1: local scheme implies $(2w-1)$-UHS, forward scheme implies $(w+1)$-UHS.
- Theorem 3: Gives an upper and lower bound on the remaining path length of the Mykkeltveit set: it’s between \(c_1\cdot w^2\) and \(c_2\cdot w^3\).
- Local schemes: \(w-1\) ’looking back’ context for \(2w+k-2\) total context size.
- The charged contexts are a UHS.
- \(O(\ln(w)/w)\) forward scheme construction:
- Definition 2 / Lemma 2: The set of words that either start with \(0^d\) or do not contain \(0^d\) at all is a UHS. Set \(d = \log_\sigma(w /\ln w)-1\). This has longest remaining path length \(w-d\).
- Then a long proof that the relative size is \(O(\ln(w) / w)\).
- (In hindsight: this is a variant of picking the smallest substring, as long as it is sufficiently small.)
- Questions:
- We can go from a scheme \(f\) to a UHS. Can we also go back?
- Does a perfect selection scheme exist?
Zheng, Kingsford, and Marçais (2020)
- For \(w\to\infty\), minimizer schemes can be optimal (have density \(O(1/w)\)) if and only if \(k \geq \log_\sigma(w) - O(1)\). In fact, the lexicographic minimizer is optimal.
- When \(k\geq (3+\varepsilon)\log_\sigma(w)\), the random minimizer has expected density \(2/(w+1)+o(1/w)\), fixing the proof by (Schleimer, Wilkerson, and Aiken 2003).
- When \(\varepsilon>0\) and \(k>(3+\varepsilon)\log_\sigma w\), the probability
of duplicate k-mers in a window is \(o(1/w)\).
- TODO: Hypothesis: the \(3\) could also be a \(2\), or actually even a \(1\)?
- turn charged contexts of a minimizer scheme into a $(w+k)$-UHS.
- Relative size of UHS is upper bound on density of compatible minimizer.
- Order k-mers by their frequency in the dataset.
2.1 Questions
Main question: What is the lowest possible density for given \((k, w)\)?
The first questions:
- What is a scheme
This question is then approached from two sides:
- Lower bounds on density for \((k,w,\sigma)\)?
- Tight lower bounds for some parameters?
- Tight lower bounds, asymptotic in parameters (e.g., \(\sigma\to\infty\))?
- Can we make tight lower bounds for all practical parameters?
- If not, can we understand why the best schemes found (using ILP) do not reach know bounds?
And:
- What is the empirical density of existing schemes?
- Can we model existing schemes and compute their density exactly?
- Can we make near-optimal schemes (say, within \(1\%\) from optimal) for practical parameters?
- Can we make exactly optimal schemes, for asymptotic parameters?
- Can we make optimal schemes for practical parameters?
- Can we make ‘pure’ optimal schemes, that do not require exponential memory?
- If we can not make pure optimal schemes, can we bruteforce search for them instead?
2.2 Types of schemes
scope:
- global (frac-sampling, mod-sampling (Chikhi et al. 2014; Chikhi, Limasset, and Medvedev 2016) (TODO, TODO), minhash, sampling every $n$-th kmer)
- local
- forward
- minimizer
type:
- sampling scheme: sample k-mer
- selection scheme: sample position (\(k=1\))
2.3 Parameter regimes
- small \(k\): \(k < \log_\sigma(w)\)
- large \(k\): \(k\gg w\) or \(k\to \infty\).
- ‘practical’: \(4\leq k \leq 2w\) with \(w\leq 20\) or so; depends on the application.
- binary/DNA alphabet \(\sigma\in\{2,4\}\).
- large/infinite alphabet, \(\sigma=256\) or \(\sigma\to\infty\).
2.4 Different perspectives
- charged contexts of length \(w+1\).
- pure cycles of length \(w+k\).
- long random strings.
2.5 UHS vs minimizer scheme
- UHS is a minimizer scheme where everything has hash/order \(0\) or \(1\).
2.6 (Asymptotic) bounds
2.7 Lower bounds
3 Minimizer schemes
3.1 Orders
3.2 UHS-based and search-based schemes
- Orenstein et al. (2016, 2017)
- Introduces UHS
- DOCKS finds a UHS
- Finding optimal UHS is hard when a set of strings to be hit is given. (But here we have a DBg, which may be easier.)
- The size of a UHS may be much smaller than the set of all possible minimizers.
- DOCKS UHS density is close to optimal (?)
- Step 1: Start with the Mykkeltveit embedding
- Step 2: repeatedly find a vertex with maximal ‘hitting number’ of $ℓ$-long paths going through it, and add it to the UHS (and remove it from the graph.)
- DOCKSany: compute number of paths of any length, instead of length \(\ell\).
- DOCKSanyX: remove the top \(X\) vertices at a time.
- Applies ’naive extension’ to work for larger \(k\).
- Runs for (many) hours to compute UHS for \(k=11\) already.
- An ILP to improve UHSes found by DOCKS; improves by only a few percent at best.
- DOCKS selects far fewer distinct kmers compared to random minimizers, and has slightly lower density.
- Does not use a compatible minimizer order.
- DeBlasio et al. (2019)
- Extends UHS generated by DOCKS
- larger \(k\) up to \(200\), but \(L\leq 21\).
- Merges UHS with random minimizer tiebreaking.
- Mentions sparsity
- Starts with UHS for small \(k\) and grows one-by-one to larger \(k\). Full
process is called
reMuval
.- First, naive extension
- Second, an ILP to reduce the size of the new UHS and increase the number of singletons: windows containing exactly one kmer. (Since density directly correlates with sparsity.)
- Naive extension can decrease density
- Remove kmers from the UHS that always co-occur with another k-mer in every window.
- ILP is on whether each kmer is retained in the UHS or not, such that every window preserves at least one element of the UHS.
- Also does sequence-specific minimizers
- Ekim, Berger, and Orenstein (2020)
- Improves DOCKS using randomized parallel algorithm for set-cover.
- Faster computation of hitting numbers.
- Scales to \(k\leq 16\).
- Hoang, Zheng, and Kingsford (2022)
- Learns a total order, instead of a UHS.
- Continuous objective, rather than discrete.
- UHSes are ‘underspecified’ since the order withing each component is not given. Determining the permutation directly is more powerful.
- Around \(5\%\) better than PASHA.
- Golan et al. (2024)
- Unlike UHS-based methods that optimize UHS size, this directly optimizes
minimizer density by minimizing the number of charged context:
- Repeatedly pick the next kmer as smallest that is in the smallest fraction of charged contexts.
- Then do some noise (slightly submoptimal choices), and local search with random restarts on top.
- Builds scheme for alphabet size \(\sigma’=2\) and \(k’\leq 20\) which is extended to \(\sigma=2\) and to larger \(k\) if \(k>20\).
- Achieves very low density. Open question how close to optimal.
- Not ‘pure’: requires the memory to store the order of kmers.
- Unlike UHS-based methods that optimize UHS size, this directly optimizes
minimizer density by minimizing the number of charged context:
- Zheng, Kingsford, and Marçais (2021b)
- Polar set intersects each $w$-mer at most once.
- Two kmers in a polar set are at least \((w+1)/2\) apart.
- Lemma 4: Formula for probability that a window is charged, in terms of number of unique kmers.
- Progressively add ’layers’ to the polar set to fill gaps.
- Heuristic: greedily try to pick kmers that are exactly \(w\) apart, by
choosing a random offset \(o\in [w]\), and adding all those kmers as long as
they aren’t too close to already chosen kmers.
- Up to 7 rounds in practice.
- Filter too frequent kmers.
- Significantly improved density over other methods.
- Requires explicitly storing an order.
3.3 Pure schemes
- Zheng, Kingsford, and Marçais (2020)
- Considers all closed syncmers in a window. Picks the smallest one.
- Parameter \(k_0\) (we call it \(s\)): the length of the hashed ‘inner’ slices.
- For \(k > w + O(\log_\sigma(w))\), has density below \(1.67/w + o(1/w)\).
- This requires a long proof.
- First scheme with guaranteed density \(<2/(w+1)\) when \(k\approx w\) (instead \(k\gg w\)).
- Does not require expensive heuristics for precomputation; no internal storage.
- Charged contexts or a \((w_0, k_0)\) minimizer are the UHS of the \((w, k=w_0+k_0)\) minimizer, as long as \(w\geq w_0\).
- Pellow et al. (2023)
- MDS: a set of k-mers that hits every cycle in the DBg.
- Mykkeltveit embedding: map each k-mer to a complex number. Take those k-mers with argument (angle) between \(0\) and \(2\pi/k\) as context-free hitting set.
- Take a compatible minimizer.
- Even better: prefer argument in \([0, 2\pi/k)\), and otherwise prefer argument \([\pi, \pi+2\pi/k)\).
- Great density for \(k\) just below \(w\).
- MDS orders outperform DOCKS and PASHA.
- Scales to larger \(k\)
- Groot Koerkamp and Pibiri (2024)
- For \(k > w\), look at $t=k\bmod w$-mers instead. If the smallest $t$-mer is at position \(x\), sample the $k$-mer at position \(x\bmod w\).
- Asymptotic optimal density as \(w\to\infty\).
- Close to optimal for large alphabet when \(k\equiv 1\pmod w\).
- Groot Koerkamp, Liu, and Pibiri (2024)
- Extend miniception to open syncmers, and open followed by closed syncmers.
- Extend modmini to wrap any other sampling scheme.
- Simple and very efficient scheme, for any \(k\).
- Greedymini has lower density, but is more complex.
3.4 Other variants
- Kille et al. (2023)
- Sample the smallest \(s\) k-mers from each \(s\cdot w\) consecutive k-mers.
- Irber et al. (2022)
- Sample all kmers with hash below \(max\cdot f\).
- (Chikhi et al. 2014)
- Frequency aware minimizers TODO
- Alanko, Biagi, and Puglisi (2024)
- frequency bounded minimizers, with frequency below \(t\)
- Prefers rare kmers as minimizers
- variable length scheme.
- Shortest unique finimizers
- Uses SBWT to work around ’non-local’ property.
- Useful for SSHash-like indices.
- Defines DSPSS: Disjoint spectrum preserving string set.
- For each kmer, find the shortest contained substring that occurs at most \(t\) times in the DBg of the input.
- (TODO: I’m getting a bit lost on the technicalities with the SBWT.)
Selection schemes
These have \(k=1\)
- Loukides and Pissis (2021; Loukides, Pissis, and Sweering 2023)
- In each window, sample the position that starts the lexicographically smallest rotation.
- Avoid sampling the last \(r\approx \log_\sigma(w)\) positions, as they cause ‘unstable’ anchors.
Canonical minimizers
- Pan and Reinert (2024)
- Choose the strandedness via higher CG-content.
- Wittler (2023)
- TODO
- Marçais, Elder, and Kingsford (2024)
- TODO
4 Open questions
- How much are local schemes better than forward schemes?
- How much are forward schemes better than minimizer schemes? Only for small \(k\)?
- How close to optimal is greedy minimizer?
References
I am not a fan of this, since the lower bound is \(1/w\), no scheme can actually achieve density factor \(1\). Calibrating the scale to the (somewhat arbirary) random minimizer, instead of to the theoretical lower bound does not really make sense to me. ↩︎