This post is simply a list of brief comments on many papers related to minimizers, and forms the basis of /posts/minimizers/.
1 Overview Link to heading
Figure 1: An overview of the papers this post discusses, showing authors and categories of each paper.
2 Introduction Link to heading
- Lots of DNA data
- Most algorithms deal with k-mers.
- k-mers overlap, and hence considering all of them is redundant.
- Thus: sample a subset of the kmers.
- Must be ’locally consistent’ and deterministic to be useful.
- Enter random minimizers.
- Parameter
: guarantee that at least one k-mer is sampled out of every window of k-mers. - Density
: (expected) overall fraction of sampled k-mers. - Obviously,
- For random mini,
. - Lower density => fewer k-mers, smaller indices, faster algorithms.
- Question: How small density can we get for given
and ?
Previous reviews Link to heading
- Zheng, Marçais, and Kingsford (2023)
- Ndiaye et al. (2024)
3 Theory of sampling schemes Link to heading
Broder (1997)
- Take the
kmers with smallest hashes, then estimate jaccard similarity based on this.
- Take the
Schleimer, Wilkerson, and Aiken (2003)
: noise threshold : guarantee threshold- winnowing: Definition 1: Select minimum hash in each window.
- Charged contexts to prove a
density, assuming no duplicate hashes (and -mers) - local algorithm: Function on k-mer hashes, rather than on window itself:
. - Local algorithms have density at least
. - Conjecture that
is optimal. - Robust Winnowing: smarter tie-breaking: same as previous window in case of tie if possible, otherwise rightmost.
- ’threshold’
- order via hash
Roberts et al. (2004)
- interior minimizers: Length
in common, then share minimizer - Same heuristic argument for
density, assuming distinct kmers. guarantees no gaps (uncovered characters) between minimizers- end minimizers: minimizers of a prefix/suffix of the string of length
. - lexicographic ordering is bad on consecutive zeros.
- ‘Alternating’ order: even positions have reversed order.
- Increase chance of ‘rare’ k-mers being minimizers.
- Reverse complement-stable minimizers:
. - Some heuristic argument that sensitivity goes as
. may have bad sensitivity.
- interior minimizers: Length
Marçais et al. (2017)
- Main goal is to disprove the
conjectured lower bound. - States that Schleimer, Wilkerson, and Aiken (2003) defines a local scheme as only having access to the sequence within a window, but actually, it only has access to the hashes.
- UHS to obtain ordering with lower density than lex or random.
- DOCKS goes below
, so the conjecture doesn’t hold. - Random order has density slightly below
. - Defines density factor
.11I am not a fan of this, since the lower bound is
, no scheme can actually achieve density factor . Calibrating the scale to the (somewhat arbirary) random minimizer, instead of to the theoretical lower bound does not really make sense to me. - UHS sparsity
: the fraction of contexts containing exactly one k-mer from the . - The density of a minimizer scheme can be computed on a De Bruin sequence of
order
. - The density of a local scheme can be less than
. - Does not refute the
lower bound.
- Main goal is to disprove the
Marçais, DeBlasio, and Kingsford (2018)
Properly introduces
.Realizes that
lower bound is only for randomized local schemes.Studies asymptotic behaviour in
andFor
, a minimizer scheme with density .For
, a lower bound on minimizer schemes.- Forward schemes can achieve density
instead, by using instead.
- Forward schemes can achieve density
A lower bound on forward schemes of
.- Proof looks at two consecutive windows and the fact that half the time,
the sampled kmers leave a gap of
in between, requiring an additional sampled kmer.
- Proof looks at two consecutive windows and the fact that half the time,
the sampled kmers leave a gap of
Local schemes can be strictly better than forward, found using ILP.
New lower bound on forward schemes.
For local schemes, a De Bruijn sequence of order
can be used to compute density.UHS-minimizer compatibility.
Naive extension for UHS: going from
to by ignoring extra characters.Construction of asymptotic in
scheme is complex, but comes down to roughly: for each , sum the characters in positions . Take the k-mer the position for which the sum is maximal. (In the paper it’s slightly different, in that a context-free version is defined where a k-mer is ‘good’ if the sum of it’s characters is larger than the sums for the other equivalence classes, and then there is an argument that good kmers close to a UHS, and turning them into a real UHS only requires ‘few’ extra kmers.) is decreasing in .
Edgar (2021)
- Introduces open syncmers, closed syncmers
- context free: each kmer is independently selected or not
- Conservation: probability that a sampled kmer is preserved under mutations.
- context-free sampled kmers are better conserved.
Shaw and Yu (2021)
- Formalizes conservation: the fraction of bases covered by sampled kmers.
- k-mer selection method: samples any kind of subset of kmers
-local selection method: looks at a -mer, and returns some subset of kmers.- word-based method: a ‘context free’ method where for each k-mer it is decided independently whether it is sampled or not.
Belbasi et al. (2022)
- The jaccard similarity based on random minimizers is biased.
Golan and Shur (2025)
- The random minimizer has density just below
when and is sufficiently large. method to compute the exact density of random minimizer.- The
and fractions were observed before in (Marçais et al. 2017)
- The random minimizer has density just below
Kille et al. (2024)
- Lower bound on density of
. - Tighter version by counting pure cycles of all lengths.
- Instead of
, can also use the bound for with .
- Lower bound on density of
Zheng, Kingsford, and Marçais (2021a)
- UHS-minimizer compatibility; remaining path length
.- Mentions decycling set of Mykkeltveit (1972)
- Theorem 2: Forward sampling scheme with density
(where is small/constant), and a corresponding UHS. - selection scheme: selects positions rather than kmers, i.e.,
. - Assumes
, so anyway or are kinda equivalent. - Theorem 1: local scheme implies
-UHS, forward scheme implies -UHS. - Theorem 3: Gives an upper and lower bound on the remaining path length of the
Mykkeltveit set: it’s between
and . - Local schemes:
’looking back’ context for total context size.- The charged contexts are a UHS.
forward scheme construction:- Definition 2 / Lemma 2: The set of words that either start with
or do not contain at all is a UHS. Set . This has longest remaining path length . - Then a long proof that the relative size is
. - (In hindsight: this is a variant of picking the smallest substring, as long as it is sufficiently small.)
- Definition 2 / Lemma 2: The set of words that either start with
- Questions:
- We can go from a scheme
to a UHS. Can we also go back? - Does a perfect selection scheme exist?
- We can go from a scheme
- UHS-minimizer compatibility; remaining path length
Zheng, Kingsford, and Marçais (2020)
- For
, minimizer schemes can be optimal (have density ) if and only if . In fact, the lexicographic minimizer is optimal. - When
, the random minimizer has expected density , fixing the proof by (Schleimer, Wilkerson, and Aiken 2003). - When
and , the probability of duplicate k-mers in a window is .- TODO: Hypothesis: the
could also be a , or actually even a ?
- TODO: Hypothesis: the
- turn charged contexts of a minimizer scheme into a
-UHS. (skipped) - Relative size of UHS is upper bound on density of compatible minimizer.
- For
(Chikhi et al. 2014)
- Order k-mers by their frequency in the dataset.
3.1 Questions Link to heading
Main question: What is the lowest possible density for given
The first questions:
- What is a scheme
type:
- sampling scheme: sample k-mer
- selection scheme: sample position (
)
This question is then approached from two sides:
- Lower bounds on density for
? - Tight lower bounds for some parameters?
- Tight lower bounds, asymptotic in parameters (e.g.,
)? - Can we make tight lower bounds for all practical parameters?
- If not, can we understand why the best schemes found (using ILP) do not reach know bounds?
And:
- What is the empirical density of existing schemes?
- Can we model existing schemes and compute their density exactly?
- Can we make near-optimal schemes (say, within
from optimal) for practical parameters? - Can we make exactly optimal schemes, for asymptotic parameters?
- Can we make optimal schemes for practical parameters?
- Can we make ‘pure’ optimal schemes, that do not require exponential memory?
- If we can not make pure optimal schemes, can we bruteforce search for them instead?
3.2 Types of schemes Link to heading
scope:
- global (frac-sampling, mod-sampling
sampling every
-th kmer) - local
- forward
- minimizer
3.3 Parameter regimes Link to heading
- small
: - large
: or . - ‘practical’:
with or so; depends on the application. - binary/DNA alphabet
. - large/infinite alphabet,
or .
3.4 Different perspectives Link to heading
- charged contexts of length
. - pure cycles of length
. - long random strings.
3.5 UHS vs minimizer scheme Link to heading
- UHS is a minimizer scheme where everything has hash/order
or .
3.6 (Asymptotic) bounds Link to heading
3.7 Lower bounds Link to heading
4 Minimizer schemes Link to heading
4.1 Orders Link to heading
4.2 UHS-based and search-based schemes Link to heading
- Orenstein et al. (2016, 2017)
- Introduces UHS
- DOCKS finds a UHS
- Finding optimal UHS is hard when a set of strings to be hit is given. (But here we have a DBg, which may be easier.)
- The size of a UHS may be much smaller than the set of all possible minimizers.
- DOCKS UHS density is close to optimal (?)
- Step 1: Start with the Mykkeltveit embedding
- Step 2: repeatedly find a vertex with maximal ‘hitting number’ of
-long paths going through it, and add it to the UHS (and remove it from the graph.) - DOCKSany: compute number of paths of any length, instead of length
. - DOCKSanyX: remove the top
vertices at a time. - Applies ’naive extension’ to work for larger
. - Runs for (many) hours to compute UHS for
already. - An ILP to improve UHSes found by DOCKS; improves by only a few percent at best.
- DOCKS selects far fewer distinct kmers compared to random minimizers, and has slightly lower density.
- Does not use a compatible minimizer order.
- DeBlasio et al. (2019)
- Extends UHS generated by DOCKS
- larger
up to , but . - Merges UHS with random minimizer tiebreaking.
- Mentions sparsity
- Starts with UHS for small
and grows one-by-one to larger . Full process is calledreMuval
.- First, naive extension
- Second, an ILP to reduce the size of the new UHS and increase the number of singletons: windows containing exactly one kmer. (Since density directly correlates with sparsity.)
- Naive extension can decrease density
- Remove kmers from the UHS that always co-occur with another k-mer in every window.
- ILP is on whether each kmer is retained in the UHS or not, such that every window preserves at least one element of the UHS.
- Also does sequence-specific minimizers
- Ekim, Berger, and Orenstein (2020)
- Improves DOCKS using randomized parallel algorithm for set-cover.
- Faster computation of hitting numbers.
- Scales to
.
- Hoang, Zheng, and Kingsford (2022)
- Learns a total order, instead of a UHS.
- Continuous objective, rather than discrete.
- UHSes are ‘underspecified’ since the order withing each component is not given. Determining the permutation directly is more powerful.
- Around
better than PASHA.
- Golan et al. (2024)
- Unlike UHS-based methods that optimize UHS size, this directly optimizes
minimizer density by minimizing the number of charged context:
- Repeatedly pick the next kmer as smallest that is in the smallest fraction of charged contexts.
- Then do some noise (slightly submoptimal choices), and local search with random restarts on top.
- Builds scheme for alphabet size
and which is extended to and to larger if . - Achieves very low density. Open question how close to optimal.
- Not ‘pure’: requires the memory to store the order of kmers.
- Unlike UHS-based methods that optimize UHS size, this directly optimizes
minimizer density by minimizing the number of charged context:
- Zheng, Kingsford, and Marçais (2021b)
- Polar set intersects each
-mer at most once. - Two kmers in a polar set are at least
apart. - Lemma 4: Formula for probability that a window is charged, in terms of number of unique kmers.
- Progressively add ’layers’ to the polar set to fill gaps.
- Heuristic: greedily try to pick kmers that are exactly
apart, by choosing a random offset , and adding all those kmers as long as they aren’t too close to already chosen kmers.- Up to 7 rounds in practice.
- Filter too frequent kmers.
- Significantly improved density over other methods.
- Requires explicitly storing an order.
- Polar set intersects each
4.3 Pure schemes Link to heading
- Zheng, Kingsford, and Marçais (2020)
- Considers all closed syncmers in a window. Picks the smallest one.
- Parameter
(we call it ): the length of the hashed ‘inner’ slices. - For
, has density below .- This requires a long proof.
- First scheme with guaranteed density
when (instead ). - Does not require expensive heuristics for precomputation; no internal storage.
- Charged contexts or a
minimizer are the UHS of the minimizer, as long as .
- Pellow et al. (2023)
- MDS: a set of k-mers that hits every cycle in the DBg.
- Mykkeltveit embedding: map each k-mer to a complex number. Take those k-mers
with argument (angle) between
and as context-free hitting set. - Take a compatible minimizer.
- Even better: prefer argument in
, and otherwise prefer argument . - Great density for
just below . - MDS orders outperform DOCKS and PASHA.
- Scales to larger
- Groot Koerkamp and Pibiri (2024)
- For
, look at -mers instead. If the smallest -mer is at position , sample the -mer at position . - Asymptotic optimal density as
. - Close to optimal for large alphabet when
.
- For
- Groot Koerkamp, Liu, and Pibiri (2025)
- Extend miniception to open syncmers, and open followed by closed syncmers.
- Extend modmini to wrap any other sampling scheme.
- Simple and very efficient scheme, for any
. - Greedymini has lower density, but is more complex.
4.4 Other variants Link to heading
- Kille et al. (2023)
- Sample the smallest
k-mers from each consecutive k-mers.
- Sample the smallest
- Irber et al. (2022)
- Sample all kmers with hash below
.
- Sample all kmers with hash below
- (Chikhi et al. 2014)
- Frequency aware minimizers TODO
- Alanko, Biagi, and Puglisi (2024)
- frequency bounded minimizers, with frequency below
- Prefers rare kmers as minimizers
- variable length scheme.
- Shortest unique finimizers
- Uses SBWT to work around ’non-local’ property.
- Useful for SSHash-like indices.
- Defines DSPSS: Disjoint spectrum preserving string set.
- For each kmer, find the shortest contained substring that occurs at most
times in the DBg of the input. - (TODO: I’m getting a bit lost on the technicalities with the SBWT.)
- frequency bounded minimizers, with frequency below
Selection schemes Link to heading
These have
- Loukides and Pissis (2021; Loukides, Pissis, and Sweering 2023)
- In each window, sample the position that starts the lexicographically smallest rotation.
- Avoid sampling the last
positions, as they cause ‘unstable’ anchors.
Canonical minimizers Link to heading
- Pan and Reinert (2024)
- Choose the strandedness via higher CG-content.
- Wittler (2023)
- TODO
- Marçais, Elder, and Kingsford (2024)
- TODO
4.5 Non-overlapping string sets Link to heading
- Levenshtein (1970)
- Shows a bound on max number of non-overlapping words of
- Shows a bound on max number of non-overlapping words of
- Blackburn (2015)
- divide alphabet into two parts. Then patterns
abbbb
and e.g.aab?b?b?b
are non-overlapping. (b
: any non-a
character) - For DNA, optimal solution (max number of pairwise non-overlapping words) for
is[AG][CT]
, while for , an optimal solution is given byA[CTG]+
. - Re-prove upper bound on number of non-overlapping words
. - Re-prove upper bound of Levenshtein above.
- Show existing scheme with size
- New scheme: not
and , but arbitrary partition. And prefix is in some set , while suffix is -free.- When
divides , choose and , and consider stringsIIIIIIJ
. These are optimal. - The set
is needed to avoid rounding errors when is small. - Conjecture: a suffix of
JJ
or longer is never optimal.
- When
- divide alphabet into two parts. Then patterns
- Frith, Noé, and Kucherov (2020)
- minimally overlapping words are anti-clustered, hence good for sensitivity.
cg
-order: alternate small and large characters, as (Roberts et al. 2004)abb
-order: compare first character normal, the rest byt=g=c<a
.
- Stanovnik, Moškon, and Mraz (2024)
- ILP to solve the problem for more
pairs.
- ILP to solve the problem for more
- Frith, Shaw, and Spouge (2023)
- Test various word-sets for their sparsity and specificity.
- Champarnaud, Hansel, and Perrin (2004)
I am not a fan of this, since the lower bound is
, no scheme can actually achieve density factor . Calibrating the scale to the (somewhat arbirary) random minimizer, instead of to the theoretical lower bound does not really make sense to me. ↩︎