\[ \newcommand{\d}{\mathrm{d}} \newcommand{\L}{\mathcal{L}} \]
This post introduces some background for minimizers and some experiments for a new minimizer variant. That new variant is now called the modminimizer and published at WABI24 (Groot Koerkamp and Pibiri 2024). This also includes a review of existing methods, including pseudocode for most of the methods covered below.
This posts discusses some variants of minimizers, inspired by the previous post on bidirectional anchors.
Minimizer schemes take an \(\ell = k+w1\) mer, and select a substring of length \(k\) in a deterministic way, in such a way that consecutive $ℓ$mers are likely to share their chosen $k$mer.
The rest of this post:
 first lists some applications,
 then introduces some existing minimizer schemes,
 then introduces some new ideas,
 and lastly evaluates all schemes on random data.
 Finally, there are a bunch of experiments logs for small \(k\).
Applications
Minimizers can be used for various purposes, such as:
 Compressing sequences, such as in minimizerspace De Bruijn graphs (Ekim, Berger, and Chikhi 2021).
 Similarity estimation,
 Locality preserving hashing of kmers for different types of clustering.
 (Pibiri 2022; Pibiri, Shibuya, and Limasset 2023).
 Here a large \(k \geq \log_\sigma N\) is used so that minimizers are unique keys.
 The objective is to have as few minimizers as possible in any sequence, to reduce the number of cachemisses due to nonlocality.
 Similarity estimation (Li 2016).
 Here the goal is to cluster similar reads. Ideally reads with a few small mutations have the same minimizer.
 (Pibiri 2022; Pibiri, Shibuya, and Limasset 2023).
Locality preserving hashing and similarity estimation both cluster kmers on similarity, but there are some differences:
 In LPH, we want consecutive kmers to share a minimizer. I.e. we want to partition the De Bruijn graph into long paths.
 In similarity estimation, we want similar kmers to share a minimizer, where similar explicitly includes small mutations. I.e. we want to partition the De Bruijn graph into ‘blobs’ covering many variants.
Background
Related but out of scope topics are:
 universal minimizers: not really a local sampling scheme;
 spaced minimizers/strobemers: used for similarity search methods, but not used for locality sensitive hashing.
Minimizers
Minimizers were introduced independently by Roberts et al. (2004) and Schleimer, Wilkerson, and Aiken (2003): Given a $(k+w1)$mer, consider the \(w\) contained $k$mers. The (rightmost) $k$mer with minimal hash (for some given hash function \(h\)) is the minimizer.
An example:


Density bounds
By definition, minimizers sample at least one in every \(w\) positions, so trivially have a density of at least \(1/w\). For locality preserving hashing applications, lower density is better since this means larger clusters and fewer cache misses.
 Random minimizers have a density of \(2/(w+1) + o(1/w)\) when \(k > (3+\epsilon) \log_\sigma (w+1)\) (Zheng, Kingsford, and Marçais 2020b; Marçais et al. 2017).
 Schleimer, Wilkerson, and Aiken (2003) prove that on random strings, any minimizer scheme has a density at least \[\frac{1.5+\frac{1}{2w}}{w+1}\geq \frac{1.5}{w+1},\] but this lower bound is false, or at least, makes overly strong assumptions.
 Marçais, DeBlasio, and Kingsford (2018) contradict that theorem by
constructing a universal minimizer with lower density, and remark that the
previous result only applies to ‘randomized local schemes’.
 Instead they prove that density is at least \[\frac{1.5 + \frac{1}{2w} + \max(0, \lfloor \frac{kw}{w}\rfloor)}{w+k}.\]
 In the modminimizer paper, we simplify and improve this to: \[\frac{1.5}{w+k0.5},\] and show that it holds not only for forward schemes but also for local schemes.
Robust minimizers
To reduce the density, Schleimer, Wilkerson, and Aiken (2003) suggest the following: when the minimizer of the preceding kmer is still a minimizer, reuse it, even when it is not rightmost.
Continuing the example:


When the same kmer occurs twice in an $ℓ$mer, only one of them will be selected in a way dependent on the context. For most applications, this nondeterminism is not a problem.
Still there is a drawback: When two distinct kmers have the same hash, only one of them is selected. Although unlikely, this is not good for downstream applications. To prevent this, minimizers \(x\) could be ordered by \((h(x), x)\) instead of just \(h(x)\).
PASHA
PASHA (Ekim, Berger, and Orenstein 2020) is another minimizer selection algorithm based on a universal hitting set. It works as follows:
 Start with a complete De Bruijn graph of order \(k\), i.e., containing all \(4^k\) kmers.
 Remove from this a minimal set of $k$mers \(U_1\) that make the graph acyclic.
 Then remove additional $k$mers to remove all paths of length \(\geq \ell\).
 This is done using the DOCKS heuristic (Orenstein et al. 2017), which greedily removes the vertex containing the most (length \(\ell\)) paths.
PASHAs main contribution is a considerable speedup over DOCKS. It still remains slow and has to process the full \(4^k\) graph, limiting it to \(k\leq 16\), but has the lower density.
Miniception
Miniception (Zheng, Kingsford, and Marçais 2020b) is another minimizer selection algorithm. It works using an additional parameter \(k_0\leq k\) around \(3\cdot \log_\sigma(k)\). It additionally requires \(k_0 \geq kw\), although I do not think this is explicitly mentioned in the paper.
For a window \(T\) of length \(\ell = k+w1\) characters, Miniception selects a minimizer as follows:
 Find all kmers whose minimal contained $k0$mer is at its start or end.
 In case there are multiple (or none), break ties using random order on kmers.
In the limit, it achieves density down to \(1.67/w\) for \(w\sim k\).
Sadly the preprint (Zheng, Kingsford, and Marçais 2020a) has a typo in Figure 6, making the results hard to interpret.
Closed syncmers
Given \(k\) and \(s\leq k\), a kmer is a closed syncmer (Edgar 2021) when its minimal contained $s$mer is at its start or end. This guarantees that in each window of \(w=ks\) kmers at least one kmer is chosen, so \(s\) should be set to \(kw\). So this only works for \(k\geq w\).
Note that closed syncmers are not directly a sampling scheme, since each kmer is independently determined to be a closed syncmer or not. This can be fixed by using an order on kmers to break ties, like miniception does.
Closed syncmers are very similar to miniception. In fact, miniception is more general since it’s parameter \(k0\) is chosen freely, rather than (implictly) restricting to \(s=kw\).
Quote:
Density is not the appropriate optimization metric
Several recent papers have focused on minimizing the density of minimizers for given k and w; see (Zheng, Kingsford & Marçais, 2020) and references therein. This would be an appropriate optimization strategy if submers were used to find identical longer substrings in different sequences, but this is rarely the primary goal of an application and other methods are better suited to this task (e.g., Burrows–Wheeler indexes).
Bdanchors
Bidirectional anchors (bdanchors) are a variant on minimizers that take the minimal lexicographic rotation instead of the minimal kmer substring (Loukides, Pissis, and Sweering 2023; Ayad, Loukides, and Pissis 2023). I wrote above them before in this post.
Reduced bdanchors restrict this rotation to not start in the last \(r=4\log_\sigma(\ell)\) positions.
Density: Reduced bdanchors have a density of \(2/(\ell+1r)\) for large alphabet, and somewhat larger for small \(\sigma\).
Bdanchors have a slightly different purpose than minimizers, in that they are keyed by their position in the text, rather than by the corresponding string itself. Thus, a suffix array is built on suffixes and reverseprefixes starting/ending there.
For random strings, reduced bdanchors are a dense subset of the \(k=r+1\) minimizers.
Given the bdanchors, two suffix arrays are built. One of suffixes starting at anchors, and one on reverse prefixes ending at anchors.
Note: bdanchors are not a socalled forward scheme. That is, it is possible for the window to shift right, but the selected position to jump backwards. Example here.
Optimization: When querying an $ℓ$mer, in practice only the longer of the prefix and suffix is actually looked up in the corresponding suffix array. Thus, we don’t need to two suffix arrays over all bdanchors:
 The forward SA over suffixes only needs to contains bdanchors occurring in the left half of some $ℓ$mer.
 The reverse SA over suffixes only needs to contains bdanchors occurring in the right half of some $ℓ$mer.
This makes things slightly sparser.
New: Modminimizers
Bidirectional anchors have a benefit over minimizers since they always use \(r=O(\log_\sigma (\ell))\) instead of possibly much larger \(k\). This means their average density \(2/(\ell+1r)\) can be lower than \(2/(w+1) = 2/(\ellk+2)\). Similarly, Miniception uses a separate \(k_0\) of order \(3 \log_\sigma(k)\) to achieve
Why do we use large \(k\), when small \(k=\Omega(\log \ell)\) is sufficient and preferable for lower density? The reason is that for locality preserving hashing we would like (nearly) unique keys of length \(\log_\sigma(N)\).
It seems that two conceptually distinct parameters are merged:
 The length \(k_0=r+1\) of the minimizer, which we would like to be small.
 The length \(k\) of the key we want to extract, which we would like to be larger.
Inspired by previous methods, here is a new sampling scheme, modsampling.
 First, choose a small parameter \(t = k\bmod w\), but large enough to prevent duplicate $k$mers.
 Find the position \(x\) of the smallest $t$mer in the $ℓ$mer window.
 Sample the kmer at position \(p=x \bmod w\).
We define two specific cases:
 The lrminimizer uses \(t = k  w\) for \(k>w\).
 The modminimizer uses \(t = (kr)\% w + r\) for \(k>r\), where \(r=4\) ensures that \(t\) is not too small.
Here is an example for \(k=7\), \(w=4\), \(t=7\%4=3\). Stars indicate the candidate $t$mer minimizers, and the dashes indicate the corresponding sampled $k$mers.


NOTE: As it turns out, lrminimizers are very similar to closed syncmers. In particular compare the figure above with figure 1b in (Edgar 2021). The main difference is that lrminimizers are context aware and break ties by the value of the chosen $t$mer, whereas closed syncmers are not ‘filtered down’ to have only one sample per window.
Here is an example with a $3$way split.


Modminimizers have low density when \(k\) is large compared to \(w\). When \(w\) is fixed and \(k\to\infty\), they approach the asymptotically optimal density of \(1/w\).
Experiments
Here are some quick results.
 Code is at https://github.com/RagnarGrootKoerkamp/minimizers.
 PASHA is excluded – even though it’s very good, it’s too much effort to download $k$mers to quickly benchmark it.
 For methods taking a parameter \(k_0\) or \(r\), I did a bruteforce search from \(0\) to \(10\) (as long as they are valid), or around \(kw\) in case that is larger than \(10\).
Note:
 bdanchors (not shown) depend only on \(\ell = w+k1\), and hence density decreases in \(k\).
 Miniception is always better than vanilla minimizers.
 Modminimizers don’t do anything for \(k\leq w\), but are best for \(k\geq w\).
 Can we optimize them more? By using more ideas from miniception?
 Can we optimize miniception by introducing a third layer of minimizers??
 Or what if we sort filtered kmers by their contained k0mer before comparing their own hash?
 For larger alphabet \(\sigma = 256\) (not shown), results are mostly the same but bdanchors have slightly lower density.
Conclusion
For \(k \geq w\), modminimizers achieve density that asymptotically approaches the lower bound of \(1/w\). So the large\(k\) case is ‘solved’. Both the scheme introduced in (Marçais, DeBlasio, and Kingsford 2018) and the new modminimizers achieve this \(1/w\) density in the limit, but modminimizers converge much faster.
 Modminimizers are also an instance of a minimizer scheme w.r.t. a specific order, namely: the hash of a kmer is the minimal hash over the tmers occurring in a position \(0 \mod w\).
 In the large\(k\) limit, the minimizer schem  forward scheme  local scheme hierarchy collapses: minimizers already achieve the lower bound that holds for local schemes.
Small k experiments
From here onward, this is a ’lablog’, primarily intended for preserving some of my notes/thoughts, not for easy reading.
This leaves the case of small \(k\), where the best schemes have density close to \(2/(w+1)\), but the lower bound is only around \(1/w\).
 For \(w=1\), it is clear that density \(2/(w+1)=1\) is the best we can do.
 TODO For \(k=1\), minimizer schemes are boring, but forward/local schemes TODO
 For alphabet size \(\sigma=1\), everything is trivial.
Thus, we start our search at parameters \(k=w=\sigma=2\). For each set of parameters, we bruteforce search three schemes:
 the best minimizer scheme,
 the best forward scheme,
 the best local scheme.
The question is:
 Are forward schemes better than minimizer schemes?
Answer: YES. But so far, only in the following way: where minimizer schemes always select the leftmost occurrence in case of ties, optimal forward schemes switch between leftmost and rightmost occurrences.
It’s open whether there are more interesting differences.
 Are local schemes better than forward schemes?
 Marçais, DeBlasio, and Kingsford (2018) mentions that using ILP they found an example for \(w=4\), \(k=2\) where a nonforward scheme is better than a forward scheme, but they do not give the example nor explain details on how it’s found. For \(\sigma=2\) I can not reproduce this, so probably \(\sigma=4\) was used.
Search methods
 Minimizer scheme bruteforce
 Iterate over all \(\sigma^k\) orders, evaluate density on a De Bruijn word of
order \(\sigma^(k+w)\).
 ILP
 We set up an Integer Linear Program.
 For each of \(\sigma^\ell\) lmers, we create \(w\) binary variables indicating which kmer in \([w]\) is chosen.
 We construct a DeBruijn word of order \(k+w=\ell+1\), and create a variable for each contained $k$mer.
 For each $l$mer in the text, we add an inequality that if a position in the lmer is selected, the corresponding position in the text must also be selected.
 For forward schemes, we add additional inequalities ensuring forwardness.
Note: for \(w=2\), every local scheme is also a forward scheme.
Directed minimizer
It appears all optimal local schemes found above have slightly lower density than corresponding minimizer schemes. But in fact the local schemes are very similar to minimizer schemes. They are all instances of ‘directional minimizers’, a small generalization of minimizers that explicitly handles ties:
Directed Minimizer. Given is an order \(O\) on $k$mers, and for each $k$mer a boolean indicating whether the leftmost or rightmost instance should be selected. Then the directional minimizer of an $l$mer is the $k$mer that is minimal according to \(O\), and in case of ties, the leftmost or rightmost is selected as required.
\(k=1\), \(w=2\)
Proven lower bound on local: \(\d(\L)\geq 1/3 + 1/(12s^2)\), much better than previous bound of \(1.5/(k+w0.5) = 1.5/2.5 = 0.6\), and correct for \(s=1\) and \(s\to\infty\).
Random mini for \(s\to\infty\): \(2/(w+1) = 2/3\), which is optimal.
Best possible density. Forward and local schemes are the same for \(w=2\).
alg \ s  \(2\)  \(3\)  \(4\)  \(5\) 

mini  \(12/16=0.75\)  \(57/81=0.7037\)  \(176/256=0.6875\)  \(425/625=0.68\) 
directed mini  same  same  same  same 
forward=local  same  same  same  same 
bound  same  less  less  less 
(I suspect I made some inefficiency in the bound proof and it should be identical everywhere.)
\(k=1\), \(w=4\)
alg \ s  \(2\) 

mini  
directed mini  
forward  \(28/64=0.4375\) 
local  same 
bound 
\(k=1\), \(w=5\)
alg \ s  \(2\) 

mini  
directed mini  
forward  \(46/128=0.359375\) 
local  \(364/1024=0.35546875\) 
bound 
\(k=2\), \(w=2\)
Best lower bound so far: \(1.5/(k+w0.5) = 1.5/3.5 = 0.4285\).
Hypothesis: best is \(3/5=0.6\).
Random mini for \(s\to\infty\): \(2/(w+1) = 2/3\), which is not optimal!
Again, forward and local are the same.
alg \ s  \(2\)  \(3\)  \(4\) 

mini  \(22/32=0.6875\)  \(156/243=0.6419\)   
directed mini  \(20/32=0.625\)     
forward=local  same  \(153/243=0.6296\)  \(636/1024=0.6210\) 
\(k=2\), \(w=4\)
 Local scheme beats forward here!
 But differences are only in tiebreaking between equal kmers.
alg \ s  \(2\)  \(3\) 

mini  \(50/128=0.3906\)  \(795/2187=0.3635\) 
directed mini  \(48/128=0.375\)   
forward  same   
local  \(190/512=0.3710\)   
Notes
 Hypothesis: For \(k\) large enough so that all kmers are distinct, minimizers, forward, and local schemes are equally good.
 Local can be strict better than forward.
 Forward can be strict better than directed mini (\(k=1\), \(w=4\)).
 Directed mini can be strict better than mini.
Reading list
 minimizerreview
 maskedminimizers
 smallwindowdecycling
 Marcais 2021
 In the O(sqrt(w)/w) and O(ln(w)/w) density methods, what fails to get O(1/w)? ie just choose d=log_sigma(w)?
 Rotates the Mykkeltveit embedding by 1 step. Rotates the other direction.
 Z_delta is so cool!