Minimizers and variations

This posts discusses some variants of minimizers, inspired by the previous post on bidirectional anchors.

The final goal is to introduce two new variations of minimizers and evaluate their density. It will turn out that ‘reduced minimizers’ have density very close to some lower bound for \(k \geq 1.25 w\).

Minimizer schemes take an \(\ell = k+w-1\) mer, and select a substring of length \(k\) in a deterministic way, in such a way that consecutive $ℓ$-mers are likely to share their chosen $k$-mer.

The rest of this post:

  1. first lists some applications,
  2. then introduces some existing minimizer schemes,
  3. then introduces some new ideas,
  4. and lastly evaluates all schemes on random data.

Applications

Minimizers can be used for various purposes, such as:

  • Compressing sequences, such as in minimizer-space De Bruijn graphs (Ekim, Berger, and Chikhi 2021).
  • Similarity estimation,
  • Locality preserving hashing of kmers for different types of clustering.
    1. (Pibiri 2022; Pibiri, Shibuya, and Limasset 2023).
      • Here a large \(k \geq \log_\sigma N\) is used so that minimizers are unique keys.
      • The objective is to have as few minimizers as possible in any sequence, to reduce the number of cache-misses due to non-locality.
    2. Similarity estimation (Li 2016).
      • Here the goal is to cluster similar reads. Ideally reads with a few small mutations have the same minimizer.

Locality preserving hashing and similarity estimation both cluster kmers on similarity, but there are some differences:

  • In LPH, we want consecutive kmers to share a minimizer. I.e. we want to partition the De Bruijn graph into long paths.
  • In similarity estimation, we want similar kmers to share a minimizer, where similar explicitly includes small mutations. I.e. we want to partition the De Bruijn graph into ‘blobs’ covering many variants.

Background

Related but out of scope topics are:

  • universal minimizers: not really a local sampling scheme;
  • spaced minimizers/syncmers/strobemers: used for similarity search methods, but not used for locality sensitive hashing.

Minimizers

Minimizers: Minimizers were introduced independently by Roberts et al. (2004) and Schleimer, Wilkerson, and Aiken (2003): Given a $(k+w-1)$-mer, consider the \(w\) contained $k$-mers. The (rightmost) $k$-mer with minimal hash (for some given hash function \(h\)) is the minimizer.

An example:

l=11, k=5, w=l-k+1=7
lmer:
 ***********
candidate minimizer kmers:
 *****
 |*****
 ||*****
 |||*****
 ||||*****
 |||||*****
 ||||||*****
 |||||||
 8136192   // hashes in [0,9]
     |     // rightmost minimal hash
     ***** // minimizer

Density: By definition, minimizers sample at least one in every \(w\) positions, so trivially have a density of at least \(1/w\). For locality preserving hashing applications, lower density is better since this means larger clusters and fewer cache misses.

  • Schleimer, Wilkerson, and Aiken (2003) prove that on random strings, any minimizer scheme has a density at least \[\frac{1.5+\frac{1}{2w}}{w+1}\geq \frac{1.5}{w+1}.\]

    • Marçais, DeBlasio, and Kingsford (2018) contradict that theorem by constructing a universal minimizer with lower density, and remark that the previous result only applies to ‘randomized local schemes’.
      • Note: I am quite unsure what is the current status of this \(1.5/(w+1)\) lower bound.
    • Instead they prove that density is at least \[\frac{1.5 + \frac{1}{2w} + \max(0, \lfloor \frac{k-w}{w}\rfloor)}{w+k}.\]
  • Random minimizers have a density of \(2/(w+1) + o(1/w)\) when \(k > (3+\epsilon) \log_\sigma (w+1)\) (Zheng, Kingsford, and Marçais 2020b; Marçais et al. 2017).

    This is only \(33\%\) above the lower bound!

    • Is \((3+\epsilon)\) tight?

Robust minimizers

To reduce the density, Schleimer, Wilkerson, and Aiken (2003) suggest the following: when the minimizer of the preceding k-mer is still a minimizer, reuse it, even when it is not rightmost.

Continuing the example:

l=11, k=5, w=7
 ************  // n=12 text
 *****  *****  // first & last minimizer
 81361921      // n-k+1 hashes of 5-mers
 -1--1--       // minimal hashes in first lmer
     *****     // minimizer is rightmost
  1--1--1      // minimal hashes in second lmer
     *****     // reuse minimizer, instead of starting at rightmost 1.

When the same kmer occurs twice in an $ℓ$-mer, only one of them will be selected in a way dependent on the context. For most applications, this non-determinism is not a problem.

Still there is a drawback: When two distinct kmers have the same hash, only one of them is selected. Although unlikely, this is not good for downstream applications. To prevent this, minimizers \(x\) could be ordered by \((h(x), x)\) instead of just \(h(x)\).

PASHA

PASHA (Ekim, Berger, and Orenstein 2020) is another minimizer selection algorithm based on a universal hitting set. It works as follows:

  1. Start with a complete De Bruijn graph of order \(k\), i.e., containing all \(4^k\) kmers.
  2. Remove from this a minimal set of $k$-mers \(U_1\) that make the graph acyclic.
  3. Then remove additional $k$-mers to remove all paths of length \(\geq \ell\).
    • This is done using the DOCKS heuristic (Orenstein et al. 2017), which greedily removes the vertex containing the most (length \(\ell\)) paths.

PASHAs main contribution is a considerable speedup over DOCKS. It still remains slow and has to process the full \(4^k\) graph, limiting it to \(k\leq 16\), but has the lower density.

Miniception

Miniception (Zheng, Kingsford, and Marçais 2020b) is a specific minimizer selection algorithm. It works using an additional parameter \(k_0\) around \(3\cdot \log_\sigma(k)\). It additionally requires \(k_0 \geq k-w\), although I do not think this is explicitly mentioned in the paper.

For a window \(T\) of length \(\ell = k+w-1\) characters, Miniception selects a minimizer as follows:

  1. Set \(w_0 = k-k_0\) and find all \((k_0, w_0)\) minimizers under some hash \(h_0\).

  2. Out of the \(w\) $k$-mers in \(T\), keep only those:

    • whose prefix $k_0$-mer is a \((k_0, w_0)\) minimizer of \(T\), or
    • whose suffix $k_0$-mer is a \((k_0, w_0)\) minimizer of \(T\).

    This is equivalent to saying that the minimal $k_0$-mer in the $k$-mer is its prefix or suffix.

  3. From the filtered $k$-mers, select the one with minimal hash \(h\).

In the limit, it achieves density down to \(1.67/w\) for \(w\sim k\).

Sadly the preprint (Zheng, Kingsford, and Marçais 2020a) has a typo in Figure 6, making the results hard to interpret.

Bd-anchors

Bidirectional anchors (bd-anchors) are a variant on minimizers that take the minimal lexicographic rotation instead of the minimal k-mer substring (Loukides, Pissis, and Sweering 2023; Ayad, Loukides, and Pissis 2023). I wrote above them before in this post.

Reduced bd-anchors restrict this rotation to not start in the last \(r=4\log_\sigma(\ell)\) positions.

Density: Reduced bd-anchors have a density of \(2/(\ell+1-r)\) for large alphabet, and somewhat larger for small \(\sigma\).

Bd-anchors have a slightly different purpose than minimizers, in that they are keyed by their position in the text, rather than by the corresponding string itself. Thus, a suffix array is built on suffixes and reverse-prefixes starting/ending there.

For random strings, reduced bd-anchors are a dense subset of the \(k=r+1\) minimizers.

Given the bd-anchors, two suffix arrays are built. One of suffixes starting at anchors, and one on reverse prefixes ending at anchors.

Note: bd-anchors are not a so-called forward scheme. That is, it is possible for the window to shift right, but the selected position to jump backwards. Example here.

Optimization: When querying an $ℓ$-mer, in practice only the longer of the prefix and suffix is actually looked up in the corresponding suffix array. Thus, we don’t need to two suffix arrays over all bd-anchors:

  • The forward SA over suffixes only needs to contains bd-anchors occurring in the left half of some $ℓ$-mer.
  • The reverse SA over suffixes only needs to contains bd-anchors occurring in the right half of some $ℓ$-mer.

This makes things slightly sparser.

New ideas

Biminimizers

Here is an idea I had and that was also tried by Giulio for SSHash (Pibiri 2022). Surely there is some literature on this but I’m at a loss to find it. Please let me know.

In short: use robust minimizers, but always use at least two candidate positions. For this, we can use two hash functions and take the minimizer for \(h_1\) and \(h_2\). Or we can consider the bottom two minimizers with lowest score for \(h\). (The latter performs slightly better in evals so is what I’ll go with.)

This also generalizes to t-minimizers, where we robustly choose the rightmost of \(t\) candidates generated either by \(t\) hash functions or the bottom-\(t\) of a single hash function.

A new example:

8336192     // hashes
    | |     // bottom two minimal hashes
      ***** // biminimizer

Like robust minimizers, this has one big drawback: Minimizers are not deterministic. Downstream applications will likely have to make two queries to locate the minimizer. But this may be worth the tradeoff compared to the space savings.

Reduced minimizers

(Naming suggestions welcome.)

Bidirectional anchors have a benefit over minimizers since they always use \(r=O(\log_\sigma (\ell))\) instead of possibly much larger \(k\). This means their average density \(2/(\ell+1-r)\) can be lower than \(2/(w+1) = 2/(\ell-k+2)\). Similarly, Miniception uses a separate \(k_0\) of order \(3 \log_\sigma(k)\) to achieve

Why do we use large \(k\), when small \(k=\Omega(\log \ell)\) is sufficient and preferable for lower density? The reason is that for locality preserving hashing we would like (nearly) unique keys of length \(\log_\sigma(N)\).

It seems that two conceptually distinct parameters are merged:

  • The length \(k_0=r+1\) of the minimizer, which we would like to be small.
  • The length \(k\) of the key we want to extract, which we would like to be larger.

Inspired by previous methods, here is a new sampling scheme.

  1. First, find a minimizer of length \(k_0=1+3 \log_\sigma w\), say at position \(0\leq i < w = \ell - k_0 + 1\).

  2. Extract a key of length \(k\leq (\ell+k_0)/2\):

    • If \(i \leq (w-1)/2\), extend right, i.e. extract \(Q_{i..i+k}\). This is in bounds because: \[i+k \leq (w-1)/2 + (\ell+k_0)/2 = (\ell-k_0)/2 + (\ell +k_0)/2 = \ell.\]
    • If \(i \geq (w-1)/2\), extend left, i.e. extract \(Q_{i+k_0-k..i+k_0}\). This is in bounds because: \[i+k_0-k \geq (w-1)/2 - (\ell+k_0)/2 = (\ell-k_0)/2 - (\ell +k_0)/2 = 0.\]

    TODO: Split into three cases for larger \(k\).

Here is an example for \(\ell \geq 2k-3-1\). Stars indicate the candidate $k_0$-minimizers, and the dashes indicate the extension to a $k$-mer key.

l=10, k=7, r=3
lmer:
 **********
minimizers (*), and extracted keys (*=)
 ***====
  ***====
   ***====
    ***====
 ====***
  ====***
   ====***
    ====***

And here is a 3-way split example that additionally includes extension around the middle.

l=10, k=8, r=3
lmer:
 ***********
minimizers (*), and keys (*=)
 ***=====
  ***=====
   ***=====
 ===***==
  ===***==
   ===***==
  =====***
   =====***

It seems that this scheme performs well when \(k\) is around \(\ell/2\), say \(\ell/3 < k < 2\ell/3\).

TODO: There are cases where we can be flexible in the exact point where we switch from extending left to extending right. Should we switch around the middle? Or better make one of the runs as long as possible?

Experiments

Here are some quick results.

  • Code is at https://github.com/RagnarGrootKoerkamp/minimizers.
  • PASHA is excluded – even though it’s very good, it’s too much effort to download $k$mers to quickly benchmark it.
  • For methods taking a parameter \(k_0\) or \(r\), I did a brute-force search from \(0\) to \(10\) (as long as they are valid), or around \(k-w\) in case that is larger than \(10\).
Figure 1: Density for various minimizer types, for alphabet size (4) and string length (n=10^5). All of (k), (w), and density are in log scale. Black lines indicate ((1.5+2/w)/(w+1)).

Figure 1: Density for various minimizer types, for alphabet size (4) and string length (n=10^5). All of (k), (w), and density are in log scale. Black lines indicate ((1.5+2/w)/(w+1)).

Figure 2: Optimal choice of (r) or (k_0). (k) and (w) are log scale.

Figure 2: Optimal choice of (r) or (k_0). (k) and (w) are log scale.

Note:

  • bd-anchors depend only on \(\ell = w+k-1\), and hence density decreases in \(k\).
  • Biminimizers are best for \(k\leq w\) (but have drawbacks).
  • Miniception is always better than vanilla minimizers.
  • Reduced minimizers are terrible for \(k\leq w\), but best for \(k>1.25 w\).
    • They get very close to \((1.5+2/w)/(w+1)\).
    • Why are they bad?
    • Can we optimize them more? By using more ideas from miniception?
  • Can we optimize miniception by introducing a third layer of minimizers??
    • Or what if we sort filtered kmers by their contained k0-mer before comparing their own hash?
  • For larger alphabet \(\sigma = 256\) (not shown), results are mostly the same but bd-anchors have slightly lower density.

Conclusion

For \(k \geq 1.25 w\), reduced minimizers achieve density very close to the lower bound of \((1.5+1/2w)/(w+1)\) that holds for random minimizers.

TODO

  • Make an explicit experiment for \(k=21\), \(w=10\), \(\ell=31\).

  • Compare to PASHA?

  • Analyze reduced minimizers formally.

    • Why are they so bad for small \(k\)? What’s going on there?
    • Why do they only work for \(k_0 = k-w\)? I would expect OK results for all small \(k_0\).
  • See if reduced minimizers can be improved for small \(k < 1.25w\).

  • Implement 3-way split for reduced minimizers.

  • Plots on distribution of distance between adjacent selected positions.

  • Why is

Large k

For \(k/w\to\infty\), the best we can do is density \(1/w\), and in the limit this is equal to the lower bound, so this case is ‘solved’. Both the scheme introduced in (Marçais, DeBlasio, and Kingsford 2018) and the new mod-minimizers achieve this \(1/w\) density in the limit, but mod-minimizers converge much faster.

  • TODO Mod-minimizers are also an instance of a minimizer scheme w.r.t. a specific order, namely: the hash of a kmer is the minimal hash over the tmers occuring in a position \(0 \mod w\).
  • In the large-\(k\) limit, the minimizer-forward scheme-local scheme hierarchy collapses: minimizers already achieve the lower bound that holds for local schemes.
  • Schleimer’s bound is only shown for forward schemes, but why does is not extend to local schemes?
  • Similarly, Marcais’ fixed bound is only for forward schemes, but could also be extended to local schemes?

Small k

This leaves the case of small \(k\), where the best schemes have density close to \(2/(w+1)\), but the lower bound is only around \(1/w\).

  • For \(w=1\), it is clear that density \(2/(w+1)=1\) is the best we can do.
  • TODO For \(k=1\), minimizer schemes are boring, but forward/local schemes TODO
  • For alphabet size \(\sigma=1\), everything is trivial.

Thus, we start our search at parameters \(k=w=\sigma=2\). For each set of parameters, we bruteforce search three schemes:

  • the best minimizer scheme,
  • the best forward scheme,
  • the best local scheme.

The question is:

  • Are forward schemes better than minimizer schemes?
    • Answer: YES. But so far, only in the following way: where minimizer schemes always select the leftmost occurrence in case of ties, optimal forward schemes switch between leftmost and rightmost occurrences.

      It’s open whether there are more interesting differences.

  • Are local schemes better than forward schemes?
    • Marçais, DeBlasio, and Kingsford (2018) mentions that using ILP they found an example for \(w=4\), \(k=2\) where a non-forward scheme is better than a forward scheme, but they do not give the example nor explain details on how it’s found. For \(\sigma=2\) I can not reproduce this, so probably \(\sigma=4\) was used.

Search methods

Minimizer scheme bruteforce
Iterate over all \(\sigma^k\) orders, evaluate density on a De Bruijn word of

order \(\sigma^(k+w)\).

ILP
We set up an Integer Linear Program.
  • For each of \(\sigma^\ell\) l-mers, we create \(w\) binary variables indicating which kmer in \([w]\) is chosen.
  • We construct a DeBruijn word of order \(k+w=\ell+1\), and create a variable for each contained $k$-mer.
  • For each $l$-mer in the text, we add an inequality that if a position in the l-mer is selected, the corresponding position in the text must also be selected.
  • For forward schemes, we add additional inequalities ensuring forwardness.

Note: for \(w=2\), every local scheme is also a forward scheme.

\(k=w=\sigma=2\)

Minimizer:

Num permutations: 4!
  best cnt : 22 / 32
  density  : 0.6875
  best perm: [0, 3, 1, 2]

Directed minimizer:

Num permutations: 4!
  best cnt : 20 / 32
  density  : 0.625
  best perm: [(2, Leftmost), (0, Leftmost), (3, Leftmost), (1, Rightmost)]

Forward/Local:

#vars: 40
#constraints: 70
#selected: 20
  best cnt : 20 / 32
  density  : 0.625
  best map: [0, 1, 0, 0, 1, 1, 0, 1]

\(w=3\)

Minimizer:

Num permutations: 4!
  best cnt : 32 / 64
  density  : 0.5
  best perm: [1, 0, 2, 3]

Directed minimizer:

Num permutations: 4!
  best cnt : 30 / 64
  density  : 0.46875
  best perm: [(1, Rightmost), (3, Leftmost), (0, Leftmost), (2, Leftmost)]

Forward:

#vars: 80
#constraints: 263
#selected: 30
  best cnt : 30 / 64
  density  : 0.46875
  best map: [0, 2, 1, 1, 0, 2, 0, 0, 1, 2, 1, 1, 2, 2, 1, 2]

Local:

#vars: 80
#constraints: 202
#selected: 30
  best cnt : 30 / 64
  density  : 0.46875
  best map: [0, 2, 1, 1, 0, 0, 0, 0, 1, 2, 1, 1, 2, 2, 0, 1]

\(w=4\)

Minimizer:

Num permutations: 4!
  best cnt : 50 / 128
  density  : 0.390625
  best perm: [2, 0, 1, 3]

Directed minimizer:

Num permutations: 4!
  best cnt : 48 / 128
  density  : 0.375
  best perm: [(1, Leftmost), (0, Leftmost), (3, Leftmost), (2, Rightmost)]

Forward: (3.5s)

#vars: 160
#constraints: 780
#selected: 48
  best cnt : 48 / 128
  density  : 0.375
  best map: [0, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 2, 2, 1, 1, 1, 1, 2, 3, 2, 2, 0, 3, 1, 2]

Local: (4.5s)

#vars: 160
#constraints: 532
#selected: 48
  best cnt : 48 / 128
  density  : 0.375
  best map: [3, 2, 3, 1, 2, 2, 3, 3, 1, 1, 3, 1, 2, 2, 3, 2, 0, 0, 0, 0, 0, 2, 3, 0, 1, 1, 3, 1, 2, 2, 3, 1]

\(w=2\), \(\sigma=3\)

Minimizer:

Num permutations: 9!
  best cnt : 155 / 243
  density  : 0.63786006
  best perm: [4, 0, 7, 5, 8, 6, 3, 1, 2]

Directed minimizer:

Num permutations: 9! * 2^9 = 185794560
  best cnt : 152 / 243
  density  : 0.6255144
  best perm: [(1, Rightmost), (4, Leftmost), (5, Leftmost), (0, Leftmost), (7, Leftmost), (2, Leftmost), (6, Leftmost), (8, Leftmost), (3, Rightmost)]

Forward/Local:

#vars: 270
#constraints: 511
#selected: 152
  best cnt : 152 / 243
  density  : 0.6255144
  best map: [1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1]

\(w=2\), \(\sigma=4\)

Minimizer:

TODO

Directed minimizer:

TODO

Forward/Local:

#vars: 1088
#constraints: 2110
#selected: 635
  best cnt : 635 / 1024
  density  : 0.6201172
  best map: [1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1]

This is a directed minimizer scheme!

\(w=\sigma=3\)

Minimizer:

Num permutations: 9!
  best cnt : 338 / 729
  density  : 0.46364883
  best perm: [5, 0, 2, 3, 7, 6, 8, 1, 4]

Directed minimizer:

Num permutations: 9! * 2^9 = 185794560
  best cnt : 331 / 729
  density  : 0.45404664
  best perm: [(3, Rightmost), (7, Leftmost), (2, Rightmost), (0, Leftmost), (5, Leftmost), (1, Leftmost), (6, Leftmost), (8, Leftmost), (4, Rightmost)]

Forward: (scip: 1000s, highs: 1600s, gurobi: 164s*5)

#vars: 810
#constraints: 2988
#selected: 331
  best cnt : 331 / 729
  density  : 0.45404664
  best map: [2, 1, 2, 0, 0, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1, 1, 2, 0, 2, 1, 1, 1, 0, 0, 2, 0, 0, 2, 0, 0, 0, 2, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 0, 2, 2, 0, 2, 1, 1, 2]
  best map: [2, 1, 2, 2, 0, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 0, 0, 2, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 1, 1, 1, 2, 0, 2, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 0, 2, 2, 2, 2, 1, 1, 2]

Local: (scip: 4400s, gurobi: 1000s)

#vars: 810
#constraints: 2262
#selected: 331
  best cnt : 331 / 729
  density  : 0.45404664
  best map: [2, 1, 2, 0, 0, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 2, 1, 1, 1, 1, 1, 2, 2, 0, 2, 1, 1, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 0, 2, 2, 0, 2, 1, 1, 2]
  best map: [2, 1, 2, 2, 0, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 0, 0, 2, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 2, 1, 1, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 0, 2, 2, 2, 2, 1, 1, 2]

\(w4\), \(\sigma=3\)

Minimizer:

Num permutations: 9!
  best cnt : 793 / 2187
  density  : 0.36259717
  best perm: [4, 0, 2, 8, 7, 5, 6, 1, 3]

Directed minimizer:

Num permutations: 9! * 2^9 = 185794560
  best cnt : 786 / 2187
  density  : 0.35939643
  best perm: [(3, Rightmost), (6, Leftmost), (1, Leftmost), (2, Leftmost), (5, Leftmost), (0, Leftmost), (8, Leftmost), (7, Leftmost), (4, Rightmost)]

Forward:

TODO

Local:

TODO

Directed minimizer

It appears all optimal local schemes found above have slightly lower density than corresponding minimizer schemes. But in fact the local schemes are very similar to minimizer schemes. They are all instances of ‘directional minimizers’, a small generalization of minimizers that explicitly handles ties:

Directed Minimizer. Given is an order \(O\) on $k$-mers, and for each $k$-mer a boolean indicating whether the leftmost or rightmost instance should be selected. Then the directional minimizer of an $l$-mer is the $k$-mer that is minimal according to \(O\), and in case of ties, the leftmost or rightmost is selected as required.

References

Ayad, Lorraine A. K., Grigorios Loukides, and Solon P. Pissis. 2023. “Text Indexing for Long Patterns: Anchors Are All You Need.” Proceedings of the Vldb Endowment 16 (9): 2117–31. https://doi.org/10.14778/3598581.3598586.
Ekim, Barış, Bonnie Berger, and Rayan Chikhi. 2021. “Minimizer-Space de Bruijn Graphs: Whole-Genome Assembly of Long Reads in Minutes on a Personal Computer.” Cell Systems 12 (10): 958–68.e6. https://doi.org/10.1016/j.cels.2021.08.009.
Ekim, Barış, Bonnie Berger, and Yaron Orenstein. 2020. “A Randomized Parallel Algorithm for Efficiently Finding near-Optimal Universal Hitting Sets.” In Research in Computational Molecular Biology, 37–53. Springer International Publishing. https://doi.org/10.1007/978-3-030-45257-5_3.
Li, Heng. 2016. “Minimap and Miniasm: Fast Mapping and de Novo Assembly for Noisy Long Sequences.” Bioinformatics 32 (14): 2103–10. https://doi.org/10.1093/bioinformatics/btw152.
Loukides, Grigorios, Solon P. Pissis, and Michelle Sweering. 2023. “Bidirectional String Anchors for Improved Text Indexing and Top-$k$ Similarity Search.” Ieee Transactions on Knowledge and Data Engineering 35 (11): 11093–111. https://doi.org/10.1109/tkde.2022.3231780.
Marçais, Guillaume, Dan DeBlasio, and Carl Kingsford. 2018. “Asymptotically Optimal Minimizers Schemes.” Bioinformatics 34 (13): i13–22. https://doi.org/10.1093/bioinformatics/bty258.
Marçais, Guillaume, David Pellow, Daniel Bork, Yaron Orenstein, Ron Shamir, and Carl Kingsford. 2017. “Improving the Performance of Minimizers and Winnowing Schemes.” Bioinformatics 33 (14): i110–17. https://doi.org/10.1093/bioinformatics/btx235.
Orenstein, Yaron, David Pellow, Guillaume Marçais, Ron Shamir, and Carl Kingsford. 2017. “Designing Small Universal K-Mer Hitting Sets for Improved Analysis of High-Throughput Sequencing.” Edited by Benjamin J. Raphael. Plos Computational Biology 13 (10): e1005777. https://doi.org/10.1371/journal.pcbi.1005777.
Pibiri, Giulio Ermanno. 2022. “Sparse and Skew Hashing of K-Mers.” Bioinformatics 38 (Supplement_1): i185–94. https://doi.org/10.1093/bioinformatics/btac245.
Pibiri, Giulio Ermanno, Yoshihiro Shibuya, and Antoine Limasset. 2023. “Locality-Preserving Minimal Perfect Hashing of K-Mers.” Bioinformatics 39 (Supplement_1): i534–43. https://doi.org/10.1093/bioinformatics/btad219.
Roberts, Michael, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. 2004. “Reducing Storage Requirements for Biological Sequence Comparison.” Bioinformatics 20 (18): 3363–69. https://doi.org/10.1093/bioinformatics/bth408.
Schleimer, Saul, Daniel S. Wilkerson, and Alex Aiken. 2003. “Winnowing: Local Algorithms for Document Fingerprinting.” In Proceedings of the 2003 Acm Sigmod International Conference on Management of Data. Sigmod/Pods03. ACM. https://doi.org/10.1145/872757.872770.
Zheng, Hongyu, Carl Kingsford, and Guillaume Marçais. 2020a. “Improved Design and Analysis of Practical Minimizers,” February. https://doi.org/10.1101/2020.02.07.939025.
———. 2020b. “Improved Design and Analysis of Practical Minimizers.” Bioinformatics 36 (Supplement_1): i119–27. https://doi.org/10.1093/bioinformatics/btaa472.