The anti-lexicographic SUS-anchor: a near-optimal k=1 sampling scheme

Table of Contents

1 Abstract
2 Introduction
3 Preliminaries
- 3.1 Minimizers and sampling schemes
- 3.2 Bidirectional string anchors
- 3.3 Maximal suffixes
4 Character-based orders
5 Smallest unique substring anchors
- 5.1 Properties
- 5.2 Computing SUS-anchors
6 Results
7 Discussion

\[ \newcommand{\sus}{\mathsf{SUS}} \newcommand{\lcp}{\mathsf{lcp}} \newcommand{\last}{\mathsf{last}} \]

1 Abstract Link to heading

Motivation. In recent years, there has been a renewed interest in the search for low density minimizer schemes. These schemes take a window of $w$ consecutive $k$-mers, and sample one of them: the smallest under some specific order. Schemes such as the mod-minimizer provide a low density (fraction of sampled $k$-mers) when $k \gg w$, while schemes such as the greedy minimizer work well for explicit small parameters roughly in the regime $k \leq 2w$, for $k$ and $w$ up to $15$ or so.

When $k < \log_\sigma w$ is very small, minimizer schemes cannot do well, and more general sampling schemes are needed that can be richer than just comparing $k$-mers. Bidirectional-string anchors (bd-anchors) form one such scheme.

Methods. Inspired by bd-anchors, we introduce the smallest unique substring or SUS-anchor: Given a window, this considers all suffixes that do not occur as a substring elsewhere in the window. It then samples the start position of the smallest suffix according to the new anti-lexicographic order that minimizes the first character and maximizes the remaining characters. We give a linear-time and $O(w)$ space streaming algorithm to compute all SUS-anchors of a string.

Results. For alphabet size $\sigma=4$ and $k=1$, the anti-lexicographic SUS-anchors empirically has density $<1\%$ away from the density lower bound, significantly improving over bd-anchors that are often $>15\%$ above it. For alphabet size $\sigma=2$, the density is at most $10\%$ above the lower bound, which again improves over the $>50\%$ overhead of bd-anchors.

2 Introduction Link to heading

Minimizers (Schleimer, Wilkerson, and Aiken 2003; Roberts et al. 2004) are a technique to subsample the $k$-mers of a string such that consecutive samples are at most $w$ positions apart. Minimizer sampling has many applications in bioinformatics. In particular, the minimizers of a string are a sketch or compressed (lossy) representation, allowing for faster processing. Specific applications include seeding for read-mapping as done by minimap (Li 2016), the minimizer-space De Bruijn graph (Ekim, Berger, and Chikhi 2021), text-indexing using the U-index (Ayad et al. 2025), sketching for Jaccard similarity in mashmap (Jain et al. 2018), host depletion in Deacon (Constantinides, Lees, and Crook 2025), and more (Zheng, Marçais, and Kingsford 2023).

Recent work. There is a long line of active work on minimizers. Of particular interest is the density, which is the expected fraction of sampled positions. Recently, the mod-minimizer (Groot Koerkamp and Pibiri 2024; Groot Koerkamp, Liu, and Pibiri 2025) introduced schemes with near-optimal density for $k\to\infty$. The GreedyMini (Golan et al. 2025) is the state of the art for smaller $w$ and $k$ roughly up to $15$, achieving particularly close to optimal density for $k=w+1$, $k=w$, and $k$ just below $w$. Shur et al. recently introduced spacers (Shur, Tziony, and Orenstein 2026a) that prefer sampling $k$-mers starting with 10 (assuming a binary alphabet) for which the next occurrence of 10 is as far ahead as possible. Spacers improve the density of the ABB+ scheme we introduce here, and are especially good when $w$ is large. For particularly small parameters $(\sigma,k)=(2,\leq 6)$ and $(\sigma,k)=(4, 2)$, exactly optimal minimizer schemes are given by the search algorithm OptMini (Shur, Tziony, and Orenstein 2026b) and analysed theoretically by Shur (Shur 2025).

This raises the question: can we find near-optimal sampling schemes for small $k$, and in particular for $k=1$? An ILP search suggests that the answer is yes: this is always possible when $k\equiv 1\mod w$ (Kille et al. 2024), and thus we are motivated to search for a “clean”, constant-space scheme.

Some further recent work includes SimdMinimizers (Groot Koerkamp and Martayan 2025), a SIMD-based algorithm that computes random minimizers at around 500 Mbp/s. A precise analysis of the density of the random minimizer is given in (Golan and Shur 2025). Vigemers (Ingels et al. 2026) reduce the imbalance when using minimizers for partitioning, and multiminimizers (Ingels et al. 2025) reduce the density at the cost of more compute.

Sampling and selection schemes. Minimizer schemes hash each $k$-mer and sample the $k$-mer with the smallest hash in a window of $w$ $k$-mers (or $\ell=w+k-1$ characters). Minimizers cannot achieve a good density of $O(1/w)$ for constant $k$ as $w\to\infty$ (Marçais, DeBlasio, and Kingsford 2018), but this restriction does not apply to sampling schemes, which have more freedom because they are not required to first hash all $k$-mers. In particular, selection schemes with $k=1$ simply sample a single position based on a window of length $w=\ell$. Bidirectional string anchors (bd-anchors) are one such selection scheme (Loukides, Pissis, and Sweering 2023). These sample the start position of the smallest rotation of a window and achieve a density of $O(1/w)$. For large $w$ and alphabet size $\sigma=4$, bd-anchors have a density around $20\%$ above the lower bound. Our goal is to reduce this to $0\%$ overhead.

Contributions. In this paper, we introduce SUS-anchors, short for smallest unique substring anchors. As the name implies, given a window, these sample the start position of the smallest substring that does not occur a second time. This is equivalent (after reversing the order of the alphabet) to finding the maximal suffix of each window. Specifically:

We introduce the concept of character-based orders that generalizes e.g. the lexicographic, alternating (Roberts et al. 2004), and ABB (Frith, Noé, and Kucherov 2020) orders.
We then introduce two new orders: the ABB+ and anti-lexicographic order.
We give an $O(n)$ time and $O(\ell)$ space sliding-window algorithm to compute the SUS-anchors of a string of length $n$ that is based on the monotone-queue approach (Groot Koerkamp and Martayan 2025).
We experimentally show that SUS-anchors with the anti-lexicographic order have density within $1\%$ of optimal for $\sigma=4$ and any $w$.

Parts of this paper have been previously introduced in the author’s thesis (Groot Koerkamp 2025).

3 Preliminaries Link to heading

3.1 Minimizers and sampling schemes Link to heading

For a more in-depth introduction to minimizers, we refer the reader to the survey by Zheng et al. (Zheng, Marçais, and Kingsford 2023), the mod-minimizer paper (Groot Koerkamp and Pibiri 2024), and the author’s thesis (Groot Koerkamp 2025). Here we briefly introduce the required notation and concepts.

Notation. For an integer $w$, we write $[w] = \{0, 1, \dots, w-1\}$, and for a string $W=w_0\dots w_{|W|-1}$ we use $W[i\dots j)$ to indicate the substring $w_i\dots w_{j-1}$. We assume an alphabet $\Sigma$ of size $\sigma$.

Sampling schemes. Given parameters $w\geq 1$ and $k\geq 1$, a window is a string over $\Sigma$ of length $\ell=w+k-1$ that contains exactly $w$ $k$-mers. A local sampling scheme or just sampling scheme is a function $f: \Sigma^\ell \to [w]$ that indicates that from window $W$, the $k$-mer $W[f(W)\dots f(W)+k)$ is sampled.

Given a text $T$, we are interested in the set of all sampled positions when sliding window $W$ over the text. Consecutive sampled positions differ by at most $w$, and a sampling scheme is forward when the sampled position in $T$ never decreases when sliding the window forward.

Often considered are minimizer schemes (Schleimer, Wilkerson, and Aiken 2003; Roberts et al. 2004), which are forward sampling schemes that sample the start position of the smallest $k$-mer (according to some order given by a hash function) starting in the window.

In this paper, we are particularly interested in selection schemes, which are sampling schemes with $k=1$ (Zheng, Kingsford, and Marçais 2021).

Density. The particular density of a sampling scheme is the fraction of positions that are sampled. The density is the expected value of the particular density on a random string with length going to infinity.

Density lower bounds. A trivial lower bound on the density of sampling schemes is $1/w$, since at least one in every $w$ positions is sampled. The best current lower bound for forward sampling schemes (Kille et al. 2024) simplifies (with a small loss of accuracy) to $\frac 1{w+k} \left\lceil \frac{w+k}w\right\rceil$. For $k=1$, this gives a lower bound for forward selection schemes of $\frac1{w+1}\left\lceil\frac{w+1}w\right\rceil = \frac 2{w+1}$. The main insight is that the density equals the expected number of samples in a random cyclic string (cycle) of length $w+1$, and in any such cycle, at least two different positions must be sampled.

For minimizer schemes, a lower bound is given by $1/\sigma^k$ (Marçais, DeBlasio, and Kingsford 2018): if $w\to\infty$, exactly all occurrences of the smallest $k$-mer will be sampled, and these occur every $\sigma^k$ positions in expectation. Note, however, that this bound does not hold for general sampling schemes, and it is exactly this property that will allow us to design near-optimal sampling schemes for small $k$.

3.2 Bidirectional string anchors Link to heading

Bidirectional string anchors (bd-anchors) are a $k=1$ selection scheme introduced by Loukides, Pissis, and Sweering for the purpose of text indexing (Loukides, Pissis, and Sweering 2023). Given a window, they sample the (leftmost) start position of the lexicographically smallest rotation. A drawback of bd-anchors is that they are not forward: the window ZABAAC has AAC... as smallest rotation, while the shifted window ABAACA has AAB... as smallest rotation. After further shifting to BAACAY, the smallest rotation is AAC... again: the AB prefix caused the selected position to jump around. To ensure the density¹ is in $O(1/w)$, reduced bd-anchors (Loukides, Pissis, and Sweering 2023) avoid sampling the last $r=C\lceil \log_\sigma \ell\rceil$ positions, where $C=4$ in theory but in practice $C=3$ or less suffices. Given the choice of $r>\log_\sigma \ell$, in most windows of a random string the bd-anchor is the same as the smallest lexicographic $(r+1)$-mer. Note though that reduced bd-anchors are still not forward.

In practice, bd-anchors can be computed in $O(n\ell)$ time by using Booth’s linear-time algorithm for the lexicographically minimal rotation (Booth 1980). An $O(n)$ time algorithm is possible using a data structure of Kociumaka (Kociumaka 2016), but this is mostly of theoretical interest only since it requires $O(n)$ words of space for e.g. a suffix array. Theorem 3 of (Loukides, Pissis, and Sweering 2023) gives an $O(n)$ time and $O(\ell)$ space algorithm by using Kociumaka’s method on overlapping chunks of size e.g. $2\ell$.

3.3 Maximal suffixes Link to heading

Separate from the literature on minimizers, there is a line of work on the non-empty minimal suffix and maximal suffix of a string (Duval 1983). While these two problems appear similar, as one might simply reverse the order of the alphabet, a crucial point is that a string $A$ is always smaller than $B$ when $A$ is a prefix of $B$. Thus, the minimal suffix prefers shorter suffixes, while the maximal suffix prefers longer suffixes. Because of this, minimal suffixes are less stable as we slide a window over a text, and we do not further consider them.

Babenko et al. (Babenko, Kolesnichenko, and Starikovskaya 2013) introduce an algorithm that preprocesses a string of length $n$ into a linear-space data structure and can then find the maximal suffix of query substrings of length $\ell$ in $O(\log \ell)$ time, which was improved to $O(1)$ query time in (Babenko et al. 2016).

4 Character-based orders Link to heading

To capture some of the existing “lexicographic-like” orders, we define character-based orders that compare strings one character at a time². This is a natural requirement in our setting, because it allows comparing suffixes of a string without worrying that future characters (after shifting the window) will change the order, as long as one is not a prefix of the other.

Definition 1 Character-based order.

An order $O$ on strings over $\Sigma$ is character-based if there exist orders $O_i$ on $\Sigma$ for $i\in \{0, 1, 2, \dots\}$ such that for all strings $A=a_0\dots a_{|A|-1}$ and $B=b_0\dots b_{|B|-1}$ with longest common prefix $\ell = \mathsf{lcp}(A,B)$ we have

\begin{equation*} A <_O B \quad\quad \text{iff} \quad\quad (\text{$A$ is a strict prefix of $B$}) \text{ or } a_\ell <_{O_\ell} b_{\ell}. \end{equation*}

The orders can be either total orders, or linear preorders when equalities are allowed.

This scheme encapsulates the lexicographic order on strings, where each $O_i$ is simply the lexicographic order on $\Sigma$. The main drawback of the lexicographic order is that it clusters small strings: since AAAAX is small, the next $k$-mer AAAXY is also small, possibly causing consecutive positions to be sampled as minimizers. Most of the following schemes instead look for a transition from a small character (A) to a large (Z) or non-small (BCD..Z) character.

The alternating order (Roberts et al. 2004) uses the lexicographic order for even $i$, and the reverse lexicographic order for odd $i$, so that AZAZAZ... is the smallest string. The ABB order (Blackburn 2015; Frith, Noé, and Kucherov 2020) uses the lexicographic order for $O_0$, and for $i>0$ it uses the order \[1=_{O_i}2=_{O_i}\dots=_{O_i}\sigma-1 <_{O_i} 0,\] so that any string like ABBBB... or AXYZDEF... is minimal. This scheme has the nice property that occurrences of small strings starting in A and not containing further A’s are disjoint (Blackburn 2015). Vigemers (Ingels et al. 2026) are also a character-based order, where $O_i$ is the order after xor’ing by a character $\gamma_i$.

A drawback of the ABB order is that it throws away some information: for example, over the normal alphabet, AB and AC are considered equal. Thus, we also consider a version with tiebreaking, ABB+:

Definition 2 ABB+ order.

The ABB+ order first compares two strings via the ABB order, and, in case of a tie, compares them via the plain lexicographic order.

Note that this is only useful when comparing strings ($k$-mers) of equal length, and that the ABB+ order is not itself a character-based order.

The following scheme is more practical.

Definition 3 Anti-lexicographic order.

The anti-lexicographic order uses the lexicographic order for $O_0$, and reverse lexicographic order for $O_i$ for $i>0$.

In this order, the smallest alphabetic string is AZZZZ....

5 Smallest unique substring anchors Link to heading

Intuition. Consider again the bd-anchor, which samples the start position of the smallest rotation of a window $W$. As we saw in 3.2, a drawback is that rotations “wrap around”, and that the character $W[0]$ at the start of a string influences whether the rotation starting at the last character $W[\ell-1]$ is small or not. This is easily fixed by considering only the smallest suffix instead. However, the smallest suffix is the empty suffix, and so we could sample the first character of the smallest non-empty suffix. Unfortunately, this results in a bad density: a random window ends in the smallest symbol $0$ with probability $1/\sigma$, in which case the suffix consisting of just this character is the smallest one, and the density will be around $1/\sigma$ regardless of $w$. To avoid this, we impose the following restriction (which, in a way, generalizes the “non-empty” condition): a suffix is only allowed to be sampled if it does not occur elsewhere in the window as a substring. Thus, we look for the smallest unique suffix $W[i\dots)$. Let $S=W[i\dots j)$ be the shortest prefix of $W[i\dots)$ that is unique in $W$. Then $S$ is the smallest unique substring (SUS³) of $W$ (Groot Koerkamp 2025). As an example, the string CABBAB has AB as its smallest suffix, and ABBAB as its smallest unique suffix. The smallest unique substring is ABB, and the index of the correspondingly sampled SUS-anchor is 1.

Definition 4 SUS-anchor.

Given a window $W$ of length $w=\ell$, the smallest unique substring is the smallest substring $\sus(W)=W[i\dots j)$ that does not occur elsewhere in $W$. Then $W[i\dots w)$ is the smallest unique suffix, and the SUS-anchor is $i$.

MS-anchor. It turns out that the concept of smallest unique suffix is exactly equivalent to that of the maximal suffix (Duval 1983; Babenko et al. 2016) after reversing the order of the alphabet: in both cases, we look for the extremal suffix where longer suffixes should be preferred over shorter ones.

Definition 5 MS-anchor.

Given a window $W$ of length $w=\ell$, the maximal suffix anchor or MS-anchor samples the position $i\in[w]$ where the maximal suffix $W[i\dots \ell)$ of $W$ starts.

Variants. Alongside the lexicographic variant, the SUS-anchor allows variants based on character-based orders. Specifically, we consider SUS-anchors with the anti-lexicographic order.

These variants work just like the lexicographic version: simply consider the set of suffixes that do not occur as a substring elsewhere, and then take the smallest of these using the chosen character-based order.

5.1 Properties Link to heading

We now state some observations and then prove some properties of SUS-anchors.

Observation 1.

Given a window $W$, the smallest unique suffix is smaller than all longer suffixes, because all longer suffixes must be unique and thus larger than the smallest unique suffix.

Observation 2.

Unless removing the last character from the smallest unique substring results in a non-unique substring.

Theorem 1 SUS-anchors are forward.

SUS-anchors are forward for any character-based order.

Consider two consecutive windows $W$ and $W’$ with $\sus(W) = W[i\dots j)$. If $i=0$, the scheme is trivially forward. Otherwise, let $0\leq i’ < i-1$ be the start index of a suffix $W’[i’\dots)$. By the previous observation, $W[i\dots) <_O W[i’+1\dots)$, and since $W[i\dots)$ is not a prefix of $W[i’+1\dots)$ (for otherwise it would not be unique), appending $W’[\ell-1]$ to these two suffixes will not change their relative order. Since $W[i\dots)$ is unique in $W$, $W’[i-1\dots)$ is also unique in $W’$, and so $W’[i-1\dots)$ is a smaller unique suffix than $W’[i’\dots)$ for all $i’<i-1$. Thus, $\sus(W’)$ cannot start at an index less than $i-1$, and thus the sus-anchor is forward.

Theorem 2 Charged context.

Given two consecutive windows $W$ and $W’$ that together form a context, the sampled SUS-anchor changes if and only if either $\sus(W)$ is a prefix of $W$ or $\sus(W’)$ is a suffix of $W’$. In this case, the context is charged.

Let $\sus(W) = W[i\dots j)$ and $\sus(W’) = W’[i’\dots j’)$.

If $i=0$ is sampled, $W’$ cannot sample the same position. If $W’[i’\dots j’)=W[i’\dots \ell)$ is a suffix, then $W’[i’\dots j’-1)$ is not unique in $W’$, and thus also not in $W$. Thus, $W$ must sample a different position.

For the other direction, assume that $i’ > i-1$, so that different text positions are sampled. If $i=0$ or $i’=\ell-1$ we are done, so assume $i>0$ and $i’<\ell-1$.

If $W[i\dots) < W[i’+1\dots)$, then for every possible character $W’[\ell-1]$ that can be appended, we have $W’[i-1\dots) < W’[i’\dots)$, which is in contradiction with the fact that $W’[i’\dots)$ is smaller than all longer suffixes. Thus, $W[i’+1\dots)$ must be a prefix of $W[i\dots)$, so that the suffix is not unique. This means that $W’[i’\dots \ell-1)$ is not unique in $W’$, and thus, the SUS $W’[i’\dots j’)$ can only end in $j’=\ell$, as required.

5.2 Computing SUS-anchors Link to heading

Via the equivalence to maximal suffixes, we can use the theoretical algorithm of (Babenko et al. 2016) that requires $O(n)$ space, $O(n)$ preprocessing, and supports $O(1)$ queries for substring suffix-maximum. To reduce the space usage, we can use the same trick as for bd-anchors, and run it on chunks of size $2\ell$ that overlap by $\ell-1$, so that the total space usage is $O(\ell)$ and the total time remains $O(n)$.

Theorem 3.

The SUS-anchors of a string of length $n$ can be computed in $O(n)$ time and $O(\ell)$ space.

The method below works for the maximal suffix, as well as for any character-based order that is a total order in every position.

A streaming version. We now directly adapt the algorithm of (Babenko et al. 2016) to our streaming setting. Suppose that we have so far processed $T[0\dots t)$. The algorithm maintains a doubly linked list $A=(a_0, a_1, a_2, \dots)$ of active text positions $a_i$ such that $T[a_i\dots t) > T[j\dots t)$ for all $a_i < j < t$. This list is decreasing in the sense that $T[a_0\dots t) > T[a_1\dots t) > T[a_2\dots t) > \dots$, and it plays a similar role to the monotone queue in the classic minimizer algorithm (Groot Koerkamp and Martayan 2025). Unlike that original case, here we also store a list of events for every text position that can modify the middle of the list.

As we shift the window and increment $t$ to $t’=t+1$, we remove $a_0$ when $a_0 < t’-\ell$. After appending $T[t]$, we get a new suffix $T[t\dots t’)$, and we have to update $A$.

Repeatedly pop $A$. We remove the last element $\last(A)=A_{|A|-1}$ from $A$ as long as $T[t\dots t’) > T[\last(A)\dots t’)$.
Add event. Then, if $A$ is not empty and $T[t\dots) > T[\last(A)\dots)$ (taking into account upcoming characters), compute $\lambda=\lcp(T[t\dots), T[\last(A)\dots))$ and add the event “check if we should drop the element preceding $t$ from $A$” to the event queue for position $t+\lambda+1$.
Push $t$. Then, push $t$ to $A$ and increment $t’$.
Handle event. As soon as $t’’$ reaches position $t+\lambda+1$, we check if $t$ is still in $A$, and if so, if it has now become larger than its preceding element. Then repeatedly remove the preceding element as long as $T[t\dots t’’)$ is larger, and end when either there is no preceding element, or when $T[a_i\dots t’’) > T[t\dots t’’)$. In that case, again check if the order will flip in the future, and if so, add a new event to the event queue.

Practical considerations. To make this practical, all $\lcp$ computations can be bounded to $\ell$, and the event queues can be stored in a ring buffer with just $\ell+1$ slots for the current and next $\ell$ positions. Originally, the $\lcp$ computations are done in constant time using the suffix array. Instead, we simply do a word-by-word scan on the bit-representation of the text, which takes expected constant time on random strings since LCPs of, say, over $64$ bits are unlikely in random strings.

A more direct version. A final modification drops the event queues and allows us to replace the doubly linked list $A$ by a simple double-ended queue. For each $a\in A$, we now additionally store the text position $x_a$ where $a$ becomes “hot”, i.e., we store that $a$ becomes the first element of $A$ (and thus the start of the maximal suffix of the window) when the character $T[x_a]$ enters the sliding window. When comparing $t$ to $\last(A)$, there are three cases.

$T[t\dots) < T[\last(A)\dots)$, in which case we push $t$ and note that it becomes hot when $\last(A)$ falls out of the window, i.e., $x_t = \last(A)+\ell$.
$T[t\dots) > T[\last(A)\dots)$: $t$ “takes over” $\last(A)$ when character $t+\lambda$ enters the window, where $\lambda = \lcp(T[t\dots), T[\last(A)\dots))$.
- If $\last(A)$ is already hot by that time ($x_{\last(A)} < t+\lambda$), simply push $t$.
- If not, $\last(A)$ will never be hot, and we pop it and (repeatedly) compare $t$ to the new $\last(A)$.

Once this process finishes, the first element of $A$ indicates the maximum suffix of the current window. We then increment $t$ to the next text position and check whether we reached the position $x_{a_1}$ where the next element of $A$ becomes hot (either because $a_0$ falls out of the window or $a_1$ introduces a new maximal suffix). In that case, we pop $a_0$.

6 Results Link to heading

Figure 1: Comparison of the density factor (the density multiplied by (w+1)) of selection schemes for (sigmain{2,4,32}), (k=1), and varying (w). The lower bound is shown in red. Bd-anchors (lime/green) are shown for various (r), and SUS-anchors are shown with both the lexicographic (purple) and anti-lexicographic order (blue). (The other orders are worse than anti-lexicographic.) On the right of each plot, we show scaling for large (w). Each data point is the particular density on an independent random string of length (10^7). For large (sigma) and (w), there remains some variance in the estimated density.

We compare bd-anchors with various $r$ and sus-anchors with various underlying orderings in Figure 1. We see that as $w$ grows, so does the optimal value of $r$. Still, even with the best choice of $r$, bd-anchors have density over $50\%$ ($\sigma=2$), $15\%$ ($\sigma=4$), or $2.5\%$ ($\sigma=32$) above the lower bound for all but the smallest $w$.

Lexicographic SUS-anchors perform consistently slightly better than the best bd-anchor for all $\sigma$ and not too small $w$. Anti-lexicographic SUS-anchors are much better, and get surprisingly close to the lower bound. They are away from the lower bound by less than $10\%$ for $\sigma=2$ and less than $1\%$ for $\sigma=4$ and $\sigma=32$.

7 Discussion Link to heading

We introduced anti-lexicographic SUS-anchors, gave a linear-time algorithm to compute them, and showed that they empirically get to within $1\%$ of the density lower bound for $\sigma=4$.

Future work. On a practical level, future work is needed to improve the current monotone-queue-based algorithm to a rescan-like approach (Groot Koerkamp and Martayan 2025) or even a branchless variant that could be implemented in parallel using SIMD.

It would also be interesting to use SUS-anchors in combination with the mod-minimizer (Groot Koerkamp and Pibiri 2024), especially in order to remove the $r$ parameter, but this has not worked out so far.

Further work is needed to prove that, for example, anti-lexicographic SUS anchors have a density of $2/(w+1) + o(1/w)$. Given how close to optimal these schemes already are, the fact that exactly optimal schemes do exist for small $w$ (Kille et al. 2024), as well as manually constructed schemes that are exactly optimal for $\sigma=2$ and $w\leq 12$, the question remains whether “clean”, constant space schemes can be developed for generic $w$. The spacers of (Shur, Tziony, and Orenstein 2026a) are similar to our own ongoing work, and provide a first step in this direction.

Lastly, the SUS-anchor is a forward scheme. It is known that non-forward schemes can be (at least) slightly better at times and break the forward lower bound, but this is not well understood.

References Link to heading

Ayad, Lorraine A. K., Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio Ermanno Pibiri, and Solon P. Pissis. 2025. “U-Index: A Universal Indexing Framework for Matching Long Patterns.” In SEA 2025, 338:4:1–4:18. Lipics. https://doi.org/10.4230/LIPIcs.SEA.2025.4.

Babenko, Maxim, Paweł Gawrychowski, Tomasz Kociumaka, Ignat Kolesnichenko, and Tatiana Starikovskaya. 2016. “Computing Minimal and Maximal Suffixes of a Substring.” Theoretical Computer Science 638: 112–21. https://doi.org/10.1016/j.tcs.2015.08.023.

Babenko, Maxim, Ignat Kolesnichenko, and Tatiana Starikovskaya. 2013. “On Minimal and Maximal Suffixes of a Substring.” In Combinatorial Pattern Matching, edited by Johannes Fischer and Peter Sanders, 28–37. Berlin, Heidelberg: Springer Berlin Heidelberg.

Blackburn, Simon R. 2015. “Non-Overlapping Codes.” Ieee Transactions on Information Theory 61 (9): 4890–94. https://doi.org/10.1109/tit.2015.2456634.

Booth, Kellogg S. 1980. “Lexicographically Least Circular Substrings.” Information Processing Letters 10 (4): 240–42. https://doi.org/https://doi.org/10.1016/0020-0190(80)90149-0.

Constantinides, Bede, John Lees, and Derrick W Crook. 2025. “Deacon: Fast Sequence Filtering and Contaminant Depletion,” June. https://doi.org/10.1101/2025.06.09.658732.

Duval, Jean-Pierre. 1983. “Factorizing Words over an Ordered Alphabet.” Journal of Algorithms 4 (4): 363–81. https://doi.org/10.1016/0196-6774(83)90017-2.

Ekim, Barış, Bonnie Berger, and Rayan Chikhi. 2021. “Minimizer-Space De Bruijn Graphs: Whole-Genome Assembly of Long Reads in Minutes on a Personal Computer.” Cell Systems 12 (10): 958–68.e6. https://doi.org/10.1016/j.cels.2021.08.009.

Frith, Martin C, Laurent Noé, and Gregory Kucherov. 2020. “Minimally Overlapping Words for Sequence Similarity Search.” Edited by Yann Ponty. Bioinformatics 36 (22–23): 5344–50. https://doi.org/10.1093/bioinformatics/btaa1054.

Golan, Shay, Ido Tziony, Matan Kraus, Yaron Orenstein, and Arseny Shur. 2025. “GreedyMini: generating low-density DNA minimizers.” Bioinformatics 41 (July): 275–84. https://doi.org/10.1093/bioinformatics/btaf251.

Golan, Shay, and Arseny M. Shur. 2025. “Expected Density of Random Minimizers.” In Sofsem 2025: Theory and Practice of Computer Science, 347–60. Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-82670-2_25.

Groot Koerkamp, Ragnar. 2025. “Optimal Throughput Bioinformatics.” ETH Zurich. https://doi.org/10.3929/ETHZ-C-000783091.

Groot Koerkamp, Ragnar, Daniel Liu, and Giulio Ermanno Pibiri. 2025. “The Open-Closed Mod-Minimizer Algorithm.” Algorithms for Molecular Biology 20 (4). https://doi.org/10.1186/s13015-025-00270-0.

Groot Koerkamp, Ragnar, and Igor Martayan. 2025. “SimdMinimizers: Computing Random Minimizers, fast.” In SEA 2025, 338:20:1–20:19. Lipics. https://doi.org/10.4230/LIPIcs.SEA.2025.20.

Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024. “The Mod-Minimizer: A Simple and Efficient Sampling Algorithm for Long $k$-Mers.” In Wabi 2024, 312:11:1–11:23. Lipics. https://doi.org/10.4230/LIPIcs.WABI.2024.11.

Ingels, Florian, Antoine Limasset, Camille Marchet, and Mikaël Salson. 2026. “Vigemers: On the Number of $k$-Mers Sharing the Same Xor-Based Minimizer.” Arxiv. https://doi.org/10.48550/ARXIV.2602.03337.

Ingels, Florian, Lucas Robidou, Igor Martayan, Camille Marchet, and Antoine Limasset. 2025. “Minimizer Density Revisited: Models and Multiminimizers,” November. https://doi.org/10.1101/2025.11.21.689688.

Jain, Chirag, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy. 2018. “A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.” Journal of Computational Biology 25 (7): 766–79. https://doi.org/10.1089/cmb.2018.0036.

Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd J Treangen. 2024. “A near-tight Lower Bound on the Density of Forward Sampling Schemes.” Edited by Yann Ponty. Bioinformatics, December. https://doi.org/10.1093/bioinformatics/btae736.

Kociumaka, Tomasz. 2016. “Minimal Suffix and Rotation of a Substring in Optimal Time.” In 27Th Annual Symposium on Combinatorial Pattern Matching (Cpm 2016), edited by Roberto Grossi and Moshe Lewenstein, 54:28:1–28:12. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.CPM.2016.28.

Li, Heng. 2016. “Minimap and Miniasm: Fast Mapping and de Novo Assembly for Noisy Long Sequences.” Bioinformatics 32 (14): 2103–10. https://doi.org/10.1093/bioinformatics/btw152.

Loukides, Grigorios, Solon P. Pissis, and Michelle Sweering. 2023. “Bidirectional String Anchors for Improved Text Indexing and Top-$k$ Similarity Search.” Ieee Transactions on Knowledge and Data Engineering 35 (11): 11093–111. https://doi.org/10.1109/tkde.2022.3231780.

Marçais, Guillaume, Dan DeBlasio, and Carl Kingsford. 2018. “Asymptotically Optimal Minimizers Schemes.” Bioinformatics 34 (13): i13–22. https://doi.org/10.1093/bioinformatics/bty258.

Roberts, Michael, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. 2004. “Reducing Storage Requirements for Biological Sequence Comparison.” Bioinformatics 20 (18): 3363–69. https://doi.org/10.1093/bioinformatics/bth408.

Schleimer, Saul, Daniel S. Wilkerson, and Alex Aiken. 2003. “Winnowing: Local Algorithms for Document Fingerprinting.” In Proceedings of the 2003 Acm Sigmod International Conference on Management of Data. Sigmod/Pods03. ACM. https://doi.org/10.1145/872757.872770.

Shur, Arseny. 2025. “On Minimizers of Minimum Density.” Arxiv. https://doi.org/10.48550/ARXIV.2506.05277.

Shur, Arseny, Ido Tziony, and Yaron Orenstein. 2026a. “10-Minimizers: A Promising Class of Constant-Space Minimizers.” Biorxiv. https://doi.org/10.64898/2026.03.16.712052.

———. 2026b. “Generating Minimum-Density Minimizers,” January. https://doi.org/10.64898/2026.01.25.701585.

Zheng, Hongyu, Carl Kingsford, and Guillaume Marçais. 2021. “Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length.” Journal of Computational Biology 28 (4): 395–409. https://doi.org/10.1089/cmb.2020.0432.

Zheng, Hongyu, Guillaume Marçais, and Carl Kingsford. 2023. “Creating and Using Minimizer Sketches in Computational Genomics.” Journal of Computational Biology 30 (12): 1251–76. https://doi.org/10.1089/cmb.2023.0094.

Note that even though typically $\ell=w+k-1=w$, we will $w$ for the density and $\ell$ for the length of the window. ↩︎
A slight generalization of this concept that we do not otherwise need in this paper uses orders $O_i$ on strings of length $i$ and then compares prefixes $a_0\dots a_\ell \lt_{O_\ell} b_0\dots b_\ell$ instead. ↩︎
Not to be confused with the shortest unique substring, which is also commonly abbreviated as SUS. ↩︎