In this post, we will look at the minimizer schemes generated by the greedy
minimizer (Golan et al. 2024).
I briefly wrote about the GreedyMini before in this post, but for the discussion
here, we can keep things simple: the minimizer scheme builds an explicit order
on k-mers, and simply returns the smallest k-mer according to this order.
To reduce the exponential blowup, it only builds an order on k-mers over the binary
\(\sigma=2\) alphabet, and extends this to \(\sigma=4\) by first comparing the
order of the k-mers consisting of the high bits, and breaking ties
lexicographically via the remaining low bits.1
Figure 1: Density of sampling/minimizer schemes for (w=4) and alphabet size (sigma=4). Plots are generated using this script.
In the plot above (\(w=5\), \(\sigma=4\)), the mod-mini (blue) uses \(r=3\), double decycling
(Pellow et al. 2023) (green) also uses the extended mod-minimizer.
Brown and teal show schemes introduced in this post (and my thesis), again also in combination
with the extended mod-mini.
The lower bound of (Kille et al. 2024) is shown in red.
The GreedyMini is shown in grey, and can be seen, it is super close to
the lower bound for \(6\leq k\leq 11\), and indeed much better than the other schemes.
For \(k\leq 5\), GreedyMini is less close to optimal, and indeed, it remains unclear
how close to optimal one could get.
At the next ‘‘step’’, for \(k\geq 12\), the GreedyMini is not as close to optimal
as before, and other schemes actually are better.
To get a better feeling for just how close to optimal the GreedyMini can
be, let’s compare it to the truly optimal values found using ILP for
sufficiently small parameters where the ILP finishes (Kille et al. 2024):
Figure 2: Densities for alphabet size (sigma=2) and (w=3). Black circles indicate optimal values found by ILP, and filled circles show where these optimal values match the red lower bound.
We see that the GreedyMini tracks the optimal values quite closely! Very
impressive :)
Moving up to \(w=4\), we only have a few points where the ILP finished, and we see
that at \(k=3\), the GreedyMini is somewhat worse than the optimal solution.
In all cases so far, the other schemes are not quite close to optimal. Instead, let’s look at
alphabet size \(\sigma=256\), where the other schemes have an easier job.
The lower bound and the density of the GreedyMini stay nearly the same, while
double decycling improves and mod-mini shifts left by using \(r=1\). The
sus-anchor and ABB+ schemes are replaced by a ’threshold’ scheme, that looks for
a character \(<100\) followed by as many characters as possible \(\geq 100\), with
tie-breaking on ABB (anti-lexicographic) order.
Double decycling is as good as greedy for small \(k\leq 6\).
Treshold is consistently better for \(k\geq 12\) now.
Treshold also improves in the \(6\leq k \leq 10\) step, but greedy is clearly
much better still.
Let’s also look at \(w=10\).
The GreedyMini improves for \(4\leq k\leq 10\), and is near-optimal just below \(w=10\).
It’s unclear how close to optimal it is for \(4\leq k\leq 7\).
At \(k\geq 12\), GreedyMini is worse than the threshold scheme.
Some general takeaways:
When \(\sigma>2\) and \(k\) is small (\(k\leq 5\) or so), the GreedyMini may be somewhat away from
optimal because reducing to binary alphabet throws away too much information.
TODO: For \(\sigma=4\) and small \(k\), we should directly search in the \(4^k\) space rather
than \(2^k\).
When \(k\geq 12\) or so, the search space for GreedyMini becomes too large, and
it loses to the threshold scheme, and thus is somewhat away from optimal.
From now on, we will use \(\sigma=256\) in plots to compare to the ‘‘ideal’’ case
of the non-greedy schemes where the alphabet is large.
Time to look into the generated schemes. For this, we’ll stick to \(\sigma=2\) to
avoid the exponential blowup in number of k-mers as much as possible, and for
slightly cleaner results (since \(\sigma=4\) does additional tiebreaking by the
remaining bits).
We’ll start small, and use this script to print the most frequently sampled
k-mers. While this order may not be exactly the same as the priority given to
them by the GreedyMini, it should be close enough and it will tell us a lot
about how the schemes (appear to) work.
1
cargo run -r --example greedy -- -w 3 -k 3
Output: (note that running command line output is colourized)
We see that this scheme has a density of 0.453 when run on a string of length
10M, and the lower bound on density is 0.429. The average distance between
consecutive samples is 2.207, while the upper bound on this is 2.333. (But not
that we observed in Figure 2 that the best \(\sigma=2\) scheme found by ILP has density
pretty much equal to GreedyMini.)
The most sampled kmer is 011, with a count of 1249 thousand samples. (The last
digits are noisy and omitted.) Then, for each k-mer we show the distribution of
the distancefrom the previous sampled k-mer and to the next sampled k-mer. We see that after 011, we always have
a jump of at least 2, equally split between a jump of 2 and 3 steps.
After 111, on the other hand, we often only have a jump of size 1, which is less
than the average of 2.256, but also somewhat inevitable.
Lastly, we show the distribution of the position of the sampled k-mer.
We observe:
Prefer sampling k-mers starting with 01.
Avoid 01 anywhere else in the k-mer.
Else, prefer ending in a 0, so that when a 1 is appended, we jump to a 01.
Else, sample 111.
The high priority k-mers have uniform distribution, while the low priority
k-mers are more skewly distributed.
Instead, let’s increase \(k\) to \(4\). Looking at Figure 2, the greedy mini is very close
to the lower bound for \(\sigma=2\) here (below we instead show the simplified
large-\(\sigma\) bound).
Starting with 01 is good, but prefer 0001 over 0111.
We can explain the 1110 as follows: we don’t have a 01 anywhere, but we do
focus on 01, so as soon as a 1 is appended to the trailing 0, we jump
there with an optimal gap of 3. So this is like the mod-mini, where we ‘mod’
the position 3 of the last 0 by \(w=3\) to sample position 0.
All k-mers starting with 01 have dist_from at least 2, because we never
sample a k-mer with 01 in the middle. Stronger: We nearly always have a 1 in the
second position!
We see a very strong mod-effect in the dist_to column: whenever the k-mer
ends in 01 (*), the successor is exactly \(w\) steps away. Thus, the position of
the 01 is 3, and position \(0=(3\bmod w)\) is sampled.
There are 6 such k-mers, exactly those ending in 01 but not ending in
0101. Thus, we anchor to the start of a 01 run.
The remaining frequent k-mers (+) start with 0111 or 0101 and have distance
at least 2 from the previously sampled k-mer.
Lastly (?), we choose k-mers that start with 01 and do not contain a second
occurrence of 01. These are far from the previous sample, but relatively
close to the next sample.
This time, we see a mod-effect in dist_from (*), where 6 k-mers starting
with 010 are preferred (all 8 of them, apart from the two starting in
01010 which have two occurrences of 010).
At the low end, we see a mod-effect in dist_to (+) as well, because these
k-mers are only rarely sampled, and only at the very start of the window, so
there is a lot of implicit information to be used.
All (but 111000) of the most-sampled k-mers contain 01 either at the start
or at position 3. Again a strong mod-effect.
We see that we rarely have k-mers that are always \(w\) positions apart, unlike
we had for \(w=3\) and \(k\geq 4\). Probably we need \(k\geq w+2\) to make that happen.
It’s all quite messy, but ending and starting in 01 seems to be preferred.
Nearly all sampled k-mers have a 0 in the middle, and most of those have
10 at position 1.
We see very few jumps of distance 1 and only slightly more at distance 2. In
fact, jumping uniform distance in 2 to 5 already implies average distance 3.5,
close to the real average distance of 3.6. Thus, fully preventing size-1 jumps
and avoiding size-2 jumps is sufficient here.
Most sampled k-mers start with 01, but not 0101, which already prevents
jumps of distance 1.
Also ending in 01 seems preferred, which creates a jump size of 3.
Let’s see if we can construct manual \(\sigma=2\) schemes matching the density of
GreedyMini for \(w=5\) for \(3\leq k\leq 11\). Currently the situation is as follows:
We see that the GreedyMini graph is mostly flat at \(8\leq k\leq 11\), so let’s
try to reproduce \(k=8\) first.
It turns out that GreedyMini generates perfectly optimal schemes for \((w,k)\) in
\((3, 4)\), \((4, 5)\), and \((5, 6)\). Then after, they are just slightly away from optimal.
When \(2^k\) is too large, only the first \(k’<k\) characters can be used, and when
\(w\) is very large, it is possible to just fall back to the order for a \(w’<w\),
but applied to the larger windows. But here we will only consider sufficiently
small cases. ↩︎