In this post, we will look at the minimizer schemes generated by the greedy minimizer (Golan et al. 2024).

I briefly wrote about the GreedyMini before in this post, but for the discussion here, we can keep things simple: the minimizer scheme builds an explicit order on k-mers, and simply returns the smallest k-mer according to this order. To reduce the exponential blowup, it only builds an order on k-mers over the binary \(\sigma=2\) alphabet, and extends this to \(\sigma=4\) by first comparing the order of the k-mers consisting of the high bits, and breaking ties lexicographically via the remaining low bits.1

I integrated GreedyMini into my minimizers playground repository by simply copying the relevant data files.

This is mostly just incomplete notes and observations. Don’t expect a nice conclusion.

GreedyMini Results Link to heading

A first look Link to heading

Figure 1: Density of sampling/minimizer schemes for (w=4) and alphabet size (sigma=4). Plots are generated using this script.

Figure 1: Density of sampling/minimizer schemes for (w=4) and alphabet size (sigma=4). Plots are generated using this script.

In the plot above (\(w=5\), \(\sigma=4\)), the mod-mini (blue) uses \(r=3\), double decycling (Pellow et al. 2023) (green) also uses the extended mod-minimizer. Brown and teal show schemes introduced in this post (and my thesis), again also in combination with the extended mod-mini. The lower bound of (Kille et al. 2024) is shown in red.

The GreedyMini is shown in grey, and can be seen, it is super close to the lower bound for \(6\leq k\leq 11\), and indeed much better than the other schemes.

  • For \(k\leq 5\), GreedyMini is less close to optimal, and indeed, it remains unclear how close to optimal one could get.
  • At the next ‘‘step’’, for \(k\geq 12\), the GreedyMini is not as close to optimal as before, and other schemes actually are better.

Comparison with optimal ILP values Link to heading

To get a better feeling for just how close to optimal the GreedyMini can be, let’s compare it to the truly optimal values found using ILP for sufficiently small parameters where the ILP finishes (Kille et al. 2024):

Figure 2: Densities for alphabet size (sigma=2) and (w=3). Black circles indicate optimal values found by ILP, and filled circles show where these optimal values match the red lower bound.

Figure 2: Densities for alphabet size (sigma=2) and (w=3). Black circles indicate optimal values found by ILP, and filled circles show where these optimal values match the red lower bound.

  • We see that the GreedyMini tracks the optimal values quite closely! Very impressive :)

Moving up to \(w=4\), we only have a few points where the ILP finished, and we see that at \(k=3\), the GreedyMini is somewhat worse than the optimal solution.

Large alphabets Link to heading

In all cases so far, the other schemes are not quite close to optimal. Instead, let’s look at alphabet size \(\sigma=256\), where the other schemes have an easier job.

The lower bound and the density of the GreedyMini stay nearly the same, while double decycling improves and mod-mini shifts left by using \(r=1\). The sus-anchor and ABB+ schemes are replaced by a ’threshold’ scheme, that looks for a character \(<100\) followed by as many characters as possible \(\geq 100\), with tie-breaking on ABB (anti-lexicographic) order.

  • Double decycling is as good as greedy for small \(k\leq 6\).
  • Treshold is consistently better for \(k\geq 12\) now.
  • Treshold also improves in the \(6\leq k \leq 10\) step, but greedy is clearly much better still.

Let’s also look at \(w=10\).

  • The GreedyMini improves for \(4\leq k\leq 10\), and is near-optimal just below \(w=10\).
  • It’s unclear how close to optimal it is for \(4\leq k\leq 7\).
  • At \(k\geq 12\), GreedyMini is worse than the threshold scheme.

Some general takeaways:

  • When \(\sigma>2\) and \(k\) is small (\(k\leq 5\) or so), the GreedyMini may be somewhat away from optimal because reducing to binary alphabet throws away too much information.
    • TODO: For \(\sigma=4\) and small \(k\), we should directly search in the \(4^k\) space rather than \(2^k\).
  • When \(k\geq 12\) or so, the search space for GreedyMini becomes too large, and it loses to the threshold scheme, and thus is somewhat away from optimal.

From now on, we will use \(\sigma=256\) in plots to compare to the ‘‘ideal’’ case of the non-greedy schemes where the alphabet is large.

Analysing GreedyMini at \(w=3\) Link to heading

Time to look into the generated schemes. For this, we’ll stick to \(\sigma=2\) to avoid the exponential blowup in number of k-mers as much as possible, and for slightly cleaner results (since \(\sigma=4\) does additional tiebreaking by the remaining bits).

\(w=3\), \(k=3\) Link to heading

We’ll start small, and use this script to print the most frequently sampled k-mers. While this order may not be exactly the same as the priority given to them by the GreedyMini, it should be close enough and it will tell us a lot about how the schemes (appear to) work.

1
cargo run -r --example greedy -- -w 3 -k 3

Output: (note that running command line output is colourized)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
w: 3 k: 3
density  0.453   lb: 0.429
avg dist 2.207   ub: 2.333

               dist_from (%)       dist_to (%)       pos (%)
kmer  count  |    1   2   3   |    1   2   3   |    0   1   2
011    1249  |    0  50  50   |   25  25  50   |   33  33  33
010    1249  |    0  50  50   |    0  75  25   |   30  40  30
000     938  |   33  50  17   |   33  33  33   |   40  40  20
110     780  |   20  40  40   |    0  60  40   |   25  50  25
111     311  |  100   0   0   |   50   0  50   |  100   0   0

We see that this scheme has a density of 0.453 when run on a string of length 10M, and the lower bound on density is 0.429. The average distance between consecutive samples is 2.207, while the upper bound on this is 2.333. (But not that we observed in Figure 2 that the best \(\sigma=2\) scheme found by ILP has density pretty much equal to GreedyMini.) The most sampled kmer is 011, with a count of 1249 thousand samples. (The last digits are noisy and omitted.) Then, for each k-mer we show the distribution of the distance from the previous sampled k-mer and to the next sampled k-mer. We see that after 011, we always have a jump of at least 2, equally split between a jump of 2 and 3 steps. After 111, on the other hand, we often only have a jump of size 1, which is less than the average of 2.256, but also somewhat inevitable. Lastly, we show the distribution of the position of the sampled k-mer.

We observe:

  • Prefer sampling k-mers starting with 01.
  • Avoid 01 anywhere else in the k-mer.
  • Else, prefer ending in a 0, so that when a 1 is appended, we jump to a 01.
  • Else, sample 111.
  • The high priority k-mers have uniform distribution, while the low priority k-mers are more skewly distributed.

\(w=7\), \(k=3\) Link to heading

If we keep \(k=3\) but increase \(w\) to, it turns out the optimal scheme remains mostly the same:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
w: 7 k: 3
density  0.224   lb: 0.200
avg dist 4.473   ub: 5.000

                       dist_from (%)                       dist_to (%)                       pos (%)
kmer  count  |    1   2   3   4   5   6   7   |    1   2   3   4   5   6   7   |    0   1   2   3   4   5   6
011    1249  |    0   6  18  19  20  19  16   |    1   0  20  22  22  20  15   |   17  17  17  15  13  11   9
010     781  |    0  19  22  20  16  14   9   |    0  26  16  16  15  15  11   |   17  21  19  16  12  10   6
000     136  |   14  21  14  14  14  14   7   |   14  14  14  14  15  14  14   |   15  15  16  15  15  15   8
110      48  |    0   0   0   0  20  40  40   |    0  60  40   0   0   0   0   |    0   0   0   0  25  50  25
111      19  |  100   0   0   0   0   0   0   |   50   0   0   0   0   0  50   |  100   0   0   0   0   0   0
  • The counts are more skewly distributed, but the order is the same.

\(w=3\), \(k=4\) Link to heading

Instead, let’s increase \(k\) to \(4\). Looking at Figure 2, the greedy mini is very close to the lower bound for \(\sigma=2\) here (below we instead show the simplified large-\(\sigma\) bound).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
w: 3 k: 4
density  0.437   lb: 0.429
avg dist 2.286   ub: 2.333

                dist_from (%)       dist_to (%)       pos (%)
kmer   count  |    1   2   3   |    1   2   3   |    0   1   2
0001     625  |   12  37  50   |    0  88  12   |   13  50  37
1110     625  |   25  25  50   |   25   0  75   |   33  33  33
0100     625  |    0  50  50   |    0  38  62   |   36  36  27
0110     624  |    0  50  50   |   25   0  75   |   36  36  27
0101     624  |    0  50  50   |    0  88  12   |   36  36  27
0111     469  |    0  33  67   |   50  33  17   |   20  40  40
0000     312  |   25  25  50   |   50  25  25   |   33  33  33
1100     311  |  100   0   0   |    0   0 100   |  100   0   0
1111     155  |  100   0   0   |   50   0  50   |  100   0   0
  • Starting with 01 is good, but prefer 0001 over 0111.
  • We can explain the 1110 as follows: we don’t have a 01 anywhere, but we do focus on 01, so as soon as a 1 is appended to the trailing 0, we jump there with an optimal gap of 3. So this is like the mod-mini, where we ‘mod’ the position 3 of the last 0 by \(w=3\) to sample position 0.
  • All k-mers starting with 01 have dist_from at least 2, because we never sample a k-mer with 01 in the middle. Stronger: We nearly always have a 1 in the second position!

\(w=3\), \(k=5\) Link to heading

Now there is a big drop in the density in Figure 2, and indeed our scheme also goes from 0.437 to 0.410

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
w: 3 k: 5
density  0.410   lb: 0.400
avg dist 2.439   ub: 2.500

                 dist_from (%)       dist_to (%)       pos (%)
kmer    count  |    1   2   3   |    1   2   3   |    0   1   2
10001     313  |   38  38  25   |    0   0 100 * |   33  33  33
01111     313  |    0  25  75 + |   25  38  38   |   33  33  33
00001     312  |   12  38  50   |    0   0 100 * |   33  33  33
01101     312  |    0  25  75   |    0   0 100 * |   36  36  27
11101     312  |   37  25  38   |    0   0 100 * |   36  36  27
01001     312  |    0  25  75   |    0   0 100 * |   36  36  27
01011     312  |    0  25  75 + |    0  75  25   |   30  40  30
01010     312  |    0  25  75 + |    0  88  12   |   36  36  27
11001     311  |   50  25  25   |    0   0 100 * |   33  33  33
01000     273  |    0  14  86 ? |   43  43  14   |   17  33  50
01110     273  |    0  14  86 ? |   57  29  14   |   17  33  50
01100     273  |    0  14  86 ? |   43  29  29   |   17  33  50
11100     156  |   25  25  50   |   25  25  50   |   33  33  33
00000     155  |   25  25  50   |   50  25  25   |   33  33  33
11111     155  |  100   0   0   |   50   0  50   |  100   0   0
  • We see a very strong mod-effect in the dist_to column: whenever the k-mer ends in 01 (*), the successor is exactly \(w\) steps away. Thus, the position of the 01 is 3, and position \(0=(3\bmod w)\) is sampled.
    • There are 6 such k-mers, exactly those ending in 01 but not ending in 0101. Thus, we anchor to the start of a 01 run.
  • The remaining frequent k-mers (+) start with 0111 or 0101 and have distance at least 2 from the previously sampled k-mer.
  • Lastly (?), we choose k-mers that start with 01 and do not contain a second occurrence of 01. These are far from the previous sample, but relatively close to the next sample.

\(w=3\), \(k=6\) Link to heading

At \(k=6\), the density converges closer to the flat line ending at \(k=7\).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
w: 3 k: 6
density  0.404   lb: 0.400
avg dist 2.474   ub: 2.500

                  dist_from (%)       dist_to (%)       pos (%)
kmer     count  |    1   2   3   |    1   2   3   |    0   1   2
111000     156  |   25  37  38   |   12  25  62   |   36  36  27
011111     156  |    0  38  62   |   25  25  50   |   36  36  27
011000     156  |    0  37  63   |   12  25  62   |   36  36  27
010011     156  |    0   0 100 * |   25   0  75   |   33  33  33
010111     156  |    0   0 100 * |    0  63  37   |   27  36  36
010001     156  |    0   0 100 * |   38  50  12   |   33  33  33
000010     156  |    0  38  62   |    0  25  75   |   33  33  33
011011     156  |    0  37  63   |   25   0  75   |   36  36  27
011110     156  |    0  38  62   |    0  75  25   |   27  36  36
101010     156  |    0 100   0   |    0  25  75   |   44  44  11
111011     156  |    0  38  62   |   25   0  75   |   36  36  27
010000     156  |    0   0 100 * |    0  75  25   |   11  44  44
111010     156  |    0  37  63   |    0  25  75   |   33  33  33
000011     156  |    0  38  62   |   37   0  63   |   33  33  33
011010     156  |    0  38  62   |    0  25  75   |   36  36  27
110011     155  |    0  25  75   |   25   0  75   |   33  33  33
010010     155  |    0   0 100 * |    0  25  75   |   33  33  33
001010     155  |    0  87  13   |    0  25  75   |   33  44  22
010110     155  |    0   0 100 * |    0  75  25   |   33  33  33
110010     155  |    0  25  75   |    0  25  75   |   27  36  36
100010     137  |   43  57   0   |    0  14  86   |   60  40   0
011100     117  |    0  33  67   |   33  67   0   |    0  50  50
000111     117  |   33  33  33   |    0  50  50   |   40  40  20
000000     116  |   33  33  34   |   33  34  33   |   33  33  33
111110      97  |   20  40  40   |    0  60  40   |   25  50  25
000110      97  |   20  40  40   |    0  60  40   |   25  50  25
100111      39  |  100   0   0   |    0   0 100 + |  100   0   0
100110      39  |  100   0   0   |    0   0 100 + |  100   0   0
110000      38  |  100   0   0   |    0   0 100 + |  100   0   0
110110      38  |  100   0   0   |    0   0 100 + |  100   0   0
110111      38  |  100   0   0   |    0   0 100 + |  100   0   0
  • This time, we see a mod-effect in dist_from (*), where 6 k-mers starting with 010 are preferred (all 8 of them, apart from the two starting in 01010 which have two occurrences of 010).
  • At the low end, we see a mod-effect in dist_to (+) as well, because these k-mers are only rarely sampled, and only at the very start of the window, so there is a lot of implicit information to be used.
  • All (but 111000) of the most-sampled k-mers contain 01 either at the start or at position 3. Again a strong mod-effect.

\(w=3\), \(k=7\) Link to heading

We are now in a \(k\equiv 1\pmod w\) situation again, and the greedy scheme is probably optimal. (The ILP also finds a perfectly optimal scheme here.)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
w: 3 k: 7
density  0.401   lb: 0.400
avg dist 2.491   ub: 2.500

                   dist_from (%)       dist_to (%)       pos (%)
kmer      count  |    1   2   3   |    1   2   3   |    0   1   2
0001000      78  |    0  25  75   |   13   0  87   |   33  33  33
0001001      78  |    0  25  75   |    0   0 100 * |   36  36  27
1000000      78  |    0   0 100   |   25  38  37   |   33  33  33
0001111      78  |   25  63  12   |   13  37  50   |   36  36  27
1111101      78  |    0  38  62   |    0   0 100 * |   33  33  33
1101101      78  |    0   0 100 + |    0   0 100 * |   33  33  33
1111000      78  |    0   0 100   |   13  25  62   |   33  33  33
0111000      78  |    0  50  50   |   13  25  62   |   33  33  33
0111101      78  |    0  87  13   |    0   0 100 * |   44  44  11
1100011      78  |    0  75  25   |   13  50  37   |   25  50  25
1001100      78  |    0   0 100 + |   88   0  12   |   25  25  50
0101010      78  |    0  75  25   |    0  87  13   |   30  40  30
1101100      78  |    0   0 100 + |   87   0  13   |   25  25  50
0011101      78  |   50  13  37   |    0   0 100 * |   50  25  25
1000101      78  |    0   0 100   |   12  12  75   |   33  33  33
1001101      78  |    0   0 100 + |    0   0 100 * |   33  33  33
0001101      78  |   25  62  12   |    0   0 100 * |   36  36  27
1101000      78  |    0   0 100 + |   13   0  87   |   33  33  33
0000101      78  |   50  37  13   |   13  13  75   |   57  29  14
1101001      78  |    0   0 100 + |    0   0 100 * |   33  33  33
1001110      78  |    0   0 100 + |   50  50   0   |   27  36  36
1101010      78  |    0   0 100 + |    0  88  12   |   20  40  40
0111001      78  |    0  50  50   |    0   0 100 * |   40  40  20
1101110      78  |    0   0 100 + |   50  50   0   |   14  29  57
1101111      78  |    0   0 100 + |   25  25  50   |   33  33  33
1000001      78  |    0   0 100   |   25  25  50   |   33  33  33
1001000      78  |    0   0 100 + |   13   0  87   |   33  33  33
1000011      78  |    0   0 100   |   13  50  37   |   33  33  33
1111001      78  |    0   0 100   |    0   0 100 * |   33  33  33
1000010      78  |    0   0 100   |   50  50   0   |   27  36  36
1001111      77  |    0   0 100 + |   13  38  50   |   33  33  33
0101001      77  |    0  75  25   |    0   0 100 * |   40  40  20
0101000      77  |    0  75  25   |   13   0  87   |   33  33  33
1101011      77  |    0   0 100 + |   25  25  50   |   33  33  33
1001011      77  |    0   0 100 + |   13  38  50   |   33  33  33
0101101      77  |    0  75  25   |    0   0 100 * |   44  44  11
0011001      77  |   50  13  37   |    0   0 100 * |   33  33  33
1001010      77  |    0   0 100 + |    0  87  13   |   11  44  44
1001001      77  |    0   0 100 + |    0   0 100 * |   33  33  33
1011101      77  |   50  25  25   |    0   0 100 * |   33  33  33
1011001      77  |   50   0  50   |    0   0 100 * |   33  33  33
1011000      68  |   43   0  57   |   14  14  72   |   43  29  29
0011000      68  |   43  14  43   |   14  14  71   |   43  29  29
1111111      67  |   43  14  43   |   43  14  43   |   43  29  29
1011111      58  |   50  16  33   |   17  33  50   |   33  33  33
1010111      58  |   50   0  50   |   17  33  50   |   33  33  33
0000011      58  |   50  34  17   |   17  33  50   |   40  40  20
1000111      48  |   20   0  80   |   40  20  40   |   25  25  50
0101011      48  |    0  60  40   |   20  40  40   |   40  40  20
0101111      39  |   25  25  50   |   25  25  50   |   33  33  33
1000110      39  |    0   0 100 * |   50  50   0   |    0  33  67
0000001      39  |   25  25  50   |   25  25  50   |   34  33  33
0111111      29  |   33  67   0   |    0  34  66   |   50  50   0
0010111      29  |   34  66   0   |    0  33  67   |   50  50   0
1010001      19  |  100   0   0   |    0   0 100   |  100   0   0
0010001      19  |  100   0   0   |    0   0 100   |  100   0   0
0110001      19  |  100   0   0   |    0   0 100   |  100   0   0
1110001      19  |  100   0   0   |    0   0 100   |  100   0   0
0011111      19  |  100   0   0   |    0   0 100   |  100   0   0
0000111      19  |  100   0   0   |    0   0 100   |  100   0   0
0001011      19  |  100   0   0   |    0   0 100   |  100   0   0
0000000      19  |  100   0   0   |   50   0  50   |  100   0   0
  • Many k-mers ending in 01 but not 0101 have a mod-effect in dist_to using the 01 as an anchor.
  • Likewise, k-mers starting with ??01 but not 0101 (+) have a mod-effect in dist_from.
  • Most of the remaining k-mers seem to have a 10 at position 3 or 0.

Looking at fixed \(k=5\) Link to heading

Now let’s swap roles and fix \(k=5\).

\(k=5\), \(w=4\) Link to heading

Here, \(k\equiv 1\pmod w\) and so the scheme is pretty much optimal.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
w: 4 k: 5
density  0.336   lb: 0.333
avg dist 2.977   ub: 3.000

                   dist_from (%)           dist_to (%)           pos (%)
kmer    count  |    1   2   3   4   |    1   2   3   4   |    0   1   2   3
10001     313  |   13  13  37  37   |    6   0  38  56   |   23  26  26  26
11110     313  |   13  13  31  44   |    6  62  13  19   |   14  18  36  32
11011     312  |    0  25  25  50   |    0   0  69  31   |   23  27  27  23
00001     312  |   12  25  19  44   |    6   0  37  56   |   23  26  26  26
01001     312  |    0  31  50  19   |    6   0  37  56   |   27  27  27  20
01011     312  |    0  31  50  19   |    0   0  69  31   |   20  32  32  16
11001     311  |   13  25  25  38   |    6   0  37  56   |   25  25  25  25
11010     273  |    0  14  29  57   |    0  57  21  21   |   18  24  35  24
01010     234  |    0  33  67   0   |    0  50  25  25   |   27  36  36   0
11100     137  |   14   0  29  57   |   43  29  14  14   |   17  17  33  33
11000     137  |   14   0  29  57   |   43  28  14  14   |   17  17  33  33
10000     136  |   14   0  28  57   |   43  29  14  14   |   17  17  33  33
11111     136  |   14   0  29  57   |   43  29  14  14   |   16  17  33  34
10011      39  |  100   0   0   0   |    0   0   0 100   |  100   0   0   0
00011      38  |  100   0   0   0   |    0   0   0 100   |  100   0   0   0
00000      38  |  100   0   0   0   |   50   0   0  50   |  100   0   0   0
  • We see that we rarely have k-mers that are always \(w\) positions apart, unlike we had for \(w=3\) and \(k\geq 4\). Probably we need \(k\geq w+2\) to make that happen.
  • It’s all quite messy, but ending and starting in 01 seems to be preferred.
  • Nearly all sampled k-mers have a 0 in the middle, and most of those have 10 at position 1.
  • (I’m clearly not fully understanding this yet.)

\(k=5\), \(w=5\) Link to heading

Here we are still quite close to the lower bound.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
w: 5 k: 5
density  0.275   lb: 0.273
avg dist 3.631   ub: 3.667

                     dist_from (%)               dist_to (%)               pos (%)
kmer    count  |    1   2   3   4   5   |    1   2   3   4   5   |    0   1   2   3   4
11110     313  |    3  12   9  41  34   |    6   0  13  50  31   |   15  22  22  22  19
01000     312  |    0   0  47  25  28   |    0  12  25  38  25   |   18  25  25  18  14
01101     312  |    0   0  47  25  28   |    0   9  63   9  19   |   19  19  25  19  17
01001     312  |    0   0  47  25  28   |    0   6  62  13  19   |   20  20  23  20  18
01110     312  |    3   0  47  25  25   |    6   0  12  50  31   |   20  20  20  20  19
01100     311  |    0   0  47  25  28   |    0   0  28  47  25   |   19  24  24  18  15
00001     263  |    0  15  33  33  18   |    0   7  55  15  22   |   14  17  34  23  11
10101     214  |    0  36  36  18   9   |    0  14  45  14  27   |   20  24  32  16   8
00101     155  |    0  13  37  37  13   |    0  12  62  19   6   |    7  14  43  29   7
00111      87  |    0  22  22  33  22   |   11  33  22  22  11   |   12  25  25  25  13
11111      67  |   14  14   0  43  29   |   29  29  14  14  14   |   16  17  17  34  17
00000      48  |   20  20  20  40   0   |   20  21  20  20  20   |   25  25  25  25   0
11100      39  |  100   0   0   0   0   |    0   0   0   0 100   |  100   0   0   0   0
  • We see very few jumps of distance 1 and only slightly more at distance 2. In fact, jumping uniform distance in 2 to 5 already implies average distance 3.5, close to the real average distance of 3.6. Thus, fully preventing size-1 jumps and avoiding size-2 jumps is sufficient here.
  • Most sampled k-mers start with 01, but not 0101, which already prevents jumps of distance 1.
  • Also ending in 01 seems preferred, which creates a jump size of 3.

\(k=5\), \(w=6\) Link to heading

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
w: 6 k: 5
density  0.236   lb: 0.231
avg dist 4.240   ub: 4.333

                       dist_from (%)                   dist_to (%)                   pos (%)
kmer    count  |    1   2   3   4   5   6   |    1   2   3   4   5   6   |    0   1   2   3   4   5
10001 *   313  |    0   3  31  25  19  22   |    0   0   0  47  28  25   |   16  17  19  19  16  14
10111     312  |    0   0  31  27  19  23   |    0   6  12  41  22  19   |   16  17  18  18  16  15
00001 *   312  |    0   3   8  31  31  27   |    0   0   0  47  28  25   |   13  15  19  19  19  16
10011     312  |    0   3  31  25  19  22   |    0   0  14  42  25  19   |   17  17  17  17  16  16
11010  +  283  |    0  14  17  28  28  14   |    5   0  45   7  28  16   |   14  17  19  23  16  11
10010  +  282  |    0   3  31  28  21  17   |    0   0  48   9  28  15   |   13  18  20  24  16  10
10110  +  176  |    0   0  22  36  28  14   |    0  14  50  11  11  14   |   10  15  20  29  19   7
11110  +  171  |    0   9  14  37  26  14   |    0  14  51   8  12  14   |   10  15  15  30  20  10
10100      63  |    7   0  15  23  31  23   |    0  46  23  15   8   8   |    8   8  17  17  33  17
11000      43  |    0  22  22  33  22   0   |    0  22  22  22  23  11   |   12  25  25  25  12   0
10101      34  |   29  14  28  28   0   0   |    0  14   0  42   0  43   |   34  17  33  17   0   0
00000      29  |   16  16  16  17  34   0   |   16  17  17  17  17  16   |   20  20  20  20  20   0
11111      24  |   20  20  20  41   0   0   |   20   0  21  20  20  20   |   25  25  25  25   0   0
  • 1???? start with a 1
  • ?0001 jump right at least 4 (*)
  • 1??10 (+) start in 1, end in 10. The 10 avoids jumps of size 1. (??10? is rare.)

\(k=5\), \(w=7\) Link to heading

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
w: 7 k: 5
density  0.207   lb: 0.200
avg dist 4.842   ub: 5.000

                         dist_from (%)                       dist_to (%)                       pos (%)
kmer    count  |    1   2   3   4   5   6   7   |    1   2   3   4   5   6   7   |    0   1   2   3   4   5   6
10001     313  |    0   1  19  25  23  17  16   |    0   0   4  29  24  24  19   |   13  15  17  19  14  12  10
10111     312  |    0   0  20  25  23  17  16   |    0   4   4  29  26  20  17   |   14  15  16  17  14  12  11
00001     312  |    1   2   4  19  26  27  23   |    0   0   4  29  24  24  19   |   14  14  14  14  14  14  14
10011     312  |    0   1  19  25  23  17  16   |    0   0   5  29  27  23  17   |   15  15  16  16  14  13  12
10110     265  |    0   0  22  18  27  18  15   |    0   0  43  10  22  16   8   |   10  13  17  17  19  14  10
10010     228  |    0   1  21  17  31  17  13   |    0   4  42   7  26  13   8   |    8  12  19  16  22  14   8
11110     141  |    0   7  17  23  29  16   9   |    0   0  36  14  31   7  12   |    9  12  22  17  23  12   6
01010      97  |    0  20  25  15  20  15   5   |    2  10  25   5  30  10  17   |   14  19  23  14  19   9   2
00000      34  |    7   0   7  22  29  21  14   |   14  14  22  22  14   7   7   |    8   8  15  23  23  15   8
10100      29  |    8   0  16  17  25  34   0   |    0  25  25  16  17   8   8   |   10  10  20  20  20  20   0
11111      12  |   20  20  20  40   0   0   0   |   20   0   0  20  20  20  20   |   25  25  25  25   0   0   0
11010       4  |    0   0   0   0 100   0   0   |    0   0 100   0   0   0   0   |    0   0   0   0 100   0   0
  • Similar to the \(k=5\), \(w=6\) scheme:
    • 11010 moves to the bottom,
    • 01010 instead of 10101.

\(k=5\), \(w=8\) Link to heading

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
w: 8 k: 5
density  0.185   lb: 0.176
avg dist 5.410   ub: 5.667

                           dist_from (%)                           dist_to (%)                           pos (%)
kmer    count  |    1   2   3   4   5   6   7   8   |    1   2   3   4   5   6   7   8   |    0   1   2   3   4   5   6   7
11110     313  |    0   1   2  11  23  24  22  18   |    0   0   0  27  19  22  18  14   |   13  13  13  13  13  12  12  12
01000 *   312  |    0   2  11  24  18  18  14  13   |    0   1   1  27  18  21  18  14   |   13  14  14  14  13  12  11  10
01110 *   308  |    0   0  11  24  19  19  14  12   |    0   0   0  26  19  22  18  14   |   12  13  14  15  13  12  10   9
01100 *   302  |    0   0  12  24  20  19  14  11   |    0   0   1  24  19  22  19  14   |   11  13  15  17  14  12  10   7
01001  +  238  |    0   2  14  20  16  22  14  11   |    0   0  38   5  20  18   9  10   |    9  11  15  17  13  16  11   8
01101  +  183  |    0   0  15  21  19  23  13   9   |    0   0  37   5  23  16  11   8   |    7  11  16  18  14  17  11   6
01010  +   80  |    0   6  21  18  24  12  12   6   |    0  18   0  32   6  27   6  11   |    8  10  19  16  21  10  10   5
00001      71  |    0   3   7  21  15  29  15   9   |    0   0  43   7  29  10   7   3   |    3   7  10  20  13  27  13   7
01111      30  |    0   0  12  20  24  24  12   8   |    8   8  16  20  20  16   8   4   |    4   8  16  21  21  17   8   4
00000       5  |   19  20  20  40   0   0   0   0   |   19   0   0   0  21  19  21  20   |   25  26  25  25   0   0   0   0
11111       2  |  100   0   0   0   0   0   0   0   |   50   0   0   0   0   0   0  50   |  100   0   0   0   0   0   0   0
  • Similar to the \(k=5\), \(w=5\) scheme.
  • Prefer starting with 01, and no further 01 (*).
  • Then start with 01, and the next 01 as far as possible away (+).
  • 11110 is the lowest ranked k-mer, and usually has distance at least 4 both to the previous and next sample.

\(k=5\), \(w=12\) Link to heading

Here the density is relatively somewhat farther away from optimal, which may be due to the lower bound not being tight.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
w: 12 k: 5
density  0.130   lb: 0.120
avg dist 7.678   ub: 8.333

                                   dist_from (%)                                           dist_to (%)                                           pos (%)
kmer    count  |    1   2   3   4   5   6   7   8   9  10  11  12   |    1   2   3   4   5   6   7   8   9  10  11  12   |    0   1   2   3   4   5   6   7   8   9  10  11
10100     312  |    0   1   0   8  11  11  12  13  12  12  11  10   |    1   0   3   3  11  12  12  13  13  13  11  10   |    9   9   9   9   9   9   8   8   8   8   7   7
11000     309  |    0   0   5   8  10  12  12  12  12  11  10   9   |    0   3   3   3  10  11  12  12  13  12  11  10   |    8   9   9   9  10   9   9   9   8   7   7   6
11001     278  |    0   0   6   9  10  12  12  12  12  11  10   8   |    0   2   1  14  10  11  11  12  12  11   9   8   |    7   8   9  10  10  10   9   9   9   8   6   5
10111     191  |    0   2   0   8   8  10  12  12  13  14  12   9   |    0   0  13  15  10  12  11  10  10   8   6   5   |    4   6   7   8   9  10  10  10  10  11   9   7
00001      76  |    0  11  15  10  13  13  10   7   8   6   4   3   |    0   3   3   7   8   9  12  11  13  13  13  10   |   10  13  13  12  11  10   8   6   6   4   3   2
00111      70  |    0   9   9   7   8  10  10  10  11  11   8   6   |    0   0   9  12  10  12  11  10  10   9   9   6   |    5   7   8   8   9   9  10  10  11  11   8   6
01101      41  |    0   0  12   6  12   9  15   8  13   8  11   5   |    0  12   3  16   8  10  12   9  10   9   6   4   |    4   6   9   9  11  11  11   8  11   7   8   5
01000      10  |   15   0  20  25   0  11  13   0   6   8   0   1   |    0   1   4  11   4   6  13   6  15  17  10  13   |   14  12  18  16   6  10   8   4   5   4   1   1
10101       8  |    0   5   2   7  12   5  11  11  15   9  15   9   |    0  17   6  21   5  18   4  14   4   7   2   3   |    2   3   4   6  10   6  12   9  14  11  14   8
11110       1  |    0   0   0   0   0   0   0  12  13  26  31  18   |    0  26  37  12  26   0   0   0   0   0   0   0   |    0   0   0   0   0   0   0  15  15  28  28  14
00100       0  |    0  36  64   0   0   0   0   0   0   0   0   0   |    0   0  16   0   0   0   0   0   0  32  37  16   |   19  42  39   0   0   0   0   0   0   0   0   0
00101       0  |    0   0   0   0   0   0   0   0  29   0  34  37   |    0  53   9  38   0   0   0   0   0   0   0   0   |    0   0   0   0   0   0   0   0  23  11  43  23
11111       0  |   35  65   0   0   0   0   0   0   0   0   0   0   |   35   0   0   0   0   0   0   0   0   0  31  34   |   52  48   0   0   0   0   0   0   0   0   0   0
00000       0  |   32  68   0   0   0   0   0   0   0   0   0   0   |   32   0   0   1   0   0   0   0   0   0  34  35   |   49  51   0   0   0   0   0   0   0   0   0   0

Investigating \(w=5\) Link to heading

Let’s see if we can construct manual \(\sigma=2\) schemes matching the density of GreedyMini for \(w=5\) for \(3\leq k\leq 11\). Currently the situation is as follows:

We see that the GreedyMini graph is mostly flat at \(8\leq k\leq 11\), so let’s try to reproduce \(k=8\) first.

\(w=5\), \(k=8\) Link to heading

I have added some columns that indicate the average distance before (from) and after (to) each k-mer, and sorted them by the average of those two.

Click to show.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
w: 5 k: 8
density  0.253   lb: 0.250
avg dist 3.952   ub: 4.000

                        dist_from (%)               dist_to (%)       avg-from   avg-to     avg            pos (%)
kmer       count  |    1   2   3   4   5   |    1   2   3   4   5   |                             |    0   1   2   3   4
00110001     117  |    0   0   0   0 100   |    0   0  12  12  75   |  5.000  |  4.628  |  4.814  |   17  19  21  21  21
00101001     117  |    0   0   0   0 100   |    0   0  12  13  75   |  5.000  |  4.626  |  4.813  |   16  19  22  22  22
00100001     116  |    0   0   0   0 100   |    0   0  13  12  75   |  5.000  |  4.625  |  4.813  |   16  19  22  22  22
00111001     117  |    0   0   0   0 100   |    0   0  13  13  75   |  5.000  |  4.622  |  4.811  |   16  19  22  22  22
00111101     116  |    0   0   0   0 100   |    0   0  25  13  62   |  5.000  |  4.373  |  4.686  |   14  17  23  23  23
00110101     117  |    0   0   0   0 100   |    0   0  35  12  53   |  5.000  |  4.184  |  4.592  |   20  20  20  20  20
00101011     117  |    0   0   0   0 100   |    0   0  13  72  16   |  5.000  |  4.030  |  4.515  |   20  20  20  20  20
00111011     117  |    0   0   0   0 100   |    0   0  13  72  15   |  5.000  |  4.030  |  4.515  |   20  20  20  20  20
00011001     116  |    0   0   0  88  12   |    0   0  12  13  75   |  4.125  |  4.626  |  4.375  |   17  20  23  23  17
00010001     117  |    0   0   0  87  13   |    0   0  13  12  75   |  4.127  |  4.623  |  4.375  |   21  21  21  21  17
10011001     117  |    0   0   0  91   9   |    0   0  12  12  75   |  4.095  |  4.626  |  4.360  |   18  19  22  22  18
10010001     117  |    0   0   0  91   9   |    0   0  12  13  75   |  4.093  |  4.625  |  4.359  |   20  20  20  20  19
00110111     117  |    0   0   0   0 100   |    0   0  62  19  19   |  5.000  |  3.565  |  4.282  |   19  20  20  20  20
00000001     117  |    3   6  25  31  34   |    0   0  13  12  75   |  3.877  |  4.623  |  4.250  |   20  20  20  20  20
00111111     117  |    0   0   0   0 100   |    0  25  31  22  22   |  5.000  |  3.402  |  4.201  |   20  20  20  20  20
00101101     117  |    0   0   0   0 100   |    0  44  12  13  31   |  5.000  |  3.317  |  4.158  |   20  20  20  20  20
10110001     113  |    0  23  22  32  23   |    0   0  10  13  77   |  3.550  |  4.679  |  4.115  |   22  26  22  19  11
10111001     113  |    0  23  23  32  23   |    0   0  10  13  77   |  3.549  |  4.675  |  4.112  |   23  26  23  19   9
00101111     117  |    0   0   0   0 100   |    0  38  25  19  19   |  5.000  |  3.181  |  4.091  |   19  20  20  20  20
00110110     117  |    0   0   0   0 100   |    0  22  53  13  12   |  5.000  |  3.155  |  4.078  |   20  20  20  20  20
00111010     117  |    0   0   0   0 100   |    0  44  12  31  12   |  5.000  |  3.123  |  4.062  |   20  20  20  20  20
10100001     114  |    0  29  23  26  23   |    0   0  10  13  77   |  3.423  |  4.678  |  4.050  |   21  25  21  18  14
10101001     114  |    0  29  23  26  23   |    0   0  10  13  78   |  3.420  |  4.679  |  4.050  |   21  25  21  18  14
10111101     110  |    0  20  23  33  23   |    0   0  20  13  67   |  3.597  |  4.468  |  4.033  |   20  24  24  20  12
10000001     113  |    0  29  22  32  16   |    0   0  10  13  78   |  3.355  |  4.679  |  4.017  |   21  25  21  18  14
00001001     117  |    0   0  75  13  12   |    0   0  12  13  75   |  3.373  |  4.627  |  4.000  |   20  21  21  19  19
10001001     117  |    0   0  78  13   9   |    0   0  12  12  75   |  3.315  |  4.626  |  3.971  |   21  24  24  18  15
11001001     117  |    0   0  78  12   9   |    0   0  12  12  75   |  3.312  |  4.627  |  3.969  |   20  20  20  20  19
11100001     113  |    0  35  23  26  16   |    0   0  10  13  77   |  3.229  |  4.678  |  3.954  |   21  25  21  18  14
11000001     113  |    0  36  26  19  19   |    0   0  10  13  77   |  3.227  |  4.678  |  3.953  |   21  25  21  18  14
00101010     117  |    0   0   0   0 100   |    0  56  12  19  12   |  5.000  |  2.876  |  3.938  |   20  20  20  20  20
00101110     117  |    0   0   0   0 100   |    0  56  13  19  12   |  5.000  |  2.876  |  3.938  |   20  20  20  20  20
11011001     113  |    0  35  23  29  13   |    0   0  10  13  77   |  3.195  |  4.679  |  3.937  |   21  25  21  18  16
11010001     113  |    0  36  23  29  13   |    0   0  10  13  78   |  3.191  |  4.678  |  3.935  |   21  25  21  18  16
10110101     109  |    0  20  23  33  23   |    0   0  30  13  57   |  3.601  |  4.264  |  3.933  |   20  24  25  20  10
11110001     113  |    0  35  29  19  16   |    0   0  10  13  77   |  3.160  |  4.677  |  3.919  |   22  25  22  18  13
01001001     116  |    0   0  87   6   6   |    0   0  12  13  75   |  3.187  |  4.628  |  3.907  |   23  26  26  13  11
11111001     113  |    0  42  23  19  16   |    0   0  10  13  77   |  3.095  |  4.676  |  3.885  |   21  25  21  18  16
11101001     113  |    0  36  39  13  13   |    0   0  10  13  77   |  3.028  |  4.677  |  3.852  |   20  23  20  20  17
10111011     113  |    0  23  23  32  23   |    0   0  10  74  16   |  3.549  |  4.061  |  3.805  |   23  26  23  19   9
00111110     117  |    0   0   0   0 100   |   31  25  13  19  13   |  5.000  |  2.560  |  3.780  |   20  20  20  20  20
10101011     113  |    0  29  23  26  23   |    0   0  10  74  16   |  3.423  |  4.066  |  3.745  |   24  24  21  17  14
00111100     116  |    0   0   0   0 100   |   37  22  13  12  16   |  5.000  |  2.469  |  3.734  |   19  20  20  20  20
00111000     117  |    0   0   0   0 100   |   37  22  12  16  12   |  5.000  |  2.440  |  3.720  |   20  20  20  20  20
11111101     109  |    7  37  23  20  13   |    0   0  20  13  67   |  2.960  |  4.468  |  3.714  |   19  23  23  19  15
00110000     116  |    0   0   0   0 100   |   37  22  16  13  12   |  5.000  |  2.408  |  3.704  |   18  21  21  21  21
00110100     117  |    0   0   0   0 100   |   38  22  16  12  13   |  5.000  |  2.406  |  3.703  |   20  20  20  20  20
00101100     116  |    0   0   0   0 100   |   38  25  12  13  13   |  5.000  |  2.373  |  3.686  |   20  20  20  20  20
00101000     117  |    0   0   0   0 100   |   38  25  12  12  13   |  5.000  |  2.372  |  3.686  |   20  20  20  20  20
00100000     117  |    0   0   0   0 100   |   38  25  12  13  12   |  5.000  |  2.371  |  3.685  |   16  19  22  22  22
10110111     102  |    0  14  25  36  25   |    0   0  61  21  18   |  3.710  |  3.571  |  3.641  |   15  20  29  24  12
01110001     102  |   53   0  14  14  18   |    0   0   7   7  86   |  2.431  |  4.784  |  3.607  |   32  18  21  16  13
01010001     102  |   50   0  21  14  14   |    0   0   7   7  86   |  2.427  |  4.786  |  3.607  |   32  18  21  16  13
01111001     102  |   54   0  14  14  18   |    0   0   7   7  86   |  2.429  |  4.784  |  3.606  |   31  18  21  15  15
01100001     101  |   54   0  14  14  18   |    0   0   7   7  86   |  2.427  |  4.785  |  3.606  |   32  18  21  16  13
01011001     102  |   43   0  36  14   7   |    0   0   7   7  86   |  2.425  |  4.786  |  3.606  |   32  18  21  16  13
01000001     102  |   50   0  21  14  14   |    0   0   7   7  86   |  2.423  |  4.787  |  3.605  |   32  18  21  16  13
01101001     102  |   54   0  14  14  18   |    0   0   7   7  86   |  2.423  |  4.786  |  3.604  |   32  18  21  16  13
01111101      95  |   50   0  15  15  19   |    0   0  15   8  77   |  2.533  |  4.619  |  3.576  |   29  18  23  18  12
10111111      98  |    0  15  22  37  26   |    0  26  30  26  19   |  3.738  |  3.370  |  3.554  |   16  21  24  26  13
11101011     113  |    0  35  39  13  13   |    0   0  10  74  16   |  3.032  |  4.066  |  3.549  |   23  23  19  19  16
10111010      84  |    0  13  17  43  26   |    0  35  13  44   9   |  3.827  |  3.268  |  3.548  |    8  24  24  32  12
10110110      91  |    0  12  24  40  24   |    0  16  56  16  12   |  3.760  |  3.239  |  3.500  |   13  20  30  27  10
10111100      51  |    0   7  14  36  43   |   21  29  21  14  14   |  4.142  |  2.717  |  3.429  |    8  15  23  31  23
10111000      51  |    0   7  14  36  43   |   22  29  21  21   7   |  4.146  |  2.637  |  3.392  |    8  15  23  31  23
10110000      51  |    0   7  14  36  43   |   21  29  29  14   7   |  4.142  |  2.578  |  3.360  |    8  15  23  31  23
10110100      51  |    0   7  15  36  43   |   21  29  29  14   7   |  4.143  |  2.568  |  3.356  |    8  15  23  31  23
10111110      51  |    0   7  14  36  43   |   21  36  21  14   7   |  4.141  |  2.501  |  3.321  |    8  15  23  31  23
10101000      36  |    0  20  20  20  40   |   20  30  20  20  10   |  3.802  |  2.708  |  3.255  |   11  22  22  22  22
10100000      36  |    0  20  20  20  40   |   20  30  20  20  10   |  3.806  |  2.694  |  3.250  |   11  22  22  22  22
10101010      65  |    0  33  22  22  22   |    0  44  11  33  11   |  3.332  |  3.114  |  3.223  |   15  30  20  20  15
11111110      36  |    0  20  20  30  30   |   20  30  20  20  10   |  3.708  |  2.698  |  3.203  |   11  22  22  22  22
11111000      18  |    0  60  40   0   0   |    0   0  20  60  20   |  2.402  |  4.002  |  3.202  |   25  50  25   0   0
10000000      32  |    0  23  22  33  22   |   22  22  22  22  11   |  3.547  |  2.778  |  3.162  |   13  25  25  25  12
11111111      11  |   34  66   0   0   0   |   34   0   0  33  33   |  1.663  |  3.323  |  2.493  |   50  50   0   0   0
00000000       7  |  100   0   0   0   0   |   49   0   0   0  51   |  1.000  |  3.033  |  2.016  |  100   0   0   0   0
  • We clearly see that k-mers start and end with 001, as an anchor for modding.
  • Both starting and ending in 001 is ideal. Otherwise max the distance between them.
  • By extension, avoid starting/ending with 001001 if possible. Never start with 001001.
  • Fall back to starting and/or ending in 101. Never end in 101101.

What about \(k = w+1\)? Link to heading

It turns out that GreedyMini generates perfectly optimal schemes for \((w,k)\) in \((3, 4)\), \((4, 5)\), and \((5, 6)\). Then after, they are just slightly away from optimal.

Here is the \((5, 6)\) scheme:

w: 5 k: 6 density 0.273 lb: 0.273 avg dist 3.658 ub: 3.667

                  dist_from (%)               dist_to (%)       avg-from   avg-to     avg            pos (%)

kmer count | 1 2 3 4 5 | 1 2 3 4 5 | | 0 1 2 3 4 100011 156 | 0 16 6 50 28 | 0 0 6 56 38 | 3.904 | 4.314 | 4.109 | 17 23 23 23 13 100001 156 | 3 16 6 50 25 | 9 0 0 25 66 | 3.783 | 4.374 | 4.078 | 18 23 23 23 14 000001 156 | 6 13 22 19 41 | 9 0 0 25 66 | 3.747 | 4.379 | 4.063 | 17 21 21 21 21 100111 156 | 0 9 12 50 28 | 0 10 34 25 31 | 3.972 | 3.778 | 3.875 | 19 21 21 21 19 100110 156 | 0 9 13 50 28 | 0 0 41 47 12 | 3.970 | 3.720 | 3.845 | 12 17 28 28 16 100010 156 | 0 16 6 50 28 | 0 0 37 50 13 | 3.905 | 3.752 | 3.828 | 16 18 24 24 19 010111 156 | 0 3 47 22 28 | 0 9 34 25 31 | 3.754 | 3.782 | 3.768 | 19 20 20 20 20 010110 155 | 0 3 47 22 28 | 0 0 41 47 12 | 3.753 | 3.719 | 3.736 | 17 21 28 17 17 010010 155 | 3 3 47 22 25 | 0 0 37 50 13 | 3.625 | 3.751 | 3.688 | 18 20 23 20 20 110111 156 | 0 6 50 25 19 | 0 9 34 25 31 | 3.563 | 3.787 | 3.675 | 19 24 28 17 12 110110 156 | 0 6 50 25 19 | 0 0 41 47 12 | 3.564 | 3.718 | 3.641 | 21 21 21 18 18 110010 155 | 3 3 50 25 19 | 0 0 37 50 12 | 3.531 | 3.751 | 3.641 | 16 19 25 22 19 101010 141 | 3 17 10 48 21 | 0 21 31 34 14 | 3.654 | 3.412 | 3.533 | 11 23 27 27 11 111110 156 | 3 3 44 28 22 | 13 0 37 37 12 | 3.622 | 3.372 | 3.497 | 17 20 27 20 17 111101 63 | 23 0 46 15 15 | 0 15 23 23 39 | 2.997 | 3.855 | 3.426 | 23 23 31 15 8 011101 73 | 0 60 13 27 0 | 0 13 13 40 33 | 2.666 | 3.935 | 3.300 | 23 46 15 15 0 111000 39 | 12 0 37 25 25 | 0 50 25 13 13 | 3.503 | 2.879 | 3.191 | 14 14 28 29 14 111001 14 | 0 0 0 33 67 | 34 66 0 0 0 | 4.672 | 1.664 | 3.168 | 0 0 0 50 50 110100 29 | 0 0 50 17 33 | 17 17 66 0 0 | 3.836 | 2.497 | 3.167 | 0 0 50 25 25 110101 14 | 0 0 0 33 67 | 34 66 0 0 0 | 4.667 | 1.658 | 3.163 | 0 0 0 50 50 101000 92 | 5 5 10 63 16 | 0 69 16 11 5 | 3.790 | 2.524 | 3.157 | 6 12 19 50 13 111100 68 | 7 0 64 14 14 | 7 7 71 7 7 | 3.285 | 3.008 | 3.147 | 9 9 54 18 9 101001 24 | 0 0 0 60 40 | 20 80 0 0 0 | 4.397 | 1.800 | 3.098 | 0 0 0 67 33 100000 58 | 8 16 0 50 25 | 25 34 17 17 8 | 3.675 | 2.497 | 3.086 | 9 18 18 37 18 111111 33 | 14 0 43 14 29 | 29 15 29 14 14 | 3.443 | 2.705 | 3.074 | 17 17 33 17 17 000011 19 | 100 0 0 0 0 | 0 0 0 0 100 | 1.000 | 5.000 | 3.000 | 100 0 0 0 0 110000 9 | 0 0 0 0 100 | 100 0 0 0 0 | 5.000 | 1.000 | 3.000 | 0 0 0 0 100 000010 9 | 100 0 0 0 0 | 0 0 0 0 100 | 1.000 | 5.000 | 3.000 | 100 0 0 0 0 000000 9 | 100 0 0 0 0 | 50 0 0 0 50 | 1.000 | 3.000 | 2.000 | 100 0 0 0 0


  1. When \(2^k\) is too large, only the first \(k’<k\) characters can be used, and when \(w\) is very large, it is possible to just fall back to the order for a \(w’<w\), but applied to the larger windows. But here we will only consider sufficiently small cases. ↩︎