Near-optimal sampling schemes

Ragnar Groot Koerkamp

2025-02-27 Thu 00:00

Created: 2025-04-23 Wed 15:27

1. Warming up: A cute prolblem

  • Given a string, choose one character.
    • CABAACBD
  • Given a rotation, choose one character.
    • ACBDCABA
  • Can we always choose the same character?
  • Yes: e.g. the smallest rotation (bd-anchor):
    • CABAACBD
    • ACBDCABA

1.1. This talk: what if one character is hidden?

  • Given a string (length \(w\)), choose one character.
    • CABAACBD​​X
  • Given a rotation (of the hidden \(w+1\) string), choose one character.
    • ACBDXCABA
  • Can we always choose the same character?
  • Maybe?
    • CABAACBD
    • ACBDXCAB 🤔

1.2. The answer is no!

C​ABAACBDX rotations:

  • C​ABAACBD........
  • .ABAACBDX.......
  • ..BAACBDXC......
  • ...AACBDXCA.....
  • ....ACBDXCAB.... <— the A is not here
  • .....CBDXCABA...
  • ......BDXCABAA..
  • .......DXCABAAC.
  • ........XCABAACB

In the \(w+1\) rotations, we need at least 2 samples.

2. Minimizer schemes

\[\newcommand{\order}{\mathcal{O}}\]

  • Minimizer scheme: Given a window of \(w\) k-mers, pick the (leftmost) smallest one
    • according to some order \(\order_k\)
  • \(k=1\), \(w=5\):
    • CABCA
  • \(k=2\), \(w=5\):
    • CABCAC.....
    • .ABCACC....
    • ..BCACCX...
    • ...CACCXY..
    • ....ACCXYZ.
    • .....CCXYZX
  • Density: 3/10
    • CABCACCXYZX

2.1. The situation, 1 year ago

1-before.svg

2.2. The mod-minimizer

2-mod.svg

2.3. A near-tight lower bound

3-lb.svg

2.4. The current picture

4-full.svg

2.5. Greedymini

greedymini.png

2.6. Minimizer density lower bound

  • Density of minimizer scheme is \(\geq 1/\sigma^k\):

    sample exactly every AAA k-mer, and nothing else.

  • \(k=1\): density at least \(1/\sigma = 1/4\).

3. Sampling schemes: more general

  • Any function \(f: \Sigma^{w+k-1} \to \{0, \dots, w-1\}\)
  • We fix \(k=1\) from now: \(f: \Sigma^w\to \{0, \dots, w-1\}\)

3.1. Bidirectional anchors

  • Pick the start of the smallest rotation
    • EADCAE......
    • .ADCAEB.....
    • ..DCAEBE....
    • ...CAEBEC...
    • ....AEBECD..
    • .....EBECDC.
    • ......BECDCD

3.2. Limitations of bd-anchors

  • Lexicographic is bad:
    • AAAABCD...
    • .AAABCDE..
    • ..AABCDEF.
    • ...ABCDEFG
  • Comparing rotations is unstable:
    • AABACD..
    • .ABACDA.
    • ..BACDAE
  • Avoid last \(r\) positions.

3.3. Bd-anchor density for \(k=1\)

10-bd-anchor.svg

4. Smallest-unique-substring anchors

  • Idea: instead of smallest rotation: smallest suffix.
  • What about CABA: is ABA or A smaller?
    • We choose ABA smaller for stability.
  • AB is the smallest unique substring.
  • Stable:
    • AABACD..
    • .ABACDA.
    • ..BACDAE

4.1. Sus-anchor density

11-sus.svg

4.2. ABB order

  • AAAA is BAD:
    • small strings overlap
    • small strings cluster
  • We want the opposite!
  • ABB order:

    A followed by many non-A is smallest: ABBBBBBBBB

    • no overlap
    • no clustering

4.3. Sus-anchor, ABB order

12-abb.svg

4.4. Anti-lex

  • Anti-lexicographic order:

    A small, followed by largest possible suffix: AZZZZZ is minimal

    • no overlap
    • no clustering

4.5. Sus-anchor, anti-lex order

13-asus.svg

5. Understanding the lower bound

  • To reach lower bound: exactly 2 samples in every \(w+1\) cycle.

lower-bound.svg

5.1. Failure mode

0010101 cycle:

  • 001010......
  • .010101.....
  • ..101010....
  • ...010100...
  • ....101001..
  • .....010010.
  • ......100101
  • The 01010 sus is not overlap free
    • Just like how AAA is not overlap free

Goal: find two non-overlapping substrings.

5.2. Can we design a perfectly optimal scheme?

  • Goal:

    For every \(w+1\) window, find two non-overlapping small strings.

  • Instead of 011...11, search 00...0011...11
    • Also non-overlapping, and more signal.
    • Still not optimal.
  • Tried many things. No general solution found yet.

6. Thanks to my co-authors!

  • Giulio Ermanno Pibiri
  • Bryce Kille
  • Daniel Liu
  • Igor Martayan

Slides: curiouscoding.nl/slides/minimizers.html

Blog curiouscoding.nl/slides/minimizers