Near-optimal sampling schemes

1 Warming up: A cute prolblem

  • Given a string, choose one character.
    • CABAACBD
  • Given a rotation, choose one character.
    • ACBDCABA
  • Can we always choose the same character?
  • Yes: e.g. the smallest rotation (bd-anchor):
    • CABAACBD
    • ACBDCABA

1.1 This talk: what if one character is hidden?

  • Given a string (length \(w\)), choose one character.
    • CABAACBD​​X
  • Given a rotation (of the hidden \(w+1\) string), choose one character.
    • ACBDXCABA
  • Can we always choose the same character?
  • Maybe?
    • CABAACBD
    • ACBDXCAB 🤔

1.2 The answer is no!

C​ABAACBDX rotations:

  • C​ABAACBD........
  • .ABAACBDX.......
  • ..BAACBDXC......
  • ...AACBDXCA.....
  • ....ACBDXCAB.... <— the A is not here
  • .....CBDXCABA...
  • ......BDXCABAA..
  • .......DXCABAAC.
  • ........XCABAACB

In the \(w+1\) rotations, we need at least 2 samples.

2 Minimizer schemes

\[\newcommand{\order}{\mathcal{O}}\]

  • Minimizer scheme: Given a window of \(w\) k-mers, pick the (leftmost) smallest one
    • according to some order \(\order_k\)
  • \(k=1\), \(w=5\):
    • CABCA
  • \(k=2\), \(w=5\):
    • CABCAC.....
    • .ABCACC....
    • ..BCACCX...
    • ...CACCXY..
    • ....ACCXYZ.
    • .....CCXYZX
  • Density: 3/10
    • CABCACCXYZX

2.1 The situation, 1 year ago

2.2 The mod-minimizer

2.3 A near-tight lower bound

2.4 The current picture

2.5 Greedymini

2.6 Minimizer density lower bound

  • Density of minimizer scheme is \(\geq 1/\sigma^k\):

    sample exactly every AAA k-mer, and nothing else.

  • \(k=1\): density at least \(1/\sigma = 1/4\).

3 Sampling schemes: more general

  • Any function \(f: \Sigma^{w+k-1} \to \{0, \dots, w-1\}\)
  • We fix \(k=1\) from now: \(f: \Sigma^w\to \{0, \dots, w-1\}\)

3.1 Bidirectional anchors

  • Pick the start of the smallest rotation
    • EADCAE......
    • .ADCAEB.....
    • ..DCAEBE....
    • ...CAEBEC...
    • ....AEBECD..
    • .....EBECDC.
    • ......BECDCD

3.2 Limitations of bd-anchors

  • Lexicographic is bad:

    • AAAABCD...
    • .AAABCDE..
    • ..AABCDEF.
    • ...ABCDEFG
  • Comparing rotations is unstable:

    • AABACD..
    • .ABACDA.
    • ..BACDAE
  • Avoid last \(r\) positions.

3.3 Bd-anchor density for \(k=1\)

4 Smallest-unique-substring anchors

  • Idea: instead of smallest rotation: smallest suffix.
  • What about CABA: is ABA or A smaller?
    • We choose ABA smaller for stability.
  • AB is the smallest unique substring.
  • Stable:
    • AABACD..
    • .ABACDA.
    • ..BACDAE

4.1 Sus-anchor density

4.2 ABB order

  • AAAA is BAD:

    • small strings overlap
    • small strings cluster
  • We want the opposite!

  • ABB order:

    A followed by many non-A is smallest: ABBBBBBBBB

    • no overlap
    • no clustering

4.3 Sus-anchor, ABB order

4.4 Anti-lex

  • Anti-lexicographic order:

    A small, followed by largest possible suffix: AZZZZZ is minimal

    • no overlap
    • no clustering

4.5 Sus-anchor, anti-lex order

5 Understanding the lower bound

  • To reach lower bound: exactly 2 samples in every \(w+1\) cycle.

5.1 Failure mode

0010101 cycle:

  • 001010......
  • .010101.....
  • ..101010....
  • ...010100...
  • ....101001..
  • .....010010.
  • ......100101
  • The 01010 sus is not overlap free
    • Just like how AAA is not overlap free

Goal: find two non-overlapping substrings.

5.2 Can we design a perfectly optimal scheme?

  • Goal:

    For every \(w+1\) window, find two non-overlapping small strings.

  • Instead of 011...11, search 00...0011...11
    • Also non-overlapping, and more signal.
    • Still not optimal.
  • Tried many things. No general solution found yet.

6 Thanks to my co-authors!

  • Giulio Ermanno Pibiri
  • Bryce Kille
  • Daniel Liu
  • Igor Martayan

Slides: curiouscoding.nl/slides/minimizers.html

Blog curiouscoding.nl/slides/minimizers