1 Warming up: A cute prolblem Link to heading
- Given a string, choose one character.
CABAACBD
- Given a rotation, choose one character.
ACBDCABA
- Can we always choose the same character?
- Yes: e.g. the smallest rotation (bd-anchor):
CAB
A
ACBD
ACBDCAB
A
1.1 This talk: what if one character is hidden? Link to heading
- Given a string (length \(w\)), choose one character.
CABAACBD
X
- Given a rotation (of the hidden \(w+1\) string), choose one character.
ACBDXCAB
A
- Can we always choose the same character?
- Maybe?
CAB
A
ACBD
ACBDXCAB
🤔
1.2 The answer is no! Link to heading
CABAACBDX
rotations:
CAB
A
ACBD........
.AB
A
ACBDX.......
..B
A
ACBDXC......
...
A
ACBDXCA.....
....ACBDXCAB....
<— theA
is not here.....CBDXCAB
A
...
......BDXCAB
A
A..
.......DXCAB
A
AC.
........XCAB
A
ACB
In the \(w+1\) rotations, we need at least 2 samples.
2 Minimizer schemes Link to heading
- Minimizer scheme: Given a window of \(w\) k-mers, pick the (leftmost) smallest one
- according to some order \(\order_k\)
- \(k=1\), \(w=5\):
C
A
BCA
- \(k=2\), \(w=5\):
C
AB
CAC.....
.
AB
CACC....
..BC
AC
CX...
...C
AC
CXY..
....
AC
CXYZ.
.....
CC
XYZX
- Density: 3/10
C
AB
C
ACC
XYZX
2.1 The situation, 1 year ago Link to heading
2.2 The mod-minimizer Link to heading
2.3 A near-tight lower bound Link to heading
2.4 The current picture Link to heading
2.5 Greedymini Link to heading

2.6 Minimizer density lower bound Link to heading
Density of minimizer scheme is \(\geq 1/\sigma^k\):
sample exactly every
AAA
k-mer, and nothing else.\(k=1\): density at least \(1/\sigma = 1/4\).
3 Sampling schemes: more general Link to heading
- Any function \(f: \Sigma^{w+k-1} \to \{0, \dots, w-1\}\)
- We fix \(k=1\) from now: \(f: \Sigma^w\to \{0, \dots, w-1\}\)
3.1 Bidirectional anchors Link to heading
- Pick the start of the smallest rotation
E
A
DCAE......
.
A
DCAEB.....
..DC
A
EBE....
...C
A
EBEC...
....
A
EBECD..
.....E
B
ECDC.
......
B
ECDCD
3.2 Limitations of bd-anchors Link to heading
Lexicographic is bad:
A
AAABCD...
.
A
AABCDE..
..
A
ABCDEF.
...
A
BCDEFG
Comparing rotations is unstable:
-
A
ABACD..
.ABACD
A
.
..B
A
CDAE
-
Avoid last \(r\) positions.
3.3 Bd-anchor density for \(k=1\) Link to heading
4 Smallest-unique-substring anchors Link to heading
- Idea: instead of smallest rotation: smallest suffix.
- What about
CABA
: isABA
orA
smaller?- We choose
ABA
smaller for stability.
- We choose
AB
is the smallest unique substring.- Stable:
-
AA
BACD..
.
AB
ACDA.
..B
AC
DAE
-
4.1 Sus-anchor density Link to heading
4.2 ABB order Link to heading
AAAA
is BAD:- small strings overlap
- small strings cluster
We want the opposite!
ABB order:
A
followed by many non-A
is smallest:ABBBBBBBBB
- no overlap
- no clustering
4.3 Sus-anchor, ABB order Link to heading
4.4 Anti-lex Link to heading
Anti-lexicographic order:
A
small, followed by largest possible suffix:AZZZZZ
is minimal- no overlap
- no clustering
4.5 Sus-anchor, anti-lex order Link to heading
5 Understanding the lower bound Link to heading
- To reach lower bound: exactly 2 samples in every \(w+1\) cycle.
5.1 Failure mode Link to heading
0010101
cycle:
-
00
1010......
.
01010
1.....
..1
0101
0....
...0101
00
...
....101
00
1..
.....01
00
10.
......1
00
101
- The
01010
sus is not overlap free- Just like how
AAA
is not overlap free
- Just like how
Goal: find two non-overlapping substrings.
5.2 Can we design a perfectly optimal scheme? Link to heading
Goal:
For every \(w+1\) window, find two non-overlapping small strings.
- Instead of
011...11
, search00...0011...11
- Also non-overlapping, and more signal.
- Still not optimal.
- Tried many things. No general solution found yet.
6 Thanks to my co-authors! Link to heading
- Giulio Ermanno Pibiri
- Bryce Kille
- Daniel Liu
- Igor Martayan