What is bioinformatics? Link to heading
For today: DNA Link to heading

Zephyris, CC BY-SA 3.0 Wikipedia
Also DNA Link to heading

Human chromosomes
DNA, according to me Link to heading
A C G T
00 01 10 11
Covid Link to heading

A virus: SARS-CoV-2, aka COVID-19
covid vs sars-cov-2
Why bioinformatics? Link to heading
1800-1980: Punch cards - 80 B - Protein size noexport-reveal Link to heading

Punchcard
Pete Birkinshaw, CC BY 2.0, Wikipedia

β-hemoglobin (1963)
Emw, CC BY-SA 3.0, Wikipedia
1970-2000: Floppy disks - MB - Bacteria Link to heading

Floppy

E. coli bacteria (1997)
2000-2020: USB stick - GB - Human Genome Link to heading

USB stick

Human genome (2001)
2010-2020: Hard Drive - TB - RefSeq Link to heading

Hard drive


RefSeq
2025: Data Center - PB - SRA Link to heading

Data Center
PhonlamaiPhoto | istockphoto.com

Growth of SRA
Goal:
Fast code
Link to heading
Goal:
High throughput code
Link to heading
Goal:
Optimal throughput code
Link to heading
What is high troughput code? Link to heading
- Complexity
Few operations:
\(\quad O(n^2)\quad\longleftrightarrow\quad O(n)\)
- Efficiency
Fast operations:
memory read, 100 ns \(\quad\longleftrightarrow\quad\) 0.1 ns, addition
- Implementation
Parallel operations:
SIMD, instruction-level parallelism
Papers Link to heading
Part 1: Pairwise Alignment
Part 2: Minimizers
Part 3: Optimal Throughput
Additional
- [1] A*PA: Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.
RGK and Pesho Ivanov, Bioinformatics 2024. - [2] A*PA2: Up to 19x Faster Exact Global Alignment.
RGK. WABI 2024.
- [3] The Mod-Minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers.
RGK and Giulio Ermanno Pibiri. WABI 2024. - [4] Forward Sampling Scheme Density Lower Bound.
Bryce Kille, RGK, et al. Bioinformatics 2024. - [5] The Open-Closed Mod-Minimizer Algorithm.
RGK, Daniel Liu, and Giulio Ermanno Pibiri. AMB 2025.
- [6] SimdMinimizers: Computing Random Minimizers, Fast.
RGK and Igor Martayan. SEA 2025. - [7] PtrHash: Minimal Perfect Hashing at RAM Throughput.
RGK. SEA 2025.
- [8] U-index: A Universal Indexing Framework for Matching Long Patterns.
Lorraine Ayad, Gabriele Fici, RGK, Rob Patro, Giulio Ermanno Pibiri, and Solon Pissis. SEA 2025.
Throughput is not in presentation, but in part 3 of the thesis
Problem 1: Pairwise Alignment Link to heading
Covid – \(\alpha\), December 2020 Link to heading

Sars-CoV-2, alpha variant, December 2020
Covid – \(\omicron\), December 2021 Link to heading

Sars-CoV-2, omicron variant, December 2021
Pairwise alignment Link to heading
- Find the mutations between two sequences
Dynamic programming Link to heading
Dynamic programming Link to heading
Dynamic programming Link to heading
Dynamic programming Link to heading
Dynamic programming Link to heading
Needleman-Wunsch – Quadratic \(O(n^2)\) Link to heading

Dijkstra – \(O(ns)\) Link to heading

Fewer states is better
Diagonal transition – \(O(n + s^2)\) Link to heading

Complexity is in expectation
A*PA\({}^{1}\) – near-linear, \(3\times\) faster on similar seqs Link to heading

[1] Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.
RGK and Pesho Ivanov, Bioinformatics 2024.
Worst-case vs avg-case complexity
A*PA\({}^{1}\) – not quite linear Link to heading

[1] Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.
RGK and Pesho Ivanov, Bioinformatics 2024.
A*PA: Great complexity – terrible efficiency Link to heading
Band Doubling – \(O(ns)\) Link to heading

A*PA2\({}^{2}\) – good efficiency: up to \(19\times\) faster Link to heading

[2] A*PA2: Up to 19x Faster Exact Global Alignment.
RGK. WABI 2024.
mention throughput
Problem 2: Minimizers – lossy compression Link to heading
Lossy compression Link to heading

original
Lossy compression Link to heading

50%
Lossy compression Link to heading

25%
Lossy compression Link to heading

12%
Lossy compression Link to heading

6%
Sampling k-mers Link to heading

\(k=10\), \(w=1000\)
w=1000 looses too much information
Minimizer definition Link to heading
\(k\)-mer size: \(k=3\)
Window guarantee: \(w=4\)
Length \(w+k-1=6\) window of \(w\) \(k\)-mers
Minimizer scheme:
\[
f: \Sigma^{w+k-1} \mapsto \{0, 1, 2, \dots, w-1\}.
\]
Used for compression and hashing.
Find f that minimizes density
leftmost vs alphabetical
usage in
Minimizer example Link to heading
Minimizer example Link to heading
Minimizer example Link to heading
Minimizer example Link to heading
Minimizer example Link to heading
Minimizer example Link to heading
Minimizer density Link to heading
Density: expected fraction of sampled \(k\)-mers.
Here: \(3/9=0.33\)
Sampling k-mers Link to heading

\(k=10\), \(w=500\)
Sampling k-mers Link to heading

\(k=10\), \(w=250\)
Sampling k-mers Link to heading

\(k=10\), \(w=100\)
Sampling k-mers Link to heading

\(k=10\), \(w=50\)
Sampling k-mers Link to heading

\(k=10\), \(w=25\)
Sampling k-mers Link to heading

\(k=10\), \(w=100\)
Sampling k-mers Link to heading

\(k=11\), \(w=100\)
Sampling k-mers Link to heading

\(k=12\), \(w=100\)
Sampling k-mers Link to heading

\(k=20\), \(w=100\)
Sampling k-mers Link to heading

\(k=30\), \(w=100\)
Goal: Minimize the number of sampled \(k\)-mers Link to heading
Density plots – before (\(w=24\)) Link to heading
The Mod-Minimizer\({}^{3}\) Link to heading
[3] The Mod-Minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers. RGK and Giulio Ermanno Pibiri. WABI 2024.
A Near-Tight Lower Bound\({}^{4}\) Link to heading
[4] Forward sampling scheme density lower bound. Bryce Kille, RGK, et al. Bioinformatics 2024.
Near-optimal schemes for small \(k\) Link to heading
(Unpublished)
Extended mod-minimizer\({}^{5}\) Link to heading
[5] The Open-closed mod-minimizer Algorithm. RGK, Daniel Liu, and Giulio Ermanno Pibiri. AMB 2025.
Proving The Lower Bound Link to heading
Suppose \(k=1\), and consider a cycle of \(w+1\) characters.
Eg for \(w=4\):ABCDEABCDEABCDE
Suppose we only sample 1 character:
ABCDEABCDEABCDE
The distance between samples is \(5 > w\)!
Thus, we need at least two samples.
Conclusion: the density is at least \(2/(w+1)\).
Conclusion: high troughput code matters! Link to heading
- Complexity
Few operations:
A*PA: \(\quad O(n^2)\quad \mapsto\quad \ ``O(n)\text{’’}\)
Provably near-optimal minimizer schemes.
- Efficiency
Fast operations:
A*PA2 is up to 500x more efficient than A*PA; up to 19x faster than other methods.
- Implementation
Parallel operations:
A*PA2 uses SIMD and instruction-level parallelism.
Outlook Link to heading
- Optimize all the code.
- Minimizers are not yet a fully solved problem.
- Proving optimality is hard.
Thanks! Link to heading
Propositions Link to heading
- Complexity theory’s days are numbered.
- \(\log \log n \leq 6\)
- Succinct data structures are overrated.
- There is beauty in mathematical perfection.
- Too many PhDs are wasted shaving of small factors of complexities that will never be practical.
- If a paper starts with “faster methods are needed”, it must talk about the implementation.
- Fast code must exploit assumptions on the input.
- Fast code puts requirements on the input format.
- Optimizing ugly code is a waste of time.
- Assembly is not scary.
Extra: Pairwise alignment Link to heading
A*PA: comparison Link to heading

A*PA heuristics Link to heading

A*PA: seed heuristic Link to heading

A*PA: gap-chaining seed heuristic Link to heading

A*PA: contours Link to heading

A*PA2: pre-pruning Link to heading

A*PA2: results (real data) Link to heading
A*PA2: results (synthetic) Link to heading
Alignment modes Link to heading
Cost models Link to heading
Semi-global variants Link to heading
Text searching Link to heading
Skip-cost Link to heading
Skip-cost Link to heading
Block-computation results Link to heading
Semi-global A*PA Link to heading

Seed-chain-extend Link to heading
A*PA: table Link to heading

Extra: Minimizers Link to heading
Minimizer schemes Link to heading
- Lexicographic: choose lexicographic smallest \(k\)-mer
- Random mini: choose \(k\)-mer with smallest hash
- ABB: choose A, followed by most non-A
- ABB+: break ties via random hash
- Sus-anchor: choose the position of the smallest unique suffix
- “smallest”: where the first character is inverted.
Large alphabet Link to heading
Lower bound Link to heading

Mod-minimizer Link to heading

Mod-minimizer Link to heading
Mod-minimizer Link to heading
Mod-minimizer Link to heading
Mod-minimizer Link to heading
Mod-minimizer Link to heading
Selection schemes Link to heading

Extra: PtrHash Link to heading
PtrHash: overview Link to heading

PtrHash: Cacheline Elias Fano Link to heading

PtrHash: Bucket Functions Link to heading

PtrHash: Construction Link to heading

PtrHash: Results Link to heading

PtrHash: Optimal Throughput Link to heading
TODO :noexport_reveal: Link to heading
minis:
- every w’th kmer
- why is the lower bound hard to reach
extra:
- affine cost layers
Last minute:
- update cursor size in sway