Optimal Throughput Bioinformatics

What is bio​informatics? Link to heading

For today: DNA Link to heading

Zephyris, CC BY-SA 3.0 Wikipedia

Also DNA Link to heading

Human chromosomes

DNA, according to me Link to heading

A C G T

00 01 10 11

Covid Link to heading

A virus: SARS-CoV-2, aka COVID-19

covid vs sars-cov-2

Why bio​informatics? Link to heading

1800-1980: Punch cards - 80 B - Protein size noexport-reveal Link to heading

Punchcard
Pete Birkinshaw, CC BY 2.0, Wikipedia

β-hemoglobin (1963)
Emw, CC BY-SA 3.0, Wikipedia

1970-2000: Floppy disks - MB - Bacteria Link to heading

Floppy

E. coli bacteria (1997)

2000-2020: USB stick - GB - Human Genome Link to heading

USB stick

Human genome (2001)

2010-2020: Hard Drive - TB - RefSeq Link to heading

Hard drive

RefSeq

2025: Data Center - PB - SRA Link to heading

Data Center
PhonlamaiPhoto | istockphoto.com

Growth of SRA

Goal:
Fast code Link to heading

Goal:
High throughput code Link to heading

Goal:
Optimal throughput code Link to heading

What is high troughput code? Link to heading

  1. Complexity
    • Few operations:

      \(\quad O(n^2)\quad\longleftrightarrow\quad O(n)\)

  2. Efficiency
    • Fast operations:

      memory read, 100 ns \(\quad\longleftrightarrow\quad\) 0.1 ns, addition

  3. Implementation
    • Parallel operations:

      SIMD, instruction-level parallelism

Papers Link to heading

Part 1: Pairwise Alignment

Part 2: Minimizers

Part 3: Optimal Throughput

Additional

  • [1] A*PA: Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.
    RGK and Pesho Ivanov, Bioinformatics 2024.
  • [2] A*PA2: Up to 19x Faster Exact Global Alignment.
    RGK. WABI 2024.

  • [3] The Mod-Minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers.
    RGK and Giulio Ermanno Pibiri. WABI 2024.
  • [4] Forward Sampling Scheme Density Lower Bound.
    Bryce Kille, RGK, et al. Bioinformatics 2024.
  • [5] The Open-Closed Mod-Minimizer Algorithm.
    RGK, Daniel Liu, and Giulio Ermanno Pibiri. AMB 2025.

  • [6] SimdMinimizers: Computing Random Minimizers, Fast.
    RGK and Igor Martayan. SEA 2025.
  • [7] PtrHash: Minimal Perfect Hashing at RAM Throughput.
    RGK. SEA 2025.

  • [8] U-index: A Universal Indexing Framework for Matching Long Patterns.
    Lorraine Ayad, Gabriele Fici, RGK, Rob Patro, Giulio Ermanno Pibiri, and Solon Pissis. SEA 2025.

Throughput is not in presentation, but in part 3 of the thesis

Problem 1: Pairwise Alignment Link to heading

Covid – \(\alpha\), December 2020 Link to heading

Sars-CoV-2, alpha variant, December 2020

Covid – \(\omicron\), December 2021 Link to heading

Sars-CoV-2, omicron variant, December 2021

Pairwise alignment Link to heading

  • Find the mutations between two sequences

Dynamic programming Link to heading

Dynamic programming Link to heading

Dynamic programming Link to heading

Dynamic programming Link to heading

Dynamic programming Link to heading

Needleman-Wunsch – Quadratic \(O(n^2)\) Link to heading

Dijkstra – \(O(ns)\) Link to heading

Fewer states is better

Diagonal transition – \(O(n + s^2)\) Link to heading

Complexity is in expectation

A*PA\({}^{1}\) – near-linear, \(3\times\) faster on similar seqs Link to heading

[1] Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.
RGK and Pesho Ivanov, Bioinformatics 2024.

Worst-case vs avg-case complexity

A*PA\({}^{1}\) – not quite linear Link to heading

[1] Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.
RGK and Pesho Ivanov, Bioinformatics 2024.

A*PA: Great complexity – terrible efficiency Link to heading

Band Doubling – \(O(ns)\) Link to heading

A*PA2\({}^{2}\) – good efficiency: up to \(19\times\) faster Link to heading

[2] A*PA2: Up to 19x Faster Exact Global Alignment.
RGK. WABI 2024.

mention throughput

Problem 2: Minimizers – lossy compression Link to heading

Lossy compression Link to heading

original

Lossy compression Link to heading

50%

Lossy compression Link to heading

25%

Lossy compression Link to heading

12%

Lossy compression Link to heading

6%

Sampling k-mers Link to heading

\(k=10\), \(w=1000\)

w=1000 looses too much information

Minimizer definition Link to heading

\(k\)-mer size: \(k=3\)
Window guarantee: \(w=4\)
Length \(w+k-1=6\) window of \(w\) \(k\)-mers

Minimizer scheme: \[ f: \Sigma^{w+k-1} \mapsto \{0, 1, 2, \dots, w-1\}. \]
Used for compression and hashing.

Find f that minimizes density

leftmost vs alphabetical

usage in

Minimizer example Link to heading

Minimizer example Link to heading

Minimizer example Link to heading

Minimizer example Link to heading

Minimizer example Link to heading

Minimizer example Link to heading

Minimizer density Link to heading

Density: expected fraction of sampled \(k\)-mers.
Here: \(3/9=0.33\)

Sampling k-mers Link to heading

\(k=10\), \(w=500\)

Sampling k-mers Link to heading

\(k=10\), \(w=250\)

Sampling k-mers Link to heading

\(k=10\), \(w=100\)

Sampling k-mers Link to heading

\(k=10\), \(w=50\)

Sampling k-mers Link to heading

\(k=10\), \(w=25\)

Sampling k-mers Link to heading

\(k=10\), \(w=100\)

Sampling k-mers Link to heading

\(k=11\), \(w=100\)

Sampling k-mers Link to heading

\(k=12\), \(w=100\)

Sampling k-mers Link to heading

\(k=20\), \(w=100\)

Sampling k-mers Link to heading

\(k=30\), \(w=100\)

Goal: Minimize the number of sampled \(k\)-mers Link to heading

Density plots – before (\(w=24\)) Link to heading

The Mod-Minimizer\({}^{3}\) Link to heading

[3] The Mod-Minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers. RGK and Giulio Ermanno Pibiri. WABI 2024.

A Near-Tight Lower Bound\({}^{4}\) Link to heading

[4] Forward sampling scheme density lower bound. Bryce Kille, RGK, et al. Bioinformatics 2024.

Near-optimal schemes for small \(k\) Link to heading

(Unpublished)

Extended mod-minimizer\({}^{5}\) Link to heading

[5] The Open-closed mod-minimizer Algorithm. RGK, Daniel Liu, and Giulio Ermanno Pibiri. AMB 2025.

Proving The Lower Bound Link to heading

  • Suppose \(k=1\), and consider a cycle of \(w+1\) characters.
    Eg for \(w=4\):

    ABCDEABCDEABCDE

  • Suppose we only sample 1 character:

    AB​C​DEAB​C​DEAB​C​DE

    The distance between samples is \(5 > w\)!

  • Thus, we need at least two samples.

  • Conclusion: the density is at least \(2/(w+1)\).

Conclusion: high troughput code matters! Link to heading

  1. Complexity
    • Few operations:

      A*PA: \(\quad O(n^2)\quad \mapsto\quad \ ``O(n)\text{’’}\)

      Provably near-optimal minimizer schemes.

  2. Efficiency
    • Fast operations:

      A*PA2 is up to 500x more efficient than A*PA; up to 19x faster than other methods.

  3. Implementation
    • Parallel operations:

      A*PA2 uses SIMD and instruction-level parallelism.

Outlook Link to heading

  • Optimize all the code.
  • Minimizers are not yet a fully solved problem.
  • Proving optimality is hard.

Thanks! Link to heading

Propositions Link to heading

  1. Complexity theory’s days are numbered.
  2. \(\log \log n \leq 6\)
  3. Succinct data structures are overrated.
  4. There is beauty in mathematical perfection.
  5. Too many PhDs are wasted shaving of small factors of complexities that will never be practical.
  6. If a paper starts with “faster methods are needed”, it must talk about the implementation.
  7. Fast code must exploit assumptions on the input.
  8. Fast code puts requirements on the input format.
  9. Optimizing ugly code is a waste of time.
  10. Assembly is not scary.

Extra: Pairwise alignment Link to heading

A*PA: comparison Link to heading

A*PA heuristics Link to heading

A*PA: seed heuristic Link to heading

A*PA: gap-chaining seed heuristic Link to heading

A*PA: contours Link to heading

A*PA2: pre-pruning Link to heading

A*PA2: results (real data) Link to heading

A*PA2: results (synthetic) Link to heading

Alignment modes Link to heading

Cost models Link to heading

Semi-global variants Link to heading

Text searching Link to heading

Skip-cost Link to heading

Skip-cost Link to heading

Block-computation results Link to heading

Semi-global A*PA Link to heading

Seed-chain-extend Link to heading

A*PA: table Link to heading

Extra: Minimizers Link to heading

Minimizer schemes Link to heading

  • Lexicographic: choose lexicographic smallest \(k\)-mer
  • Random mini: choose \(k\)-mer with smallest hash
  • ABB: choose A, followed by most non-A
  • ABB+: break ties via random hash
  • Sus-anchor: choose the position of the smallest unique suffix
    • “smallest”: where the first character is inverted.

Large alphabet Link to heading

Lower bound Link to heading

Mod-minimizer Link to heading

Mod-minimizer Link to heading

Mod-minimizer Link to heading

Mod-minimizer Link to heading

Mod-minimizer Link to heading

Mod-minimizer Link to heading

Selection schemes Link to heading

Extra: PtrHash Link to heading

PtrHash: overview Link to heading

PtrHash: Cacheline Elias Fano Link to heading

PtrHash: Bucket Functions Link to heading

PtrHash: Construction Link to heading

PtrHash: Results Link to heading

PtrHash: Optimal Throughput Link to heading

TODO :noexport_reveal: Link to heading

  • minis:

    • every w’th kmer
    • why is the lower bound hard to reach
  • extra:

    • affine cost layers
  • Last minute:

    • update cursor size in sway