Optimal throughput bioinformatics

1 High performance computing bioinformatics

1.1 What is bioinformatics?

1.2 A*PA

1.3 A*PA2

2 High Optimal performance bioinformatics

Proposition

Optimal does not mean anything.

2.1 Optimal?

  • Which metric?
    • Space?
    • Time?
    • Theory? Practice?
  • How optimal?
    • Within some constant factor? (\(O(n)\))
    • Sublinear space overhead? (\(o(n)\))
    • Within \(2×\)?

2.2 Big-\(O\)

Proposition

Big-\(O\) was nice, but does not align with modern hardware.

  • If you have the memory, use it!
  • CPUs are not turing machines.
  • SDSL is succinct in theory: \(o(n)\) overhead on theoretical minimum space.
    • … but practice is hard,
    • … and also up to \(1000\times\) slower to query!
    • Using \(10\%\) extra space is fine, and can avoid all the slowdown.
  • Not all \(O(1)\) are equal:
    • one 100ns cache miss per character? (FM-index, cough)
    • one 0.003ns comparison per character?
    • \(O(\log n)\) or even \(O(\sqrt n)\) fast steps is better than \(O(1)\) slow steps!

2.3 Minimizers

2.4 Minimizers: a lower bound

2.5 Mod-minimizers: a near-optimal scheme

2.6 Outlook:: future schemes

3 Optimal performance throughput bioinformatics

Proposition

Most bioinformatics applications need throughput, not latency.

3.1 Why throughput?

  • Performance can be anything!
  • Throughput, not latency!

3.2 Fast minimizers

3.4 PtrHash