Optimal throughput bioinformatics

August 2024 One-minute read

1 High performance computing bioinformatics

Proposition

Optimal does not mean anything.

Which metric?
- Space?
- Time?
- Theory? Practice?
How optimal?
- Within some constant factor? (\(O(n)\))
- Sublinear space overhead? (\(o(n)\))
- Within \(2×\)?

Proposition

Big-\(O\) was nice, but does not align with modern hardware.

SDSL is succinct in theory: \(o(n)\) overhead on theoretical minimum space.
- … but practice is hard,
- … and also up to \(1000\times\) slower to query!
- Using \(10\%\) extra space is fine, and can avoid all the slowdown.
Not all \(O(1)\) are equal:
- one 100ns cache miss per character? (FM-index, cough)
- one 0.003ns comparison per character?
- \(O(\log n)\) or even \(O(\sqrt n)\) fast steps is better than \(O(1)\) slow steps!

Proposition

Most bioinformatics applications need throughput, not latency.