Past, ongoing, and future research plans
Here I list projects that I am working/have worked on. Outdated almost always. Mostly as a note-to-self to keep things a bit organised.
In progress (October 2025) Link to heading
Algorithm engineering (with Peter Sanders’ group @ KIT):
- QuickHeap (blog): a fast SIMD-based heap
- k-perfect hashing: quickly map keys to bins of size of \(k\in \{8,16\}\), mostly extending PtrHash.
- Static hash sets (
StaticHashSet<u32>
) with single-cache-miss lookups - Fast 2-bit rank for use in
bwa-mem
, with Heng Li.
Tool/library development (with Igor Martayan & Bede Constantinides):
- Updated packed-seq (gh), seq-hash (gh), simd-minimizers (gh)
- Faster handling of short-read data
- Support for N characters using 2+1bit encoding
- simd-mini: skip minimizers of windows containing N and other non-ACGT chars.
- Paraseq (gh) PRs:
- multithreaded unzipping for paired-end input
- new
ParallelReader
API - overall cleanup and deduplication
- Deacon (gh):
- speedups from dependencies above
- more reusing of allocations
- faster hashset WIP
- faster IO and on-disk format
Furthermore:
- Simd-Sketch (gh, blog):
- algorithm engineering for fast bottom & bucket sketching
- TODO: algorithm engineering for fast bottom-sketch distance comparison
- SIMD for fast bucket-sketch distances
- experiments (sketch time, comparison time, correlation)
- 1-bit hashing with many hashes is best
- WIP: applications/integration with attotree (by Karel Břinda) and sketchlib (by John Lees)
- TODO: fix bias due to 32-bit hash collisions.
- TODO?: locality sensitive clustering; phylogeny reconstruction
- NtHash collisions (blog), with John Lees and Victor Bouza.
Further collaborations (external involvement / supervision):
- Theory: \(O(n\lg n)\) chaining for anchored edit distance, with Nicola Rizzo.
- Software: Barbell: fast & accurate demultiplexing, with Rick Beeloo.
Dormant Link to heading
- Sassy (Beeloo and Groot Koerkamp 2025): under review.
- Minimizers:
- publish the anti-lex scheme?
- continue investigation of greedymini to develop ‘understandable’ equivalent schemes.
- Find exact optimal schemes for all \(k\equiv 1\pmod w\)
- lower bound for local schemes
Code:
- sshash-rs (gh)
- suffix array searching (gh, post) (Bahne et al. 2019)
Done Link to heading
See publications.
Long term plans Link to heading
If anything here interests you, feel free to reach out for collaborations.
- engineer STPD (Becker et al. 2025): write a highly optimized implementation.
- minimizer space?
- fuzzy version by only using minimizer start positions?
- Optimal CPU/memory performance given 3D latency and cooling constraints.
- Pairwise alignment review paper based on thesis chapter 2 (this post).
- Review on approximate string matching, this post (will never happen).
- Minimizers review paper based on thesis part 2 (this post).
Abandoned Link to heading
- Affine A*PA2: too much of an overhaul.
- Spaced $k$-mer similarity: see this post.
- Expected linear time A*PA: see a draft in this post.
- Linear memory WFA: Instead of storing furthest reaching points for all wavefronts, it is sufficient to only store critical points where paths split/merge. This should lower memory usage of WFA to close to linear, without needing BiWFA. See this post. This has similar vibes to TALCO (Walia et al. 2024).
- Local doubling: see this post.
Reading list Link to heading
(I’m too lazy to add proper DOI entries for everything here.)
Rank and select over mutable bitmaps (Pibiri and Kanda 2021)
Bounded context BWT
counting-quotient-filter
r-index
fmalign2
local-kmer-selection
count-min-sketch (with conservative updates)
- count-min-sketch.pdf
- efficient-kmer-counting.pdf
tinted dbg
function-assigned masked superstrings
Turning unit cost into affine cost alignment?
- Maybe by doubling and refining costs in each iteration? similar to cost-scaling flow algorithms?
(NO_ITEM_DATA:spaln3)
(NO_ITEM_DATA:linear-space-four-russians)
(NO_ITEM_DATA:multi-context-seeds)
(NO_ITEM_DATA:from-superstring-to-indexing)
(NO_ITEM_DATA:col-bwt)
Eskemap for A*PA?
Lyndon trees: https://arxiv.org/abs/2406.16475, Giuseppe Romana
ForAlign
(NO_ITEM_DATA:bwt-compression)
texrex: https://academic.oup.com/nargab/article/7/2/lqaf039/8115380?login=false
simd-utils crate (for myers bitpacking, and simd-minimizers)
Medlib: https://www.biorxiv.org/content/biorxiv/early/2025/05/07/2025.05.01.651420.full.pdf
K2Rmini
alignment history: https://pubmed.ncbi.nlm.nih.gov/10890397/
BGSA pairwise aligner
Prokrustean graph
New Durbin & Myers paper
slp-recompression
GapsMis pairwise aligner https://www.sciencedirect.com/science/article/pii/S0304397515002200#se0030
KeBaB: https://link.springer.com/chapter/10.1007/978-3-032-05228-5_2
TODO: dedup up projects/talks/publications/cv pages