Past, ongoing, and future research plans

Here I list projects that I am working/have worked on. Outdated almost always. Mostly as a note-to-self to keep things a bit organised.

In progress (October 2025) Link to heading

Algorithm engineering (with Peter Sanders’ group @ KIT):

  • QuickHeap (blog): a fast SIMD-based heap
  • k-perfect hashing: quickly map keys to bins of size of \(k\in \{8,16\}\), mostly extending PtrHash.
  • Static hash sets (StaticHashSet<u32>) with single-cache-miss lookups
  • Fast 2-bit rank for use in bwa-mem, with Heng Li.

Tool/library development (with Igor Martayan & Bede Constantinides):

  • Updated packed-seq (gh), seq-hash (gh), simd-minimizers (gh)
    • Faster handling of short-read data
    • Support for N characters using 2+1bit encoding
    • simd-mini: skip minimizers of windows containing N and other non-ACGT chars.
  • Paraseq (gh) PRs:
    • multithreaded unzipping for paired-end input
    • new ParallelReader API
    • overall cleanup and deduplication
  • Deacon (gh):
    • speedups from dependencies above
    • more reusing of allocations
    • faster hashset WIP
    • faster IO and on-disk format

Furthermore:

  • Simd-Sketch (gh, blog):
    • algorithm engineering for fast bottom & bucket sketching
    • TODO: algorithm engineering for fast bottom-sketch distance comparison
    • SIMD for fast bucket-sketch distances
    • experiments (sketch time, comparison time, correlation)
      • 1-bit hashing with many hashes is best
    • WIP: applications/integration with attotree (by Karel Břinda) and sketchlib (by John Lees)
    • TODO: fix bias due to 32-bit hash collisions.
    • TODO?: locality sensitive clustering; phylogeny reconstruction
  • NtHash collisions (blog), with John Lees and Victor Bouza.

Further collaborations (external involvement / supervision):

  • Theory: \(O(n\lg n)\) chaining for anchored edit distance, with Nicola Rizzo.
  • Software: Barbell: fast & accurate demultiplexing, with Rick Beeloo.

Dormant Link to heading

  • Sassy (Beeloo and Groot Koerkamp 2025): under review.
  • Minimizers:
    • publish the anti-lex scheme?
    • continue investigation of greedymini to develop ‘understandable’ equivalent schemes.
    • Find exact optimal schemes for all \(k\equiv 1\pmod w\)
    • lower bound for local schemes

Code:

Done Link to heading

See publications.

Long term plans Link to heading

If anything here interests you, feel free to reach out for collaborations.

  • engineer STPD (Becker et al. 2025): write a highly optimized implementation.
    • minimizer space?
    • fuzzy version by only using minimizer start positions?
  • Optimal CPU/memory performance given 3D latency and cooling constraints.
  • Pairwise alignment review paper based on thesis chapter 2 (this post).
    • Review on approximate string matching, this post (will never happen).
  • Minimizers review paper based on thesis part 2 (this post).

Abandoned Link to heading

  • Affine A*PA2: too much of an overhaul.
  • Spaced $k$-mer similarity: see this post.
  • Expected linear time A*PA: see a draft in this post.
  • Linear memory WFA: Instead of storing furthest reaching points for all wavefronts, it is sufficient to only store critical points where paths split/merge. This should lower memory usage of WFA to close to linear, without needing BiWFA. See this post. This has similar vibes to TALCO (Walia et al. 2024).
  • Local doubling: see this post.

Reading list Link to heading

(I’m too lazy to add proper DOI entries for everything here.)

References Link to heading

Alanko, Jarno N., Simon J. Puglisi, and Jaakko Vuohtoniemi. 2023. “Small Searchable -Spectra via Subset Rank Queries on the Spectral Burrows-Wheeler Transform.” In Siam Conference on Applied and Computational Discrete Algorithms (Acda23), 225–36. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611977714.20.
Bahne, Johannes, Nico Bertram, Marvin Böcker, Jonas Bode, Johannes Fischer, Hermann Foot, Florian Grieskamp, et al. 2019. “Sacabench: Benchmarking Suffix Array Construction.” In String Processing and Information Retrieval, 407–16. Springer International Publishing. https://doi.org/10.1007/978-3-030-32686-9_29.
Becker, Ruben, Davide Cenzato, Travis Gagie, Ragnar Groot Koerkamp, Sung-Hwan Kim, Giovanni Manzini, and Nicola Prezza. 2025. “Compressing Suffix Trees by Path Decompositions.” arXiv; arXiv. https://doi.org/10.48550/ARXIV.2506.14734.
Beeloo, Rick, and Ragnar Groot Koerkamp. 2025. “Sassy: Searching Short DNA Strings in the 2020s,” July. https://doi.org/10.1101/2025.07.22.666207.
Ertl, Otmar. 2021. “Setsketch: Filling the Gap between Minhash and Hyperloglog.” Proceedings of the Vldb Endowment 14 (11): 2244–57. https://doi.org/10.14778/3476249.3476276.
Pibiri, Giulio Ermanno, and Shunsuke Kanda. 2021. “Rank/Select Queries over Mutable Bitmaps.” Information Systems 99 (July): 101756. https://doi.org/10.1016/j.is.2021.101756.
Walia, Sumit, Cheng Ye, Arkid Bera, Dhruvi Lodhavia, and Yatish Turakhia. 2024. “TALCO: Tiling Genome Sequence Alignment Using Convergence of Traceback Pointers.” In 2024 Ieee International Symposium on High-Performance Computer Architecture (Hpca). IEEE. https://doi.org/10.1109/hpca57654.2024.00044.
NO_ITEM_DATA:spaln3
NO_ITEM_DATA:linear-space-four-russians
NO_ITEM_DATA:multi-context-seeds
NO_ITEM_DATA:from-superstring-to-indexing
NO_ITEM_DATA:col-bwt
NO_ITEM_DATA:bwt-compression