Ongoing and future research

Here I list projects that I’m currently working on, and ideas for future work.

This page is usually outdated.

In progress

A* pairwise aligner [GitHub]
Exact global pairwise alignment of random strings in expected linear time.

Contains proof of correctness, implementation, evals and comparison with WFA and edlib on random data.

Proof of expected linear time alignment
I have a proof of concept to show that a simplified version of the algorithm currently implemented by A* pairwise aligner runs in expected linear time on random input with sufficiently low edit distance (\(|\Sigma|^{1/e} \ll n\)), but need to spend some time on details and writing it down. WIP post: linear time pairwise alignment.
Finish review post
Write text on diagonal transition and fill in other remaining TODO sections.
Post comparing heuristics
A comparison of the different heuristics that have been used so far.

On hold

Spaced $k$mer similarity
See this post. Not as interesting as the A*PA work.

Pending ideas/blogposts

Some ideas I have, that each deserve at least a blogpost once I find the time.

Low memory WFA backtracking
Instead of storing furthest reaching points for all wavefronts, it is sufficient to only store critical points where paths split/merge. This should lower memory usage of WFA to close to linear, without needing BiWFA.
NW/Diagonal transition with exponential search and heuristics
Think about merging NW/Diagonal transition with exponential search and heuristics. When doing exponential search, we can use a heuristic to limit the band/number of diagonals to process. If we are given the actual distance, this may lead to a linear time algorithm to prove that it is indeed the correct distance.

TODO: Figure out how to use pruning heuristics in this case – so far this only works well for non-pruning/static heuristics.

Homopolymer affine layers
We could make affine layers with a special (lower) cost that only allows homopolymer inserts/deletes. From first experiments that such a cost is not necessarily well-defined, so further evaluations are needed.
A*PA post
A post on the A*PA paper.
  • Additionally, there can be a post on how it extends/is similar to many other existing ideas/algorithms/papers.
Semi-global pairwise alignment review
similar to the global pairwise alignment review, but for semi-global / $k$-approximate string matches.
Finding (in)exact matches
This turns out to be hard to do fast. For exact matches the fastest is a hashmap mapping qgrams to locations. Instead of an absl::flat_hash_map, it may be better to handroll something and ignore/bail on collisions.

For inexact matches, everything is more complicated, but again the fastest seems to be a hashmap and looking up all mutations of a string.

A simple optimization here is to build a map over \(A\), which contains only \(n/k\) seeds, and do lookups for all \(n\) $k$mers in \(B\).

Smaller tasks

Extract A*PA benchmarks
simple coding task Extract the evals directory to a separate reusable repo with benchmarks and tests.
Find ONT reads corresponding to a reference
annoying It would be nice to test A*PA on a set of real ONT reads. We need to find a dataset where the reads are from exactly the same genome as the reference, so that no biological mutations (in particular long indels) are present in the data.
Test/bench datasets
It would be nice to have more testcases than just random ones. In particular, the following would be nice:
  • random strings with multiple long indels
  • Random with non-uniform mutations (only inserts, only deletions, …)
  • Random with different length distribution on indel length, so they are longer than the default geometric distribution.

Future plans

Exact global pairwise alignment review
Finish that post, and convert the post into a paper, once it’s done.
Extend A*PA to build on diagonal-transition
needs work Currently the algorithm is based on the vanilla \(O(nm)\) DP. Instead we can base it on the diagonal transition methods to reduce the number of states visited and the memory needed to store \(g\).

This should provide a speedup especially in regions where the linear search falls back to quadratic behaviour.

More A*PA extensions
  • Ends-free/semi-global alignment: easy I know how this would work and just needs doing.
  • Affine costs: tricky should be possible, but harder. Will be very tricky to get right (bug-free).
  • Replace gap-cost transition by letter-count-cost transition: hard very unclear how this would work, and whether the transformation can be preserved.
Review paper on semi-global pairwise alignment
low priority lots of work/time Similar to the table I made for global exact pairwise alignment, but for semi-global/mapping. There are a lot of papers in this area. Navarro (2001) also does this with a focus on $k$-approximate string matching, but it quite old by now.

Open questions

  • Can WFA/diagonal transition benefit from bit-parallel techniques? (Likely answer: No.)

Reading list

Projects to pick up

A*PA2

  • semi-global
  • affine

Minimizers

  • analyze practical small-k scheme: OC-syncmers
  • merge mod-mini and OC-syncmer
  • Lower bound for local schemes
  • Lower bound for \(1<k<w\)
  • Find exact optimal schemes for all \(k\equiv 1\pmod w\)

High throughput

k-mer indices

  • write some proof-of-concept code

References

Alanko, Jarno N., Simon J. Puglisi, and Jaakko Vuohtoniemi. 2023. “Small Searchable Κ-Spectra via Subset Rank Queries on the Spectral Burrows-Wheeler Transform.” In Siam Conference on Applied and Computational Discrete Algorithms (Acda23), 225–36. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611977714.20.
Navarro, Gonzalo. 2001. “A Guided Tour to Approximate String Matching.” Acm Computing Surveys 33 (1): 31–88. https://doi.org/10.1145/375360.375365.
Pibiri, Giulio Ermanno, and Shunsuke Kanda. 2021. “Rank/Select Queries over Mutable Bitmaps.” Information Systems 99 (July): 101756. https://doi.org/10.1016/j.is.2021.101756.
NO_ITEM_DATA:spaln3
NO_ITEM_DATA:setsketch
NO_ITEM_DATA:linear-space-four-russians
NO_ITEM_DATA:sacabench