Ongoing and future research
Here I list projects that I’m currently working on, and ideas for future work.
This page is usually outdated.
In progress
- A* pairwise aligner [GitHub]
- Exact global pairwise alignment of random strings in
expected linear time.
Contains proof of correctness, implementation, evals and comparison with WFA and edlib on random data.
- Proof of expected linear time alignment
- I have a proof of concept to show that a simplified version of the algorithm currently implemented by A* pairwise aligner runs in expected linear time on random input with sufficiently low edit distance (\(|\Sigma|^{1/e} \ll n\)), but need to spend some time on details and writing it down. WIP post: linear time pairwise alignment.
- Finish review post
- Write text on diagonal transition and fill in other remaining TODO sections.
- Post comparing heuristics
- A comparison of the different heuristics that have been used so far.
On hold
- Spaced $k$mer similarity
- See this post. Not as interesting as the A*PA work.
Pending ideas/blogposts
Some ideas I have, that each deserve at least a blogpost once I find the time.
- Low memory WFA backtracking
- Instead of storing furthest reaching points for all wavefronts, it is sufficient to only store critical points where paths split/merge. This should lower memory usage of WFA to close to linear, without needing BiWFA.
- NW/Diagonal transition with exponential search and heuristics
- Think about merging NW/Diagonal transition with exponential search and
heuristics.
When doing exponential search, we can use a heuristic to limit the band/number
of diagonals to process. If we are given the actual distance, this may lead to
a linear time algorithm to prove that it is indeed the correct distance.
TODO: Figure out how to use pruning heuristics in this case – so far this only works well for non-pruning/static heuristics.
- Homopolymer affine layers
- We could make affine layers with a special (lower) cost that only allows homopolymer inserts/deletes. From first experiments that such a cost is not necessarily well-defined, so further evaluations are needed.
- A*PA post
- A post on the A*PA paper.
- Additionally, there can be a post on how it extends/is similar to many other existing ideas/algorithms/papers.
- Semi-global pairwise alignment review
- similar to the global pairwise alignment review, but for semi-global / $k$-approximate string matches.
- Finding (in)exact matches
- This turns out to be hard to do fast. For exact matches the fastest is a
hashmap mapping qgrams to locations. Instead of an
absl::flat_hash_map
, it may be better to handroll something and ignore/bail on collisions.For inexact matches, everything is more complicated, but again the fastest seems to be a hashmap and looking up all mutations of a string.
A simple optimization here is to build a map over \(A\), which contains only \(n/k\) seeds, and do lookups for all \(n\) $k$mers in \(B\).
Smaller tasks
- Extract A*PA benchmarks
simple coding task
Extract theevals
directory to a separate reusable repo with benchmarks and tests.- Find ONT reads corresponding to a reference
annoying
It would be nice to test A*PA on a set of real ONT reads. We need to find a dataset where the reads are from exactly the same genome as the reference, so that no biological mutations (in particular long indels) are present in the data.- Test/bench datasets
- It would be nice to have more testcases than just random ones. In particular,
the following would be nice:
- random strings with multiple long indels
- Random with non-uniform mutations (only inserts, only deletions, …)
- Random with different length distribution on indel length, so they are longer than the default geometric distribution.
Future plans
- Exact global pairwise alignment review
- Finish that post, and convert the post into a paper, once it’s done.
- Extend A*PA to build on diagonal-transition
needs work
Currently the algorithm is based on the vanilla \(O(nm)\) DP. Instead we can base it on the diagonal transition methods to reduce the number of states visited and the memory needed to store \(g\).This should provide a speedup especially in regions where the linear search falls back to quadratic behaviour.
- More A*PA extensions
- Ends-free/semi-global alignment:
easy
I know how this would work and just needs doing. - Affine costs:
tricky
should be possible, but harder. Will be very tricky to get right (bug-free). - Replace gap-cost transition by letter-count-cost transition:
hard
very unclear how this would work, and whether the transformation can be preserved.
- Ends-free/semi-global alignment:
- Review paper on semi-global pairwise alignment
low priority
lots of work/time
Similar to the table I made for global exact pairwise alignment, but for semi-global/mapping. There are a lot of papers in this area. Navarro (2001) also does this with a focus on $k$-approximate string matching, but it quite old by now.
Open questions
- Can WFA/diagonal transition benefit from bit-parallel techniques? (Likely answer: No.)
Reading list
Succinct datastructures:
- Rank and select over mutable bitmaps (Pibiri and Kanda 2021)
Bounded context BWT
counting-quotient-filter
r-index
syncmers
fmalign2
local-kmer-selection
count-min-sketch (with conservative updates)
- count-min-sketch.pdf
- efficient-kmer-counting.pdf
tinted dbg
function-assigned masked superstrings
Turning unit cost into affine cost alignment?
- Maybe by doubling and refining costs in each iteration? similar to cost-scaling flow algorithms?
(NO_ITEM_DATA:spaln3)
(NO_ITEM_DATA:setsketch)
(NO_ITEM_DATA:linear-space-four-russians)
Projects to pick up
A*PA2
- semi-global
- affine
Minimizers
- analyze practical small-k scheme: OC-syncmers
- merge mod-mini and OC-syncmer
- Lower bound for local schemes
- Lower bound for \(1<k<w\)
- Find exact optimal schemes for all \(k\equiv 1\pmod w\)
High throughput
- SSHash-rs
- Suffix array binary searching (PLA-index)
- Optimize https://github.com/acubeLab/Deep-Shallow-Suffix-Sorting; cf (NO_ITEM_DATA:sacabench)
- fast minimizers SEA paper
- ptrhash SEA paper
- PACE 25
k-mer indices
- write some proof-of-concept code