Ongoing and future research

Table of Contents

In progress
On hold
Pending ideas/blogposts
Smaller tasks
Future plans
Open questions
Reading list
Projects to pick up

Here I list projects that I’m currently working on, and ideas for future work.

This page is usually outdated.

In progress Link to heading

A* pairwise aligner [GitHub]: Exact global pairwise alignment of random strings in expected linear time.
Contains proof of correctness, implementation, evals and comparison with WFA and edlib on random data.
Proof of expected linear time alignment: I have a proof of concept to show that a simplified version of the algorithm currently implemented by A* pairwise aligner runs in expected linear time on random input with sufficiently low edit distance ($|\Sigma|^{1/e} \ll n$), but need to spend some time on details and writing it down. WIP post: linear time pairwise alignment.
Finish review post: Write text on diagonal transition and fill in other remaining TODO sections.
Post comparing heuristics: A comparison of the different heuristics that have been used so far.

On hold Link to heading

Spaced $k$mer similarity: See this post. Not as interesting as the A*PA work.

Pending ideas/blogposts Link to heading

Some ideas I have, that each deserve at least a blogpost once I find the time.

Low memory WFA backtracking

Instead of storing furthest reaching points for all wavefronts, it is sufficient to only store critical points where paths split/merge. This should lower memory usage of WFA to close to linear, without needing BiWFA.

NW/Diagonal transition with exponential search and heuristics

Think about merging NW/Diagonal transition with exponential search and heuristics. When doing exponential search, we can use a heuristic to limit the band/number of diagonals to process. If we are given the actual distance, this may lead to a linear time algorithm to prove that it is indeed the correct distance.

TODO: Figure out how to use pruning heuristics in this case – so far this only works well for non-pruning/static heuristics.

Homopolymer affine layers

We could make affine layers with a special (lower) cost that only allows homopolymer inserts/deletes. From first experiments that such a cost is not necessarily well-defined, so further evaluations are needed.

A*PA post

A post on the A*PA paper.

Additionally, there can be a post on how it extends/is similar to many other existing ideas/algorithms/papers.

Semi-global pairwise alignment review

similar to the global pairwise alignment review, but for semi-global / $k$-approximate string matches.

Finding (in)exact matches

This turns out to be hard to do fast. For exact matches the fastest is a hashmap mapping qgrams to locations. Instead of an absl::flat_hash_map, it may be better to handroll something and ignore/bail on collisions.

For inexact matches, everything is more complicated, but again the fastest seems to be a hashmap and looking up all mutations of a string.

A simple optimization here is to build a map over $A$, which contains only $n/k$ seeds, and do lookups for all $n$ $k$mers in $B$.

Smaller tasks Link to heading

Extract A*PA benchmarks

simple coding task Extract the evals directory to a separate reusable repo with benchmarks and tests.

Find ONT reads corresponding to a reference

annoying It would be nice to test A*PA on a set of real ONT reads. We need to find a dataset where the reads are from exactly the same genome as the reference, so that no biological mutations (in particular long indels) are present in the data.

Test/bench datasets

It would be nice to have more testcases than just random ones. In particular, the following would be nice:

random strings with multiple long indels
Random with non-uniform mutations (only inserts, only deletions, …)
Random with different length distribution on indel length, so they are longer than the default geometric distribution.

Future plans Link to heading

Exact global pairwise alignment review

Finish that post, and convert the post into a paper, once it’s done.

Extend A*PA to build on diagonal-transition

needs work Currently the algorithm is based on the vanilla $O(nm)$ DP. Instead we can base it on the diagonal transition methods to reduce the number of states visited and the memory needed to store $g$.

This should provide a speedup especially in regions where the linear search falls back to quadratic behaviour.

More A*PA extensions

Ends-free/semi-global alignment: easy I know how this would work and just needs doing.
Affine costs: tricky should be possible, but harder. Will be very tricky to get right (bug-free).
Replace gap-cost transition by letter-count-cost transition: hard very unclear how this would work, and whether the transformation can be preserved.

Review paper on semi-global pairwise alignment

low priority lots of work/time Similar to the table I made for global exact pairwise alignment, but for semi-global/mapping. There are a lot of papers in this area. Navarro (2001) also does this with a focus on $k$-approximate string matching, but it quite old by now.

Open questions Link to heading

Can WFA/diagonal transition benefit from bit-parallel techniques? (Likely answer: No.)

Reading list Link to heading

Succinct datastructures:
- Rank and select over mutable bitmaps (Pibiri and Kanda 2021)
Bounded context BWT
- https://twitter.com/jnalanko/status/1747585917861208266
- https://github.com/jhhung/kISS
- k-context BWT: (Alanko, Puglisi, and Vuohtoniemi 2023)
counting-quotient-filter
r-index
syncmers
fmalign2
local-kmer-selection
count-min-sketch (with conservative updates)
- count-min-sketch.pdf
- efficient-kmer-counting.pdf
tinted dbg
function-assigned masked superstrings
Turning unit cost into affine cost alignment?
- Maybe by doubling and refining costs in each iteration? similar to cost-scaling flow algorithms?
(NO_ITEM_DATA:spaln3)
(Ertl 2021)
(NO_ITEM_DATA:linear-space-four-russians)
(NO_ITEM_DATA:multi-context-seeds)
(NO_ITEM_DATA:from-superstring-to-indexing)
(NO_ITEM_DATA:col-bwt)
https://sedgewick.io/ideas/#computer-science-for-all
Eskemap for A*PA?
Lyndon trees: https://arxiv.org/abs/2406.16475, Giuseppe Romana
ForAlign
U-index variant: https://github.com/mranisz/samsami
https://www.sciencedirect.com/science/article/pii/S0304397522007642?casa_token=s43fN-Spjf0AAAAA:9i3HPEfmAs0X7eYvglMmam7eBH52tMAvqAeTEnEBlpVn8YJb1WoGR-wiHv5yfAOA_t4sgI7HlZ8
(NO_ITEM_DATA:bwt-compression)
texrex: https://academic.oup.com/nargab/article/7/2/lqaf039/8115380?login=false
ETH talk by Jeff Dean

Projects to pick up Link to heading

A*PA2 Link to heading

semi-global
affine

Minimizers Link to heading

analyze practical small-k scheme: OC-syncmers
merge mod-mini and OC-syncmer
Lower bound for local schemes
Lower bound for $1<k<w$
Find exact optimal schemes for all $k\equiv 1\pmod w$

High throughput Link to heading

SSHash-rs
Suffix array binary searching (PLA-index)
Optimize https://github.com/acubeLab/Deep-Shallow-Suffix-Sorting; cf (NO_ITEM_DATA:sacabench)
fast minimizers SEA paper
ptrhash SEA paper
PACE 25

k-mer indices Link to heading

write some proof-of-concept code

References Link to heading

Alanko, Jarno N., Simon J. Puglisi, and Jaakko Vuohtoniemi. 2023. “Small Searchable -Spectra via Subset Rank Queries on the Spectral Burrows-Wheeler Transform.” In Siam Conference on Applied and Computational Discrete Algorithms (Acda23), 225–36. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611977714.20.

Ertl, Otmar. 2021. “Setsketch: Filling the Gap between Minhash and Hyperloglog.” Proceedings of the Vldb Endowment 14 (11): 2244–57. https://doi.org/10.14778/3476249.3476276.

Navarro, Gonzalo. 2001. “A Guided Tour to Approximate String Matching.” Acm Computing Surveys 33 (1): 31–88. https://doi.org/10.1145/375360.375365.

Pibiri, Giulio Ermanno, and Shunsuke Kanda. 2021. “Rank/Select Queries over Mutable Bitmaps.” Information Systems 99 (July): 101756. https://doi.org/10.1016/j.is.2021.101756.

NO_ITEM_DATA:spaln3

NO_ITEM_DATA:linear-space-four-russians

NO_ITEM_DATA:multi-context-seeds

NO_ITEM_DATA:from-superstring-to-indexing

NO_ITEM_DATA:col-bwt

NO_ITEM_DATA:bwt-compression

NO_ITEM_DATA:sacabench