1 Early idea: Bottom-up match-merging (aka BUMMer?) Link to heading
TODO: Move to separate post.
One thing that becomes clear with mapping is that we don’t quite know where exactly to start the semi-global alignments. This can be fixed by adding some buffer/padding, but this remains slightly ugly and iffy.
Instead, I’m going to attempt to explain a new approach here. Some details are still a bit unclear to me on how exactly they’d work, but I have good hope it can all be worked out.
1.1 Some previous ideas Link to heading
Instead, we can use the following approach, which is a natural evolution/convergence of a few previous ideas:
pre-pruning (or local-pruning; I haven’t been very consistent with the name)
The idea here is that a k-mer match gives us information that this seed can be traversed for free. The lack of a match implies cost at least 1. When a match is followed by noise, and thus can not be extended into an alignment of two seeds with cost \(<2\), we can discard it, because the promise that there would be a good alignment (ie, relative cost \(<1/k\)) is not held.
- see A*PA2 paper (Groot Koerkamp 2024) (PDF) or blogpost
path-pruning (blogpost): if we already know some alignment, which is not necessarily optimal, we can use that to either find a better one or prove optimality: we can find all places at the start of a match where the heuristic is smaller than the actual remaining distance, and remove those matches. Again, these matches ‘‘promise’’ that the remainder of the alignment can be done in cost \(<1/k\), but we should avoid to over-promise.
After path-pruning some matches, we run the alignment as usual, until the end of the original path is reached. Either the guessed path is then optimal, or the optimal path will have been found.
local-doubling (blogpost): a drawback of path-pruning is that first we must find a path somehow, and then we must run the alignment again with the improved heuristic. Local-doubling attempts to fix this by increasing the band of the alignment locally as needed.
It gives nice figures, but I never quite got it to work reliably.
1.2 Divide & conquer Link to heading
Another common technique for pairwise alignment is Hirschberg’s divide & conquer approach (Hirschberg 1975). This find the distance to the middle column from the left and right. There, a splitting point on the optimal alignment is found, and we recurse into the two half-sized sub problems.
1.3 Bottom-up match merging (BUMMer) Link to heading
Initially, we have a set of many matches, including some spurious ones. As we already saw with pre-pruning and path-pruning, if a match covering 1 seed does not into an alignment of cost \(<2\) covering \(2\) seeds, we might as well discard it. Then, if it does not extend into an alignment of cost \(<4\) covering 4 seeds, we can again discard it.
A slightly more principled approach is as follows:
Consider a binary tree on the seeds.
Initially the leafs correspond to a k-mer (seed) of the text, and the matches for that seed.
Then, we go up one level and see if we can merge adjacent matches. If so, we get a new match spanning two seeds, with margin \(2\) (because the two matches have cost \(0\), which is \(2\) below the number of seeds covered).
Otherwise, it may be possible to extend a match of the left seed to also cover the right seed for cost \(1\), creating a match covering the two seeds with margin \(1\). Similarly, a right-match might be extended into the left seed.
Because an alignment of \(2^{k+1}\) seeds with cost \(<2^{k+1}\) must have cost \(<2^k\) in either the left or right half, this procedure finds all such \(2^{k+1}\)-matches by only starting with single k-mer matches.
Eventually we extend our matches into a full alignment of the pattern and we’re done.
One core idea here is this: if you have a long run of matches, these build up a bunch of margin \(a\), that can then be spend by aligning through a region with up to \(a\) noise. In the end, the complexity will be something like \(\sum_a a^2\).
In fact, maybe this ends up exactly similar to A*PA, but faster because it doesn’t actually do the relatively slow A* bit. But I’m not sure yet; we’ll see.
Tricky bits. What I haven’t figured out yet:
- We need to efficiently merge matches for consecutive seeds. Maybe a simple lower bound like the seed heuristic (that ignores the \(j\) coordinate) is good enough, but it would be interesting to see if we can design some algo/datastructure for efficiently merging matches.
- Reconstructing traces from output costs: suppose we take a semi global alignment and run it once top-to-bottom and once bottom-to-top. Can we infer from this information the start and end points of all locally-optimal alignment traces?