This post build on top of our recent preprint Groot Koerkamp and Ivanov (2024) and gives an overview of some of my new ideas to significantly speed up exact global pairwise alignment. It’s recommended you understand the seed heuristic and match pruning before reading this post.
Two subsections contain ideas that I haven’t fully thought through yet but sound
promising11 More promising than the main text, in fact, because they do not
depend on a given path as input.
Motivation Link to heading

Figure 1: Figure 4b from our preprint: at (e=5%) error rate, the seed heuristic (SH) only outperforms BiWFA from (n=30kbp) onward.
In the preprint, we present an algorithm with
near-linear runtime on random sequences. However, despite its linear runtime, it
only outperforms BiWFA and Edlib for sequences with length
Summary Link to heading
The method I present here needs a candidate alignment
Instead of match-pruning the heuristic on the fly, we path-prune the
the matches on
Why is A* slow? Link to heading
Even though our implementation uses
In general, the A* has to do a lot of work for each processed state.
- compute
for each expanded and explored state; - push and pop each state to/from the priority queue;
- store the value of
of each state in a hashmap.
Individually, these may not sound too bad, but taken together, the algorithm has to do a lot of work per state.
SIMD. Another reason A* is not fast is because none of it parallelizes well in
terms of SIMD and instruction-pipelining22 Verification needed I suppose it would be possible to expand a few states in
parallel, but that does not sound fun at all.
WFA and Edlib on the other hand are DP based algorithms that
benefit a lot from SIMD. In WFA, each furthest reaching point in wavefront
For linear and single affine costs,
the bottleneck is actually the Extend operation. Thanks to Santiago for this insight.
Memory locality. This also has do to with memory locality: A* jumps all over the place in
unpredictable ways, whereas WFA processes the input in a completely predictable
way. Processors are good at processing memory linearly and have to resort
to slower (L2/L3) caches for random access operations. Again this does not work
in our favour.55 Again, verification needed.
Computational volumes Link to heading

Figure 2: Figure from Spouge (1989). The dotted line shows a computational volume in the dotted where (g(u) + h(u) leq 13), where (g) are the numbers as shown, and (h) is (2) per indel needed to get to the end.
We have to do less and more predictable work.
A first step towards this are computational volumes (Spouge 1989), see Figure 2.
Let’s assume we already know the length
Precomputing front boundaries. We can save more time by not evaluating
Testing
Order of computations. One we determine which states need to computed, we can compute them in any order we like:
- row/column-wise,
- anti-diagonal-wise,
- order of
(as in Dijkstra and WFA), - order of
(as in A*), where can be a different heuristic than the one used before.
For now my feeling is that option 3 is the fastest, but option 4 (in particular WFA with gap-cost) may need some investigation as well.
Dealing with pruning Link to heading
So, this is all nice, but actually our linear runtime heavily depends on pruning.
Without pruning we inevitably get a ‘blow-up’ (Dijkstra-like behaviour) around the
start of the search, where the band increases by
A match is pruned once the state at its start is expanded. After pruning, the heuristic typically increases for most states preceding the match. When processing states column-by-column, this means that all states that could have been skipped because of pruning have already been computed anyway. The solution is to prune matches right from the start: path-pruning.
Assume we already have a candidate alignment
From
After this process, the value of Proof needed. Proof needed.
Now, we have a fixed (as in, not changing anymore because of pruning) heuristic, and we can apply the computational volumes technique from the previous section again.
If
When Proof needed.
The bandwidth condition of Harris (1974)99 Amit Patel remarked
on his site that this looked useful in 1997 but he has never seen it actually
being used. A nice example of how maths may only become useful much later. Our
Thoughts on more aggressive pruning Link to heading
This subsection is speculative.
Full pruning. Maybe it’s even possibly to path-prune all matches on the guessed path. That makes the heuristic inadmissible, but my feeling is that as long as we make sure to expand the start of all pruned matches at some point, this still works. Proof needed.
In combination with the front-doubling approach below, this could have the additional benefit that no initial path/cost estimate is needed.
I’m not quite sure whether this actually makes sense though. After pruning all matches on the path there is nothing to guide the heuristic anymore. The search will still be pushed towards the tip, but the tip will not be pulled across long indels.
Algorithm summary Link to heading
- Input
- Some alignment
of cost . - Output
- An optimal alignment
of cost . - Algorithm
- Construct the (chaining) seed heuristic
. - Compute
for all states on . - In reverse order, remove from
all matches (with start ) on the path with .
Note: this pruning can be done directly during the construction of , since contours/layers in the heuristic are also constructed backwards. - Run your favourite alignment algorithm (Edlib/WFA), but after each front (ie column
or wavefront), shrink the ends of the front as long as
for states at those ends. - When the algorithm finishes, it will have found a shortest path.
- Construct the (chaining) seed heuristic
When the input
Challenges Link to heading
- When
overestimates the actual distance by , extra work is done, since the computational volume increases in width. - A good candidate
needs to be found. This could be done by finding the longest chain of matches in and filling in the gaps using a DP approach, or by running a banded alignment algorithm. - Computing
requires building a hashmap of kmers (or a suffix array). While that is relatively fast, it can in fact become the bottleneck when the rest of the algorithm is made more efficient. We’ll have to see how this ends up after doing experiments. - It could happen that there are two good candidate alignments that are far from
each other. In this case we should split each front (column/wavefront) into
two smaller intervals of states
that cover the good candidate states, and skip the states in the middle with .
Results Link to heading
For now, I only did one small experiment on this where I compared A*PA to a
non-optimized (read: very slow) implementation of WFA with a path-pruned
heuristic, and the WFA version was
What about band-doubling? Link to heading
In Ukkonen (1985) and Edlib (Šošić and Šikić 2017), the band-doubling approach is used
to find
- Iterations with too small
do not add a significant overhead because of the exponential growth of the band: . - The final iteration (the first with
) has , which again has only constant overhead over .
Sadly, the same idea does not work as well when using a heuristic:
When
Maybe doubling can work after all? Link to heading
This subsection is speculative.
NOTE: I have now written a dedicated post about this here.
Front-doubling. I’m thinking that maybe band-doubling can still work in a different way: Instead
of doubling a global parameter, we double the size of each front
(column/wavefront) whenever it needs to grow. But each front depends on previous
fronts, so they need to grow as well to be able to compute the new front.
Now, instead of a global threshold
Let’s assume that the size of a front roughly
corresponds to the difference between the smallest and largest value of
Or maybe the difference between the smallest and largest
- Let
be the minimum value of in front . The maximum value is by construction. - Extend this and previous front up to
. Thus, set for all . - For each previous front
that grows, make sure that its size (difference between and ) at least doubles. If not, further increase and additionally increase for previous fronts.
Now, this should1212 experiments needed
To implement this, we keep all fronts in memory and simply grow them whenever needed.
And pruning? I think this can also work with a path-pruned heuristic,
but we need to be careful since
I’m also hopeful that a fully path-pruned heuristic (i.e. with all matches on the path removed) can work here. The most important requirement is that we need to make sure that eventually all states at the start of a pruned match are indeed expanded. Otherwise it wouldn’t have been allowed to prune the match at all.
Maybe a middle-ground between online and path-pruning is possible: Once a path to a match has been found, we prune it from that point onward. For all future band-doublings we will take into account the pruned match. A drawback here is that the pruning only happens after the current doubling of the band. This means we compute too many states. But maybe since we’re only doubling on each iteration everything is fine. Again, experiments needed.
TODOs Link to heading
- Write down the proofs that are omitted here.
- Argue more formally where A* is slow.
- A more efficient implementation of WFA with heuristic is needed. Either I need to improve my own Rust implementation, or I need to path it into WFA directly.
- When that’s available, proper experiments need to be done with different
approximate alignments
. - The time spent in various parts of the algorithm needs to be analysed.
- We can efficiently proof the correctness of candidate alignments, but do people care?
- Write a paper. (Current ETA: Q1'23. Help with coding it is welcome.)
Extensions Link to heading
- It may be possible to use this with BiWFA, when the heuristic is used on both sides.
- Instead of doubling
, we could double the band when is too small. That way, we will never do more than twice (or maybe times) the optimal amount of work. But it’s not clear yet to me in what ways doubling of band differs from increasing . This requires some more thought.
References Link to heading
More promising than the main text, in fact, because they do not depend on a given path as input. ↩︎
Verification needed ↩︎
I suppose it would be possible to expand a few states in parallel, but that does not sound fun at all. ↩︎
For linear and single affine costs, the bottleneck is actually the Extend operation. Thanks to Santiago for this insight. ↩︎
Again, verification needed. ↩︎
Proof needed. ↩︎
Proof needed. ↩︎
Proof needed. ↩︎
Amit Patel remarked on his site that this looked useful in 1997 but he has never seen it actually being used. A nice example of how maths may only become useful much later. ↩︎
Or maybe the difference between the smallest and largest
or ? Needs investigation. ↩︎experiments needed ↩︎