Diamond optimisation for diagonal transition

Diamond transition or how technicalities can break concepts

We assume the reader has some basic knowledge about pairwise alignment and in particular the WFA algorithm.

In this post we dive into a potential 2x speedup of WFA — one that turns out not to work.

Let’s take a look at one of the most important and efficient algorithms for pairwise alignment — WFA (Marco-Sola et al. 2020). It already looks good, and is pretty efficient. In Table 1, which copies the style of Figure 1 in Eizenga and Paten (2022), rows are wavefronts, and columns are diagonals. Light-blue states are stored in memory. Green shows the current state being computed, and dark-blue shows the cells the green cell depends on.

If a path is not required, we can use the linear-memory optimisation, as shown in the right figure.1

WFA algorithm, without and with linear-memory optimisation.

WFA is especially good for parallel calculations (which are extremely fast using SIMD), Table 2. The parallel wavefront computation also can use the linear memory optimisation, as shown on the right.

Parallel WFA algorithm, without and with linear-memory optimisation.

The following idea for improvement of this algorithm may cross your mind: In the very beginning, we already know in which diagonal (column) the answer will appear:

$$answer\_diagonal = length\_of\_first\_seq - length\_of\_second\_seq.$$

This “answer” diagonal is highlighted in red in the figures. So why do we calculate whole triangle? We never use the bottom left and right corners, but we do waste time on that. Let’s try to calculate only cells needed to explore new cells on the answer diagonal. Knowing that every cell depends on at most three cells in the previous wavefront, we can realize we do not need to store whole triangle, only part of it. Figure 1 shows this.

The form of the covered area looks like a diamond, so let’s call it a Diamond optimisation. (This name was first coined by Santiago and his team.)

Less cells to store means memory optimisation; less cells to calculate - runtime optimisation! It should be at least two times faster! Looks like it can be parallelized just as WFA (Table 1).

We also do no not need to store all states if we do not need to trace back path, so we can make a linear memory optimisation.

Parallel diamond optimisation, without and with linear-memory optimisation.

Looks like an amazing alternative to WFA!

But let’s take a closer look

Let’s take a look at Figure 1 again.

In one layer, we need to first cover one side of the diamond, then the other one. Stunts like this make the implementation harder and more error-prone.

The second detail is that cells in one layer do depend on previous cells in the SAME layer. What does it mean? It means we can say “Auf Wiedersehen” to parallel calculations. On the contrary, in usual WFA the current layer depends ONLY on previous layers, so you can compute all cells in the layer at the same time (in parallel).

And finally, for WFA the formula is very simple – take the maximum of the cells in the previous layer on the same diagonal and the diagonals left and right of it, and then extend. The author is simplifying here a bit, for you usually also add constant to these diagonals, but this formula is still straightforward.

On the other hand, in the Diamond optimisation you will need to work with three layers at the same time and have some special edge cases for cells near the answer diagonal. So, both figures in Table 1 are completely wrong and give hope only at the first sight.

Furthermore, in the diamond optimisation we can only extend one diagonal at a time (or possibly two — one at each side), while in WFA we can extend all diagonals in the same layer at the same time since they do not depend on each other and can be calculated simultaneously. This becomes a bottleneck: the for-loop over each side of the diamond now has iterations that are dependent on each other, so the next cell can only be computed after the previous one has been computed. Also, each iteration contains an extend step resulting in hard-to-predict memory access pattern.

Overall, this is simply too complex to optimise well.

Conclusion

So this double speed-up is just fading comparing to what hurdles and obstacles this “improvement” brings. But one more important fact is that you can get approximately the same effect by using BiWFA (Marco-Sola et al. 2022) (in two words, running WFA from both sides), which keeps all the benefits of usual WFA, allowing to reduce space and time usage.

References

Eizenga, Jordan M., and Benedict Paten. 2022. “Improving the Time and Space Complexity of the Wfa Algorithm and Generalizing Its Scoring,” January. https://doi.org/10.1101/2022.01.12.476087.
Marco-Sola, Santiago, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. 2022. “Optimal Gap-Affine Alignment in O(S) Space,” April. https://doi.org/10.1101/2022.04.14.488380.
Marco-Sola, Santiago, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. 2020. “Fast gap-affine pairwise alignment using the wavefront algorithm.” Bioinformatics 37 (4): 456–63. https://doi.org/10.1093/bioinformatics/btaa777.

1. The animated SVG figures and the C-code used to create them are available in the repository↩︎