Variations on the WFA recursion

cross references: BiWFA GitHub issue

In this post I will explore some variations of the recursion used by WFA/BiWFA for the affine version of the diagonal transition algorithm. In particular, we will go over a gap-close variant, and look into some more symmetric formulations.

Gap open

WFA (Marco-Sola et al. 2020) introduces the affine cost variant of the classic diagonal transition method. Let us call it a gap-open variant, because the gap-open cost \(o\) is payed when opening the gap, that is, when jumping from the \(M\) layer to the \(I\) or \(D\) layer.

Marco-Sola et al. (2022) introduces the classic divide-and-conquer approach of Hirschberg (1975) to WFA, resulting in BiWFA. Here, the WFA method is run from both ends towards the middle of the sequences until the fronts overlap. It turns out this has the drawback that the sides need to run until the overlap is at least \(o\): \(s_l + s_r \geq s + o\), where \(s_l\) and \(s_r\) are the distances ‘searched’ from the start and end respectively, \(s\) is the total cost, and \(o\) is the gap open penalty.

Note that in practice this overlap does not create any big problems, apart from a possibly slightly more tricky implementation. Nevertheless I find it theoretically displeasing.

Figure 1: Dependencies for Gap Open recursion. The rows correspond to the affine insert ((I)), linear match/substitution ((M, X)), and affine delete ((D)) layers respectively. Arrow labels are costs, i.e. the number of fronts to look back. The horizontal axis indicates consecutive cells in the direction of each layer. For the insert and main layer, this corresponds to a (+1) increase of the furthest reaching point index. Lastly, note that this uses separate costs for insertions and deletions.

Figure 1: Dependencies for Gap Open recursion. The rows correspond to the affine insert ((I)), linear match/substitution ((M, X)), and affine delete ((D)) layers respectively. Arrow labels are costs, i.e. the number of fronts to look back. The horizontal axis indicates consecutive cells in the direction of each layer. For the insert and main layer, this corresponds to a (+1) increase of the furthest reaching point index. Lastly, note that this uses separate costs for insertions and deletions.

The reason that this overlap of \(o\) is needed is because the DP is not symmetric, meaning that when running it from the start and end to some state (on the optimal path), the sum of the costs does not equal the cost of the optimal path through the state. This is caused by paying the gap-open cost \(o\) for both sides of the gap.

The original WFA recursion is this (copied from the BiWFA paper, Marco-Sola et al. (2022)):

\begin{align} I_{s,k} &= \max\big\{M_{s-o-e,k-1}+1, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s-o-e,k+1}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s,k}, D_{s,k}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big) \end{align}

where \(LCP\) is the length of the longest common prefix of the two strings. This recursion is shown in Figure Figure 1.

Gap close

It is possible to change the gap-open affine cost model to a gap-close affine cost model so that they are mirrored versions of each other. This way, the meet-in-the-middle can be stopped as soon as the two fronts reach the same state.

The following new recursion only adds the cost \(o\) when the affine \(I\) and \(D\) layers are merged into the main layer \(M\) (or \(X\)), instead of adding it when branching off a matching state \(M\). (Equivalently: it pays \(o\) when leaving the affine layer, instead of when entering it.) Additions are in green and deletions in red:

\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o}-e,k-1}+1, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o}-e,k+1}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o},k}, D_{s\mathbf{\color{lime}-o},k}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}

In fact, to make the gap-close cost completely equivalent to gap-open cost running in reverse, we need to preserve symmetry for all these items:

WhatGap openGap close
Open cost (\(o\))startend
Extend cost (\(e\))all but endall but start
Furthest reaching (\(+ 1\))all but endall but start
Diagonal shift (\(k\pm 1\))startend

The flexibility here comes from the fact that an indel of length \(l\) corresponds to \(l+1\) edges in the DP-graph: \(1\) edge into the affine layer, \(l-1\) edges inside the affine layer, and \(1\) edge to leave the affine layer. This means that the per-character changes (extend cost \(e\) and furthest reaching index \(+1\)) should be done for only one of entering/leaving the affine layer.

Figure 2: Dependencies for Gap Close recursion. Indeed, this is exactly the mirror image of Figure Figure 1.

Figure 2: Dependencies for Gap Close recursion. Indeed, this is exactly the mirror image of Figure Figure 1.

Thus, the fully mirrored recursion is show in Figure Figure 1 and becomes

\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o-e},k\mathbf{\color{red}-1}}\,\mathbf{\color{red}+1}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o-e},k\mathbf{\color{red}+1}}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o-e},k\mathbf{\color{lime}-1}}\,\mathbf{\color{lime}+1}, D_{s\mathbf{\color{lime}-o-e},k\mathbf{\color{lime}+1}}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}

At this point it makes more sense to reorder these equations. Sadly this becomes slightly more ugly now, having the extend phase in the middle instead of at the end.

\begin{align} X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o-e},k\mathbf{\color{lime}-1}}\,\mathbf{\color{lime}+1}, D_{s\mathbf{\color{lime}-o-e},k\mathbf{\color{lime}+1}}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big)\\ I_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o-e},k\mathbf{\color{red}-1}}\,\mathbf{\color{red}+1}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o-e},k\mathbf{\color{red}+1}}, D_{s-e,k+1}\big\}\\ \end{align}

Using this formula for the reverse part of BiWFA, the forward and reverse costs to each (affine) state sum exactly to the total cost of the corresponding path through that state, simplifying the meeting conditions.

Symmetric alternatives

Figure 1: Symmetric version 1, using half costs and half furthest reaching increments.

Figure 1: Symmetric version 1, using half costs and half furthest reaching increments.

To avoid having a separate implementation for the forward and reverse DP, it may be possible to have a single, symmetric, recursion.

A first attempt at this is by incurring a cost of \((o+e)/2\) to open a gap and a cost of \((o+e)/2\) when closing a gap, and by extending the FR point by a half in each case: Figure Figure 1,

\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2-e/2},k-1}\,\mathbf{\color{lime}+\tfrac12}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2-e/2},k+1}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o/2-e/2},k}\,\mathbf{\color{lime}+\tfrac12}, D_{s\mathbf{\color{lime}-o/2-e/2},k}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}

However, this does not fix the inconsistency that we shift diagonals (\(k-1\)) when entering the affine layer, but not when leaving it.

Figure 1: Symmetric version 2, inlining both the gap start and gap end steps.

Figure 1: Symmetric version 2, inlining both the gap start and gap end steps.

An alternative solution (Figure Figure 1), that makes the affine path have length \(l\), is be to make \(X_{s,k}\) depend on \(I_{x-e,k-1}\) by inlining one extend step of \(I\) into \(X\). This removes the issue with having \(l\) increments for \(l+1\) edges.

\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2-e},k-1}\,\mathbf{\color{lime}+1}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2-e},k+1}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o/2-e},k\mathbf{\color{lime}-1}}\,\mathbf{\color{lime}+1}, D_{s\mathbf{\color{lime}-o/2-e},k\mathbf{\color{lime}+1}}, \\ &\phantom{=\max\big\{}\;\mathbf{\color{lime}M_{s-o-e, k-1}+1}, \mathbf{\color{lime}M_{s-o-e, k+1}}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}

This does not allow for length \(1\) affine indels, so those are explicitly handled separately in the linear \(X\) layer itself, as in the linear diagonal-transition algorithm.

Figure 1: Symmetric version 3, where transitions between layers do not process characters.

Figure 1: Symmetric version 3, where transitions between layers do not process characters.

Another option is to use to following recursion, that transitions between the main/linear layer and affine layers without processing any characters:

\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2},k\mathbf{\color{red}-1}}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2},k\mathbf{\color{red}+1}}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o/2},k\mathbf{\color{red}-1}}, D_{s\mathbf{\color{lime}-o/2},k\mathbf{\color{red}+1}}\big\} \\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}

This only leaves the \(o/2\) issue, which seems inevitable in any symmetric representation:

WhatGap openGap closeSymmetric 1Symmetric 2Symmetric 3
Open cost (\(o\))startend\(o/2\)\(o/2\)\(o/2\)
Extend cost (\(e\))all but endall but start\(e/2\)\(e\)\(0\)
Furthest reaching (\(+ 1\))all but endall but start\(+1/2\)\(+1\)\(0\) at start/end
Diagonal shift (\(k\pm 1\))all but endall but startall but end?!everywhere\(0\) at start/end

Note that I consider both of these variants theoretically interesting, but not practically relevant for now. Maintaining a separate forward and backward implementation seems simpler than the overhead of having fractional costs or doubling all costs.

Another symmetry

All the formulas so far have an asymmetry between the two sequences: when extending an insertion, we increase the furthest reaching point (\(f\)) by \(1\), while we do not do this for deletions. The reason is that furthest reaching points are stored by their $i$-index. Instead, we can store the sum of indices \(i+j\). This changes the value of \(f\) to simply the number of characters of both \(A\) and \(B\) processed up to this point.

The original gap-open formulation becomes:

\begin{align} I_{s,k} &= \max\big\{M_{s-o-e,k-1}+1, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s-o-e,k+1}\,\mathbf{\color{lime}+1}, D_{s-e,k+1}\,\mathbf{\color{lime}+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+\mathbf{\color{lime}2}, I_{s,k}, D_{s,k}\big\}\\ M_{s,k} &= X_{s,k} + \mathbf{\color{lime}2\times} LCP\big(A[\mathbf{\color{lime}(}X_{s,k}\mathbf{\color{lime}+k)/2} \dots], B[\mathbf{\color{lime}(}X_{x,k}\mathbf{\color{lime}-k)/2}\dots]\big). \end{align}

Note that the length of the longest common prefix is doubled, since for each match we process two characters, one of \(A\) and one of \(B\). The end condition changes from \(f \geq |A|\) to \(f \geq |A| + |B|\).

While this formula contains more symbols, it seems more consistent to me, making it easier to understand, and less bug-prone to implement.

The only remaining difference (anti-symmetry) between \(I\) and \(D\) is whether we shift a diagonal up or down (\(k\pm1\)), which will always be needed.

Conclusions

  • To prevent having overlap \(o\) in the forward and backward DP runs, a gap-close variant of the recursion may be used.
  • Instead, a single symmetric recursion could also be used.
    • This adds cost \(o/2\) when entering/exiting an affine layer, which is problematic for odd \(o\). Doubling of costs is possible but ugly.
  • The recursion can be conceptually simplified by storing furthest reaching points as sum of their coordinates, instead of only the first coordinate.

References

Hirschberg, D. S. 1975. “A Linear Space Algorithm for Computing Maximal Common Subsequences.” Communications of the Acm 18 (6): 341–43. https://doi.org/10.1145/360825.360861.
Marco-Sola, Santiago, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. 2022. “Optimal Gap-Affine Alignment in O(S) Space,” April. https://doi.org/10.1101/2022.04.14.488380.
Marco-Sola, Santiago, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. 2020. “Fast gap-affine pairwise alignment using the wavefront algorithm.” Bioinformatics 37 (4): 456–63. https://doi.org/10.1093/bioinformatics/btaa777.