cross references: BiWFA GitHub issue
In this post I will explore some variations of the recursion used by WFA/BiWFA for the affine version of the diagonal transition algorithm. In particular, we will go over a gap-close variant, and look into some more symmetric formulations.
Gap open
WFA (Marco-Sola et al. 2020) introduces the affine cost variant of the classic diagonal transition method. Let us call it a gap-open variant, because the gap-open cost \(o\) is payed when opening the gap, that is, when jumping from the \(M\) layer to the \(I\) or \(D\) layer.
Marco-Sola et al. (2023) introduces the classic divide-and-conquer approach of Hirschberg (1975) to WFA, resulting in BiWFA. Here, the WFA method is run from both ends towards the middle of the sequences until the fronts overlap. It turns out this has the drawback that the sides need to run until the overlap is at least \(o\): \(s_l + s_r \geq s + o\), where \(s_l\) and \(s_r\) are the distances ‘searched’ from the start and end respectively, \(s\) is the total cost, and \(o\) is the gap open penalty.
Note that in practice this overlap does not create any big problems, apart from a possibly slightly more tricky implementation. Nevertheless I find it theoretically displeasing.
The reason that this overlap of \(o\) is needed is because the DP is not symmetric, meaning that when running it from the start and end to some state (on the optimal path), the sum of the costs does not equal the cost of the optimal path through the state. This is caused by paying the gap-open cost \(o\) for both sides of the gap.
The original WFA recursion is this (copied from the BiWFA paper, Marco-Sola et al. (2023)):
\begin{align} I_{s,k} &= \max\big\{M_{s-o-e,k-1}+1, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s-o-e,k+1}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s,k}, D_{s,k}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big) \end{align}
where \(LCP\) is the length of the longest common prefix of the two strings. This recursion is shown in Figure Figure 1.
Gap close
It is possible to change the gap-open affine cost model to a gap-close affine cost model so that they are mirrored versions of each other. This way, the meet-in-the-middle can be stopped as soon as the two fronts reach the same state.
The following new recursion only adds the cost \(o\) when the affine \(I\) and \(D\) layers are merged into the main layer \(M\) (or \(X\)), instead of adding it when branching off a matching state \(M\). (Equivalently: it pays \(o\) when leaving the affine layer, instead of when entering it.) Additions are in green and deletions in red:
\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o}-e,k-1}+1, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o}-e,k+1}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o},k}, D_{s\mathbf{\color{lime}-o},k}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}
In fact, to make the gap-close cost completely equivalent to gap-open cost running in reverse, we need to preserve symmetry for all these items:
What | Gap open | Gap close |
---|---|---|
Open cost (\(o\)) | start | end |
Extend cost (\(e\)) | all but end | all but start |
Furthest reaching (\(+ 1\)) | all but end | all but start |
Diagonal shift (\(k\pm 1\)) | start | end |
The flexibility here comes from the fact that an indel of length \(l\) corresponds to \(l+1\) edges in the DP-graph: \(1\) edge into the affine layer, \(l-1\) edges inside the affine layer, and \(1\) edge to leave the affine layer. This means that the per-character changes (extend cost \(e\) and furthest reaching index \(+1\)) should be done for only one of entering/leaving the affine layer.
Thus, the fully mirrored recursion is show in Figure Figure 2 and becomes
\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o-e},k\mathbf{\color{red}-1}}\,\mathbf{\color{red}+1}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o-e},k\mathbf{\color{red}+1}}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o-e},k\mathbf{\color{lime}-1}}\,\mathbf{\color{lime}+1}, D_{s\mathbf{\color{lime}-o-e},k\mathbf{\color{lime}+1}}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}
At this point it makes more sense to reorder these equations. Sadly this becomes slightly more ugly now, having the extend phase in the middle instead of at the end.
\begin{align} X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o-e},k\mathbf{\color{lime}-1}}\,\mathbf{\color{lime}+1}, D_{s\mathbf{\color{lime}-o-e},k\mathbf{\color{lime}+1}}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big)\\ I_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o-e},k\mathbf{\color{red}-1}}\,\mathbf{\color{red}+1}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{red}-o-e},k\mathbf{\color{red}+1}}, D_{s-e,k+1}\big\}\\ \end{align}
Using this formula for the reverse part of BiWFA, the forward and reverse costs to each (affine) state sum exactly to the total cost of the corresponding path through that state, simplifying the meeting conditions.
Symmetric alternatives
To avoid having a separate implementation for the forward and reverse DP, it may be possible to have a single, symmetric, recursion.
A first attempt at this is by incurring a cost of \((o+e)/2\) to open a gap and a cost of \((o+e)/2\) when closing a gap, and by extending the FR point by a half in each case: Figure Figure 3,
\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2-e/2},k-1}\,\mathbf{\color{lime}+\tfrac12}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2-e/2},k+1}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o/2-e/2},k}\,\mathbf{\color{lime}+\tfrac12}, D_{s\mathbf{\color{lime}-o/2-e/2},k}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}
However, this does not fix the inconsistency that we shift diagonals (\(k-1\)) when entering the affine layer, but not when leaving it.
An alternative solution (Figure Figure 4), that makes the affine path have length \(l\), is be to make \(X_{s,k}\) depend on \(I_{x-e,k-1}\) by inlining one extend step of \(I\) into \(X\). This removes the issue with having \(l\) increments for \(l+1\) edges.
\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2-e},k-1}\,\mathbf{\color{lime}+1}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2-e},k+1}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o/2-e},k\mathbf{\color{lime}-1}}\,\mathbf{\color{lime}+1}, D_{s\mathbf{\color{lime}-o/2-e},k\mathbf{\color{lime}+1}}, \\ &\phantom{=\max\big\{}\;\mathbf{\color{lime}M_{s-o-e, k-1}+1}, \mathbf{\color{lime}M_{s-o-e, k+1}}\big\}\\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}
This does not allow for length \(1\) affine indels, so those are explicitly handled separately in the linear \(X\) layer itself, as in the linear diagonal-transition algorithm.
Another option is to use to following recursion, that transitions between the main/linear layer and affine layers without processing any characters:
\begin{align} I_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2},k\mathbf{\color{red}-1}}, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s\mathbf{\color{lime}-o/2},k\mathbf{\color{red}+1}}, D_{s-e,k+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+1, I_{s\mathbf{\color{lime}-o/2},k\mathbf{\color{red}-1}}, D_{s\mathbf{\color{lime}-o/2},k\mathbf{\color{red}+1}}\big\} \\ M_{s,k} &= X_{s,k} + LCP\big(A[X_{s,k}-k \dots], B[X_{x,k}\dots]\big). \end{align}
This only leaves the \(o/2\) issue, which seems inevitable in any symmetric representation:
What | Gap open | Gap close | Symmetric 1 | Symmetric 2 | Symmetric 3 |
---|---|---|---|---|---|
Open cost (\(o\)) | start | end | \(o/2\) | \(o/2\) | \(o/2\) |
Extend cost (\(e\)) | all but end | all but start | \(e/2\) | \(e\) | \(0\) |
Furthest reaching (\(+ 1\)) | all but end | all but start | \(+1/2\) | \(+1\) | \(0\) at start/end |
Diagonal shift (\(k\pm 1\)) | all but end | all but start | all but end?! | everywhere | \(0\) at start/end |
Note that I consider both of these variants theoretically interesting, but not practically relevant for now. Maintaining a separate forward and backward implementation seems simpler than the overhead of having fractional costs or doubling all costs.
Another symmetry
All the formulas so far have an asymmetry between the two sequences: when extending an insertion, we increase the furthest reaching point (\(f\)) by \(1\), while we do not do this for deletions. The reason is that furthest reaching points are stored by their $i$-index. Instead, we can store the sum of indices \(i+j\). This changes the value of \(f\) to simply the number of characters of both \(A\) and \(B\) processed up to this point.
The original gap-open formulation becomes:
\begin{align} I_{s,k} &= \max\big\{M_{s-o-e,k-1}+1, I_{s-e,k-1}+1\big\}\\ D_{s,k} &= \max\big\{M_{s-o-e,k+1}\,\mathbf{\color{lime}+1}, D_{s-e,k+1}\,\mathbf{\color{lime}+1}\big\}\\ X_{s,k} &= \max\big\{M_{s-x,k}+\mathbf{\color{lime}2}, I_{s,k}, D_{s,k}\big\}\\ M_{s,k} &= X_{s,k} + \mathbf{\color{lime}2\times} LCP\big(A[\mathbf{\color{lime}(}X_{s,k}\mathbf{\color{lime}+k)/2} \dots], B[\mathbf{\color{lime}(}X_{x,k}\mathbf{\color{lime}-k)/2}\dots]\big). \end{align}
Note that the length of the longest common prefix is doubled, since for each match we process two characters, one of \(A\) and one of \(B\). The end condition changes from \(f \geq |A|\) to \(f \geq |A| + |B|\).
While this formula contains more symbols, it seems more consistent to me, making it easier to understand, and less bug-prone to implement.
The only remaining difference (anti-symmetry) between \(I\) and \(D\) is whether we shift a diagonal up or down (\(k\pm1\)), which will always be needed.
Conclusions
- To prevent having overlap \(o\) in the forward and backward DP runs, a gap-close variant of the recursion may be used.
- Instead, a single symmetric recursion could also be used.
- This adds cost \(o/2\) when entering/exiting an affine layer, which is problematic for odd \(o\). Doubling of costs is possible but ugly.
- The recursion can be conceptually simplified by storing furthest reaching points as sum of their coordinates, instead of only the first coordinate.