1. Pairwise Alignment

Attribution

This chapter is based on two papers:

  • A*PA, Bioinformatics 24, which has joint first-authorship with Pesho Ivanov.

    Groot Koerkamp, Ragnar, and Pesho Ivanov. 2024. “Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.” Edited by Tobias Marschall. Bioinformatics 40 (3). https://doi.org/10.1093/bioinformatics/btae032.

  • A*PA2, WABI24.

    Groot Koerkamp, Ragnar. 2024. “A*PA2: Up to 19 Faster Exact Global Alignment.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:17:1–17:25. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.17.

A*PA was developed in close collaboration with Pesho Ivanov. Specifically, the seed heuristic is a direct application of his earlier work on A*ix. Pesho Ivanov’s contributions predominantly include the extension of the heuristic to the more general seed heuristic with chaining, gaps, and inexact matches. My own contributions predominantly include match pruning, the implementation using contours, and the proofs and experiments.

Further, the survey is based on ongoing notes on pairwise alignment that are also co-authored with Pesho Ivanov.

A*PA2 is my own work.

TODO: The extension to semi-global alignment is my own work, and currently unpublished.

Summary

In this chapter, we explore methods for global pairwise alignment, that find the minimal number of edits between two genomic sequences.

There is a vast amount of literature on algorithms for pairwise alignment and its variants, and we start with a corresponding survey.

Then, we introduce A* pairwise aligner, an aligner that, as the name says, uses the A* shortest path algorithm. Its main novelties are a significantly stronger gap-chaining seed heuristic than was used in previous methods. Adding pruning to this enables near-linear runtime on inputs with sufficiently low error rate.

A drawback of A*PA is that it is based on plain A*, which, like Dijkstra’s algorithm, is a relatively cache-inefficient graph algorithm. Some of the fastest aligners use DP (dynamic programming) with bitpacking to very efficiently compute the edit distance between sequences. In A*PA2, we build a highly optimized aligner that merges A* with bitpacking, yielding significant speedups over both A*PA and Edlib.

$$ \newcommand{\g}{g^} \newcommand{\h}{h^} \newcommand{\f}{f^*} \newcommand{\cgap}{c_{\mathrm{gap}}} \newcommand{\xor}{\ \mathrm{xor}\ } \renewcommand{\and}{\ \mathrm{and}\ } \renewcommand{\st}[2]{\langle #1, #2\rangle} \newcommand{\matches}{\mathcal M} \newcommand{\ed}{\operatorname{ed}} \renewcommand{\d}{\operatorname{d}} \newcommand{\lcp}{\operatorname{LCP}} $$

1 Introduction

  • introduced by (Needleman and Wunsch 1970) for biology
  • Premise: if we can clearly see the optimal alignment in a dot-plot, why do we need quadratic time anyway? (Wilbur and Lipman 1983) observe the same

As Fickett stated 40 years ago (Fickett 1984, 1) and still true today,

at present one must choose between an algorithm which gives the best alignment but is expensive, and an algorithm which is fast but may not give the best alignment.

In this paper we narrow this gap and show that A*PA2 is nearly as fast as approximate methods.

1.1 Overview

  • briefly highlight the three chapters

2 A history of pairwise alignment

Overview. This survey focuses on exact global pairwise alignment. In particular, we focus on research from roughly 1970 to 1990 on improving the complexity of alignment algorithms. This is followed by more recent papers on improving the practical performance of these methods through more efficient use of the hardware. Rather than presenting all work chronologically, we treat them topic by topic.

Scope. There is also a vast literature on text searching, where all occurrences of a short pattern in a long text must be found. This field was very active from around 1990 to 2010, and again includes a large number of papers. We consider these mostly out of scope and refer the reader to Navarro’s survey (Navarro 2001).

More recently, read mapping has become a crucial part of bioinformatics, and indeed, there is also a plethora of different tools for aligning and mapping reads. Here, we refer the reader to the survey by Alser at al. (Alser et al. 2021).

Lastly, we note that most moderns read mappers and alignment tools are approximate, in that they are not guaranteed to return an alignment with provably minimal cost. A*PA and A*PA2 are both exact methods, and thus we will focus on thos.

3 Problem statement

The main problem of this chapter is as follows.

Problem 2 Global pairwise alignment.

Given two sequences \(A\) and \(B\) of lengths \(n\) and \(m\), compute the edit distance \(\ed(A,B)\) between them.

Before looking into solutions to this problem, we first cover some theory to precisely define it.

Input sequences. As input, we take two sequences \(A=a_0a_1\dots a_{n-1}\) and \(B=b_0b_1\dots b_{m-1}\) of lengths \(n\) and \(m\) over an alphabet \(\Sigma\) that is typically of size \(\sigma=4\). We usually assume that \(n\geq m\). We refer substrings \(a_i\dots a_{i’-1}\) as \(A_{i\dots i’}\) to a prefix \(a_0\dots a_{i-1}\) as \(A_{<i}\) and to a suffix \(a_i\dots a_{n-1}\) as \(A_{\geq i}\).

Edit distance. The edit distance \(s:=\ed(A,B)\) is the minimum number of insertions, deletions, and substitutions needed to convert \(A\) into \(B\). In practice, we also consider the divergence \(d:=\ed(A,B)/n\), which is the average number of errors per characters. This is different from the error rate \(e\), which we consider to be the (relative) number of errors applied to a pair of sequence. The error rate is typically higher than the divergence, since random errors can cancel each other.

Figure 1: An example of an edit graph (left) corresponding to the alignment of strings ABCA and ACBBA, adapted from (Sellers 1974). Solid edges indicate insertion/deletion/substitution edges of cost (1), while dashed edges indicate matches of cost (0). All edges are directed from the top-left to the bottom-right. The shortest path of cost (2) is shown in blue. The right shows the corresponding dynamic programming (DP) matrix containing the distance (g(u)) to each state.

Figure 1: An example of an edit graph (left) corresponding to the alignment of strings ABCA and ACBBA, adapted from (Sellers 1974). Solid edges indicate insertion/deletion/substitution edges of cost (1), while dashed edges indicate matches of cost (0). All edges are directed from the top-left to the bottom-right. The shortest path of cost (2) is shown in blue. The right shows the corresponding dynamic programming (DP) matrix containing the distance (g(u)) to each state.

Edit graph. The alignment graph or edit graph (Figure 1) is a way to formalize edit distance (Vintsyuk 1968; Ukkonen 1985). It contains states \(\st ij\) (\(0\leq i\leq n\), \(0\leq j\leq m\)) as vertices. It further contains edges, such that an edge of cost \(0\) corresponds to a pair of matching characters, and an edge of cost \(1\) corresponds to an insertion, deletion, or substitution. The vertical insertion and horizontal deletion edges have the form \(\st ij \to \st i{j+1}\) and \(\st ij \to \st {i+1}j\) of cost \(1\). Diagonal edges are \(\st ij\to \st{i+1}{j+1}\) and have cost \(0\) when \(A_i = B_i\) and substitution cost \(1\) otherwise. A shortest path from \(v_s:=\st 00\) to \(v_t := \st nm\) in the edit graph corresponds to an alignment of \(A\) and \(B\). The distance \(d(u,v)\) from \(u\) to \(v\) is the length of the shortest (minimal cost) path from \(u\) to \(v\), and we use edit distance, distance, length, and cost interchangeably. Further we write \(\g(u) := d(v_s, u)\) for the distance from the start to \(u\), \(\h(u) := d(u, v_t)\) for the distance from \(u\) to the end, and \(\f(u) := \g(u) + \h(u)\) for the minimal cost of a path through \(u\).

In figures, we draw sequence \(A\) at the top and sequence \(B\) on the left. Index \(i\) will always be used for \(A\) and indicates a column, while index \(j\) is used for \(B\) and indicates a row.

Shortest path algorithms. Using this graph, the problem of pairwise alignment reduces to finding a shortest path in a graph. There are many shortest path algorithms for graphs, and indeed, many of them are used for pairwise alignment. Since the graph is acyclic, the simplest method is to greedily process the states in any topologically sorted order such as row-wise, column-wise, or anti-diagonal by anti-diagonal. We then start by setting \(d(\st 00)=0\), and find the distance to any other state as the minimum distance to an incoming neighbour plus the cost of the final edge. As we will see soon, this is often implemented using dynamic programming (DP).

Dijkstra’s shortest path algorithm can also be applied here (Dijkstra 1959), which visits states in order of increasing distance. This does require that all edges have non-negative weights. An extension of Dijkstra’s algorithm is A* (Hart, Nilsson, and Raphael 1968), which visits states in order of increasing ``anticipated total distance’'.

4 Variations on pairwise alignment

There are a few variants of pairwise alignments and edit distance. While the focus of this chapter is (unit cost) edit distance, it is helpful to first have an overview of the different variants since most papers each assume a slightly different context.

4.1 Alignment types

TODO: Also put Pesho’s 2nd figure?

Figure 2: Overview of different alignment types. (CC0 by Pesho Ivanov; source) TODO: re-scale image

Figure 2: Overview of different alignment types. (CC0 by Pesho Ivanov; source) TODO: re-scale image

In global pairwise alignment, the two sequences must be fully matched against each other. In practice though, there are a number of different settings, see Figure 2.

  • Global: Align both sequences fully, end-to-end.
  • Semi-global: Align a full sequence to a substring of a reference.
  • Global-extension: Align one sequence to a prefix of the other.
  • Overlap: Align two partially overlapping reads against each other.
  • Ends-free: ends-free alignment allows one of the sequences on each end to have a (bounded) number of unmatched characters, and generalized the methods above (Spouge 1991).
  • Extension: Align a prefix of the two sequences. Similar to local, but anchored at the start.
  • Local: Align a substring of \(A\) to a substring of \(B\). Like ends-free, but now we may skip the and start of both sequences.

Of these, semi-global is very commonly used when mapping reads onto a larger reference. A slightly difference is that we consider semi-global alignment to be a one-off alignment between two sequences, whereas for mapping, we usually align many small reads onto a single long reference.

4.2 Cost Models

Figure 3: Overview of different cost models. (CC0; source)

Figure 3: Overview of different cost models. (CC0; source)

There are different models to specify the cost of each edit operation (Figure 3) (Spouge 1991). In particular, in a biological setting the probability of various types of mutations may not be equal, and thus, the associated costs should be different. We list some of them here, from simple to more complicated.

  • Hamming distance: The hamming distance (Hamming 1950) between two sequences is the number of substitutions required to transform one into the other, where insertions or deletions are not allowed. This is simple to compute in linear time.

  • LCS: The longest common subsequence maximizes the number of matches, or equivalently, minimizes the number of indels (insertions or deletions) while not allowing substitutions. Insertions and deletions both have a cost of \(1\).

  • Unit cost edit distance / Levenshtein distance: The classic edit distance counts the minimum number of idels and/or substitutions needed, where each has a cost of \(1\).

  • Edit distance: In general, the edit distance allows for arbitrary indel and substitution costs. Matches/mismatches between characters \(a_i\) and \(b_j\) have cost \(\delta(a_i, b_j)\). Inserting/deleting a character has cost \(\delta(\varepsilon, b_j)>0\) and \(\delta(a_i, \varepsilon)>0\) respectively. Usually the cost of a match is \(0\) or negative (\(\delta(a,a) \leq 0\)) and the cost of a mismatch is positive (\(\delta(a,b)>0\) for \(a\neq b\)).

    In this chapter, when we use edit distance, we usually mean the unit-cost version.

  • Affine cost: It turns out that insertions and deletions in DNA sequences are somewhat rare, but that once there is an indel, it is relatively common for it to be longer than a single character. This is modelled using affine costs (Smith, Waterman, and Fitch 1981; Gotoh 1982), where there is a cost \(o\) to open a gap, and a cost \(e\) to extend a gap, so that the cost of a gap of length \(k\) is \(w_k = o+k\cdot e\).

    It is also possible to have different parameters \((o_{\mathrm{ins}}, e_{\mathrm{ins}})\) and \((o_{\mathrm{del}}, e_{\mathrm{del}})\) for insertions and deletions.

  • Dual affine: It turns out that affine costs are not sufficient to capture all biological processes: the gap-cost can give a too large penalty to very long indels of length \(100\) to \(1000\). To fix this, a second gap-cost can be introduced with separate parameters \((o_2, e_2)\), with for example an offset of \(o=1000\) and an extend cost of \(e=0.5\). The cost of a gap of length \(k\) is now given by \(w_k = \min(o_1 + k\cdot e_1, o_2 + k\cdot e_2)\).

    More general, a piecewise linear cost can be considered as well (GOTOH 1990).

  • Concave: Even more general, we can give gaps of length \(k\) a cost \(w_k\), where \(w_k\) is a concave function of \(k\), where longer gaps become relatively less expensive. Affine costs are an example of a concave gap cost.

  • Arbitrary: Even more general, we can merge the concave gap-costs with arbitrary substitution costs \(\delta(a,b)\) for (mis)matches.

In practice, most methods use a match cost \(\delta(a,a) = 0\), fixed mismatch cost \(\delta(a,b) = X>0\) for \(a\neq b\), and fixed indel cost \(\delta(a,\varepsilon) = \delta(\varepsilon,b) = I\).

4.3 Minimizing Cost versus Maximizing Score

So far, the cost models we considered are just that: cost models. They focus on minimizing the cost of the edits between two sequences, and usually assume that all costs are \(\geq 0\), so that in particular matching two characters has a cost of \(0\).

In some settings, scores are considered instead, which are simple the negative of the cost. In this setting, matching characters usually give a positive score, so that this is explicitly rewarded. This is for example the case when finding the longest common subsequence, where each pair of matching characters gives a score of \(1\), and everything else has a score of \(0\).

Both approaches have their benefits. When using non-negative costs, all edges in the alignment graph have non-negative weights. This significantly simplifies the shortest path problem, since this is, for example, a requirement for Dijkstra’s algorithm. Scores, on the other hand, work better for overlap, extension, and local alignment: in these cases, the empty alignment is usually a solution, and thus, we must give some bonus to the matching of characters to compensate for the inevitable mismatches that will also occur. Unfortunately, this more general setting usually means that algorithms have to explore a larger part of the alignment graph. The ratio between the match bonus (score \(>0\)) and mismatch penalty (score \(<0\)) influences the trade-off between how many additional characters must be matched for each additional mismatch.

Cost-vs-score duality. For the problem of longest common subsequence there is a duality between scores and costs. When \(p\) is the length of the LCS, and \(s\) is the cost of aligning the two sequences via the LCS cost model where indels cost \(1\) and mismatches are not allowed, we have

\begin{align} 2\cdot p + s = n+m. \end{align}

Thus, maximizing the number of matched characters is equivalent to minimizing the number of insertions and deletions.

A similar duality holds for global alignment: there is a direct correspondence between maximizing score and minimizing cost (Smith, Waterman, and Fitch 1981; Eizenga and Paten 2022): given a scoring model with fixed affine costs \(\delta(a, a) = M\), \(\delta(a,b) = X\), and \(w_k = O + E \cdot k\), there is a cost-model (with \(\delta(a,a)=0\)) that yields the same optimal alignment.

5 The classic quadratic DP algorithms

We are now ready to look into the first algorithms. We start with DP algorithms, that process the graph one column at a time. Note that we present all algorithms as similar as possible: they go from the top-left to the bottom-right, and always minimize the cost. We write \(D(i,j)=\g(\st ij)\) for the cost to state \(\st ij\). Smith et al. (Smith, Waterman, and Fitch 1981) provide a nice overview of the similarities and differences between the early approaches.

Figure 4: The cubic algorithm as shown by Needleman and Wunsch (Needleman and Wunsch 1970). Note that as shown, it works from the bottom right to the top left, and maximizes the LCS score instead of minimizing cost. Consider the outlined 1-cell. It has a score of 1 because the characters in its row and column match. The final score of the cell is this 1, plus the maximum of the remaining outlined cells in the row below and column right of it.

Figure 4: The cubic algorithm as shown by Needleman and Wunsch (Needleman and Wunsch 1970). Note that as shown, it works from the bottom right to the top left, and maximizes the LCS score instead of minimizing cost. Consider the outlined 1-cell. It has a score of 1 because the characters in its row and column match. The final score of the cell is this 1, plus the maximum of the remaining outlined cells in the row below and column right of it.

Needleman-Wunsch’ cubic algorithm. The problem of pairwise alignment of biological sequences was first formalized by Needleman and Wunsch (Needleman and Wunsch 1970). They provide a cubic recurrence that assumes a (mis)match between \(a_{i-1}\) and \(b_{j-1}\) of cost \(\delta(a_{i-1},b_{j-1})\) and an arbitrary gap cost \(w_k\). The recursion uses that before matching \(a_{i-1}\) and \(b_{j-1}\), either \(a_{i-2}\) and \(b_{j-2}\) are matched to each other, or one of them is matched to some other character:

\begin{align*} D(0,0) &= D(i,0) = D(0,j) := 0\\ D(i,j) &:= \delta(a_{i{-}1}, b_{j{-}1})&& \text{cost of match}\\ &\phantom{:=} + \min\big( \min_{0\leq i’ < i} D(i’, j{-}1) + w_{i{-}i’{-}1},&&\text{cost of matching $a_{i’-1}$ against $b_{j-2}$ next}\\ &\phantom{:=+\min\big(} \min_{0\leq j’<j} D(i{-}1, j’)+w_{j{-}j’{-}1}\big).&&\text{cost of matching $a_{i-2}$ against $b_{j’-1}$ next} \end{align*}

The value of \(D(n,m)\) is the final cost of the alignment.

The total runtime is \(O(nm \cdot (n+m)) = O(n^2m)\) since each of the \(n\cdot m\) cells requires \(O(n+m)\) work.

A quadratic DP. The cubic DP was improved into a quadratic DP by Sellers (Sellers 1974) and Wagner and Fisher (Wagner and Fischer 1974). The improvement comes from dropping the arbitrary gap cost \(w_k\), so that instead of trying all \(O(n+m)\) indels in each position, only one insertion and one deletion is tries:

\begin{align*} D(0,0) &:= 0\\ D(i, 0) &:= D(i-1,0)+ \delta(a_i, \varepsilon) \\ D(0, j) &:= D(0,j-0)+ \delta(\varepsilon, b_j) \\ D(i, j) &:= \min\big(D(i{-}1,j{-}1) + \delta(a_i, b_j), &&\text{(mis)match}\\ &\phantom{:=\min\big(}\, D(i{-}1,j) + \delta(a_i, \varepsilon), && \text{deletion}\\ &\phantom{:=\min\big(}\, D(i,j{-}1) + \delta(\varepsilon, b_j)\big). && \text{insertion}. \end{align*}

This algorithm takes \(O(nm)\) time since it now does constant work per DP cell.

This quadratic DP is now called the Needleman-Wunsch (NW) algorithm. Gotoh (Gotoh 1982) refers to it as Needleman-Wunsch-Sellers’ algorithm, to highlight the speedup that Sellers contributed (Sellers 1974). Apparently Gotoh was not aware of the identical formulation of Wagner and Fischer (Wagner and Fischer 1974).

Vintsyuk published a quadratic algorithm published already before Needleman and Wunsch (Vintsyuk 1968), but in the context of speech recognition. Instead of a cost of matching characters, there is some cost \(\delta(i,j)\) associated to matching two states, and it does not allow deletions:

\begin{align*} D(i, j) &:= \min\big(D(i{-}1,j{-}1), D(i{-}1, j)\big) + \delta(i,j). \end{align*}

Sankoff also gives a quadratic recursion (Sankoff 1972), similar to the one by Sellers (Sellers 1974), but specifically for LCS. This leads to the recursion

\begin{align*} S(i, j) &:= \max\big(S(i{-}1,j{-}1) + \delta(a_i, b_j),\, D(i{-}1, j), D(i, j{-}1)\big). \end{align*}

Local alignment. Smith and Waterman (Smith and Waterman 1981) introduce a DP for local alignment. The structure of their algorithm is similar to the cubic DP of Needleman and Wunsch and allows for arbitrary gap costs \(w_k\). While introduced as a maximization of score, here we present it as minimizing cost (with \(\delta(a,a)<0\)) for consistency. The new addition is a \(\min(0, \dots)\) term, that can reset the alignment whenever the cost goes above \(0\). The best local alignment ends in the smallest value of \(D(i,j)\) in the table.

\begin{align*} D(0, 0) &= D(i, 0) = D(0, j) := 0 \\ D(i,j) &= \min\big(0, &&\text{start a new local alignment}\\ &\phantom{=\min\big(} D(i-1, j-1) + \delta(a_{i{-}1}, b_{j{-}1}), &&\text{(mis)math}\\ &\phantom{=\min\big(} \min_{0\leq i’ < i} D(i’, j) - w_{i{-}i’}, &&\text{deletion}\\ &\phantom{=\min\big(} \min_{0\leq j’<j} D(i, j’)-w_{j{-}j’}\big).&&\text{insertion} \end{align*}

This algorithm uses arbitrary gap costs \(w_k\), as first mentioned by Needleman and Wunsch (Needleman and Wunsch 1970) and formally introduced by Waterman (Waterman, Smith, and Beyer 1976). Because of this, it runs in \(O(n^2m)\).

The quadratic algorithm for local alignment is now usually referred to as the Smith-Waterman-Gotoh (SWG) algorithm, since the ideas introduced by Gotoh (Gotoh 1982) can be used to reduce the runtime from cubic by assuming affine costs, just like to how Sellers (Sellers 1974) sped up the Needleman-Wunsch algorithm (Needleman and Wunsch 1970) for global alignment costs by assuming linear gap costs. Note though that Gotoh only mentions this speedup in passing, and that Smith and Waterman (Smith and Waterman 1981) could have directly based their idea on the quadratic algorithm of Sellers (Sellers 1974) instead.

Affine costs. To my knowledge, the first mention of affine costs of the form \(o+k\cdot e\) is by Smith, Waterman, and Fitch (Smith, Waterman, and Fitch 1981). Gotoh (Gotoh 1982) generalized the quadratic recursion to these affine costs, to circumvent the cubic runtime needed for the arbitrary gap costs \(w_k\) of Waterman (Waterman, Smith, and Beyer 1976). This is done by introducing two additional matrices \(P(i,j)\) and \(Q(i,j)\) that contain the minimal cost to get to \((i,j)\) where the last step is required to be an insertion or deletion respectively:

\begin{align*} D(i, 0) &= P(i, 0) = I(i, 0) := 0 \\ D(0, j) &= P(0, j) = I(0, j) := 0 \\ P(i, j) &:= \min\big(D(i-1, j) + o+e, &&\text{new gap}\\ &\phantom{:= \min\big(}\ P(i-1, j) + e\big)&&\text{extend gap}\\ Q(i, j) &:= \min\big(D(i, j-1) + o+e, &&\text{new gap}\\ &\phantom{:= \min\big(}\ Q(i, j-1) + e\big)&&\text{extend gap}\\ D(i, j) &:= \min\big(D(i-1, j-1) + \delta(a_{i-1}, b_{j-1}),\, P(i, j),\, Q(i, j)\big). \end{align*}

This algorithm run in \(O(nm)\) time.

Gotoh also mentions that this method can be modified to solve the local alignment of Smith and Waterman (Smith and Waterman 1981) in quadratic time. Later, Gotoh further extended the method to support double affine costs and more general piecewise linear gap costs (GOTOH 1990).

Traceback. To compute the final alignment, we can follow the trace of the DP matrix: starting at the end \(\st nm\), we can repeatedly determine which of the preceding DP-states was optimal as predecessor and store these states. This takes linear time, but requires quadratic memory since all states could be on the optimal path. Gotoh notes (Gotoh 1982) that if only the final score is required, only the last two columns of the DP matrix \(D\) (and \(P\) and \(Q\)) are needed at any time, so that linear memory suffices.

6 Linear Memory using Divide and Conquer

Figure 5: Divide-and-conquer as shown by Myers and Miller (Myers and Miller 1988). Unlike the main text, in this figure, the recursion is on the middle row, rather than the middle column.

Figure 5: Divide-and-conquer as shown by Myers and Miller (Myers and Miller 1988). Unlike the main text, in this figure, the recursion is on the middle row, rather than the middle column.

Hirschberg (Hirschberg 1975) introduces a divide-and-conquer algorithm to compute the LCS of two sequences in linear space. Instead of computing the full alignment from \(\st 00\) to \(\st nm\), we first fix a column halfway, \(i^\star = \lfloor n/2\rfloor\). This splits the problem into two halves: we compute the forward DP matrix \(D(i, j)\) for all \(i\leq i^\star\), and introduce a backward DP \(D’(i, j)\) that is computed for all \(i\geq i^\star\). Here, \(D’(i,j)\) is the minimal cost for aligning suffixes of length \(n-i\) and \(m-j\) of \(A\) and \(B\). It is shown that there must exist a \(j^\star\) such that \(D(i^\star, j^\star) + D’(i^\star, j^\star) = D(n, m)\), and we can find this \(j^\star\) as the \(j\) that minimizes \(D(i^\star, j) + D’(i^\star, j)\).

At this point, we know that the point \((i^\star, j^\star)\) is part of an optimal alignment. The two resulting subproblems of aligning \(A[0, i^\star]\) to \(B[0, j^\star]\) and \(A[i^\star, n]\) to \(B[j^\star, m]\) can now be solved recursively using the same technique, where again we find the midpoint of the alignment. This recursive process is shown in figure Figure 5. The recursion stops as soon as the alignment problem becomes trivial, or, in practice, small enough to solve with the usual quadratic-memory approach.

Space complexity. The benefit of this method is that it only uses linear memory: each forward or reverse DP is only needed to compute the scores in the final column, and thus can be done in linear memory. After the midpoint \(\st {i^\star}{j^\star}\) is found, the results of the left and right subproblem can be discarded before recursing further. Additionally, the space for the solution itself is linear.

Time complexity. We analyse the time complexity following (Myers and Miller 1988). The first step takes \(2\cdot O((n/2)m) = O(nm)\) time. We are then left with two subproblems of size \(i^\star \cdot j^\star\) and \((n-i^\star)(m-j^\star)\). Since \(i^\star = n/2\), their total size is \(n/2 \cdot j^\star + n/2 \cdot (m-j^\star) = nm/2\). Thus, the total time in the first layer of the recursion is \(nm/2\). Extending this, we see that the total number of states halves with each level of the recursion. Thus, the total time is bounded by

\begin{equation*} mn + \frac 12 \cdot mn + \frac 14 \cdot mn + \frac 18\cdot mn + \dots \leq 2\cdot mn = O(mn). \end{equation*}

Indeed, in practice this algorithm indeed takes around twice as long to find an alignment as the non-recursive algorithm takes to find just the score.

Applications. Hirschberg introduced this algorithm for computing the longest common subsequence (Hirschberg 1975). It was then applied multiple times to reduce the space complexity of other variants as well: Myers first applied it to the \(O(ns)\) LCS algorithm (Myers 1986), and also improved the \(O(nm)\) algorithm by Gotoh (Gotoh 1982) to linear memory (Myers and Miller 1988). Similarly, BiWFA (Marco-Sola et al. 2023) improves the space complexity of WFA from \(O(n+s^2)\) to \(O(s)\) working memory, where \(s\) is the cost of the alignment.

7 Dijkstra’s algorithm and A*

Dijkstra’s algorithm. Both Ukkonen (Ukkonen 1985) and Myers (Myers 1986) remarked that this can be solved using Dijkstra’s algorithm (Dijkstra 1959), which visits states by increasing distance. Ukkonen gave a bound of \(O(nm \log (nm))\), whereas Myers’ analysis uses the fact that there are only \(O(ns)\) at distance \(\leq s\) (see 8), and thus concludes that the algorithms runs in \(O(ns)\) (Table 1a).

However, Myers (Myers 1986, 2) observes that

the resulting algorithm involves a relatively complex discrete priority queue and this queue may contain as many as \(O(ns)\) entries even in the case where just the length of the shortest edit script is being computed.

And indeed, I am not aware of any tool that practically implemented Dijkstra’s algorithm to compute the edit distance.

A and the gap cost heuristic. Hadlock realized (Hadlock 1988) that Dijkstra’s algorithm can be improved upon by using A* (Hart, Nilsson, and Raphael 1968; Hart, Nilsson, and Raphael 1972; Pearl 1984), a more informed algorithm that uses a heuristic function \(h\) that gives a lower bound on the remaining edit distance between two suffixes. He proposes two heuristics, one based on character frequencies, and also the widely used gap cost heuristic (Ukkonen 1985; Hadlock 1988; Spouge 1989, 1991; Myers and Miller 1995). This uses the difference in length between two sequences as a lower bound on their edit distance (Table 1d): \[ \cgap(\st ij, \st{i’}{j’}) = |(i-i’) - (j-j’)|. \] We specifically highlight the papers by Wu et al. (Wu et al. 1990) and Papamichail and Papamichail (Papamichail and Papamichail 2009), where the authors’ method exactly matches the A* algorithm with the gap-heuristic, in combination with diagonal transition (Section 9).

Much more recently, A*ix (Ivanov et al. 2020; Ivanov, Bichsel, and Vechev 2022) introduced the much stronger seed heuristic for the problem of sequence-to-graph alignment. This heuristic splits the sequence \(A\) into disjoint k-mers, and uses that at least one edit is needed for each remaining k-mer that is not present in sequence \(B\).

In A*PA (Groot Koerkamp and Ivanov 2024) (Section FIXME) we will improve this into the gap-chaining seed heuristic and add pruning, which results in near-linear alignment when the divergence is sufficient low (Table 1g).

Notation. To prepare for the theory on A*PA, we now introduce some formal terminology and notation for Dijkstra’s algorithm and A*. Dijkstra’s algorithm finds a shortest path from \(v_s=\st 00\) to \(v_t=\st nm\) by expanding (generating all successors) vertices in order of increasing distance \(\g(u)\) from the start. This next vertex to be expanded is chosen from a set of open vertices. The A* algorithm, instead, directs the search towards a target by expanding vertices in order of increasing \({f(u) := g(u) + h(u)}\), where \(h(u)\) is a heuristic function that estimates the distance \(\h(u)\) to the end and \(g(u)\geq \g(u)\) is the shortest length of a path from \(v_s\) to \(u\) found so far. We say that \(u\) is fixed when the distance to \(u\) has been found, i.e., \(g(u) = \g(u)\). A heuristic is admissible if it is a lower bound on the remaining distance (\(h(u) \leq \h(u)\)), which guarantees that A* has found a shortest path as soon as it expands \(v_t\). A heuristic \(h_1\) dominates (is more accurate than) another heuristic \(h_2\) when \(h_1(u) \ge h_2(u)\) for all vertices \(u\). A dominant heuristic will usually (but not always (Holte 2010)) expand less vertices. Note that Dijkstra’s algorithm is equivalent to A* using a heuristic that is always \(0\), and that both algorithms require non-negative edge costs.

We end our discussion of graph algorithms with a quote: as Spouge states (Spouge 1991, 3),

algorithms exploiting the lattice structure of an alignment graph are usually faster.

and further (Spouge 1989, 4):

This suggests a radical approach to A* search complexities: dispense with the lists [of open states] if there is a natural order for vertex expansion.

In A*PA2 (Groot Koerkamp 2024) (Section FIXME), we follow this advice and replace the plain A* search in A*PA with a much more efficient approach based on computational volumes that merges DP and A*.

8 Computational volumes and band doubling

All methods we have seen so far use time \(\Theta(nm)\) or worse, even when the two input sequences are very similar, or even equal. To our knowledge, Wilbur and Lipman (Wilbur and Lipman 1983; Wilbur and Lipman 1984) are the first to speed this up, by only considering states near diagonals with many k-mer matches. However, this method is not exact, i.e., it could return a suboptimal alignment. Nevertheless, they raise the question whether the alignments found by their method are closer to biological truth than the true minimal cost alignments found by exact algorithms.

Reordering the matrix computation. The main reason the methods so far are quadratic is that they compute the entire \(n\times m\) matrix. But, especially when the two sequences are similar, the optimal alignment is likely to be close to the main diagonal. Thus, Fickett (Fickett 1984) proposes to compute the entries of the DP matrix in a new order: Instead of column by column, we can first compute all entries at distance up to \(t\), and if this does not yet result in a path to the end (\(\st nm\)), we can incrementally extend to computed area to a larger area with distance up to \(t’>t\), and so on, until we try a \(t\geq s\). In fact, when \(t\) is increased by \(1\) at a time this is similar to Dijkstra’s algorithm.

Vertices at distance \(\leq t\) can never be more than \(t\) diagonals away from the main diagonal, and hence, computing them can be done in \(O(nt)\) time. This can be much faster than \(O(nm)\) when \(s\) and \(t\) are both small, and works especially well when \(t\) is not too much larger than \(s\). For example, \(t\) can be set as a known upper bound for the data being aligned, or as the length of some known suboptimal alignment.

Table 1: Alignment of two sequences of length $3000$bp with 20% divergence using different algorithms. Coloured pixels correspond to visited states in the edit graph or dynamic programming matrix, and the blue to red gradient indicates the order of computation. TODO: Review figs and caption
+ DT+ band doubling+ gap heuristic and bitpacking+ blocks+ GCSHA*
DijkstraWFAUkkonenEdlibA*PA2-simpleA*PA2-fullA*PA

Gap heuristic. In parallel, Ukkonen introduced a very similar idea (Ukkonen 1985), statically bounding the computation to only those states that can be contained in a path of length at most \(t\) from the start to the end of the graph (Table 1c). In particular, it uses the gap heuristic: the minimal cost of an alignment containing \(\st ij\) is \[ f(\st ij) := \cgap(\st 00, \st ij) + \cgap(\st ij, \st nm) = |i-j| + |(n-i) - (m-j)|, \] and Ukkonen’s algorithm only considers those states for which \(f(\st ij) \leq t\). Thus, instead that the actual distance to a state is at most \(t\) (\(\g(\st ij) \leq t\)), it requires that the best possible cost of a path containing \(\st ij\) is sufficiently low.

Band doubling. Ukkonen also introduces band doubling (Ukkonen 1985): if it turns out that \(t=t_0<s\), then it can be doubled to \(t_1 = 2t_0\), until a \(t_i\geq s\) is found. As we already saw, testing \(t\) takes \(O(nt)\) time. Now suppose we test \(t_0=1\), \(t_1=2\), \(\dots\), \(t_{i-1}=2^{i-1}<s\), up to \(t_i=2^i \geq s\). Then the total cost of this is \[ t_0n + t_1n + \dots + t_i n = 1\cdot n + 2\cdot n + \dots + 2^i \cdot n < 2^{i+1}\cdot n = 4\cdot 2^{i-1}\cdot n < 4sn. \] Thus, band doubling finds an optimal alignment in \(O(ns)\) time. Note that computed values are not reused between iterations, so that each state is computed twice on average.

Two tools implementing this band doubling are Edlib and KSW2.

Computational volumes. Spouge unifies the methods of Fickett and Ukkonen in computational volumes (Spouge 1989, 1991), which are subgraphs of the full edit graph that are guaranteed to contain all shortest paths. thus, to find an alignment, it is sufficient to only consider the states in such a computational volume. Given a bound \(t\geq s\), some examples of computational volumes are:

  1. \(\{u\}\), the entire \((n+1)\times (m+1)\) graph (Needleman and Wunsch 1970).
  2. \(\{u: \g(u)\leq t\}\), the states at distance \(\leq t\), introduced by Fickett (Fickett 1984) and similar to Dijkstra’s algorithm (Table 1ab) (Dijkstra 1959).
  3. \(\{u: \cgap(v_s, u) + \cgap(u, v_t) \leq t\}\) the static set of states possibly on a path of cost \(\leq t\) (Table 1c) (Ukkonen 1985).
  4. \(\{u: \g(u) + \cgap(u, v_t) \leq t\}\), as used by Edlib (Table 1de) (Šošić and Šikić 2017; Spouge 1991; Papamichail and Papamichail 2009).
  5. \(\{u: \g(u) + h(u) \leq t\}\), for any admissible heuristic \(h\), which we will use in A*PA2 and is similar to A* (Table 1fg).

TODO: Check figure references.

9 Diagonal transition

Figure 6: Farthest reaching points for LCS by Myers (Myers 1986).

Figure 6: Farthest reaching points for LCS by Myers (Myers 1986).

Around 1985, the diagonal transition algorithm was independently discovered by Ukkonen (Ukkonen 1983, 1985) (for edit distance) and Myers (Myers 1986) (for LCS). It hinges on the observation that along diagonals of the edit graph (or DP matrix), the value of \(\g(\st ij) = D(i,j)\) never decreases (Lemma 3 Ukkonen 1985), as can be seen in Figure 1.

We already observed before that when the edit distance is \(s\), only the \(s\) diagonals above and below the main diagonal are needed, and on these diagonals, we only are interested in the values up to \(s\). Thus, on each diagonal, there are at most \(s\) transition from a distance \(g\leq s\) to distance \(g+1\). We call the farthest state along a diagonal with a given distance a farthest reaching state. Specifically, given a diagonal \(-s\leq k\leq s\), we consider the farthest \(u=\st ij\) on this diagonal (i.e., with \(i-j=k\)) at distance \(g\) (\(\g(u) \leq g\)). Then we write \(F_{gk}:=i+j\) to indicate the antidiagonal of this farthest reaching state. (Note that more commonly (Ukkonen 1985; Marco-Sola et al. 2020), just the column \(i\) is used to indicate how far along diagonal \(k=i-j\) can be found, the using \(i+j\) leads to more symmetric formulas.) In order to write the recursive formula on the \(F_{gk}\), we need a helper function: \(\lcp(i, j)\) returns the length of the longest common prefix between \(A_{\geq i}\) and \(B_{\geq j}\), which indicates how far we can walk along the diagonal for free starting at \(u=\st ij\). We call this extending from \(u\). The recursion then starts with \(F_{00} = \lcp(0,0)\) as the farthest state along the main diagonal with cost \(0\). To compute wavefront \(F_{g, \bullet}\) in terms of \(F_{g-1, \bullet}\), we first find the farthest state at distance \(g\) on diagonal \(k\) that is adjacent to a state at distance \(g-1\): \[ X_{gk} := \max(F_{g-1,k-1}+1, F_{g-1,k}+2, F_{g-1,k+1}+1). \] From this state, with coordinates \(i(X_{gk}) = (X_{gk}+k)/2\) and \(j(X_{gk})=(X_{gk}-k)/2\), we can possibly walk further along the diagonal for free to obtain the farthest reaching point: \[ F_{gk} = X_{gk} + \lcp(i(X_{gk}), j(X_{gk})). \] The edit distance between two sequences is then the smallest \(g\) such that \(F_{g, n-m} \geq n+m\).

Time complexity. The total number of farthest reaching states is \(O(s^2)\), since there are \(2s+1\) diagonal within distance \(s\), and each has at most \(s+1\) farthest reaching states. The total time spent on \(\lcp\) is at most \(O(ns)\), since on each of the \(2s+1\) diagonals, the \(\lcp\) calls cover at most \(n\) characters in total. Thus, the worst case of this method is \(O(ns)\). Nevertheless, Ukkonen observes (Ukkonen 1985) that in practice the total time needed for \(\lcp\) can be small, and Myers proves (Myers 1986) that the LCS-version of the algorithm does run in expected \(O(n+s^2)\) when we assume that the input is a random pair of sequences with distance \(s\).

Myers also notes that the \(\lcp\) can be computed in \(O(1)\) by first building (in \(O(n+m)\) time) a suffix tree on the input strings and then using an auxiliary data structure to answer lowest-common-ancestor queries, leading to a worst-case \(O(n+s^2)\) algorithm. However, this does not perform well in practice.

We remark here that when the divergence \(d=s/n\) is fixed at, say, \(1\%\), \(s^2\) still grows quadratically in \(n\), and thus, in practice still method still becomes slow when the inputs become too long.

Space complexity. A naive implementation of the method requires \(O(s^2)\) memory to store all values of \(F_{gk}\) (on top of the \(O(n+m)\) input sequences). If only the distance is needed, only the last front has to be stored and \(O(s)\) additional memory suffices. To reduce the \(O(s^2)\) memory, Hirschberg’s divide-and-conquer technique can also be applied here (Myers 1986): we can run two instances of the search in parallel, from the start and end of the alignment graph, until they meet. Then, this meeting point must be on the optimal alignment, and we can recurse into the two sub-problems. These now have distance \(s/2\), so that overall, the cost is \[ 2\cdot (s/2)^2 + 4\cdot (s/4)^2 + \dots = s^2/2+s^2/4+\dots < s^2. \]

Applications. Wu et al. (Wu et al. 1990) and Papamichail and Papamichail (Papamichail and Papamichail 2009) apply diagonal transition to align sequences of different lengths, by incorporating the gap-heuristic. Diagonal transition has also been extended to linear and affine costs in the wavefront alignment algorithm (WFA) (Marco-Sola et al. 2020) in a way similar to (Gotoh 1982), by introducing multiple layers to the graph. Similar to Myers (Myers 1986), BiWFA (Marco-Sola et al. 2023) applies Hirscherg’s divide-and-conquer approach (Hirschberg 1975) to obtain \(O(s)\) memory usage (on top of the \(O(n+m)\) input).

10 Subquadratic methods and lower bounds

Lower bounds. Backurs and Indyk (Backurs and Indyk 2018) have shown that unit cost edit distance can not be solved in time \(O(n^{2-\delta})\) for any \(\delta > 0\), on the condition that the Strong Exponential Time Hypothesis (SETH) is true. Soon after, it was also shown that SETH implies that LCS also can not be solved in time \(O(n^{2-\delta})\) for any \(\delta > 0\) (Abboud, Backurs, and Williams 2015).

Figure 7: In the four Russians method, the (ntimes m) grid is divided into blocks of size (rtimes r). For each block, differences between DP table cells along the top row (R) and left column (S) are the input, together with the corresponding substrings of (A) and (B). The output are the differences along the bottom row (R’) and right column (S’). For each possible input of a block, the corresponding output is precomputed, so that the DP table can be filled by using lookups only. Red shaded states are not visited. (CC0; source)

Figure 7: In the four Russians method, the (ntimes m) grid is divided into blocks of size (rtimes r). For each block, differences between DP table cells along the top row (R) and left column (S) are the input, together with the corresponding substrings of (A) and (B). The output are the differences along the bottom row (R’) and right column (S’). For each possible input of a block, the corresponding output is precomputed, so that the DP table can be filled by using lookups only. Red shaded states are not visited. (CC0; source)

Four Russians method. The so called four Russians method was introduced by (Arlazarov et al. 1970). It is a general method to speed up DP algorithms from \(n^2\) to \(n^2 / \log n\), provided that entries are integers and all operations are ’local'.

This idea was applied to pairwise alignment by Masek (Masek and Paterson 1980), resulting in the first subquadratic worst-case algorithm for edit distance. It works by partitioning the \(n\times m\) matrix in blocks of size \(r\times r\), for some \(r=\log_k n\), as shown in figure Figure 7. Consider the differences \(R_i\) and \(S_i\) between adjacent DP cells along the top row (\(R_i\)) and left column (\(S_i\)) of the block. The core observation is that the differences \(R’_i\) and \(S’_i\) along the bottom row and right column of the block only depend on \(R_i\), \(S_i\), and the substrings \(a_i\cdots a_{i+r}\) and \(b_j\cdots b_{j+r}\). This means that for some value of \(k\) depending on the alphabet size \(\sigma\), \(r=\log_k n\) is small enough so that we can precompute the values of \(R’\) and \(S’\) for all possibilities of \((R, S, a_i\cdots a_{i+r}, b_j\cdots b_{j+r})\) in \(O(n^2/r^2)\) time. In practice, \(r\) needs to be quite small.

Using these precomputed values, the DP can be sped up by doing a single \(O(1)\) lookup for each of the \(O(n^2/r^2)\) blocks, for a total runtime of \(O(n^2/\log^2 n)\). The runtime was originally reported as \(O(n^2/\log n)\), but subsequent papers realized that the \(r\) differences along each block boundary fit in a single machine word, so that lookups are indeed \(O(1)\) instead of \(O( r)\). While this is the only known subquadratic worst-case algorithm, it does not break the \(O(n^{2-\delta})\) lower bound, since \(\log^2 n\) grows subpolynomial.

Masek’s method requires a constant size alphabet. A first extension to general alphabets increased the time to \(O(n^2 (\log \log n)^2 / \log^2(n))\) (Bille and Farach-Colton 2008), and this was later improved to \(O(n^2 \log \log n / \log^2(n))\) (Grabowski 2016). An algorithm with similar complexity also works for LCS.

Applications. Wu et al. provide an implementation of this method for approximate string matching (Wu, Manber, and Myers 1996). They suggest a block size of \(1\times r\), for \(r=5\) or \(r=6\), and provide efficient ways of transitioning from one block to the next.

Nowadays, the bit-parallel technique of Myers (Myers 1999) has replaced four Russians, since it can compute up to 64 cells in a single step, while not having to wait for (comparatively) slow lookups of the precomputed data.

11 Parallelism

So far we have mostly focused on the theoretical time complexity of methods. However, since the introduction of \(O(n+s^2)\) diagonal transition around 1985, no further significant breakthroughs in theoretical complexity have been found. Thus, since then, the focus has shifted away from reducing the number of computed states and towards computing states faster through more efficient implementations and more modern hardware. Most of the developments in this area were first developed for either semi-global or local alignment, but they just as much apply to global alignment.

As Spouge notes (Spouge 1989) in the context of computational volumes:

The order of computation (row major, column major or antidiagonal) is just a minor detail in most algorithms.

But this decision exactly at the core of most efficient implementations.

SWAR. The first technique in this direction is microparallelism (Alpern, Carter, and Gatlin 1995), nowadays also called SWAR (SIMD within a register), where each (\(64\) bit) computer word is divided into multiple (e.g. \(16\) bit) parts, and word-size operations modify all (\(4\)) parts in parallel. This can then applied with inter-sequence parallelism to search a given query in multiple reference sequences in parallel (Alpern, Carter, and Gatlin 1995; Baeza-Yates and Gonnet 1992; Wu and Manber 1992; Hyyrö, Fredriksson, and Navarro 2005; Rognes 2011).

Anti-diagonals. Hughey (Hughey 1996) notes that values along anti-diagonals of the DP matrix are not dependent on each other, and thus can be computed in parallel. Wozniak (Wozniak 1997) applied SIMD (single instruction, multiple data) instructions for this purpose, which are special CPU instructions that operate on multiple (currently up to \(8\), for AVX-512) computer words at a time.

TODO: Fig; possibly from (Rognes and Seeberg 2000).

Vertical packing. Rognes et al. (Rognes and Seeberg 2000, 702) also use microparallelism, but use vertical instead of anti-diagonal vectors:

The advantage of this approach is the much-simplified and faster loading of the vector of substitution scores from memory. The disadvantage is that data dependencies within the vector must be handled.

Indeed, when using vertical vectors a sequence profile (see below) can be used to quickly determine the (mis)match score of each of the character in the vector. However, the DP cells now depend on each other, and it may be necessarily to (slowly) iterate through the values in the vector to handle insertions corresponding to vertical edges in the edit graph.

Striped SIMD. To work around the dependencies between adjacent states in each vector, Farrar (Farrar 2006) introduces an alternative striped SIMD scheme where lanes are interleaved with each other. Thus, the query is split into, say, \(8\) segments that are aligned in parallel (each in one lane of the SIMD vector). In this case, there are still dependencies between adjacent segments, and these are resolved using a separate while loop, for as long as needed. This is used by for example BSAlign (Shao and Ruan 2024).

Bitpacking. An observation that we have not used so far is that for (unit cost) edit distance specifically, it can be shown that the distance to adjacent states differs by at most 1. Myers (Myers 1999) uses this fact to encode \(w=64\) adjacent differences into two $w$-bit words: one word in which bit \(j\) to indicate that the \(j\)’th difference is \(+1\), and one word in which bit \(j\) indicates that the \(j\)’th difference is \(-1\). If we additionally know the difference along the top edge, Myers’ method can efficiently compute the output of the \(1\times w\) rectangle in only \(20\) operations.

TODO: figure

We call each consecutive non-overlapping chunk of \(64\) rows a lane, so that there are \(\lceil m/64\rceil\) lanes, where the last lane may be padded. As presented originally, for semi-global alignment (or text searching), this method only uses \(17\) instructions, but some additional instructions are needed to carry the horizontal difference to the next lane when \(m>w\).

Currently, Edlib (Šošić and Šikić 2017) is the most popular tool that implements bitpacking, alongside band doubling and divide-and-conquer, so that it has a complexity of \(O(ns/w)\).

The supplement of BitPAl (Loving, Hernandez, and Benson 2014; Benson, Hernandez, and Loving 2013) introduces an alternative scheme for edit distance based on a different encoding of the \(\{-1,0,+1\}\) values, that also ends up using \(20\) instructions.

Profile. The DP recurrence relation depends on the sequences \(A\) and \(B\) via \(\delta(a_i,b_j)\), which indicates \(1\) when \(a_i\neq b_j\). When using a vertorized method, we would like to query this information efficiently for multiple pairs \((i, j)\). When using vertical vectors, this can be done efficiently using a profile (Rognes and Seeberg 2000). For Myers’ bitpacking, this looks as follows. For each character \(c\), the bitvector \(Eq[c]\) stores for each character \(b_j\) of \(B\) whether it equals \(c\) as a single bit. Then, when the lane for rows \(j=0\) to \(j=64\) is processed in column \(i\), we can simply read the indicator word corresponding to these lanes from the bitvector for \(c=a_i\) (\(Eq[a_i]\)) and directly use it in the bitwise algorithm.

For SIMD and SWAR methods that use packed integer values (rather than single bits), the same can be done, where we can simply write the values of all \(\delta(a_i, b_j)\).

Difference recurrence relations. For more general cost models, such as affine costs, direct bitpacking does not work, since differences between adjacent states can be larger than \(1\). Still, it is beneficial to consider differences between adjacent states rather than absolute distances: these are typically smaller, so that they require fewer bits to store and more of them can be processed in parallel (Suzuki and Kasahara 2018). Suzuki and Kasahara introduce this technique to affine-cost local alignment, and this has subsequently been used by KSW2 and BSAlign (Shao and Ruan 2024).

TODO: Many packing tools (also based on automata) for text search / semi-global alignment

12 LCS and Contours

Figure 8: Contours as shown by Hirschberg (Hirschberg 1977)

Figure 8: Contours as shown by Hirschberg (Hirschberg 1977)

So far, all pairwise alignment methods we looked at are based on the alignment graph. The longest common subsequence problem also admits different solutions. See e.g. (Bergroth, Hakonen, and Raita, n.d.) for a survey.

Sparse dynamic programming. Instead of finding a minimal-cost path through a graph, we can search for the longest chain of matches (Hirschberg 1975, 1977; Hunt and Szymanski 1977). Suppose there are \(r\) matches in total, where each match is a pair \((i,j)\) such that \(a_i=b_j\). We can then process these matches from left to right (by increasing \(i\) and \(j\)), and for each of them determine the longest chain of matches ending in them. By extension, we determine for each state \(\st ij\) the length \(S(\st ij)\) of the LCS of \(A_{<i}\) and \(B_{<j}\). Such methods that only consider a subset of vertices of the graph or DP-matrix are using sparse dynamic programming, and are reviewed and extended in (Eppstein et al. 1992a, 1992b).

Note that \(S\) can never decrease as we move right or down the matrix, and this allows to efficiently store the values of a column via a list of thresholds of rows where the LCS jumps from \(g\) to \(g+1\). Then, the value of a cell can be found using binary search, so that the overall algorithm runs in \(O((r + n) \lg n)\). While this is slow in general, when there are only few (\(r\approx n\)) matches, as may be the case when comparing lines of code, this algorithm is much faster than previous quadratic methods.

Contours. The regions of equal \(S\) create a set of contours (Figure 8), where contour \(\ell\) is the boundary between the regions with \(S(u)\geq \ell\) and \(S(u) < \ell\). Each contour is determined by a set of dominant matches \(a_i=b_j\) for which \(S(i+1,j+1)\) is larger than both \(S(i, j+1)\) and \(S(i+1,j)\).

LCSk. We also mention here the LCSk variant, where the task is to maximize the number of length-\(k\) matches between the two input strings. This was first introduced around 1982 by Wilbur and Lipman (Wilbur and Lipman 1983; Wilbur and Lipman 1984), and rediscovered in 2014 (Benson, Levy, and Shalom 2014; Pavetić, Žužić, and Šikić 2014; Pavetić et al. 2017; Deorowicz and Grabowski 2013). When choosing \(k\) sufficiently larger than \(\log_\sigma n\), this has the benefit that the number of $k$-mer matches between the two strings is typically much smaller than \(n^2\), so that the \(O((r+n)\lg n)\) runtime becomes feasible. The drawback, however, is that this not provide an exact solution to the original LCS problem.

Chaining k-mers. A solution to the LCSk problem consist of a sequence of matching $k$-mers. Together, these form a chain, which is formally defined as a sequence of vertices \(u_1\), \(\dots\), \(u_n\) in a partially ordered set (whose transitive close is a directed acyclic graph), such that \(u_1\leq u_2\leq \dots \leq u_n\). Then, LCSk is equivalent to finding the longest chain in the poset of k-mer matches. In this formulation, a score (the length) is maximized. Myers and Miller (Myers and Miller 1995) instead consider a version where the cost of a chain is minimized, by using the gap cost over the gap between consecutive k-mer matches in the chain.

13 Some tools

We now briefly list a few recent and competitive global alignment tools. For more, we refer to the survey of Alser et al. (Alser et al. 2021).

KSW2 (Li 2016) implements banded alignment using the difference recurrence (Suzuki and Kasahara 2018) with SIMD, and supports (double) affine costs.

Edlib (Šošić and Šikić 2017) implements band doubling (Ukkonen 1985) using the \(\g(u) + \cgap(u, v_t)\leq t\) computational volume, similar to A* with the gap-heuristic. It uses Myers’ bitpacking (Myers 1999). For traceback, it uses Hirschberg’s divide-and-conquer approach (Hirschberg 1975): once the distance is found, the alignment is started over from both sides towards the middle column, where a state on the shortest path is determined. This is recursively applied to the left and right halves until the sequences are short enough that quadratic memory can be used.

WFA (Marco-Sola et al. 2020) builds on the diagonal transition method (Ukkonen 1985; Myers 1986), and extends it to affine costs using a method similar to (Gotoh 1982). BiWFA (Marco-Sola et al. 2023) is a later version that applies divide-and-conquer to reduce to linear memory usage.

14 Summary

We summarize most of the papers discussed in this section in chronological order in Table 2. Not mentioned in the table are the review papers by Smith et al. (Smith, Waterman, and Fitch 1981), Kruskal (Kruskal 1983), Spouge (Spouge 1991), and Navarro (Navarro 2001), and also Bergroth et al.’s survey on LCS algorithms (Bergroth, Hakonen, and Raita, n.d.). A more recent review on read-aligners was done by Alser et al. (Alser et al. 2021).

Table 2: Chronological overview of papers related to exact global pairwise alignment. Parameters are sequence lengths \(n\) and \(m\) with $n≥ \(m\). The (edit) distance is \(s\). The number of matching characters or k-mers is \(r\). The length of the LCS is \(p\). \(w=64\) is the word size, and lastly we assume a fixed-size alphabet \(\Sigma\). Time is worst-case unless noted otherwise, and space usage is to obtain the full alignment. Methods in bold are newly introduced or combined.
PaperCost modelTimeSpaceMethodRemarks
Vintsyuk (1968)no deletions\(O(nm)\)\(O(nm)\)DPdifferent formulation in a different domain, but conceptually similar
Needleman and Wunsch (1970)edit distance TODO ARBITRARY GAPS???\(O(n^2m)\)\(O(nm)\)DPsolves pairwise alignment in polynomial time
Sankoff (1972)LCS\(\boldsymbol{O(nm)}\)\(O(nm)\)DPthe first quadratic algorithm
Sellers (1974) and Wagner and Fischer (1974)edit distance\(O(nm)\)\(O(nm)\)DPthe quadratic algorithm now known as Needleman-Wunch
Hirschberg (1975)LCS\(O(nm)\)\(\boldsymbol{O(n)}\)divide-and-conqueintroduces linear memory backtracking
Hunt and Szymanski (1977)LCS\(\boldsymbol{O((r+n)\lg n)}\)\(O(r+n)\)thresholdsdistance only
Hirschberg (1977)LCS\(\boldsymbol{O(p(m-p)\lg n)}\)\(\boldsymbol{O(n+(m-p)^2)}\)contours + bandfor similar sequences
Masek and Paterson (1980)edit distance\(\boldsymbol{O(n\cdot \max(1, m/\lg n))}\)\(O(n^2/\lg^2 n)\)four-russiansbest known complexity; requires finite alphaet
Smith, Waterman, and Fitch (1981)affine---First to suggest affine, in future work.
Gotoh (1982)affine\(O(nm)\)\(O(nm)\)DPextends Sellers (1974) to affine
Altschul and Erickson (1986)affine\(O(nm)\)\(O(nm)\)DPFixes bug in traceback of (Gotoh 1982)
Nakatsu, Kambayashi, and Yajima (1982)LCS\(\boldsymbol{O(n(m-p))}\)\(O(n(m-p))\)DP on thresholdsimproves Hirschberg (1977), subsumed by Myers (1986)
Wilbur and Lipman (1983; Wilbur and Lipman 1984)LCSk--chaining k-mer matchesApproximate
Fickett (1984)Edit distance\(O(nt)\)\(O(nt)\)*Bound \(\g(u)\leq t\)Assuming \(t\geq s\).
Dijkstra (Dijkstra 1959; Ukkonen 1985; Myers 1986)Edit distance\(O(ns)\)\(O(ns)\)Dijkstra’s algorithmImplement using \(O(1)\) bucket queue
Ukkonen (1985)edit distance\(\boldsymbol{O(ms)}\)\(O(ns)\)band doublingfirst \(O(ns)\) algorithm for edit distance
Ukkonen (1985)edit distance\(O(n+s^2)\) expected\(\boldsymbol{O(n+s^2)}\)diagonal transitionintroduces diagonal transition method, requires fixed indel cost
Myers (1986)LCS\(O(n+s^s)\) expected\(O(s)\) working memorydiagonal transition, divide-and-conquerintroduces diagonal transition method for LCS, \(O(n+s^2)\) expected time
Myers (1986)LCS\(\boldsymbol{O(n +s^2)}\)\(O(n)\)suffix treebetter worst case complexity, but slower in practice
Myers and Miller (1988)affine\(O(nm)\)\(O(m + \lg n)\)divide-and-conquerapplies Hirschberg (1975) to Gotoh (1982) to get linear space
Spouge (1989)edit distance--A*, computational volumesReview paper
GOTOH (1990)double affine/piecewise-linear\(O(Lmn)\)\(O(nm+Lm)\)DP, \(L\) layers in the graph
Wu, Manber, Myers, and Miller (1990)unit cost\(O(n+(s-\vert n-m\vert)s)\) expected\(O(n)\)Diagonal transition, gap-heuristic, divide-and-conquer
Eppstein, Galil, Giancarlo, and Italiano (1992a)LCSk\(O(n + d \log\log \min(d, nm/d))\)sparse-dynamic-programming, contours\(d\) is number of dominant matches
Myers and Miller (1995)LCSk, edit distance\(O(r \log^2 r)\)\(O(r \log r)\)chaining k-mer matches with gap-cost\(r\) is number of matches
Myers (1999)unit costs\(O(nm/w)\)\(O(m\sigma / w)\)DP, bitpacking
Papamichail and Papamichail (2009)unit costs\(O(n+(s-\vert n-m\vert)s)\)\(O(s)\)A*, gap heuristic, diagonal transition
Deorowicz and Grabowski (2013)LCS\(k\)\(O(n + r \log l)\)\(O(n + \min(r, nl))\)thresholdsmodifies Hunt and Szymanski (1977) for LCS\(k\)
BitPAl Loving, Hernandez, and Benson (2014)Edit distance\(O(znm/w)\)\(O(znm/w)\)Bitpacking, difference recurrence\(z\) depends on edit costs
Grabowski (2016)unit cost/LCS\(O(nm \log \log n / \log^2 n)\)\(o(n)\) overheadFour-russiansGeneral alphabet
Edlib Šošić and Šikić (2017)unit costs\(O(ns/w)\)\(O(n)\)Bitpacking, band-doubling, divide-and-conquerextends Myers’ bit-packing to global alignment
libgaba Suzuki and Kasahara (2018)Affine--SIMD, affine difference recurrence relationAdaptive band; not exact
KSW2 Li (2018)Affine, Double affine\(O(nm/w)\)\(O(nm/w)\)Implements (Suzuki and Kasahara 2018)\(w\) SIMD lanes
WFA Marco-Sola, Moure, Moreto, and Espinosa (2020)affine\(O(n+s^2)\) expected\(O(n+s^2)\)diagonal-transitionextends diagonal transition to gap affine Gotoh (1982)
WFALM Eizenga and Paten (2022)affine\(O(n+s^2)\)\(O(n+s^{3/2})\)diagonal-transition, square-root-decompositionreduces memory usage of WFA by only storing \(1/\sqrt n\) of fronts
BiWFA Marco-Sola, Eizenga, Guarracino, Paten, Garrison, and Moreto (2023)affine\(O(n+s^2)\) expected\(O(s)\) working memorydiagonal-transition, divide-and-conquerapplies Hirschberg (1975) to WFA to get linear space
Block Aligner Liu and Steinegger (2023)Affine; scoring matrixSIMD, blocks, adaptive band
TALCO Walia, Ye, Bera, Lodhavia, and Turakhia (2024)AffineAdaptive band; traceback convergenceResolves trace during alignment, saving memory
BSAlign Shao and Ruan (2024)Affinestriped SIMD, difference recurrence, (adaptive) bandedFirst to implement these together
A*PA Groot Koerkamp and Ivanov (2024)unit costs\(O(n)\) best case\(O(n)\)A*, gap-chaining seed heuristic, pruning, diagonal-transitiononly for random strings with random errors, with \(n\ll\vert \Sigma\vert ^{1/e}\)
A*PA2 Groot Koerkamp (2024)unit costs\(O(n)\) best caseDP, A*, blocks, (incremental) band-doubling, SIMD, bitpacking

15 TODO

  • where to make the point that graph/Dijkstra is slow, and DP is 1000x faster?
  • consistent capitalization of headers
  • consider dropping appendix/human data results; we only have to make the high-level point here
  • Redo copied figures
  • figure captions
  • Add figures to methods
  • Add the big table and make it complete
  • incorporate ideas from more pairwise alignment blogposts?
  • 0 and 1 and so do not go in math
  • $k$-mers
  • Talco technique
  • BSAlign technique?
  • ensure all mentioned papers are also in the table?
  • Refer myers-miller-95 in context of A*PA seed heuristic. it’s nearly the same.
  • Mention that content focusses on background for A*PA[2]

15.1 A*PA2

  • TODO: Properly contextualize A*PA2 wrt A*PA:
    • Drop the graph stuff
    • Do DP/NW/SIMD/bitpacking instead

Summary/overview/contribs

  • blocks
  • simd
  • encoding
  • incremental doubling
  • traceback
  • A*
  • sparse heuristic invocation

Notation

Methods

  • mostly copy paste; include some appendix stuff

Evaluation

  • mostly copy paste; include some appendix stuff?

Discussion

15.2 Semi-global alignment

#+print_bibliography: