A History of Pairwise Alignment

Table of Contents

1 A Brief History
- 1.1 A*PA
- 1.2 A*PA2
- 1.3 Overview
2 Problem Statement
3 Alignment types
4 Cost Models
- 4.1 Minimizing Cost versus Maximizing Score
5 The Classic DP Algorithms
6 Linear Memory using Divide and Conquer
7 Dijkstra’s Algorithm and A*
8 Computational Volumes and Band Doubling
9 Diagonal Transition
10 Parallelism
11 LCS and Contours
12 Some Tools
13 Subquadratic Methods and Lower Bounds
14 Summary

This is Chapter 2 of my thesis, to introduce the first part on Pairwise Alignment.

Summary

In this chapter, we survey methods for global pairwise alignment, which is the problem of finding the minimal number of mutations between two genomic sequences.

There is a vast amount of literature on algorithms for pairwise alignment and its variants. We start with the classic DP algorithm by Needleman and Wunsch and go over a number of improvements, both algorithmic and in the implementation. We focus on those methods that A*PA and A*PA2 build on.

Attribution

This chapter is based on the introductions of the A*PA and A*PA2 papers (Groot Koerkamp and Ivanov 2024; Groot Koerkamp 2024), and also on unpublished notes on pairwise alignment. The A*PA paper has joint first-authorship with Pesho Ivanov, and also the unpublished notes are coauthored with Pesho Ivanov.

\[ \newcommand{\g}{g^*} \newcommand{\h}{h^*} \newcommand{\f}{f^*} \newcommand{\cgap}{c_{\mathrm{gap}}} \newcommand{\xor}{\ \mathrm{xor}\ } \renewcommand{\and}{\ \mathrm{and}\ } \renewcommand{\st}[2]{\langle #1, #2\rangle} \newcommand{\matches}{\mathcal M} \newcommand{\ed}{\operatorname{ed}} \renewcommand{\d}{\operatorname{d}} \newcommand{\lcp}{\operatorname{LCP}} \]

1 A Brief History Link to heading

The problem of pairwise alignment was originally introduced in genomics in 1970 by Needleman and Wunsch (Needleman and Wunsch 1970). In their paper, they use their method to align proteins of around 100 amino acids. This way, they find the mutations between the two sequences, which can be insertions, deletions, or substitutions of single amino acids. Specifically, the alignment corresponds to the minimal number of such mutations that is needed to transform one sequence into the other.

Since this first paper, the sequences being aligned have grown significantly in length. For example, the mutations between different strains of the $30\ 000$ bases long SARS-CoV-2 (COVID) virus can be found this way. Even more, full-genome alignment of $3\ 000\ 000\ 000$ bases long human genomes is entering the picture, although here approximate rather than exact methods tend to be used.

Algorithmic improvements. Along with the increasing length and amount of genomic sequences, alignment algorithms have significantly improved since their introduction over 50 years ago. This started with a significant number of algorithmic improvements from 1970 to 1990. These first brought the runtime down from cubic ($O(n^2m)$ for sequences of length $n$ and $m$) to near-linear when the sequences are similar, with the band-doubling method of Ukkonen (Ukkonen 1985) with complexity $O(ns)$. This is still used by the popular tool Edlib (Šošić and Šikić 2017). At the same time, that complexity was further improved to $O(n+s^2)$ in expectation by Ukkonen and Myers (Ukkonen 1985; Myers 1986) with the diagonal transition method, and also this is used in the popular BiWFA (Marco-Sola et al. 2020, 2023) aligner. BiWFA further incorporates the divide-and-conquer technique or Hirschberg (Hirschberg 1975) to reduce its memory usage.

All this is to say: there was a large number of early algorithmic contributions, and at the core, the best methods conceptually haven’t changed much since then (Medvedev 2023).

Implementation improvements. Nevertheless, the amount of genomic data has significantly increased since then, and algorithms had to speed up at the same time to keep up. As the algorithmic speedups seemed exhausted, starting in roughly 1990, the focus shifted more towards more efficient implementations of these methods. By 1999, this resulted in the $O(ns/w)$ bitpacking algorithm of Myers (Myers 1999), that provides up to $w=64\times$ speedup by packing 64 adjacent states of the internal DP matrix into two 64 bit computer words. With more recent advances in computer hardware, this has also extended to SIMD instructions that can do up to 512-bit instructions, providing another up to $8\times$ speedup (Suzuki and Kasahara 2018).

Approximate alignment. At the same time, there also has been a rise in popularity of approximate alignment algorithms and tools. Unlike all methods considered so far, these are not guaranteed to find the minimal number of mutations between the two sequences. Rather, they sacrifice this optimality guarantee to achieve a speedup over exact methods. Two common techniques are x-drop (Zhang et al. 2000) and static banding, that both significantly reduce the region of the DP matrix (Figure 1) that needs to be computed to find an alignment.

Before proceeding with the survey, we briefly highlight the contributions A*PA and A*PA2 make.

1.1 A*PA Link to heading

Despite all the algorithmic contributions so far, in a retrospective on pairwise alignment (Medvedev 2023), Medvedev observed that

a major open problem is to implement an algorithm with linear-like empirical scaling on inputs where the edit distance is linear in $n$.

Indeed, the best algorithms so far scale as $O(s^2)$ when the edit distance is $s$, and this is still linear in $n$ when the edit distance is, say, $s=0.01 \cdot n$. A*PA attempts to break this boundary.

As we will see soon, pairwise alignment corresponds to finding the shortest path in a graph. The classic algorithm for this is Dijkstra’s algorithm (Dijkstra 1959). A faster version of it that can use domain-specific information is the A* algorithm (Hart, Nilsson, and Raphael 1968). It achieves this by using a heuristic that estimates the cost of the (remaining) alignment. In A*PA, we take the seed heuristic of A*ix (Ivanov et al. 2020; Ivanov, Bichsel, and Vechev 2022) as a starting point and improve it to the gap-chaining seed heuristic, similar to (Myers and Miller 1995), and extend it with inexact matches. For inputs with uniform error rate up to $10\%$, this can estimate the remaining distance very accurately. By additionally adding pruning, A*PA2 ends up being near-linear in practice when errors are uniformly distributed, improving on the quadratic behaviour of previous exact methods.

1.2 A*PA2 Link to heading

A drawback of A*PA is that it is based on plain A*, which, like Dijkstra’s algorithm, is a relatively cache-inefficient graph algorithm.

On the other hand, some of the fastest aligners use DP (dynamic programming) with bitpacking to very efficiently compute the edit distance between sequences, even though they do not have the near-linear scaling of A*PA. In A*PA2, we build a highly optimized aligner. It merges the ideas of band-doubling and bit-packing (as already used by Edlib (Šošić and Šikić 2017)) with SIMD and the heuristics developed for A*PA. This results in significant speedups over both A*PA and Edlib. In particular, A*PA2 is up to $1000\times$ faster per visited state.

As Fickett stated 40 years ago (Fickett 1984, 1) and still true today,

at present one must choose between an algorithm which gives the best alignment but is expensive, and an algorithm which is fast but may not give the best alignment.

A*PA2 narrows this gap, and is nearly as fast as approximate methods.

1.3 Overview Link to heading

The remainder of this chapter reviews the history of exact global pairwise alignment in more detail. In particular, we focus on those methods that A*PA and A*PA2 build on, including algorithmic improvements and implementation techniques. Rather than presenting all work strictly chronologically, we treat them topic by topic. At times, we include formal notation for the concepts we introduce, which will be useful in later chapters.

We start our survey with a formal problem statement for pairwise alignment (2). Then, we list a number of variants of global alignment (3, 4). While these are not our focus, they can help to contextualize other existing methods. Then we move on to the classic DP algorithms in 5 and the algorithmic improvements in later sections. These are also covered in the surveys by Kruskal (Kruskal 1983) and Navarro (Navarro 2001).

In 13 we present some theoretical results on the complexity of the pairwise alignment problem and the best worst-case methods (although not practical). We also introduce some methods for the strongly related longest common subsequence (LCS) problem (11). Then, in 12, we briefly explain the methods used in some common tools that are the main baseline for the comparison of A*PA and A*PA2. We end with a table summarizing the papers discussed in this chapter, 14.

Scope. There is also a vast literature on text searching, where all (approximate) occurrences of a short pattern in a long text must be found. This field has been very active since around 1990, and again includes a large number of papers. We consider these mostly out of scope and refer the reader to Navarro’s survey (Navarro 2001).

More recently, read mapping has become a crucial part of bioinformatics, and indeed, there is also a plethora of different tools for aligning and mapping reads. This is a generalization of text searching where patterns tend to be significantly longer (100 to 10000 of bases, rather than tens of characters). Due to the amounts of data involved, most solutions to this problem are approximate, with the notable exception of A*ix (Ivanov et al. 2020; Ivanov, Bichsel, and Vechev 2022), which is the precursor for the work on A*PA presented in subsequent chapters, and POASTA (van Dijk et al. 2024). we refer the reader to the survey by Alser at al. (Alser et al. 2021) for a thorough overview of many tools and methods used for read alignment.

Lastly, we again note that most moderns read mappers and alignment tools are approximate, in that they are not guaranteed to return an alignment with provably minimal cost. A*PA and A*PA2 are both exact methods, and thus we will focus on these. We again refer the reader to (Alser et al. 2021).

2 Problem Statement Link to heading

The main problem of this chapter is as follows.

Given two sequences $A$ and $B$ of lengths $n$ and $m$, compute the edit distance $\ed(A,B)$ between them.

Before looking into solutions to this problem, we first cover some theory to precisely define it.

Input sequences. As input, we take two sequences $A=a_0a_1\dots a_{n-1}$ and $B=b_0b_1\dots b_{m-1}$ of lengths $n$ and $m$ over an alphabet $\Sigma$ that is typically of size $\sigma=4$. We usually assume that $n\geq m$. We refer substrings $a_i\dots a_{i’-1}$ as $A_{i\dots i’}$ to a prefix $a_0\dots a_{i-1}$ as $A_{<i}$ and to a suffix $a_i\dots a_{n-1}$ as $A_{\geq i}$.

Edit distance. The edit distance $s:=\ed(A,B)$ is the minimum number of insertions, deletions, and substitutions needed to convert $A$ into $B$. In practice, we also consider the divergence $d:=\ed(A,B)/n$, which is the average number of errors per character. This is different from the error rate $e$, which we consider to be the (relative) number of errors applied to a pair of sequence. The error rate is typically higher than the divergence, since random errors can cancel each other.

Figure 1: An example of an edit graph (left) corresponding to the alignment of strings ABCA and ACBBA, adapted from (Sellers 1974). Bold edges indicate matches of cost 0, while all other edges have cost 1. All edges are directed from the top-left to the bottom-right. The shortest path of cost 2 is highlighted. The middle shows the corresponding dynamic programming (DP) matrix containing the distance (g(u)) to each state. The right shows the final alignment corresponding to the shortest path through the graph, where a C is inserted between the first A and B, and the initial C is substituted for a B. — Figure 1: An example of an edit graph (left) corresponding to the alignment of strings `ABCA` and `ACBBA`, adapted from (Sellers 1974). Bold edges indicate matches of cost 0, while all other edges have cost 1. All edges are directed from the top-left to the bottom-right. The shortest path of cost 2 is highlighted. The middle shows the corresponding dynamic programming (DP) matrix containing the distance (g(u)) to each state. The right shows the final alignment corresponding to the shortest path through the graph, where a `C` is inserted between the first `A` and `B`, and the initial `C` is substituted for a `B`.

Dynamic programming. Pairwise alignment has classically been approached as a dynamic programming (DP) problem. For input strings of lengths $n$ and $m$, this method creates a $(n+1)\times (m+1)$ table that is filled cell by cell using a recursive formula, as we. There are many algorithms based on DP, as we will see in 5.

Edit graph. The alignment graph or edit graph (Figure 1) is a way to formalize edit distance (Vintsyuk 1968; Ukkonen 1985). It contains states $\st ij$ ($0\leq i\leq n$, $0\leq j\leq m$) as vertices. It further contains edges, such that an edge of cost 0 corresponds to a pair of matching characters, and an edge of cost 1 corresponds to an insertion, deletion, or substitution. The vertical insertion and horizontal deletion edges have the form $\st ij \to \st i{j+1}$ and $\st ij \to \st {i+1}j$ of cost 1. Diagonal edges are $\st ij\to \st{i+1}{j+1}$ and have cost 0 when $A_i = B_j$ and substitution cost 1 otherwise. A shortest path from $v_s:=\st 00$ to $v_t := \st nm$ in the edit graph corresponds to an alignment of $A$ and $B$. The distance $d(u,v)$ from $u$ to $v$ is the length of the shortest (minimal cost) path from $u$ to $v$, and we use edit distance, distance, length, and cost interchangeably. Further we write $\g(u) := d(v_s, u)$ for the distance from the start to $u$, $\h(u) := d(u, v_t)$ for the distance from $u$ to the end, and $\f(u) := \g(u) + \h(u)$ for the minimal cost of a path through $u$.

In figures, we draw sequence $A$ at the top and sequence $B$ on the left. Index $i$ will always be used for $A$ and indicates a column, while index $j$ is used for $B$ and indicates a row.

Shortest path algorithms. Using this graph, the problem of pairwise alignment reduces to finding a shortest path in a graph. There are many shortest path algorithms for graphs, and indeed, many of them are used for pairwise alignment. Since the graph is acyclic, the simplest method is to greedily process the states in any topologically sorted order such as row-wise, column-wise, or anti-diagonal by anti-diagonal. We then start by setting $d(\st 00)=0$, and find the distance to any other state as the minimum distance to an incoming neighbour plus the cost of the final edge. As we will see soon, this is often implemented using dynamic programming (DP).

Dijkstra’s shortest path algorithm, which visits states in order of increasing distance, can also be applied here (Dijkstra 1959). This does require that all edges have non-negative weights. An extension of Dijkstra’s algorithm is A* (Hart, Nilsson, and Raphael 1968), which visits states in order of increasing ‘‘anticipated total distance’’.

3 Alignment types Link to heading

Figure 2: Different types of pairwise alignment. Bold vertices and edges indicate where each type of alignment may start and end. Local-extension alignment can end anywhere, whereas local alignment can also start anywhere. Like clipping, these modes require a score for matching characters, while in other cases, a cost model (with cost-0 matches) suffices. This figure is based on an earlier version that was made in collaboration with Pesho Ivanov. — Figure 2: Different types of pairwise alignment. Bold vertices and edges indicate where each type of alignment may start and end. Local-extension alignment can end anywhere, whereas local alignment can also start anywhere. Like clipping, these modes require a *score* for matching characters, while in other cases, a *cost model* (with cost-0 matches) suffices. This figure is based on an earlier version that was made in collaboration with Pesho Ivanov.

There are a few variants of pairwise alignments and edit distance. While the focus of this chapter is (unit cost) edit distance, it is helpful to first have an overview of the different variants since most papers each assume a slightly different context.

In global pairwise alignment, the two sequences must be fully matched against each other. In practice though, there are a number of different settings, see Figure 2.

Global: Align both sequences fully, end-to-end.
Ends-free: Ends-free alignment allows one of the sequences on each end to have a (small) number of unmatched characters. This way, the alignment is still mostly from the top-left to the bottom-right, and global-alignment algorithms still work.
Semi-global: Align a full sequence to a substring of a reference, e.g., when mapping a read onto a larger genome.
Extension: Align one sequence to a prefix of the other.
Overlap: Align two partially overlapping reads against each other.
Local-extension: Align a prefix of the two sequences. Unlike all previous methods, this alignment can end anywhere in the edit graph. This requires a cost model that trades off matching aligned characters with not matching characters at all, and thus gives a positive score to the matching bases.
Local: Local alignment takes this a step further and aligns a substring of $A$ to a substring of $B$. This is somewhat like ends-free, but now we may skip the start/end of both sequences at the same time. The increased freedom in the location of the alignment prevents the faster global-alignment algorithms from working well.
Clipping: When a sequence (read) is aligned onto a longer sequence (reference), it can happen that the read has, say, around 100 bp of noise at the ends that can not be aligned. When doing a local alignment, these characters will simply not be matched. It can be desirable to add an additional fixed-cost clipping penalty that is applied whenever the start or end of the read indeed has unaligned bases, so that effectively, an affine cost is paid for this.

In this chapter, we only consider global alignment and corresponding algorithms. These methods also work for ends-free alignment when the number of characters that may be skipped is relatively small. Later, in Chapter 5, we look deeper into semi-global alignment and its variants.

4 Cost Models Link to heading

Figure 3: Overview of different cost models. The highlighted edge indicates the cost of matching characters, while all remaining edges indicate the cost of mismatch or indel edges. The longest common subsequence (LCS) is a score rather than a cost, and counts the maximal number of matching characters. This figure is based on an earlier version that was made in collaboration with Pesho Ivanov. — Figure 3: Overview of different cost models. The highlighted edge indicates the cost of matching characters, while all remaining edges indicate the cost of mismatch or indel edges. The longest common subsequence (LCS) is a *score* rather than a *cost*, and counts the maximal number of matching characters. This figure is based on an earlier version that was made in collaboration with Pesho Ivanov.

There are different models to specify the cost of each edit operation (Figure 3) (Spouge 1991). In particular, in a biological setting the probability of various types of mutations may not be equal, and thus, the associated costs should be different. We list some of them here, from simple to more complicated.

Hamming distance: The hamming distance (Hamming 1950) between two sequences is the number of substitutions required to transform one into the other, where insertions or deletions are not allowed. This is simple to compute in linear time.
LCS: The longest common subsequence maximizes the number of matches, or equivalently, minimizes the number of indels (insertions or deletions) while not allowing substitutions.
Unit cost edit distance / Levenshtein distance: The classic edit distance counts the minimum number of indels and/or substitutions needed, where each has a cost of 1.
Edit distance: In general, the edit distance allows for arbitrary indel and substitution costs. Matches/mismatches between characters $a_i$ and $b_j$ have cost $\delta(a_i, b_j)$. Inserting/deleting a character has cost $\delta(\varepsilon, b_j)>0$ and $\delta(a_i, \varepsilon)>0$ respectively. Usually the cost of a match is 0 or negative ($\delta(a,a) \leq 0$) and the cost of a mismatch is positive ($\delta(a,b)>0$ for $a\neq b$).
In this chapter, when we use edit distance, we usually mean the unit-cost version.
Affine cost: Insertions and deletions in DNA sequences are somewhat rare, but that once there is an indel, it is relatively common for it to be longer than a single character. This is modelled using affine costs (Smith, Waterman, and Fitch 1981; Gotoh 1982), where there is a cost $o$ to open a gap, and a cost $e$ to extend a gap, so that the cost of a gap of length $k$ is $w_k = o+k\cdot e$.
It is also possible to have different parameters $(o_{\mathrm{ins}}, e_{\mathrm{ins}})$ and $(o_{\mathrm{del}}, e_{\mathrm{del}})$ for insertions and deletions.
Dual affine: Affine costs are not sufficient to capture all biological processes: the gap cost can give a too large penalty to very long indels of length 100 to 1000. To fix this, a second gap cost can be introduced with separate parameters $(o_2, e_2)$, with for example an offset of $o=1000$ and an extend cost of $e=0.5$. The cost of a gap of length $k$ is now given by $w_k = \min(o_1 + k\cdot e_1, o_2 + k\cdot e_2)$.
More general, a piecewise linear cost can be considered as well (GOTOH 1990).
Concave: Even more general, we can give gaps of length $k$ a cost $w_k$, where $w_k$ is a concave function of $k$, where longer gaps become relatively less expensive. Double-affine costs are an example of a concave gap cost.
Arbitrary: Even more general, we can merge the concave gap costs with arbitrary substitution costs $\delta(a,b)$ for (mis)matches.

In practice, most methods for global alignment use a match cost $\delta(a,a) = 0$, fixed mismatch cost $\delta(a,b) = X>0$ for $a\neq b$, and fixed indel cost $\delta(a,\varepsilon) = \delta(\varepsilon,b) = I$.

In the chapter on semi-global alignment and mapping (blog), we discuss additional cost models used when the alignment mode is not global.

4.1 Minimizing Cost versus Maximizing Score Link to heading

So far, most of the cost models we considered are just that: cost models. They focus on minimizing the cost of the edits between two sequences, and usually assume that all costs are $\geq 0$, so that in particular matching two characters has a cost of 0.

In some settings, scores are considered instead, which are simply the negative of the cost. In this setting, matching characters usually give a positive score, so that this is explicitly rewarded. This is for example the case when finding the longest common subsequence, where each pair of matching characters gives a score of 1, and everything else has a score of 0.

Both approaches have their benefits. When using non-negative costs, all edges in the alignment graph have non-negative weights. This significantly simplifies the shortest path problem, since this is, for example, a requirement for Dijkstra’s algorithm. Scores, on the other hand, work better for overlap, extension, and local alignment: in these cases, the empty alignment is usually a solution, and thus, we must give some bonus to the matching of characters to compensate for the inevitable mismatches that will also occur. Unfortunately, this more general setting usually means that algorithms have to explore a larger part of the alignment graph. The ratio between the match bonus (score $>0$) and mismatch penalty (score $<0$) influences the trade-off between how many additional characters must be matched for each additional mismatch.

Cost-vs-score duality. For the problem of longest common subsequence there is a duality between scores and costs. When $p$ is the length of the LCS, and $s$ is the cost of aligning the two sequences via the LCS cost model where indels cost 1$ and mismatches are not allowed, we have

\begin{align} 2\cdot p + s = n+m. \end{align}

Thus, maximizing the number of matched characters is equivalent to minimizing the number of insertions and deletions.

A similar duality holds for global alignment: there is a direct correspondence between maximizing score and minimizing cost (Smith, Waterman, and Fitch 1981; Eizenga and Paten 2022): given a scoring model with fixed affine costs $\delta(a, a) = M$, $\delta(a,b) = X$, and $w_k = O + E \cdot k$, there is a cost-model (with $\delta(a,a)=0$) that yields the same optimal alignment.

5 The Classic DP Algorithms Link to heading

We are now ready to look into the first algorithms. We start with DP algorithms, that process the graph one column at a time. Note that we present all algorithms as similar as possible: they go from the top-left to the bottom-right, and always minimize the cost. We write $D(i,j)=\g(\st ij)$ for the cost to state $\st ij$. Smith et al. (Smith, Waterman, and Fitch 1981) provide a nice overview of the similarities and differences between the early approaches.

Note that for the sake of exposition, we start with the paper of Needleman and Wunsch (Needleman and Wunsch 1970), even though Vintsyuk (Vintsyuk 1968) already discovered a very similar method a few years before, although in a different domain.

Figure 4: The cubic algorithm as introduced by Needleman and Wunsch (Needleman and Wunsch 1970). Consider the highlighted cell. In the cubic algorithm, we first compute the cost between the two preceding characters, which are both B and thus create a match. Then, we consider all earlier cells in the the preceding row and column, and consider all gaps of arbitrary length (k) and cost (w_k=k). The quadratic algorithm does not support arbitrary gap costs, but relies on (w_k=k). This allows it to only consider three neighbouring states. — Figure 4: The cubic algorithm as introduced by Needleman and Wunsch (Needleman and Wunsch 1970). Consider the highlighted cell. In the cubic algorithm, we first compute the cost between the two preceding characters, which are both `B` and thus create a match. Then, we consider all earlier cells in the the preceding row and column, and consider all gaps of arbitrary length (k) and cost (w_k=k). The quadratic algorithm does not support arbitrary gap costs, but relies on (w_k=k). This allows it to only consider three neighbouring states.

Needleman-Wunsch’ cubic algorithm. The problem of pairwise alignment of biological sequences was first formalized by Needleman and Wunsch (Needleman and Wunsch 1970). They provide a cubic recurrence that assumes a (mis)match between $a_{i-1}$ and $b_{j-1}$ of cost $\delta(a_{i-1},b_{j-1})$ and an arbitrary gap cost $w_k$. The recursion uses that before matching $a_{i-1}$ and $b_{j-1}$, either $a_{i-2}$ and $b_{j-2}$ are matched to each other, or one of them is matched to some other character:

\begin{align*} D(0,0) &= D(i,0) = D(0,j) := 0\\ D(i,j) &:= \delta(a_{i{-}1}, b_{j{-}1})&& \text{cost of match}\\ &\phantom{:=} + \min\big( \min_{0\leq i’ < i} D(i’, j{-}1) + w_{i{-}i’{-}1},&&\text{cost of matching $a_{i’-1}$ against $b_{j-2}$ next}\\ &\phantom{:=+\min\big(} \min_{0\leq j’<j} D(i{-}1, j’)+w_{j{-}j’{-}1}\big).&&\text{cost of matching $a_{i-2}$ against $b_{j’-1}$ next} \end{align*}

The value of $D(n,m)$ is the final cost of the alignment.

The total runtime is $O(nm \cdot (n+m)) = O(n^2m)$ since each of the $n\cdot m$ cells requires $O(n+m)$ work.

A quadratic DP. The cubic DP was improved into a quadratic DP by Sellers (Sellers 1974) and Wagner and Fisher (Wagner and Fischer 1974). The improvement comes from dropping the arbitrary gap cost $w_k$, so that instead of trying all $O(n+m)$ indels in each position, only one insertion and one deletion is tried:

\begin{align*} D(0,0) &:= 0\\ D(i, 0) &:= D(i-1,0)+ \delta(a_i, \varepsilon) \\ D(0, j) &:= D(0,j-0)+ \delta(\varepsilon, b_j) \\ D(i, j) &:= \min\big(D(i{-}1,j{-}1) + \delta(a_i, b_j), &&\text{(mis)match}\\ &\phantom{:=\min\big(}\, D(i{-}1,j) + \delta(a_i, \varepsilon), && \text{deletion}\\ &\phantom{:=\min\big(}\, D(i,j{-}1) + \delta(\varepsilon, b_j)\big). && \text{insertion}. \end{align*}

This algorithm takes $O(nm)$ time since it now does constant work per DP cell.

This quadratic DP is now called the Needleman-Wunsch (NW) algorithm (Figure 5a, Figure 6a). Gotoh (Gotoh 1982) refers to it as Needleman-Wunsch-Sellers’ algorithm, to highlight the speedup that Sellers contributed (Sellers 1974). Apparently Gotoh was not aware of the identical formulation of Wagner and Fischer (Wagner and Fischer 1974).

Vintsyuk published a quadratic algorithm already before Needleman and Wunsch (Vintsyuk 1968), but in the context of speech recognition. Instead of a cost of matching characters, there is some cost $\delta(i,j)$ associated to matching two states, and it does not allow deletions:

\begin{align*} D(i, j) &:= \min\big(D(i{-}1,j{-}1), D(i{-}1, j)\big) + \delta(i,j). \end{align*}

Sankoff also gives a quadratic recursion (Sankoff 1972), similar to the one by Sellers (Sellers 1974), but specifically for LCS. This leads to the recursion

\begin{align*} S(i, j) &:= \max\big(S(i{-}1,j{-}1) + \delta(a_i, b_j),\, S(i{-}1, j), S(i, j{-}1)\big), \end{align*}

where we use $S$ to indicate that the score is maximized.

Local alignment. Smith and Waterman (Smith and Waterman 1981) introduce a DP for local alignment. The structure of their algorithm is similar to the cubic DP of Needleman and Wunsch and allows for arbitrary gap costs $w_k$. While introduced as a maximization of score, here we present it as minimizing cost (with $\delta(a,a)<0$) for consistency. The new addition is a $\min(0, \dots)$ term, that can reset the alignment whenever the cost goes above 0. The best local alignment ends in the smallest value of $D(i,j)$ in the table.

\begin{align*} D(0, 0) &= D(i, 0) = D(0, j) := 0 \\ D(i,j) &= \min\big(0, &&\text{start a new local alignment}\\ &\phantom{=\min\big(} D(i-1, j-1) + \delta(a_{i{-}1}, b_{j{-}1}), &&\text{(mis)match}\\ &\phantom{=\min\big(} \min_{0\leq i’ < i} D(i’, j) - w_{i{-}i’}, &&\text{deletion}\\ &\phantom{=\min\big(} \min_{0\leq j’<j} D(i, j’)-w_{j{-}j’}\big).&&\text{insertion} \end{align*}

This algorithm uses arbitrary gap costs $w_k$, as first mentioned by Needleman and Wunsch (Needleman and Wunsch 1970) and formally introduced by Waterman (Waterman, Smith, and Beyer 1976). Because of this, it runs in $O(n^2m)$.

The quadratic algorithm for local alignment is now usually referred to as the Smith-Waterman-Gotoh (SWG) algorithm, since the ideas introduced by Gotoh (Gotoh 1982) can be used to reduce the runtime from cubic by assuming affine costs, just like to how Sellers (Sellers 1974) sped up the Needleman-Wunsch algorithm (Needleman and Wunsch 1970) for global alignment costs by assuming linear gap costs. Note though that Gotoh only mentions this speedup in passing, and that Smith and Waterman (Smith and Waterman 1981) could have directly based their idea on the quadratic algorithm of Sellers (Sellers 1974) instead.

Affine costs. To my knowledge, the first mention of affine costs of the form $o+k\cdot e$ is by Smith, Waterman, and Fitch (Smith, Waterman, and Fitch 1981). Gotoh (Gotoh 1982) generalized the quadratic recursion to these affine costs, to circumvent the cubic runtime needed for the arbitrary gap costs $w_k$ of Waterman (Waterman, Smith, and Beyer 1976). This is done by introducing two additional matrices $P(i,j)$ and $Q(i,j)$ that contain the minimal cost to get to $(i,j)$ where the last step is required to be an insertion or deletion respectively:

\begin{align*} D(i, 0) &= P(i, 0) = I(i, 0) := 0 \\ D(0, j) &= P(0, j) = I(0, j) := 0 \\ P(i, j) &:= \min\big(D(i-1, j) + o+e, &&\text{new gap}\\ &\phantom{:= \min\big(}\ P(i-1, j) + e\big)&&\text{extend gap}\\ Q(i, j) &:= \min\big(D(i, j-1) + o+e, &&\text{new gap}\\ &\phantom{:= \min\big(}\ Q(i, j-1) + e\big)&&\text{extend gap}\\ D(i, j) &:= \min\big(D(i-1, j-1) + \delta(a_{i-1}, b_{j-1}),&&\text{(mis)match}\\ &\phantom{:= \min\big(}\ P(i, j),&&\text{close gap}\\ &\phantom{:= \min\big(}\ Q(i, j)\big)&&\text{close gap} \end{align*}

This algorithm run in $O(nm)$ time.

Gotoh also mentions that this method can be modified to solve the local alignment of Smith and Waterman (Smith and Waterman 1981) in quadratic time. Later, Gotoh further extended the method to support double affine costs and more general piecewise linear gap costs (GOTOH 1990).

Traceback. To compute the final alignment, we can follow the trace of the DP matrix: starting at the end $\st nm$, we can repeatedly determine which of the preceding DP-states was optimal as predecessor and store these states. This takes linear time, but requires quadratic memory since all states could be on the optimal path, and thus we need to keep the entire matrix in memory. Gotoh notes (Gotoh 1982) that if only the final score is required, only the last two columns of the DP matrix $D$ (and $P$ and $Q$) are needed at any time, so that linear memory suffices.

6 Linear Memory using Divide and Conquer Link to heading

Hirschberg (Hirschberg 1975) introduces a divide-and-conquer algorithm to compute the LCS of two sequences in linear space. Instead of computing the full alignment from $\st 00$ to $\st nm$, we first fix a column halfway, $i^\star = \lfloor n/2\rfloor$. This splits the problem into two halves: we compute the forward DP matrix $D(i, j)$ for all $i\leq i^\star$, and introduce a backward DP $D’(i, j)$ that is computed for all $i\geq i^\star$. Here, $D’(i,j)$ is the minimal cost for aligning suffixes of length $n-i$ and $m-j$ of $A$ and $B$. It is shown that there must exist a $j^\star$ such that $D(i^\star, j^\star) + D’(i^\star, j^\star) = D(n, m)$, and we can find this $j^\star$ as the $j$ that minimizes $D(i^\star, j) + D’(i^\star, j)$.

At this point, we know that the point $(i^\star, j^\star)$ is part of an optimal alignment. The two resulting subproblems of aligning $A[0, i^\star]$ to $B[0, j^\star]$ and $A[i^\star, n]$ to $B[j^\star, m]$ can now be solved recursively using the same technique, where again we find the midpoint of the alignment. This recursive process is shown in Figure 5e and Figure 6e. The recursion stops as soon as the alignment problem becomes trivial, or, in practice, small enough to solve with the usual quadratic-memory approach.

Space complexity. The benefit of this method is that it only uses linear memory: each forward or reverse DP is only needed to compute the scores in the final column, and thus can be done in linear memory. After the midpoint $\st {i^\star}{j^\star}$ is found, the results of the left and right subproblem can be discarded before recursing. Additionally, the space for the solution itself is linear.

Time complexity. We analyse the time complexity following (Myers and Miller 1988). The first step takes $2\cdot O((n/2)m) = O(nm)$ time. We are then left with two subproblems of size $i^\star \cdot j^\star$ and $(n-i^\star)(m-j^\star)$. Since $i^\star = n/2$, their total size is $n/2 \cdot j^\star + n/2 \cdot (m-j^\star) = nm/2$. Thus, the total time in the first layer of the recursion is $nm/2$. Extending this, we see that the total number of states halves with each level of the recursion. Thus, the total time is bounded by

\begin{equation*} mn + \frac 12 \cdot mn + \frac 14 \cdot mn + \frac 18\cdot mn + \dots \leq 2\cdot mn = O(mn). \end{equation*}

Indeed, in practice this algorithm indeed takes around twice as long to find an alignment as the non-recursive algorithm takes to find just the score.

Applications. Hirschberg introduced this algorithm for computing the longest common subsequence (Hirschberg 1975). It was then applied multiple times to reduce the space complexity of other variants as well: Myers first applied it to the $O(ns)$ LCS algorithm (Myers 1986), and also improved the $O(nm)$ algorithm by Gotoh (Gotoh 1982) to linear memory (Myers and Miller 1988). Similarly, BiWFA (Marco-Sola et al. 2023) improves the space complexity of WFA from $O(n+s^2)$ to $O(s)$ working memory, where $s$ is the cost of the alignment.

7 Dijkstra’s Algorithm and A* Link to heading

Dijkstra’s algorithm. Both Ukkonen (Ukkonen 1985) and Myers (Myers 1986) remarked that pairwise alignment can be solved using Dijkstra’s algorithm (Dijkstra 1959) (Figure 5b, Figure 6b), which visits states by increasing distance. Ukkonen gave a bound of $O(nm \log (nm))$, whereas Myers’ analysis uses the fact that there are only $O(ns)$ at distance $\leq s$ (see 8) and that a discrete priority queue that avoids the $\log$ is sufficient, and thus concludes that the algorithms runs in $O(ns)$.

However, Myers (Myers 1986, 2) observes that

the resulting algorithm involves a relatively complex discrete priority queue and this queue may contain as many as $O(ns)$ entries even in the case where just the length of the […] shortest edit script is being computed.

And indeed, I am not aware of any tool that practically implemented Dijkstra’s algorithm to compute the edit distance.

A* and the gap cost heuristic. Hadlock realized (Hadlock 1988) that Dijkstra’s algorithm can be improved upon by using A* (Hart, Nilsson, and Raphael 1968; Hart, Nilsson, and Raphael 1972; Pearl 1984), a more informed algorithm that uses a heuristic function $h$ that gives a lower bound on the remaining edit distance between two suffixes. He proposes two heuristics, one based on character frequencies, and also the widely used gap cost heuristic (Ukkonen 1985; Hadlock 1988; Spouge 1989, 1991; Myers and Miller 1995). This uses the difference in length between two sequences as a lower bound on their edit distance (Figure 5f, Figure 6f): \[ \cgap(\st ij, \st{i’}{j’}) = |(i-i’) - (j-j’)|. \] We specifically highlight the papers by Wu et al. (Wu et al. 1990) and Papamichail and Papamichail (Papamichail and Papamichail 2009), where the authors’ method exactly matches the A* algorithm with the gap-heuristic, in combination with diagonal transition (Section 9, Figure 5g, Figure 6g).

Seed heuristic. Much more recently, A*ix (Ivanov et al. 2020; Ivanov, Bichsel, and Vechev 2022) introduced the much stronger seed heuristic for the problem of sequence-to-graph alignment. This heuristic splits the sequence $A$ into disjoint k-mers, and uses that at least one edit is needed for each remaining k-mer that is not present in sequence $B$.

In A*PA (Groot Koerkamp and Ivanov 2024) we will improve this into the gap-chaining seed heuristic and add pruning, which results in near-linear alignment when the divergence is sufficient low.

Notation. To prepare for the theory on A*PA, we now introduce some formal terminology and notation for Dijkstra’s algorithm and A*. Dijkstra’s algorithm finds a shortest path from $v_s=\st 00$ to $v_t=\st nm$ by expanding (generating all successors) vertices in order of increasing distance $\g(u)$ from the start. This next vertex to be expanded is chosen from a set of open vertices. The A* algorithm, instead, directs the search towards a target by expanding vertices in order of increasing ${f(u) := g(u) + h(u)}$, where $h(u)$ is a heuristic function that estimates the distance $\h(u)$ to the end and $g(u)\geq \g(u)$ is the shortest length of a path from $v_s$ to $u$ found so far. We say that $u$ is fixed when the distance to $u$ has been found, i.e., $g(u) = \g(u)$. A heuristic is admissible if it is a lower bound on the remaining distance ($h(u) \leq \h(u)$), which guarantees that A* has found a shortest path as soon as it expands $v_t$. A heuristic $h_1$ dominates (is more accurate than) another heuristic $h_2$ when $h_1(u) \ge h_2(u)$ for all vertices $u$. A dominant heuristic will usually (but not always (Holte 2010)) expand less vertices. Note that Dijkstra’s algorithm is equivalent to A* using a heuristic that is always 0, and that both algorithms require non-negative edge costs.

We end our discussion of graph algorithms with the following observation, as Spouge states (Spouge 1991, 3),

algorithms exploiting the lattice structure of an alignment graph are usually faster,

and further (Spouge 1989, 4):

This suggests a radical approach to A* search complexities: dispense with the lists [of open states] if there is a natural order for vertex expansion.

In A*PA2 (Groot Koerkamp 2024), we follow this advice and replace the plain A* search in A*PA with a much more efficient approach based on computational volumes that merges DP and A*.

Figure 5: Schematic overview of global pairwise alignment methods. The shaded areas indicate states computed by each method, and darker shades indicate states that are computed multiple times. In practice, diagonal transition only computes a very sparse set of states, as indicated by lines rather than an area.

Figure 6: A detailed example of each method shown in Figure 5. Shaded areas indicate computed values, and darker shades states are computed more than once. The yellow path indicates the optimal alignment. For diagonal transition (DT), the wavefronts are indicates by black lines, and this grey lines indicate a best path to each state. The top right (j) shows contours for the longest common subsequence.

8 Computational Volumes and Band Doubling Link to heading

So far, Dijkstra’s algorithm is the only method we’ve seen that is faster than $\Theta(nm)$. But as remarked by Spouge, unfortunately it tends to be slow. To our knowledge, Wilbur and Lipman (Wilbur and Lipman 1983; Wilbur and Lipman 1984) are the first to give a sub-quadratic algorithm, by only considering states near diagonals with many k-mer matches, not unlike the approach taken by modern seed-and-extend algorithms. However, this method is not exact, i.e., it could return a suboptimal alignment. Nevertheless, they raise the question whether the alignments found by their method are closer to biological truth than the true minimal cost alignments found by exact algorithms.

Reordering the matrix computation. The main reason the methods so far are quadratic is that they compute the entire $n\times m$ matrix. But, especially when the two sequences are similar, the optimal alignment is likely to be close to the main diagonal. Thus, Fickett (Fickett 1984) proposes to compute the entries of the DP matrix in a new order: Instead of column by column, we can first compute all entries at distance up to some threshold $t$, and if this does not yet result in a path to the end ($\st nm$), we can expand this computed area to a larger area with distance up to $t’>t$, and so on, until we try a $t\geq s$. In fact, when $t$ is increased by 1 at a time this is equivalent to Dijkstra’s algorithm (Figure 5b, Figure 6b).

Vertices at distance $\leq t$ can never be more than $t$ diagonals away from the main diagonal, and hence, computing them can be done in $O(nt)$ time. This can be much faster than $O(nm)$ when $s$ and $t$ are both small, and works especially well when $t$ is not too much larger than $s$. For example, $t$ can be set as a known upper bound for the data being aligned, or as the length of some known suboptimal alignment.

Gap heuristic. In parallel, Ukkonen introduced a very similar idea (Ukkonen 1985), statically bounding the computation to only those states that can be contained in a path of length at most $t$ from the start to the end of the graph. In particular, it uses the gap heuristic, so that the minimal cost of an alignment containing $\st ij$ is \[ f(\st ij) := \cgap(\st 00, \st ij) + \cgap(\st ij, \st nm) = |i-j| + |(n-i) - (m-j)|. \] Ukkonen’s algorithm then only considers those states for which $f(\st ij) \leq t$ (Figure 5h, Figure 6h). Thus, instead that the actual distance to a state is at most $t$ ($\g(\st ij) \leq t$), it requires that the best possible cost of a path containing $\st ij$ is sufficiently low.

Band doubling. Ukkonen also introduces band doubling (Ukkonen 1985). The method of Fickett computes all states with distance up to some threshold $t$. The idea of band doubling is that if it turns out that $t=t_0<s$, then it can be doubled to $t_1 = 2t_0$, $t_2=4t_0$, and so on, until a $t_k=2^k\geq s$ is found. As we already saw, testing $t$ takes $O(nt)$ time. Now suppose we test $t_0=1$, $t_1=2$, $\dots$, $t_{k-1}=2^{k-1}<s$, up to $t_k=2^k \geq s$. Then the total cost of this is \[ t_0n + t_1n + \dots + t_k n = 1\cdot n + 2\cdot n + \dots + 2^k \cdot n < 2^{k+1}\cdot n = 4\cdot 2^{k-1}\cdot n < 4sn. \] Thus, band doubling finds an optimal alignment in $O(ns)$ time. Note that computed values are not reused between iterations, so that each state is computed twice on average.

Two tools implementing this band doubling are Edlib and KSW2.

Computational volumes. Spouge unifies the methods of Fickett and Ukkonen in computational volumes (Spouge 1989, 1991), which are subgraphs of the full edit graph that are guaranteed to contain all shortest paths. Thus, to find an alignment, it is sufficient to only consider the states in such a computational volume. Given a bound $t\geq s$, some examples of computational volumes are:

$\{u\}$, the entire $(n+1)\times (m+1)$ graph (Needleman and Wunsch 1970) (Figure 5a, Figure 6a).
$\{u: \g(u)\leq t\}$, the states at distance $\leq t$, introduced by Fickett (Fickett 1984) and similar to Dijkstra’s algorithm (Dijkstra 1959) (Figure 5b, Figure 6b).
$\{u: \cgap(v_s, u) + \cgap(u, v_t) \leq t\}$ the static set of states possibly on a path of cost $\leq t$ (Ukkonen 1985) (Figure 5h, Figure 6h).
$\{u: \g(u) + \cgap(u, v_t) \leq t\}$, as used by Edlib (Šošić and Šikić 2017; Spouge 1991; Papamichail and Papamichail 2009) (Figure 5i, Figure 6i).
$\{u: \g(u) + h(u) \leq t\}$, for any admissible heuristic $h$, which we will use in A*PA2 and is similar to A*.

9 Diagonal Transition Link to heading

Around 1985, the diagonal transition (Figure 5c, Figure 6c) algorithm was independently discovered by Ukkonen (Ukkonen 1983, 1985) (for edit distance) and Myers (Myers 1986) (for LCS). It hinges on the observation that along diagonals of the edit graph (or DP matrix), the value of $\g(\st ij) = D(i,j)$ never decreases (Lemma Ukkonen 1985, 3), as can be seen in Figure 1.

We already observed before that when the edit distance is $s$, only the $s$ diagonals above and below the main diagonal are needed, and on these diagonals, we only are interested in the values up to $s$. Thus, on each diagonal, there are at most $s$ transition from a distance $g\leq s$ to distance $g+1$. We call the farthest state along a diagonal with a given distance a farthest reaching state. Specifically, given a diagonal $-s\leq k\leq s$, we consider the farthest $u=\st ij$ on this diagonal (i.e., with $i-j=k$) at distance $g$ ($\g(u) \leq g$). Then we write $F_{gk}:=i+j$ to indicate the antidiagonal of this farthest reaching state. (Note that more commonly (Ukkonen 1985; Marco-Sola et al. 2020), just the column $i$ is used to indicate how far along diagonal $k=i-j$ the farthest reaching state can be found. Using $i+j$ leads to more symmetric formulas.) In order to write the recursive formula on the $F_{gk}$, we need a helper function: $\lcp(i, j)$ returns the length of the longest common prefix between $A_{\geq i}$ and $B_{\geq j}$, which indicates how far we can walk along the diagonal for free starting at $u=\st ij$. We call this extending from $u$. The recursion then starts with $F_{00} = \lcp(0,0)$ as the farthest state along the main diagonal with cost 0. A wavefront is the set of farthest reaching states for a given distance, as shown by black lines in Figure 6c. To compute wavefront $F_{g, \bullet}$ in terms of $F_{g-1, \bullet}$, we first find the farthest state at distance $g$ on diagonal $k$ that is adjacent to a state at distance $g-1$: \[ X_{gk} := \max(F_{g-1,k-1}+1, F_{g-1,k}+2, F_{g-1,k+1}+1). \] From this state, with coordinates $i(X_{gk}) = (X_{gk}+k)/2$ and $j(X_{gk})=(X_{gk}-k)/2$, we can possibly walk further along the diagonal for free to obtain the farthest reaching point: \[ F_{gk} = X_{gk} + \lcp(i(X_{gk}), j(X_{gk})). \] The edit distance between two sequences is then the smallest $g$ such that $F_{g, n-m} \geq n+m$.

Time complexity. The total number of farthest reaching states is $O(s^2)$, since there are $2s+1$ diagonal within distance $s$, and each has at most $s+1$ farthest reaching states. The total time spent on $\lcp$ is at most $O(ns)$, since on each of the $2s+1$ diagonals, the $\lcp$ calls cover at most $n$ characters in total. Thus, the worst case of this method is $O(ns)$. Nevertheless, Ukkonen observes (Ukkonen 1985) that in practice the total time needed for $\lcp$ can be small, and Myers proves (Myers 1986) that the LCS-version of the algorithm does run in expected $O(n+s^2)$ when we assume that the input is a random pair of sequences with distance $s$.

Myers also notes that the $\lcp$ can be computed in $O(1)$ by first building (in $O(n+m)$ time) a suffix tree on the input strings and then using an auxiliary data structure to answer lowest-common-ancestor queries, leading to a worst-case $O(n+s^2)$ algorithm. However, this does not perform well in practice.

We remark here that when the divergence $d=s/n$ is fixed at, say, $1\%$, $s^2$ still grows quadratically in $n$, and thus, in practice still method still becomes slow when the inputs become too long.

Space complexity. A naive implementation of the method requires $O(s^2)$ memory to store all values of $F_{gk}$ (on top of the $O(n+m)$ input sequences). If only the distance is needed, only the last front has to be stored and $O(s)$ additional memory suffices. To reduce the $O(s^2)$ memory, Hirschberg’s divide-and-conquer technique can also be applied here (Myers 1986): we can run two instances of the search in parallel, from the start and end of the alignment graph, until they meet. Then, this meeting point must be on the optimal alignment, and we can recurse into the two sub-problems. These now have distance $s/2$, so that overall, the cost is \[ 2\cdot (s/2)^2 + 4\cdot (s/4)^2 + 8\cdot(s/8)^2 \dots = s^2/2+s^2/4+s^2/8+\dots < s^2. \]

Applications. Wu et al. (Wu et al. 1990) and Papamichail and Papamichail (Papamichail and Papamichail 2009) apply diagonal transition to align sequences of different lengths, by incorporating the gap-heuristic (Figure 5g, Figure 6g). Diagonal transition has also been extended to linear and affine costs in the wavefront alignment algorithm (WFA) (Marco-Sola et al. 2020) in a way similar to (Gotoh 1982), by introducing multiple layers to the graph. Similar to Myers (Myers 1986), BiWFA (Marco-Sola et al. 2023) applies Hirscherg’s divide-and-conquer approach (Hirschberg 1975) to obtain $O(s)$ memory usage (on top of the $O(n+m)$ input).

10 Parallelism Link to heading

So far we have mostly focused on the theoretical time complexity of methods. However, since the introduction of $O(n+s^2)$ diagonal transition around 1985, no further significant breakthroughs in theoretical complexity have been found. Thus, since then, the focus has shifted away from reducing the number of computed states and towards computing states faster through more efficient implementations and more modern hardware. Most of the developments in this area were first developed for either semi-global or local alignment, but they just as much apply to global alignment.

As Spouge notes (Spouge 1989) in the context of computational volumes:

The order of computation (row major, column major or antidiagonal) is just a minor detail in most algorithms.

But this decision exactly at the core of most efficient implementations.

SWAR. The first technique in this direction is microparallelism (Alpern, Carter, and Gatlin 1995), nowadays also called SWAR (SIMD within a register), where each (64-bit) computer word is divided into multiple (e.g. 16-bit) parts, and word-size instructions modify all (4) parts in parallel. This can then applied with inter-sequence parallelism to search a given query in multiple reference sequences in parallel (Alpern, Carter, and Gatlin 1995; Baeza-Yates and Gonnet 1992; Wu and Manber 1992; Hyyrö, Fredriksson, and Navarro 2005; Rognes 2011).

Anti-diagonals. Hughey (Hughey 1996) notes that values along anti-diagonals of the DP matrix are not dependent on each other, and thus can be computed in parallel. Wozniak (Wozniak 1997) applied SIMD (single instruction, multiple data) instructions for this purpose, which are special CPU instructions that operate on multiple (currently up to 512 bits, for AVX-512) computer words at a time.

Vertical packing. Rognes and Seeberg (Rognes and Seeberg 2000, 702) also use microparallelism, but use vertical instead of anti-diagonal vectors:

The advantage of this approach is the much-simplified and faster loading of the vector of substitution scores from memory. The disadvantage is that data dependencies within the vector must be handled.

Indeed, when using vertical vectors a sequence profile (see below) can be used to quickly determine the (mis)match score of each of the character in the vector. However, the DP cells now depend on each other, and it may be necessarily to (slowly) iterate through the values in the vector to handle insertions corresponding to vertical edges in the edit graph.

Striped SIMD. To work around the dependencies between adjacent states in each vector, Farrar (Farrar 2006) introduces an alternative striped SIMD scheme where lanes are interleaved with each other. Thus, the query is split into, say, 8 segments that are aligned in parallel (each in one lane of the SIMD vector). In this case, there are still dependencies between adjacent segments, and these are resolved using a separate while loop, for as long as needed. This is used by for example BSAlign (Shao and Ruan 2024).

Bitpacking. An observation that we have not used so far is that for unit cost edit distance specifically, it can be shown that the difference between distances to adjacent states is always in $\{-1, 0, +1\}$. Myers (Myers 1999) uses this fact to encode $w=64$ adjacent differences into two $w$-bit words: one word in which bit $j$ indicates that the $j$’th difference is $+1$, and one word in which bit $j$ indicates that the $j$’th difference is $-1$. If we additionally know the difference along the top edge, Myers’ method can efficiently compute the output differences of a $1\times w$ rectangle in only 20 instructions.

We call each consecutive non-overlapping chunk of 64 rows a lane, so that there are $\lceil m/64\rceil$ lanes, where the last lane may be padded. As presented originally, for text searching, this method only uses 17 instructions, but some additional instructions are needed to carry the horizontal difference to the next lane when $m>w$.

Currently, Edlib (Šošić and Šikić 2017) is the most popular tool that implements bitpacking, alongside band doubling and divide-and-conquer, so that it has a complexity of $O(ns/w)$.

The supplement of BitPAl (Loving, Hernandez, and Benson 2014; Benson, Hernandez, and Loving 2013) introduces an alternative scheme for edit distance based on a different encoding of the $\{-1,0,+1\}$ values, that also ends up using 20 instructions. We show implementations of both Myers’ and BitPAl’s method in Code Snippet 1 and Code Snippet 2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
pub fn compute_block_simd_myers(
    // 0 or 1. Indicates -1 difference on top.
    hp0: &mut Simd<u64, 4>,
    // 0 or 1. Indicates -1 difference on top.
    hm0: &mut Simd<u64, 4>,
    // 64-bit indicator of +1 differences on left.
    vp: &mut Simd<u64, 4>,
    // 64-bit indicator of -1 differences on left.
    vm: &mut Simd<u64, 4>,
    // 64-bit indicator of chars equal to top char.
    eq: Simd<u64, 4>,
) {
    let vx = eq | *vm;
    let eq = eq | *hm0;
    // The addition carries information between rows.
    let hx = (((eq & *vp) + *vp) ^ *vp) | eq;
    let hp = *vm | !(hx | *vp);
    let hm = *vp & hx;
    // Extract the high bit as bottom difference.
    let right_shift = Simd::<u64, 4>::splat(63);
    let hpw = hp >> right_shift;
    let hmw = hm >> right_shift;
    // Insert the top horizontal difference.
    let left_shift = Simd::<u64, 4>::splat(1);
    let hp = (hp << left_shift) | *hp0;
    let hm = (hm << left_shift) | *hm0;
    // Update the input-output parameters.
    *hp0 = hpw;
    *hm0 = hmw;
    *vp = hm | !(vx | hp);
    *vm = hp & vx;
}

Code Snippet 1: Rust code for SIMD version of Myers' bitpacking algorithm, taking 20 instructions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
pub fn compute_block_simd_bitpal(
    // 0 or 1. Indicates 0 difference on top.
    hz0: &mut Simd<u64, 4>,
    // 0 or 1. Indicates -1 difference on top.
    hp0: &mut Simd<u64, 4>,
    // 64-bit indicator of -1 differences on left.
    vm: &mut Simd<u64, 4>,
    // 64-bit indicator of -1 and 0 differences on left.
    vmz: &mut Simd<u64, 4>,
    // 64-bit indicator of chars equal to top char.
    eq: Simd<u64, 4>,
) {
    let eq = eq | *vm;
    let ris = !eq;
    let notmi = ris | *vmz;
    let carry = *hp0 | *hz0;
    // The addition carries info between rows.
    let masksum = (notmi + *vmz + carry) & ris;
    let hz = masksum ^ notmi ^ *vm;
    let hp = *vm | (masksum & *vmz);
    // Extract the high bit as bottom difference.
    let right_shift = Simd::<u64, 4>::splat(63);
    let hzw = hz >> right_shift;
    let hpw = hp >> right_shift;
    // Insert the top horizontal difference.
    let left_shift = Simd::<u64, 4>::splat(1);
    let hz = (hz << left_shift) | *hz0;
    let hp = (hp << left_shift) | *hp0;
    // Update the input-output parameters.
    *hz0 = hzw;
    *hp0 = hpw;
    *vm = eq & hp;
    *vmz = hp | (eq & hz);
}

Code Snippet 2: Rust code for SIMD version of Bitpal's bitpacking algorithm, taking 20 instructions.

Profile. The DP recurrence relation depends on the sequences $A$ and $B$ via $\delta(a_i,b_j)$, which is 1 when $a_i\neq b_j$. When using a vertorized method, we would like to query this information efficiently for multiple pairs $(i, j)$ at once. When using vertical vectors, this can be done efficiently using a profile (Rognes and Seeberg 2000). For Myers’ bitpacking, this looks as follows. For each character $c$ in the alphabet, the bitvector $Eq[c]$ stores for each character $b_j$ of $B$ whether it equals $c$ as a single bit. Then, when the lane for rows $j=0$ to $j=64$ is processed in column $i$, we can simply read the indicator word corresponding to these lanes from the bitvector for $c=a_i$ ($Eq[a_i]$) and directly use it in the bitwise algorithm.

For SIMD and SWAR methods that use packed integer values (rather than single bits), the same can be done, where we can simply write the values of all $\delta(a_i, b_j)$.

Difference recurrence relations. For more general cost models, such as affine costs, direct bitpacking does not work, since differences between adjacent states can be larger than 1. Still, it is beneficial to consider differences between adjacent states rather than absolute distances: these are typically smaller, so that they require fewer bits to store and more of them can be processed in parallel (Suzuki and Kasahara 2018). Suzuki and Kasahara introduce this technique for affine-cost alignment, and this has subsequently been used by KSW2 and BSAlign (Shao and Ruan 2024).

Blocks. A separate improvement is made by Block aligner (Liu and Steinegger 2023), an approximate aligner that can also handle position-specific scoring matrices. Its main novelty is to divide the computation into large blocks. This results in highly predictable code, and benefits the execution speed, even though some more (redundant) states may be computed.

11 LCS and Contours Link to heading

So far, all pairwise alignment methods we looked at are based on the alignment graph. The longest common subsequence problem also admits different solutions. See e.g. (Bergroth, Hakonen, and Raita, n.d.) for a survey.

Sparse dynamic programming. Instead of finding a minimal-cost path through a graph, we can search for the longest chain of matches (Hirschberg 1975, 1977; Hunt and Szymanski 1977). Suppose there are $r$ matches in total, where each match is a pair $(i,j)$ such that $a_i=b_j$. We can then process these matches from left to right (by increasing $i$ and $j$), and for each of them determine the longest chain of matches ending in them (Figure 6j). By extension, we determine for each state $\st ij$ the length $S(\st ij)$ of the LCS of $A_{<i}$ and $B_{<j}$. Such methods that only consider a subset of vertices of the graph or DP-matrix are using sparse dynamic programming, and are reviewed and extended in (Eppstein et al. 1992a, 1992b).

Note that $S$ can never decrease as we move right or down the matrix, and this allows to efficiently store the values of a column via a list of thresholds of rows where the LCS jumps from $g$ to $g+1$. Then, the value of a cell can be found using binary search, so that the overall algorithm runs in $O((r + n) \lg n)$. While this is slow in general, when there are only few ($r\approx n$) matches, as may be the case when comparing lines of code, this algorithm is much faster than previous quadratic methods.

Contours. The regions of equal $S$ create a set of contours (Figure 6j), where contour $\ell$ is the boundary between the regions with $S(u)\geq \ell$ and $S(u) < \ell$. Each contour is determined by a set of dominant matches $a_i=b_j$ for which $S(i+1,j+1)$ is larger than both $S(i, j+1)$ and $S(i+1,j)$.

LCSk. We also mention here the LCSk variant, where the task is to maximize the number of length-$k$ matches between the two input strings. This was first introduced around 1982 by Wilbur and Lipman (Wilbur and Lipman 1983; Wilbur and Lipman 1984), and rediscovered in 2014 (Benson, Levy, and Shalom 2014; Pavetić, Žužić, and Šikić 2014; Pavetić et al. 2017; Deorowicz and Grabowski 2013). When choosing $k$ sufficiently larger than $\log_\sigma n$, this has the benefit that the number of $k$-mer matches between the two strings is typically much smaller than $n^2$, so that the $O((r+n)\lg n)$ runtime becomes feasible. The drawback, however, is that this not provide an exact solution to the original LCS problem.

Chaining k-mers. A solution to the LCSk problem consist of a sequence of matching $k$-mers. Together, these form a chain, which is formally defined as a sequence of vertices $u_1$, $\dots$, $u_n$ in a partially ordered set (whose transitive close is a directed acyclic graph), such that $u_1\leq u_2\leq \dots \leq u_n$. Then, LCSk is equivalent to finding the longest chain in the poset of k-mer matches. In this formulation, a score (the length) is maximized. Myers and Miller (Myers and Miller 1995) instead consider a version where the cost of a chain is minimized, by using the gap cost over the gap between consecutive k-mer matches in the chain.

12 Some Tools Link to heading

There are many aligners that implement $O(nm)$ (semi)-global alignment using numerous of the aforementioned implementation techniques, such as SeqAn (Döring et al. 2008), Parasail (Daily 2016), SWIPE (Rognes 2011), Opal (Šošić 2015), libssa (Frielingsdorf 2015), SWPS3 (Szalkowski et al. 2008), and SSW library (Zhao et al. 2013).

Dedicated global alignment implementations implementing band-doubling are much rarer, and we list some recent ones here. For more, we refer to the survey of Alser et al. (Alser et al. 2021).

KSW2 (Li 2016) implements banded alignment using the difference recurrence (Suzuki and Kasahara 2018) with SIMD, and supports (double) affine costs.

Edlib (Šošić and Šikić 2017) implements band doubling (Ukkonen 1985) using the $\g(u) + \cgap(u, v_t)\leq t$ computational volume, similar to A* with the gap-heuristic. It uses Myers’ bitpacking (Myers 1999). For traceback, it uses Hirschberg’s divide-and-conquer approach (Hirschberg 1975): once the distance is found, the alignment is started over from both sides towards the middle column, where a state on the shortest path is determined. This is recursively applied to the left and right halves until the sequences are short enough that quadratic memory can be used.

WFA (Marco-Sola et al. 2020) builds on the $O(n+s^2)$ diagonal transition method (Ukkonen 1985; Myers 1986), and extends it to affine costs using a method similar to (Gotoh 1982).

BiWFA (Marco-Sola et al. 2023) is a later version that applies divide-and-conquer (Hirschberg 1975) to reduce to linear memory usage.

13 Subquadratic Methods and Lower Bounds Link to heading

We end this chapter with a discussion of some more theoretical methods that have a worst case that is slightly better than quadratic.

Lower bounds. Backurs and Indyk (Backurs and Indyk 2018) have shown that unit cost edit distance can not be solved in time $O(n^{2-\delta})$ for any $\delta > 0$, on the condition that the Strong Exponential Time Hypothesis (SETH) is true. Soon after, it was also shown that SETH implies that LCS also can not be solved in time $O(n^{2-\delta})$ for any $\delta > 0$ (Abboud, Backurs, and Williams 2015).

Figure 7: In the four Russians method, the (ntimes m) grid is divided into blocks of size (rtimes r). For each block, differences between DP table cells along the top row (R) and left column (S) are the input, together with the corresponding substrings of (A) and (B). The output are the differences along the bottom row (R’) and right column (S’). For each possible input of a block, the corresponding output is precomputed, so that the DP table can be filled by using lookups only. Red shaded states are not visited. — Figure 7: In the four Russians method, the (ntimes m) grid is divided into blocks of size (rtimes r). For each block, differences between DP table cells along the top row (R) and left column (S) are the *input*, together with the corresponding substrings of (A) and (B). The *output* are the differences along the bottom row (R’) and right column (S’). For each possible input of a block, the corresponding *output* is precomputed, so that the DP table can be filled by using lookups only. Red shaded states are not visited.

Four Russians method. The so called four Russians method was introduced by (Arlazarov et al. 1970). It is a general method to speed up DP algorithms from $n^2$ to $n^2 / \log n$, provided that entries are integers and all dependencies are ’local’.

This idea was applied to pairwise alignment by Masek (Masek and Paterson 1980), resulting in the first subquadratic worst-case algorithm for edit distance. It works by partitioning the $n\times m$ matrix in blocks of size $r\times r$, for some $r=\log_k n$, as shown in figure Figure 7. Consider the differences $R_i$ and $S_i$ between adjacent DP cells along the top row ($R_i$) and left column ($S_i$) of the block. The core observation is that the differences $R’_i$ and $S’_i$ along the bottom row and right column of the block only depend on $R_i$, $S_i$, and the substrings $a_i\cdots a_{i+r}$ and $b_j\cdots b_{j+r}$. This means that for some value of $k$ depending on the alphabet size $\sigma$, $r=\log_k n$ is small enough so that we can precompute the values of $R’$ and $S’$ for all possibilities of $(R, S, a_i\cdots a_{i+r}, b_j\cdots b_{j+r})$ in $O(n^2/r^2)$ time. In practice, $r$ needs to be quite small.

Using these precomputed values, the DP can be sped up by doing a single $O(1)$ lookup for each of the $O(n^2/r^2)$ blocks, for a total runtime of $O(n^2/\log^2 n)$. The runtime was originally reported as $O(n^2/\log n)$, but subsequent papers realized that the $r$ differences along each block boundary fit in a single machine word, so that lookups are indeed $O(1)$ instead of $O( r)$. While this is the only known subquadratic worst-case algorithm, it does not break the $O(n^{2-\delta})$ lower bound, since $\log^2 n$ grows subpolynomial.

Masek’s method requires a constant size alphabet. A first extension to general alphabets increased the time to $O(n^2 (\log \log n)^2 / \log^2(n))$ (Bille and Farach-Colton 2008), and this was later improved to $O(n^2 \log \log n / \log^2(n))$ (Grabowski 2016). An algorithm with similar complexity also works for LCS.

Applications. Wu et al. provide an implementation of this method for approximate string matching (Wu, Manber, and Myers 1996). They suggest a block size of $1\times r$, for $r=5$ or $r=6$, and provide efficient ways of transitioning from one block to the next.

Nowadays, the bit-parallel technique of Myers (Myers 1999) has replaced four Russians, since it can compute up to 64 cells in a single step, while not having to wait for (comparatively) slow lookups of the precomputed data.

14 Summary Link to heading

We summarize most of the papers discussed in this section in chronological order in Table 1. Not mentioned in the table are the review papers by Smith et al. (Smith, Waterman, and Fitch 1981), Kruskal (Kruskal 1983), Spouge (Spouge 1991), and Navarro (Navarro 2001), and also Bergroth et al.’s survey on LCS algorithms (Bergroth, Hakonen, and Raita, n.d.). A more recent review on read-aligners was done by Alser et al. (Alser et al. 2021).

Table 1: Chronological overview of papers related to exact global pairwise alignment. Parameters are sequence lengths $n$ and $m$ with $n\geq m$. The (edit) distance is $s$. The number of matching characters or k-mers is $r$. The length of the LCS is $p$. $w=64$ is the word size, and lastly we assume a fixed-size alphabet $\Sigma$. Time is worst-case unless noted otherwise, and space usage is to obtain the full alignment. Methods in bold are newly introduced or combined.

Paper	Cost model	Time	Space	Method	Remarks
(Vintsyuk 1968)	no deletions	$O(nm)$	$O(nm)$	DP	different formulation in a different domain, but conceptually similar
(Needleman and Wunsch 1970)	arbitrary	$O(n^2m)$	$O(nm)$	DP	solves pairwise alignment in polynomial time
(Sankoff 1972)	LCS	$\boldsymbol{O(nm)}$	$O(nm)$	DP	the first quadratic algorithm
(Sellers 1974) and (Wagner and Fischer 1974)	edit distance	$O(nm)$	$O(nm)$	DP	the quadratic algorithm now known as Needleman-Wunch
(Hirschberg 1975)	LCS	$O(nm)$	$\boldsymbol{O(n)}$	divide-and-conque	introduces linear memory backtracking
(Hunt and Szymanski 1977)	LCS	$\boldsymbol{O((r+n)\lg n)}$	$O(r+n)$	thresholds	distance only
(Hirschberg 1977)	LCS	$\boldsymbol{O(p(m-p)\lg n)}$	$\boldsymbol{O(n+(m-p)^2)}$	contours + band	for similar sequences
(Masek and Paterson 1980)	edit distance	$\boldsymbol{O(n\cdot \max(1, m/\lg n))}$	$O(n^2/\lg^2 n)$	four-russians	best known complexity; requires finite alphaet
(Smith, Waterman, and Fitch 1981)	affine	-	-	-	First to suggest affine, in future work.
(Gotoh 1982)	affine	$O(nm)$	$O(nm)$	DP	extends (Sellers 1974) to affine
(Altschul and Erickson 1986)	affine	$O(nm)$	$O(nm)$	DP	Fixes bug in traceback of (Gotoh 1982)
(Nakatsu, Kambayashi, and Yajima 1982)	LCS	$\boldsymbol{O(n(m-p))}$	$O(n(m-p))$	DP on thresholds	improves (Hirschberg 1977), subsumed by (Myers 1986)
(Wilbur and Lipman 1983; Wilbur and Lipman 1984)	LCSk	-	-	chaining k-mer matches	Approximate
(Fickett 1984)	Edit distance	$O(nt)$	$O(nt)$	*Bound $\g(u)\leq t$	Assuming $t\geq s$.
Dijkstra (Dijkstra 1959)	Edit distance	$O(ns)$	$O(ns)$	Dijkstra’s algorithm	Implement using $O(1)$ bucket queue
(Ukkonen 1985)	edit distance	$\boldsymbol{O(ms)}$	$O(ns)$	band doubling	first $O(ns)$ algorithm for edit distance
(Ukkonen 1985)	edit distance	$O(n+s^2)$ expected	$\boldsymbol{O(n+s^2)}$	diagonal transition	introduces diagonal transition method, requires fixed indel cost
(Myers 1986)	LCS	$O(n+s^s)$ expected	$O(s)$ working memory	diagonal transition, divide-and-conquer	introduces diagonal transition method for LCS, $O(n+s^2)$ expected time
(Myers 1986)	LCS	$\boldsymbol{O(n +s^2)}$	$O(n)$	suffix tree	better worst case complexity, but slower in practice
(Myers and Miller 1988)	affine	$O(nm)$	$O(m + \lg n)$	divide-and-conquer	applies (Hirschberg 1975) to (Gotoh 1982) to get linear space
(Spouge 1989)	edit distance	-	-	A, computational volumes*	Review paper
(GOTOH 1990)	double/more affine	$O(Lmn)$	$O(nm+Lm)$	DP, $L$ layers in the graph
(Wu et al. 1990)	unit cost	$O(n+(s-\vert n-m\vert)s)$ exp.	$O(n)$	Diagonal transition, gap-heuristic, divide-and-conquer
(Eppstein et al. 1992a)	LCSk	$O(n + d \log\log \min(d, nm/d))$		sparse-dynamic-programming, contours	$d$ is number of dominant matches
(Myers and Miller 1995)	LCSk, edit distance	$O(r \log^2 r)$	$O(r \log r)$	chaining k-mer matches with gap cost	$r$ is number of matches
(Myers 1999)	unit costs	$O(nm/w)$	$O(m\sigma / w)$	DP, bitpacking
(Papamichail and Papamichail 2009)	unit costs	$O(n+(s-\vert n-m\vert)s)$	$O(s)$	A*, gap heuristic, diagonal transition
(Deorowicz and Grabowski 2013)	LCS$k$	$O(n + r \log l)$	$O(n + \min(r, nl))$	thresholds	modifies (Hunt and Szymanski 1977) for LCS$k$
BitPAl (Loving, Hernandez, and Benson 2014)	Edit distance	$O(znm/w)$	$O(znm/w)$	Bitpacking, difference recurrence	$z$ depends on edit costs
(Grabowski 2016)	unit cost/LCS	$O(nm \log \log n / \log^2 n)$	$o(n)$ overhead	Four-russians	General alphabet
Edlib (Šošić and Šikić 2017)	unit costs	$O(ns/w)$	$O(n)$	Bitpacking, band-doubling, divide-and-conquer	extends Myers’ bit-packing to global alignment
libgaba (Suzuki and Kasahara 2018)	Affine	-	-	SIMD, affine difference recurrence relation	Adaptive band; not exact
KSW2 (Li 2018)	Affine, Double affine	$O(nm/w)$	$O(nm/w)$	Implements (Suzuki and Kasahara 2018)	$w$ SIMD lanes
WFA (Marco-Sola et al. 2020)	affine	$O(n+s^2)$ expected	$O(n+s^2)$	diagonal-transition	extends diagonal transition to gap affine (Gotoh 1982)
WFALM (Eizenga and Paten 2022)	affine	$O(n+s^2)$	$O(n+s^{3/2})$	diagonal-transition, square-root-decomposition	reduces memory usage of WFA by only storing $1/\sqrt n$ of fronts
BiWFA (Marco-Sola et al. 2023)	affine	$O(n+s^2)$ expected	$O(s)$ working memory	diagonal-transition, divide-and-conquer	applies (Hirschberg 1975) to WFA to get linear space
Block Aligner (Liu and Steinegger 2023)	Affine; scoring matrix			SIMD, blocks, adaptive band
TALCO (Walia et al. 2024)	Affine			Adaptive band; traceback convergence	Resolves trace during alignment, saving memory
BSAlign (Shao and Ruan 2024)	Affine			striped SIMD, difference recurrence, (adaptive) banded	First to implement these together
A*PA (Groot Koerkamp and Ivanov 2024)	unit costs	$O(n)$ best case	$O(n)$	A, gap-chaining seed heuristic, pruning*, diagonal-transition	only for random strings with random errors, with $n\ll\vert \Sigma\vert ^{1/e}$
A*PA2 (Groot Koerkamp 2024)	unit costs	$O(n)$ best case		DP, A*, blocks, (incremental) band-doubling, SIMD, bitpacking

References Link to heading

Abboud, Amir, Arturs Backurs, and Virginia Vassilevska Williams. 2015. “Quadratic-Time Hardness of Lcs and Other Sequence Similarity Measures.” arXiv. https://doi.org/10.48550/ARXIV.1501.07053.

Alpern, Bowen, Larry Carter, and Kang Su Gatlin. 1995. “Microparallelism and High-Performance Protein Matching.” Proceedings of the 1995 Acm/Ieee Conference on Supercomputing (Cdrom) - Supercomputing ’95. https://doi.org/10.1145/224170.224222.

Alser, Mohammed, Jeremy Rotman, Dhrithi Deshpande, Kodi Taraszka, Huwenbo Shi, Pelin Icer Baykal, Harry Taegyun Yang, et al. 2021. “Technology Dictates Algorithms: Recent Developments in Read Alignment.” Genome Biology 22 (1). https://doi.org/10.1186/s13059-021-02443-7.

Altschul, Stephen F., and Bruce W. Erickson. 1986. “Optimal Sequence Alignment Using Affine Gap Costs.” Bulletin of Mathematical Biology 48 (5–6): 603–16. https://doi.org/10.1007/bf02462326.

Arlazarov, V. L., Y. A. Dinitz, M. A. Kronrod, and I. A. Faradzhev. 1970. “On Economical Construction of the Transitive Closure of an Oriented Graph.” Dokl. Akad. Nauk Sssr 194 (3): 487–88. http://www.ams.org/mathscinet-getitem?mr=0269546.

Backurs, Arturs, and Piotr Indyk. 2018. “Edit Distance Cannot Be Computed in Strongly Subquadratic Time (Unless Seth Is False).” Siam Journal on Computing 47 (3): 1087–97. https://doi.org/10.1137/15m1053128.

Baeza-Yates, Ricardo, and Gaston H. Gonnet. 1992. “A New Approach to Text Searching.” Communications of the Acm 35 (10): 74–82. https://doi.org/10.1145/135239.135243.

Benson, Gary, Yozen Hernandez, and Joshua Loving. 2013. “A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm.” Lecture Notes in Computer Science, 50–61. https://doi.org/10.1007/978-3-642-38905-4_7.

Benson, Gary, Avivit Levy, and Riva Shalom. 2014. “Longest Common Subsequence in K-Length Substrings.” arXiv. https://doi.org/10.48550/ARXIV.1402.2097.

Bergroth, L., H. Hakonen, and T. Raita. n.d. “A Survey of Longest Common Subsequence Algorithms.” In Proceedings Seventh International Symposium on String Processing and Information Retrieval. Spire 2000, 39–48. Spire-00. IEEE Comput. Soc. https://doi.org/10.1109/spire.2000.878178.

Bille, Philip, and Martin Farach-Colton. 2008. “Fast and Compact Regular Expression Matching.” Theoretical Computer Science 409 (3): 486–96. https://doi.org/10.1016/j.tcs.2008.08.042.

Daily, Jeff. 2016. “Parasail: Simd c Library for Global, Semi-Global, and Local Pairwise Sequence Alignments.” Bmc Bioinformatics 17 (1). https://doi.org/10.1186/s12859-016-0930-z.

Deorowicz, Sebastian, and Szymon Grabowski. 2013. “Efficient Algorithms for the Longest Common Subsequence in $k$-Length Substrings.” arXiv. https://doi.org/10.48550/ARXIV.1311.4552.

Dijk, Lucas R. van, Abigail L. Manson, Ashlee M. Earl, Kiran V Garimella, and Thomas Abeel. 2024. “Fast and Exact Gap-Affine Partial Order Alignment with Poasta.” Biorxiv. https://doi.org/10.1101/2024.05.23.595521.

Dijkstra, E. W. 1959. “A Note on Two Problems in Connexion with Graphs.” Numerische Mathematik 1 (1): 269–71. https://doi.org/10.1007/bf01386390.

Döring, Andreas, David Weese, Tobias Rausch, and Knut Reinert. 2008. “Seqan an Efficient, Generic c++ Library for Sequence Analysis.” Bmc Bioinformatics 9 (1). https://doi.org/10.1186/1471-2105-9-11.

Eizenga, Jordan M., and Benedict Paten. 2022. “Improving the Time and Space Complexity of the Wfa Algorithm and Generalizing Its Scoring,” January. https://doi.org/10.1101/2022.01.12.476087.

Eppstein, David, Zvi Galil, Raffaele Giancarlo, and Giuseppe F. Italiano. 1992a. “Sparse Dynamic Programming I: Linear Cost Functions.” Journal of the Acm 39 (3): 519–45. https://doi.org/10.1145/146637.146650.

———. 1992b. “Sparse Dynamic Programming Ii: Convex and Concave Cost Functions.” Journal of the Acm 39 (3): 546–67. https://doi.org/10.1145/146637.146656.

Farrar, Michael. 2006. “Striped Smith–Waterman Speeds Database Searches Six Times over Other Simd Implementations.” Bioinformatics 23 (2): 156–61. https://doi.org/10.1093/bioinformatics/btl582.

Fickett, James W. 1984. “Fast Optimal Alignment.” Nucleic Acids Research 12 (1Part1): 175–79. https://doi.org/10.1093/nar/12.1part1.175.

Frielingsdorf, J.T. 2015. “Improving Optimal Sequence Alignments through a Simd-Accelerated Library.” https://bib.irb.hr/datoteka/758607.diplomski_Martin_Sosic.pdf.

GOTOH, O. 1990. “Optimal Sequence Alignment Allowing for Long Gaps.” Bulletin of Mathematical Biology 52 (3): 359–73. https://doi.org/10.1016/s0092-8240(05)80216-2.

Gotoh, Osamu. 1982. “An Improved Algorithm for Matching Biological Sequences.” Journal of Molecular Biology 162 (3): 705–8. https://doi.org/10.1016/0022-2836(82)90398-9.

Grabowski, Szymon. 2016. “New Tabulation and Sparse Dynamic Programming Based Techniques for Sequence Similarity Problems.” Discrete Applied Mathematics 212 (October): 96–103. https://doi.org/10.1016/j.dam.2015.10.040.

Groot Koerkamp, Ragnar. 2024. “A*PA2: Up to 19 Faster Exact Global Alignment.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:17:1–17:25. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.17.

Groot Koerkamp, Ragnar, and Pesho Ivanov. 2024. “Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.” Edited by Tobias Marschall. Bioinformatics 40 (3). https://doi.org/10.1093/bioinformatics/btae032.

Hadlock, Frank O. 1988. “Minimum Detour Methods for String or Sequence Comparison.” Congressus Numerantium 61: 263–74.

Hamming, R. W. 1950. “Error Detecting and Error Correcting Codes.” Bell System Technical Journal 29 (2): 147–60. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x.

Hart, Peter E., Nils J. Nilsson, and Bertram Raphael. 1972. “Correction to ``a Formal Basis for the Heuristic Determination of Minimum Cost Paths’’.” Acm Sigart Bulletin, no. 37 (December): 28–29. https://doi.org/10.1145/1056777.1056779.

Hart, Peter, Nils Nilsson, and Bertram Raphael. 1968. “A Formal Basis for the Heuristic Determination of Minimum Cost Paths.” Ieee Transactions on Systems Science and Cybernetics 4 (2): 100–107. https://doi.org/10.1109/tssc.1968.300136.

Hirschberg, D. S. 1975. “A Linear Space Algorithm for Computing Maximal Common Subsequences.” Communications of the Acm 18 (6): 341–43. https://doi.org/10.1145/360825.360861.

Hirschberg, Daniel S. 1977. “Algorithms for the Longest Common Subsequence Problem.” Journal of the Acm 24 (4): 664–75. https://doi.org/10.1145/322033.322044.

Holte, Robert C. 2010. “Common Misconceptions Concerning Heuristic Search.” In Proceedings of the Third Annual Symposium on Combinatorial Search, SOCS 2010, Stone Mountain, Atlanta, Georgia, Usa, July 8-10, 2010, edited by Ariel Felner and Nathan R. Sturtevant. AAAI Press. http://aaai.org/ocs/index.php/SOCS/SOCS10/paper/view/2073.

Hughey, Richard. 1996. “Parallel Hardware for Sequence Comparison and Alignment.” Bioinformatics 12 (6): 473–79. https://doi.org/10.1093/bioinformatics/12.6.473.

Hunt, James W., and Thomas G. Szymanski. 1977. “A Fast Algorithm for Computing Longest Common Subsequences.” Communications of the Acm 20 (5): 350–53. https://doi.org/10.1145/359581.359603.

Hyyrö, Heikki, Kimmo Fredriksson, and Gonzalo Navarro. 2005. “Increased Bit-Parallelism for Approximate and Multiple String Matching.” Acm Journal of Experimental Algorithmics 10 (December). https://doi.org/10.1145/1064546.1180617.

Ivanov, Pesho, Benjamin Bichsel, Harun Mustafa, André Kahles, Gunnar Rätsch, and Martin Vechev. 2020. “Astarix: Fast and Optimal Sequence-to-Graph Alignment.” In Research in Computational Molecular Biology, 104–19. Springer International Publishing. https://doi.org/10.1007/978-3-030-45257-5_7.

Ivanov, Pesho, Benjamin Bichsel, and Martin Vechev. 2022. “Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds.” In Research in Computational Molecular Biology, 306–25. Springer International Publishing. https://doi.org/10.1007/978-3-031-04749-7_22.

Kruskal, Joseph B. 1983. “An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules.” Siam Review 25 (2): 201–37. https://doi.org/10.1137/1025045.

Li, Heng. 2016. “Minimap and Miniasm: Fast Mapping and de Novo Assembly for Noisy Long Sequences.” Bioinformatics 32 (14): 2103–10. https://doi.org/10.1093/bioinformatics/btw152.

———. 2018. “Minimap2: Pairwise Alignment for Nucleotide Sequences.” Edited by Inanc Birol. Bioinformatics 34 (18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.

Liu, Daniel, and Martin Steinegger. 2023. “Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices.” Bioinformatics, btad487. https://doi.org/10.1093/bioinformatics/btad487.

Loving, Joshua, Yozen Hernandez, and Gary Benson. 2014. “Bitpal: A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm.” Bioinformatics 30 (22): 3166–73. https://doi.org/10.1093/bioinformatics/btu507.

Marco-Sola, Santiago, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. 2023. “Optimal Gap-Affine Alignment in O(s) Space.” Edited by Pier Luigi Martelli. Bioinformatics 39 (2). https://doi.org/10.1093/bioinformatics/btad074.

Marco-Sola, Santiago, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. 2020. “Fast gap-affine pairwise alignment using the wavefront algorithm.” Bioinformatics 37 (4): 456–63. https://doi.org/10.1093/bioinformatics/btaa777.

Masek, William J., and Michael S. Paterson. 1980. “A Faster Algorithm Computing String Edit Distances.” Journal of Computer and System Sciences 20 (1): 18–31. https://doi.org/10.1016/0022-0000(80)90002-1.

Medvedev, Paul. 2023. “Theoretical Analysis of Edit Distance Algorithms.” Communications of the Acm 66 (12): 64–71. https://doi.org/10.1145/3582490.

Myers, Gene. 1986. “An $O(ND)$ Difference Algorithm and Its Variations.” Algorithmica 1 (1–4): 251–66. https://doi.org/10.1007/bf01840446.

———. 1999. “A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming.” Journal of the Acm 46 (3): 395–415. https://doi.org/10.1145/316542.316550.

Myers, Gene, and Webb Miller. 1988. “Optimal Alignments in Linear Space.” Bioinformatics 4 (1): 11–17. https://doi.org/10.1093/bioinformatics/4.1.11.

———. 1995. “Chaining Multiple-Alignment Fragments in Sub-Quadratic Time.” In Proceedings of the Sixth Annual Acm-Siam Symposium on Discrete Algorithms, 38–47. Soda ’95. San Francisco, California, USA: Society for Industrial and Applied Mathematics.

Nakatsu, Narao, Yahiko Kambayashi, and Shuzo Yajima. 1982. “A Longest Common Subsequence Algorithm Suitable for Similar Text Strings.” Acta Informatica 18 (2). https://doi.org/10.1007/bf00264437.

Navarro, Gonzalo. 2001. “A Guided Tour to Approximate String Matching.” Acm Computing Surveys 33 (1): 31–88. https://doi.org/10.1145/375360.375365.

Needleman, Saul B., and Christian D. Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology 48 (3): 443–53. https://doi.org/10.1016/0022-2836(70)90057-4.

Papamichail, Dimitris, and Georgios Papamichail. 2009. “Improved Algorithms for Approximate String Matching (Extended Abstract).” Bmc Bioinformatics 10 (S1). https://doi.org/10.1186/1471-2105-10-s1-s10.

Pavetić, Filip, Ivan Katanić, Gustav Matula, Goran Žužić, and Mile Šikić. 2017. “Fast and Simple Algorithms for Computing Both $lcs_k$ and $LCS_k+$.” arXiv. https://doi.org/10.48550/ARXIV.1705.07279.

Pavetić, Filip, Goran Žužić, and Mile Šikić. 2014. “$LCSk$++: Practical Similarity Metric for Long Strings.” https://doi.org/10.48550/ARXIV.1407.2407.

Pearl, Judea. 1984. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley Longman Publishing Co., Inc.

Rognes, Torbjørn. 2011. “Faster Smith-Waterman Database Searches with Inter-Sequence Simd Parallelisation.” Bmc Bioinformatics 12 (1). https://doi.org/10.1186/1471-2105-12-221.

Rognes, Torbjørn, and Erling Seeberg. 2000. “Six-Fold Speed-up of Smith–Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors.” Bioinformatics 16 (8): 699–706. https://doi.org/10.1093/bioinformatics/16.8.699.

Sankoff, David. 1972. “Matching Sequences under Deletion/Insertion Constraints.” Proceedings of the National Academy of Sciences 69 (1): 4–6. https://doi.org/10.1073/pnas.69.1.4.

Sellers, Peter H. 1974. “On the Theory and Computation of Evolutionary Distances.” Siam Journal on Applied Mathematics 26 (4): 787–93. https://doi.org/10.1137/0126070.

Shao, Haojing, and Jue Ruan. 2024. “Bsalign: A Library for Nucleotide Sequence Alignment.” Genomics, Proteomics & Bioinformatics, March. https://doi.org/10.1093/gpbjnl/qzae025.

Smith, T. F., M. S. Waterman, and W. M. Fitch. 1981. “Comparative Biosequence Metrics.” Journal of Molecular Evolution 18 (1): 38–46. https://doi.org/10.1007/bf01733210.

Smith, T.F., and M.S. Waterman. 1981. “Identification of Common Molecular Subsequences.” Journal of Molecular Biology 147 (1): 195–97. https://doi.org/10.1016/0022-2836(81)90087-5.

Spouge, John L. 1989. “Speeding up Dynamic Programming Algorithms for Finding Optimal Lattice Paths.” Siam Journal on Applied Mathematics 49 (5): 1552–66. https://doi.org/10.1137/0149094.

———. 1991. “Fast Optimal Alignment.” Bioinformatics 7 (1): 1–7. https://doi.org/10.1093/bioinformatics/7.1.1.

Suzuki, Hajime, and Masahiro Kasahara. 2018. “Introducing Difference Recurrence Relations for Faster Semi-Global Alignment of Long Sequences.” BMC Bioinformatics 19 (S1). https://doi.org/10.1186/s12859-018-2014-8.

Szalkowski, Adam, Christian Ledergerber, Philipp Krähenbühl, and Christophe Dessimoz. 2008. “Swps3 – Fast Multi-Threaded Vectorized Smith-Waterman for Ibm Cell/B.E. And 86/Sse2.” Bmc Research Notes 1 (1): 107. https://doi.org/10.1186/1756-0500-1-107.

Ukkonen, Esko. 1983. “On Approximate String Matching.” Lecture Notes in Computer Science, 487–95. https://doi.org/10.1007/3-540-12689-9_129.

———. 1985. “Algorithms for Approximate String Matching.” Information and Control 64 (1–3): 100–118. https://doi.org/10.1016/s0019-9958(85)80046-2.

Vintsyuk, T. K. 1968. “Speech Discrimination by Dynamic Programming.” Cybernetics 4 (1): 52–57. https://doi.org/10.1007/bf01074755.

Wagner, Robert A., and Michael J. Fischer. 1974. “The String-to-String Correction Problem.” Journal of the Acm 21 (1): 168–73. https://doi.org/10.1145/321796.321811.

Walia, Sumit, Cheng Ye, Arkid Bera, Dhruvi Lodhavia, and Yatish Turakhia. 2024. “Talco: Tiling Genome Sequence Alignment Using Convergence of Traceback Pointers.” In 2024 Ieee International Symposium on High-Performance Computer Architecture (Hpca). IEEE. https://doi.org/10.1109/hpca57654.2024.00044.

Waterman, M.S, T.F Smith, and W.A Beyer. 1976. “Some Biological Sequence Metrics.” Advances in Mathematics 20 (3): 367–87. https://doi.org/10.1016/0001-8708(76)90202-4.

Wilbur, W J, and D J Lipman. 1983. “Rapid Similarity Searches of Nucleic Acid and Protein Data Banks.” Proceedings of the National Academy of Sciences 80 (3): 726–30. https://doi.org/10.1073/pnas.80.3.726.

Wilbur, W. John, and David J. Lipman. 1984. “The Context Dependent Comparison of Biological Sequences.” Siam Journal on Applied Mathematics 44 (3): 557–67. https://doi.org/10.1137/0144038.

Wozniak, A. 1997. “Using Video-Oriented Instructions to Speed up Sequence Comparison.” Bioinformatics 13 (2): 145–50. https://doi.org/10.1093/bioinformatics/13.2.145.

Wu, Sun, U. Manber, and Gene Myers. 1996. “A Subquadratic Algorithm for Approximate Limited Expression Matching.” Algorithmica 15 (1): 50–67. https://doi.org/10.1007/bf01942606.

Wu, Sun, Udi Manber, Gene Myers, and Webb Miller. 1990. “An O(Np) Sequence Comparison Algorithm.” Information Processing Letters 35 (6): 317–23. https://doi.org/10.1016/0020-0190(90)90035-v.

Wu, Sun, and Udi Manber. 1992. “Fast Text Searching.” Communications of the Acm 35 (10): 83–91. https://doi.org/10.1145/135239.135244.

Zhang, Zheng, Scott Schwartz, Lukas Wagner, and Webb Miller. 2000. “A Greedy Algorithm for Aligning DNA Sequences.” Journal of Computational Biology 7 (1–2): 203–14. https://doi.org/10.1089/10665270050081478.

Zhao, Mengyao, Wan-Ping Lee, Erik P. Garrison, and Gabor T. Marth. 2013. “Ssw Library: An Simd Smith-Waterman c/c++ Library for Use in Genomic Applications.” Edited by Leonardo Mariño-Ram\’ırez. Plos One 8 (12): e82138. https://doi.org/10.1371/journal.pone.0082138.

Šošić, M. 2015. “An Simd Dynamic Programming c/c++ Library.” https://bib.irb.hr/datoteka/758607.diplomski_Martin_Sosic.pdf.

Šošić, Martin, and Mile Šikić. 2017. “Edlib: A c/c++ Library for Fast, Exact Sequence Alignment Using Edit Distance.” Edited by John Hancock. Bioinformatics 33 (9): 1394–95. https://doi.org/10.1093/bioinformatics/btw753.

Paper	Cost model	Time	Space	Method	Remarks
(Vintsyuk 1968)	no deletions	\(O(nm)\)	\(O(nm)\)	DP	different formulation in a different domain, but conceptually similar
(Needleman and Wunsch 1970)	arbitrary	\(O(n^2m)\)	\(O(nm)\)	DP	solves pairwise alignment in polynomial time
(Sankoff 1972)	LCS	\(\boldsymbol{O(nm)}\)	\(O(nm)\)	DP	the first quadratic algorithm
(Sellers 1974) and (Wagner and Fischer 1974)	edit distance	\(O(nm)\)	\(O(nm)\)	DP	the quadratic algorithm now known as Needleman-Wunch
(Hirschberg 1975)	LCS	\(O(nm)\)	\(\boldsymbol{O(n)}\)	divide-and-conque	introduces linear memory backtracking
(Hunt and Szymanski 1977)	LCS	\(\boldsymbol{O((r+n)\lg n)}\)	\(O(r+n)\)	thresholds	distance only
(Hirschberg 1977)	LCS	\(\boldsymbol{O(p(m-p)\lg n)}\)	\(\boldsymbol{O(n+(m-p)^2)}\)	contours + band	for similar sequences
(Masek and Paterson 1980)	edit distance	\(\boldsymbol{O(n\cdot \max(1, m/\lg n))}\)	\(O(n^2/\lg^2 n)\)	four-russians	best known complexity; requires finite alphaet
(Smith, Waterman, and Fitch 1981)	affine	-	-	-	First to suggest affine, in future work.
(Gotoh 1982)	affine	\(O(nm)\)	\(O(nm)\)	DP	extends (Sellers 1974) to affine
(Altschul and Erickson 1986)	affine	\(O(nm)\)	\(O(nm)\)	DP	Fixes bug in traceback of (Gotoh 1982)
(Nakatsu, Kambayashi, and Yajima 1982)	LCS	\(\boldsymbol{O(n(m-p))}\)	\(O(n(m-p))\)	DP on thresholds	improves (Hirschberg 1977), subsumed by (Myers 1986)
(Wilbur and Lipman 1983; Wilbur and Lipman 1984)	LCSk	-	-	chaining k-mer matches	Approximate
(Fickett 1984)	Edit distance	\(O(nt)\)	\(O(nt)\)	*Bound \(\g(u)\leq t\)	Assuming \(t\geq s\).
Dijkstra (Dijkstra 1959)	Edit distance	\(O(ns)\)	\(O(ns)\)	Dijkstra’s algorithm	Implement using \(O(1)\) bucket queue
(Ukkonen 1985)	edit distance	\(\boldsymbol{O(ms)}\)	\(O(ns)\)	band doubling	first \(O(ns)\) algorithm for edit distance
(Ukkonen 1985)	edit distance	\(O(n+s^2)\) expected	\(\boldsymbol{O(n+s^2)}\)	diagonal transition	introduces diagonal transition method, requires fixed indel cost
(Myers 1986)	LCS	\(O(n+s^s)\) expected	\(O(s)\) working memory	diagonal transition, divide-and-conquer	introduces diagonal transition method for LCS, \(O(n+s^2)\) expected time
(Myers 1986)	LCS	\(\boldsymbol{O(n +s^2)}\)	\(O(n)\)	suffix tree	better worst case complexity, but slower in practice
(Myers and Miller 1988)	affine	\(O(nm)\)	\(O(m + \lg n)\)	divide-and-conquer	applies (Hirschberg 1975) to (Gotoh 1982) to get linear space
(Spouge 1989)	edit distance	-	-	A, computational volumes*	Review paper
(GOTOH 1990)	double/more affine	\(O(Lmn)\)	\(O(nm+Lm)\)	DP, \(L\) layers in the graph
(Wu et al. 1990)	unit cost	\(O(n+(s-\vert n-m\vert)s)\) exp.	\(O(n)\)	Diagonal transition, gap-heuristic, divide-and-conquer
(Eppstein et al. 1992a)	LCSk	\(O(n + d \log\log \min(d, nm/d))\)		sparse-dynamic-programming, contours	\(d\) is number of dominant matches
(Myers and Miller 1995)	LCSk, edit distance	\(O(r \log^2 r)\)	\(O(r \log r)\)	chaining k-mer matches with gap cost	\(r\) is number of matches
(Myers 1999)	unit costs	\(O(nm/w)\)	\(O(m\sigma / w)\)	DP, bitpacking
(Papamichail and Papamichail 2009)	unit costs	\(O(n+(s-\vert n-m\vert)s)\)	\(O(s)\)	A*, gap heuristic, diagonal transition
(Deorowicz and Grabowski 2013)	LCS\(k\)	\(O(n + r \log l)\)	\(O(n + \min(r, nl))\)	thresholds	modifies (Hunt and Szymanski 1977) for LCS\(k\)
BitPAl (Loving, Hernandez, and Benson 2014)	Edit distance	\(O(znm/w)\)	\(O(znm/w)\)	Bitpacking, difference recurrence	\(z\) depends on edit costs
(Grabowski 2016)	unit cost/LCS	\(O(nm \log \log n / \log^2 n)\)	\(o(n)\) overhead	Four-russians	General alphabet
Edlib (Šošić and Šikić 2017)	unit costs	\(O(ns/w)\)	\(O(n)\)	Bitpacking, band-doubling, divide-and-conquer	extends Myers’ bit-packing to global alignment
libgaba (Suzuki and Kasahara 2018)	Affine	-	-	SIMD, affine difference recurrence relation	Adaptive band; not exact
KSW2 (Li 2018)	Affine, Double affine	\(O(nm/w)\)	\(O(nm/w)\)	Implements (Suzuki and Kasahara 2018)	\(w\) SIMD lanes
WFA (Marco-Sola et al. 2020)	affine	\(O(n+s^2)\) expected	\(O(n+s^2)\)	diagonal-transition	extends diagonal transition to gap affine (Gotoh 1982)
WFALM (Eizenga and Paten 2022)	affine	\(O(n+s^2)\)	\(O(n+s^{3/2})\)	diagonal-transition, square-root-decomposition	reduces memory usage of WFA by only storing \(1/\sqrt n\) of fronts
BiWFA (Marco-Sola et al. 2023)	affine	\(O(n+s^2)\) expected	\(O(s)\) working memory	diagonal-transition, divide-and-conquer	applies (Hirschberg 1975) to WFA to get linear space
Block Aligner (Liu and Steinegger 2023)	Affine; scoring matrix			SIMD, blocks, adaptive band
TALCO (Walia et al. 2024)	Affine			Adaptive band; traceback convergence	Resolves trace during alignment, saving memory
BSAlign (Shao and Ruan 2024)	Affine			striped SIMD, difference recurrence, (adaptive) banded	First to implement these together
A*PA (Groot Koerkamp and Ivanov 2024)	unit costs	\(O(n)\) best case	\(O(n)\)	A, gap-chaining seed heuristic, pruning*, diagonal-transition	only for random strings with random errors, with \(n\ll\vert \Sigma\vert ^{1/e}\)
A*PA2 (Groot Koerkamp 2024)	unit costs	\(O(n)\) best case		DP, A*, blocks, (incremental) band-doubling, SIMD, bitpacking