[WIP] A*PA2: 10x faster exact global alignment

\begin{equation*} \newcommand{\g}{g^*} \newcommand{\h}{h^*} \newcommand{\cgap}{c_{\texttt{gap}}} \newcommand{\xor}{\ \mathrm{xor}\ } \newcommand{\and}{\ \mathrm{and}\ } \newcommand{\st}[2]{\langle #1, #2\rangle} \newcommand{\matches}{\mathcal M} \end{equation*}

Preface: This is my work-in-progress draft for my paper on version 2 of A*PA. Code is done and works, but needs some cleanup and polishing. The API also needs work.


Methods. We introduce a new global pairwise aligner, A*PA2, that builds on Edlib and incorporates ideas from Block Aligner and A*PA. A*PA2 1) uses Myers’ bitpacking with SIMD, 2) significantly reduces overhead by using large block sizes, 3) avoids recomputation of states where possible, 4) introduces two new techniques for traceback, and 5) improves and applies the heuristics developed in A*PA.

Results. A \(10\times\) overall speedup on the alignment of $500$kbp ONT reads.

Availability. https://github.com/RagnarGrootKoerkamp/astar-pairwise-aligner

Contact. https://twitter.com/curious_coding

1 Introduction

The problem of global pairwise alignment is to find the shortest sequence of edit operations (insertions, deletions, substitutions) to convert a string \(A\) into a second string \(B\) (Needleman and Wunsch 1970; Vintsyuk 1968), where the number of such operations is called the Levenshtein distance or edit distance (Levenshtein 1965; Wagner and Fischer 1974).

Per tradition, as Rognes and Seeberg (2000) write:

The rapidly increasing amounts of genetic-sequence information available represent a constant challenge to developers of hardware and software database searching and handling.

In this work we introduce a new exact algorithm for global pairwise alignment building on the recently introduced A* heuristics of A*PA (Groot Koerkamp and Ivanov 2022), but instead of running A* we use a much more efficient DP-based algorithm. Thus, as Fickett (1984) states,

at present one must choose between an algorithm which gives the best alignment but is expensive, and an algorithm which is fast but may not give the best alignment. In this paper we narrow the gap between these choices by showing how to get the optimal alignment much more quickly than before.

1.1 Contributions

In A*PA2, we combine multiple existing techniques and introduce a number of new ideas to obtain a \(10\times\) speedup over existing single-threaded exact aligners. As a starting point, we take the band doubling algorithm implemented by Edlib (Šošić and Šikić 2017) using bitpacking (Myers 1999). First, we speed up the implementation (points 1., 2., and 3.). Then, we reduce the amount of work done (points 4. and 5.). Lastly, we apply A*.

  1. Block-based computation. Edlib computes one column of the DP matrix at a time, and for each column decides which range (subset of rows) of cells to compute. We significantly reduce this overhead by processing blocks of \(256\) columns at a time, taking inspiration from Block aligner (Liu and Steinegger 2023). Correspondingly, we only store cells of the DP-matrix at block boundaries, reducing memory usage.
  2. SIMD. We speed up the implementation using SIMD to compute each block, allowing the processing of \(4\) computer words in parallel.
  3. Novel encoding. We introduce a novel bit-encoding of the input sequence to speed up SIMD operations for size-\(4\) alphabets.
  4. Incremental doubling. We further note that both the original band doubling method of Ukkonen (1985) and Edlib recompute states in the doubled region. Reusing the theory behind the A* algorithm, we prove and apply a novel theorem stating that some of this recomputation can be avoided.
  5. Traceback. For the traceback, we optimistically use the diagonal transition method within each block with a strong Z-drop [TODO] heuristic, falling back to a full recomputation of the block when needed.
  6. A* . We improve the A* seed heuristics of Groot Koerkamp and Ivanov (2022) in two ways. First, instead of updating contours each time a match is pruned, we now do this in batches once the band is doubled. Secondly, we introduce a new pre-pruning technique that discards most of the spurious (off-path) matches ahead of time.

1.2 Previous work

In the following, we give a brief recap of developments that this work builds on, in chronological order per approach. See also e.g. the reviews by Kruskal (1983) and Navarro (2001).

1.2.1 Needleman-Wunsch

This problem has classically been approached as a dynamic programming (DP) problem. For string lengths \(n\) and \(m\), Needleman and Wunsch (1970) introduced the first \(O(n^2m)\) algorithm. This was improved to what is now known as the \(O(nm)\) Needleman-Wunsch algorithm by Sellers (1974) and Wagner and Fischer (1974), building on the quadratic algorithm for longest common subsequence by Sankoff (1972).

1.2.2 Graph algorithms

It was already realized early on that this problem corresponds to finding the shortest path from \(v_s\) to \(v_t\) in the alignment graph (also called edit graph or dependency graph) (Vintsyuk 1968; Ukkonen 1985). Both Ukkonen (1985) and Myers (1986) remarked that this can be solved using Dijkstra’s algorithm (Dijkstra 1959), taking \(O(ns)\) time1, where \(s\) is the edit distance between the two strings. However, Myers (1986) observes that

the resulting algorithm involves a relatively complex discrete priority queue and this queue may contain as many as O(ND) entries even in the case where just the length of the […] shortest edit script is being computed.

Hadlock (1988) realized that Dijkstra’s algorithm can be improved upon by using A* (Hart, Nilsson, and Raphael 1968), a more informed algorithm that uses a heuristic function \(h(u)\) that gives a lower bound on the edit distance \(\h(u)\) between the suffixes following DP state \(u\). He uses two heuristics, the widely used gap cost heuristic \(h(u)=\cgap(u, v_t)\) (Ukkonen 1985; Hadlock 1988; Wu et al. 1990; Spouge 1989, 1991; Papamichail and Papamichail 2009) that simply uses the difference between the lengths of the suffixes as lower bound, and a new improved heuristic based on character frequencies in the two suffixes. A*PA (Groot Koerkamp and Ivanov 2022) applies the gap-chaining seed heuristic with pruning (Ivanov, Bichsel, and Vechev 2021) to obtain near-linear runtime when errors are uniform random. Nevertheless, as Spouge (1991) states:

Many algorithms for finding optimal paths in non-lattice graphs also exist (Dijkstra 1959; Hart, Nilsson, and Raphael 1968; Rubin 1974), but algorithms exploiting the lattice structure of an alignment graph are usually faster. In molecular biology, speed is important, …

and further (Spouge 1989):

This suggests a radical approach to A* search complexities: dispense with the lists [of open states] if there is a natural order for vertex expansion.

Indeed, a lot of work has gone into speeding up DP-based algorithms.

1.2.3 Computational volumes

Wilbur and Lipman (1983) is (to our knowledge) the first paper that speeds up the \(O(nm)\) DP algorithm, by only considering states near diagonals with many k-mer matches, but at the cost of giving up the exactness of the method. Fickett (1984) notes that for \(t\geq s\) only those DP-states with cost \(\g(u)\) at most \(t\) need to be computed:

However it is possible to fill the matrix in many different orders, the only restriction being that the calculation of any given \(d_{ij}\) depends on already having the values of the three element up and to the left of it.


But the only alignments of subsequences which are relevant are ones at least as good (distance at least as small) as the overall one. I.e. one really only needs those \(d_{ij}\) which are below a fixed bound.

This only requires \(O(nt)\) time, which is fast when \(t\) is an accurate bound on the distance \(s\), which for example can be set as a known upper bound for the data being aligned, or as the length of a known suboptimal alignment. When \(t=t_0\) turns out too small a larger new bound \(t_1\) can be chosen, and only states with distance in between \(t_0\) and \(t_1\) have to be computed. This is implemented by keeping for each row the index of the first and last state with value at most \(t_0\), and skipping over already computed states. In the limit, one could choose \(t_i = i\) and compute states by increasing distance, closely mirroring Dijkstra’s algorithm.

Ukkonen (1985) introduces a very similar idea, statically bounding the computation to only those states that can be on a path of length at most \(t\) through the graph. When the sequences have the same length (\(n=m\)), this only considers diagonals \(-t/2\) to \(t/2\), where diagonal \(0\) is the main diagonal of the DP-matrix.

On top of this, Ukkonen (1985) introduces band doubling: \(t_0=1\) can be doubled (\(t_i = 2^i\)) until \(t_k \geq s > t_{k-1}\). Since each test requires \(O(n \cdot t_i)\) time, the total time is

\begin{equation} n\cdot t_0 + \dots + n\cdot t_k = n\cdot (2^0 + \dots + 2^k) < n\cdot 2^{k+1} = 4\cdot n\cdot 2^{k-1} < 4\cdot n\cdot s = O(ns). \end{equation}

Note that this method does not (and indeed can not) reuse values from previous iterations, resulting in roughly a factor \(2\) overhead.

Spouge (1989) unifies the methods of Fickett (1984) and Ukkonen (1985), and generalizes them to accept any A* heuristic. In particular, a computational volume is a subgraph of the alignment graph that contains every shortest path. Given a bound \(t\geq s\), some examples of computational volumes are:

  1. \(\{u\}\), the entire \((n+1)\times (m+1)\) graph.
  2. \(\{u: \g(u) + \h(u)=s\}\), the vertices on a shortest paths.
  3. \(\{u: \g(u)\leq t\}\), the states at distance \(\leq t\) (Fickett 1984).
  4. \(\{u: \cgap(v_s, u) + \cgap(u, v_t) \leq t\}\) the static set of states possibly on a path of length \(\leq t\) (Ukkonen 1985).
  5. \(\{u: \g(u) + \cgap(u, v_t) \leq t\}\), as used by Edlib (Spouge 1991; Šošić and Šikić 2017).
  6. \(\{u: \g(u) + h(u) \leq t\}\), for any admissible heuristic \(h\).

As Spouge (1989) notes:

The order of computation (row major, column major or antidiagonal) is just a minor detail in most algorithms.

But this is exactly what was investigated a lot in the search for faster implementations.

1.2.4 Implementation and parallelism

Since roughly \(1995\), the focus shifted from reducing the number of computed states to computing states faster through advancements in implementation and hardware (SIMD, GPUs). These speedups are often applied to the Smith-Waterman-(Gotoh) (Smith and Waterman 1981; Gotoh 1982) algorithm for (affine-cost) local alignment, where algorithmic improvements beyond \(\Theta(nm)\) are unknown.

The first technique in this direction is microparallelism (Alpern, Carter, and Gatlin 1995), where each (64 bit) computer word is divided into multiple (e.g. 16 bit) parts, and word-size operations modifying all (4) parts in parallel. Alpern, Carter, and Gatlin (1995) applied this with inter-sequence parallelism to align a given query to \(4\) reference sequences in parallel (see also Rognes (2011)). Hughey (1996) was the first to note that antidiagonals of the DP matrix can be computed in parallel, and Wozniak (1997) applied SIMD for this purpose.

Rognes and Seeberg (2000) split 64bit words into 8 8-bit values, capping all computations at \(255\) but doubling the speed. Further, it uses vertical instead of antidiagonal vectors.

The advantage of this approach is the much-simplified and faster loading of the vector of substitution scores from memory. The disadvantage is that data dependencies within the vector must be handled.

In particular, Rognes and Seeberg (2000) introduce the query profile: Instead of looking up the substitution score \(S[A[i]][B[j]]\) for the \(i\)’th and \(j\)’th character of \(A\) and \(B\) respectively, it is more efficient to precompute the profile \(P[c][j] := S[c][B[j]]\) for each character \(c\) in the alphabet. Then, adjacent scores are simply found as adjacent values \(P[A[i]][j \dots j’]\).

Similarly, Myers (1999) introduces a bitpacking algorithm specifically for edit distance that stores the differences between adjacent DP-states bit-encoded in two 64-words \(P\) and \(M\), with \(P_i\) and \(M_i\) indicating whether the \(i\)’th difference is \(+1\) resp. \(-1\). It then gives an efficient algorithm using bitwise operations on these words.


  • (Farrar 2006) Farrar’s striped; uses query profile; conditional prefix scan is moved outside inner loop. \(2-8\times\) faster than Wozniak and Rognes.
  • Wu Manber 1992
  • Baeza-Yates Gonnet 1992
  • Hyyro and Navarro, 2005; Hyyro et al., 2005
  • Benson 2013
  • navarro 2004
  • bergeron hamel 2002

1.2.5 Tools

There are multiple semi-global aligners that implement \(O(nm)\) global alignment using numerous of the aforementioned implementation techniques, such as SeqAn (Döring et al. 2008), Parasail (Daily 2016), Opal (https://github.com/martinsos/opal), libssa (https://github.com/RonnySoak/libssa), SWIPE (Rognes 2011), SWPS3 (Szalkowski et al. 2008), SSW library (Zhao et al. 2013) (link), and KSW2 (Li 2018).

Dedicated global alignment implementations are much rarer. Edlib (Šošić and Šikić 2017) implements the band doubling of Ukkonen (1985) using the \(\g(u)+\cgap(u, v_t)\leq t\) computational volume of Spouge (1991) and the bitpacking of Myers (1999). WFA and BiWFA (Marco-Sola et al. 2020, 2022) implement the \(O(n+s^2)\) expected time diagonal transition algorithm (Ukkonen 1985; Myers 1986). Block aligner (Liu and Steinegger 2023) is an approximate aligner that can handle position-specific scoring matrices whose main novelty is to divide the computation into blocks. Lastly, A*PA (Groot Koerkamp and Ivanov 2022) directly implements A* on the alignment graph using the gap-chaining seed heuristic.

2 Methods

First, we explain in detail the algorithm and implementation used by Edlib and reduce the overhead in the implementation by using blocks and SIMD. Then, we improve the algorithm by avoiding recomputing states and speeding up the traceback algorithm. On top of that, we apply the A*PA heuristics for further speed gains on large/complex alignments, at the cost of larger precomputation time to build the heuristic.

2.1 Band-doubling and bitpacking in Edlib

As a baseline, we first outline the band-doubling method used by Edlib.

  1. Start with edit-distance threshold \(t=1\).

  2. Iterate over columns \(i\) from \(0\) to \(n\).

  3. For each column, determine the range of rows \(R=(r_{start}, r_{end})\) to compute by finding the top- and bottommost state that can possibly have cost at most \(t\), taking into account the gap-cost to the end. Both \(r_{start}\) and \(r_{end}\) are rounded out to the previous/next multiple of \(w\). a. If the range is empty, double \(t\) and go back to step 2. b. Otherwise, compute the range in blocks of \(w=64\) rows at a time using bitpacking and the standard profile of sequence \(B\).

    Only the last and current column are kept in memory.

  4. Traceback. Once the last column has been reached, recursively repeat the algorithm using Hirschberg’s meet-in-the-middle approach to find the alignment. Continue until the sequences are of length \(\leq 1000\). For these small sequences all vertical differences can be stored and a backtrace is done to find the alignment.

2.2 Blocks

Our first improvement is to process \(B=256\) columns at a time. Instead of computing the range of rows \(R\) for each column individually, we compute it once and then use this one range for a block of \(B\) consecutive columns. While this computes some extra states in most columns, the time saved by not having to compute \(R\) for each column is larger.

Within each block, we iterate over the rows in lanes of \(w\) rows at a time, and for each lane compute all \(B\) columns before moving on to the next lane.

See 2.8.1 for details on the computation of \(R\).

Where Edlib does not initially store intermediate values and uses meet-in-the-middle to find the alignment, we always store vertical differences at the end of each2 block. This simplifies the implementation, and has sufficiently small memory usage to be practical. See 2.6 for details on recovering the alignment.

2.3 SIMD

While it is tempting to use a SIMD vector as a single $W=256$-bit word, the four $w=64$-bit words (SIMD lanes) are dependent on each other and require manual work to shift bits between the lanes. Instead, we let each $256$-bit AVX2 SIMD vector represent four $64$-bit words (lanes) that are anti-diagonally staggered (TODO FIG). This is similar to the original anti-diagonal tiling introduced by Wozniak (1997), but using units of $64$-bit words instead of single characters. This idea was already introduced in 2014 by the author of Edlib3, but to our knowledge has never been implemented either in Edlib or elsewhere.

We achieve further speedup by improving instruction-level-parallelism. Modern CPUs can execute up to 4 instructions per cycle (IPC) and use execution pipelines that look ahead tens of instructions. The dependencies between the instructions for computing each SIMD vector do not allow such high parallelism. We improve this by processing two SIMD vectors in parallel, spanning a total of \(8\) anti-diagonally-aligned $64$-bit lanes covering \(2W = 512\) rows (TODO FIG).

When the number of lanes of rows to be computed is \(c=(r_{end}-r_{start})/64\), we process \(8\) lanes in parallel as long as \(c\geq 8\). If there are remaining rows, we end with another $8$-lane (\(5\leq c<8\)) or $4$-lane (\(1\leq c\leq 4\)) iteration that optionally includes some padding rows at the bottom. In case the horizontal differences along the original bottom row are needed (as is the case for incremental doubling 2.5), we do not use padding and instead fall back to trying a $4$-lane SIMD (\(c\geq 4\)), a $2$-lane SIMD (\(c\geq 2\)), and lastly a scalar iteration (\(c\geq 1\)).

TODO: How about padding upwards?

2.4 SIMD-friendly sequence profile

Myers’ bitpacking algorithm precomputes a profile \(P_{eq}[c][j]\) containing \(\sigma \times m\) bits. For each character \(c\), it contains a bitvector of $w$-bit words indicating the positions where \(c\) occurs in \(B\). We improve memory locality by instead storing the profile as an array of blocks of \(\sigma\) words: \(P_{eq}[j/w][c]\) containing \(\lceil m/w\rceil \times \sigma\) $w$-bit words (FIG?).

A drawback of anti-diagonal tiling is that each column contains its own character \(a_i\) that needs to be looked up. While SIMD offers gather instructions to do multiple of these lookups in parallel, these instructions are not always efficient. Thus, we introduce the following alternative scheme.

Let \(b = \lceil \log_2(\sigma)\rceil\) be the number of bits needed to encode each character, with \(b=2\) for DNA. The new profile \(P’\) contains \(b\) bitvectors, each indicating the negation of one bit of each character, stored as an \(\lceil m/w\rceil \times b\) array \(P’[j/w][p]\) of $w$-bit words.

To check whether row \(j\) contains character \(c\) with bit representation \(\overline{c_{b-1}\dots c_{0}}\), we compute \[(c_0 \xor P’[j/w][0][j\bmod w]) \and \dots \and (c_{b-1} \xor P’[j/w][b-1][j\bmod w]).\] This naturally extends to an efficient computation for $w$-bit words and larger SIMD vectors.

TODO: Tried BitPAl’s bitpacking method which is one less than Myers 99’s, but without success so far.

2.5 Incremental doubling

TODO: Rephrase \(g(u)\leq t\) to \(f(u) \leq t\).

TODO: The range-end only matters for the last columns of the block. Intermediate columns that go further down can be disregarded.

The original band doubling algorithm doubles the threshold from \(t\) to \(t’=2t\) in each iteration and simply recomputes the distance to all states. On the other hand, BFS, Dijkstra, and A*4 visit states in increasing order of distance (\(g(u)\) for BFS and Dijkstra, \(f(u) = g(u) + h(u)\) for A*), and the distance to a state is known to be correct (fixed) as soon as it is expanded. This way a state is never expanded twice.

Indeed, the band-doubling algorithm can also avoid recomputations. After completing the iteration for \(t\), it is guaranteed that the distance is fixed for all states that are indeed at distance \(\leq t\). In fact a stronger result holds: in any column the distance is fixed for all states between the topmost and bottommost state with distance \(\leq t\). Note that due to the word-based computations, there will also be states whose computed distance is \(>t\). These are not guaranteed to be correct.

After a range \(R=(r_{start}, r_{end})\) of rows for a block of \(B\) columns has been computed, we determine the first row \(r’_{start} \geq r_{start}\) and last row \(r’_{end}\leq r_{send}\) that are a multiple of \(w\) and for which all computed distances in this block are at most \(t\)5, if such rows exists. (See 2.8.2 for details.) We then store these values \((r’_{start}, r’_{end})\) and the horizontal difference along row \(r’_{end}\). The next iteration for \(t’=2t\) then skips the rows in this interval, and uses the stored differences as input to compute rows \(r’_{end}\) to the new \(r_{end}\).

2.6 Traceback

The traceback stage takes as input the computed vertical differences at the end of each block of columns. We iteratively work backwards through the blocks of columns. In each step, we are given the distances \(D_i[j]\) to the states in column \(i\) (\(B|i\)) and the state \(u=\st{i+B}j\) in column \(i+B\) that is on the optimal path and has distance \(d_u\). The goal is to find an optimal path from column \(i\) to \(u\).

A naive approach is to simply recompute the entire block of columns for their entire range \(R\) while storing distances to all cells, but we introduce to faster methods.

2.6.1 Optimistic block computation

Instead of computing the full range \(R=(r_{start}, r_{end})\) for this column, a first insight is that only rows up to \(j\) are needed, since the optimal path to \(u=\st{i+B}j\) can never go below row \(j\).

Secondly, the path crosses \(B=256\) rows, and so we optimistically assume that it will be contained in rows \(j-256-64=j-320\) [CHECK] to \(j\). Thus, we first compute the states in this range of rows (rounded out to multiples of \(w\)). If the distance to \(u\) computed this way agrees with the known distance, the path must lie within these rows. Otherwise, we repeatedly try again with double the number of lanes, until success. The exponential search ensures low overhead and good average case performance.

2.6.2 Optimistic diagonal transition

A further improvement uses the diagonal transition algorithm backwards from \(u\). We simply run the unmodified algorithm on the reverse graph covering columns \(i\) to \(i+B\) and rows \(0\) to \(j\). When the distance \(d_j\) from \(u\) to a state \(\st ij\) in column \(i\) is found, we check whether \(D_i[j] + d_j = d_u\). If this is not the case, we continue until a suitable \(j\) is found. We then infer the optimal path by a traceback on the diagonal transition algorithm.

One further optimization to this is that again we can be optimistic and assume that the path will have distance at most \(d_0=TODO\), and ignore any states that are at distance \(>d_0\). If all states at distance \(\leq d_0\) have been explored without finding a match in column \(i\), \(d_0\) is doubled repeatedly until success.

Another optimization is the X-drop [TODO], meaning that all states that lag more than \(x\) columns behind the one with smallest \(j\) will be dropped.

2.7 A*

Edlib already uses a simple gap-cost heuristic that gives a lower bound on the number of insertions and deletions on a path from each state to the end. We simply replace this by the stronger heuristics introduced in A*PA. We use three variants:

  1. No heuristic. Only use the gap heuristic. No initialization needed.
  2. Seed heuristic (SH). This requires relatively simple precomputation, and little bookkeeping, but works well for low uniform error rate.
  3. Gap-chaining seed heuristic (GCSH). The strongest heuristic that requires more time to initialize and update, but is better able to penalize long indels.

The details of how these changes affects the ranges of rows being computed are in 2.8.

We make two modifications the previous version of the A*PA algorithm.

2.7.1 Bulk-contours update

In A*PA, matches are pruned as soon as a shortest path to their start has been found. This helps to penalize states before (left of) the match. Each iteration of our new algorithm works left-to-right only, and thus pruning of matches does not affect the current iteration. Instead of pruning on the fly, we now collect all matches to be pruned at the end of each iteration, and prune them in one right-to-left sweep.

To ensure pruning is a valid optimization, we never allow the range of rows for each block to shrink after increasing \(t\).

2.7.2 Pre-pruning

Here we introduce an independent optimization that also applies to the original A*PA method.

Each of the heuristics \(h\) introduced in A*PA depends on the set of matches \(\matches\). Given that \(\matches\) contains all matches, \(h\) was shown to be an admissible [TODO] heuristic. Even after pruning some matches, \(h\) was shown to still be a lower bound on the length of a path not going through already visited states.

Now consider a situation where there are two seeds and there is an exact match \(m\) from \(u=v_s\) to \(v\) for seed \(s_0\), but going from \(v\) to the end of the next seed \(s_1\) takes cost at least \(2\) (TODO FIG). The existence of the match is a ‘promise’ that \(s_0\) can be crossed for free. In this case, this leads to a seed heuristic value of \(1\) is \(u\), namely \(0\) for \(s_1\) plus \(1\) for \(s_1\). But we already know that match \(m\) can never lead to a path of cost \(<2\) to the end of \(s_1\). Thus, we may as well ignore \(m\)! This increases the value of the seed heuristic in \(u\) to \(2\), which is indeed a lower bound on the actual distance.

More generally, consider a situation where there is a match \(m\) from \(u\) to \(v\) in seed \(s_i\), and the lowest cost path from \(s_i\) to the start of \(s_{i+p}\) has cost \(\geq p\). The seed heuristic penalizes the path from \(u\) (at the start of \(s_i\)) to the start of \(s_{i+p}\) by at most \(p-1\), since there are at most \(p-1\) seeds in \(\{s_{i+1}, \dots, s_{i+p-1}\}\) without match. Since in fact we know that this path has cost at least \(p\), we can pre-prune the match \(m\) and increase the value of the heuristic while keeping it admissible.

2.8 Appendix: Range-of-rows computations

2.8.1 Computed range

Here we determine the range of rows that can possibly contain cells at distance \(\leq t\) in a block of \(B\) columns from \(i\) to \(i+B\). We assume that the heuristic \(h\) being used is consistent, i.e. that for any states \(u\preceq v\) we have \(h(u) \leq d(u,v) +h(v)\).

Let \(R=(r_{start}, r_{end})\) be the range of states in column \(i\) to which the distance has been computed. From this we can find the topmost and bottommost states \(r^t_{start}\) and \(r^t_{end}\) that are at distance \(\leq t\), see 2.8.2.

Start of range. Since row \(j=r^t_{start}\) is the first row in column \(i\) with distance \(\leq t\), this means that states in columns \(i\) to \(i+B\) at rows \(<j\) can not be at distance \(\leq t\). Thus, the first row that needs to be computed is row \(r^t_{start}\). [TODO: Add a \(+1\) to this?]

End of range. We will now determine the bottommost row \(j\) that can contain a state at distance \(\leq t\) in the block. Let \(u=\st{i}{r^t_{end})\) be the bottommost state in column \(i\) with distance \(\leq t\). Let \(v = \st{i’}{j’}\) be a state in the current block (\(i\leq i’\leq i+B\)) that is below the diagonal of \(u\) (\(j’-i’ \geq r^t_{end}-i\)). Then, the distance to \(v\) is at least \(\g(v) \geq \g(u) + \cgap(u,v)\), and hence \[ f(v) = g(v) + h(v) \geq \g(v) + h(v) \geq \g(u) + \cgap(u,v) + h(v) =: f_l(v). \] The end of the range can now be computed by finding the bottommost state \(v\) in each column for which this lower bound \(f_l\) on \(f\) is \(\leq t\), using the following algorithm6.

Algorithm (bottommost row computation).

  1. Start with \(v = \st{i’}{j’} = u = \st{i}{r^t_{end}}\).
  2. While the below-neighbour $v’ = \st{i’}{j’+1}$$ of \(v\) has \(f_l(v)\leq t\), increment \(j’\).
  3. Go to the next column by incrementing \(i’\) and \(j’\) by \(1\) and repeat step 2, until \(i’=i+B\).

A drawback of this approach is that \(h\) is evaluated at least once per column, which is slow in practice.

We improve this using the following lemma.

Lemma 1. When \(h\) is a consistent heuristic and \(v\preceq v’\), then \(f_l(v’) \geq f(v) - 2\cdot d(v, v’)\).

Proof. By consistency, \(h(v) \leq d(v, v’) + h(v’)\), so \(h(v’) \geq h(v)-d(v,v’)\). Furthermore, \(\cgap(u,v) \leq \cgap(u,v’) + \cgap(v,v’)\leq \cgap(u,v) + d(v,v’)\), and hence \(\cgap(u,v’) \geq \cgap(u,v) - d(v,v’)\). Putting these together we obtain

\begin{align*} f_l(v’) &= \g(u) + \cgap(u,v’) + h(v’) \\ &\geq \g(u) + \cgap(u,v) - d(v,v’) + h(v) - d(v,v’) \\ &= f_l(v) - 2\cdot d(v,v’). \square % TODO \end{align*}

When we have \(f_l(v) > t+2x\), the lemma implies that \(f_l(v’)>t\) for any \(v’\) with \(d(v,v’)\leq x\). This inspires the following algorithm7, that first takes just over \(B\) steps down, and then makes jumps to the right.

Algorithm (sparse bottommost row computation).

  1. Start with \(v = \st{i’}{j’} = u+\st{0}{B+8} = \st{i}{r^t_{end} + B + 8}\).
  2. If \(f_l(v) \leq t\), increase \(j’\) (go down) by \(8\).
  3. If \(f_l(v) > t\), increase \(i’\) (go right) by \(\min(\lceil(f_l(v)-t)/2\rceil, i+B-i’)\).
  4. Repeat from step 2, until \(i’ = i+B\).
  5. While \(f_l(v) > t\), decrease \(j’\) (go up) by \(\lceil(f_l(v)-t)/2\rceil\), but do not go above the diagonal of \(u\).

The resulting \(v\) is the bottommost state in column \(i+B\) with \(f_l(v) \leq t\), and its row is the last row that will be computed.

TODO: Run this algorithm directly on 64-row lanes.

TODO: Can we be optimistic and bump more than \((f_l(v)-t)/2\)? the next value will also block ‘backwards’.

2.8.2 Fixed range

In a column, the fixed range is the range of rows between the topmost and bottommost states with \(f(v)\leq t\), in rows \(r’_{start}\) and \(r’_{end}\) respectively. Given a range \(R=(r_{start}, r_{end})\) of computed values, one way to find \(r’_{start}\) and \(r’_{end}\) is by simply iterating from the start/end of the range and dropping all states \(v\) with \(f(v)>t\). Like before, this has the drawback that the heuristic must be invoked many times.

Instead, we have the following lemma, somewhat analogous to Lemma 1:

Lemma 2. When \(h\) is a consistent heuristic we have \[|f(v) - f(v’)| \leq 2 d(v, v’).\]

Proof. The triangle inequality gives \(\g(v) \leq \g(v’) + d(v, v’)\), and consistency gives \(h(v) \leq h(v’) + d(v,v’)\) TODO WHAT IF \(v\) and \(v’\) ARE IN THE OPPOSITE ORIENTATION?? Expanding the definitions of \(f\), we have

\begin{align*} f(v) - f(v’) &= (g(v) + h(v)) - (g(v’) + h(v’))\\ &= (\g(v) + h(v)) - (\g(v’) + h(v’))\\ &= (\g(v) - \g(v’)) - (h(v) + h(v’))\\ &\leq d(v,v’) + d(v,v’) = 2\cdot d(v,v’). \square \end{align*}

Now we can use a similar approach as before. To find the first row \(j’\) with \(f(\st ij)\leq t\), start with \(v=\st{i’}{j’}=\st{i}{r_{start}}\), and increment \(j’\) by \(\lceil(f(v)-t)/2\rceil\) as long as \(f(v)>t\). The last row is found in the same way.

TODO: Run this algorithm directly on 64-row lanes.

3 Results


  • Edlib
  • WFA
  • A*PA
  • A*PA2 no h
  • A*PA2 SH+pruning
  • A*PA2 GCSH+pruning


  • synthetic data
  • human ONT reads
  • human ONT reads with genetic variation


  • Find threshold where heuristics become worth the overhead
  • Show benefit of each of the optimizations
  • Show sensitivity to parameter tuning

4 Conclusion

4.1 Summary

4.2 Limitations

  • Initialization takes time
  • WFA is better when edit distance is very low.

4.3 Future work

  • Pre-pruning for seed&extend methods?
  • Semi-global alignment
  • Affine alignment
  • Local doubling


I am grateful to Daniel Liu for regular discussions, and suggesting additional papers that have been added to the introduction.


Alpern, Bowen, Larry Carter, and Kang Su Gatlin. 1995. “Microparallelism and High-Performance Protein Matching.” Proceedings of the 1995 Acm/Ieee Conference on Supercomputing (Cdrom) - Supercomputing ’95. https://doi.org/10.1145/224170.224222.
Daily, Jeff. 2016. “Parasail: Simd c Library for Global, Semi-Global, and Local Pairwise Sequence Alignments.” Bmc Bioinformatics 17 (1). https://doi.org/10.1186/s12859-016-0930-z.
Dijkstra, E. W. 1959. “A Note on Two Problems in Connexion with Graphs.” Numerische Mathematik 1 (1): 269–71. https://doi.org/10.1007/bf01386390.
Döring, Andreas, David Weese, Tobias Rausch, and Knut Reinert. 2008. “Seqan an Efficient, Generic c++ Library for Sequence Analysis.” Bmc Bioinformatics 9 (1). https://doi.org/10.1186/1471-2105-9-11.
Farrar, Michael. 2006. “Striped Smith–Waterman Speeds Database Searches Six Times over Other Simd Implementations.” Bioinformatics 23 (2): 156–61. https://doi.org/10.1093/bioinformatics/btl582.
Fickett, James W. 1984. “Fast Optimal Alignment.” Nucleic Acids Research 12 (1Part1): 175–79. https://doi.org/10.1093/nar/12.1part1.175.
Gotoh, Osamu. 1982. “An Improved Algorithm for Matching Biological Sequences.” Journal of Molecular Biology 162 (3): 705–8. https://doi.org/10.1016/0022-2836(82)90398-9.
Groot Koerkamp, Ragnar, and Pesho Ivanov. 2022. “Exact Global Alignment Using A* with Seed Heuristic and Match Pruning,” September. https://doi.org/10.1101/2022.09.19.508631.
Hadlock, Frank O. 1988. “Minimum Detour Methods for String or Sequence Comparison.” Congressus Numerantium 61: 263–74.
Hart, Peter, Nils Nilsson, and Bertram Raphael. 1968. “A Formal Basis for the Heuristic Determination of Minimum Cost Paths.” Ieee Transactions on Systems Science and Cybernetics 4 (2): 100–107. https://doi.org/10.1109/tssc.1968.300136.
Hughey, Richard. 1996. “Parallel Hardware for Sequence Comparison and Alignment.” Bioinformatics 12 (6): 473–79. https://doi.org/10.1093/bioinformatics/12.6.473.
Ivanov, Pesho, Benjamin Bichsel, and Martin Vechev. 2021. “Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds.” Biorxiv. https://doi.org/10.1101/2021.11.05.467453.
Kruskal, Joseph B. 1983. “An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules.” Siam Review 25 (2): 201–37. https://doi.org/10.1137/1025045.
Levenshtein, Vladimir I. 1965. “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” Soviet Physics. Doklady 10: 707–10. https://api.semanticscholar.org/CorpusID:60827152.
Li, Heng. 2018. “Minimap2: Pairwise Alignment for Nucleotide Sequences.” Edited by Inanc Birol. Bioinformatics 34 (18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.
Liu, Daniel, and Martin Steinegger. 2023. “Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices.” Bioinformatics, btad487. https://doi.org/10.1093/bioinformatics/btad487.
Loving, Joshua, Yozen Hernandez, and Gary Benson. 2014. “Bitpal: A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm.” Bioinformatics 30 (22): 3166–73. https://doi.org/10.1093/bioinformatics/btu507.
Marco-Sola, Santiago, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. 2022. “Optimal Gap-Affine Alignment in O(S) Space,” April. https://doi.org/10.1101/2022.04.14.488380.
Marco-Sola, Santiago, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. 2020. “Fast gap-affine pairwise alignment using the wavefront algorithm.” Bioinformatics 37 (4): 456–63. https://doi.org/10.1093/bioinformatics/btaa777.
Myers, Gene. 1986. “An $O(ND)$ Difference Algorithm and Its Variations.” Algorithmica 1 (1–4): 251–66. https://doi.org/10.1007/bf01840446.
———. 1999. “A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming.” Journal of the Acm 46 (3): 395–415. https://doi.org/10.1145/316542.316550.
Navarro, Gonzalo. 2001. “A Guided Tour to Approximate String Matching.” Acm Computing Surveys 33 (1): 31–88. https://doi.org/10.1145/375360.375365.
Needleman, Saul B., and Christian D. Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology 48 (3): 443–53. https://doi.org/10.1016/0022-2836(70)90057-4.
Papamichail, Dimitris, and Georgios Papamichail. 2009. “Improved Algorithms for Approximate String Matching (Extended Abstract).” Bmc Bioinformatics 10 (S1). https://doi.org/10.1186/1471-2105-10-s1-s10.
Rognes, Torbjørn. 2011. “Faster Smith-Waterman Database Searches with Inter-Sequence Simd Parallelisation.” Bmc Bioinformatics 12 (1). https://doi.org/10.1186/1471-2105-12-221.
Rognes, Torbjørn, and Erling Seeberg. 2000. “Six-Fold Speed-up of Smith–Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors.” Bioinformatics 16 (8): 699–706. https://doi.org/10.1093/bioinformatics/16.8.699.
Rubin, F. 1974. “The Lee Path Connection Algorithm.” Ieee Transactions on Computers C–23 (9): 907–14. https://doi.org/10.1109/t-c.1974.224054.
Sankoff, David. 1972. “Matching Sequences under Deletion/Insertion Constraints.” Proceedings of the National Academy of Sciences 69 (1): 4–6. https://doi.org/10.1073/pnas.69.1.4.
Sellers, Peter H. 1974. “On the Theory and Computation of Evolutionary Distances.” Siam Journal on Applied Mathematics 26 (4): 787–93. https://doi.org/10.1137/0126070.
Smith, T.F., and M.S. Waterman. 1981. “Identification of Common Molecular Subsequences.” Journal of Molecular Biology 147 (1): 195–97. https://doi.org/10.1016/0022-2836(81)90087-5.
Spouge, John L. 1989. “Speeding up Dynamic Programming Algorithms for Finding Optimal Lattice Paths.” Siam Journal on Applied Mathematics 49 (5): 1552–66. https://doi.org/10.1137/0149094.
———. 1991. “Fast Optimal Alignment.” Bioinformatics 7 (1): 1–7. https://doi.org/10.1093/bioinformatics/7.1.1.
Suzuki, Hajime, and Masahiro Kasahara. 2018. “Introducing Difference Recurrence Relations for Faster Semi-Global Alignment of Long Sequences.” BMC Bioinformatics 19 (S1). https://doi.org/10.1186/s12859-018-2014-8.
Szalkowski, Adam, Christian Ledergerber, Philipp Krähenbühl, and Christophe Dessimoz. 2008. “Swps3 – Fast Multi-Threaded Vectorized Smith-Waterman for Ibm Cell/B.E. And ×86/Sse2.” Bmc Research Notes 1 (1): 107. https://doi.org/10.1186/1756-0500-1-107.
Ukkonen, Esko. 1985. “Algorithms for Approximate String Matching.” Information and Control 64 (1–3): 100–118. https://doi.org/10.1016/s0019-9958(85)80046-2.
Vintsyuk, T. K. 1968. “Speech Discrimination by Dynamic Programming.” Cybernetics 4 (1): 52–57. https://doi.org/10.1007/bf01074755.
Wagner, Robert A., and Michael J. Fischer. 1974. “The String-to-String Correction Problem.” Journal of the Acm 21 (1): 168–73. https://doi.org/10.1145/321796.321811.
Wilbur, W J, and D J Lipman. 1983. “Rapid Similarity Searches of Nucleic Acid and Protein Data Banks.” Proceedings of the National Academy of Sciences 80 (3): 726–30. https://doi.org/10.1073/pnas.80.3.726.
Wozniak, A. 1997. “Using Video-Oriented Instructions to Speed up Sequence Comparison.” Bioinformatics 13 (2): 145–50. https://doi.org/10.1093/bioinformatics/13.2.145.
Wu, Sun, Udi Manber, Gene Myers, and Webb Miller. 1990. “An O(Np) Sequence Comparison Algorithm.” Information Processing Letters 35 (6): 317–23. https://doi.org/10.1016/0020-0190(90)90035-v.
Zhao, Mengyao, Wan-Ping Lee, Erik P. Garrison, and Gabor T. Marth. 2013. “Ssw Library: An Simd Smith-Waterman c/c++ Library for Use in Genomic Applications.” Edited by Leonardo Mariño-Ramírez. Plos One 8 (12): e82138. https://doi.org/10.1371/journal.pone.0082138.
Šošić, Martin, and Mile Šikić. 2017. “Edlib: A c/c++ Library for Fast, Exact Sequence Alignment Using Edit Distance.” Edited by John Hancock. Bioinformatics 33 (9): 1394–95. https://doi.org/10.1093/bioinformatics/btw753.

  1. Although Ukkonen didn’t realize this faster runtime and only gave a bound of \(O(nm \log (nm))\). ↩︎

  2. Even sparser memory usage is possible by only storing differences every \(S=s\cdot B\) columns, but in practice this does not matter much. ↩︎

  3. See https://github.com/Martinsos/edlib/issues/5↩︎

  4. A* with a consistent heuristic. ↩︎

  5. More precisely, such that in each column there is a state of distance \(\leq t\) above (below) with distance \(\leq t\). ↩︎

  6. Bound checks omitted. ↩︎

  7. Bound checks omitted. ↩︎