[WIP] 10x faster exact global alignment using A*PA-v2

\begin{equation*} \newcommand{\g}{g^*} \newcommand{\h}{h^*} \newcommand{\cgap}{c_{\texttt{gap}}} \end{equation*}



This paper improves Edlib and A*PA by

  • reducing overhead by using larger block sizes,
  • avoiding recomputing states where possible,
  • using SIMD with a novel bit-encoding,
  • using a novel optimistic diagonal-transition based traceback,
  • applying and improving the A* heuristic of A*PA,

leading to a \(10\times\) overall speedup on the alignment of $500$kbp ONT reads.

Introduction and previous work

The problem of global pairwise alignment is to find the shortest sequence of edit operations (insertions, deletions, substitutions) to convert a string \(A\) into a second string \(B\) (Needleman and Wunsch 1970; Vintsyuk 1968), where the number of such operations is called the Levenshtein distance or edit distance (Levenshtein 1965; Wagner and Fischer 1974).

Per tradition, as Rognes and Seeberg (2000) write:

The rapidly increasing amounts of genetic-sequence information available represent a constant challenge to developers of hardware and software database searching and handling.

In this work we introduce a new exact algorithm for global pairwise alignment building on the recently introduced A* heuristics of A*PA (Groot Koerkamp and Ivanov 2022), but instead of running A* we use a much more efficient DP-based algorithm. Thus, as Fickett (1984) states,

at present one must choose between an algorithm which gives the best alignment but is expensive, and an algorithms which is fast but may not give the best alignment. In this paper we narrow the gap between these choices by showing how to get the optimal alignment much more quickly than before.

In the following, we give a brief recap of developments that this work builds on, in chronological order per approach. See also e.g. the reviews by Kruskal (1983) and Navarro (2001).

Needleman-Wunsch. This problem has classically been approached as a dynamic programming (DP) problem. For string lengths \(n\) and \(m\), Needleman and Wunsch (1970) introduced the first \(O(n^2m)\) algorithm. This was improved to what is now known as the \(O(nm)\) Needleman-Wunsch algorithm by Sellers (1974) and Wagner and Fischer (1974), building on the quadratic algorithm for longest common subsequence by Sankoff (1972).

Graph algorithms. It was already realized early on that this problem corresponds to finding the shortest path from \(v_s\) to \(v_t\) in the alignment graph (also called edit graph or dependency graph) (Vintsyuk 1968; Ukkonen 1985). Both Ukkonen (1985) and Myers (1986) remarked that this can be solved using Dijkstra’s algorithm (Dijkstra 1959), taking \(O(ns)\) time1, where \(s\) is the edit distance between the two strings. However, Myers (1986) observes that

the resulting algorithm involves a relatively complex discrete priority queue and this queue may contain as many as O(ND) entries even in the case where just the length of the […] shortest edit script is being computed.

Hadlock (1988) realized that Dijkstra’s algorithm can be improved upon by using A* (Hart, Nilsson, and Raphael 1968), a more informed algorithm that uses a heuristic function \(h(u)\) that gives a lower bound on the edit distance \(\h(u)\) between the suffixes following DP state \(u\). He uses two heuristics, the widely used gap cost heuristic \(h(u)=\cgap(u, v_t)\) (Ukkonen 1985; Hadlock 1988; Wu et al. 1990; Spouge 1989, 1991; Papamichail and Papamichail 2009) that simply uses the difference between the lengths of the suffixes as lower bound, and a new improved heuristic based on character frequencies in the two suffixes. A*PA (Groot Koerkamp and Ivanov 2022) applies the gap-chaining seed heuristic with pruning (Ivanov, Bichsel, and Vechev 2021) to obtain near-linear runtime when errors are uniform random. Nevertheless, as Spouge (1991) states:

Many algorithms for finding optimal paths in non-lattice graphs also exist (Dijkstra 1959; Hart, Nilsson, and Raphael 1968; Rubin 1974), but algorithms exploiting the lattice structure of an alignment graph are usually faster. In molecular biology, speed is important, …

and further (Spouge 1989):

This suggests a radical approach to A* search complexities: dispense with the lists [of open states] if there is a natural order for vertex expansion.

Indeed, a lot of work has gone into speeding up DP-based algorithms.

Computational volumes. Wilbur and Lipman (1983) is (to our knowledge) the first paper that speeds up the \(O(nm)\) DP algorithm, by only considering states near diagonals with many k-mer matches, but at the cost of giving up the exactness of the method. Fickett (1984) notes that for \(t\geq s\) only those DP-states with cost \(\g(u)\) at most \(t\) need to be computed:

However it is possible to fill the matrix in many different orders, the only restriction being that the calculation of any given \(d_{ij}\) depends on already having the values of the three element up and to the left of it.


But the only alignments of subsequences which are relevant are ones at least as good (distance at least as small) as the overall one. I.e. one really only needs those \(d_{ij}\) which are below a fixed bound.

This only requires \(O(nt)\) time, which is fast when \(t\) is an accurate bound on the distance \(s\), which for example can be set as a known upper bound for the data being aligned, or as the length of a known suboptimal alignment. When \(t=t_0\) turns out too small a larger new bound \(t_1\) can be chosen, and only states with distance in between \(t_0\) and \(t_1\) have to be computed. This is implemented by keeping for each row the index of the first and last state with value at most \(t_0\), and skipping over already computed states. In the limit, one could choose \(t_i = i\) and compute states by increasing distance, closely mirroring Dijkstra’s algorithm.

Ukkonen (1985) introduces a very similar idea, statically bounding the computation to only those states that can be on a path of length at most \(t\) through the graph. When the sequences have the same length (\(n=m\)), this only considers diagonals \(-t/2\) to \(t/2\), where diagonal \(0\) is the main diagonal of the DP-matrix.

On top of this, Ukkonen (1985) introduces band doubling: \(t_0=1\) can be doubled (\(t_i = 2^i\)) until \(t_k \geq s > t_{k-1}\). Since each test requires \(O(n \cdot t_i)\) time, the total time is

\begin{equation} n\cdot t_0 + \dots + n\cdot t_k = n\cdot (2^0 + \dots + 2^k) < n\cdot 2^{k+1} = 4\cdot n\cdot 2^{k-1} < 4\cdot n\cdot s = O(ns). \end{equation}

Note that this method does not (and indeed can not) reuse values from previous iterations, resulting in roughly a factor \(2\) overhead.

Spouge (1989) unifies the methods of Fickett (1984) and Ukkonen (1985), and generalizes them to accept any A* heuristic. In particular, a computational volume is a subgraph of the alignment graph that contains every shortest path. Given a bound \(t\geq s\), some examples of computational volumes are:

  1. \(\{u\}\), the entire \((n+1)\times (m+1)\) graph.
  2. \(\{u: \g(u) + \h(u)=s\}\), the vertices on a shortest paths.
  3. \(\{u: \g(u)\leq t\}\), the states at distance \(\leq t\) (Fickett 1984).
  4. \(\{u: \cgap(v_s, u) + \cgap(u, v_t) \leq t\}\) the static set of states possibly on a path of length \(\leq t\) (Ukkonen 1985).
  5. \(\{u: \g(u) + \cgap(u, v_t) \leq t\}\), as used by Edlib (Spouge 1991; Šošić and Šikić 2017).
  6. \(\{u: \g(u) + h(u) \leq t\}\), for any admissible heuristic \(h\).

As Spouge (1989) notes:

The order of computation (row major, column major or antidiagonal) is just a minor detail in most algorithms.

But this is exactly what was investigated a lot in the search for faster implementations.

Implementation and parallelism. Since roughly \(1995\), the focus shifted from reducing the number of computed states to computing states faster through advancements in implementation and hardware (SIMD, GPUs). These speedups are often applied to the Smith-Waterman-(Gotoh) (Smith and Waterman 1981; Gotoh 1982) algorithm for (affine-cost) local alignment, where algorithmic improvements beyond \(\Theta(nm)\) are unknown.

The first technique in this direction is microparallelism (Alpern, Carter, and Gatlin 1995), where each (64 bit) computer word is divided into multiple (e.g. 16 bit) parts, and word-size operations modifying all (4) parts in parallel. Alpern, Carter, and Gatlin (1995) applied this with inter-sequence parallelism to align a given query to \(4\) reference sequences in parallel (see also Rognes (2011)). Hughey (1996) was the first to note that antidiagonals of the DP matrix can be computed in parallel, and Wozniak (1997) applied SIMD for this purpose.

Rognes and Seeberg (2000) split 64bit words into 8 8-bit values, capping all computations at \(255\) but doubling the speed. Further, it uses vertical instead of antidiagonal vectors.

The advantage of this approach is the much-simplified and faster loading of the vector of substitution scores from memory. The disadvantage is that data dependencies within the vector must be handled.

In particular, Rognes and Seeberg (2000) introduce the query profile: Instead of looking up the substitution score \(S[A[i]][B[j]]\) for the \(i\)’th and \(j\)’th character of \(A\) and \(B\) respectively, it is more efficient to precompute the profile \(P[c][j] := S[c][B[j]]\) for each character \(c\) in the alphabet. Then, adjacent scores are simply found as adjacent values \(P[A[i]][j \dots j’]\).

Similarly, Myers (1999) introduces a bitpacking algorithm specifically for edit distance that stores the differences between adjacent DP-states bit-encoded in two 64-words \(P\) and \(M\), with \(P_i\) and \(M_i\) indicating whether the \(i\)’th difference is \(+1\) resp. \(-1\). It then gives an efficient algorithm using bitwise operations on these words.


  • (Farrar 2006) Farrar’s striped; uses query profile; conditional prefix scan is moved outside inner loop. \(2-8\times\) faster than Wozniak and Rognes.
  • Wu Manber 1992
  • Baeza-Yates Gonnet 1992
  • Hyyro and Navarro, 2005; Hyyro et al., 2005
  • Benson 2013
  • navarro 2004
  • bergeron hamel 2002

Tools. There are multiple semi-global aligners that implement \(O(nm)\) global alignment using numerous of the aforementioned implementation techniques, such as SeqAn (Döring et al. 2008), Parasail (Daily 2016), Opal (https://github.com/martinsos/opal), libssa (https://github.com/RonnySoak/libssa), SWIPE (Rognes 2011), SWPS3 (Szalkowski et al. 2008), SSW library (Zhao et al. 2013) (link), and KSW2 (Li 2018).

Dedicated global alignment implementations are much rarer. Edlib (Šošić and Šikić 2017) implements the band doubling of Ukkonen (1985) using the \(\g(u)+\cgap(u, v_t)\leq t\) computational volume of Spouge (1991) and the bitpacking of Myers (1999). WFA and BiWFA (Marco-Sola et al. 2020, 2022) implement the \(O(n+s^2)\) expected time diagonal transition algorithm (Ukkonen 1985; Myers 1986). Block aligner (Liu and Steinegger 2023) is an approximate aligner that can handle position-specific scoring matrices whose main novelty is to divide the computation into blocks. Lastly, A*PA (Groot Koerkamp and Ivanov 2022) directly implements A* on the alignment graph using the gap-chaining seed heuristic.



In A*PA-v2, we combine many existing techniques and introduce a number of new techniques to obtain a \(10\times\) speedup over existing single-threaded aligners. As a starting point, we take the band doubling algorithm as efficiently implemented by Edlib (Šošić and Šikić 2017) using bitpacking (Myers 1999). From there, we make a number of improvements.

  1. Block-based computation. We first observe that Edlib computes one column of the DP matrix at a time, and for each column decides which range of cells to compute. We significantly reduce this overhead by processing blocks of \(256\) columns at a time, similar to Block aligner (Liu and Steinegger 2023). Correspondingly, we only store cells of the DP-matrix at block boundaries.
  2. Incremental doubling. We further note that both the original band doubling method of Ukkonen (1985) and Edlib recompute states in the doubled region. Reusing the theory behind the A* algorithm, we prove a novel theorem stating that some of this recomputation can be avoided.
  3. SIMD. We speed up the implementation using SIMD to compute each block, allowing the processing of \(4\) computer words in parallel, and introduce a novel bit-encoding of the input sequence to make this even faster.
  4. Traceback. For the traceback, we optimistically use the diagonal transition method within each block with a strong Z-drop [TODO] heuristic, falling back to a full recomputation of the block when needed.
  5. A* . We improve the A* seed heuristics of Groot Koerkamp and Ivanov (2022) in two ways. First, instead of updating contours each time a match is pruned, we now do this in batches once the band is doubled. Secondly, we introduce a new pre-pruning technique that discards most of the spurious (off-path) matches ahead of time.


First, we reduce the amount of meta overhead in Edlib. Then, we speed up the implementation further. At this point, we should simply have a more efficient reimplementation that roughly mimicks Edlib.

On top of that, we can apply the A*PA heuristics for further speed gains on large/complex alignments, at the cost of larger precomputation time to build the heuristic.


  • Blocks
    • Param: block size
  • Sparse heuristic
  • Sparse memory
  • Param: sparsity, same as block size

Incremental doubling

  • States with \(f \leq f_{max}\) have fixed values (assuming the heuristic is consistent).
  • We can store deltas around the edge of this region and incrementally start the computation from there.

SIMD and bitpacking

  • Diagonally strided 4-lane SIMD
  • 2 parallel executions for higher instructions-per-cycle, so 8-lanes (512 states) total.
  • New bitpacking method than
  • TODO: Tried BitPAl’s bitpacking method which is one less than Myers 99’s, but without success so far.


  • DT Trace
    • Param: x-drop


  • pre-pruning
    • Param: length of lookahead
  • bulk contours update



  • Edlib
  • WFA
  • A*PA
  • A*PA-v2 without heuristics
  • A*PA-v2 with heuristics


  • synthetic data
  • human ONT reads
  • human ONT reads with genetic variation


  • Find threshold where heuristics become worth the overhead
  • Show benefit of each of the optimizations
  • Show sensitivity to parameter tuning


I am grateful to Daniel Liu for regular discussions, and suggesting additional papers that have been added to the introduction.


Alpern, Bowen, Larry Carter, and Kang Su Gatlin. 1995. “Microparallelism and High-Performance Protein Matching.” Proceedings of the 1995 Acm/Ieee Conference on Supercomputing (Cdrom) - Supercomputing ’95. https://doi.org/10.1145/224170.224222.
Daily, Jeff. 2016. “Parasail: Simd c Library for Global, Semi-Global, and Local Pairwise Sequence Alignments.” Bmc Bioinformatics 17 (1). https://doi.org/10.1186/s12859-016-0930-z.
Dijkstra, E. W. 1959. “A Note on Two Problems in Connexion with Graphs.” Numerische Mathematik 1 (1): 269–71. https://doi.org/10.1007/bf01386390.
Döring, Andreas, David Weese, Tobias Rausch, and Knut Reinert. 2008. “Seqan an Efficient, Generic c++ Library for Sequence Analysis.” Bmc Bioinformatics 9 (1). https://doi.org/10.1186/1471-2105-9-11.
Farrar, Michael. 2006. “Striped Smith–Waterman Speeds Database Searches Six Times over Other Simd Implementations.” Bioinformatics 23 (2): 156–61. https://doi.org/10.1093/bioinformatics/btl582.
Fickett, James W. 1984. “Fast Optimal Alignment.” Nucleic Acids Research 12 (1Part1): 175–79. https://doi.org/10.1093/nar/12.1part1.175.
Gotoh, Osamu. 1982. “An Improved Algorithm for Matching Biological Sequences.” Journal of Molecular Biology 162 (3): 705–8. https://doi.org/10.1016/0022-2836(82)90398-9.
Groot Koerkamp, Ragnar, and Pesho Ivanov. 2022. “Exact Global Alignment Using A* with Seed Heuristic and Match Pruning,” September. https://doi.org/10.1101/2022.09.19.508631.
Hadlock, Frank O. 1988. “Minimum Detour Methods for String or Sequence Comparison.” Congressus Numerantium 61: 263–74.
Hart, Peter, Nils Nilsson, and Bertram Raphael. 1968. “A Formal Basis for the Heuristic Determination of Minimum Cost Paths.” Ieee Transactions on Systems Science and Cybernetics 4 (2): 100–107. https://doi.org/10.1109/tssc.1968.300136.
Hughey, Richard. 1996. “Parallel Hardware for Sequence Comparison and Alignment.” Bioinformatics 12 (6): 473–79. https://doi.org/10.1093/bioinformatics/12.6.473.
Ivanov, Pesho, Benjamin Bichsel, and Martin Vechev. 2021. “Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds.” Biorxiv. https://doi.org/10.1101/2021.11.05.467453.
Kruskal, Joseph B. 1983. “An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules.” Siam Review 25 (2): 201–37. https://doi.org/10.1137/1025045.
Levenshtein, Vladimir I. 1965. “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” Soviet Physics. Doklady 10: 707–10. https://api.semanticscholar.org/CorpusID:60827152.
Li, Heng. 2018. “Minimap2: Pairwise Alignment for Nucleotide Sequences.” Edited by Inanc Birol. Bioinformatics 34 (18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.
Liu, Daniel, and Martin Steinegger. 2023. “Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices.” Bioinformatics, btad487. https://doi.org/10.1093/bioinformatics/btad487.
Loving, Joshua, Yozen Hernandez, and Gary Benson. 2014. “Bitpal: A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm.” Bioinformatics 30 (22): 3166–73. https://doi.org/10.1093/bioinformatics/btu507.
Marco-Sola, Santiago, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. 2022. “Optimal Gap-Affine Alignment in O(S) Space,” April. https://doi.org/10.1101/2022.04.14.488380.
Marco-Sola, Santiago, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. 2020. “Fast gap-affine pairwise alignment using the wavefront algorithm.” Bioinformatics 37 (4): 456–63. https://doi.org/10.1093/bioinformatics/btaa777.
Myers, Gene. 1986. “An $O(ND)$ Difference Algorithm and Its Variations.” Algorithmica 1 (1–4): 251–66. https://doi.org/10.1007/bf01840446.
———. 1999. “A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming.” Journal of the Acm 46 (3): 395–415. https://doi.org/10.1145/316542.316550.
Navarro, Gonzalo. 2001. “A Guided Tour to Approximate String Matching.” Acm Computing Surveys 33 (1): 31–88. https://doi.org/10.1145/375360.375365.
Needleman, Saul B., and Christian D. Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology 48 (3): 443–53. https://doi.org/10.1016/0022-2836(70)90057-4.
Papamichail, Dimitris, and Georgios Papamichail. 2009. “Improved Algorithms for Approximate String Matching (Extended Abstract).” Bmc Bioinformatics 10 (S1). https://doi.org/10.1186/1471-2105-10-s1-s10.
Rognes, Torbjørn. 2011. “Faster Smith-Waterman Database Searches with Inter-Sequence Simd Parallelisation.” Bmc Bioinformatics 12 (1). https://doi.org/10.1186/1471-2105-12-221.
Rognes, Torbjørn, and Erling Seeberg. 2000. “Six-Fold Speed-up of Smith–Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors.” Bioinformatics 16 (8): 699–706. https://doi.org/10.1093/bioinformatics/16.8.699.
Rubin, F. 1974. “The Lee Path Connection Algorithm.” Ieee Transactions on Computers C–23 (9): 907–14. https://doi.org/10.1109/t-c.1974.224054.
Sankoff, David. 1972. “Matching Sequences under Deletion/Insertion Constraints.” Proceedings of the National Academy of Sciences 69 (1): 4–6. https://doi.org/10.1073/pnas.69.1.4.
Sellers, Peter H. 1974. “On the Theory and Computation of Evolutionary Distances.” Siam Journal on Applied Mathematics 26 (4): 787–93. https://doi.org/10.1137/0126070.
Smith, T.F., and M.S. Waterman. 1981. “Identification of Common Molecular Subsequences.” Journal of Molecular Biology 147 (1): 195–97. https://doi.org/10.1016/0022-2836(81)90087-5.
Spouge, John L. 1989. “Speeding up Dynamic Programming Algorithms for Finding Optimal Lattice Paths.” Siam Journal on Applied Mathematics 49 (5): 1552–66. https://doi.org/10.1137/0149094.
———. 1991. “Fast Optimal Alignment.” Bioinformatics 7 (1): 1–7. https://doi.org/10.1093/bioinformatics/7.1.1.
Suzuki, Hajime, and Masahiro Kasahara. 2018. “Introducing Difference Recurrence Relations for Faster Semi-Global Alignment of Long Sequences.” BMC Bioinformatics 19 (S1). https://doi.org/10.1186/s12859-018-2014-8.
Szalkowski, Adam, Christian Ledergerber, Philipp Krähenbühl, and Christophe Dessimoz. 2008. “Swps3 – Fast Multi-Threaded Vectorized Smith-Waterman for Ibm Cell/B.E. And ×86/Sse2.” Bmc Research Notes 1 (1): 107. https://doi.org/10.1186/1756-0500-1-107.
Ukkonen, Esko. 1985. “Algorithms for Approximate String Matching.” Information and Control 64 (1–3): 100–118. https://doi.org/10.1016/s0019-9958(85)80046-2.
Vintsyuk, T. K. 1968. “Speech Discrimination by Dynamic Programming.” Cybernetics 4 (1): 52–57. https://doi.org/10.1007/bf01074755.
Wagner, Robert A., and Michael J. Fischer. 1974. “The String-to-String Correction Problem.” Journal of the Acm 21 (1): 168–73. https://doi.org/10.1145/321796.321811.
Wilbur, W J, and D J Lipman. 1983. “Rapid Similarity Searches of Nucleic Acid and Protein Data Banks.” Proceedings of the National Academy of Sciences 80 (3): 726–30. https://doi.org/10.1073/pnas.80.3.726.
Wozniak, A. 1997. “Using Video-Oriented Instructions to Speed up Sequence Comparison.” Bioinformatics 13 (2): 145–50. https://doi.org/10.1093/bioinformatics/13.2.145.
Wu, Sun, Udi Manber, Gene Myers, and Webb Miller. 1990. “An O(Np) Sequence Comparison Algorithm.” Information Processing Letters 35 (6): 317–23. https://doi.org/10.1016/0020-0190(90)90035-v.
Zhao, Mengyao, Wan-Ping Lee, Erik P. Garrison, and Gabor T. Marth. 2013. “Ssw Library: An Simd Smith-Waterman c/c++ Library for Use in Genomic Applications.” Edited by Leonardo Mariño-Ramírez. Plos One 8 (12): e82138. https://doi.org/10.1371/journal.pone.0082138.
Šošić, Martin, and Mile Šikić. 2017. “Edlib: A c/c++ Library for Fast, Exact Sequence Alignment Using Edit Distance.” Edited by John Hancock. Bioinformatics 33 (9): 1394–95. https://doi.org/10.1093/bioinformatics/btw753.

  1. Although Ukkonen didn’t realize this faster runtime and only gave a bound of \(O(nm \log (nm))\). ↩︎