\begin{equation*} \newcommand{\g}{g^*} \newcommand{\h}{h^*} \newcommand{\cgap}{c_{\texttt{gap}}} \end{equation*}

## Abstract

## Summary

This paper improves Edlib and A*PA by

- reducing overhead by using larger block sizes,
- avoiding recomputing states where possible,
- using SIMD with a novel bit-encoding,
- using a novel
*optimistic*diagonal-transition based traceback, - applying and improving the A* heuristic of A*PA,

leading to a \(10\times\) overall speedup on the alignment of $500$kbp ONT reads.

## Introduction and previous work

The problem of *global pairwise alignment* is to find the shortest sequence of
edit operations (insertions, deletions, substitutions) to convert a string \(A\)
into a second string \(B\) (Needleman and Wunsch 1970; Vintsyuk 1968), where the number of such
operations is called the *Levenshtein distance* or *edit distance*
(Levenshtein 1965; Wagner and Fischer 1974).

Per tradition, as Rognes and Seeberg (2000) write:

The rapidly increasing amounts of genetic-sequence information available represent a constant challenge to developers of hardware and software database searching and handling.

In this work we introduce a new exact algorithm for global pairwise alignment building on the recently introduced A* heuristics of A*PA (Groot Koerkamp and Ivanov 2022), but instead of running A* we use a much more efficient DP-based algorithm. Thus, as Fickett (1984) states,

at present one must choose between an algorithm which gives the best alignment but is expensive, and an algorithms which is fast but may not give the best alignment. In this paper we narrow the gap between these choices by showing how to get the optimal alignment much more quickly than before.

In the following, we give a brief recap of developments that this work builds on, in chronological order per approach. See also e.g. the reviews by Kruskal (1983) and Navarro (2001).

**Needleman-Wunsch.** This problem has classically been approached as a dynamic
programming (DP) problem. For string lengths \(n\) and \(m\), Needleman and Wunsch (1970)
introduced the first \(O(n^2m)\) algorithm. This was improved to what is now
known as the \(O(nm)\) *Needleman-Wunsch algorithm* by Sellers (1974) and
Wagner and Fischer (1974), building on the quadratic algorithm for *longest common
subsequence* by Sankoff (1972).

**Graph algorithms.** It was already realized early on that this problem
corresponds to finding the shortest path from \(v_s\) to \(v_t\) in the *alignment
graph* (also called *edit graph* or *dependency graph*)
(Vintsyuk 1968; Ukkonen 1985). Both Ukkonen (1985) and Myers (1986)
remarked that this can be solved using Dijkstra’s algorithm (Dijkstra 1959),
taking \(O(ns)\) time^{1}, where \(s\) is the edit distance between
the two strings. However, Myers (1986) observes that

the resulting algorithm involves a relatively complex discrete priority queue and this queue may contain as many as O(ND) entries even in the case where just the length of the […] shortest edit script is being computed.

Hadlock (1988) realized that Dijkstra’s algorithm can be improved
upon by using A* (Hart, Nilsson, and Raphael 1968), a more *informed* algorithm that uses a
*heuristic* function \(h(u)\) that gives a lower bound on the edit distance
\(\h(u)\) between the suffixes following DP state \(u\). He uses two heuristics, the widely
used *gap cost* heuristic \(h(u)=\cgap(u, v_t)\)
(Ukkonen 1985; Hadlock 1988; Wu et al. 1990; Spouge 1989, 1991; Papamichail and Papamichail 2009)
that simply uses the difference between the lengths of the suffixes as lower
bound, and a new improved heuristic based on character frequencies in the two
suffixes. A*PA (Groot Koerkamp and Ivanov 2022) applies the *gap-chaining seed heuristic* with *pruning*
(Ivanov, Bichsel, and Vechev 2021) to obtain near-linear runtime when errors are uniform random.
Nevertheless, as Spouge (1991) states:

Many algorithms for finding optimal paths in non-lattice graphs also exist (Dijkstra 1959; Hart, Nilsson, and Raphael 1968; Rubin 1974), but algorithms exploiting the lattice structure of an alignment graph are usually faster. In molecular biology, speed is important, …

and further (Spouge 1989):

This suggests a radical approach to A* search complexities: dispense with the lists [of open states] if there is a natural order for vertex expansion.

Indeed, a lot of work has gone into speeding up DP-based algorithms.

**Computational volumes.** Wilbur and Lipman (1983) is (to our knowledge) the
first paper that speeds up the \(O(nm)\) DP algorithm, by only considering states
near diagonals with many *k-mer matches*, but at the cost of giving up the exactness
of the method. Fickett (1984) notes that for \(t\geq s\) only those DP-states with cost \(\g(u)\) at
most \(t\) need to be computed:

However it is possible to fill the matrix in many different orders, the only restriction being that the calculation of any given \(d_{ij}\) depends on already having the values of the three element up and to the left of it.

[…]

But the only alignments of subsequences which are relevant are ones at least as good (distance at least as small) as the overall one. I.e. one really only needs those \(d_{ij}\) which are below a fixed bound.

This only requires \(O(nt)\) time, which is fast when \(t\) is an accurate bound on the distance \(s\), which for example can be set as a known upper bound for the data being aligned, or as the length of a known suboptimal alignment. When \(t=t_0\) turns out too small a larger new bound \(t_1\) can be chosen, and only states with distance in between \(t_0\) and \(t_1\) have to be computed. This is implemented by keeping for each row the index of the first and last state with value at most \(t_0\), and skipping over already computed states. In the limit, one could choose \(t_i = i\) and compute states by increasing distance, closely mirroring Dijkstra’s algorithm.

Ukkonen (1985) introduces a very similar idea, statically bounding the computation to only those states that can be on a path of length at most \(t\) through the graph. When the sequences have the same length (\(n=m\)), this only considers diagonals \(-t/2\) to \(t/2\), where diagonal \(0\) is the main diagonal of the DP-matrix.

On top of this, Ukkonen (1985) introduces *band doubling*: \(t_0=1\) can be *doubled* (\(t_i
= 2^i\)) until \(t_k \geq s > t_{k-1}\). Since each test requires \(O(n \cdot t_i)\) time, the
total time is

\begin{equation} n\cdot t_0 + \dots + n\cdot t_k = n\cdot (2^0 + \dots + 2^k) < n\cdot 2^{k+1} = 4\cdot n\cdot 2^{k-1} < 4\cdot n\cdot s = O(ns). \end{equation}

Note that this method does not (and indeed can not) reuse values from previous iterations, resulting in roughly a factor \(2\) overhead.

Spouge (1989) unifies the methods of
Fickett (1984) and Ukkonen (1985), and generalizes them to accept any
A* heuristic. In particular, a *computational volume* is a subgraph of the
alignment graph that contains *every* shortest path. Given a bound \(t\geq s\), some examples of
computational volumes are:

- \(\{u\}\), the entire \((n+1)\times (m+1)\) graph.
- \(\{u: \g(u) + \h(u)=s\}\), the vertices on a shortest paths.
- \(\{u: \g(u)\leq t\}\), the states at distance \(\leq t\) (Fickett 1984).
- \(\{u: \cgap(v_s, u) + \cgap(u, v_t) \leq t\}\) the static set of states possibly on a path of length \(\leq t\) (Ukkonen 1985).
- \(\{u: \g(u) + \cgap(u, v_t) \leq t\}\), as used by Edlib (Spouge 1991; Šošić and Šikić 2017).
- \(\{u: \g(u) + h(u) \leq t\}\), for any admissible heuristic \(h\).

As Spouge (1989) notes:

The order of computation (row major, column major or antidiagonal) is just a minor detail in most algorithms.

But this is exactly what was investigated a lot in the search for faster implementations.

**Implementation and parallelism.** Since roughly \(1995\), the focus shifted from
reducing the number of computed states to computing states faster through
advancements in implementation and hardware (SIMD, GPUs). These speedups are
often applied to the Smith-Waterman-(Gotoh) (Smith and Waterman 1981; Gotoh 1982) algorithm for
(affine-cost) local alignment, where algorithmic improvements beyond
\(\Theta(nm)\) are unknown.

The first technique in this direction is *microparallelism* (Alpern, Carter, and Gatlin 1995),
where each (64 bit) computer word is divided into multiple (e.g. 16 bit) parts,
and word-size operations modifying all (4) parts in parallel.
Alpern, Carter, and Gatlin (1995) applied this with *inter-sequence parallelism* to align a
given query to \(4\) reference sequences in parallel (see also
Rognes (2011)). Hughey (1996) was the first to note that
*antidiagonals* of the DP matrix can be computed in parallel, and
Wozniak (1997) applied SIMD for this purpose.

Rognes and Seeberg (2000) split 64bit words into 8 8-bit values, capping all
computations at \(255\) but doubling the speed. Further, it uses *vertical*
instead of antidiagonal vectors.

The advantage of this approach is the much-simplified and faster loading of the vector of substitution scores from memory. The disadvantage is that data dependencies within the vector must be handled.

In particular, Rognes and Seeberg (2000) introduce the *query profile*: Instead of
looking up the substitution score \(S[A[i]][B[j]]\) for the \(i\)’th and \(j\)’th
character of \(A\) and \(B\) respectively, it is more efficient to precompute the
*profile* \(P[c][j] := S[c][B[j]]\) for each character \(c\) in the alphabet. Then,
adjacent scores are simply found as adjacent values \(P[A[i]][j \dots j’]\).

Similarly, Myers (1999) introduces a *bitpacking* algorithm specifically
for edit distance that stores the differences between adjacent DP-states
bit-encoded in two 64-words \(P\) and \(M\), with \(P_i\) and \(M_i\) indicating whether
the \(i\)’th difference is \(+1\) resp. \(-1\). It then gives an efficient algorithm
using bitwise operations on these words.

TODO

- (Farrar 2006) Farrar’s striped; uses query profile; conditional prefix scan is moved outside inner loop. \(2-8\times\) faster than Wozniak and Rognes.
- Wu Manber 1992
- Baeza-Yates Gonnet 1992
- Hyyro and Navarro, 2005; Hyyro et al., 2005
- Benson 2013
- navarro 2004
- bergeron hamel 2002

**Tools.**
There are multiple semi-global aligners that implement \(O(nm)\) global
alignment using numerous of the aforementioned implementation
techniques, such as SeqAn (Döring et al. 2008), Parasail (Daily 2016), Opal
(https://github.com/martinsos/opal), libssa
(https://github.com/RonnySoak/libssa), SWIPE (Rognes 2011), SWPS3
(Szalkowski et al. 2008), SSW library (Zhao et al. 2013) (link), and KSW2 (Li 2018).

Dedicated global alignment implementations are much rarer.
Edlib (Šošić and Šikić 2017) implements the band doubling of Ukkonen (1985) using
the \(\g(u)+\cgap(u, v_t)\leq t\) computational volume of Spouge (1991) and
the bitpacking of Myers (1999).
WFA and BiWFA (Marco-Sola et al. 2020, 2022) implement the \(O(n+s^2)\) expected time *diagonal transition*
algorithm (Ukkonen 1985; Myers 1986).
Block aligner (Liu and Steinegger 2023) is an approximate aligner that can handle
position-specific scoring matrices whose main novelty is to divide the
computation into blocks.
Lastly, A*PA (Groot Koerkamp and Ivanov 2022) directly implements A* on the alignment graph using
the gap-chaining seed heuristic.

TODO:

Opal: Šošic M. An simd dynamic programming c/c++ library: Thesis, University of Zagreb; 2015. https://bib.irb.hr/datoteka/758607.diplomski_Martin_ Sosic.pdf.

libssa: Frielingsdorf JT. Improving optimal sequence alignments through a simd-accelerated library: Thesis, University of Oslo; 2015. http://urn.nb.no/ URN:NBN:no-49935. Accessed 10 Dec 2015.

(Suzuki and Kasahara 2018) libgaba: SIMD with difference recurrence relation for affine cost alignment

(Loving, Hernandez, and Benson 2014) BitPAl

### Contributions

In A*PA-v2, we combine many existing techniques and introduce a number of new techniques to obtain a \(10\times\) speedup over existing single-threaded aligners. As a starting point, we take the band doubling algorithm as efficiently implemented by Edlib (Šošić and Šikić 2017) using bitpacking (Myers 1999). From there, we make a number of improvements.

**Block-based computation.**We first observe that Edlib computes one column of the DP matrix at a time, and for each column decides which range of cells to compute. We significantly reduce this overhead by processing blocks of \(256\) columns at a time, similar to Block aligner (Liu and Steinegger 2023). Correspondingly, we only store cells of the DP-matrix at block boundaries.**Incremental doubling.**We further note that both the original band doubling method of Ukkonen (1985) and Edlib recompute states in the doubled region. Reusing the theory behind the A* algorithm, we prove a novel theorem stating that some of this recomputation can be avoided.**SIMD.**We speed up the implementation using SIMD to compute each block, allowing the processing of \(4\) computer words in parallel, and introduce a novel bit-encoding of the input sequence to make this even faster.**Traceback.**For the traceback, we optimistically use the diagonal transition method within each block with a strong Z-drop [TODO] heuristic, falling back to a full recomputation of the block when needed.**A*****.**We improve the A* seed heuristics of Groot Koerkamp and Ivanov (2022) in two ways. First, instead of updating contours each time a match is pruned, we now do this in batches once the band is doubled. Secondly, we introduce a new*pre-pruning*technique that discards most of the*spurious*(off-path) matches ahead of time.

## Methods

First, we reduce the amount of meta overhead in Edlib. Then, we speed up the implementation further. At this point, we should simply have a more efficient reimplementation that roughly mimicks Edlib.

On top of that, we can apply the A*PA heuristics for further speed gains on large/complex alignments, at the cost of larger precomputation time to build the heuristic.

### Blocks

- Blocks
- Param: block size

- Sparse heuristic
- Sparse memory
- Param: sparsity, same as block size

### Incremental doubling

- States with \(f \leq f_{max}\) have
*fixed*values (assuming the heuristic is consistent). - We can store deltas around the edge of this region and incrementally start the computation from there.

### SIMD and bitpacking

- Diagonally strided 4-lane SIMD
- 2 parallel executions for higher instructions-per-cycle, so 8-lanes (512 states) total.
- New bitpacking method than
- TODO: Tried BitPAl’s bitpacking method which is one less than Myers 99’s, but without success so far.

### Traceback

- DT Trace
- Param: x-drop

### A*

- pre-pruning
- Param: length of lookahead

- bulk contours update

## Results

Compare

- Edlib
- WFA
- A*PA
- A*PA-v2 without heuristics
- A*PA-v2 with heuristics

on

- synthetic data
- human ONT reads
- human ONT reads with genetic variation

Important:

- Find threshold where heuristics become worth the overhead
- Show benefit of each of the optimizations
- Show sensitivity to parameter tuning

## Acknowledgements

I am grateful to Daniel Liu for regular discussions, and suggesting additional papers that have been added to the introduction.

## References

*Proceedings of the 1995 Acm/Ieee Conference on Supercomputing (Cdrom) - Supercomputing ’95*. https://doi.org/10.1145/224170.224222.

*Bmc Bioinformatics*17 (1). https://doi.org/10.1186/s12859-016-0930-z.

*Numerische Mathematik*1 (1): 269–71. https://doi.org/10.1007/bf01386390.

*Bmc Bioinformatics*9 (1). https://doi.org/10.1186/1471-2105-9-11.

*Bioinformatics*23 (2): 156–61. https://doi.org/10.1093/bioinformatics/btl582.

*Nucleic Acids Research*12 (1Part1): 175–79. https://doi.org/10.1093/nar/12.1part1.175.

*Journal of Molecular Biology*162 (3): 705–8. https://doi.org/10.1016/0022-2836(82)90398-9.

*Congressus Numerantium*61: 263–74.

*Ieee Transactions on Systems Science and Cybernetics*4 (2): 100–107. https://doi.org/10.1109/tssc.1968.300136.

*Bioinformatics*12 (6): 473–79. https://doi.org/10.1093/bioinformatics/12.6.473.

*Biorxiv*. https://doi.org/10.1101/2021.11.05.467453.

*Siam Review*25 (2): 201–37. https://doi.org/10.1137/1025045.

*Soviet Physics. Doklady*10: 707–10. https://api.semanticscholar.org/CorpusID:60827152.

*Bioinformatics*34 (18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.

*Bioinformatics*, btad487. https://doi.org/10.1093/bioinformatics/btad487.

*Bioinformatics*30 (22): 3166–73. https://doi.org/10.1093/bioinformatics/btu507.

*Bioinformatics*37 (4): 456–63. https://doi.org/10.1093/bioinformatics/btaa777.

*Algorithmica*1 (1–4): 251–66. https://doi.org/10.1007/bf01840446.

*Journal of the Acm*46 (3): 395–415. https://doi.org/10.1145/316542.316550.

*Acm Computing Surveys*33 (1): 31–88. https://doi.org/10.1145/375360.375365.

*Journal of Molecular Biology*48 (3): 443–53. https://doi.org/10.1016/0022-2836(70)90057-4.

*Bmc Bioinformatics*10 (S1). https://doi.org/10.1186/1471-2105-10-s1-s10.

*Bmc Bioinformatics*12 (1). https://doi.org/10.1186/1471-2105-12-221.

*Bioinformatics*16 (8): 699–706. https://doi.org/10.1093/bioinformatics/16.8.699.

*Ieee Transactions on Computers*C–23 (9): 907–14. https://doi.org/10.1109/t-c.1974.224054.

*Proceedings of the National Academy of Sciences*69 (1): 4–6. https://doi.org/10.1073/pnas.69.1.4.

*Siam Journal on Applied Mathematics*26 (4): 787–93. https://doi.org/10.1137/0126070.

*Journal of Molecular Biology*147 (1): 195–97. https://doi.org/10.1016/0022-2836(81)90087-5.

*Siam Journal on Applied Mathematics*49 (5): 1552–66. https://doi.org/10.1137/0149094.

*Bioinformatics*7 (1): 1–7. https://doi.org/10.1093/bioinformatics/7.1.1.

*BMC Bioinformatics*19 (S1). https://doi.org/10.1186/s12859-018-2014-8.

*Bmc Research Notes*1 (1): 107. https://doi.org/10.1186/1756-0500-1-107.

*Information and Control*64 (1–3): 100–118. https://doi.org/10.1016/s0019-9958(85)80046-2.

*Cybernetics*4 (1): 52–57. https://doi.org/10.1007/bf01074755.

*Journal of the Acm*21 (1): 168–73. https://doi.org/10.1145/321796.321811.

*Proceedings of the National Academy of Sciences*80 (3): 726–30. https://doi.org/10.1073/pnas.80.3.726.

*Bioinformatics*13 (2): 145–50. https://doi.org/10.1093/bioinformatics/13.2.145.

*Information Processing Letters*35 (6): 317–23. https://doi.org/10.1016/0020-0190(90)90035-v.

*Plos One*8 (12): e82138. https://doi.org/10.1371/journal.pone.0082138.

*Bioinformatics*33 (9): 1394–95. https://doi.org/10.1093/bioinformatics/btw753.

Although Ukkonen didn’t realize this faster runtime and only gave a bound of \(O(nm \log (nm))\). ↩︎