- Abstract
- 1 Introduction
- 1.1 Contributions
- 1.2 Previous work
- 1.2.1 Needleman-Wunsch
- 1.2.2 Graph algorithms
- 1.2.3 Computational volumes
- 1.2.4 Implementation and parallelism
- 1.2.5 Tools

- 2 Methods
- 2.1 Band-doubling and bitpacking in Edlib
- 2.2 Blocks
- 2.3 SIMD
- 2.4 SIMD-friendly sequence profile
- 2.5 Incremental doubling
- 2.6 Traceback
- 2.7 A*
- 2.7.1 Bulk-contours update
- 2.7.2 Pre-pruning

- 2.8 Appendix: Range-of-rows computations
- 2.8.1 Computed range
- 2.8.2 Fixed range

- 3 Results
- 4 Conclusion
- 4.1 Summary
- 4.2 Limitations
- 4.3 Future work

- Acknowledgements

\begin{equation*} \newcommand{\g}{g^*} \newcommand{\h}{h^*} \newcommand{\cgap}{c_{\texttt{gap}}} \newcommand{\xor}{\ \mathrm{xor}\ } \newcommand{\and}{\ \mathrm{and}\ } \newcommand{\st}[2]{\langle #1, #2\rangle} \newcommand{\matches}{\mathcal M} \end{equation*}

Preface: This is my work-in-progress draft for my paper on version 2 of A*PA. Code is done and works, but needs some cleanup and polishing. The API also needs work.

## Abstract

**Methods.** We introduce a new global pairwise aligner, A*PA2, that builds on Edlib and
incorporates ideas from Block Aligner and A*PA. A*PA2 1) uses
Myers’ bitpacking with SIMD, 2) significantly reduces overhead by using large block
sizes, 3) avoids recomputation of states where possible, 4) introduces two new
techniques for traceback, and 5) improves and applies the heuristics developed
in A*PA.

**Results.**
A \(10\times\) overall speedup on the alignment of $500$kbp ONT reads.

**Availability.** https://github.com/RagnarGrootKoerkamp/astar-pairwise-aligner

**Contact.** https://twitter.com/curious_coding

## 1 Introduction

The problem of *global pairwise alignment* is to find the shortest sequence of
edit operations (insertions, deletions, substitutions) to convert a string \(A\)
into a second string \(B\) (Needleman and Wunsch 1970; Vintsyuk 1968), where the number of such
operations is called the *Levenshtein distance* or *edit distance*
(Levenshtein 1965; Wagner and Fischer 1974).

Per tradition, as Rognes and Seeberg (2000) write:

The rapidly increasing amounts of genetic-sequence information available represent a constant challenge to developers of hardware and software database searching and handling.

In this work we introduce a new exact algorithm for global pairwise alignment building on the recently introduced A* heuristics of A*PA (Groot Koerkamp and Ivanov 2022), but instead of running A* we use a much more efficient DP-based algorithm. Thus, as Fickett (1984) states,

at present one must choose between an algorithm which gives the best alignment but is expensive, and an algorithm which is fast but may not give the best alignment. In this paper we narrow the gap between these choices by showing how to get the optimal alignment much more quickly than before.

### 1.1 Contributions

In A*PA2, we combine multiple existing techniques and introduce a number of new ideas to obtain a \(10\times\) speedup over existing single-threaded exact aligners. As a starting point, we take the band doubling algorithm implemented by Edlib (Šošić and Šikić 2017) using bitpacking (Myers 1999). First, we speed up the implementation (points 1., 2., and 3.). Then, we reduce the amount of work done (points 4. and 5.). Lastly, we apply A*.

**Block-based computation.**Edlib computes one column of the DP matrix at a time, and for each column decides which range (subset of rows) of cells to compute. We significantly reduce this overhead by processing blocks of \(256\) columns at a time, taking inspiration from Block aligner (Liu and Steinegger 2023). Correspondingly, we only store cells of the DP-matrix at block boundaries, reducing memory usage.**SIMD.**We speed up the implementation using SIMD to compute each block, allowing the processing of \(4\) computer words in parallel.**Novel encoding.**We introduce a novel bit-encoding of the input sequence to speed up SIMD operations for size-\(4\) alphabets.**Incremental doubling.**We further note that both the original band doubling method of Ukkonen (1985) and Edlib recompute states in the doubled region. Reusing the theory behind the A* algorithm, we prove and apply a novel theorem stating that some of this recomputation can be avoided.**Traceback.**For the traceback, we optimistically use the diagonal transition method within each block with a strong Z-drop [TODO] heuristic, falling back to a full recomputation of the block when needed.**A*****.**We improve the A* seed heuristics of Groot Koerkamp and Ivanov (2022) in two ways. First, instead of updating contours each time a match is pruned, we now do this in batches once the band is doubled. Secondly, we introduce a new*pre-pruning*technique that discards most of the*spurious*(off-path) matches ahead of time.

### 1.2 Previous work

In the following, we give a brief recap of developments that this work builds on, in chronological order per approach. See also e.g. the reviews by Kruskal (1983) and Navarro (2001).

#### 1.2.1 Needleman-Wunsch

This problem has classically been approached as a dynamic
programming (DP) problem. For string lengths \(n\) and \(m\), Needleman and Wunsch (1970)
introduced the first \(O(n^2m)\) algorithm. This was improved to what is now
known as the \(O(nm)\) *Needleman-Wunsch algorithm* by Sellers (1974) and
Wagner and Fischer (1974), building on the quadratic algorithm for *longest common
subsequence* by Sankoff (1972).

#### 1.2.2 Graph algorithms

It was already realized early on that this problem
corresponds to finding the shortest path from \(v_s\) to \(v_t\) in the *alignment
graph* (also called *edit graph* or *dependency graph*)
(Vintsyuk 1968; Ukkonen 1985). Both Ukkonen (1985) and Myers (1986)
remarked that this can be solved using Dijkstra’s algorithm (Dijkstra 1959),
taking \(O(ns)\) time^{1}, where \(s\) is the edit distance between
the two strings. However, Myers (1986) observes that

the resulting algorithm involves a relatively complex discrete priority queue and this queue may contain as many as O(ND) entries even in the case where just the length of the […] shortest edit script is being computed.

Hadlock (1988) realized that Dijkstra’s algorithm can be improved
upon by using A* (Hart, Nilsson, and Raphael 1968), a more *informed* algorithm that uses a
*heuristic* function \(h(u)\) that gives a lower bound on the edit distance
\(\h(u)\) between the suffixes following DP state \(u\). He uses two heuristics, the widely
used *gap cost* heuristic \(h(u)=\cgap(u, v_t)\)
(Ukkonen 1985; Hadlock 1988; Wu et al. 1990; Spouge 1989, 1991; Papamichail and Papamichail 2009)
that simply uses the difference between the lengths of the suffixes as lower
bound, and a new improved heuristic based on character frequencies in the two
suffixes. A*PA (Groot Koerkamp and Ivanov 2022) applies the *gap-chaining seed heuristic* with *pruning*
(Ivanov, Bichsel, and Vechev 2021) to obtain near-linear runtime when errors are uniform random.
Nevertheless, as Spouge (1991) states:

Many algorithms for finding optimal paths in non-lattice graphs also exist (Dijkstra 1959; Hart, Nilsson, and Raphael 1968; Rubin 1974), but algorithms exploiting the lattice structure of an alignment graph are usually faster. In molecular biology, speed is important, …

and further (Spouge 1989):

This suggests a radical approach to A* search complexities: dispense with the lists [of open states] if there is a natural order for vertex expansion.

Indeed, a lot of work has gone into speeding up DP-based algorithms.

#### 1.2.3 Computational volumes

Wilbur and Lipman (1983) is (to our knowledge) the
first paper that speeds up the \(O(nm)\) DP algorithm, by only considering states
near diagonals with many *k-mer matches*, but at the cost of giving up the exactness
of the method. Fickett (1984) notes that for \(t\geq s\) only those DP-states with cost \(\g(u)\) at
most \(t\) need to be computed:

However it is possible to fill the matrix in many different orders, the only restriction being that the calculation of any given \(d_{ij}\) depends on already having the values of the three element up and to the left of it.

[…]

But the only alignments of subsequences which are relevant are ones at least as good (distance at least as small) as the overall one. I.e. one really only needs those \(d_{ij}\) which are below a fixed bound.

This only requires \(O(nt)\) time, which is fast when \(t\) is an accurate bound on the distance \(s\), which for example can be set as a known upper bound for the data being aligned, or as the length of a known suboptimal alignment. When \(t=t_0\) turns out too small a larger new bound \(t_1\) can be chosen, and only states with distance in between \(t_0\) and \(t_1\) have to be computed. This is implemented by keeping for each row the index of the first and last state with value at most \(t_0\), and skipping over already computed states. In the limit, one could choose \(t_i = i\) and compute states by increasing distance, closely mirroring Dijkstra’s algorithm.

Ukkonen (1985) introduces a very similar idea, statically bounding the computation to only those states that can be on a path of length at most \(t\) through the graph. When the sequences have the same length (\(n=m\)), this only considers diagonals \(-t/2\) to \(t/2\), where diagonal \(0\) is the main diagonal of the DP-matrix.

On top of this, Ukkonen (1985) introduces *band doubling*: \(t_0=1\) can be *doubled* (\(t_i
= 2^i\)) until \(t_k \geq s > t_{k-1}\). Since each test requires \(O(n \cdot t_i)\) time, the
total time is

\begin{equation} n\cdot t_0 + \dots + n\cdot t_k = n\cdot (2^0 + \dots + 2^k) < n\cdot 2^{k+1} = 4\cdot n\cdot 2^{k-1} < 4\cdot n\cdot s = O(ns). \end{equation}

Note that this method does not (and indeed can not) reuse values from previous iterations, resulting in roughly a factor \(2\) overhead.

Spouge (1989) unifies the methods of
Fickett (1984) and Ukkonen (1985), and generalizes them to accept any
A* heuristic. In particular, a *computational volume* is a subgraph of the
alignment graph that contains *every* shortest path. Given a bound \(t\geq s\), some examples of
computational volumes are:

- \(\{u\}\), the entire \((n+1)\times (m+1)\) graph.
- \(\{u: \g(u) + \h(u)=s\}\), the vertices on a shortest paths.
- \(\{u: \g(u)\leq t\}\), the states at distance \(\leq t\) (Fickett 1984).
- \(\{u: \cgap(v_s, u) + \cgap(u, v_t) \leq t\}\) the static set of states possibly on a path of length \(\leq t\) (Ukkonen 1985).
- \(\{u: \g(u) + \cgap(u, v_t) \leq t\}\), as used by Edlib (Spouge 1991; Šošić and Šikić 2017).
- \(\{u: \g(u) + h(u) \leq t\}\), for any admissible heuristic \(h\).

As Spouge (1989) notes:

The order of computation (row major, column major or antidiagonal) is just a minor detail in most algorithms.

But this is exactly what was investigated a lot in the search for faster implementations.

#### 1.2.4 Implementation and parallelism

Since roughly \(1995\), the focus shifted from reducing the number of computed states to computing states faster through advancements in implementation and hardware (SIMD, GPUs). These speedups are often applied to the Smith-Waterman-(Gotoh) (Smith and Waterman 1981; Gotoh 1982) algorithm for (affine-cost) local alignment, where algorithmic improvements beyond \(\Theta(nm)\) are unknown.

The first technique in this direction is *microparallelism* (Alpern, Carter, and Gatlin 1995),
where each (64 bit) computer word is divided into multiple (e.g. 16 bit) parts,
and word-size operations modifying all (4) parts in parallel.
Alpern, Carter, and Gatlin (1995) applied this with *inter-sequence parallelism* to align a
given query to \(4\) reference sequences in parallel (see also
Rognes (2011)). Hughey (1996) was the first to note that
*antidiagonals* of the DP matrix can be computed in parallel, and
Wozniak (1997) applied SIMD for this purpose.

Rognes and Seeberg (2000) split 64bit words into 8 8-bit values, capping all
computations at \(255\) but doubling the speed. Further, it uses *vertical*
instead of antidiagonal vectors.

The advantage of this approach is the much-simplified and faster loading of the vector of substitution scores from memory. The disadvantage is that data dependencies within the vector must be handled.

In particular, Rognes and Seeberg (2000) introduce the *query profile*: Instead of
looking up the substitution score \(S[A[i]][B[j]]\) for the \(i\)’th and \(j\)’th
character of \(A\) and \(B\) respectively, it is more efficient to precompute the
*profile* \(P[c][j] := S[c][B[j]]\) for each character \(c\) in the alphabet. Then,
adjacent scores are simply found as adjacent values \(P[A[i]][j \dots j’]\).

Similarly, Myers (1999) introduces a *bitpacking* algorithm specifically
for edit distance that stores the differences between adjacent DP-states
bit-encoded in two 64-words \(P\) and \(M\), with \(P_i\) and \(M_i\) indicating whether
the \(i\)’th difference is \(+1\) resp. \(-1\). It then gives an efficient algorithm
using bitwise operations on these words.

TODO

- (Farrar 2006) Farrar’s striped; uses query profile; conditional prefix scan is moved outside inner loop. \(2-8\times\) faster than Wozniak and Rognes.
- Wu Manber 1992
- Baeza-Yates Gonnet 1992
- Hyyro and Navarro, 2005; Hyyro et al., 2005
- Benson 2013
- navarro 2004
- bergeron hamel 2002

#### 1.2.5 Tools

There are multiple semi-global aligners that implement \(O(nm)\) global alignment using numerous of the aforementioned implementation techniques, such as SeqAn (Döring et al. 2008), Parasail (Daily 2016), Opal (https://github.com/martinsos/opal), libssa (https://github.com/RonnySoak/libssa), SWIPE (Rognes 2011), SWPS3 (Szalkowski et al. 2008), SSW library (Zhao et al. 2013) (link), and KSW2 (Li 2018).

Dedicated global alignment implementations are much rarer.
Edlib (Šošić and Šikić 2017) implements the band doubling of Ukkonen (1985) using
the \(\g(u)+\cgap(u, v_t)\leq t\) computational volume of Spouge (1991) and
the bitpacking of Myers (1999).
WFA and BiWFA (Marco-Sola et al. 2020, 2022) implement the \(O(n+s^2)\) expected time *diagonal transition*
algorithm (Ukkonen 1985; Myers 1986).
Block aligner (Liu and Steinegger 2023) is an approximate aligner that can handle
position-specific scoring matrices whose main novelty is to divide the
computation into blocks.
Lastly, A*PA (Groot Koerkamp and Ivanov 2022) directly implements A* on the alignment graph using
the gap-chaining seed heuristic.

Opal: Šošic M. An simd dynamic programming c/c++ library: Thesis, University of Zagreb; 2015. https://bib.irb.hr/datoteka/758607.diplomski_Martin_ Sosic.pdf.

libssa: Frielingsdorf JT. Improving optimal sequence alignments through a simd-accelerated library: Thesis, University of Oslo; 2015. http://urn.nb.no/ URN:NBN:no-49935. Accessed 10 Dec 2015.

(Suzuki and Kasahara 2018) libgaba: SIMD with difference recurrence relation for affine cost alignment

(Loving, Hernandez, and Benson 2014) BitPAl

## 2 Methods

First, we explain in detail the algorithm and implementation used by Edlib and reduce the overhead in the implementation by using blocks and SIMD. Then, we improve the algorithm by avoiding recomputing states and speeding up the traceback algorithm. On top of that, we apply the A*PA heuristics for further speed gains on large/complex alignments, at the cost of larger precomputation time to build the heuristic.

### 2.1 Band-doubling and bitpacking in Edlib

As a baseline, we first outline the band-doubling method used by Edlib.

Start with edit-distance threshold \(t=1\).

Iterate over columns \(i\) from \(0\) to \(n\).

For each column, determine the range of rows \(R=(r_{start}, r_{end})\) to compute by finding the top- and bottommost state that can possibly have cost at most \(t\), taking into account the gap-cost to the end. Both \(r_{start}\) and \(r_{end}\) are rounded

*out*to the previous/next multiple of \(w\). a. If the range is empty, double \(t\) and go back to step 2. b. Otherwise, compute the range in blocks of \(w=64\) rows at a time using bitpacking and the standard*profile*of sequence \(B\).Only the last and current column are kept in memory.

**Traceback.**Once the last column has been reached, recursively repeat the algorithm using Hirschberg’s*meet-in-the-middle*approach to find the alignment. Continue until the sequences are of length \(\leq 1000\). For these small sequences all vertical differences can be stored and a backtrace is done to find the alignment.

### 2.2 Blocks

Our first improvement is to process \(B=256\) columns at a time. Instead of computing the range of rows \(R\) for each column individually, we compute it once and then use this one range for a block of \(B\) consecutive columns. While this computes some extra states in most columns, the time saved by not having to compute \(R\) for each column is larger.

Within each block, we iterate over the rows in *lanes* of \(w\) rows at a time, and for each
lane compute all \(B\) columns before moving on to the next lane.

See 2.8.1 for details on the computation of \(R\).

Where Edlib does not initially store intermediate values and uses meet-in-the-middle to
find the alignment, we *always* store vertical differences at the end of
each^{2}
block. This simplifies the implementation, and has sufficiently small memory
usage to be practical. See 2.6 for details on recovering the alignment.

### 2.3 SIMD

While it is tempting to use a SIMD vector as a single $W=256$-bit word, the four
$w=64$-bit words (SIMD lanes) are dependent on each other and require manual
work to shift bits between the lanes.
Instead, we let each $256$-bit AVX2 SIMD vector represent four $64$-bit words
(lanes) that are anti-diagonally
staggered (TODO FIG). This is similar to the original anti-diagonal tiling
introduced by Wozniak (1997), but using units of $64$-bit words instead of
single characters. This idea was already introduced in 2014 by the author of
Edlib^{3}, but to our
knowledge has never been implemented either in Edlib or elsewhere.

We achieve further speedup by improving instruction-level-parallelism. Modern CPUs can execute up to 4 instructions per cycle (IPC) and use execution pipelines that look ahead tens of instructions. The dependencies between the instructions for computing each SIMD vector do not allow such high parallelism. We improve this by processing two SIMD vectors in parallel, spanning a total of \(8\) anti-diagonally-aligned $64$-bit lanes covering \(2W = 512\) rows (TODO FIG).

When the number of lanes of rows to be computed is \(c=(r_{end}-r_{start})/64\), we process \(8\) lanes in parallel as long as \(c\geq 8\). If there are remaining rows, we end with another $8$-lane (\(5\leq c<8\)) or $4$-lane (\(1\leq c\leq 4\)) iteration that optionally includes some padding rows at the bottom. In case the horizontal differences along the original bottom row are needed (as is the case for incremental doubling 2.5), we do not use padding and instead fall back to trying a $4$-lane SIMD (\(c\geq 4\)), a $2$-lane SIMD (\(c\geq 2\)), and lastly a scalar iteration (\(c\geq 1\)).

TODO: How about padding upwards?

### 2.4 SIMD-friendly sequence profile

Myers’ bitpacking algorithm precomputes a *profile* \(P_{eq}[c][j]\) containing
\(\sigma \times m\) bits. For each character \(c\), it contains a bitvector of
$w$-bit words indicating the positions where \(c\) occurs in \(B\). We improve
memory locality by instead storing the profile as an array of blocks of \(\sigma\)
words: \(P_{eq}[j/w][c]\) containing \(\lceil m/w\rceil \times \sigma\) $w$-bit
words (FIG?).

A drawback of anti-diagonal tiling is that each column contains its own
character \(a_i\) that needs to be looked up. While SIMD offers `gather`

instructions to do multiple of these lookups in parallel, these instructions are
not always efficient. Thus, we introduce the following alternative scheme.

Let \(b = \lceil \log_2(\sigma)\rceil\) be the number of bits needed to encode each character, with \(b=2\) for DNA. The new profile \(P’\) contains \(b\) bitvectors, each indicating the negation of one bit of each character, stored as an \(\lceil m/w\rceil \times b\) array \(P’[j/w][p]\) of $w$-bit words.

To check whether row \(j\) contains character \(c\) with bit representation \(\overline{c_{b-1}\dots c_{0}}\), we compute \[(c_0 \xor P’[j/w][0][j\bmod w]) \and \dots \and (c_{b-1} \xor P’[j/w][b-1][j\bmod w]).\] This naturally extends to an efficient computation for $w$-bit words and larger SIMD vectors.

TODO: Tried BitPAl’s bitpacking method which is one less than Myers 99’s, but without success so far.

### 2.5 Incremental doubling

TODO: Rephrase \(g(u)\leq t\) to \(f(u) \leq t\).

TODO: The range-end only matters for the last columns of the block. Intermediate columns that go further down can be disregarded.

The original band doubling algorithm doubles the threshold from \(t\) to \(t’=2t\)
in each iteration and simply recomputes the distance to all states. On the
other hand, BFS, Dijkstra, and A*^{4} visit
states in increasing order of distance (\(g(u)\) for BFS and Dijkstra, \(f(u) =
g(u) + h(u)\) for A*), and the distance to a state is known to be correct
(*fixed*) as soon as it is expanded. This way a state is never expanded twice.

Indeed, the band-doubling algorithm can also avoid recomputations. After
completing the iteration for \(t\), it is guaranteed that the distance is fixed
for all states that are indeed at distance \(\leq t\). In fact a stronger result holds:
in any column the distance is fixed for *all* states between the topmost
and bottommost state with distance \(\leq t\).
Note that due to the
word-based computations, there will also be states whose computed distance is
\(>t\). These are *not* guaranteed to be correct.

After a range \(R=(r_{start}, r_{end})\) of rows for a block of \(B\) columns has
been computed, we determine the first row \(r’_{start} \geq r_{start}\) and last
row \(r’_{end}\leq r_{send}\) that are a multiple of \(w\) and for which all
computed distances in this block are at most \(t\)^{5}, if such rows exists. (See
2.8.2 for details.) We then store these values \((r’_{start}, r’_{end})\)
and the horizontal difference along row \(r’_{end}\). The next iteration for
\(t’=2t\) then skips the rows in this interval, and uses the stored differences as
input to compute rows \(r’_{end}\) to the new \(r_{end}\).

### 2.6 Traceback

The traceback stage takes as input the computed vertical differences at the end of each block of columns. We iteratively work backwards through the blocks of columns. In each step, we are given the distances \(D_i[j]\) to the states in column \(i\) (\(B|i\)) and the state \(u=\st{i+B}j\) in column \(i+B\) that is on the optimal path and has distance \(d_u\). The goal is to find an optimal path from column \(i\) to \(u\).

A naive approach is to simply recompute the entire block of columns for their entire range \(R\) while storing distances to all cells, but we introduce to faster methods.

#### 2.6.1 Optimistic block computation

Instead of computing the full range \(R=(r_{start}, r_{end})\) for this column, a first insight is that only rows up to \(j\) are needed, since the optimal path to \(u=\st{i+B}j\) can never go below row \(j\).

Secondly, the path crosses \(B=256\) rows, and so we optimistically assume that it will be contained in rows \(j-256-64=j-320\) [CHECK] to \(j\). Thus, we first compute the states in this range of rows (rounded out to multiples of \(w\)). If the distance to \(u\) computed this way agrees with the known distance, the path must lie within these rows. Otherwise, we repeatedly try again with double the number of lanes, until success. The exponential search ensures low overhead and good average case performance.

#### 2.6.2 Optimistic diagonal transition

A further improvement uses the *diagonal transition* algorithm backwards from
\(u\). We simply run the unmodified algorithm on the reverse graph covering
columns \(i\) to \(i+B\) and rows \(0\) to \(j\). When the distance \(d_j\) from \(u\) to a
state \(\st ij\) in column \(i\) is found, we check whether \(D_i[j] + d_j = d_u\).
If this is not the case, we continue until a suitable \(j\) is found. We then
infer the optimal path by a traceback on the diagonal transition algorithm.

One further optimization to this is that again we can be optimistic and assume
that the path will have distance at most \(d_0=TODO\), and ignore any states that are
at distance \(>d_0\). If all states at distance \(\leq d_0\) have been explored
without finding a *match* in column \(i\), \(d_0\) is doubled repeatedly until success.

Another optimization is the X-drop [TODO], meaning that all states that lag more than \(x\) columns behind the one with smallest \(j\) will be dropped.

### 2.7 A*

Edlib already uses a simple *gap-cost* heuristic that gives a lower bound on the
number of insertions and deletions on a path from each state to the end.
We simply replace this by the stronger heuristics introduced in A*PA.
We use three variants:

**No heuristic.**Only use the gap heuristic. No initialization needed.**Seed heuristic (SH).**This requires relatively simple precomputation, and little bookkeeping, but works well for low uniform error rate.**Gap-chaining seed heuristic (GCSH).**The strongest heuristic that requires more time to initialize and update, but is better able to penalize long indels.

The details of how these changes affects the ranges of rows being computed are in 2.8.

We make two modifications the previous version of the A*PA algorithm.

#### 2.7.1 Bulk-contours update

In A*PA, matches are *pruned* as soon as a shortest path to their start has been
found. This helps to penalize states *before* (left of) the match. Each
iteration of our new algorithm works left-to-right only, and thus pruning of
matches does not affect the current iteration. Instead of pruning on the fly, we
now collect all matches to be pruned at the end of each iteration, and prune
them in one right-to-left sweep.

To ensure pruning is a valid optimization, we never allow the range of rows for each block to shrink after increasing \(t\).

#### 2.7.2 Pre-pruning

Here we introduce an independent optimization that also applies to the original A*PA method.

Each of the heuristics \(h\) introduced in A*PA depends on the set of matches
\(\matches\). Given that \(\matches\) contains *all* matches, \(h\) was shown to be an
admissible [TODO] heuristic. Even after pruning some matches, \(h\) was shown to
still be a lower bound on the length of a path not going through already visited states.

Now consider a situation where there are two seeds and there is an exact match
\(m\) from \(u=v_s\) to \(v\) for seed \(s_0\), but going from \(v\) to the end of the
next seed \(s_1\) takes cost at least \(2\) (TODO FIG). The existence of the match
is a ‘promise’ that \(s_0\) can be crossed for free. In this case, this leads to
a seed heuristic value of \(1\) is \(u\), namely \(0\) for \(s_1\) plus \(1\) for \(s_1\). But we
already know that match \(m\) can *never* lead to a path of cost \(<2\) to the end
of \(s_1\). Thus, we may as well ignore \(m\)! This increases the value of the
seed heuristic in \(u\) to \(2\), which is indeed a lower bound on the actual distance.

More generally, consider a situation where there is a match \(m\) from \(u\) to \(v\)
in seed \(s_i\), and the lowest cost path from \(s_i\) to the *start* of \(s_{i+p}\)
has cost \(\geq p\). The seed heuristic penalizes the path from \(u\) (at the start
of \(s_i\)) to the start of \(s_{i+p}\) by at most \(p-1\), since there are at most
\(p-1\) seeds in \(\{s_{i+1}, \dots, s_{i+p-1}\}\) without match. Since in fact we
know that this path has cost at least \(p\), we can *pre-prune* the match \(m\) and
increase the value of the heuristic while keeping it *admissible*.

### 2.8 Appendix: Range-of-rows computations

#### 2.8.1 Computed range

Here we determine the range of rows that can possibly contain cells at distance \(\leq t\) in a block of \(B\) columns from \(i\) to \(i+B\). We assume that the heuristic \(h\) being used is consistent, i.e. that for any states \(u\preceq v\) we have \(h(u) \leq d(u,v) +h(v)\).

Let \(R=(r_{start}, r_{end})\) be the range of states in column \(i\) to which the distance has been computed. From this we can find the topmost and bottommost states \(r^t_{start}\) and \(r^t_{end}\) that are at distance \(\leq t\), see 2.8.2.

**Start of range.** Since row \(j=r^t_{start}\) is the first row in column \(i\) with distance \(\leq t\),
this means that states in columns \(i\) to \(i+B\) at rows \(<j\) can not be at
distance \(\leq t\). Thus, the first row that needs to be computed is row \(r^t_{start}\).
[TODO: Add a \(+1\) to this?]

**End of range.** We will now determine the bottommost row \(j\) that can contain a
state at distance \(\leq t\) in the block. Let \(u=\st{i}{r^t_{end})\) be the bottommost state in
column \(i\) with distance \(\leq t\). Let \(v = \st{i’}{j’}\) be a state in the
current block (\(i\leq i’\leq i+B\)) that is below the diagonal of \(u\) (\(j’-i’ \geq r^t_{end}-i\)).
Then, the distance to \(v\) is at least \(\g(v) \geq \g(u) + \cgap(u,v)\), and hence
\[
f(v) = g(v) + h(v) \geq \g(v) + h(v) \geq \g(u) + \cgap(u,v) + h(v) =: f_l(v).
\]
The end of the range can now be computed by finding the bottommost state \(v\) in each
column for which this lower bound \(f_l\) on \(f\) is \(\leq t\), using the following
algorithm^{6}.

**Algorithm (bottommost row computation).**

- Start with \(v = \st{i’}{j’} = u = \st{i}{r^t_{end}}\).
- While the below-neighbour $v’ = \st{i’}{j’+1}$$ of \(v\) has \(f_l(v)\leq t\), increment \(j’\).
- Go to the next column by incrementing \(i’\) and \(j’\) by \(1\) and repeat step 2, until \(i’=i+B\).

A drawback of this approach is that \(h\) is evaluated at least once per column, which is slow in practice.

We improve this using the following lemma.

**Lemma 1.** When \(h\) is a consistent heuristic and \(v\preceq v’\), then \(f_l(v’)
\geq f(v) - 2\cdot d(v, v’)\).

**Proof.** By consistency, \(h(v) \leq d(v, v’) + h(v’)\), so \(h(v’) \geq
h(v)-d(v,v’)\). Furthermore, \(\cgap(u,v) \leq \cgap(u,v’) + \cgap(v,v’)\leq
\cgap(u,v) + d(v,v’)\), and hence \(\cgap(u,v’) \geq \cgap(u,v) - d(v,v’)\).
Putting these together we obtain

\begin{align*} f_l(v’) &= \g(u) + \cgap(u,v’) + h(v’) \\ &\geq \g(u) + \cgap(u,v) - d(v,v’) + h(v) - d(v,v’) \\ &= f_l(v) - 2\cdot d(v,v’). \square % TODO \end{align*}

When we have \(f_l(v) > t+2x\), the lemma implies that \(f_l(v’)>t\) for any \(v’\)
with \(d(v,v’)\leq x\). This inspires the following algorithm^{7}, that first takes
just over \(B\) steps down, and then makes jumps to the right.

**Algorithm (sparse bottommost row computation).**

- Start with \(v = \st{i’}{j’} = u+\st{0}{B+8} = \st{i}{r^t_{end} + B + 8}\).
- If \(f_l(v) \leq t\), increase \(j’\) (go down) by \(8\).
- If \(f_l(v) > t\), increase \(i’\) (go right) by \(\min(\lceil(f_l(v)-t)/2\rceil, i+B-i’)\).
- Repeat from step 2, until \(i’ = i+B\).
- While \(f_l(v) > t\), decrease \(j’\) (go up) by \(\lceil(f_l(v)-t)/2\rceil\), but do not go above the diagonal of \(u\).

The resulting \(v\) is the bottommost state in column \(i+B\) with \(f_l(v) \leq t\), and its row is the last row that will be computed.

TODO: Run this algorithm directly on 64-row lanes.

TODO: Can we be optimistic and bump more than \((f_l(v)-t)/2\)? the next value will also block ‘backwards’.

#### 2.8.2 Fixed range

In a column, the *fixed* range is the range of rows between the topmost and
bottommost states with \(f(v)\leq t\), in rows \(r’_{start}\) and \(r’_{end}\)
respectively. Given a range \(R=(r_{start}, r_{end})\) of computed values, one
way to find \(r’_{start}\) and \(r’_{end}\) is by simply iterating from the
start/end of the range and dropping all states \(v\) with \(f(v)>t\).
Like before, this has the drawback that the heuristic must be invoked many
times.

Instead, we have the following lemma, somewhat analogous to Lemma 1:

**Lemma 2.** When \(h\) is a consistent heuristic we have
\[|f(v) - f(v’)| \leq 2 d(v, v’).\]

**Proof.**
The triangle inequality gives \(\g(v) \leq \g(v’) + d(v, v’)\), and consistency
gives \(h(v) \leq h(v’) + d(v,v’)\) TODO WHAT IF \(v\) and \(v’\) ARE IN THE OPPOSITE ORIENTATION??
Expanding the definitions of \(f\), we have

\begin{align*} f(v) - f(v’) &= (g(v) + h(v)) - (g(v’) + h(v’))\\ &= (\g(v) + h(v)) - (\g(v’) + h(v’))\\ &= (\g(v) - \g(v’)) - (h(v) + h(v’))\\ &\leq d(v,v’) + d(v,v’) = 2\cdot d(v,v’). \square \end{align*}

Now we can use a similar approach as before. To find the first row \(j’\) with \(f(\st ij)\leq t\), start with \(v=\st{i’}{j’}=\st{i}{r_{start}}\), and increment \(j’\) by \(\lceil(f(v)-t)/2\rceil\) as long as \(f(v)>t\). The last row is found in the same way.

TODO: Run this algorithm directly on 64-row lanes.

## 3 Results

Compare

- Edlib
- WFA
- A*PA
- A*PA2 no h
- A*PA2 SH+pruning
- A*PA2 GCSH+pruning

on

- synthetic data
- human ONT reads
- human ONT reads with genetic variation

Important:

- Find threshold where heuristics become worth the overhead
- Show benefit of each of the optimizations
- Show sensitivity to parameter tuning

## 4 Conclusion

### 4.1 Summary

### 4.2 Limitations

- Initialization takes time
- WFA is better when edit distance is
*very*low.

### 4.3 Future work

- Pre-pruning for seed&extend methods?
- Semi-global alignment
- Affine alignment
- Local doubling

## Acknowledgements

I am grateful to Daniel Liu for regular discussions, and suggesting additional papers that have been added to the introduction.

## References

*Proceedings of the 1995 Acm/Ieee Conference on Supercomputing (Cdrom) - Supercomputing ’95*. https://doi.org/10.1145/224170.224222.

*Bmc Bioinformatics*17 (1). https://doi.org/10.1186/s12859-016-0930-z.

*Numerische Mathematik*1 (1): 269–71. https://doi.org/10.1007/bf01386390.

*Bmc Bioinformatics*9 (1). https://doi.org/10.1186/1471-2105-9-11.

*Bioinformatics*23 (2): 156–61. https://doi.org/10.1093/bioinformatics/btl582.

*Nucleic Acids Research*12 (1Part1): 175–79. https://doi.org/10.1093/nar/12.1part1.175.

*Journal of Molecular Biology*162 (3): 705–8. https://doi.org/10.1016/0022-2836(82)90398-9.

*Congressus Numerantium*61: 263–74.

*Ieee Transactions on Systems Science and Cybernetics*4 (2): 100–107. https://doi.org/10.1109/tssc.1968.300136.

*Bioinformatics*12 (6): 473–79. https://doi.org/10.1093/bioinformatics/12.6.473.

*Biorxiv*. https://doi.org/10.1101/2021.11.05.467453.

*Siam Review*25 (2): 201–37. https://doi.org/10.1137/1025045.

*Soviet Physics. Doklady*10: 707–10. https://api.semanticscholar.org/CorpusID:60827152.

*Bioinformatics*34 (18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.

*Bioinformatics*, btad487. https://doi.org/10.1093/bioinformatics/btad487.

*Bioinformatics*30 (22): 3166–73. https://doi.org/10.1093/bioinformatics/btu507.

*Bioinformatics*37 (4): 456–63. https://doi.org/10.1093/bioinformatics/btaa777.

*Algorithmica*1 (1–4): 251–66. https://doi.org/10.1007/bf01840446.

*Journal of the Acm*46 (3): 395–415. https://doi.org/10.1145/316542.316550.

*Acm Computing Surveys*33 (1): 31–88. https://doi.org/10.1145/375360.375365.

*Journal of Molecular Biology*48 (3): 443–53. https://doi.org/10.1016/0022-2836(70)90057-4.

*Bmc Bioinformatics*10 (S1). https://doi.org/10.1186/1471-2105-10-s1-s10.

*Bmc Bioinformatics*12 (1). https://doi.org/10.1186/1471-2105-12-221.

*Bioinformatics*16 (8): 699–706. https://doi.org/10.1093/bioinformatics/16.8.699.

*Ieee Transactions on Computers*C–23 (9): 907–14. https://doi.org/10.1109/t-c.1974.224054.

*Proceedings of the National Academy of Sciences*69 (1): 4–6. https://doi.org/10.1073/pnas.69.1.4.

*Siam Journal on Applied Mathematics*26 (4): 787–93. https://doi.org/10.1137/0126070.

*Journal of Molecular Biology*147 (1): 195–97. https://doi.org/10.1016/0022-2836(81)90087-5.

*Siam Journal on Applied Mathematics*49 (5): 1552–66. https://doi.org/10.1137/0149094.

*Bioinformatics*7 (1): 1–7. https://doi.org/10.1093/bioinformatics/7.1.1.

*BMC Bioinformatics*19 (S1). https://doi.org/10.1186/s12859-018-2014-8.

*Bmc Research Notes*1 (1): 107. https://doi.org/10.1186/1756-0500-1-107.

*Information and Control*64 (1–3): 100–118. https://doi.org/10.1016/s0019-9958(85)80046-2.

*Cybernetics*4 (1): 52–57. https://doi.org/10.1007/bf01074755.

*Journal of the Acm*21 (1): 168–73. https://doi.org/10.1145/321796.321811.

*Proceedings of the National Academy of Sciences*80 (3): 726–30. https://doi.org/10.1073/pnas.80.3.726.

*Bioinformatics*13 (2): 145–50. https://doi.org/10.1093/bioinformatics/13.2.145.

*Information Processing Letters*35 (6): 317–23. https://doi.org/10.1016/0020-0190(90)90035-v.

*Plos One*8 (12): e82138. https://doi.org/10.1371/journal.pone.0082138.

*Bioinformatics*33 (9): 1394–95. https://doi.org/10.1093/bioinformatics/btw753.

Although Ukkonen didn’t realize this faster runtime and only gave a bound of \(O(nm \log (nm))\). ↩︎

Even

*sparser*memory usage is possible by only storing differences every \(S=s\cdot B\) columns, but in practice this does not matter much. ↩︎A* with a

*consistent*heuristic. ↩︎More precisely, such that in each column there is a state of distance \(\leq t\) above (below) with distance \(\leq t\). ↩︎

Bound checks omitted. ↩︎

Bound checks omitted. ↩︎