This post is published at WABI24 (PDF). Content below is slightly out of sync.
Groot Koerkamp, Ragnar. 2024. “A*PA2: Up to 19 Faster Exact Global Alignment.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:17:1–17:25. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.17.
Methods. We introduce A*PA2, an exact global pairwise aligner with respect to
edit distance. The goal of A*PA2 is to unify the near-linear runtime of A*PA on
similar sequences with the efficiency of dynamic programming (DP) based methods.
Like Edlib, A*PA2 uses Ukkonen’s band doubling in combination
with Myers’ bitpacking. A*PA2 1) extends this with SIMD (single instruction,
multiple data), 2) uses large block
sizes inspired by Block Aligner, 3) avoids recomputation of states where
possible as suggested before by Fickett, 4) introduces a new optimistic technique for
traceback based on diagonal transition, and 5) applies the heuristics
developed in A*PA and improves them using pre-pruning.
Results.
The average runtime of A*PA2 is faster than BiWFA and Edlib on kbp long ONT reads of a
human genome having divergence on average. On shorter ONT reads of
average divergence the speedup is (avg. length kbp)
and (avg. length bp).
The problem of global pairwise alignment is to find the shortest sequence of
edit operations (insertions, deletions, substitutions) to convert a string
into a second string (Needleman and Wunsch 1970; Vintsyuk 1968), where the number of such
operations is called the Levenshtein distance or edit distance
(Levenshtein 1965; Wagner and Fischer 1974).
Over time, the length of genomic reads has increased from hundreds of basepairs
to hundreds of thousands basepairs now. Meanwhile, the complexity of exact
algorithms has not been improved by more than a constant factor since the
introduction of the diagonal transition algorithm (Ukkonen 1985; Myers 1986).
Our recent work A*PA (Groot Koerkamp and Ivanov 2024) uses the A* shortest path algorithm to
speed up alignment and has near-linear runtime when divergence is low. A
drawback of A* is that it uses a queue and must store all computed distances,
causing large overhead compared to methods based on dynamic programming (DP).
This work introduces A*PA2, a method that unifies the heuristics and near-linear
runtime of A*PA with the efficiency of DP based methods.
As Fickett (1984, 1) stated 40 years ago and still true today,
at present one must choose between an algorithm which gives the best alignment
but is expensive, and an algorithm which is fast but may not give the best
alignment.
In this paper we narrow this gap and show that A*PA2 is nearly as fast as
approximate methods.
We introduce A*PA2, an exactly pairwise global sequence aligner with respect to
edit distance.
In A*PA2 we combine multiple existing techniques and introduce a number of new
ideas to obtain up to speedup over existing single-threaded exact
aligners. Furthermore, A*PA2 is often faster and never much slower than
approximate methods.
As a starting point, we take the band doubling algorithm implemented
by Edlib (Šošić and Šikić 2017) using bitpacking (Myers 1999). First, we speed up the
implementation (points 1., 2., and 3.). Then, we reduce the amount of work done
(points 4. and 5.). Lastly, we apply A* (point 6.).
Block-based computation. Edlib computes one column of the DP matrix at a
time, and for each column decides which range (subset of rows) of states to
compute. We significantly reduce this overhead by processing blocks of
columns at a time (Table 1f), taking inspiration from Block aligner
(Liu and Steinegger 2023). Correspondingly, we only store states of the
DP-matrix at block boundaries, reducing memory usage.
SIMD. We speed up the implementation using bit SIMD to compute each block,
allowing the processing of computer words in parallel.
Novel encoding. We introduce a novel encoding of the input sequence to
speed up SIMD operations by comparing characters bit-by-bit and avoiding slow
gather instructions.
This limits the current implementation to size- alphabets.
Incremental doubling. Both the band doubling methods of Ukkonen (1985)
and Edlib recompute states after doubling the threshold. We avoid this by
using the theory behind the A* algorithm, extending the incremental doubling
of Fickett (1984) to blocks and arbitrary heuristics.
Traceback. For the traceback, we optimistically use the diagonal transition
method (Ukkonen 1985; Myers 1986; Marco-Sola et al. 2020) within each block with a strong
adaptive heuristic, falling back to a full recomputation of the block when
needed.
A*. We apply the gap-chaining seed heuristic of A*PA (Groot Koerkamp and Ivanov 2024)
(Table 1fg), and improve it using pre-pruning. This technique discards most of
the spurious (off-path) matches ahead of time.
Table 1:
Alignment of two sequences of length bp with 20% divergence using different algorithms. Coloured pixels correspond to visited states in the edit graph or dynamic programming matrix, and the blue to red gradient indicates the order of computation.
In the following, we give a brief recap of developments that this work builds
on, in chronological order per approach. See also e.g. the reviews by
Kruskal (1983) and Navarro (2001), and the introduction of our
previous paper Groot Koerkamp and Ivanov (2024). 2 covers relevant topics more formally.
Pairwise alignment has classically been approached as a dynamic programming
problem. For input strings of lengths and , this method creates a table that is filled cell by cell using a recursive formula.
Needleman and Wunsch (1970) gave the first algorithm, and Sellers (1974) and
Wagner and Fischer (1974) improved this to what is now known as the Needleman-Wunsch algorithm, building on the quadratic algorithm for longest
common subsequence by Sankoff (1972).
It was already realized early on that an optimal alignment
corresponds to a shortest path in the edit graph
(Vintsyuk 1968; Ukkonen 1985) (see 2). Both Ukkonen (1985) and Myers (1986)
remarked that this can be solved using Dijkstra’s algorithm (Dijkstra 1959),
taking time (Table 1a), where is the edit distance between
the two strings and is typically much smaller than the string length.
(Although Ukkonen only gave a bound of .) However, Myers (1986, 2) observes that
the resulting algorithm involves a relatively complex discrete priority queue
and this queue may contain as many as entries even in the case where just
the length of the […] shortest edit script is being computed.
Hadlock (1988) realized that Dijkstra’s algorithm can be improved
upon by using A* (Hart, Nilsson, and Raphael 1968), a more informed algorithm that uses a
heuristic function that gives a lower bound on the remaining edit distance
between two suffixes. He uses two heuristics, the widely
used gap cost heuristic
(Ukkonen 1985; Hadlock 1988; Wu et al. 1990; Spouge 1989, 1991; Papamichail and Papamichail 2009)
that simply uses the difference between the lengths of the suffixes as lower
bound (Table 1d), and a new improved heuristic based on character frequencies in the two
suffixes. A*PA (Groot Koerkamp and Ivanov 2024) improves the seed heuristic of Ivanov, Bichsel, and Vechev (2022) to the gap-chaining seed heuristic with pruning
to obtain near-linear runtime when errors are uniform random (Table 1g).
Nevertheless, as Spouge (1991, 3) states,
algorithms exploiting the lattice structure of an alignment graph are usually faster.
This suggests a radical approach to A* search complexities: dispense with the
lists [of open states] if there is a natural order for vertex expansion.
In this work we follow this advice and replace the plain A* search in A*PA with a much
more efficient approach based on computational volumes that merges DP and A*.
Wilbur and Lipman (1983) is, to our knowledge, the first paper that speeds up
the DP algorithm, by only considering states near diagonals with many
k-mer matches, but at the cost of giving up the exactness of the method.
Fickett (1984) notes that for some chosen parameter that is at
least the edit distance , only those DP-states with cost at most need to
computed. This only requires time, which is fast when is an accurate bound on
the distance . For example can be set as a known upper bound for the
data being aligned, or as the length of a suboptimal alignment. When
turns out too small, a larger new bound can be chosen, and only
states with distance in between and have to be computed.
When increases by in each iteration, this closely mirrors Dijkstra’s algorithm.
Ukkonen (1985) introduces a very similar idea, statically bounding the
computation to only those states that can be contained in a path of length at most
from the start to the end of the graph (Table 1c).
On top of this, Ukkonen introduces band doubling: can be
doubled () until is at least the actual distance .
This find the alignment in time.
Spouge (1989) unifies the methods of
Fickett and Ukkonen in computational volumes
(see 2): small subgraphs of the full edit graph that are guaranteed to
contain the shortest paths. As Spouge notes:
The order of computation (row major, column major or antidiagonal) is just a
minor detail in most algorithms.
But this is exactly what was investigated a lot in the search for more efficient implementations.
In the 1990s, the focus shifted from reducing the number of computed states to
computing states faster through advancements in implementation and hardware.
This resulted in a plethora of new methods. While there many recent methods
optimizing the computation of arbitrary scoring schemes and affine costs
(Smith and Waterman 1981; Gotoh 1982; Bergeron and Hamel 2002; Suzuki and Kasahara 2018; Shao and Ruan 2024), here we focus on
methods for computing edit distance.
The first technique in this direction is microparallelism (Alpern, Carter, and Gatlin 1995),
also called SWAR (SIMD within a register),
where each ( bit) computer word is divided into multiple (e.g. bit) parts,
and word-size operations modify all () parts in parallel.
This was then applied with inter-sequence parallelism to align a
given query to multiple reference sequences in parallel
(Alpern, Carter, and Gatlin 1995; Baeza-Yates and Gonnet 1992; Wu and Manber 1992; Hyyrö, Fredriksson, and Navarro 2005; Rognes 2011).
Hughey (1996) notes that anti-diagonals of the DP matrix are independent
and can be computed in parallel, to speed up single alignments. Wozniak (1997) applied SIMD (single
instruction, multiple data) for this purpose, which are special CPU instructions
that operate on multiple (currently up to , for AVX-512) computer words at a time.
Rognes and Seeberg (2000, 702) also use microparallelism, but use vertical
instead of anti-diagonal vectors:
The advantage of this approach is the much-simplified and faster loading of the
vector of substitution scores from memory. The disadvantage is that data
dependencies within the vector must be handled.
To work around these dependencies, Farrar (2006) introduces an alternative striped SIMD scheme where lanes are
interleaved with each other. A*PA2 does not use this, but for example
Shao and Ruan (2024) does.
Myers (1999) introduces a bitpacking algorithm specifically for edit
distance (Table 1f). It bit-encodes the differences between states in a
column into two computer words and gives an efficient algorithm to operate on
them. This provides a significant speedup over previous methods. The supplement of
BitPAl (Loving, Hernandez, and Benson 2014; Benson, Hernandez, and Loving 2013) introduces an alternative scheme for edit
distance based on a different bit-encoding, but as both methods end up using
instructions (see 6.1) we did not pursue this further.
Dedicated global alignment implementations implementing band-doubling are much rarer.
Edlib (Šošić and Šikić 2017) implements band doubling and Myers’ bitpacking (Table 1d).
KSW2 implements band doubling for affine costs (Suzuki and Kasahara 2018; Li 2018).
WFA and BiWFA (Marco-Sola et al. 2020, 2023) implement the expected time diagonal transition
algorithm (Ukkonen 1985; Myers 1986) (Table 1b).
Block aligner (Liu and Steinegger 2023) is an approximate aligner that can handle
position-specific scoring matrices whose main novelty is to divide the
computation into larger blocks.
Recently Shao and Ruan (2024) provided a new implementation of band doubling based
on Farrar’s striped method that focusses on affine costs but also supports edit distance.
Lastly, A*PA (Groot Koerkamp and Ivanov 2024) directly implements A* on the alignment graph using
the gap-chaining seed heuristic.
Figure 1: An example of an edit graph (left) corresponding to the alignment of strings ABCA and ACBBA, adapted from Sellers (1974). Solid edges indicate insertion/deletion/substitution edges of cost (1), while dashed edges indicate matches of cost (0). All edges are directed from the top-left to the bottom-right. The shortest path of cost (2) is shown in blue. The right shows the corresponding dynamic programming (DP) matrix containing the distance (g(u)) to each state.
Edit graph. We take as input two zero-indexed sequences and over an alphabet of size
of lengths and . The edit graph (Figure 1) contains states (, ) as vertices. It further contains directed insertion and
deletion edges and of cost ,
and diagonal edges of cost when and
substitution cost otherwise. A shortest path from to in the edit graph corresponds to an alignment of and .
The distance from to is the length of the shortest (minimal
cost) path from to , and we use distance, length, and cost interchangeably.
Further we write
for the distance from the start to ,
for the distance from to the end, and for the minimal cost
of a path through .
A* is a shortest path algorithm based on a heuristic function (Hart, Nilsson, and Raphael 1968). A
heuristic is called admissible when underestimates the distance to the
end, i.e., , and admissible guarantee that A* finds a
shortest path. A* expands states in order of increasing , where is the best distance to found so far. We say that
is fixed when the distance to has been found, i.e., .
Computational volumes. Spouge (1989) defines a computational volume as a subgraph of the
alignment graph that contains all shortest paths . Given a bound , some examples of
computational volumes are:
, for any admissible heuristic , which we will
use and is similar to A* (Table 1fg).
Band-doubling is the following algorithm by Ukkonen (1985), that depends on the choice of
computational volume being used.
Start with edit-distance threshold .
Loop over columns from to .
For each column, determine the range of rows to be
computed according to the computational volume that’s being used.
a. If this range is empty or does not contain a state at distance , double and go back to step 1.
b. Otherwise, compute the distance to the states in the range, and continue
with the next column.
The algorithm stops when . For the
computational volume used by Ukkonen, each
test requires time, and hence the total time is
Note that this method does not (and indeed can not) reuse values from previous
iterations, resulting in roughly a factor overhead.
Myers’ bitpacking exploits that the difference in distance to adjacent states
is always in (Myers 1999). The method bit-encodes differences between
adjacent states in a columns in two
indicator words, indicating positions where the difference is and respectively.
Given also the similarly encoded difference along the top, a
rectangle can be computed in only bit operations (6.1).
We call each consecutive non-overlapping chunk of rows a lane, so that
there are lanes, where the last lane may be padded.
Note that this method originally only uses instructions, but some additional
instructions are needed to support multiple lanes when .
Profile. Instead of computing each substitution score for the states in a word one by one, Myers’ algorithm first builds a
profile (Rognes and Seeberg 2000). For each character , stores a bitvector indicating
which characters of equal . This way, adjacent scores in a column
are simply found as .
Edlib implements band doubling using the computational
volume and bitpacking (Šošić and Šikić 2017). For traceback, it uses Hirschberg’s meet-in-the-middle
approach: once the distance is found, the alignment is started over from both
sides towards the middle column, where a state on the shortest path is
determined. This is recursively applied to the left and right halves until the
sequences are short enough that memory can be used.
Conceptually, A*PA2 builds on Edlib.
First we describe how we make the implementation more efficient
using SIMD and blocks.
Then, we modify the algorithm itself by using a new traceback method and
avoiding unnecessary recomputation of states.
On top of that, we apply the A*PA heuristics for further speed gains on large/complex
alignments, at the cost of larger precomputation time to build the heuristic.
A*PA2 uses band-doubling with the computational volume.
That is, in each iteration of we compute the distance to all states with
. In its simple form, we use , like
Edlib does. We start doubling at , so that , where is the block size introduced below.
Instead of determining the range of rows to be computed for each column
individually, we determine it once per block and then reuse it for
consecutive columns. This computes some extra states, but reduces the overhead
by a lot. (From here on, stands for the block size, and not for the sequence
to be aligned.)
Within each block, we iterate over the lanes of rows at a time, and for
each lane compute all columns before moving on to the next lane.
3.8 explains in detail how the range of rows to be
computed is determined.
Where Edlib does not initially store intermediate values and uses
meet-in-the-middle to find the alignment, A*PA2 always stores the distance to
all states at the end of a block, encoded as the distance to the top-right state
of the block and the bit-encoded vertical differences along the right-most
column. This simplifies the traceback method (see 3.6), and has
sufficiently small memory usage to be practical.
Figure 2: SIMD processing of two times 4 lanes in parallel. This example uses 4-row (instead of 64-row) lanes. First the top-left triangle is computed lane by lane, and then 8-lane diagonals are computed by using two 4-lane SIMD vectors in parallel.
While it is tempting to use a SIMD vector as a single -bit word, the four
-bit words (SIMD lanes) are dependent on each other and require manual
work to shift bits between the lanes.
Instead, we let each -bit AVX2 SIMD vector represent four -bit words
(lanes) that are anti-diagonally
staggered as in Figure 2. This is similar to the original anti-diagonal tiling
introduced by Wozniak (1997), but using units of -bit words instead of
single characters. This idea was already introduced in 2014 by the author of
Edlib in a GitHub issue (https://github.com/Martinsos/edlib/issues/5), but to our
knowledge has never been implemented either in Edlib or elsewhere.
We further improve instruction-level-parallelism (ILP) by processing lanes
at a time using two SIMD vectors in parallel, spanning a total of rows (Figure 2).
When the number of remaining lanes to be computed is , we
process lanes in parallel as long as . If there are remaining
lanes, we end with another -lane () or -lane ()
iteration that optionally includes some padding lanes at the bottom.
In case the horizontal differences along the original bottom row are needed (as
required by incremental doubling 3.9), we
can not use padding and instead fall back to trying a -lane SIMD (),
a -lane SIMD (), and lastly a scalar iteration ().
A drawback of anti-diagonal tiling is that each column contains its own
character that needs to be looked up in the profile . While SIMD can do multiple
lookups in parallel using gather instructions, these instructions are
not always efficient. Thus, we introduce the following alternative scheme:
Let be the number of bits needed to encode
each character, with for DNA.
For each lane, the new profile stores words as an array . Each word
stores the negation of the th bit of each character.
To check which characters in lane contain character with bit representation
, we precompute words to
and then compute
, where denotes the xor operation.
As an example take and a lane with characters .
Then and ,
keeping in mind that bits are shown in reverse order in this notation.
If the column now contains character we initialize
and and compute
indicating
that -based positions and contain character .
This naturally extends to SIMD vectors, where each lane is initialized with its
own constants.
Figure 3: Traceback method. States expanded by the diagonal transition traceback in each block are shown in green. When the distance in a block is too large, a part of the block is fully recomputed as fallback, as shown in blue.
The traceback stage takes as input the computed vertical differences at
the end of each block of columns. We iteratively work backwards through the
blocks. In each step, we know the distances to
the states in column and the state in column
that is on the optimal path and has distance .
The goal is to find an optimal path from column to .
A naive approach is to simply recompute the entire block of columns while
storing distances to all states. Here we consider two more efficient methods.
Optimistic block computation.
Instead of computing the full range for this column, a
first insight is that only rows up to are needed, since the optimal path to
can never go below row .
Secondly, the path crosses columns, and so we optimistically assume that
it will be contained in rows to . Thus, we first compute the
distance to all states in this range of rows (rounded out to multiples of
). If the distance to computed this way agrees with the known
distance, there is a shortest path contained within the computed rows and we
trace it one state at a time. Otherwise, we repeatedly try again with double the
number of lanes, until success. The exponential search ensures low overhead and
good average case performance.
Optimistic diagonal transition traceback (DTT).
A second improvement uses the diagonal transition algorithm backwards from
. We simply run the unmodified algorithm on the reverse graph covering
columns to and rows to .
Whenever a state in column is reached, with distance from , we check
whether , and continue until a is found for which this holds.
We then know that lies on a shortest path and can find the path from to
by a usual traceback on the diagonal transition algorithm.
As an optimization, when no suitable is found after trying all states at
distance , we abort the DTT and fall back to the block doubling described above.
Another optimization is the WF-adaptive heuristic introduced by WFA: all states
that lag more than behind the furthest reaching diagonal are dropped.
Lastly, we abort early when after reaching distance , less than half
the columns were reached.
Figure 3 shows that in regions with low divergence, the DTT is sufficient to trace
the path, and only in noisy regions the algorithm falls back to recomputing full blocks.
Edlib already uses a simple gap-cost heuristic that gives a lower bound on the
number of insertions and deletions on a path from each state to the end.
We replace this by the much stronger gap-chaining seed heuristic (GCSH) introduced in A*PA.
In A*PA, matches are pruned as soon as a shortest path to their start has been
found. This helps to penalize states before (left of) the match. Each
iteration of our new algorithm works left-to-right only, and thus pruning of
matches does not affect the current iteration. Instead of pruning on the fly, we
collect all matches to be pruned at the end of each iteration, and update the
contours in one right-to-left sweep.
To ensure the band doubling approach remains valid after pruning, we ensure that
the range of computed rows never shrinks after an increase of and subsequent
pruning.
Table 2:Effect of pre-pruning on chaining seed heuristic (CSH) contours. The left shows contours and layers of the heuristic at the end of an A*PA alignment, after matches (black diagonals) on the path have been pruned (red). The right shows pre-pruned matches in purple and the states visited during pre-pruning in green. After pre-pruning, almost no off-path matches remain. This decreases the number of contours, making the heuristic stronger, and simplifies contours, making the heuristic faster to evaluate.
Here we introduce an independent optimization that also applies to the original
A*PA method.
Each of the heuristics introduced in A*PA depends on the set of matches
. Given that contains all matches, is an
admissible heuristic that never overestimates the true distance. Even after
pruning some matches, is still a lower bound on the length of a
path not going through already visited states.
Now consider an exact match from to for seed . The existence
of the match is a ‘promise’ that seed can be crossed for free. When
is a match outside the optimal alignment,
it is likely that can not be extended into a longer alignment. When indeed
can not be extended into an alignment of and of cost less
than , the existence of was a ‘false promise’, since crossing the two
seeds takes cost at least . Thus, we can ignore and remove from the
heuristic, making the heuristic more accurate.
More generally, we try to extend each match into an alignment covering seeds
up to (but excluding) for all . If any of these
extensions has cost at least , i.e. falsely promised that to
can be crossed for cost , we pre-prune (remove) .
We try to extend each match by running the diagonal transition algorithm
from the end of each match, and dropping any furthest reaching points that are
at distance while at most seeds have been covered.
As shown in Table 2b, the effect is that the number of off-path matches is
significantly reduced. This makes contours faster to initialize, update, and
query, and increases the value of the heuristic
For each block spanning columns to , only a subset of rows is computed in each iteration.
Namely, we only compute those rows that can possibly contain states on a
path/alignment of cost at most .
Intuitively, we try to ’trap’ the alignment inside a wall of states that can not lie
on a path of length at most (i.e. have ), as can be seen in Table 3a.
We determine this range of rows in two steps:
First, we determine the fixed range at the end of the preceding block.
I.e., we find the topmost and
bottom-most states and with . All in-between states with
are then fixed, meaning that the correct distance has been found and .
Then, we use the heuristic to find the bottom-most state at the
end of the to-be-computed block that can possibly lie on a path of length
.
We then compute rows to in columns to ,
rounding out to the previous/next multiple of the word size .
Step 1: Fixed range.
Suppose that states in rows were computed.
One way to find and is by simply iterating inward from the
start/end of the range and dropping all states with
, as indicated by the red columns in Table 3a.
Step 2: End of computed range.
We will now determine the bottom-most row that can contain a state at
distance at the end of the block. Let be the
bottom-most fixed state in column with distance . Let be a state in the current block () that is below
the diagonal of . Suppose lies on a path of length . This path
most cross column in or above , since states below have .
The distance to is now at least , and thus we define
as a lower bound on the length of the shortest path through , assuming is
below the diagonal of and . When , this implies
and also for all below .
The end of the range is now computed by finding the bottom-most state in each
column for which is at most , using the following
algorithm (omitting boundary checks).
Start with .
While the below-neighbour of has , increment .
Go to the next column by incrementing and by and repeat step 2, until .
The row of the last we find in this way is the bottom-most state
in column that can possibly have , and hence this is end of
the range we compute.
In Table 3a, we see that is evaluated at a diagonal of states just below
the bottommost green (fixed) state at the end of the preceding black, and that the to-be-computed range
(indicated in blue) includes exactly all states above the diagonal.
Table 3:Detail of computed ranges. Coloured states are invocations of . Red: , green: and is fixed, and blue: , but only tentatively. Vertical black rectangles indicated fixed states, and blue rectangles indicate the range of rows that must be computed for each block. The third block has no fixed states in its right column, indicating that must be increased.
A drawback of the previous method is that it requires a large number of
calls to and hence the heuristic : roughly one per column and one per row.
Here we present a sparse version that uses fewer calls to , based on two
similar lemmas.
Lemma 1. When is admissible and , then when .
Proof.
Since adjacent states differ in distance by , we
have and .
Now suppose that . Then is fixed and we have , and since is admissible . Thus:
This is in contradiction with , so we must have , as required.
Lemma 2. When is admissible, is below the diagonal of a computed state
, and , then when .
Proof.
We have , and .
From before we already know that , and we still
have and and .
The result follows directly:
Sparse fixed range. To find the first row with , start with , and increment by
as long as , since none of the intermediate
states can lie on a path of length by Lemma 1. The last row is found in the same
way. As seen in Table 3b, this sparse variant significantly reduces the number
of evaluations of the heuristic in the right-most columns of each block.
Sparse end of computed range.
Lemma 2 inspires the following algorithm (Table 3b). Instead of considering
one column at a time, we now first make a big just down and then jump to the right.
Start with .
If , increase (go down) by .
If , increase (go right) by , but do not exceed column .
Repeat from step 2, until .
While , decrease (go up) by , but
do not go above the diagonal of .
The resulting is again the bottom-most state in column that can
potentially have , and its row is the last row that will be computed.
Table 4:Incremental doubling detail. Blue rectangles show the ranges required to be computed, and grey the computed blocks. Vertical green rectangles show the fixed range at the end of each block, and horizontal rectangles a fixed row of states inside some blocks. In both figures the third column was just computed, in the first (left) and second (right) iteration of trying a threshold. The black horizontal rectangle indicates the new candidate for fixed horizontal region.
When the original band doubling algorithm doubles the threshold from to ,
it simply recomputes the distance to all states. On the
other hand, BFS, Dijkstra, and A* with a consistent heuristic visit
states in increasing order of distance ( for BFS and Dijkstra, for A*), and the distance to a state is known to be correct
(fixed) as soon as it is expanded. This way a state is never expanded twice.
Indeed, our band-doubling algorithm can also avoid recomputations. After
completing the iteration for , it is guaranteed that the distance is fixed
to all states that indeed satisfy . In fact a stronger result holds:
in any column the distance is fixed for all states between the topmost
and bottom-most state in that column with .
To be able to skip rows, we must store horizontal differences along
a row so we can continue from there. We choose this row (for fixed)
as the last row at a lane boundary before the end of the fixed states
in the last column of the preceding block, as indicated in Table 4 by a
horizontal black rectangle. In the first iteration, reusing values is not
possible, so we split the computation of the block into two parts (Table 4a): one above
, to extract and store the horizontal differences at , and the remainder below .
In the second and further iterations, the values at may be
reused and the block is split into three parts. The first part computes all
lanes covering states before the start of the already-fixed range at the end of the block (the
green column at the end of the third column in Table 4b). Then we skip the
lanes up to the previous , since the values at both the bottom and right of this
region are already fixed. Then, we compute the lanes between the old and its new
value . Lastly we compute
the lanes from to the end.
Our implementation A*PA is written in Rust and available at
github.com/RagnarGrootKoerkamp/astar-pairwise-aligner. We compare it against
other aligners on real datasets, report the impact of the individual
techniques we introduced, and measure time and memory usage.
Four datasets containing Oxford Nanopore Technologies (ONT) reads are reused
from the WFA, BiWFA, and A*PA evaluations (Marco-Sola et al. 2020, 2023; Groot Koerkamp and Ivanov 2024). Of these,
the ‘>500kbp’ and ‘>500kbp with genetic variation’ datasets have divergence
, while two ‘1kbp’ and ‘10kbp’ datasets are filtered for sequences of
length <1kbp and <50kbp have average divergence and average sequence length
bp and kbp.
A SARS-CoV-2 dataset was newly generated by downloading 500MB of viral sequences
from the COVID-19 Data Portal, covid19dataportal.org (Harrison et al. 2021),
filtering out non-ACTG characters, and selecting random pairs. This
dataset has average divergence and length kbp.
For each set, we sorted all sequence pairs by edit distance and split them
into files each containing multiple pairs, with the first file containing the
of pairs with the lowest divergence. Reported results are averaged over
the sequences in each file.
Algorithms and aligners.
We benchmark A*PA2 against state-of-the-art exact aligners Edlib,
BiWFA, and A*PA. We further compare against the approximate aligners
WFA-Adaptive (Marco-Sola et al. 2020) and Block Aligner.
For WFA-Adaptive we use default parameters , dropping states that lag behind by more than .
For Block Aligner we use block sizes from to of the input size.
Block Aligner only supports affine costs so we use gap-open cost instead of .
We compare two versions of A*PA2.
A*PA2-simple uses all engineering optimizations (bitpacking, SIMD,
blocks, new traceback) and uses the simple gap-heuristic.
A*PA2-full additionally uses more complicated techniques:
incremental-doubling, and the gap-chaining seed heuristic introduced by
A*PA with pre-pruning.
Parameters.
For A*PA2, we fix block size . For A*PA2-full, we use the gap-chaining seed
heuristic (GCSH) of A*PA with exact matches () and seed length . We
pre-prune matches by looking ahead up to seeds.
A detailed parameter comparison can be found in 6.2.
For A*PA, we use inexact matches () with seed length by default, and
only change this for the low-divergence SARS-CoV-2 dataset and divergence
synthetic data, where we use exact matches () instead.
Execution.
We ran all benchmarks using PaBench (github.com/pairwise-alignment/pa-bench) on
Arch Linux on an Intel Core i7-10750H with GB of memory and cores,
with hyper-threading disabled, frequency boost disabled, and CPU power saving
features disabled. The CPU frequency is fixed to GHz and we run
single-threaded job at a time with niceness . Reported running times are
the average wall-clock time per alignment and do not include the time to read
data from disk. For A*PA2-full, reported times do include the time to find matches and
initialize the heuristic.
Speedup on real data.Figure 4 compares the running time of aligners on real datasets.
6.2 contains a corresponding table of average runtimes.
For long ONT reads, with divergence, A*PA2-full is faster
than Edlib, BiWFA, and A*PA in average running time, and using the gap-chaining
seed heuristic in A*PA2-full provides speedup over A*PA2-simple.
On shorter sequences, the overhead of initializing the heuristic in A*PA2-full is large, and
A*PA2-simple is faster. For the 10kbp dataset, A*PA2-simple is
faster than other exat methods.
For the shortest (<1kbp ONT reads) and most similar sequences (SARS-CoV-2
with divergence), BiWFA is usually faster than Edlib and A*PA2-simple. In these cases,
the overhead of using wide blocks is relatively large compared to the
edit distance in combination with BiWFAs expected running time.
Figure 4: Runtime comparison (log). Each dot shows the running time of a single alignment (right two plots) or the average runtime over (2%) of the input pairs (left four plots). Box plots show the three quartiles, and the red circled dot shows the average running time over all alignments. For APA, exact matches ((r=1)) are used for the SARS-CoV-2 dataset, some alignments kbp time out, and the shown average is a lower bound on the true average. Approximate aligners WFA Adaptive and Block Aligner are indicated with a triangle. On the >500kbp reads, APA2-full is (20times) faster than other methods.
Comparison with approximate aligners.
For the smallest datasets, BiWFA is about as fast as the approximate methods WFA
Adaptive and Block Aligner, while for the largest datasets A*PA2-full is
significantly faster. Only for the set of kbp ONT reads is Block Aligner
significantly () faster than the fastest exact method. For the
two smallest datasets, approximate aligners do not significantly improve on
BiWFA.
Only on the kbp ONT reads dataset is Block Aligner faster than
A*PA2, but it only reports of the alignments correctly. All accuracy
numbers can be found in 6.2.
Scaling with divergence.Table 5 compares the runtime of aligners on synthetic sequences of increasing
divergence. BiWFA’s runtime grows quadratically, while Edlib grows
linearly and jumps up each time another doubling of the threshold is required.
A*PA is fast until the maximum potential is reached at resp. and
then becomes very slow. A*PA2 behaves similar to Edlib and jumps up each time
another doubling of the threshold is needed, but is much faster.
It outperforms BiWFA for divergence and A*PA for divergence
.
The runtime of A*PA2-full is near-constant up to divergence due to the
gap-chaining seed heuristic which can correct for up to of divergence, while
A*PA2-simple starts to slow down because of doubling at lower divergence.
For a fixed number of doublings of the threshold, A*PA2 is faster for higher
divergence because too low thresholds are rejected more quickly.
Table 5:Runtime scaling with divergence. Average running time of aligners over sequences of length kbp with varying uniform divergence. The right plot is the same but zoomed in.
Scaling with length.Table 6 compares the runtime of aligners on synthetic random sequences of increasing
length and constant uniform divergence.
BiWFA’s runtime is quadratic and is fast for sequences up to bp.
As expected, A*PA2-simple has very similar scaling to Edlib but is faster by a
constant factor. A*PA2-full includes the gap-chaining seed heuristic used by
A*PA, resulting in comparable speed and near-linear scaling for both of them
when . For more divergent sequences, A*PA2-full is faster than A*PA since
initializing the A*PA heuristic with inexact matches is relatively slow.
The reason A*PA2-full is slower than A*PA for sequences of length Mbp is
that A*PA2-full uses seed length instead of , causing the number of matches
to explode when approaches .
Table 6:Runtime scaling with length. Log-log plot of average running time of aligners on synthetic sequences of increasing length with divergence (left) and divergence (right). A*PA uses exact matches () for and inexact matches () for . For sequences of length , averages are over pairs. Lines are fitted in the log-log domain. The region between linear and quadratic growth is shaded in grey.
Memory usage of A*PA2 on >500kbp sequences is at most MB and MB in
median. For shorter sequences, memory usage is always less than MB (6.2).
Incremental improvements.Figure 5 shows the effect of one-by-one adding improvements to A*PA2 on
>500kbp long sequences, starting with Ukkonen’s band-doubling method using Myers’
bitpacking. We first change to the domain, making it
comparable to Edlib. Then we process blocks of columns at a time and only
store differences at block boundaries giving speedup. Adding
SIMD gives another speedup, and instruction level parallelism
(ILP) provides a further small improvement. The diagonal transition traceback
(DTT) and sparse heuristic computation do not improve performance of
A*PA2-simple much on long sequences, but their removal can be seen to slow it
down for shorter sequences in Figure 7.
Incremental doubling (ID), the gap-chaining seed heuristic (GCSH), pre-pruning
(PP), and the pruning of A*PA give another speedup on average and
speedup in the first quantile.
Figure 5: Effect of adding features. Box plots showing the performance improvements of APA2 when incrementally adding new methods one-by-one. APA2-simple corresponds to the rightmost red columns, and A*PA2-full corresponds to the rightmost blue column.
Runtime profile. In Figure 6 we see that for >500kbp long sequences,
A*PA2-full spends most of its time computing blocks, followed by the
initialization of the heuristic. For shorter sequences the heuristic is not
used, and for very short sequences <10kbp, up to half the time is spent on
tracing the optimal alignment.
Figure 6: Runtime distribution per stage of A*PA2, using APA2-simple for short sequences and APA2-full for the two rightmost >500kbp datasets. Each column corresponds to a (set of) alignment(s), which are sorted by total runtime. Overhead is the part of the runtime not measured in one of the other parts and includes the time to build the profile.
We have shown that by incorporating many existing techniques and by writing highly
performant code, A*PA2 achieves speedup over other methods when
aligning kbp ONT reads with divergence, speedup for
sequences of average length kbp, and only a slight slowdown over BiWFA for
very short ( bp) and very similar ( divergence) sequences.
A*PA2’s speed is also comparable to approximate aligners, and is faster for long
sequences, thereby nearly closing the gap between approximate and exact methods.
A*PA2
achieves this by building on Edlib, using band doubling, bitpacking, blocks,
SIMD, the gap-chaining seed heuristic, and pre-pruning. The effect of this is
that A*PA2-simple has similar scaling behaviour as Edlib in both length and
divergence, but with a significantly better constant. A*PA2-full additionally
includes the A*PA heuristics and achieves the best of both worlds:
the near-linear scaling with length of A*PA when divergence is small, and the efficiency of Edlib.
Limitations.
The main limitation of A*PA2-full is that the heuristic requires finding all
matches between the two input sequences, which can take long compared to the
alignment itself.
For sequences with divergence , BiWFA exploits the
sparse structure of the diagonal transition algorithm. In comparison, computing full
blocks of size around in A*PA2 has considerable overhead.
Only sequences over alphabet size are currently supported, so DNA
sequences containing e.g. N characters must be cleaned first.
Future work.
When divergence is low, performance could be improved by applying A* to the
diagonal transition algorithm directly, instead of using DP. As a middle
ground, it may be possible to compute individual blocks using DT when the
divergence is low.
Currently A*PA2 is completely unaware of the type of sequences it aligns.
Using an upper bound on the edit distance, either known or found using a
non-exact method, could avoid trying overly large thresholds and smoothen the
curve in Table 5.
It should be possible to extend A*PA2 to open-ended and semi-global
alignment, just like Edlib and WFA support these modes.
Extending A*PA2 to affine cost models should also be possible. This will
require adjusting the gap-chaining seed heuristic, and changing the
computation of the blocks from a bitpacking approach to one of the
SIMD-based methods for affine costs.
Lastly, TALCO (Tiling ALignment using COnvergence of traceback pointers,
https://turakhia.ucsd.edu/research/) provides an interesting idea: it may be
possible start traceback while still computing blocks, thereby saving memory.
I am grateful to Daniel Liu for discussions, feedback, and suggesting additional
related papers, to André Kahles, Harun Mustafa, and Gunnar Rätsch for feedback
on the manuscript, to Andrea Guarracino and Santiago Marco-Sola for sharing the
WFA and BiWFA benchmark datasets, and to Gary Benson for help with debugging the
BitPAl bitpacking code. RGK is financed by ETH Research Grant ETH-1721-1 to
Gunnar Rätsch.
Code Snippet 1 shows a SIMD version of Myers’ bitpacking algorithm, and
Code Snippet 2 shows a SIMD version of the edit distance bitpacking scheme explained
in the supplement of Loving, Hernandez, and Benson (2014). Both methods require instructions.
Both methods are usually reported to use fewer than instructions,
but exclude the shifting out of the bottom horizontal difference (four
instructions) and the initialization of the carry for BitPAl (one operation). We require
these additional outputs/inputs since we want to align multiple bit lanes
below each other, and the horizontal difference in between must be carried
through.
pubfncompute_block_simd_myers(hp0: &mutSimd<u64,4>,// 0 or 1. Indicates +1 difference on top.
hm0: &mutSimd<u64,4>,// 0 or 1. Indicates -1 difference on top.
vp: &mutSimd<u64,4>,// 64-bit indicator of +1 differences on left.
vm: &mutSimd<u64,4>,// 64-bit indicator of -1 differences on left.
eq: Simd<u64,4>,// 64-bit indicator which characters equal the top char.
){letvx=eq|*vm;leteq=eq|*hm0;// The addition carries information between rows.
lethx=(((eq&*vp)+*vp)^*vp)|eq;lethp=*vm|!(hx|*vp);lethm=*vp&hx;// Extract the high bit as bottom horizontal difference.
letright_shift=Simd::<u64,4>::splat(63);// Shift each lane by 63.
lethpw=hp>>right_shift;lethmw=hm>>right_shift;// Insert the top horizontal difference.
letleft_shift=Simd::<u64,4>::splat(1);// Shift each lane by 1.
lethp=(hp<<left_shift)|*hp0;lethm=(hm<<left_shift)|*hm0;// Update the input-output parameters.
*hp0=hpw;*hm0=hmw;*vp=hm|!(vx|hp);*vm=hp&vx;}
Code Snippet 1:Myers' bitpacking. Rust code for SIMD version of Myers' bitpacking algorithm. Computes four independent words on an antidiagonal in parallel in instructions.
pubfncompute_block_simd_bitpal(hz0: &mutSimd<u64,4>,// 0 or 1. Indicates 0 difference on top.
hp0: &mutSimd<u64,4>,// 0 or 1. Indicates -1 difference on top.
vm: &mutSimd<u64,4>,// 64-bit indicator of -1 differences on left.
vmz: &mutSimd<u64,4>,// 64-bit indicator of -1 and 0 differences on left.
eq: Simd<u64,4>,// 64-bit indicator which characters equal the top char.
){leteq=eq|*vm;letris=!eq;letnotmi=ris|*vmz;letcarry=*hp0|*hz0;// The addition carries information between rows.
letmasksum=(notmi+*vmz+carry)&ris;lethz=masksum^notmi^*vm;lethp=*vm|(masksum&*vmz);// Extract the high bit as bottom horizontal difference.
letright_shift=Simd::<u64,4>::splat(63);lethzw=hz>>right_shift;lethpw=hp>>right_shift;// Insert the top horizontal difference.
letleft_shift=Simd::<u64,4>::splat(1);lethz=(hz<<left_shift)|*hz0;lethp=(hp<<left_shift)|*hp0;// Update the input-output parameters.
*hz0=hzw;*hp0=hpw;*vm=eq&hp;*vmz=hp|(eq&hz);}
Code Snippet 2:
Rust code for SIMD version of BitPAl's bitpacking. Computes four independent words on an antidiagonal in parallel in instructions.
Here we provide further results on the comparison of aligners.
Dataset statistics. Detailed statistics on the datasets are provided in Table 7.
The ONT (Oxford Nanopore Technologies) read sets all have high divergence, and
the set with genetic variation (gen.var.) contains long gaps.
The SARS-CoV-2 dataset stands out for having only divergence.
Table 7:
Statistics of the real datasets. Lengths are in kbp, divergence in %. Max gap indicates the average length of the largest gap in each alignment.
Dataset
Source
#Pairs
len min
len mean
len max
div min
div mean
div max
max gap mean
max gap max
SARS-CoV-2
A*PA2
10000
27
30
30
0.0
1.5
12.8
0.1
1.0
ONT <1k
WFA
12477
0.04
0.8
1.1
0.0
10.4
22.5
0.01
0.1
ONT <10k
BiWFA
5000
0.2
3.6
10
3.0
12.1
20.1
0.04
0.5
ONT <50k
BiWFA
10000
0.2
11
50
3.0
11.6
19.2
0.07
3.4
ONT >500k
A*PA
50
500
594
849
2.7
6.1
16.7
0.1
1.3
ONT >500k + gen.var.
BiWFA
48
502
632
1053
4.3
7.2
18.2
1.9
42
Runtime comparison on real data.Table 8 shows the numeric value of the average runtime of each aligner in Figure 4.
Table 8:Average runtime per sequence of each aligner on each dataset. Cells marked with are a lower bound due to timeouts. Speedup is reported as the fastest A*PA2 variant compared to the fastest of Edlib, BiWFA, and A*PA.
SARS-CoV-2 pairs (ms)
1kbp ONT reads (ms)
10kbp ONT reads (ms)
>500kbp ONT reads (s)
>500kbp ONT reads + gen.var. (s)
Edlib
11.14
0.110
8.0
3.74
5.20
BiWFA
1.13
0.042
9.3
4.47
6.96
A*PA
6.25
0.514
>190.1
>14.01
>12.92
WFA Adaptive
0.85
0.038
3.0
1.04
0.81
Block Aligner
2.35
0.038
0.9
0.63
0.68
A*PA2 simple
0.89
0.052
1.4
0.58
0.78
A*PA2 full
2.00
0.083
1.7
0.20
0.27
Speedup
1.3×
0.81×
5.6×
18.8×
19.0×
Approximate aligner accuracy.Table 9 shows the percentage of alignments in
each dataset for which approximate methods report the correct distance.
The accuracy of WFA Adaptive drops a lot for the >500kbp dataset with genetic
variation, since these alignments contain gaps of thousands of basepairs, much
larger than the bp cutoff after which trailing diagonals are dropped.
Table 9:Percentage of correctly aligned reads for approximate aligners.
SARS-CoV-2 pairs
1kbp ONT reads
10kbp ONT reads
>500kbp ONT reads
>500kbp ONT reads + gen.var.
WFA Adaptive
92
93
49
60
4
Block Aligner
34
85
53
96
50
WFA Adaptive
92
93
49
60
4
Block Aligner
34
85
53
96
50
Memory usage.Table 10 shows the memory usage of all compared aligners.
Table 10:
Memory usage of aligners, measured as the increase in max_rss before and after aligning a pair of sequences.
Ablation.Figure 7 shows how the performance of A*PA2 changes as individual features are removed.
Figure 7: Ablation. Box plots showing how the performance of APA2-simple and APA2-full changes when removing features.
Parameters.Figure 8 compares A*PA2 with default parameters against versions where one of the
parameters is modified. As can be seen, running time is not very sensitive with
regards to most parameters. Of note are using inexact matches () for the
heuristic, which take significantly longer to find, larger seed length , which decreases the strength of the heuristic, and
smaller block sizes ( and ), which induce more overhead.
Figure 8: Changing parameters. Running time of APA2-simple (left, middle) and APA2-full (right) with one parameter modified. Default parameters are seed length (k=12), pre-pruning look-ahead (p=14), growth factor (f=2), block size (b=256), max traceback cost (g=40), and dropping diagonals that lag (fd=10) behind during traceback.
Alpern, Bowen, Larry Carter, and Kang Su Gatlin. 1995. “Microparallelism and High-Performance Protein Matching.” Proceedings of the 1995 Acm/Ieee Conference on Supercomputing (Cdrom) - Supercomputing ’95. https://doi.org/10.1145/224170.224222.
Baeza-Yates, Ricardo, and Gaston H. Gonnet. 1992. “A New Approach to Text Searching.” Communications of the Acm 35 (10): 74–82. https://doi.org/10.1145/135239.135243.
Benson, Gary, Yozen Hernandez, and Joshua Loving. 2013. “A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm.” Lecture Notes in Computer Science, 50–61. https://doi.org/10.1007/978-3-642-38905-4_7.
Bergeron, Anne, and Sylvie Hamel. 2002. “Vector Algorithms for Approximate String Matching.” International Journal of Foundations of Computer Science 13 (01): 53–65. https://doi.org/10.1142/s0129054102000947.
Daily, Jeff. 2016. “Parasail: Simd c Library for Global, Semi-Global, and Local Pairwise Sequence Alignments.” Bmc Bioinformatics 17 (1). https://doi.org/10.1186/s12859-016-0930-z.
Dijkstra, E. W. 1959. “A Note on Two Problems in Connexion with Graphs.” Numerische Mathematik 1 (1): 269–71. https://doi.org/10.1007/bf01386390.
Döring, Andreas, David Weese, Tobias Rausch, and Knut Reinert. 2008. “Seqan an Efficient, Generic c++ Library for Sequence Analysis.” Bmc Bioinformatics 9 (1). https://doi.org/10.1186/1471-2105-9-11.
Farrar, Michael. 2006. “Striped Smith–Waterman Speeds Database Searches Six Times over Other Simd Implementations.” Bioinformatics 23 (2): 156–61. https://doi.org/10.1093/bioinformatics/btl582.
Groot Koerkamp, Ragnar. 2024. “A*PA2: Up to 19 Faster Exact Global Alignment.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:17:1–17:25. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.17.
Groot Koerkamp, Ragnar, and Pesho Ivanov. 2024. “Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.” Edited by Tobias Marschall. Bioinformatics 40 (3). https://doi.org/10.1093/bioinformatics/btae032.
Hadlock, Frank O. 1988. “Minimum Detour Methods for String or Sequence Comparison.” Congressus Numerantium 61: 263–74.
Harrison, Peter W, Rodrigo Lopez, Nadim Rahman, Stefan Gutnick Allen, Raheela Aslam, Nicola Buso, Carla Cummins, et al. 2021. “The Covid-19 Data Portal: Accelerating Sars-Cov-2 and Covid-19 Research through Rapid Open Access Data Sharing.” Nucleic Acids Research 49 (W1): W619–23. https://doi.org/10.1093/nar/gkab417.
Hart, Peter, Nils Nilsson, and Bertram Raphael. 1968. “A Formal Basis for the Heuristic Determination of Minimum Cost Paths.” Ieee Transactions on Systems Science and Cybernetics 4 (2): 100–107. https://doi.org/10.1109/tssc.1968.300136.
Hyyrö, Heikki, Kimmo Fredriksson, and Gonzalo Navarro. 2005. “Increased Bit-Parallelism for Approximate and Multiple String Matching.” Acm Journal of Experimental Algorithmics 10 (December). https://doi.org/10.1145/1064546.1180617.
Ivanov, Pesho, Benjamin Bichsel, and Martin Vechev. 2022. “Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds.” In Research in Computational Molecular Biology, 306–25. Springer International Publishing. https://doi.org/10.1007/978-3-031-04749-7_22.
Kruskal, Joseph B. 1983. “An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules.” Siam Review 25 (2): 201–37. https://doi.org/10.1137/1025045.
Liu, Daniel, and Martin Steinegger. 2023. “Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices.” Bioinformatics, btad487. https://doi.org/10.1093/bioinformatics/btad487.
Loving, Joshua, Yozen Hernandez, and Gary Benson. 2014. “Bitpal: A Bit-Parallel, General Integer-Scoring Sequence Alignment Algorithm.” Bioinformatics 30 (22): 3166–73. https://doi.org/10.1093/bioinformatics/btu507.
Marco-Sola, Santiago, Jordan M Eizenga, Andrea Guarracino, Benedict Paten, Erik Garrison, and Miquel Moreto. 2023. “Optimal Gap-Affine Alignment in O(s) Space.” Edited by Pier Luigi Martelli. Bioinformatics 39 (2). https://doi.org/10.1093/bioinformatics/btad074.
Marco-Sola, Santiago, Juan Carlos Moure, Miquel Moreto, and Antonio Espinosa. 2020. “Fast gap-affine pairwise alignment using the wavefront algorithm.” Bioinformatics 37 (4): 456–63. https://doi.org/10.1093/bioinformatics/btaa777.
———. 1999. “A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming.” Journal of the Acm 46 (3): 395–415. https://doi.org/10.1145/316542.316550.
Needleman, Saul B., and Christian D. Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology 48 (3): 443–53. https://doi.org/10.1016/0022-2836(70)90057-4.
Papamichail, Dimitris, and Georgios Papamichail. 2009. “Improved Algorithms for Approximate String Matching (Extended Abstract).” Bmc Bioinformatics 10 (S1). https://doi.org/10.1186/1471-2105-10-s1-s10.
Rognes, Torbjørn, and Erling Seeberg. 2000. “Six-Fold Speed-up of Smith–Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors.” Bioinformatics 16 (8): 699–706. https://doi.org/10.1093/bioinformatics/16.8.699.
Sankoff, David. 1972. “Matching Sequences under Deletion/Insertion Constraints.” Proceedings of the National Academy of Sciences 69 (1): 4–6. https://doi.org/10.1073/pnas.69.1.4.
Sellers, Peter H. 1974. “On the Theory and Computation of Evolutionary Distances.” Siam Journal on Applied Mathematics 26 (4): 787–93. https://doi.org/10.1137/0126070.
Shao, Haojing, and Jue Ruan. 2024. “Bsalign: A Library for Nucleotide Sequence Alignment.” Genomics, Proteomics & Bioinformatics, March. https://doi.org/10.1093/gpbjnl/qzae025.
Smith, T.F., and M.S. Waterman. 1981. “Identification of Common Molecular Subsequences.” Journal of Molecular Biology 147 (1): 195–97. https://doi.org/10.1016/0022-2836(81)90087-5.
Spouge, John L. 1989. “Speeding up Dynamic Programming Algorithms for Finding Optimal Lattice Paths.” Siam Journal on Applied Mathematics 49 (5): 1552–66. https://doi.org/10.1137/0149094.
Suzuki, Hajime, and Masahiro Kasahara. 2018. “Introducing Difference Recurrence Relations for Faster Semi-Global Alignment of Long Sequences.” BMC Bioinformatics 19 (S1). https://doi.org/10.1186/s12859-018-2014-8.
Szalkowski, Adam, Christian Ledergerber, Philipp Krähenbühl, and Christophe Dessimoz. 2008. “Swps3 – Fast Multi-Threaded Vectorized Smith-Waterman for Ibm Cell/B.E. And 86/Sse2.” Bmc Research Notes 1 (1): 107. https://doi.org/10.1186/1756-0500-1-107.
Wagner, Robert A., and Michael J. Fischer. 1974. “The String-to-String Correction Problem.” Journal of the Acm 21 (1): 168–73. https://doi.org/10.1145/321796.321811.
Wilbur, W J, and D J Lipman. 1983. “Rapid Similarity Searches of Nucleic Acid and Protein Data Banks.” Proceedings of the National Academy of Sciences 80 (3): 726–30. https://doi.org/10.1073/pnas.80.3.726.
Zhao, Mengyao, Wan-Ping Lee, Erik P. Garrison, and Gabor T. Marth. 2013. “Ssw Library: An Simd Smith-Waterman c/c++ Library for Use in Genomic Applications.” Edited by Leonardo Mariño-Ram\’ırez. Plos One 8 (12): e82138. https://doi.org/10.1371/journal.pone.0082138.
Šošić, Martin, and Mile Šikić. 2017. “Edlib: A c/c++ Library for Fast, Exact Sequence Alignment Using Edit Distance.” Edited by John Hancock. Bioinformatics 33 (9): 1394–95. https://doi.org/10.1093/bioinformatics/btw753.