This is Chapter 5 of my thesis.


Summary

So far, we have considered only algorithms for global alignment. In this chapter, we consider semi-global alignment and its variants instead, where a pattern (query) is searched in a longer string (reference). There are many flavours of semi-global alignment, depending on the (relative) sizes of the inputs. We list these variants, and introduce some common approaches to solve this problem.

We then extend A*PA into A*Map as a new tool to solve these problems.

Attribution

All text in this chapter is my own. Some ideas on semi-global alignment are based on early discussions with Pesho Ivanov.

\[ \renewcommand{\st}[2]{\langle #1, #2\rangle} \]

In this chapter, we will look at the problem of semi-global alignment, as a next step after global alignment (blog post). In fact, we will quickly see that there is no such thing as ‘‘just’’ global alignment. Rather, there are many variants that have applications in different domains.

In 1 we survey the different types of semi-global alignment. Then, in 2, we adapt the bitpacking and SIMD method of A*PA2 for \(O(nm)\) text searching. In 3, we extend the gap-chaining seed heuristic of A*PA for semi-global alignment and hence mapping. We use this as the seeding and chaining parts of the classic seed-chain-extend framework. Then we run a separate alignment to find the optimal alignment.

1 Variants of semi-global alignment Link to heading

Figure 1: Different types of pairwise alignment. Bold vertices and edges indicate where each type of alignment may start and end. Local-extension alignment can end anywhere, whereas local alignment can also start anywhere. Like clipping, these modes require a score for matching characters, while in other cases, a cost model (with cost-0 matches) suffices. This figure is based on an earlier version that was made in collaboration with Pesho Ivanov.

Figure 1: Different types of pairwise alignment. Bold vertices and edges indicate where each type of alignment may start and end. Local-extension alignment can end anywhere, whereas local alignment can also start anywhere. Like clipping, these modes require a score for matching characters, while in other cases, a cost model (with cost-0 matches) suffices. This figure is based on an earlier version that was made in collaboration with Pesho Ivanov.

As we saw in Chapter 2 (blog post), there are many different types of pairwise alignment (Figure 1). Whereas global alignment is used when two sequences are to be fully aligned, when that is not the case, there are many different variants. In semi-global alignment, we align a sequence to a subset of a longer sequence. While theoretically speaking this captures the full problem, in practice, there are many variants of semi-global alignment that admit different algorithms. In particular, solutions to the problem depend a lot on the absolute and relative size of the two sequences being aligned.

Figure 2: Depending on the absolute and relative size of the two sequences, instances of semi-global alignment fall into three categories: pairwise alignment, text searching, and mapping.

Figure 2: Depending on the absolute and relative size of the two sequences, instances of semi-global alignment fall into three categories: pairwise alignment, text searching, and mapping.

Pairwise semi-global alignment. In this chapter, we will use (pairwise) semi-global alignment to refer to instances where the two sequences to be aligned have comparable length. In this case, the number of skipped characters at the start and end of the longer sequence should be small, and not much more than the edit distance between the two sequences. In this case, we assume that the alignment is one-off, i.e., that the both sequences are only aligned once, and that only a single best alignment is needed. Instances of this problem can have length anywhere from 100bp to 100kbp. Pairwise semi-global alignment is especially useful when a pattern has already been approximately located in a text, and a final alignment is done to obtain the precise alignment.

Since the two sequences have similar length, the DP matrix is nearly square, and it suffices to compute states near ‘’the’’ diagonal. Thus, semi-global alignment is similar to ends-free and global alignment, and the classic methods for global alignment can be applied.

Approximate string matching (text searching). In the remaining cases, one of the two sequences (the text) is significantly longer than the other (the pattern). When additionally the pattern is short, with a length on the order of the machine word size (64 classically, or 256 with SIMD), we classify the instance as approximate[fn::Here, /approximate means that we look for inexact matches with a number of mutations.] string matching/ or text searching. Here, we are typically interested in not just one optimal alignment, but in finding all sufficiently good alignments, with a cost below some (fixed) threshold. One of the reasons we reasons we consider this a separate problem is that there is a truly vast amount of literature on optimizing approximate string matching. See e.g. (Ukkonen 1985b; Landau and Vishkin 1989; Chang and Lampe 1992; Baeza-Yates and Gonnet 1992; Wu and Manber 1992; Alpern, Carter, and Gatlin 1995; Baeza-Yates and Navarro 1996; Hyyrö, Fredriksson, and Navarro 2005; Rognes 2011; NO_ITEM_DATA:leap) for a small selection1.

Initial solutions for this problem include computing the full matrix (\(O(nm)\)) and using various types of bitpacking (\(O(n\lceil m/w\rceil)\)) and parallellization. Additionally, when patterns with up to \(k\) errors are to be found, only a subset of the matrix has to be computed and \(O(n\lceil k/w\rceil)\) expected time is sufficient (Ukkonen 1985b; Chang and Lampe 1992; Myers 1999) (Figure 3).

One characteristic of nearly all methods for text searching is that they scan the entire text on each search, and thus require at least \(\Omega(n)\) time. Typical applications are searching for a text in a set of files (as can be done by grep), and extracting tags and/or barcodes from multiplexed ONT (Oxford Nanopore Technologies) reads.

Mapping. The primary characteristic of mapping when compared to approximate string matching is that for mapping, many strings are searched in a single text. Thus, \(\Omega(n)\) time per search is not sufficient, especially when the reference has a size anywhere from megabytes to gigabytes. Often, patterns are long reads with a length on the order of 50kbp, but also short reads of length 100bp can be mapped. To avoid the \(\Omega(n)\) per search, the reference is indexed, so that regions similar to the pattern can efficiently be identified. Usually, this leads to approximate methods, where the reported alignments are not guaranteed to have a minimal cost.

As with approximate string searching, one way to use mapping is by asking for all alignments with cost below some threshold. Another option is to ask only for the few best alignments. In practice, another mode is to first find the cost \(s\) of the best alignment (assuming its cost is below some threshold), and then find all alignments up to e.g. cost \(s’ = 1.1 \cdot s\).

The classic method applied here is seed-chain-extend (Li 2018). This first finds matches between the sequences, then finds chains of these matches, and then fills the gaps in between consecutive matches using relatively small alignments.

2 Fast text searching Link to heading

Figure 3: Text searching is the problem of finding a typically short (length (O(w))) pattern in a longer text. The left shows how the classical Needleman-Wunsch algorithm fills the entire matrix column by column. On the right (adapted from (Myers 1999)), we search for all alignments with cost (leq k), and states at distance (leq k) are highlighted. The bloc(k)-based approach only computes blocks that contain at least one state at distance (leq k), and takes (O(n lceil k/wrceil)) time in expectation on random strings (Chang and Lampe 1992).

Figure 3: Text searching is the problem of finding a typically short (length (O(w))) pattern in a longer text. The left shows how the classical Needleman-Wunsch algorithm fills the entire matrix column by column. On the right (adapted from (Myers 1999)), we search for all alignments with cost (leq k), and states at distance (leq k) are highlighted. The bloc(k)-based approach only computes blocks that contain at least one state at distance (leq k), and takes (O(n lceil k/wrceil)) time in expectation on random strings (Chang and Lampe 1992).

In A*PA2 (blog post), we developed a bloc\(k\)-based method for pairwise alignment. At the core, these blocks are computed using a fast SIMD-based implementation of the bitpacking algorithm of Myers (Myers 1999; Chang and Lampe 1992). So far, we have only used this as a building block for global alignment, but now we will use this to directly support \(O(n\lceil m/w\rceil)\) text searching.

In the basis, this requires two changes. First, we ensure that the alignment can start anywhere in the text by changing the horizontal differences along the top row of the matrix from \(1\) (as used by global alignment) to \(0\), as indicated by the bold lines in Figure 3.

Secondly, the alignment may end anywhere, and the user may be interested more than just a single best alignment. To support this, we do not only report the score in the bottom right of the DP matrix, but we return a list of all scores along the bottom row. Based on this, the user can decide which scores are sufficiently low to find a full alignment.

Tracing. Once the user decides which scores at the bottom of the matrix are sufficiently low, a traceback be started from those positions. To save time and memory, the initial computation of the matrix only returns the output scores and does not store all \(nm\) values. Thus, to find an alignment ending in column \(i\), we recompute the matrix from column \(i-2m\) to column \(i\) and store all values for each column. We then do a usual trace through this matrix from \(\st im\) until we reach the top row (\(j=0\)).

2.1 Skip-cost for overlap alignments Link to heading

Figure 4: By default, global alignment uses a cost of 1 along all edges of the matrix, while semi-global alignment and overlap/ends-free/extension variants have a cost of 0 along some edge. When a pattern only partially overlaps the text, as shown on the left, it may be preferable to have a skip-cost (alpha) for each unmatched character that is in between (0) and (1). This can also be applied to global alignment (replacing ends-free alignment), and can be an alternative to local alignment.

Figure 4: By default, global alignment uses a cost of 1 along all edges of the matrix, while semi-global alignment and overlap/ends-free/extension variants have a cost of 0 along some edge. When a pattern only partially overlaps the text, as shown on the left, it may be preferable to have a skip-cost (alpha) for each unmatched character that is in between (0) and (1). This can also be applied to global alignment (replacing ends-free alignment), and can be an alternative to local alignment.

In some applications, it may happen that the pattern is present, but cut off at either its start or end, as shown on the left in Figure 4. For example when a read was cut short, or when aligning reads against an incomplete assembly (Abramova, Karkman, and Bengtsson-Palme 2024). In a classical semi-global alignment, the unmatched start of the pattern would incur a cost of 1 per unmatched character, but this may make the total cost of the pattern go above the threshold. Instead, overlap alignment could be used (Figure 1), but this requires a bonus for matches, since otherwise the cheapest way to align the pattern could be to skip nearly all of its characters. Ends-free alignment solves this by only allowing a limited number of characters to be skipped. Still, this is suboptimal: when the pattern matches once in full, and once at the start of the sequence with \(50\%\) overlap, the scores of these two alignments are not directly comparable. In fact, the overlapping alignment has a benefit because it only pays for mismatches in half its length.

To solve this, we introduce the skip cost2 \(0\leq \alpha \leq 1\), which is the cost paid for each character at the start and/or end of the pattern that is not aligned because it extends outside the text. This concept can also be applied to global-alignment variants such as ends-free and overlap (Figure 4, middle), so that skipping characters in both sequences has a (not necessarily equal) cost.

In practice, it is not practical to handle fractional costs, especially in the case of edit distance where the distance between adjacent states must be 0 or 1. To avoid this, we can initialize the first and last column (and row, for global alignment) with a mix of zeros and ones, so that the fraction of ones is approximately \(\alpha\), as shown in Figure 5 for \(\alpha=0.5\).

Figure 5: Example of computing a semi-global alignment with a skip-cost of (alpha = 1/2). In the first column the graph, edges of cost 1 and 0 alternate. On the bottom, the graph is extended with matches until a multiple of the block size is reached. On the right, the final score in row (j) is increased by (lceil alpha(m-j)rceil = lceil (m-j)/2rceil) to obtain the score including skip-cost. Three alignments are highlighted and shown, with edits highlighted. Only half of the skipped characters (rounded up) incurs a cost.

Figure 5: Example of computing a semi-global alignment with a skip-cost of (alpha = 1/2). In the first column the graph, edges of cost 1 and 0 alternate. On the bottom, the graph is extended with matches until a multiple of the block size is reached. On the right, the final score in row (j) is increased by (lceil alpha(m-j)rceil = lceil (m-j)/2rceil) to obtain the score including skip-cost. Three alignments are highlighted and shown, with edits highlighted. Only half of the skipped characters (rounded up) incurs a cost.

Applying the skip-cost. In Figure 6, we show an example output when using a skip-cost of \(\alpha\in\{0, 0.5, 1\}\) for the alignment as shown in Figure 7. Using \(\alpha = 1\) corresponds to classical semi-global alignment (thin black), and we see that this correctly detects that the pattern matches in the middle of the sequence, ending at position 300, with a cost around 20. However, the occurrences overlapping the start and end of the text are completely missed. Overlap alignment, which corresponds to \(\alpha=0\) (bold black) does have local minima at position 50 and 650 (indicating the pattern extends 50 characters beyond the text). The drawback of these minima is that there are also global minima at positions 0 and 700 where the pattern is completely disjoint from the text, so that some additional logic is needed to separate these cases. We see that in regions where the pattern does not match, the alignment has a score around 50, or \(0.5\) per character. Thus, we choose \(\alpha=0.5\) per skipped character. Using this (yellow), we recover clear local minima at positions 50 and 650, while the cost converges back to 50 as the overlap shrinks to 0.

Figure 6: Example of the output of the skip-cost alignment when aligning a length-100 pattern onto a length-600 text (as shown in Figure 7). Graphs are shown for (alpha=1), corresponding to classical semi-global alignment, (alpha=0.5), corresponding to the skip-cost introduced here, and (alpha=0), corresponding to an overlap alignment. Vertical lines indicate the region inside of which the pattern fully matches within the text, and where the cost of the alignment does not depend on the skip-cost (alpha).

Figure 6: Example of the output of the skip-cost alignment when aligning a length-100 pattern onto a length-600 text (as shown in Figure 7). Graphs are shown for (alpha=1), corresponding to classical semi-global alignment, (alpha=0.5), corresponding to the skip-cost introduced here, and (alpha=0), corresponding to an overlap alignment. Vertical lines indicate the region inside of which the pattern fully matches within the text, and where the cost of the alignment does not depend on the skip-cost (alpha).

Figure 7: The setup of the alignment results shown in Figure 6. A random pattern of length 100 is generated and overlaid on a length 600 text 3 times: once in the middle, and twice with a 50 base overlap at the start/end of the sequence. Before inserting the pattern into the text, a different number of mutations is applied to the full length-100 pattern.

Figure 7: The setup of the alignment results shown in Figure 6. A random pattern of length 100 is generated and overlaid on a length 600 text 3 times: once in the middle, and twice with a 50 base overlap at the start/end of the sequence. Before inserting the pattern into the text, a different number of mutations is applied to the full length-100 pattern.

2.2 Results Link to heading

Figure 8: Log-log plot of the time to align a pattern of length (m) against a text of length 50 kbp, in nanoseconds per base of the text. Only the time needed to compute the minimal distance is reported, excluding alignment/traceback. Our SIMD search method (yellow) always computes the entire matrix. Edlib, on the other hand, by default uses a band doubling approach (solid lines). Disabling this via a fixed high threshold is shown dashed.

Figure 8: Log-log plot of the time to align a pattern of length (m) against a text of length 50 kbp, in nanoseconds per base of the text. Only the time needed to compute the minimal distance is reported, excluding alignment/traceback. Our SIMD search method (yellow) always computes the entire matrix. Edlib, on the other hand, by default uses a band doubling approach (solid lines). Disabling this via a fixed high threshold is shown dashed.

We benchmark the throughput of the search function in Figure 8, where we measure how long it takes (per text character) to align a pattern against a text. For Edlib (Šošić and Šikić 2017), we use the infix method for semi-global alignment and ask it to report the distance only, and likewise for our method, we measure only the time needed to compute the output distances. Experiments are run on an Intel i7-10750H with AVX2, running at a fixed CPU frequency of 2.6 GHz.

As can be seen, both methods take as long for pattern length 32 as for 64, since they pad to 64 bit values. Our SIMD-based method has constant performance up to patterns of length 256, and then grows linearly with the pattern length. Edlib starts to grow at its word size \(w=64\) instead. On very divergent sequences (black), indeed the growth is linear, and even slightly worse because of redundant band doubling. For more similar sequences (grey), when the pattern is present in the text with a small divergence, band doubling reduces the part of the matrix that needs to be computed. Especially when the pattern can be found with a divergence of 1%, this makes the performance nearly independent of the pattern length, as also predicted by Myers’ complexity of \(O(n \lceil k/w\rceil)=O(n\lceil 0.01 m/64\rceil)=O(n\lceil m/6400\rceil)\) (Myers 1999) and shown in Figure 3.

For shorter texts, on the order of the pattern length (not shown), there is an additional 50% to 100% overhead on the time per character that is spent on preprocessing the pattern.

When also tracing the optimal alignment, Edlib needs another 5-10% of time, while our method needs an additional 10-20%.

For patterns of length 128 to 256, our method ends up around \(1.7\times\) to \(2.0\times\) faster than Edlib. In practical terms, this implies that a pattern of length up to 256 bp can be found in a 1 kbp read in 13 μs (75000 searches per second) or in a 50 kbp text in 440 μs (2200 searches per second). Or alternatively, in one second, nearly 100 Mbp of text can be searched.

Future work. Currently, we only implement a naive \(O(n\lceil m/w\rceil)\) method that always computes the entire matrix. For sequences of length greater than 256, most of the matrix below the first 256 rows can likely be skipped, and this should provide a significant speedup.

3 Mapping using A*Map Link to heading

Figure 9: An example of the seed-chain-extend method for mapping. First, seeds (black diagonals) are found, which are short matches between the two sequences. Then, these seeds are chained into chains (dashed lines). Each seed and each chain is scored based on the number of seeds in the chain and their relative positions. The chains with the highest scores are selected as candidate alignments. Then, short alignments are done to fill the gaps between the seeds and extend the chain into a full alignment. A drawback of seed-chain-extend is that it may not return optimal alignments. Instead, a full semi-global alignment could be done around the chain to obtain an exact alignment, leading to seed-chain-align. The bottom left shows a semi-global alignment using Needleman-Wunsch, and the bottom-right showh a semi-global alignment using band-doubling.

Figure 9: An example of the seed-chain-extend method for mapping. First, seeds (black diagonals) are found, which are short matches between the two sequences. Then, these seeds are chained into chains (dashed lines). Each seed and each chain is scored based on the number of seeds in the chain and their relative positions. The chains with the highest scores are selected as candidate alignments. Then, short alignments are done to fill the gaps between the seeds and extend the chain into a full alignment. A drawback of seed-chain-extend is that it may not return optimal alignments. Instead, a full semi-global alignment could be done around the chain to obtain an exact alignment, leading to seed-chain-align. The bottom left shows a semi-global alignment using Needleman-Wunsch, and the bottom-right showh a semi-global alignment using band-doubling.

The problem of mapping differs from text searching considered so far in a few ways. First, the text (reference) is fixed and is reused for many alignments. It can be anywhere from megabases to gigabases in size. Secondly, the patterns (reads) being mapped can have length 100 bp (short reads) up to 50 kbp (long reads). To enable efficient mapping, most tools build an index on the reference, and then query this for each read to be mapped. In practice, such methods are often approximate, in that they are not guaranteed to find a minimal-cost alignment. They work using seed-chain-extend: seeds3 (usually \(k\)-mer matches) are found via the index. Then these are joined into chains, and the best chains are extended into a full alignment, as shown in Figure 9.

In the remainder of this section, we briefly review strategies for the three parts, seeding, chaining, and extending.

A*Map builds on the same paradigm, and we review how A*PA’s gap-chaining seed heuristic can be applied here, and how A*PA and A*PA2 can be modified for exact mapping and semi-global alignment. Note that in A*Map, we replace the usual extend phase by a more thorough semi-global alignment that covers the full chain at once. This way, we can guarantee that optimal alignments are found.

3.1 Seeding Link to heading

There are various strategies for seeding alignments.

Minimizers. The most popular mapper, minimap2 (Li 2018), uses minimizers (blog). By default, it uses \(k\)-mer size \(k\) from 15 to 19 and window size \(w\) from 10 to 19, to extract one out of each \(w\) consecutive \(k\)-mers. It first finds all minimizers of the reference and builds an index that maps each \(k\)-mer to the locations where it occurs as a minimizer. Then, the minimizer \(k\)-mers for each query are determined, and these are looked up in the index to find the \(k\)-mer matches that seed the alignment.

\(k\)-min-mers. A different approach is taken by mapquick (Ekim et al. 2023), which is a mapper designed for highly similar sequences. Here, \(k\)-min-mers are used to seed the alignment. These are chains of 2 to 15 consecutive 31-mers. This way, each \(k\)-min-mer spans a much larger portion of the sequence, and fewer matches are needed to recover sufficiently good chains.

\(k\)-mers. In the seed heuristic in A*ix (Ivanov, Bichsel, and Vechev 2022) and A*PA (Groot Koerkamp and Ivanov 2024) (TODO chapter), plain \(k\)-mer matches are used. A drawback of this approach is that it creates more matches, since there are more \(k\)-mers than minimizers. The main benefit, on the other hand, is that it leads to an exact algorithm. For other seeding methods, a lack of matches does not imply a (good) lower bound on the minimal edit distance between consecutive matches, as we will see in 3.2.

Maximal-exact-matches. Maximal-exact-matches are a variant where \(k\)-mer matches are extended on either side as long as the two sequences match. This is similar to the seeding used by BLAST (Altschul et al. 1990).

Maximal-unique-matches. Yet another method is to seed the alignment using maximal-unique-matches, also known as MUMs. These are substrings of the query and reference that occur exactly once in each string, and that can not be extended into a longer matching substring. Thus, these matches consider global information, rather than just considering local matches. This technique is used by MUMmer (Delcher et al. 1999; Marçais et al. 2018);

3.2 Chaining Link to heading

Figure 10: There are different models to give costs and scores to chains. Here we show three possible costs that can be given to the connection between

Figure 10: There are different models to give costs and scores to chains. Here we show three possible costs that can be given to the connection between

After finding all the seed matches, the next step is to find candidate regions where the query could align. This is done by finding chains consisting of multiple matches, and giving each chain a cost or score. Specifically, a chain is a sequence of seeds that can occur together in an alignment.

As for seeding, there are many different methods to score chains.

LCS\(k\). A simple method of scoring chains is to assume that the seeds are disjoint \(k\)-mer matches, and simply maximize the number of \(k\)-mers in the chain. This is also known as the LCS\(k\) metric. (Benson, Levy, and Shalom 2014). Like the plain LCS, this score focuses only on matches, and disregards the mismatches and indels in between.

LCS\(k{+}{+}\). An extension of LCS\(k\) is LCS\(k{+}{+}\) (Pavetić, Žužić, and Šikić 2014). This method allows matches of arbitrary length, and maximizes the total length of the matches.

Anchored edit distance. As with edit distance, we can consider a cost equivalent of the score given by the LCS\(k{+}{+}\) metric. This is the anchored edit distance, where the focus in again on the mismatches and indels rather than the matches. As shown in Figure 10, the cost of joining two seeds is the maximum of the horizontal and vertical gap between them. (TODO citation)

Gap cost. We already saw that the gap cost (Ukkonen 1985a). is used a lot for pairwise alignment, and it is also useful as a cost for chaining matches: we can lower bound the cost of the alignment between two consecutive matches by the minimal number of horizontal or vertical steps needed to join them (Figure 10). Indeed, minimap2 (Li 2018) also uses a chaining score based on the gap cost. In fact, minimap2 uses a concave function of the size of the gap as actual distance, so that longer gaps are penalized relatively less than short gaps, to admit e.g. splicing alignments.

Seed heuristic (SH). The seed heuristic, introduced by A*ix (Ivanov, Bichsel, and Vechev 2022; Groot Koerkamp and Ivanov 2024), provides a second, independent lower bound on the edit distance between two matches. We first find all \(k\)-mer matches. Then, say that there is a gap of \(\Delta_i \times \Delta_j\) bases between two matches in our chain. Assuming that there no in-between matches, we know that there is no \(k\)-mer match in the path joining the two matches. Thus, we must incur an error at least every \(k\) steps, for at least \(\max(\lfloor \Delta_i/k\rfloor, \lfloor\Delta_j/k\rfloor)\) errors. (If we assume that the two initial matches are already maximally extended, we could replace the \(\lfloor\cdot\rfloor\) by a \(\lceil \cdot \rceil\).) In practice, the seed heuristic is implemented by splitting the reference sequence into adjacent disjoint \(k\)-mers, and only matches of those \(k\)-mers are found. Then, the distance between consecutive matches is always a multiple of \(k\), and the minimal cost to join them is simply the number of skipped \(k\)-mers, as shown in Figure 10.

Gap-chaining seed heuristic (GCSH). In A*PA, we extended the seed heuristic into the gap-chaining seed heuristic. Conceptually, this simply takes the maximum of the gap-cost and the seed heuristic cost, since the maximum of two lower bounds is still a lower bound. The main theoretical result of A*PA (Theorem 5, Lemma 7) is the following:

In an optimal path, two matches can only be chained if the gap cost between them is at most the value of the seed heuristic between them.

Thus, two matches that are \(d\) diagonals apart may only be chained if there are at least \(k\cdot d\) columns between them. This puts a strong limitation on how far chains can ‘‘stray away’’ from their diagonal. In A*PA, we provide an efficient \(r \lg r\) algorithm for chaining \(r\) matches that is equivalent to the solution for LCS (Hirschberg 1977). It works by first applying a suitable transformation to the coordinates of the matches, followed by a plain LCS algorithm.

The main benefit of the GCSH is that it gives mathematical guarantees. Suppose we are doing a global alignment between two sequences of length \(n\) (the one that is split into \(\ell = \lfloor n/k\rfloor\) \(k\)-mer seeds) and \(m\). If there is an alignment of cost \(s\), then we know for sure that there is also a chain of cost \(\leq s\). Thus, to find all alignments of cost up to \(s\), we only have to consider all chains with cost up to \(s\).

3.3 Aligning Link to heading

After all matches have been chained and sufficiently good candidate chains have been determined, this chain can be extended into an alignment. Minimap2 uses the KSW2 algorithm (Suzuki and Kasahara 2018) to do an approximate (banded) alignment to fill the gaps between matches. Other methods such as mapquick completely the alignment phase completely and only report the location and/or score of the chain.

A drawback of extending a chain is that the optimal alignment may not completely follow the chain, as exemplified in the bottom-left alignment in Figure 9. Instead, we can run a semi-global alignment around the chain using any of the global alignment methods discussed in Chapter 2 (blog), such as a plain Needleman-Wunsch DP or band doubling. Indeed, we can also use A*PA or A*PA2 for this semi-global alignment.

Updating GCSH for semi-global alignment. For global alignment we can simply count the number of seeds that is still to be covered to get to the end of the first sequence (the reference). In particular, when \(x=n-i\) characters of the first remain, we need to still cross and pay for \(x/k -O(1)\) seeds. With semi-global alignment, we can end the alignment anywhere, and avoid crossing all seeds. If there are still \(y\) bases of the pattern remaining, it turns out this will need a cost of at least \(y/(k+1)-O(1)\). This division by \(k+1\) rather than \(k\) could be avoided by replacing the role of the pattern and reference, and splitting the pattern into \(m/k\) seeds, but that turns out to be inefficient when it comes to indexing all \(k\)-mers. By splitting the reference into \(k\)-mers, we only need to index \(1/k\) of its \(k\)-mers, so that this index is much smaller.

Secondly, in A*PA and A*PA2 we filter away all matches for which the their gap cost to the end (\(\st nm\)) is larger than the seed heuristic cost to the end, since these can provably never be part of a chain. With semi-global alignment, chains can end anywhere, and thus this filter does not apply anymore.

Semi-global alignment using A*PA and A*PA2. We additionally make some modifications to A*PA and A*PA2. First, the alignment can start anywhere along the top of the grid, and so we do not only push the root state \(\st 00\) on the A*PA priority queue, but we push all states along the top row for which the heuristic has a local minima. From there, we expand sideways as needed, both to the right and to the left. For A*PA2, we similarly make sure to cover all start positions with sufficiently low value of the heuristic.

Similarly, the alignment may end anywhere on the bottom row, and so the termination condition is changed accordingly. Also during the traceback, we ensure that this is stopped as soon as the top row is reached, rather than the top-left state.

3.4 A*Map Link to heading

While it would be possible to using A*PA or A*PA2 directly as a mapping algorithm, this is inefficient because the index on reference \(k\)-mers is not reused between alignments. Thus, we develop A*Map as a dedicated mapper. As discussed, this consists of three components:

  • Seeding using \(k\)-mer matches: a static index is built containing exactly every \(k\)’th \(k\)-mer of the reference, and all query \(k\)-mers are looked up in this to find their matches in the reference. Matches are sorted using an efficient radix sort.
  • Chaining using the gap-chaining seed heuristic (GCSH): all \(r\) matches are transformed as done by A*PA, and then an efficient implementation of the \(r \lg r\) LCS chaining algorithm is used. All chains with a cost below some fixed threshold \(t\) are candidates for alignment.
  • Candidate chains are semi-global aligned using A*PA2 with band doubling. The best score is tracked and returned. To ensure the alignment is contained in the subsequence of the reference that is semi-globally aligned, a small buffer is added before the first match and after the last match, as shown in Figure 9.

When the goal is to find all alignments with divergence up to \(d=4\%\), one must use a value of \(k\) somewhat below \(0.9(1/d-1)=0.9(1/4\%-1) = 21.6\) to accommodate spurious matches, and to ensure that candidate chains contain at least one tenth of the maximum possible number of matches (i.e., chains should have length at least \(0.1 \cdot m/k\)). In this case, \(k=20\) would be a good choice. Generally, smaller \(k\) is preferred to improve the quality of the heuristic, but we also need \(k>\log_4 n\) to ensure that the number of spurious matches remains limited.

3.5 Results Link to heading

We will compare A*Map against minimap2 on synthetic long read data. We use chromosome 1 as the reference, which has length around 235 Mbp. From this, we sample 1000 random reads of length 50 kbp. Then, we apply a varying number of uniform random mutations to these strings \(1\%\), \(3\%\), and \(5\%\), to obtain divergences of \(0.9\%\), \(2.7\%\), and \(4.4\%\).

We run both methods on a single thread. For minimap2, we run with -x map-pb, -x map-ont (default), and -x map-hifi. For A*Map, we use A*PA2 with plain band doubling (Ukkonen 1985a) for the semi-global alignments.

Experiments are run on an Intel i7-10750H with AVX2, running at a fixed CPU frequency of 2.6 GHz.

Table 1: Results of aligning 1000 random subsequence of chromosome 1, with varying divergence. The first row shows the time in seconds to index the 235 Mbp chromosome, and remaining rows show the total time to map the 1000 reads. For minimap2, we try various default configurations, while for A*Map we use \(k=20\) and \(k=28\). For \(k=28\), the alignments found with \(4.4\%\) divergence are not guaranteed to be exact, since \(k\) is larger than \(1/d=1/4.4\%\), and indeed, 3 reads remain unmapped. The PacBio mode uses homopolymer compression.
DivergencePB k=19,w=10,hpcONT k=15,w=10HIFI k=19,w=19A*Map k=20A*Map k=28
Indexing8.710.57.41.21.0
0.9%31.248.123.026.98.7
2.7%31.450.623.324.612.9
4.4%28.846.621.822.8(*) 15.4

A*Map analysis. When \(k=20\), a bottleneck of A*Map is the large number of \(k\)-mer matches: 200000 on average per mapped read. For divergence \(2.7\%\), 3 seconds are spent collecting matches, 4.5 seconds are needed to sort them, and 5 seconds to chain them. Aligning the most likely chain for each read takes a total of 8 seconds. On average, there are 2.6 candidate chains per pattern. It appears that this is mostly due to reads falling into highly repetitive regions (e.g. in the centromere), where many overlapping starting positions for the semi-global alignment are considered. These (on average) 1.6 additional alignments per read take a total of 3.8 seconds.

When \(k=28\), the number of matches is significantly reduced, to only 30000 per read. This reduces the time spent sorting matches to 0.6 s, and the time for building contours to 0.8 s. Also, the number of candidate chains drops from 2.6 to 1.4 per read, since the larger \(k\) increases sensitivity. The total time for aligning the best scoring chains is still 8 seconds.

Comparison. Compared to minimap2, A*Map is significantly faster at indexing the text, since it only needs to build a hashtable on every \(k\)’th \(k\)-mer. Minimap2, on the other hand, has to compute all minimizers. Nevertheless, minimap2’s indexing could probably be sped up by using SimdMinimizers (blog). On this data, minimap2 works best with the HIFI preset, with \(k=w=19\). For divergence \(0.9\%\), A*Map is \(2.6\times\) faster, and for divergence \(2.7\%\), A*Map is \(1.8\times\) faster. To put these results into perspective, on data with \(<1\%\) divergence, mapquick (Ekim et al. 2023) was shown to be more than an order of magnitude faster than minimap2.

Nevertheless, the main feature of A*Map is that it is able to guarantee exact results, where one can prove that no alignment below the threshold is missed. In cases where this is important, A*Map is a viable alternative to minimap2.

Future work. Currently, there are a few limitations. First, the semi-global alignment is independent of the preceding chaining. It could be beneficial to reuse the chains to build a heuristic, to reduce the size of the subsequent alignment. However, initial experiments show that the overhead of evaluating the heuristic quickly grows compared to simply computing more states. Alternatively, it may be possible to develop an exact alignment method that is bottom-up (like the usual extending) by building on ideas such as local pruning introduced by A*PA2.

A second issue is the large number of matches, and the time needed to query all \(k\)-mers of the read. One way to speed this up is to swap the roles of the query and the reference, so that only every \(k\)’th query has to be looked up. However, that comes at the cost of a \(k\times\) larger index. Alternatively, fine-tuning the value of \(k\) so that it is small enough for the given error rate and as large as possible to reduce false positive matches could also help. In parallel, it may be possible to build an efficient index on inexact matches of length \(2k\), so that there are simply fewer resulting matches that have to be sorted and chained.

References Link to heading

Abramova, Anna, Antti Karkman, and Johan Bengtsson-Palme. 2024. “Metagenomic Assemblies Tend to Break around Antibiotic Resistance Genes.” Bmc Genomics 25 (1). https://doi.org/10.1186/s12864-024-10876-0.
Alpern, Bowen, Larry Carter, and Kang Su Gatlin. 1995. “Microparallelism and High-Performance Protein Matching.” Proceedings of the 1995 Acm/Ieee Conference on Supercomputing (Cdrom) - Supercomputing ’95. https://doi.org/10.1145/224170.224222.
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (3): 403–10. https://doi.org/10.1016/s0022-2836(05)80360-2.
Baeza-Yates, Ricardo, and Gaston H. Gonnet. 1992. “A New Approach to Text Searching.” Communications of the Acm 35 (10): 74–82. https://doi.org/10.1145/135239.135243.
Baeza-Yates, Ricardo, and Gonzalo Navarro. 1996. “A Faster Algorithm for Approximate String Matching.” Lecture Notes in Computer Science, 1–23. https://doi.org/10.1007/3-540-61258-0_1.
Benson, Gary, Avivit Levy, and Riva Shalom. 2014. “Longest Common Subsequence in K-Length Substrings.” arXiv. https://doi.org/10.48550/ARXIV.1402.2097.
Chang, William I., and Jordan Lampe. 1992. “Theoretical and Empirical Comparisons of Approximate String Matching Algorithms.” Lecture Notes in Computer Science, 175–84. https://doi.org/10.1007/3-540-56024-6_14.
Delcher, A. L., S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg. 1999. “Alignment of Whole Genomes.” Nucleic Acids Research 27 (11): 2369–76. https://doi.org/10.1093/nar/27.11.2369.
Ekim, Bariş, Kristoffer Sahlin, Paul Medvedev, Bonnie Berger, and Rayan Chikhi. 2023. “Efficient Mapping of Accurate Long Reads in Minimizer Space with Mapquik.” Genome Research, June. https://doi.org/10.1101/gr.277679.123.
Groot Koerkamp, Ragnar, and Pesho Ivanov. 2024. “Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.” Edited by Tobias Marschall. Bioinformatics 40 (3). https://doi.org/10.1093/bioinformatics/btae032.
Hirschberg, Daniel S. 1977. “Algorithms for the Longest Common Subsequence Problem.” Journal of the Acm 24 (4): 664–75. https://doi.org/10.1145/322033.322044.
Hyyrö, Heikki, Kimmo Fredriksson, and Gonzalo Navarro. 2005. “Increased Bit-Parallelism for Approximate and Multiple String Matching.” Acm Journal of Experimental Algorithmics 10 (December). https://doi.org/10.1145/1064546.1180617.
Ivanov, Pesho, Benjamin Bichsel, and Martin Vechev. 2022. “Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds.” In Research in Computational Molecular Biology, 306–25. Springer International Publishing. https://doi.org/10.1007/978-3-031-04749-7_22.
Landau, Gad M, and Uzi Vishkin. 1989. “Fast Parallel and Serial Approximate String Matching.” Journal of Algorithms 10 (2): 157–69. https://doi.org/10.1016/0196-6774(89)90010-2.
Li, Heng. 2018. “Minimap2: Pairwise Alignment for Nucleotide Sequences.” Edited by Inanc Birol. Bioinformatics 34 (18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.
Marçais, Guillaume, Arthur L. Delcher, Adam M. Phillippy, Rachel Coston, Steven L. Salzberg, and Aleksey Zimin. 2018. “Mummer4: A Fast and Versatile Genome Alignment System.” Edited by Aaron E. Darling. Plos Computational Biology 14 (1): e1005944. https://doi.org/10.1371/journal.pcbi.1005944.
Myers, Gene. 1999. “A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming.” Journal of the Acm 46 (3): 395–415. https://doi.org/10.1145/316542.316550.
Pavetić, Filip, Goran Žužić, and Mile Šikić. 2014. “$LCSk$++: Practical Similarity Metric for Long Strings.” https://doi.org/10.48550/ARXIV.1407.2407.
Rognes, Torbjørn. 2011. “Faster Smith-Waterman Database Searches with Inter-Sequence Simd Parallelisation.” Bmc Bioinformatics 12 (1). https://doi.org/10.1186/1471-2105-12-221.
Suzuki, Hajime, and Masahiro Kasahara. 2018. “Introducing Difference Recurrence Relations for Faster Semi-Global Alignment of Long Sequences.” BMC Bioinformatics 19 (S1). https://doi.org/10.1186/s12859-018-2014-8.
Ukkonen, Esko. 1985a. “Algorithms for Approximate String Matching.” Information and Control 64 (1–3): 100–118. https://doi.org/10.1016/s0019-9958(85)80046-2.
———. 1985b. “Finding Approximate Patterns in Strings.” Journal of Algorithms 6 (1): 132–37. https://doi.org/10.1016/0196-6774(85)90023-9.
Wu, Sun, and Udi Manber. 1992. “Fast Text Searching.” Communications of the Acm 35 (10): 83–91. https://doi.org/10.1145/135239.135244.
Šošić, Martin, and Mile Šikić. 2017. “Edlib: A c/c++ Library for Fast, Exact Sequence Alignment Using Edit Distance.” Edited by John Hancock. Bioinformatics 33 (9): 1394–95. https://doi.org/10.1093/bioinformatics/btw753.
NO_ITEM_DATA:leap

  1. See https://curiouscoding.nl/posts/approximate-string-matching for a longer overview of relevant papers. ↩︎

  2. I would not be surprised if this has been done before. There are many tools applying similar techniques (either via local alignment or a clipping cost), but as far as I am aware, the technique as stated here has not been applied before. ↩︎

  3. We somewhat interchangeably use seeds and matches here. To me, a seed is a conceptual anchor that can be extended into an alignment. A match is the specific type of anchor we use: our seeds are usually matches between \(k\)-mers. ↩︎