Proof sketch for linear time seed heuristic alignment

24 April 2022 7-minute read

Table of Contents

Pairwise alignment in subquadratic time
Random model
Algorithm
- Seed heuristic
- Match pruning
Analysis
- Expanded states
  - Excess errors
- Algorithmic complexity

This post is a proof sketch to show that A* with the seed heuristic (Groot Koerkamp and Ivanov 2024) does exact pairwise alignment of random strings with random mutations in near linear time.

Pairwise alignment in subquadratic time

Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.e. \(O(n^{2-\delta})\) for any \(\delta >0\)) assuming the Strong Exponential Time Hypothesis.

To break this barrier, there are multiple approaches:

Use algorithms with an output-sensitive runtime: \(O(ns)\) or \(O(n + s^2)\), which is \(o(n^2)\) when \(s = o(n)\).
Use approximation algorithms: these do not have to adhere to the \(n^2\) lower bound for exact algorithms.

Assume (uniform) random input, which is what we will do here. There are a few possibilities:

A given string, and a random string.
A non-random/arbitrary string, and a random mutation of it.
A random string, and a non-random/arbitrary/adversarial mutation of it.
Two independent random strings.
One random string, of which the other is a random mutation via some model.

	given \(A\)	random \(A\)
given \(B\)	\(\omega(n^{2-\delta})\)	-
\(B\) given at distance \(s\)	1: \(O(n+s^2)\)	3.3: ?
independent random \(B\)	3.1: ?	3.4: ?
\(B\) random \(e\%\leq f(n)\) mutations of \(A\)	3.2: ?	3.5: \(O(n)\) ¹ , this post

We will make the strongest possible assumption on the randomness of the input: that \(A\) is random, and that \(B\) is a random mutation of \(A\) with a limited/bounded error rate \(e \leq f(n)\).

Random model

There are multiple possible random models. One decision to make is the number of errors to introduce for a given error rate:

Fixed total: Use exactly \(e\cdot n\) errors.
Binomial total: Use \(Bin(n, e)\) errors.
Bernoulli per position: Loop over the positions, and for each character apply a mutation with probability \(e\).
Geometric per position: A position can be mutated multiple times. Use a geometric distribution with parameter \(p = 1/(1+e)\) for the number of mutations, so that each position has expected \(e\) mutations.

Another choice is the order in which mutations are applied:

Iterated errors: For each mutation, first choose whether it will be an insertion, deletion, or substitution. Then choose a random position in the string, and for insertions and substitutions a replacement character. This way, mutations can affect earlier mutations.
Up-front: Another possibility is to generate all errors up-front before applying any. This can work by generating a list of insert/delete/substitute instructions, and then executing them. Substitute instructions always refer to characters of the original string (never to inserted characters), and inserted characters are never deleted.

A final choice is whether or not to allow identity substitutions that do not change the current character.

For any model, we have a mapping between the characters of \(A\) and \(B\). In particular for the up-front model, each character of \(A\) maps to some interval of \(B\), and each character of \(B\) maps to a single character of \(A\). These characters correspond to each other.

The main difference between these methods is that iterative errors slightly favours longer runs of inserts: once a character is inserted, the probability that a second character will be inserted in the same ‘run’ is slightly larger because now there are two positions instead of just one.

The benefit of the up-front method is that it is easier to analyse the probability distribution of the number of errors in an interval, since the generated mutations do not depend on earlier mutations. I think this is the model of choice to analyse our probabilistic algorithm.

Previous work has used an indel channel Ganesh and Sy (2020) which covers all of the above points. This seems to be a promising approach since they already have results on it.

Algorithm

The algorithm runs an A* search from \(s\) to \(t\), using a heuristic \(h\). During the algorithm, a match is pruned as soon as its start is expanded. This increases the value of \(h\) and thus improves the A*.

Seed heuristic

We use a the seed heuristic used in A*PA. We partition \(A\) into seeds of length \(k\), and find all matches for each seed. The heuristic at a state \(u\) is simply the number of remaining seeds for which no matches exist. This is an admissible heuristic, i.e. lower bound on the remaining distance to \(t\), since all remaining seeds must be aligned, and each seed without a match will incur a cost of at least \(1\) for its alignment.

Note that the value of \(h(u)\) only depends on the position of the seed that ‘covers’ \(u\).

Match pruning

Once we expand the start of a match, we remove this match from consideration. If this is the last non-pruned match for a given seed, this means that \(h\) increases by \(1\) for all states left of \(u\). This can be implemented using e.g. a Fenwick tree: both queries and updates take \(O(\log (n/k))\) time this way.

Analysis

Expanded states

Here is a sketch to show that the number of expanded states is linear in \(n\).

Choose \(\log_\Sigma n^2 < k < 1/e\).

A proof could consist of the following steps:

With high probability, all matches are true positives. In particular, the number of false positives should be exponentially (or super-quadratically) small.
- Both this point and the next can probably reuse ideas/techniques/theorems from Ganesh and Sy (2020).
The optimal alignment of \(A\) and \(B\) goes through all matches.
In regions with \(x\) excess errors (see below), pruning ensures the local runtime is \(O(x^2)\).
Having \(x\) excess errors is exponentially rare, \(o(e^{-x})\).

A* works by keeping a priority queue of states ordered by \(f = g + h\). Our \(h\langle i, j\rangle\) only depends on \(i\), and is non-increasing. This means that \(f=g+h\) can only increase when \(g\) goes up by \(1\) (because of mismatch/indels) while \(h\) stays the same. If it weren’t for pruning, at any point in time all states have either value \(f\) or \(f+1\), where \(f\) is the smallest \(f(u)\) over explored (non-expanded) states \(u\).

In our model, we will pay for the expansion of all states at distance \(f\) as soon as \(f\) first reaches that value. We want to show that this number of states is \(O(1+x)\).

Excess errors

Let \(P(i)\) be the potential in column \(i\), i.e. the maximal number of errors the heuristic can anticipate from \(i\) to the end. The average potential per character is \(P(0)/n \approx p := r/k = 1/k\) for exact matches, which equals the maximal error rate the heuristic can anticipate.

Let \(E_i\) be the minimal number of errors on a path from column \(i\) to the end.² Write \(e_i := E_i - E_{i+1}\) as the number of errors in column \(i\).

We define \(x_i\) as the excess errors at \(i\): \(x_n = 0\),

\begin{align} x_i := \max(0, x_{i+1} + e_i - p) = \max(0, x_{i+1} + E_i - E_{i+1} - r/k) \end{align}

The total excess is \(X := \sum x_i\).

Hypothesis: The number of states expanded by A* with seed heuristic and match pruning is \(O(X)\).

Algorithmic complexity

Finding the matches is \(O(n)\) when there are no false-positives.
Evaluating the heuristic is \(O(1)\).
When there are only true positive matches, pruning is \(O(1)\).
Reordering states (updating their value in the priority queue) should only happen a logarithmic number of times per state. But it may well turn out to be \(O(1)\) in this case.

References

Backurs, Arturs, and Piotr Indyk. 2018. “Edit Distance Cannot Be Computed in Strongly Subquadratic Time (Unless Seth Is False).” Siam Journal on Computing 47 (3): 1087–97. https://doi.org/10.1137/15m1053128.

Ganesh, Arun, and Aaron Sy. 2020. “Near-Linear Time Edit Distance for Indel Channels.” arXiv. https://doi.org/10.48550/ARXIV.2007.03040.

Groot Koerkamp, Ragnar, and Pesho Ivanov. 2024. “Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.” Edited by Tobias Marschall. Bioinformatics 40 (3). https://doi.org/10.1093/bioinformatics/btae032.

or \(O(n \log n)\)? ↩︎
Alternatively: given some shortest path \(\pi\) from \(s\) to \(t\), the distance from column \(i\) to the end along \(\pi\). ↩︎