AStarix

12 November 2021 2-minute read

Ragnar Groot Koerkamp

Papers

AStarix is a method for aligning sequences (reads) to graphs:

Input

A reference sequence or graph
Alignment costs \((\Delta_{match}, \Delta_{subst}, \Delta_{del}, \Delta_{ins})\) for a match, substitution, insertion and deletion
Sequence(s) to align

Output

An optimal alignment of each input sequence

The input is a reference graph (automaton really) \(G_r = (V_r, E_r)\) with edges \(E_r \subseteq V_r\times V_r\times \Sigma\) that indicate the transitions between states.

At the core of AStarix, there is an implicit alignment graph \(G_a^q\), whose states (nodes) \(\langle v, i\rangle\) are a pair of a position in \(G_r\) and the number of characters of the sequence \(q\) that has already been matched. Transitions (weighted edges) in \(G_a^q\) correspond to matching/substituting/deleting/inserting one character of the sequence \(q\) with their respective cost. Note that this is a generalization of the transition costs in the DP table for conventional quadratic time alignment algorithms.

The optimal alignment is a path from \(\langle u, 0\rangle\) to \(\langle v, |q|\rangle\) for some \(u,v\in V_r\) with minimal cost.

As the name suggests, the A* algorithm is used with a heuristic to find such a path. To make this work, two optimizations are done:

Seed heuristic: A* needs a heuristic function \(h\) that lower bounds the distance between vertices: \(d(u, v) \geq h(u, v)\).

[WIP]