[WIP] Homopolymer gap cost

Table of Contents

What do we actually want?
What can we compute
Possible definitions
Linear costs
Affine costs

A typical error in ONT reads is that for (long?) runs of the same character (homopolymers), the read may report an incorrect number of copies of the character. One way to work around this during alignment is by doing homopolymer compression (see e.g (NO_ITEM_DATA:hicanu-nurk2020)), where repeated copies (runs) of a character are replaced by a single copy.

In a discussion with Sergey Nurk and Pesho Ivanov, the idea came up to introduce a new type of mutation for pairwise alignment that accounts for this type of errors.

This posts goes over some possible implementations of this idea. Further work will be needed to choose the right one and find good parameters for the actual discrepancy between real and observed homopolymer lengths (e.g. Fig. 9 of (Delahaye and Nicolas 2021)).

What do we actually want? Link to heading

First some intuition:

Homopolymer indels should consist only of the same character.
- e.g. AC => ABBBC
They should be cheaper than normal indels in most cases. (Otherwise, what’s the point.)
- e.g. ABC => ABBBC should be cheaper than ABC => ABXXC.
Longer indels should cost more, given the same context.
- e.g. ABBC => ABBBBBC should be more expensive than ABBC => ABBBC.
The more copies of the character surround the indel, the cheaper it should be.
- e.g. ABBC => ABBBBC should be cheaper than ABC => ABBBC.

In general, we need a cost function for the cost of inserting/deleting \(m\) equal characters, given that \(k\) respectively \(l\) copies precede the indel in the two strings:

cost function

\(c(k, l; m)\) is the cost of an indel of size \(m>0\), when \(k\geq 0\) resp. \(l\geq 0\) copies of the character precede the indel.

I.e., the cost of transforming A (k*B) | C into A (l*B) | (m*B) C, where we assume everything before the | has already been aligned, and none of \(k\), \(l\), and \(m\) could increase.
(We assume that greedy matching another B is not possible.)

The intuition from before translates to:

\(c(k,l;m)\) is increasing¹ in \(m\);
\(c(k,l;m)\) is decreasing¹ in each of \(k\) and \(l\).

What can we compute Link to heading

Possible definitions Link to heading

Let’s first discuss some ways of defining homopolymer indels. For now we assume insertions and deletions are symmetric.

Basic idea: Any indel of equal characters is allowed.

One problem with this approach is that any indel becomes a homopolymer indel. That could be fixed by using affine costs, but that would penalize e.g. AAA => AAAA as much as AC => ABC, while the goal is to make homopolymer indels cheaper, even if they are only of length \(1\).

\(c(0,0;1) = \infty\)

Thus, we need a context aware definition.

Absolute contextual bound: Require that at least \(k>0\) preceding (non-indel) characters are equal to the indel char in both strings.
For \(k=2\) we may have ABBC => ABBBC. Note that ABC => BBBC would not be allowed in this case.

This has the benefit that an absolute lower bound on the number of equal preceding characters is easy to check algorithmically.

Relative contextual bound: For an indel of length \(l\), require at least \(f\cdot l\) adjacent copies of the same character, for some \(f> 0\).

Linear costs Link to heading

Affine costs Link to heading

References Link to heading

Delahaye, Clara, and Jacques Nicolas. 2021. “Sequencing Dna with Nanopores: Troubles and Biases.” Edited by Eduardo Andrés-León. Plos One 16 (10): e0257521. https://doi.org/10.1371/journal.pone.0257521.

NO_ITEM_DATA:hicanu-nurk2020

Non-strict. ↩︎ ↩︎