[WIP] Homopolymer gap cost

A typical error in ONT reads is that for (long?) runs of the same character (homopolymers), the read may report an incorrect number of copies of the character. One way to work around this during alignment is by doing homopolymer compression (see e.g (NO_ITEM_DATA:hicanu-nurk2020)), where repeated copies (runs) of a character are replaced by a single copy.

In a discussion with Sergey Nurk and Pesho Ivanov, the idea came up to introduce a new type of mutation for pairwise alignment that accounts for this type of errors.

This posts goes over some possible implementations of this idea. Further work will be needed to choose the right one and find good parameters for the actual discrepancy between real and observed homopolymer lengths (e.g. Fig. 9 of (Delahaye and Nicolas 2021)).

What do we actually want?

First some intuition:

  1. Homopolymer indels should consist only of the same character.
    • e.g. AC => ABBBC
  2. They should be cheaper than normal indels in most cases. (Otherwise, what’s the point.)
    • e.g. ABC => ABBBC should be cheaper than ABC => ABXXC.
  3. Longer indels should cost more, given the same context.
    • e.g. ABBC => ABBBBBC should be more expensive than ABBC => ABBBC.
  4. The more copies of the character surround the indel, the cheaper it should be.
    • e.g. ABBC => ABBBBC should be cheaper than ABC => ABBBC.

In general, we need a cost function for the cost of inserting/deleting \(m\) equal characters, given that \(k\) respectively \(l\) copies precede the indel in the two strings:

cost function
\(c(k, l; m)\) is the cost of an indel of size \(m>0\), when \(k\geq 0\) resp. \(l\geq 0\) copies of the character precede the indel.
  • I.e., the cost of transforming A (k*B) | C into A (l*B) | (m*B) C, where we assume everything before the | has already been aligned, and none of \(k\), \(l\), and \(m\) could increase.

    (We assume that greedy matching another B is not possible.)

The intuition from before translates to:

  • \(c(k,l;m)\) is increasing1 in \(m\);
  • \(c(k,l;m)\) is decreasing1 in each of \(k\) and \(l\).

What can we compute

Possible definitions

Let’s first discuss some ways of defining homopolymer indels. For now we assume insertions and deletions are symmetric.

Basic idea
Any indel of equal characters is allowed.

One problem with this approach is that any indel becomes a homopolymer indel. That could be fixed by using affine costs, but that would penalize e.g. AAA => AAAA as much as AC => ABC, while the goal is to make homopolymer indels cheaper, even if they are only of length \(1\).

  • \(c(0,0;1) = \infty\)

Thus, we need a context aware definition.

Absolute contextual bound
Require that at least \(k>0\) preceding (non-indel) characters are equal to the indel char in both strings.

For \(k=2\) we may have ABBC => ABBBC. Note that ABC => BBBC would not be allowed in this case.

This has the benefit that an absolute lower bound on the number of equal preceding characters is easy to check algorithmically.

Relative contextual bound
For an indel of length \(l\), require at least \(f\cdot l\) adjacent copies of the same character, for some \(f> 0\).

Linear costs

Affine costs

References

Delahaye, Clara, and Jacques Nicolas. 2021. “Sequencing Dna with Nanopores: Troubles and Biases.” Edited by Eduardo Andrés-León. Plos One 16 (10): e0257521. https://doi.org/10.1371/journal.pone.0257521.
NO_ITEM_DATA:hicanu-nurk2020

  1. Non-strict. ↩︎ ↩︎