Ideas on CuriousCoding

Practical minimizers

Thu, 12 Sep 2024 00:00:00 +0200

Table of Contents

1 Sampling schemes
- 1.1 Definitions
- 1.2 Miniception
- 1.3 Mod-minimizer
- 1.4 Forward scheme lower bound
- 1.5 Open syncmer minimizer
- 1.6 Open-closed minimizer
- 1.7 New: General mod-minimizer
- 1.8 Variant: Open-closed minimizer using offsets
2 Selection schemes
- 2.1 Definition
- 2.2 Bd-anchors
- 2.3 New: Smallest unique substring anchors
- 2.4 New: Anti lexicographic sorting
3 More sampling schemes
- 3.1 Anti-lex sus-anchors
- 3.2 Threshold anchors
- 3.3 The $t$-gap disappears for large alphabets
4 Computing the density of forward schemes
- 4.1 WIP: Anti lexicographic sus-anchor density
5 Open questions
6 Ideas

This post introduces some new practical sampling schemes. It builds on:

Mod-minimizers and other minimizers

Thu, 18 Jan 2024 00:00:00 +0100

\[ \newcommand{\d}{\mathrm{d}} \newcommand{\L}{\mathcal{L}} \]

This post introduces some background for minimizers and some experiments for a new minimizer variant. That new variant is now called the mod-minimizer and published at WABI24 (Groot Koerkamp and Pibiri 2024). This also includes a review of existing methods, including pseudocode for most of the methods covered below.

Notes on implementing Longest Common Repeat (LCR)

Wed, 06 Dec 2023 00:00:00 +0100

Table of Contents

Notes
Discussion / TODOs
- Evals

These are my running notes on implementing an algorithm for Longest Common Repeat using minimizers.

Notes

Coloured Tree Problem

See Lemma 3 at here

Generic sparse suffix array

For random strings and $b \leq n / \log n$, direct radix sort on $2log n + log log n$-bit prefixes is sufficient for $O(n)$ runtime. In fact, since computer word size $w\geq \log n$, we only need at most $2$ rounds of radix sort! (See simple-saca.)

Research proposal: subquadratic string graph construction

Mon, 10 Jul 2023 00:00:00 +0200

Table of Contents

Introduction
Research plan

This is a research proposal for a 5 month internship at CWI during autumn/winter 2023-2024.

Introduction

An important problem in bioinformatics is genome assembly: DNA sequencing machines read substrings of a full DNA genome, and these pieces must be assembled together to recover the entire genome.

Doctoral plan

Mon, 12 Dec 2022 00:00:00 +0100

Research Proposal: Near-linear exact pairwise alignment

Abstract

Pairwise alignment and edit distance specifically is a problem that was first stated around 1968 (Needleman and Wunsch 1970; Vintsyuk 1968). It involves finding the minimal number of edits (substitutions, insertions, deletions) to transform one string/sequence into another. For sequences of length $n$, the original algorithm takes $O(n^2)$ quadratic time (Sellers 1974). In 1983, this was improved to $O(ns)$ for sequences with low edit distance $s$ using Band-Doubling. At the same time, a further improvement to $O(n+s^2)$ expected runtime was presented using the diagonal-transition method (Ukkonen 1983, 1985; Myers 1986).

String algorithm visualizations

Tue, 08 Nov 2022 00:00:00 +0100

Select the algorithm to visualize
Click the buttons, or click the canvas and use the indicated keys

Suffix-array construction is explained here and BWT is explained here.

Source code is on GitHub.

Algorithm
String
Query

Delay (s)

Thoughts on linear programming

Fri, 04 Nov 2022 00:00:00 +0100

Table of Contents

Linear programming
Assumptions
Idea for an algorithm

This note contains some ideas about linear programming and most-orthogonal faces. They’re mostly on an intuitive level and not very formal.

Postscriptum: The ideas here don’t work.

Linear programming

Maximize $\t\x$ subject to $A\x \leq \b$.

$\x$ is a vector of $n$ variables $x_i$.
$A$ is a $m\times n$ matrix: there are $m$ constraints $A_j \x \leq b_j$.

Assumptions

We make the following assumptions:

Local Doubling

Wed, 19 Oct 2022 00:00:00 +0200

Table of Contents

Notation
Needleman-Wunsch: where it all begins
Dijkstra/BFS: visiting fewer states
Band doubling: Dijkstra, but more efficient
GapCost: A first heuristic
Computational volumes: an even smaller search
Cheating: an oracle gave us $g^*$
A*: Better heuristics
Broken idea: A* and computational volumes
Local doubling
- Without heuristic
- With heuristic
Diagonal Transition
A* with Diagonal Transition and pruning: doing less work
Goal: Diagonal Transition + pruning + local doubling
Pruning: Improving A* heuristics on the go
Cheating more: an oracle gave us the optimal path
TODO: aspriation windows

\begin{equation*} \newcommand{\st}[2]{\langle #1,#2\rangle} \newcommand{\g}{g^*} \newcommand{\fm}{f_{max}} \newcommand{\gap}{\operatorname{Gap}} \end{equation*}

Reducing A* memory usage using fronts

Mon, 26 Sep 2022 00:00:00 +0200

Table of Contents

Motivation
Parititioning A* memory by fronts
- Non-consistent heuristics
- Front indexing
Tracing back the path

Here is an idea to reduce the memory usage of A* by only storing one front at a time, similar to what Edlib and WFA do. Note that for now this will not work, but I’m putting this online anyway.

Motivation

In our implementation of A*PA, we use a hashmap to store the value of $g$ of all visited (explored/expanded) states by A*. This can take up a lot of memory and simply reading/writing $g$ in the hashmap can take over half the total execution time.

Speeding up A*: computational volumes and path-pruning

Fri, 23 Sep 2022 00:00:00 +0200

Table of Contents

Motivation
Summary
Why is A* slow?
Computational volumes
Dealing with pruning
- Thoughts on more aggressive pruning
Algorithm summary
Challenges
Results
What about band-doubling?
- Maybe doubling can work after all?
TODOs
Extensions

This post build on top of our recent preprint Groot Koerkamp and Ivanov (2024) and gives an overview of some of my new ideas to significantly speed up exact global pairwise alignment. It’s recommended you understand the seed heuristic and match pruning before reading this post.

Linear memory WFA?

Wed, 17 Aug 2022 00:00:00 +0200

Table of Contents

Motivation
Path traceback: two strategies
Observations
- What information is needed for path tracing
A pragmatic solution
Another interpretation
Affine costs
Conclusion

Figure 1: Only the red substitutions and blue indel need to be stored to trace the entire path.

In this post I’ll discuss an idea to run WFA using less memory, while still allowing us to trace back the optimal path from the target state back to the start of the search.

Transforming match bonus into cost

Tue, 16 Aug 2022 00:00:00 +0200

Table of Contents

Tricks with match bonus or how to fool Dijkstra’s limitations
Conclusion

Tricks with match bonus or how to fool Dijkstra’s limitations

The reader is assumed to have basic knowledge about pairwise alignment and graph theory.

Diamond optimisation for diagonal transition

Mon, 01 Aug 2022 00:00:00 +0200

Table of Contents

Diamond transition or how technicalities can break concepts
- But let’s take a closer look
- Conclusion

Diamond transition or how technicalities can break concepts

We assume the reader has some basic knowledge about pairwise alignment and in particular the WFA algorithm.

In this post we dive into a potential 2x speedup of WFA — one that turns out not to work.

Let’s take a look at one of the most important and efficient algorithms for pairwise alignment — WFA (Marco-Sola et al. 2020). It already looks good, and is pretty efficient. In Table 1, which copies the style of Figure 1 in Eizenga and Paten (2022), rows are wavefronts, and columns are diagonals. Light-blue states are stored in memory. Green shows the current state being computed, and dark-blue shows the cells the green cell depends on.

The BiWFA meeting condition

Mon, 11 Jul 2022 00:00:00 +0200

cross references: BiWFA GitHub issue

It seems that getting the meeting/overlap condition of BiWFA (Marco-Sola et al. (2023), Algorithm 1 and Lemma 2.1) correct is tricky.

Let $p := \max(x, o+e)$ be the maximal cost of any edge in the edit graph. As in the BiWFA paper, let $s_f$ and $s_r$ be the distances of the forward and reverse fronts computed so far.

We prove the following lemma:

Lemma Once BiWFA has expanded the forward and reverse fronts up to $s_f$ and $s_r$ and has found some path of cost $s \leq s_f + s_r$, expanding the fronts until $s’_f + s’_r \geq s+p+o$ is guaranteed to find a shortest path.

Proof sketch for linear time seed heuristic alignment

Sun, 24 Apr 2022 00:00:00 +0200

Table of Contents

Pairwise alignment in subquadratic time
Random model
Algorithm
- Seed heuristic
- Match pruning
Analysis
- Expanded states
  - Excess errors
- Algorithmic complexity

This post is a proof sketch to show that A* with the seed heuristic (Groot Koerkamp and Ivanov 2024) does exact pairwise alignment of random strings with random mutations in near linear time.

Pairwise alignment in subquadratic time

Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.e. $O(n^{2-\delta})$ for any $\delta >0$) assuming the Strong Exponential Time Hypothesis.

Variations on the WFA recursion

Sun, 17 Apr 2022 03:14:00 +0200

Table of Contents

Gap open
Gap close
Symmetric alternatives
Another symmetry
Conclusions

cross references: BiWFA GitHub issue

In this post I will explore some variations of the recursion used by WFA/BiWFA for the affine version of the diagonal transition algorithm. In particular, we will go over a gap-close variant, and look into some more symmetric formulations.

Gap open

WFA (Marco-Sola et al. 2020) introduces the affine cost variant of the classic diagonal transition method. Let us call it a gap-open variant, because the gap-open cost $o$ is payed when opening the gap, that is, when jumping from the $M$ layer to the $I$ or $D$ layer.

Pruning for A* heuristics

Sat, 11 Dec 2021 00:00:00 +0100

Note: this post extends the concept of multiple-path pruning presented in Poole and Mackworth (2017).

Say we’re running A* in a graph from $s$ to $t$. $d(s,t)$ is the distance we are looking for.

An A* heuristic has to satisfy $h(u) \leq d(u, t)$ to be admissible: the estimated distance to the end should never be larger than the actual distance to guarantee that the algorithm finds a shortest path.

Spaced K-mer Seeded Distance

Wed, 20 Oct 2021 00:00:00 +0200

Table of Contents

Background
Introduction
Spaced $k$-mer Seeded Distance
Phylogeny reconstruction
- Running the algorithm
TODO Assembly

\[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]

Background

Quickly finding similar pieces of DNA within large datasets is at the core of computational biology. This has many applications:

Alignment: Given two pieces of related DNA, align them to find where mutations (i.e. substitutions, insertions, or deletions) occur.

Ideas for assembling [long] reads

Fri, 09 Jul 2021 00:00:00 +0200

\[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\Z}{\mathbb Z} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]

Here is an idea for an algorithm to assemble long reads.

Go over all sequences and sketch their windows using the Hamming distance preserving sketch method described here. This method may need some tweaking to also work with an indel rate of around 10%.
Let’s say we find a pair of matching windows between reads $A$ and $B$ starting at positions $i$ and $j$. This indicates that $A$ and $B$ may be related with an offset of $j-i$.

Hamming Similarity Search

Thu, 08 Jul 2021 00:00:00 +0200

Table of Contents

Background
Introduction
Hamming Similarity Search
Phylogeny reconstruction
- Running the algorithm
Assembly

Background

Quickly finding similar pieces of DNA within large datasets is at the core of computational biology. This has many applications:

Alignment: Given two pieces of related DNA, align them to find where mutations (i.e. substitutions, insertions, or deletions) occur.