Note on CuriousCoding

Tools for suffix array searching

Fri, 14 Jun 2024 00:00:00 +0200

Table of Contents

1 Sapling
2 PLA-Index
3 LISA: learned index

Let’s summarize some tools for efficiently searching suffix arrays.

1 Sapling

Sapling (Kirsche, Das, and Schatz 2020) works as follows:

Choose a parameter $p$ store for each of the $2^p$ $p$-bit prefixes the corresponding position in the suffix array.
When querying, first find the bucket for the query prefix. Then do a linear interpolation inside the bucket.
Search the area $[-E, +E]$ around the interpolated position, where $E$ is a bound on the error of the linear approximation. In practice $E$ is only a $95\%$-confidence bound, and if the true value is not in the range, a linear search with steps of size $E$ is done.

The paper also introduces a neural network approach to approximating buckets, but this takes over a day to learn and is slower to query in practice.

Crates for suffix array construction

Thu, 13 Jun 2024 00:00:00 +0200

Popular C libraries are:

Both have a ..64 variant that supports input strings longer than 2GB.

Rust wrappers:

divsufsort: rust reimplementation, does not support large inputs.
cdivsufsort: c-wrapper, does not support large inputs
livdivsufsort-rs: c-wrapper, does support large inputs
sais: unrelated to the original library; does not implement a linear time algorithm anyway
libsais-rs: Daniel Liu’s fork-of-fork of the original, but not on crates.io. Supports multithreading using OpenMP and wraps both the original and 64bit version.
simple-saca: Daniel Liu’s bounded-context suffix array construction that is faster than divsufsort and libsais, but does not return a true fully sorted suffix array.

References

Notes on SsHash

Mon, 15 Jan 2024 00:00:00 +0100

Table of Contents

Paper summary
Remarks
Ideas

\[\newcommand{\S}{\mathcal{S}}\]

Paper summary

Intro

SsHash (Pibiri 2022) is a datastructure for indexing kmers. Given a set of kmers $\S$, it supports two operations:

$Lookup(g)$: return the unique id $i\in [|\S|]$ of the kmer $g$.
$Access(i)$: return the kmer corresponding to id $i$.

It also supports streaming queries, looking up all kmers from a longer string consecutively, by expoiting the overlap between them.

Notes on writing course

Tue, 14 Nov 2023 00:00:00 +0100

Some notes from the writing course I’m taking.

Lecture 1, 14 November

Resources

Searching phrases/alternatives in quotes in Google Scholar can tell which one is more frequently used.

BBHash: some ideas

Mon, 04 Sep 2023 00:00:00 +0200

Table of Contents

Possible speedup?

BBHash Limasset et al. (2017) uses multiple layers to create a minimal perfect hashing functions (MPFH), that hashes some input set into $[n]$.

(See also my note on PTHash (Pibiri and Trani 2021).)

Simply said, it maps the $n$ elements into $[\gamma \cdot n]$ using hashing function $h_0$. The $k_0$ elements that have collisions are mapped into $[\gamma \cdot k_0]$ using $h_1$. Then, the $k_1$ elements with collisions are mapped into $[\gamma \cdot k_1]$, and so on.

BitPAl bitpacking algorithm

Sun, 03 Sep 2023 00:00:00 +0200

Table of Contents

Problem
Input
Example
Discussion
Found the bug
Outlook

The supplement (download) of the Loving, Hernandez, and Benson (2014) paper introduces a $15$ operation version of Myers (1999) bitpacking algorithm, which uses $16$ operations when modified for edit distance.

I tried implementing it, but it seems to have a bug that I will describe below. The fix is here.

Problem

To recap, this algorithm solves the unit-cost edit distance problem by using bitpacking to compute a $1\times w$ at a time. As input, it takes

Thoughts on linear programming

Fri, 04 Nov 2022 00:00:00 +0100

Table of Contents

Linear programming
Assumptions
Idea for an algorithm

This note contains some ideas about linear programming and most-orthogonal faces. They’re mostly on an intuitive level and not very formal.

Postscriptum: The ideas here don’t work.

Linear programming

Maximize $\t\x$ subject to $A\x \leq \b$.

$\x$ is a vector of $n$ variables $x_i$.
$A$ is a $m\times n$ matrix: there are $m$ constraints $A_j \x \leq b_j$.

Assumptions

We make the following assumptions:

A Combinatorial Identity

Sun, 16 Oct 2022 00:00:00 +0200

Some notes regarding the identity

\begin{equation} \sum_{k=0}^n \binom{2k}k \binom{2n-2k}{n-k} = 4^n \end{equation}

Gould has two derivations:
- The first, from Jensens equality, (18) in (Jensen 1902; Shijie 1303).
- A second via the Chu-Vandermonde convolution:
  
  \begin{equation} \sum_{k=0}^n \binom{x}k \binom{y}{n-k} = \binom{x+y}n \end{equation}
  
  using $x=y=-\frac 12$ and using the $-\frac 12$-transform:
  
  \begin{equation} \binom{-1/2}{n} = (-1)^n\binom{2n}{n}\frac 1 {2^{2n}} \end{equation}
Duarte and de Oliveira (2012) has a combinatorial proof.

References

Duarte, Rui, and António Guedes de Oliveira. 2012. “New Developments of an Old Identity.” https://doi.org/10.48550/ARXIV.1203.5424.

Jensen, J. L. W. V. 1902. “Sur Une Identité D’abel et Sur D’autres Formules Analogues.” Acta Mathematica 26 (0): 307–18. https://doi.org/10.1007/bf02415499.

Shijie, Zhu. 1303. Jade Mirror of the Four Unknowns.

Linear-time suffix array construction

Thu, 13 Oct 2022 00:00:00 +0200

Table of Contents

Notation
Small and Large suffixes
Building the suffix array from a smaller one
Visualization

These are some notes about linear time suffix array (SA) construction algorithms (SACA’s).

At the bottom you can find a visualization.
This page has an interactive demo.

History of suffix array construction algorithms:

1990 first algorithm: Manber and Myers (1993)
2002 small/large suffixes, explained below: Ko and Aluru (2005)
2009 recursion only on LMS suffixes: Nong, Zhang, and Chan (2009)

These slides from Stanford are a nice reference for the last algorithm.

Reducing A* memory usage using fronts

Mon, 26 Sep 2022 00:00:00 +0200

Table of Contents

Motivation
Parititioning A* memory by fronts
- Non-consistent heuristics
- Front indexing
Tracing back the path

Here is an idea to reduce the memory usage of A* by only storing one front at a time, similar to what Edlib and WFA do. Note that for now this will not work, but I’m putting this online anyway.

Motivation

In our implementation of A*PA, we use a hashmap to store the value of $g$ of all visited (explored/expanded) states by A*. This can take up a lot of memory and simply reading/writing $g$ in the hashmap can take over half the total execution time.

Bidirectional A*

Thu, 28 Jul 2022 17:59:00 +0200

These are some links and papers on bidirectional A* variants. Nothing insightful at the moment.

small lecture: introduces $h_f(u) = \frac 12 (\pi_f(u) - \pi_r)$. Not found a paper yet.
An Improved Bidirectional Heuristic Search Algorithm (Champeaux 1977): introduces a bidirectional variant
Bidirectional Heuristic Search Again (Champeaux 1983): fixes a bug in the above paper
Efficient modified bidirectional A* algorithm for optimal route-finding: Didn’t read closely yet.
A new bidirectional algorithm for shortest paths (Pijls 2008): Actually a new methods. Seems to cite useful papers.
There 2 papers that cite this one may also be interesting.

The BiWFA meeting condition

Mon, 11 Jul 2022 00:00:00 +0200

cross references: BiWFA GitHub issue

It seems that getting the meeting/overlap condition of BiWFA (Marco-Sola et al. (2023), Algorithm 1 and Lemma 2.1) correct is tricky.

Let $p := \max(x, o+e)$ be the maximal cost of any edge in the edit graph. As in the BiWFA paper, let $s_f$ and $s_r$ be the distances of the forward and reverse fronts computed so far.

We prove the following lemma:

Lemma Once BiWFA has expanded the forward and reverse fronts up to $s_f$ and $s_r$ and has found some path of cost $s \leq s_f + s_r$, expanding the fronts until $s’_f + s’_r \geq s+p+o$ is guaranteed to find a shortest path.

Benchmark attention points

Thu, 28 Apr 2022 23:33:00 +0200

Benchmarking is harder than you think, even when taking into account this rule.

This post lists some lessons I learned while attempting to run benchmarks for A* pairwise aligner. I was doing this on a laptop, which likely has different characteristics from CPUs in a typical server rack. All the programs I run are single threaded.

Hardware

Do not run while charging the laptop: Charging makes the battery hot and causes throttling. Run either on battery power or with a completely full battery to prevent this.
Disable hyperthreading: Completely disable hyperthreading in the BIOS. Multiple programs running on the same core may fight for resources.

CPU settings

Pin CPU frequency: CPUs, especially laptops, have turboboost, (thermal) throttling, and powersave features. Make sure to pin the CPU core frequency low enough that it can be sustained for long times without throttling.
In my case, the performance governor can fix the CPU frequency. The base frequency of my CPU is 2.6GHz, so that’s where I pinned it.

Proof sketch for linear time seed heuristic alignment

Sun, 24 Apr 2022 00:00:00 +0200

Table of Contents

Pairwise alignment in subquadratic time
Random model
Algorithm
- Seed heuristic
- Match pruning
Analysis
- Expanded states
  - Excess errors
- Algorithmic complexity

This post is a proof sketch to show that A* with the seed heuristic (Groot Koerkamp and Ivanov 2024) does exact pairwise alignment of random strings with random mutations in near linear time.

Pairwise alignment in subquadratic time

Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.e. $O(n^{2-\delta})$ for any $\delta >0$) assuming the Strong Exponential Time Hypothesis.

AStarix

Fri, 12 Nov 2021 13:05:00 +0100

Papers

AStarix is a method for aligning sequences (reads) to graphs:

Input

A reference sequence or graph
Alignment costs $(\Delta_{match}, \Delta_{subst}, \Delta_{del}, \Delta_{ins})$ for a match, substitution, insertion and deletion
Sequence(s) to align

Output

An optimal alignment of each input sequence

The input is a reference graph (automaton really) $G_r = (V_r, E_r)$ with edges $E_r \subseteq V_r\times V_r\times \Sigma$ that indicate the transitions between states.

Neighbour joining

Fri, 12 Nov 2021 11:57:00 +0100

Neighbour joining (NJ, paper) is a phylogeny reconstruction method. It differs from UPGMA in the way it computes the distances between clusters.

This algorithm first assumes that the phylogeny is a star graph. Then it finds the pair of vertices that when merged and split out gives the minimal total edge length $S_{ij}$ of the new almost-star graph. (See eq. (4) and figure 2a and 2b in the paper.) \[ S_{i,j} = \frac1{2(n-2)} \sum_{k\not\in \{i,j\}}(d(i, k)+d(j,k)) + \frac 12 d(i,j)+\frac 1{n-2} \sum_{k<l,\, k, l\not\in\{i,j\}}d(k,l). \] After subtracting the sum of all pairwise distances (which is a constant) and multiplying by $2(n-2)$, we obtain the familiar \[ Q(i, j) = (n-2) d(i, j) - \sum_{k=1}^n d(i, k) - \sum_{k=1}^n d(j, k). \] Thus, we merge the two vertices that minimize $Q$. The distance from the merging of vertices $i$ and $j$ to each other vertex $k$ is $d_{(i-j)k} = (d_{i,k} + d_{j,k})/2$.

UPGMA

Thu, 28 Oct 2021 11:56:00 +0200

Unweighted pair group method with arithmetic mean (UPGMA) is a phylogeny reconstruction method.

Input: Matrix of pairwise distances
Output: Phylogeny
Algorithm: Repeatedly merge the nearest two clusters. The distance between clusters is the average of all pairwise distances between them. When merging two clusters, the distances of the new cluster are the weighted averages of distances from the two clusters being merged.
Complexity: $O(n^3)$ naive, $O(n^2 \ln n)$ using heap.

Spaced k-mer and assembler methods

Wed, 14 Jul 2021 00:00:00 +0200

Table of Contents

Spaced $k$-mers
Minimap
SPAdes
MUMmer4
BLASR
Bowtie 2
Patternhunter
Spaced seeds improve $k$-mer-based metagenomic classification
LoMeX
Meeting notes

Concepts:

Mapping Map a sequence onto a reference genome/dataset
Assembly Build a genome from a set of reads
- de novo (implied): without using a reference genome
- Otherwise just called mapping

Typical complicating factors:

read errors
non-uniform coverage
insert size variation
chimeric reads (?)
bireads
non-uniform read coverage (as in metagenomics, i.e. multi cell assembly)

Spaced $k$-mers

Also called

Note on CuriousCoding

Tools for suffix array searching

1 Sapling

Crates for suffix array construction

References

Notes on SsHash

Paper summary

Intro

Notes on writing course

Lecture 1, 14 November

Resources

BBHash: some ideas

BitPAl bitpacking algorithm

Problem

Thoughts on linear programming

Linear programming

Assumptions

A Combinatorial Identity

References

Linear-time suffix array construction

Reducing A* memory usage using fronts

Motivation

Bidirectional A*

The BiWFA meeting condition

Benchmark attention points

Hardware

CPU settings

Proof sketch for linear time seed heuristic alignment

Pairwise alignment in subquadratic time

AStarix

Neighbour joining

UPGMA

Spaced k-mer and assembler methods

Spaced \(k\)-mers