Note on CuriousCodinghttps://curiouscoding.nl/tags/note/Recent content in Note on CuriousCodingHugoenFri, 14 Jun 2024 00:00:00 +0200Tools for suffix array searchinghttps://curiouscoding.nl/posts/suffix-array-searching/Fri, 14 Jun 2024 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-searching/Table of Contents 1 Sapling 2 PLA-Index 3 LISA: learned index Let’s summarize some tools for efficiently searching suffix arrays.
1 Sapling Sapling (Kirsche, Das, and Schatz 2020) works as follows:
Choose a parameter \(p\) store for each of the \(2^p\) $p$-bit prefixes the corresponding position in the suffix array. When querying, first find the bucket for the query prefix. Then do a linear interpolation inside the bucket. Search the area \([-E, +E]\) around the interpolated position, where \(E\) is a bound on the error of the linear approximation.Crates for suffix array constructionhttps://curiouscoding.nl/posts/suffix-array-crates/Thu, 13 Jun 2024 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-crates/Popular C libraries are:
divsufsort libsais Both have a ..64 variant that supports input strings longer than 2GB.
Rust wrappers:
divsufsort: rust reimplementation, does not support large inputs. cdivsufsort: c-wrapper, does not support large inputs livdivsufsort-rs: c-wrapper, does support large inputs sais: unrelated to the original library; does not implement a linear time algorithm anyway libsais-rs: Daniel Liu’s fork-of-fork of the original, but not on crates.io. Supports multithreading using OpenMP and wraps both the original and 64bit version.Notes on SsHashhttps://curiouscoding.nl/posts/sshash/Mon, 15 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/sshash/Table of Contents Paper summary Intro Prelims Related work Sparse and skew hashing Remarks Ideas \[\newcommand{\S}{\mathcal{S}}\]
Paper summary Intro SsHash (Pibiri 2022) is a datastructure for indexing kmers. Given a set of kmers \(\S\), it supports two operations:
\(Lookup(g)\) return the unique id \(i\in [|\S|]\) of the kmer \(g\). \(Access(i)\) return the kmer corresponding to id \(i\). It also supports streaming queries, looking up all kmers from a longer string consecutively, by expoiting the overlap between them.Notes on writing coursehttps://curiouscoding.nl/posts/writing-course/Tue, 14 Nov 2023 00:00:00 +0100https://curiouscoding.nl/posts/writing-course/Table of Contents Lecture 1, 14 November Resources Reader friendlyness Typical problems Lecture 2, 21 November Paragraph level expectations Flow Assignment for next week Lecture 3, 28 November Bad organization Figures References to figures Indicative vs Informative (ex. 7) Lecture 4, December 5 Introduction Conclusion Tense Lecture 5, December 12 Abstracts Titles Punctuation Comma Dashes Some notes from the writing course I’m taking.
Lecture 1, 14 November Resources Searching phrases/alternatives in quotes in Google Scholar can tell which one is more frequently used.BBHash: some ideashttps://curiouscoding.nl/posts/bbhash/Mon, 04 Sep 2023 00:00:00 +0200https://curiouscoding.nl/posts/bbhash/Table of Contents Possible speedup? BBHash Limasset et al. (2017) uses multiple layers to create a minimal perfect hashing functions (MPFH), that hashes some input set into \([n]\).
(See also my note on PTHash (Pibiri and Trani 2021).)
Simply said, it maps the \(n\) elements into \([\gamma \cdot n]\) using hashing function \(h_0\). The \(k_0\) elements that have collisions are mapped into \([\gamma \cdot k_0]\) using \(h_1\). Then, the \(k_1\) elements with collisions are mapped into \([\gamma \cdot k_1]\), and so on.BitPAl bitpacking algorithmhttps://curiouscoding.nl/posts/bitpal/Sun, 03 Sep 2023 00:00:00 +0200https://curiouscoding.nl/posts/bitpal/Table of Contents Problem Input Example Discussion Found the bug Outlook The supplement (download) of the Loving, Hernandez, and Benson (2014) paper introduces a \(15\) operation version of Myers (1999) bitpacking algorithm, which uses \(16\) operations when modified for edit distance.
I tried implementing it, but it seems to have a bug that I will describe below. The fix is here.
Problem To recap, this algorithm solves the unit-cost edit distance problem by using bitpacking to compute a \(1\times w\) at a time.Thoughts on linear programminghttps://curiouscoding.nl/posts/linear-programming/Fri, 04 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/linear-programming/Table of Contents Linear programming Assumptions Idea for an algorithm This note contains some ideas about linear programming and most-orthogonal faces. They’re mostly on an intuitive level and not very formal.
Postscriptum: The ideas here don’t work.
Linear programming \begin{equation*} \newcommand{\v}[1]{\textbf{#1}} \newcommand{\x}{\v x} \newcommand{\t}{\v t} \newcommand{\b}{\v b} \end{equation*}
Maximize \(\t\x\) subject to \(A\x \leq \b\).
\(\x\) is a vector of \(n\) variables \(x_i\). \(A\) is a \(m\times n\) matrix: there are \(m\) constraints \(A_j \x \leq b_j\).A Combinatorial Identityhttps://curiouscoding.nl/posts/a-combinatorial-identity/Sun, 16 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/a-combinatorial-identity/Some notes regarding the identity
\begin{equation} \sum_{k=0}^n \binom{2k}k \binom{2n-2k}{n-k} = 4^n \end{equation}
Gould has two derivations: The first, from Jensens equality, (18) in (Jensen 1902; Shijie 1303).
A second via the Chu-Vandermonde convolution:
\begin{equation} \sum_{k=0}^n \binom{x}k \binom{y}{n-k} = \binom{x+y}n \end{equation}
using \(x=y=-\frac 12\) and using the $-\frac 12$-transform:
\begin{equation} \binom{-1/2}{n} = (-1)^n\binom{2n}{n}\frac 1 {2^{2n}} \end{equation}
Duarte and de Oliveira (2012) has a combinatorial proof. References Duarte, Rui, and AntÃ³nio Guedes de Oliveira.Linear-time suffix array constructionhttps://curiouscoding.nl/posts/suffix-array-construction/Thu, 13 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-construction/Table of Contents Notation Small and Large suffixes Building the suffix array from a smaller one Visualization These are some notes about linear time suffix array (SA) construction algorithms (SACA’s).
At the bottom you can find a visualization. This page has an interactive demo. History of suffix array construction algorithms:
1990 first algorithm: Manber and Myers (1993) 2002 small/large suffixes, explained below: Ko and Aluru (2005) 2009 recursion only on LMS suffixes: Nong, Zhang, and Chan (2009) These slides from Stanford are a nice reference for the last algorithm.Reducing A* memory usage using frontshttps://curiouscoding.nl/posts/astar-memory-usage/Mon, 26 Sep 2022 00:00:00 +0200https://curiouscoding.nl/posts/astar-memory-usage/Table of Contents Motivation Parititioning A* memory by fronts Non-consistent heuristics Front indexing Tracing back the path Here is an idea to reduce the memory usage of A* by only storing one front at a time, similar to what Edlib and WFA do. Note that for now this will not work, but I’m putting this online anyway.
Motivation In our implementation of A*PA, we use a hashmap to store the value of \(g\) of all visited (explored/expanded) states by A*.Bidirectional A*https://curiouscoding.nl/posts/bidirectional-astar/Thu, 28 Jul 2022 17:59:00 +0200https://curiouscoding.nl/posts/bidirectional-astar/These are some links and papers on bidirectional A* variants. Nothing insightful at the moment.
small lecture introduces \(h_f(u) = \frac 12 (\pi_f(u) - \pi_r)\). Not found a paper yet. An Improved Bidirectional Heuristic Search Algorithm (Champeaux 1977) introduces a bidirectional variant Bidirectional Heuristic Search Again (Champeaux 1983) fixes a bug in the above paper Efficient modified bidirectional A* algorithm for optimal route-finding Didn’t read closely yet. A new bidirectional algorithm for shortest paths (Pijls 2008) Actually a new methods.The BiWFA meeting conditionhttps://curiouscoding.nl/posts/biwfa-meeting-condition/Mon, 11 Jul 2022 00:00:00 +0200https://curiouscoding.nl/posts/biwfa-meeting-condition/cross references: BiWFA GitHub issue
It seems that getting the meeting/overlap condition of BiWFA (Marco-Sola et al. (2023), Algorithm 1 and Lemma 2.1) correct is tricky.
Let \(p := \max(x, o+e)\) be the maximal cost of any edge in the edit graph. As in the BiWFA paper, let \(s_f\) and \(s_r\) be the distances of the forward and reverse fronts computed so far.
We prove the following lemma:
Lemma Once BiWFA has expanded the forward and reverse fronts up to \(s_f\) and \(s_r\) and has found some path of cost \(s \leq s_f + s_r\), expanding the fronts until \(s’_f + s’_r \geq s+p+o\) is guaranteed to find a shortest path.Benchmark attention pointshttps://curiouscoding.nl/posts/benchmarks/Thu, 28 Apr 2022 23:33:00 +0200https://curiouscoding.nl/posts/benchmarks/Benchmarking is harder than you think, even when taking into account this rule.
This post lists some lessons I learned while attempting to run benchmarks for A* pairwise aligner. I was doing this on a laptop, which likely has different characteristics from CPUs in a typical server rack. All the programs I run are single threaded.
Hardware Do not run while charging the laptop Charging makes the battery hot and causes throttling.Proof sketch for linear time seed heuristic alignmenthttps://curiouscoding.nl/posts/linear-time-pa/Sun, 24 Apr 2022 00:00:00 +0200https://curiouscoding.nl/posts/linear-time-pa/Table of Contents Pairwise alignment in subquadratic time Random model Algorithm Seed heuristic Match pruning Analysis Expanded states Excess errors Algorithmic complexity This post is a proof sketch to show that A* with the seed heuristic (Groot Koerkamp and Ivanov 2024) does exact pairwise alignment of random strings with random mutations in near linear time.
Pairwise alignment in subquadratic time Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.AStarixhttps://curiouscoding.nl/posts/astarix/Fri, 12 Nov 2021 13:05:00 +0100https://curiouscoding.nl/posts/astarix/Papers
AStarix: Fast and Optimal Sequence-to-Graph Alignment Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds AStarix is a method for aligning sequences (reads) to graphs:
Input A reference sequence or graph Alignment costs \((\Delta_{match}, \Delta_{subst}, \Delta_{del}, \Delta_{ins})\) for a match, substitution, insertion and deletion Sequence(s) to align Output An optimal alignment of each input sequence The input is a reference graph (automaton really) \(G_r = (V_r, E_r)\) with edges \(E_r \subseteq V_r\times V_r\times \Sigma\) that indicate the transitions between states.Neighbour joininghttps://curiouscoding.nl/posts/neighbour-joining/Fri, 12 Nov 2021 11:57:00 +0100https://curiouscoding.nl/posts/neighbour-joining/Neighbour joining (NJ, paper) is a phylogeny reconstruction method. It differs from UPGMA in the way it computes the distances between clusters.
This algorithm first assumes that the phylogeny is a star graph. Then it finds the pair of vertices that when merged and split out gives the minimal total edge length \(S_{ij}\) of the new almost-star graph. (See eq. (4) and figure 2a and 2b in the paper.) \[ S_{i,j} = \frac1{2(n-2)} \sum_{k\not\in \{i,j\}}(d(i, k)+d(j,k)) + \frac 12 d(i,j)+\frac 1{n-2} \sum_{k<l,\, k, l\not\in\{i,j\}}d(k,l).UPGMAhttps://curiouscoding.nl/posts/upgma/Thu, 28 Oct 2021 11:56:00 +0200https://curiouscoding.nl/posts/upgma/Unweighted pair group method with arithmetic mean (UPGMA) is a phylogeny reconstruction method.
Input Matrix of pairwise distances Output Phylogeny Algorithm Repeatedly merge the nearest two clusters. The distance between clusters is the average of all pairwise distances between them. When merging two clusters, the distances of the new cluster are the weighted averages of distances from the two clusters being merged. Complexity \(O(n^3)\) naive, \(O(n^2 \ln n)\) using heap.Spaced k-mer and assembler methodshttps://curiouscoding.nl/posts/spaced-kmer-review/Wed, 14 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/spaced-kmer-review/Table of Contents Spaced \(k\)-mers Minimap SPAdes MUMmer4 BLASR Bowtie 2 Patternhunter Spaced seeds improve \(k\)-mer-based metagenomic classification LoMeX Meeting notes Concepts:
Mapping Map a sequence onto a reference genome/dataset Assembly Build a genome from a set of reads de novo (implied): without using a reference genome Otherwise just called mapping Typical complicating factors:
read errors non-uniform coverage insert size variation chimeric reads (?) bireads non-uniform read coverage (as in metagenomics, i.