Ideas on CuriousCodinghttps://curiouscoding.nl/categories/ideas/Recent content in Ideas on CuriousCodingHugoenThu, 18 Jan 2024 00:00:00 +0100Mod-minimizers and other minimizershttps://curiouscoding.nl/posts/mod-minimizers/Thu, 18 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/mod-minimizers/Table of Contents Applications Background Minimizers Density bounds Robust minimizers PASHA Miniception Closed syncmers Bd-anchors New: Mod-minimizers Experiments Conclusion Small k experiments Search methods Directed minimizer \(k=1\), \(w=2\) \(k=1\), \(w=4\) \(k=1\), \(w=5\) \(k=2\), \(w=2\) \(k=2\), \(w=4\) Notes Reading list \[ \newcommand{\d}{\mathrm{d}} \newcommand{\L}{\mathcal{L}} \]
This post introduces some background for minimizers and some experiments for a new minimizer variant. That new variant is now called the mod-minimizer and available as a preprint at bioRxiv (Groot Koerkamp and Pibiri 2024).Notes on implementing Longest Common Repeat (LCR)https://curiouscoding.nl/posts/longest-common-repeat/Wed, 06 Dec 2023 00:00:00 +0100https://curiouscoding.nl/posts/longest-common-repeat/Table of Contents Notes Coloured Tree Problem Generic sparse suffix array Sparse suffix array on minimizers Discussion / TODOs Evals These are my running notes on implementing an algorithm for Longest Common Repeat using minimizers.
Notes Coloured Tree Problem See Lemma 3 at here
Generic sparse suffix array paper: https://arxiv.org/pdf/2310.09023.pdf code: https://github.com/lorrainea/SSA/blob/main/PA/ssa.cc For random strings and \(b \leq n / \log n\), direct radix sort on $2log n + log log n$-bit prefixes is sufficient for \(O(n)\) runtime.Research proposal: subquadratic string graph constructionhttps://curiouscoding.nl/posts/cwi-proposal/Mon, 10 Jul 2023 00:00:00 +0200https://curiouscoding.nl/posts/cwi-proposal/Table of Contents Introduction Research plan Improve query performance using Heavy-Light Decomposition Add more query types Extend to non-exact suffix-prefix-overlap that allows for read errors Implement an algorithm to build string graphs, and possibly a full assembler This is a research proposal for a 5 month internship at CWI during autumn/winter 2023-2024.
Introduction An important problem in bioinformatics is genome assembly: DNA sequencing machines read substrings of a full DNA genome, and these pieces must be assembled together to recover the entire genome.Doctoral planhttps://curiouscoding.nl/posts/research-proposal/Mon, 12 Dec 2022 00:00:00 +0100https://curiouscoding.nl/posts/research-proposal/Table of Contents Research Proposal: Near-linear exact pairwise alignment Abstract Introduction and current state of research in the field Goals of the thesis Impact Progress to date Detailed work plan WP1: A*PA v1: initial version WP2: Visualizing aligners WP3: Benchmarking aligners WP4: Theory review WP5: A*PA v2: efficient implementation WP6: Affine costs WP7: Ends-free alignment and mapping WP8: Further extension and open ended research WP9: Thesis writing Publication plan Time schedule Teaching responsibilities Other duties Study plan Signatures Research Proposal: Near-linear exact pairwise alignment Abstract Pairwise alignment and edit distance specifically is a problem that was first stated around 1968 (Needleman and Wunsch 1970; Vintsyuk 1968).String algorithm visualizationshttps://curiouscoding.nl/posts/alg-viz/Tue, 08 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/alg-viz/ Select the algorithm to visualize Click the buttons, or click the canvas and use the indicated keys Suffix-array construction is explained here and BWT is explained here.
Source code is on GitHub.
Algorithm Suffix Array Construction Burrows-Wheeler Transform Bidirectional BWT String Query prev (←/backspace) next (→/space) Delay (s) faster (↑/+/f) slower (↓/-/s) pause/play (p/return)Thoughts on linear programminghttps://curiouscoding.nl/posts/linear-programming/Fri, 04 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/linear-programming/Table of Contents Linear programming Assumptions Idea for an algorithm This note contains some ideas about linear programming and most-orthogonal faces. They’re mostly on an intuitive level and not very formal.
Postscriptum: The ideas here don’t work.
Linear programming \begin{equation*} \newcommand{\v}[1]{\textbf{#1}} \newcommand{\x}{\v x} \newcommand{\t}{\v t} \newcommand{\b}{\v b} \end{equation*}
Maximize \(\t\x\) subject to \(A\x \leq \b\).
\(\x\) is a vector of \(n\) variables \(x_i\). \(A\) is a \(m\times n\) matrix: there are \(m\) constraints \(A_j \x \leq b_j\).Local Doublinghttps://curiouscoding.nl/posts/local-doubling/Wed, 19 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/local-doubling/Table of Contents Notation Needleman-Wunsch: where it all begins Dijkstra/BFS: visiting fewer states Band doubling: Dijkstra, but more efficient GapCost: A first heuristic Computational volumes: an even smaller search Cheating: an oracle gave us \(g^*\) A*: Better heuristics Broken idea: A* and computational volumes Local doubling Without heuristic With heuristic Diagonal Transition A* with Diagonal Transition and pruning: doing less work Goal: Diagonal Transition + pruning + local doubling Pruning: Improving A* heuristics on the go Cheating more: an oracle gave us the optimal path TODO: aspriation windows \begin{equation*} \newcommand{\st}[2]{\langle #1,#2\rangle} \newcommand{\g}{g^*} \newcommand{\fm}{f_{max}} \newcommand{\gap}{\operatorname{Gap}} \end{equation*}Reducing A* memory usage using frontshttps://curiouscoding.nl/posts/astar-memory-usage/Mon, 26 Sep 2022 00:00:00 +0200https://curiouscoding.nl/posts/astar-memory-usage/Table of Contents Motivation Parititioning A* memory by fronts Non-consistent heuristics Front indexing Tracing back the path Here is an idea to reduce the memory usage of A* by only storing one front at a time, similar to what Edlib and WFA do. Note that for now this will not work, but I’m putting this online anyway.
Motivation In our implementation of A*PA, we use a hashmap to store the value of \(g\) of all visited (explored/expanded) states by A*.Speeding up A*: computational volumes and path-pruninghttps://curiouscoding.nl/posts/speeding-up-astar/Fri, 23 Sep 2022 00:00:00 +0200https://curiouscoding.nl/posts/speeding-up-astar/Table of Contents Motivation Summary Why is A* slow? Computational volumes Dealing with pruning Thoughts on more aggressive pruning Algorithm summary Challenges Results What about band-doubling? Maybe doubling can work after all? TODOs Extensions This post build on top of our recent preprint Groot Koerkamp and Ivanov (2024) and gives an overview of some of my new ideas to significantly speed up exact global pairwise alignment. It’s recommended you understand the seed heuristic and match pruning before reading this post.Linear memory WFA?https://curiouscoding.nl/posts/linear-memory-wfa/Wed, 17 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/linear-memory-wfa/Table of Contents Motivation Path traceback: two strategies Observations What information is needed for path tracing A pragmatic solution Another interpretation Affine costs Conclusion Figure 1: Only the red substitutions and blue indel need to be stored to trace the entire path.
In this post I’ll discuss an idea to run WFA using less memory, while still allowing us to trace back the optimal path from the target state back to the start of the search.Transforming match bonus into costhttps://curiouscoding.nl/posts/alignment-scores-transform/Tue, 16 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/alignment-scores-transform/Table of Contents Tricks with match bonus or how to fool Dijkstra’s limitations Edit graph Algorithms Potentials Multiple variants Some notes on algorithms WFA A* Extending to different cost models Affine costs Substitution matrices But not local alignment Evaluations Unequal string length Equal string lengths Conclusion Tricks with match bonus or how to fool Dijkstra’s limitations The reader is assumed to have basic knowledge about pairwise alignment and graph theory.Diamond optimisation for diagonal transitionhttps://curiouscoding.nl/posts/diamond-optimization/Mon, 01 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/diamond-optimization/Table of Contents Diamond transition or how technicalities can break concepts But let’s take a closer look Conclusion Diamond transition or how technicalities can break concepts We assume the reader has some basic knowledge about pairwise alignment and in particular the WFA algorithm.
In this post we dive into a potential 2x speedup of WFA — one that turns out not to work.
Let’s take a look at one of the most important and efficient algorithms for pairwise alignment — WFA (Marco-Sola et al.The BiWFA meeting conditionhttps://curiouscoding.nl/posts/biwfa-meeting-condition/Mon, 11 Jul 2022 00:00:00 +0200https://curiouscoding.nl/posts/biwfa-meeting-condition/cross references: BiWFA GitHub issue
It seems that getting the meeting/overlap condition of BiWFA (Marco-Sola et al. (2023), Algorithm 1 and Lemma 2.1) correct is tricky.
Let \(p := \max(x, o+e)\) be the maximal cost of any edge in the edit graph. As in the BiWFA paper, let \(s_f\) and \(s_r\) be the distances of the forward and reverse fronts computed so far.
We prove the following lemma:
Lemma Once BiWFA has expanded the forward and reverse fronts up to \(s_f\) and \(s_r\) and has found some path of cost \(s \leq s_f + s_r\), expanding the fronts until \(s’_f + s’_r \geq s+p+o\) is guaranteed to find a shortest path.Proof sketch for linear time seed heuristic alignmenthttps://curiouscoding.nl/posts/linear-time-pa/Sun, 24 Apr 2022 00:00:00 +0200https://curiouscoding.nl/posts/linear-time-pa/Table of Contents Pairwise alignment in subquadratic time Random model Algorithm Seed heuristic Match pruning Analysis Expanded states Excess errors Algorithmic complexity This post is a proof sketch to show that A* with the seed heuristic (Groot Koerkamp and Ivanov 2024) does exact pairwise alignment of random strings with random mutations in near linear time.
Pairwise alignment in subquadratic time Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.Variations on the WFA recursionhttps://curiouscoding.nl/posts/wfa-variations/Sun, 17 Apr 2022 03:14:00 +0200https://curiouscoding.nl/posts/wfa-variations/Table of Contents Gap open Gap close Symmetric alternatives Another symmetry Conclusions cross references: BiWFA GitHub issue
In this post I will explore some variations of the recursion used by WFA/BiWFA for the affine version of the diagonal transition algorithm. In particular, we will go over a gap-close variant, and look into some more symmetric formulations.
Gap open WFA (Marco-Sola et al. 2020) introduces the affine cost variant of the classic diagonal transition method.Pruning for A* heuristicshttps://curiouscoding.nl/posts/pruning/Sat, 11 Dec 2021 00:00:00 +0100https://curiouscoding.nl/posts/pruning/Note: this post extends the concept of multiple-path pruning presented in Poole and Mackworth (2017).
Say we’re running A* in a graph from \(s\) to \(t\). \(d(s,t)\) is the distance we are looking for.
An A* heuristic has to satisfy \(h(u) \leq d(u, t)\) to be admissible: the estimated distance to the end should never be larger than the actual distance to guarantee that the algorithm finds a shortest path.Spaced K-mer Seeded Distancehttps://curiouscoding.nl/posts/spaced-kmer-distance/Wed, 20 Oct 2021 00:00:00 +0200https://curiouscoding.nl/posts/spaced-kmer-distance/Table of Contents Background $k$-mers Sketching MinHash Terminology Introduction Spaced $k$-mer Seeded Distance Improving performance Analysis Pruning false positive candidate matches Phylogeny reconstruction Running the algorithm TODO Assembly \[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Background Quickly finding similar pieces of DNA within large datasets is at the core of computational biology. This has many applications:
Alignment: Given two pieces of related DNA, align them to find where mutations (i.Ideas for assembling [long] readshttps://curiouscoding.nl/posts/thoughts-on-assembling/Fri, 09 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/thoughts-on-assembling/\[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\Z}{\mathbb Z} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Here is an idea for an algorithm to assemble long reads.
Go over all sequences and sketch their windows using the Hamming distance preserving sketch method described here. This method may need some tweaking to also work with an indel rate of around 10%.
Let’s say we find a pair of matching windows between reads \(A\) and \(B\) starting at positions \(i\) and \(j\).Hamming Similarity Searchhttps://curiouscoding.nl/posts/hamming-similarity-search/Thu, 08 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/hamming-similarity-search/Table of Contents Background $k$-mers Sketching MinHash Introduction Hamming Similarity Search Improving performance Analysis Pruning false positive candidate matches Phylogeny reconstruction Running the algorithm Assembly \[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Background Quickly finding similar pieces of DNA within large datasets is at the core of computational biology. This has many applications:
Alignment: Given two pieces of related DNA, align them to find where mutations (i.