home on CuriousCodinghttps://curiouscoding.nl/Recent content in home on CuriousCodingHugoenSat, 05 Oct 2024 00:00:00 +0200A lemma on suffix array searchinghttps://curiouscoding.nl/posts/suffix-array-searching-lemma/Sat, 05 Oct 2024 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-searching-lemma/Table of Contents 1 Suffix arrays 2 Searching methods 2.1 Naive \(O(|P|\cdot \lg_2 n)\) search 2.2 Faster \(O(|P|\cdot \lg_2 n)\) search 2.3 LCP-based \(O(|P| + \lg_2 n)\) search 3 Analysing the faster search We’ll prove that using the “faster” binary search algorithm (see 2.2) that tracks the LCP with the left and right boundary of the remaining search interval has amortized runtime
\[ O\Big(\lg_2(n) + |P| + |P| \cdot \lg_2(Occ(P))\Big), \] when \(P\) is a randomly sampled fixed-length pattern from the text and \(Occ(P)\) counts the number of occurrences of \(P\) in the text.FM-index implementationshttps://curiouscoding.nl/posts/fm-index-implementations/Wed, 02 Oct 2024 00:00:00 +0200https://curiouscoding.nl/posts/fm-index-implementations/Here I’ll briefly list some FM-index and related implementations around the web. Implementations seem relatively inconsistent, mostly because the FM-index is more of a ‘wrapper’ type around a given Burrows-Wheeler-transform and an occurrences list. Both can be implemented in various ways. In particular occurrences should be stored using a wavelet tree for optimal compressing.
The nucleic-acid repo contains a completely unoptimised version. The Rust-bio crate contains a generic FM-index. It stores a sampled occurrences array, so that space is relatively small but lookups take \(O(k)\) time for sampling factor \(k\).[WIP] Progress on fast suffix array searchinghttps://curiouscoding.nl/posts/suffix-array-searching-log/Tue, 01 Oct 2024 00:00:00 +0000https://curiouscoding.nl/posts/suffix-array-searching-log/Here’s a lablog.
Background Compare with suffix arrays with a twist: https://www.cai.sk/ojs/index.php/cai/article/view/2019_3_555 Compare with https://github.com/mranisz/sa, which is based on Compact and hash based variants of the suffix array https://journals.pan.pl/dlibra/publication/121376/edition/105762/content Here’s a bike
A figure of a bike.
Binary searching Eytzinger Btrees MultithreadingPractical selection and sampling schemeshttps://curiouscoding.nl/posts/practical-selection-and-sampling/Thu, 12 Sep 2024 00:00:00 +0200https://curiouscoding.nl/posts/practical-selection-and-sampling/Table of Contents 1 Sampling schemes 1.1 Definitions and background 1.2 Mod-minimizer 1.3 Forward scheme lower bound 1.4 Open syncmer minimizer 1.5 Open-closed syncmer minimizer 1.6 New: Open-closed mod-minimizer 1.7 The $t$-gap 2 Selection schemes 2.1 Definition 2.2 Bd-anchors 2.3 New: Smallest unique substring anchors 2.4 New: Scrambled sort 2.5 TODO Scrambled sus-anchor density 3 Open questions This post introduces some new practical sampling schemes. It builds on:
The post and paper (Groot Koerkamp and Pibiri 2024) introducing the mod-minimizer.Calling Rust from Pythonhttps://curiouscoding.nl/posts/calling-rust-from-python/Tue, 10 Sep 2024 00:00:00 +0200https://curiouscoding.nl/posts/calling-rust-from-python/Table of Contents 1 Steps 1.1 Using kwargs 2 TODOs Using PyO3 and maturin, it’s very easy to call Rust code from Python. I’m mostly following the guide at pyo3.rs, but leaving out some thing related to python environments.
1 Steps Install maturin. I use the Arch package but you can also do a pip install in the environment below.
Make sure you have a lib target, and add cdylib as a crate-type.AI reading listhttps://curiouscoding.nl/posts/ai-reading-list/Mon, 09 Sep 2024 00:00:00 +0200https://curiouscoding.nl/posts/ai-reading-list/ https://transformer-circuits.pub/2021/framework/index.html https://infini-gram.io/ https://sleepinyourhat.github.io/checklist/[WIP] Faster binary searchhttps://curiouscoding.nl/posts/fast-binary-search/Sun, 08 Sep 2024 00:00:00 +0200https://curiouscoding.nl/posts/fast-binary-search/Table of Contents 1 High level ideas 1.1 Resources 1.2 Code 2 To measure 3 TODO Memory efficiency 3.1 B-tree 1 High level ideas Prefix table: for each 20-bit prefix, store the corresponding range of the array. Interpolation: Make one or more interpolation steps. Could store max resulting error. Drawback: can cause an unpredictable number of resulting iterations. Batching: process multiple (8-32) queries at the same time, hiding memory latency Query bucketing: given >>1M of queries, partition them into 1M buckets and answer bucket by bucket.PACE 24https://curiouscoding.nl/posts/pace24/Thu, 05 Sep 2024 00:00:00 +0200https://curiouscoding.nl/posts/pace24/Table of Contents 1 General observations 2 Heuristic track 3 Parameterized track 4 Exact track In this post I will collect some high level ideas and approaches used to solve the PACE 2024 challenge. Very briefly, the goal is to write fast solvers for NP-hard problems. The problem for the 2024 edition is one-side crossing minimization: Given is a bipartite graph \((A, B)\) that is drawn in standard way with the nodes of both \(A\) and \(B\) on a line, where the order of the nodes of \(A\) is fixed.[WIP] Feynman problemshttps://curiouscoding.nl/posts/feynman-problems/Mon, 12 Aug 2024 00:00:00 +0200https://curiouscoding.nl/posts/feynman-problems/ Table of Contents 1 Space dust 1 Space dust What is the total mass of space dust hitting the earth during the Perseids meteor shower?
ReferencesTitlehttps://curiouscoding.nl/slides/sample/Fri, 09 Aug 2024 00:00:00 +0200https://curiouscoding.nl/slides/sample/ Table of Contents 1 First part 1.1 subsection 2 Second part 1 First part hello
list 1 more list asdf
first second col col val val 1.1 subsection centered
col 1
tab more 2 1 col 2
tab more 2 1 2 Second partComputing random minimizers, fasthttps://curiouscoding.nl/posts/fast-minimizers/Fri, 12 Jul 2024 00:00:00 +0200https://curiouscoding.nl/posts/fast-minimizers/Table of Contents 1 Introduction 1.1 Results 2 Random minimizers 3 Algorithms 3.1 Problem statement Problem A: Only the set of minimizers Problem B: The minimizer of each window Problem C: Super-k-mers Which problem to solve Canonical k-mers 3.2 The naive algorithm Performance characteristics 3.3 Rephrasing as sliding window minimum 3.4 The queue Performance characteristics 3.5 Jumping: Away with the queue Performance characteristics 3.6 Re-scan Performance characteristics 3.7 Split windows Performance characteristics 4 Analysing what we have so far 4.Links for talks & postershttps://curiouscoding.nl/posts/links/Mon, 01 Jul 2024 00:00:00 +0200https://curiouscoding.nl/posts/links/Table of Contents A*PA PA-Bench Mod-minimizers There are links for my talks and posters.
A*PA A*PA (bioinformatic) Groot Koerkamp, Ragnar, and Pesho Ivanov. 2024. “Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.” Edited by Tobias Marschall. Bioinformatics 40 (3). https://doi.org/10.1093/bioinformatics/btae032. A*PA2 (WABI24) Groot Koerkamp, Ragnar. 2024. “A*PA2: Up to 19× Faster Exact Global Alignment.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P.A near-tight lower bound on minimizer densityhttps://curiouscoding.nl/posts/minimizer-lower-bound/Tue, 25 Jun 2024 00:00:00 +0200https://curiouscoding.nl/posts/minimizer-lower-bound/Table of Contents Succinct background Definitions Lower bounds A new lower bound Discussion Post scriptum Acknowledgement The results of this post are now available in a pre-print: DOI, PDF:
Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd Treangen. 2024. “A near-Tight Lower Bound on the Density of Forward Sampling Schemes.” Biorxiv. https://doi.org/10.1101/2024.09.06.611668.
In this post I will prove a new lower bound on the density of any minimizer or forward sampling scheme: \[ d(f) \geq \frac{\lceil\frac{w+k}{w}\rceil}{w+k} = \frac{\lceil\frac{\ell+1}{w}\rceil}{\ell+1}.[WIP] High throughput searching - Part 1https://curiouscoding.nl/posts/high-throughput-searching-1/Sun, 16 Jun 2024 00:00:00 +0200https://curiouscoding.nl/posts/high-throughput-searching-1/Table of Contents Hardware Details of caches and memory Latency, bandwidth, and throughput Measuring latency Pointer chasing Bounds checking Padding elements Raw pointers Aligned memory & Hugepages Summary TODO Memory bandwidth TODO High throughput random access NOTES TODO This (planned) series of posts has the aim to write a high performance search algorithm for suffix arrays. We will start with a classic binary search implementation and make incremental improvements to it.Tools for suffix array searchinghttps://curiouscoding.nl/posts/suffix-array-searching/Fri, 14 Jun 2024 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-searching/Table of Contents 1 Sapling 2 PLA-Index 3 LISA: learned index Let’s summarize some tools for efficiently searching suffix arrays.
1 Sapling Sapling (Kirsche, Das, and Schatz 2020) works as follows:
Choose a parameter \(p\) store for each of the \(2^p\) $p$-bit prefixes the corresponding position in the suffix array. When querying, first find the bucket for the query prefix. Then do a linear interpolation inside the bucket. Search the area \([-E, +E]\) around the interpolated position, where \(E\) is a bound on the error of the linear approximation.Crates for suffix array constructionhttps://curiouscoding.nl/posts/suffix-array-crates/Thu, 13 Jun 2024 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-crates/Popular C libraries are:
divsufsort libsais Both have a ..64 variant that supports input strings longer than 2GB.
Rust wrappers:
divsufsort: rust reimplementation, does not support large inputs. cdivsufsort: c-wrapper, does not support large inputs livdivsufsort-rs: c-wrapper, does support large inputs sais: unrelated to the original library; does not implement a linear time algorithm anyway libsais-rs: Daniel Liu’s fork-of-fork of the original, but not on crates.io. Supports multithreading using OpenMP and wraps both the original and 64bit version.Thoughts on POASTAhttps://curiouscoding.nl/posts/poasta/Tue, 28 May 2024 00:00:00 +0200https://curiouscoding.nl/posts/poasta/Table of Contents Summary Background Review comments DFS Supplementary methods Details of pruning Evals Discussion Code & repo Here are some thoughts on POASTA (van Dijk et al. 2024), a recent affine-cost sequence-to-DAG (POA) aligner inspired by WFA and using A*.
Summary Take a query and a directed acyclic graph (DAG). Align the query to the full DAG. It’s like global alignment for graphs. In fact I think the graph doesn’t actually have to be acyclic, as long as it has a start and end.A*PA2: Up to 19x faster exact global alignmenthttps://curiouscoding.nl/posts/astarpa2/Sat, 23 Mar 2024 00:00:00 +0100https://curiouscoding.nl/posts/astarpa2/Table of Contents Abstract 1 Introduction 1.1 Contributions 1.2 Previous work 1.2.1 Needleman-Wunsch 1.2.2 Graph algorithms 1.2.3 Computational volumes 1.2.4 Parallelism 1.2.5 Tools 2 Preliminaries 3 Methods 3.1 Band-doubling 3.2 Blocks 3.3 Memory 3.4 SIMD 3.5 SIMD-friendly sequence profile 3.6 Traceback 3.7 A* 3.7.1 Bulk-contours update 3.7.2 Pre-pruning 3.8 Determining the rows to compute 3.8.1 Sparse heuristic invocation 3.9 Incremental doubling 4 Results 4.1 Setup 4.2 Comparison with other aligners 4.Review of refined minimizeshttps://curiouscoding.nl/posts/refined-minimizer/Fri, 26 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/refined-minimizer/Table of Contents Summary Main issues 1. Introduction 2. Methods 2.3 heuristic 3. Results Discussion Code These are my review-like notes on refined minimizers, introduced in Pan and Reinert (2024).
Summary The paper introduces refined minimizers, a new scheme for sampling canonical minimizers that is less biased than the usual scheme.
Instead of taking the minimum of the minimizer of the forward and reverse strand, the minimizer of the strand with the higher GT density is chosen.Mod-minimizers and other minimizershttps://curiouscoding.nl/posts/mod-minimizers/Thu, 18 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/mod-minimizers/Table of Contents Applications Background Minimizers Density bounds Robust minimizers PASHA Miniception Closed syncmers Bd-anchors New: Mod-minimizers Experiments Conclusion Small k experiments Search methods Directed minimizer \(k=1\), \(w=2\) \(k=1\), \(w=4\) \(k=1\), \(w=5\) \(k=2\), \(w=2\) \(k=2\), \(w=4\) Notes Reading list \[ \newcommand{\d}{\mathrm{d}} \newcommand{\L}{\mathcal{L}} \]
This post introduces some background for minimizers and some experiments for a new minimizer variant. That new variant is now called the mod-minimizer and published at WABI24 (Groot Koerkamp and Pibiri 2024).Intro to Rusthttps://curiouscoding.nl/posts/intro-to-rust/Tue, 16 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/intro-to-rust/Table of Contents Overview Rust features Basics Basic syntax Expressions everywhere! Closures Pattern matching References Ownership Containers Traits Iterators Common libraries Ecosystem Useful links Hands-on Installation Create a project Hello, world! Small project ideas These are notes for a quick introduction to Rust.
Overview Statically typed & Compiled language. Great developer experience: cargo build system rust-analyzer LSP Rust features Basics C++ Rust std::size_t usize std::pointerdiff_t isize int i32 unsigned int u32 long long i64 unsigned long long u64 string String string_view &str byte u8 char char vector<T> Vec<T> array<int, 4> [u32; 4] int[] &[u32] T T const T& &T T& &mut T T* unsafe { T* } unique_ptr<T> Box<T> optional<T> Option<T> variant<T, E> Result<T, E> C++ Rust for(int i = 0; i < n; ++i) {} for i in 0.Notes on bidirectional anchorshttps://curiouscoding.nl/posts/bd-anchors/Mon, 15 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/bd-anchors/Table of Contents Paper overview Remarks on the paper Thoughts \[ \newcommand{\A}{\mathcal{A}_\ell} \newcommand{\T}{\mathcal{T}_\ell} \]
These are some notes on Bidirectional String Anchors (Loukides, Pissis, and Sweering 2023), also called bd-anchors.
Resources:
Loukides and Pissis (2021): preceding conference paper with subset of content. Loukides, Pissis, and Sweering (2023): The paper discussed here. Ayad, Loukides, and Pissis (2023): follow-up/second paper containing a faster average-case \(O(n)\) construction algorithm; a more memory efficient construction algorithms for the index.Notes on SsHashhttps://curiouscoding.nl/posts/sshash/Mon, 15 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/sshash/Table of Contents Paper summary Intro Prelims Related work Sparse and skew hashing Remarks Ideas \[\newcommand{\S}{\mathcal{S}}\]
Paper summary Intro SsHash (Pibiri 2022) is a datastructure for indexing kmers. Given a set of kmers \(\S\), it supports two operations:
\(Lookup(g)\) return the unique id \(i\in [|\S|]\) of the kmer \(g\). \(Access(i)\) return the kmer corresponding to id \(i\). It also supports streaming queries, looking up all kmers from a longer string consecutively, by expoiting the overlap between them.One Billion Row Challengehttps://curiouscoding.nl/posts/1brc/Wed, 03 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/1brc/Table of Contents External links The problem Initial solution: 105s First flamegraph Bytes instead of strings: 72s Manual parsing: 61s Inline hash keys: 50s Faster hash function: 41s A new flame graph Perf it is Something simple: allocating the right size: 41s memchr for scanning: 47s memchr crate: 29s get_unchecked: 28s Manual SIMD: 29s Profiling Revisiting the key function: 23s PtrHash perfect hash function: 17s Larger masks: 15s Reduce pattern matching: 14s Memory map: 12s Parallelization: 2.Perfect NtHash for Robust Minimizershttps://curiouscoding.nl/posts/nthash/Sun, 31 Dec 2023 00:00:00 +0100https://curiouscoding.nl/posts/nthash/Table of Contents NtHash Minimizers Robust minimizers Is NtHash injective on kmers? Searching for a collision Proving perfection Alternatives SmHasher results TODO benchmark NtHash, NtHash2, FxHash NtHash NtHash (Mohamadi et al. 2016) is a rolling hash suitable for hashing any kind of text, but made for DNA originally. For a string of length \(k\) it is a \(64\) bit value computed as:
\begin{equation} h(x) = \bigoplus_{i=0}^{k-1} rot^i(h(x_i)) \end{equation}
where \(h(x_i)\) assigns a fixed \(64\) bit random value to each character, \(rot^i\) rotates the bits \(i\) places, and \(\bigoplus\) is the xor over all terms.A*PA talk @ CWIhttps://curiouscoding.nl/posts/astarpa-talk-cwi/Wed, 27 Dec 2023 00:00:00 +0100https://curiouscoding.nl/posts/astarpa-talk-cwi/I recently gave a talk about A*PA at CWI. Sadly the recording doesn’t show the blackboard, but either way, find it here.Notes on implementing Longest Common Repeat (LCR)https://curiouscoding.nl/posts/longest-common-repeat/Wed, 06 Dec 2023 00:00:00 +0100https://curiouscoding.nl/posts/longest-common-repeat/Table of Contents Notes Coloured Tree Problem Generic sparse suffix array Sparse suffix array on minimizers Discussion / TODOs Evals These are my running notes on implementing an algorithm for Longest Common Repeat using minimizers.
Notes Coloured Tree Problem See Lemma 3 at here
Generic sparse suffix array paper: https://arxiv.org/pdf/2310.09023.pdf code: https://github.com/lorrainea/SSA/blob/main/PA/ssa.cc For random strings and \(b \leq n / \log n\), direct radix sort on $2log n + log log n$-bit prefixes is sufficient for \(O(n)\) runtime.ALPACA/PANGAIA winter workshop noteshttps://curiouscoding.nl/posts/winter-workshop-2023/Mon, 20 Nov 2023 00:00:00 +0100https://curiouscoding.nl/posts/winter-workshop-2023/Table of Contents Monday Fimpera: bloom filter for kmers Progress of tools Order-preserving MPHF of minimizers Algorithmic bottlenecks in SSHash Fourier transform of the human genome? Tuesday Variant types Wednesday SSHash PTHash de Bruijn Graphs These are notes of discussions at the ALPACA/PANGAIA conference in November 2023.
Monday I had interesting discussions with Giulio, Paul, and Lucas Robidou.
Fimpera: bloom filter for kmers Idea: instead of storing $k$mers in bloom filter, store all constituent $s$mers (\(s<k\)).Notes on writing coursehttps://curiouscoding.nl/posts/writing-course/Tue, 14 Nov 2023 00:00:00 +0100https://curiouscoding.nl/posts/writing-course/Table of Contents Lecture 1, 14 November Resources Reader friendlyness Typical problems Lecture 2, 21 November Paragraph level expectations Flow Assignment for next week Lecture 3, 28 November Bad organization Figures References to figures Indicative vs Informative (ex. 7) Lecture 4, December 5 Introduction Conclusion Tense Lecture 5, December 12 Abstracts Titles Punctuation Comma Dashes Some notes from the writing course I’m taking.
Lecture 1, 14 November Resources Searching phrases/alternatives in quotes in Google Scholar can tell which one is more frequently used.[WIP] PTRhash: Improving the PTHash Minimal Perfect Hash Functionhttps://curiouscoding.nl/posts/ptrhash-paper/Mon, 23 Oct 2023 00:00:00 +0200https://curiouscoding.nl/posts/ptrhash-paper/Table of Contents Abstract Abstract Motivation: Given a set \(S\) of \(n\) objects, a minimal perfect hash function (MPHF) is a collision-free bijective map \(f\) from the elements of \(S\) to \(\{0, \dots, n-1\}\). These functions have uses in databases, search engines, and are used in bioinformatics indexing tools such as Pufferfish (using BBHash), SSHash, and Piscem (both using PTHash). This work presents an MPHF that prioritizes query throughput and can be constructed efficiently for billions or more elements using \(2\) to \(4\) bits of memory per key.BAPCtools instructionhttps://curiouscoding.nl/posts/bapctools-demo/Tue, 17 Oct 2023 00:00:00 +0200https://curiouscoding.nl/posts/bapctools-demo/Steps:
Clone https://github.com/RagnarGrootKoerkamp/BAPCtools Make an alias to the executable: 1 ln -s ~/git/BAPCtools/bin/tools.py ~/bin/bt Create a new problem: 1 2 3 cd ~/problems bt new_problem my_problem_name cd ~/problems/my_problem_name You now have the following: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 . ├── data │ ├── sample │ │ └── 1.in # Sample testcase input │ │ └── 1.PTRHash: Notes on adapting PTHash in Rusthttps://curiouscoding.nl/posts/ptrhash/Thu, 21 Sep 2023 00:00:00 +0200https://curiouscoding.nl/posts/ptrhash/Table of Contents Questions and remarks on PTHash paper Ideas for improvement Parameters Align packed vectors to cachelines Prefetching Faster modulo operations Store dictionary \(D\) sorted using Elias-Fano coding How many bits of \(n\) and hash entropy do we need? Ideas for faster construction Implementation log Hashing function Bitpacking crates Construction Fastmod TODO Try out fastdivide and reciprocal crates First benchmark Faster bucket computation Branchless, for real now! (aka the trick-of-thirds) Compiling and benchmarking PTHash Compact encoding Find the \(x\) differences FastReduce revisited TODO Is there a problem if \(\gcd(m, n)\) is large?BBHash: some ideashttps://curiouscoding.nl/posts/bbhash/Mon, 04 Sep 2023 00:00:00 +0200https://curiouscoding.nl/posts/bbhash/Table of Contents Possible speedup? BBHash Limasset et al. (2017) uses multiple layers to create a minimal perfect hashing functions (MPFH), that hashes some input set into \([n]\).
(See also my note on PTHash (Pibiri and Trani 2021).)
Simply said, it maps the \(n\) elements into \([\gamma \cdot n]\) using hashing function \(h_0\). The \(k_0\) elements that have collisions are mapped into \([\gamma \cdot k_0]\) using \(h_1\). Then, the \(k_1\) elements with collisions are mapped into \([\gamma \cdot k_1]\), and so on.BitPAl bitpacking algorithmhttps://curiouscoding.nl/posts/bitpal/Sun, 03 Sep 2023 00:00:00 +0200https://curiouscoding.nl/posts/bitpal/Table of Contents Problem Input Example Discussion Found the bug Outlook The supplement (download) of the Loving, Hernandez, and Benson (2014) paper introduces a \(15\) operation version of Myers (1999) bitpacking algorithm, which uses \(16\) operations when modified for edit distance.
I tried implementing it, but it seems to have a bug that I will describe below. The fix is here.
Problem To recap, this algorithm solves the unit-cost edit distance problem by using bitpacking to compute a \(1\times w\) at a time.[WIP] Bitpacking and string searchinghttps://curiouscoding.nl/posts/bitpacking/Fri, 11 Aug 2023 00:00:00 +0200https://curiouscoding.nl/posts/bitpacking/Table of Contents Intro Review papers DP methods LCS Allison and Dix (1986) Crochemore et al. (2001) Hyyrö (2004) Hyyrö (2017) Edit distance Wright (1994) Myers (1999) Hyyrö (2001) Hyyrö (2003) Hyyrö, Fredriksson, and Navarro (2004) Hyyrö, Fredriksson, and Navarro (2005) Hyyrö and Navarro (2002) Hyyrö and Navarro (2006) Indel distance Lipton and Lopresti (1985) Hyyrö, Pinzon, and Shinohara (2005a) Hyyrö, Pinzon, and Shinohara (2005b) Automata methods Hamming distance Landau and Vishkin (1986) Baeza-Yates and Gonnet (1992) TODO Baeza-Yates and Gonnet (1994) Edit distance Ukkonen (1985b) Landau and Vishkin (1985) Landau and Vishkin (1988) Wu and Manber (1992) Bergeron and Hamel (2002) Suffix array methods Hamming distance Galil and Giancarlo (1986) Grossi and Luccio (1989) Edit distance Landau and Vishkin (1989) Galil and Park (1990) Chang and Lawler (1990) Chang and Lawler (1994) Other Hyyrö (2008a) Hyyrö, Narisawa, and Inenaga (2010) TODO TODO Chang and Lampe (1992) Baeza-Yates 1989 Improved string searching Baeza-Yates 1989 Efficient text searching (PhD thesis) Baeza-Yates 1989 string searching algorithms revisited Baeza-Yates and Perleberg (1996) Baeza-Yates and Navarro (1996) Baeza-Yates and G.Shortest paths, bucket queues, and A* on the edit graphhttps://curiouscoding.nl/posts/shortest_path_history/Sat, 29 Jul 2023 00:00:00 +0200https://curiouscoding.nl/posts/shortest_path_history/Table of Contents Shortest path algorithms .. .. in general .. for circuit design Bucket queues Shortest path algorithms by Hadlock Grid graphs Strings Spouge’s computational volumes This note summarizes some papers I was reading while investigating the history of A* for pairwise alignment, and related to that the first usage of a bucket queue. Schrijver (2012) provides a nice overview of general shortest path methods.
Shortest path algorithms .Research proposal: subquadratic string graph constructionhttps://curiouscoding.nl/posts/cwi-proposal/Mon, 10 Jul 2023 00:00:00 +0200https://curiouscoding.nl/posts/cwi-proposal/Table of Contents Introduction Research plan Improve query performance using Heavy-Light Decomposition Add more query types Extend to non-exact suffix-prefix-overlap that allows for read errors Implement an algorithm to build string graphs, and possibly a full assembler This is a research proposal for a 5 month internship at CWI during autumn/winter 2023-2024.
Introduction An important problem in bioinformatics is genome assembly: DNA sequencing machines read substrings of a full DNA genome, and these pieces must be assembled together to recover the entire genome.Loukides, Pissis, Thankachan, Zuba :: Suffix-Prefix Queries on a Dictionaryhttps://curiouscoding.nl/posts/apsp/Fri, 07 Jul 2023 00:00:00 +0200https://curiouscoding.nl/posts/apsp/Table of Contents Comments Prelims One-to-One One-to-All Report and Count Top-\(K\) A small rant on $τ$-micro-macro trees Ideas for simplification Replace $τ$-micro-macro tree Heavy-Light-Decomposition (HLD) for \(Count\) queries in \(O(\log n)\) time Finding the largest \(l\) with \(Count(i, l) \geq K\) in \(O(\log n)\) time Reporting matching strings Comparison Closing thoughts \[\newcommand{\dol}{\$}\]
These are some comments and new ideas on the paper by Loukides, Pissis, Thankachan, and Zuba (2023).DSB 2023https://curiouscoding.nl/posts/dsb-2023/Mon, 20 Mar 2023 00:00:00 +0100https://curiouscoding.nl/posts/dsb-2023/These are notes for DSB 2023. They’re not very structured though. I usually find methods more interesting than results.
Day 1, Tuesday Practical data structures for longest common extensions, Alexander Herlez LCE: longest common extension: given \(i\), \(j\), the max \(k\) s.t. \(A[i, i+k) = A[j, j+k)\). alg: compare first k if same: sample a subset and use black-box datastructure. similar idea to minhash/mash kmer selection methods, same(?) as syncmers string synchronizing sets (SSS): rolling hash.Doctoral planhttps://curiouscoding.nl/posts/research-proposal/Mon, 12 Dec 2022 00:00:00 +0100https://curiouscoding.nl/posts/research-proposal/Table of Contents Research Proposal: Near-linear exact pairwise alignment Abstract Introduction and current state of research in the field Goals of the thesis Impact Progress to date Detailed work plan WP1: A*PA v1: initial version WP2: Visualizing aligners WP3: Benchmarking aligners WP4: Theory review WP5: A*PA v2: efficient implementation WP6: Affine costs WP7: Ends-free alignment and mapping WP8: Further extension and open ended research WP9: Thesis writing Publication plan Time schedule Teaching responsibilities Other duties Study plan Signatures Research Proposal: Near-linear exact pairwise alignment Abstract Pairwise alignment and edit distance specifically is a problem that was first stated around 1968 (Needleman and Wunsch 1970; Vintsyuk 1968).One Year Of Rusthttps://curiouscoding.nl/posts/one-year-of-rust/Thu, 17 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/one-year-of-rust/Table of Contents Thoughts and remarks Good Bad My programming language journey Lego mindstorms LabVIEW C++ Python Rust These are some notes on my opinions on Rust after one year of using it.
Thoughts and remarks These pros and cons are mostly relative to C++, the language I used for the past ~10 years.
Good Sum types! Option and enum are so much nicer than optional and in particular variant.The complexity and performance of WFA and band doublinghttps://curiouscoding.nl/posts/wfa-edlib-perf/Thu, 17 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/wfa-edlib-perf/Table of Contents Complexity analysis Complexity of edit distance Complexity of affine cost alignment Comparison Implementation efficiency Band doubling for affine scores was never implemented WFA vs band doubling for affine costs Conclusion Future work This note explores the complexity and performance of band doubling (Edlib) and WFA under varying cost models.
Edlib (Šošić and Šikić 2017) uses band doubling and runs in \(O(ns)\) time, for sequence length \(n\) and edit distance \(s\) between the two sequences.String algorithm visualizationshttps://curiouscoding.nl/posts/alg-viz/Tue, 08 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/alg-viz/ Select the algorithm to visualize Click the buttons, or click the canvas and use the indicated keys Suffix-array construction is explained here and BWT is explained here.
Source code is on GitHub.
Algorithm Suffix Array Construction Burrows-Wheeler Transform Bidirectional BWT String Query prev (←/backspace) next (→/space) Delay (s) faster (↑/+/f) slower (↓/-/s) pause/play (p/return)Thoughts on linear programminghttps://curiouscoding.nl/posts/linear-programming/Fri, 04 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/linear-programming/Table of Contents Linear programming Assumptions Idea for an algorithm This note contains some ideas about linear programming and most-orthogonal faces. They’re mostly on an intuitive level and not very formal.
Postscriptum: The ideas here don’t work.
Linear programming \begin{equation*} \newcommand{\v}[1]{\textbf{#1}} \newcommand{\x}{\v x} \newcommand{\t}{\v t} \newcommand{\b}{\v b} \end{equation*}
Maximize \(\t\x\) subject to \(A\x \leq \b\).
\(\x\) is a vector of \(n\) variables \(x_i\). \(A\) is a \(m\times n\) matrix: there are \(m\) constraints \(A_j \x \leq b_j\).Local Doublinghttps://curiouscoding.nl/posts/local-doubling/Wed, 19 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/local-doubling/Table of Contents Notation Needleman-Wunsch: where it all begins Dijkstra/BFS: visiting fewer states Band doubling: Dijkstra, but more efficient GapCost: A first heuristic Computational volumes: an even smaller search Cheating: an oracle gave us \(g^*\) A*: Better heuristics Broken idea: A* and computational volumes Local doubling Without heuristic With heuristic Diagonal Transition A* with Diagonal Transition and pruning: doing less work Goal: Diagonal Transition + pruning + local doubling Pruning: Improving A* heuristics on the go Cheating more: an oracle gave us the optimal path TODO: aspriation windows \begin{equation*} \newcommand{\st}[2]{\langle #1,#2\rangle} \newcommand{\g}{g^*} \newcommand{\fm}{f_{max}} \newcommand{\gap}{\operatorname{Gap}} \end{equation*}BWT and FM-indexhttps://curiouscoding.nl/posts/bwt/Tue, 18 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/bwt/Table of Contents Burrows-Wheeler Transformation (BWT) Last-to-first mapping (LF mapping) Pattern matching Visualization Bi-directional BWT These are some notes about the Burrows-Wheeler Transform (BWT), FM-index, and variants.
See my post on the linear time suffix array construction algorithm for notation and terminology.
At the bottom you can find a visualization. This page has an interactive demo. Source code for visualizations is this GitHub repo.
Burrows-Wheeler Transformation (BWT) The BWT of a string \(S\) is generated as follows:A Combinatorial Identityhttps://curiouscoding.nl/posts/a-combinatorial-identity/Sun, 16 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/a-combinatorial-identity/Some notes regarding the identity
\begin{equation} \sum_{k=0}^n \binom{2k}k \binom{2n-2k}{n-k} = 4^n \end{equation}
Gould has two derivations: The first, from Jensens equality, (18) in (Jensen 1902; Shijie 1303).
A second via the Chu-Vandermonde convolution:
\begin{equation} \sum_{k=0}^n \binom{x}k \binom{y}{n-k} = \binom{x+y}n \end{equation}
using \(x=y=-\frac 12\) and using the $-\frac 12$-transform:
\begin{equation} \binom{-1/2}{n} = (-1)^n\binom{2n}{n}\frac 1 {2^{2n}} \end{equation}
Duarte and de Oliveira (2012) has a combinatorial proof. References Duarte, Rui, and António Guedes de Oliveira.Tensor embedding preserves Hamming distancehttps://curiouscoding.nl/posts/tensor-embedding-distance/Fri, 14 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/tensor-embedding-distance/Table of Contents Definitions Proof of Lemma 1 TODO Proof of Lemma 2 This is a proof that Tensor Embedding (Joudaki, Rätsch, and Kahles 2020) with $ℓ^2$-norm preserves the Hamming distance.
This is in collaboration with Amir Joudaki.
\begin{equation*} \newcommand{\I}{\mathcal I} \newcommand{\EE}{\mathbb E} \newcommand{\var}{\operatorname{Var}} \end{equation*}
Definitions Notation The alphabet is \(\Sigma\), of size \(|\Sigma| = \sigma\). The set of indices is \(\I := \{(i_1, \dots, i_t) \in [n]^t: i_1 < \dots < i_t\}\).Linear-time suffix array constructionhttps://curiouscoding.nl/posts/suffix-array-construction/Thu, 13 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-construction/Table of Contents Notation Small and Large suffixes Building the suffix array from a smaller one Visualization These are some notes about linear time suffix array (SA) construction algorithms (SACA’s).
At the bottom you can find a visualization. This page has an interactive demo. History of suffix array construction algorithms:
1990 first algorithm: Manber and Myers (1993) 2002 small/large suffixes, explained below: Ko and Aluru (2005) 2009 recursion only on LMS suffixes: Nong, Zhang, and Chan (2009) These slides from Stanford are a nice reference for the last algorithm.Competitive Programming Lecturehttps://curiouscoding.nl/posts/competitive-programming-lecture/Wed, 28 Sep 2022 00:00:00 +0200https://curiouscoding.nl/posts/competitive-programming-lecture/Table of Contents Contest strategies Pairwise Alignment using A* Exercises Contest strategies Preparation Thinking costs energy! Sleep enough; early to bed the 2 nights before. No practising on contest day (and the day before); it just takes energy. During the contest Eat! At the very least take a break halfway with the entire team and eat some snacks. Make sure to read all the problems before the end of the contest.Reducing A* memory usage using frontshttps://curiouscoding.nl/posts/astar-memory-usage/Mon, 26 Sep 2022 00:00:00 +0200https://curiouscoding.nl/posts/astar-memory-usage/Table of Contents Motivation Parititioning A* memory by fronts Non-consistent heuristics Front indexing Tracing back the path Here is an idea to reduce the memory usage of A* by only storing one front at a time, similar to what Edlib and WFA do. Note that for now this will not work, but I’m putting this online anyway.
Motivation In our implementation of A*PA, we use a hashmap to store the value of \(g\) of all visited (explored/expanded) states by A*.Speeding up A*: computational volumes and path-pruninghttps://curiouscoding.nl/posts/speeding-up-astar/Fri, 23 Sep 2022 00:00:00 +0200https://curiouscoding.nl/posts/speeding-up-astar/Table of Contents Motivation Summary Why is A* slow? Computational volumes Dealing with pruning Thoughts on more aggressive pruning Algorithm summary Challenges Results What about band-doubling? Maybe doubling can work after all? TODOs Extensions This post build on top of our recent preprint Groot Koerkamp and Ivanov (2024) and gives an overview of some of my new ideas to significantly speed up exact global pairwise alignment. It’s recommended you understand the seed heuristic and match pruning before reading this post.Revised Oxford Bioinformatics latex templatehttps://curiouscoding.nl/posts/bioinformatics-template/Thu, 22 Sep 2022 12:13:00 +0200https://curiouscoding.nl/posts/bioinformatics-template/I made an improved version of the Oxford Bioinformatics latex template. See the Github repository.Linear memory WFA?https://curiouscoding.nl/posts/linear-memory-wfa/Wed, 17 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/linear-memory-wfa/Table of Contents Motivation Path traceback: two strategies Observations What information is needed for path tracing A pragmatic solution Another interpretation Affine costs Conclusion Figure 1: Only the red substitutions and blue indel need to be stored to trace the entire path.
In this post I’ll discuss an idea to run WFA using less memory, while still allowing us to trace back the optimal path from the target state back to the start of the search.Transforming match bonus into costhttps://curiouscoding.nl/posts/alignment-scores-transform/Tue, 16 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/alignment-scores-transform/Table of Contents Tricks with match bonus or how to fool Dijkstra’s limitations Edit graph Algorithms Potentials Multiple variants Some notes on algorithms WFA A* Extending to different cost models Affine costs Substitution matrices But not local alignment Evaluations Unequal string length Equal string lengths Conclusion Tricks with match bonus or how to fool Dijkstra’s limitations The reader is assumed to have basic knowledge about pairwise alignment and graph theory.Paper styleguidehttps://curiouscoding.nl/posts/styleguide/Sat, 06 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/styleguide/Table of Contents Notation Naming and style This is a growing list of notation and style decisions Pesho and I made during the writing of our paper, written down so that we don’t have to spend time on it again next time.
Notation Math Modulo: \(a\bmod m\) for remainder, \(a\equiv b\pmod m\) for equivalence. Alphabet \(\Sigma\), \(|\Sigma| = 4\) Sequences \(A = \overline{a_0\dots a_{n-1}} \in \Sigma^*\), \(|A| = n\) \(B = \overline{b_0\dots b_{m-1}} \in \Sigma^*\), \(|B| = m\) Edit distance \(\mathrm{ed}(A, B)\) \(A_{<i} = \overline{a_0\dots a_{i-1}}\) \(A_{\geq i} = \overline{a_i\dots a_{n-1}}\) \(A_{i\dots i’} = \overline{a_i\dots a_{i’-1}}\) Edit graph State \(\langle i, j\rangle\) Graph \(G(V, E)\) where \(V = \{\langle i,j\rangle | 0\leq i\leq n, 0\leq j\leq m\}\) Root state \(v_s = \langle 0,0\rangle\) Target state \(v_t = \langle n,m\rangle\) Distance \(d(u, v)\) Path \(\pi\) Shortest path \(\pi^*\) Cost of path \(cost(\pi)\), \(cost(\pi^*) = d(v_s, v_t) = \mathrm{ed}(A, B)\).Diamond optimisation for diagonal transitionhttps://curiouscoding.nl/posts/diamond-optimization/Mon, 01 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/diamond-optimization/Table of Contents Diamond transition or how technicalities can break concepts But let’s take a closer look Conclusion Diamond transition or how technicalities can break concepts We assume the reader has some basic knowledge about pairwise alignment and in particular the WFA algorithm.
In this post we dive into a potential 2x speedup of WFA — one that turns out not to work.
Let’s take a look at one of the most important and efficient algorithms for pairwise alignment — WFA (Marco-Sola et al.Bidirectional A*https://curiouscoding.nl/posts/bidirectional-astar/Thu, 28 Jul 2022 17:59:00 +0200https://curiouscoding.nl/posts/bidirectional-astar/These are some links and papers on bidirectional A* variants. Nothing insightful at the moment.
small lecture introduces \(h_f(u) = \frac 12 (\pi_f(u) - \pi_r)\). Not found a paper yet. An Improved Bidirectional Heuristic Search Algorithm (Champeaux 1977) introduces a bidirectional variant Bidirectional Heuristic Search Again (Champeaux 1983) fixes a bug in the above paper Efficient modified bidirectional A* algorithm for optimal route-finding Didn’t read closely yet. A new bidirectional algorithm for shortest paths (Pijls 2008) Actually a new methods.The BiWFA meeting conditionhttps://curiouscoding.nl/posts/biwfa-meeting-condition/Mon, 11 Jul 2022 00:00:00 +0200https://curiouscoding.nl/posts/biwfa-meeting-condition/cross references: BiWFA GitHub issue
It seems that getting the meeting/overlap condition of BiWFA (Marco-Sola et al. (2023), Algorithm 1 and Lemma 2.1) correct is tricky.
Let \(p := \max(x, o+e)\) be the maximal cost of any edge in the edit graph. As in the BiWFA paper, let \(s_f\) and \(s_r\) be the distances of the forward and reverse fronts computed so far.
We prove the following lemma:
Lemma Once BiWFA has expanded the forward and reverse fronts up to \(s_f\) and \(s_r\) and has found some path of cost \(s \leq s_f + s_r\), expanding the fronts until \(s’_f + s’_r \geq s+p+o\) is guaranteed to find a shortest path.A* variantshttps://curiouscoding.nl/posts/astar-variants/Sun, 12 Jun 2022 12:04:00 +0200https://curiouscoding.nl/posts/astar-variants/These are some quick notes listing papers related to A* itself and variants. In particular, here I’m interested in papers that update \(h\) during the A* search, as a background for pruning.
Specifically, our version of pruning increases \(h\) during a single A* search, and in fact the heuristic becomes in-admissible after pruning.
Changing \(h\) The original A* paper has a proof of optimality. Later papers consider this also with heuristics that change their value over time.IGGSY 22 Slideshttps://curiouscoding.nl/posts/iggsy-presentation-slides/Sun, 12 Jun 2022 12:04:00 +0200https://curiouscoding.nl/posts/iggsy-presentation-slides/These are the slides Pesho Ivanov and I presented at IGGSY 2022 on Astarix and A*PA.
Drive: here
Pdf: hereBenchmark attention pointshttps://curiouscoding.nl/posts/benchmarks/Thu, 28 Apr 2022 23:33:00 +0200https://curiouscoding.nl/posts/benchmarks/Benchmarking is harder than you think, even when taking into account this rule.
This post lists some lessons I learned while attempting to run benchmarks for A* pairwise aligner. I was doing this on a laptop, which likely has different characteristics from CPUs in a typical server rack. All the programs I run are single threaded.
Hardware Do not run while charging the laptop Charging makes the battery hot and causes throttling.Motivationhttps://curiouscoding.nl/posts/motivation/Thu, 28 Apr 2022 23:22:00 +0200https://curiouscoding.nl/posts/motivation/It’s not the need for faster software that motivates; it’s the mathematical discovery that needs sharing.Proof sketch for linear time seed heuristic alignmenthttps://curiouscoding.nl/posts/linear-time-pa/Sun, 24 Apr 2022 00:00:00 +0200https://curiouscoding.nl/posts/linear-time-pa/Table of Contents Pairwise alignment in subquadratic time Random model Algorithm Seed heuristic Match pruning Analysis Expanded states Excess errors Algorithmic complexity This post is a proof sketch to show that A* with the seed heuristic (Groot Koerkamp and Ivanov 2024) does exact pairwise alignment of random strings with random mutations in near linear time.
Pairwise alignment in subquadratic time Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.Variations on the WFA recursionhttps://curiouscoding.nl/posts/wfa-variations/Sun, 17 Apr 2022 03:14:00 +0200https://curiouscoding.nl/posts/wfa-variations/Table of Contents Gap open Gap close Symmetric alternatives Another symmetry Conclusions cross references: BiWFA GitHub issue
In this post I will explore some variations of the recursion used by WFA/BiWFA for the affine version of the diagonal transition algorithm. In particular, we will go over a gap-close variant, and look into some more symmetric formulations.
Gap open WFA (Marco-Sola et al. 2020) introduces the affine cost variant of the classic diagonal transition method.Ongoing and future researchhttps://curiouscoding.nl/pages/todo/Fri, 15 Apr 2022 00:00:00 +0200https://curiouscoding.nl/pages/todo/Table of Contents In progress On hold Pending ideas/blogposts Smaller tasks Future plans Open questions Reading list Projects to pick up A*PA2 Minimizers High throughput k-mer indices Here I list projects that I’m currently working on, and ideas for future work.
This page is usually outdated.
In progress A* pairwise aligner [GitHub] Exact global pairwise alignment of random strings in expected linear time. Contains proof of correctness, implementation, evals and comparison with WFA and edlib on random data.Projectshttps://curiouscoding.nl/pages/projects/Fri, 15 Apr 2022 00:00:00 +0200https://curiouscoding.nl/pages/projects/1 Pairwise alignment 1.1 A*PA A pairwise aligner based on A*, in collaboration with Pesho Ivanov. A*PA uses A* with new heuristics to speed up global pairwise alignment. A*PA2 is much faster by using a DP-based method.
Code, slides A*PA paper: PDF, bioRxiv, Bioinformatics (supplement separate) A*PA2 paper: PDF, bioRxiv (outdated) Blogposts Pairwise alignment history Computational volumes A*PA2, the blogpost version of the paper. Local doubling, an idea that didn’t make it into the final paper.Publicationshttps://curiouscoding.nl/pages/publications/Fri, 15 Apr 2022 00:00:00 +0200https://curiouscoding.nl/pages/publications/Chronological list of publications Forward sampling scheme density lower bound (PDF) Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd Treangen. 2024. “A near-Tight Lower Bound on the Density of Forward Sampling Schemes.” Biorxiv. https://doi.org/10.1101/2024.09.06.611668. Mod-minimizer (PDF) Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024a. “The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P.Glossaryhttps://curiouscoding.nl/posts/glossary/Thu, 14 Apr 2022 00:00:00 +0200https://curiouscoding.nl/posts/glossary/This is a growing list of ambiguous terms and their definitions. More of a place to store random remarks than a complete reference for now.
diagonal transition name introduced by Navarro (2001) approximate approximate algorithm: an algorithms that does not always give the correct answer.
$k$-approximate string matching: variant semi-global alignment where we find all matches of a pattern in a reference with at most \(k\) mistakes.
Also approximate string matching: alternative name for global pairwise alignment.A survey of exact global pairwise alignmenthttps://curiouscoding.nl/posts/pairwise-alignment-history/Fri, 01 Apr 2022 00:00:00 +0200https://curiouscoding.nl/posts/pairwise-alignment-history/Table of Contents Variants of pairwise alignment Cost models Alignment types A chronological overview of global pairwise alignment Algorithms in detail Classic DP algorithms Cubic algorithm of Needleman and Wunsch (1970) A quadratic DP Local alignment Affine costs Minimizing vs. maximizing duality Four Russians method TODO \(O(ns)\) methods TODO Exponential search on band TODO LCS: thresholds, $k$-candidates and contours TODO Diagonal transition: furthest reaching and wavefronts TODO Suffixtree for \(O(n+s^2)\) expected runtime Using less memory Computing the score in linear space Divide-and-conquer TODO LCSk[++] algorithms Theoretical lower bound TODO A note on DP (toposort) vs Dijkstra vs A* TODO Tools TODO Notes for other posts Semi-global alignment papers Approximate pairwise aligners Old vs new papers Note: This is a living document, and will likely remain so for a while.Pruning for A* heuristicshttps://curiouscoding.nl/posts/pruning/Sat, 11 Dec 2021 00:00:00 +0100https://curiouscoding.nl/posts/pruning/Note: this post extends the concept of multiple-path pruning presented in Poole and Mackworth (2017).
Say we’re running A* in a graph from \(s\) to \(t\). \(d(s,t)\) is the distance we are looking for.
An A* heuristic has to satisfy \(h(u) \leq d(u, t)\) to be admissible: the estimated distance to the end should never be larger than the actual distance to guarantee that the algorithm finds a shortest path.AStarixhttps://curiouscoding.nl/posts/astarix/Fri, 12 Nov 2021 13:05:00 +0100https://curiouscoding.nl/posts/astarix/Papers
AStarix: Fast and Optimal Sequence-to-Graph Alignment Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds AStarix is a method for aligning sequences (reads) to graphs:
Input A reference sequence or graph Alignment costs \((\Delta_{match}, \Delta_{subst}, \Delta_{del}, \Delta_{ins})\) for a match, substitution, insertion and deletion Sequence(s) to align Output An optimal alignment of each input sequence The input is a reference graph (automaton really) \(G_r = (V_r, E_r)\) with edges \(E_r \subseteq V_r\times V_r\times \Sigma\) that indicate the transitions between states.Neighbour joininghttps://curiouscoding.nl/posts/neighbour-joining/Fri, 12 Nov 2021 11:57:00 +0100https://curiouscoding.nl/posts/neighbour-joining/Neighbour joining (NJ, paper) is a phylogeny reconstruction method. It differs from UPGMA in the way it computes the distances between clusters.
This algorithm first assumes that the phylogeny is a star graph. Then it finds the pair of vertices that when merged and split out gives the minimal total edge length \(S_{ij}\) of the new almost-star graph. (See eq. (4) and figure 2a and 2b in the paper.) \[ S_{i,j} = \frac1{2(n-2)} \sum_{k\not\in \{i,j\}}(d(i, k)+d(j,k)) + \frac 12 d(i,j)+\frac 1{n-2} \sum_{k<l,\, k, l\not\in\{i,j\}}d(k,l).UPGMAhttps://curiouscoding.nl/posts/upgma/Thu, 28 Oct 2021 11:56:00 +0200https://curiouscoding.nl/posts/upgma/Unweighted pair group method with arithmetic mean (UPGMA) is a phylogeny reconstruction method.
Input Matrix of pairwise distances Output Phylogeny Algorithm Repeatedly merge the nearest two clusters. The distance between clusters is the average of all pairwise distances between them. When merging two clusters, the distances of the new cluster are the weighted averages of distances from the two clusters being merged. Complexity \(O(n^3)\) naive, \(O(n^2 \ln n)\) using heap.RTFEhttps://curiouscoding.nl/posts/rfte/Fri, 22 Oct 2021 15:16:00 +0200https://curiouscoding.nl/posts/rfte/Read The F*ing Error
When you complain about an error without reading it first. When you assume you understand the problem halfway through reading the error, and only after more debugging you realize you failed to read properly.1st law of Procrastinationhttps://curiouscoding.nl/posts/procrastination/Fri, 22 Oct 2021 11:46:00 +0200https://curiouscoding.nl/posts/procrastination/Important deadlines require important procrastination.Data should be reviewedhttps://curiouscoding.nl/posts/data-should-be-reviewed/Fri, 22 Oct 2021 11:41:00 +0200https://curiouscoding.nl/posts/data-should-be-reviewed/Experiments and their analysis should be reproducible, and all data/figures in a paper should be reviewable. Pipelines (e.g. snakemake files) to generated them should be attached to the paper.
I’ve asked for automated scripts to reproduce test data on 3+ github repositories now, and got a satisfactory answer zero times:
WFA: https://github.com/smarco/WFA/issues/26
Link to a datadump on the block-aligner repository. Good to have actual data, but exactly how this data was created is unclear to me.Spaced K-mer Seeded Distancehttps://curiouscoding.nl/posts/spaced-kmer-distance/Wed, 20 Oct 2021 00:00:00 +0200https://curiouscoding.nl/posts/spaced-kmer-distance/Table of Contents Background $k$-mers Sketching MinHash Terminology Introduction Spaced $k$-mer Seeded Distance Improving performance Analysis Pruning false positive candidate matches Phylogeny reconstruction Running the algorithm TODO Assembly \[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Background Quickly finding similar pieces of DNA within large datasets is at the core of computational biology. This has many applications:
Alignment: Given two pieces of related DNA, align them to find where mutations (i.Open Sciencehttps://curiouscoding.nl/posts/open-science/Tue, 19 Oct 2021 00:00:00 +0200https://curiouscoding.nl/posts/open-science/Let’s go over some reasons for why I’m writing this blog.
The internet is more accessible than papers The inspiration for this blog is the post on Succinct de Bruijn Graphs by Alex Bowe. I think blog posts are a great way to quickly learn about new ideas and concepts, since they are usually more accessible than papers. A blog post can omit some of the more formal text required in papers and spend more time explaining things on an intuitive level.Hugo and ox-hugohttps://curiouscoding.nl/posts/hugo/Thu, 14 Oct 2021 00:00:00 +0200https://curiouscoding.nl/posts/hugo/Here’s the customary how I made this site using X post.
This site is built using Hugo and ox-hugo.
The source is written in Org mode, which is converted to markdown by ox-hugo. To get started yourself, check out the initial commit of the source repository and build from there.
Some notes:
I’m using the Hugo-coder theme. Since the conversion from Org to markdown is done using an Emacs plugin, the emacs folder contains a simple init.Hello, World!https://curiouscoding.nl/posts/hello-world/Wed, 13 Oct 2021 00:00:00 +0200https://curiouscoding.nl/posts/hello-world/ 1 print("Hello, World!") 1 std::cout << "Hello, World!" << std::endl;Spaced k-mer and assembler methodshttps://curiouscoding.nl/posts/spaced-kmer-review/Wed, 14 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/spaced-kmer-review/Table of Contents Spaced \(k\)-mers Minimap SPAdes MUMmer4 BLASR Bowtie 2 Patternhunter Spaced seeds improve \(k\)-mer-based metagenomic classification LoMeX Meeting notes Concepts:
Mapping Map a sequence onto a reference genome/dataset Assembly Build a genome from a set of reads de novo (implied): without using a reference genome Otherwise just called mapping Typical complicating factors:
read errors non-uniform coverage insert size variation chimeric reads (?) bireads non-uniform read coverage (as in metagenomics, i.Ideas for assembling [long] readshttps://curiouscoding.nl/posts/thoughts-on-assembling/Fri, 09 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/thoughts-on-assembling/\[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\Z}{\mathbb Z} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Here is an idea for an algorithm to assemble long reads.
Go over all sequences and sketch their windows using the Hamming distance preserving sketch method described here. This method may need some tweaking to also work with an indel rate of around 10%.
Let’s say we find a pair of matching windows between reads \(A\) and \(B\) starting at positions \(i\) and \(j\).Hamming Similarity Searchhttps://curiouscoding.nl/posts/hamming-similarity-search/Thu, 08 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/hamming-similarity-search/Table of Contents Background $k$-mers Sketching MinHash Introduction Hamming Similarity Search Improving performance Analysis Pruning false positive candidate matches Phylogeny reconstruction Running the algorithm Assembly \[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Background Quickly finding similar pieces of DNA within large datasets is at the core of computational biology. This has many applications:
Alignment: Given two pieces of related DNA, align them to find where mutations (i.Detached fullscreen in Swayhttps://curiouscoding.nl/posts/sway-fullscreen/Fri, 02 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/sway-fullscreen/Xrefs: PR for Sway | AUR package sway-inhibit-fullscreen-git
Once upon a time, Chromium had a bug where using $mod+f in i3 to fullscreen the Chromium window changed the window to occupy the entire screen, but didn’t actually make Chromium enter full screen mode. According to some, those1 were2 the3 good4 days5, 6. Watching 4 YouTube streams in parallel was still possibly, back in those days:
Without patches, the best we can do nowadays7 is the followingOpen source contributionshttps://curiouscoding.nl/posts/oss-contributions/Fri, 02 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/oss-contributions/Table of Contents My aur packages Some issues I reported/fixed My aur packages List on aur.archlinux.org
bapctools-git: BAPCtools is used for developing ICPC style programming contest problems. feh-preload-next-image-git: Branch of Feh that loads the next image to speed up browsing images in a remote directory. i3-focus-last-git: Window switcher for i3/sway. python-pyexiftool-nocheck: the original python-pyexiftool is outdated, orphaned, and still depends on python2. sway-inhibit-fullscreen-git: Sway branch that adds the inhibit_fullscreen toggle command.Powersearch with Vimiumhttps://curiouscoding.nl/posts/vimium/Fri, 02 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/vimium/Related posts: Dark mode with Vimium
Vimium (Github, Chromium extension) is not only a great way to navigate webpages; it’s also a great help to quickly search many webpages.
I am using it many times a day to search for just the documentation I need. Some of the search engines I have configured:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Documentation archwiki: https://wiki.Wayland utilitieshttps://curiouscoding.nl/posts/wayland/Fri, 02 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/wayland/This post goes over some useful utilities I have been using on my Wayland system.
Screen brightness: light Light is a nice tool to manage screen and keyboard brightness.
Install light Add your user to the video group: usermod -aG video <user> I really like the light -T flag, which multiplies the current brightness by some value. This way you can have fine grained control both for very low and very high brightness values.Browsing in the dark with Vimium and Dark Readerhttps://curiouscoding.nl/posts/dark-mode/Thu, 01 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/dark-mode/Table of Contents Chromium theme Dark Reader Vimium Let’s quickly go over some settings you can change for a better dark mode experience in Chromium.
Chromium theme First of all, you can make Chromium itself use a dark theme. This will ensure both a dark tab bar and nice dark settings pages. As explained here, you’ll need to change the following:
Run chromium with the flags
1 chromium --enable-features=WebUIDarkMode --force-dark-mode If you are already using other feature flags, they can be comma separated:Window switching in Swayhttps://curiouscoding.nl/posts/sway-window-switching/Thu, 01 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/sway-window-switching/Sway has many commands for switching the active workspace and focused window. However, I find that most of my window switching comes down to a few simple commands that focus a specific application, or open it first when it has no open windows yet. E.g.:
$mod+s: open and/or focus slack $mod+i: open and/or focus signal $mod+m: open and/or focus emacs $mod+c: open and/or focus chromium In addition to this, some apps like emacs have a separate $mod+Shift+m command that always opens a new window/instance.Clean your homedir with XDG Base Dirhttps://curiouscoding.nl/posts/xdg-base-dir/Wed, 30 Jun 2021 00:00:00 +0200https://curiouscoding.nl/posts/xdg-base-dir/Xrefs: XDG specification | ArchWiki | Reddit post
In case you are, like me, tired of applications polluting your homedir with config and data files, the XDB Base Directory Specification (ArchWiki) has your back.
You probably saw the ~/.config directory already, and in fact, many programs can be told to use this directory instead of polluting your homedir. The ArchWiki page has a list of many applications and which environment variables need to be set to change the location of their configuration.Emacs Doomhttps://curiouscoding.nl/posts/emacs/Wed, 30 Jun 2021 00:00:00 +0200https://curiouscoding.nl/posts/emacs/Table of Contents Configuration init.el config.el Running as server and client Wayland Useful commands Emacs as mail client Install Doom Emacs as explained in the readme.
Alongside it, you’ll want to install ripgrep and fd for better search integration, and possibly ttf-font-awesome for better icons.
Configuration Instead of the default ~/emacs.d/ and ~/doom.d/ config directories, you can also use ~/.config/emacs/ and ~/.config/doom/.
init.el My init.el is mostly default, and enables the languages I regularly use, with LSP support where possible:Environment variables done oncehttps://curiouscoding.nl/posts/environment-variables/Wed, 30 Jun 2021 00:00:00 +0200https://curiouscoding.nl/posts/environment-variables/Xrefs: GitHub issue
One problem I had with my Sway setup is that setting environment variables in my config.fish (the Fish equivalent to .bashrc or .zshrc) is not always sufficient.
In particular, I need my environment variables to be available in at least the following places:
my Fish shell, applications launched from Sway (e.g. using keybindings), applications launched as a systemd service (e.g. the Emacs server daemon). Setting variables in the shell profile has the problem that they are not picked up by systemd services.28000x speedup with Numba.CUDAhttps://curiouscoding.nl/posts/numba-cuda-speedup/Mon, 24 May 2021 00:00:00 +0200https://curiouscoding.nl/posts/numba-cuda-speedup/Table of Contents CUDA Overview Profiling Optimizing Tensor Sketch CPU code V0: Original python code V1: Numba V2: Multithreading GPU code V3: A first GPU version V4: Parallel kernel invocations V5: Single kernel with many blocks V6: Detailed profiling: Kernel Compute V7: Detailed profiling: Kernel Latency V8: Detailed profiling: Shared Memory Access Pattern V9: More work per thread V10: Cache seq to shared memory V11: Hashes and signs in shared memory V12: Revisiting blocks per kernel V13: Passing a tuple of sequences V14: Better hardware V15: Dynamic shared memory Wrap up Xrefs: r/CUDA, Numba discourseX1 Extreme Gen 3 - Migrating to Waylandhttps://curiouscoding.nl/posts/x1e3/Sun, 16 May 2021 00:00:00 +0200https://curiouscoding.nl/posts/x1e3/I got a new laptop, so this felt like the right time to migrate to Wayland.
Delta what before after hardware laptop Asus UX501V Lenovo X1 Extreme Gen 3 CPU i7-6700HQ i7-10750H GPU GTX 960M GTX 1650 RAM 16GB 64GB OS bootloader Grub EFISTUB OS Windows + Arch dualboot Windows + Arch dualboot networking netctl systemd-networkd dns/dhcp dhcpcd systemd-resolved wifi wpa_supplicant iwd Wayland display/login manager - - display server X Wayland window manager i3 Sway bar i3blocks waybar backlight xbacklight light night mode redshift gammastep clipboard - wl-clipboard, clipman program launcher rofi rofi [wayland] password finder rofi-pass rofi-pass-git key remapping setxkbmap, xcape, xmodmap interception-tools Tools terminal emulator urxvt foot shell zsh fish shell highlighting zsh-syntax-highlight - environment variables .SE Endurance: Early gamehttps://curiouscoding.nl/posts/factorio-early-game/Mon, 26 Apr 2021 00:00:00 +0200https://curiouscoding.nl/posts/factorio-early-game/Xrefs: Reddit
This is the start of a series of posts on our (philae, winston) play through Factorio with the Space Exploration mod.
After lots of struggling, we recently finished our first SE world after 624 in-game hours. Since this was also our first/second Factorio world, the start was very inefficient and we learned a lot of things along the way. In this new map, which we call Endurance (after the interplanetary spaceship in Interstellar), we will apply what we learned, and share it with the world :)Hashcode 2021 Finalshttps://curiouscoding.nl/posts/hashcode-2021-finals/Sat, 24 Apr 2021 00:00:00 +0200https://curiouscoding.nl/posts/hashcode-2021-finals/Xrefs: Problem | Scoreboard
Team: cat /dev/random | grep "to be or not to be"
Who: Jan-Willem Buurlage, Ragnar Groot Koerkamp, Timon Knigge, Abe Wits
Score: 274253375
Rank: 19 of 38
Not good.
Not bad.
Definitely ugly.
Linkerrijtje (aka top half).
I would have liked to write that I’m happy with the result, but to be fair–I’m not. Just the fact that I can’t sleep and feel the need to write this in the middle of the night surely is indication of this.Hashcode 2021: A lucky ridehttps://curiouscoding.nl/posts/hashcode-2021/Mon, 01 Mar 2021 00:00:00 +0100https://curiouscoding.nl/posts/hashcode-2021/Xrefs: Problem | Scoreboard | Codeforces announcement, this blog | Hacker News
Team: cat /dev/random | grep "to be or not to be"
Who: Jan-Willem Buurlage, Ragnar Groot Koerkamp, Timon Knigge, Abe Wits
Score: 10282641
Rank: 16
Since we did quite well, here is a write-up of our participation in Hashcode 2021.
Prep All four of us had previously participated in Hashcode, but this was the first time in the current composition.Abouthttps://curiouscoding.nl/pages/about/Mon, 01 Jan 0001 00:00:00 +0000https://curiouscoding.nl/pages/about/Hi there ;) I’m doing a PhD in bioinformatics at the BMI lab at ETH Zurich. I’m working on near-linear algorithms for exact pairwise alignment, low-density minimizer schemes, and generally on high throughput bioinformatics software.
This blog is where I dump my thoughts on my PhD research. It includes numerous short notes/remarks/ideas for research, and a few longer posts, some of which may turn into papers.
Feel free to use this blog as inspiration and build on the ideas you see here, as long as you cite appropriately.Readmehttps://curiouscoding.nl/readme/Mon, 01 Jan 0001 00:00:00 +0000https://curiouscoding.nl/readme/Research notes This repository contains the source of my blog: https://curiouscoding.nl.
Feel free to comment on the code or create an issue if you see something off.
This blog is written in Org, converted to markdown by ox-hugo and built using Hugo.
License All written text (i.e. everything rendered on my blog) is licensed under CC BY-SA 4.0.
The Hugo, ox-hugo, and org mode related source code (everything in the initial commit) are licensed under MIT.