CuriousCodinghttps://curiouscoding.nl/Recent content on CuriousCodingHugo -- gohugo.ioenMon, 20 Mar 2023 00:00:00 +0100DSB 2023https://curiouscoding.nl/notes/dsb-2023/Mon, 20 Mar 2023 00:00:00 +0100https://curiouscoding.nl/notes/dsb-2023/These are notes for DSB 2023.
Day 1, Tuesday Practical data structures for longest common extensions, Alexander Herlez notes go here
Pan-genome de Bruijn graph using the bidirectional FM-index, Lore Depuydt Indexing large metagenomics projects with abundances, Pierre Peterlongo µ-PBWT: Enabling the Storage and Use of UK Biobank Data on a Commodity Laptop, Davide Cozzi Genome-on-Diet: Taming Large-Scale Genomic Analyses via Sparsified Genomics, Mohammed Alser Spectrum preserving tilings enable sparse and modular reference indexing, Giulio Ermanno Pibiri Towards a lower-memory chunked graph data structure inspired by Minecraft, Fawaz Dabbaghie Optimal Worst-Case Design of Gapped k-mer Masks, Sven Rahmann Locality-Preserving Hashing of k-mers, Yoshihiro Shibuya Space-efficient k-mer counting using an implicit sequence representation, Miika Leinonen VeChat: correcting errors in long reads using variation graphs, Alexander Schönhuth Fixing homopolymer errors in HiFi reads using dictionary compression, Diego Diaz-Dominguez Orthanq: orthogonal evidence based haplotype quantification, Hamdiye Uzuner Day 2, Wednesday Random Wheeler graphs, Riccardo Maso The Graph Wheelerization Problem, Davide Tonetto Sorting Wheeler NFA’s using relational partition refinement, Bojana Kodric Prefix-sorting strings on deterministic finite automata, Sung-Hwan Kim MARIA: Multiple alignment r-index with aggregation, Adrián Goga Approximate pattern matching using search schemes and in-text verification, Luca Renders Chaining of maximal exact matches in graphs, Nicola Rizzo RecGraph: recombination-aware alignment of sequences to variation graphs, Jorge Avila Cartes Exact string alignments to (E)D-texts, Nadia Pisanti Periodicity of degenerate strings, Pengfei Wang Deriving polygenic risk score using non-negative matrix factorization, Vu Lam Dang Identifying antimicrobial resistance gene transfer between plasmids, Marco TeixeiraDoctoral planhttps://curiouscoding.nl/notes/research-proposal/Mon, 12 Dec 2022 00:00:00 +0100https://curiouscoding.nl/notes/research-proposal/Table of Contents Research Proposal: Near-linear exact pairwise alignment Abstract Introduction and current state of research in the field Goals of the thesis Impact Progress to date Detailed work plan WP1: A*PA v1: initial version WP2: Visualizing aligners WP3: Benchmarking aligners WP4: Theory review WP5: A*PA v2: efficient implementation WP6: Affine costs WP7: Ends-free alignment and mapping WP8: Further extension and open ended research WP9: Thesis writing Publication plan Time schedule Teaching responsibilities Other duties Study plan Signatures Research Proposal: Near-linear exact pairwise alignment Abstract Pairwise alignment and edit distance specifically is a problem that was first stated around 1968 (Needleman and Wunsch 1970; Vintsyuk 1968).One Year Of Rusthttps://curiouscoding.nl/notes/one-year-of-rust/Thu, 17 Nov 2022 00:00:00 +0100https://curiouscoding.nl/notes/one-year-of-rust/Table of Contents Thoughts and remarks Good Bad Story time – my programming language journey Lego mindstorms LabVIEW C++ Python Rust These are some draft notes that should turn into a post on my opinions on Rust after one year of using it.
Thoughts and remarks These pros and cons are mostly relative to C++, the language I used for the past ~10 years.
Good Sum types! Option and enum are so much nicer than optional and in particular variant.The complexity and performance of WFA and band doublinghttps://curiouscoding.nl/posts/wfa-edlib-perf/Thu, 17 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/wfa-edlib-perf/Table of Contents Complexity analysis Complexity of edit distance Complexity of affine cost alignment Comparison Implementation efficiency Band doubling for affine scores was never implemented WFA vs band doubling for affine costs Conclusion Future work This note explores the complexity and performance of band doubling (Edlib) and WFA under varying cost models.
Edlib (Šošić and Šikić 2017) uses band doubling and runs in \(O(ns)\) time, for sequence length \(n\) and edit distance \(s\) between the two sequences.String algorithm visualizationshttps://curiouscoding.nl/posts/alg-viz/Tue, 08 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/alg-viz/ Select the algorithm to visualize Click the buttons, or click the canvas and use the indicated keys Suffix-array construction is explained here and BWT is explained here.
Source code is on GitHub.
Algorithm Suffix Array Construction Burrows-Wheeler Transform Bidirectional BWT String Query prev (←/backspace) next (→/space) Delay (s) faster (↑/+/f) slower (↓/-/s) pause/play (p/return)Thoughts on linear programminghttps://curiouscoding.nl/notes/linear-programming/Fri, 04 Nov 2022 00:00:00 +0100https://curiouscoding.nl/notes/linear-programming/Table of Contents Linear programming Assumptions Idea for an algorithm This note contains some ideas about linear programming and most-orthogonal faces. They’re mostly on an intuitive level and not very formal.
Linear programming \begin{equation*} \newcommand{\v}[1]{\textbf{#1}} \newcommand{\x}{\v x} \newcommand{\t}{\v t} \newcommand{\b}{\v b} \end{equation*}
Maximize \(\t\x\) subject to \(A\x \leq \b\).
\(\x\) is a vector of \(n\) variables \(x_i\). \(A\) is a \(m\times n\) matrix: there are \(m\) constraints \(A_j \x \leq b_j\).Local Doublinghttps://curiouscoding.nl/posts/local-doubling/Wed, 19 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/local-doubling/Table of Contents Notation Needleman-Wunsch: where it all begins Dijkstra/BFS: visiting fewer states Band doubling: Dijkstra, but more efficient GapCost: A first heuristic Computational volumes: an even smaller search Cheating: an oracle gave us \(g^*\) A*: Better heuristics Broken idea: A* and computational volumes Local doubling Without heuristic With heuristic Diagonal Transition A* with Diagonal Transition and pruning: doing less work Goal: Diagonal Transition + pruning + local doubling Pruning: Improving A* heuristics on the go Cheating more: an oracle gave us the optimal path TODO: aspriation windows \begin{equation*} \newcommand{\st}[2]{\langle #1,#2\rangle} \newcommand{\g}{g^*} \newcommand{\fm}{f_{max}} \newcommand{\gap}{\operatorname{Gap}} \end{equation*}BWT and FM-indexhttps://curiouscoding.nl/notes/bwt/Tue, 18 Oct 2022 00:00:00 +0200https://curiouscoding.nl/notes/bwt/Table of Contents Burrows-Wheeler Transformation (BWT) Last-to-first mapping (LF mapping) Pattern matching Visualization Bi-directional BWT These are some notes about the Burrows-Wheeler Transform (BWT), FM-index, and variants.
See my post on the linear time suffix array construction algorithm for notation and terminology.
At the bottom you can find a visualization. This page has an interactive demo. Source code for visualizations is this GitHub repo.
Burrows-Wheeler Transformation (BWT) The BWT of a string \(S\) is generated as follows:A Combinatorial Identityhttps://curiouscoding.nl/notes/a-combinatorial-identity/Sun, 16 Oct 2022 00:00:00 +0200https://curiouscoding.nl/notes/a-combinatorial-identity/Some notes regarding the identity
\begin{equation} \sum_{k=0}^n \binom{2k}k \binom{2n-2k}{n-k} = 4^n \end{equation}
Gould has two derivations: The first, from Jensens equality, (18) in (Jensen 1902; Shijie 1303).
A second via the Chu-Vandermonde convolution:
\begin{equation} \sum_{k=0}^n \binom{x}k \binom{y}{n-k} = \binom{x+y}n \end{equation}
using \(x=y=-\frac 12\) and using the $-\frac 12$-transform:
\begin{equation} \binom{-1/2}{n} = (-1)^n\binom{2n}{n}\frac 1 {2^{2n}} \end{equation}
Duarte and de Oliveira (2012) has a combinatorial proof. References Duarte, Rui, and António Guedes de Oliveira.Tensor embedding preserves Hamming distancehttps://curiouscoding.nl/notes/tensor-embedding-distance/Fri, 14 Oct 2022 00:00:00 +0200https://curiouscoding.nl/notes/tensor-embedding-distance/Table of Contents Definitions Proof of Lemma 1 TODO Proof of Lemma 2 This is a proof that Tensor Embedding (Joudaki, Rätsch, and Kahles 2020) with $ℓ^2$-norm preserves the Hamming distance.
This is in collaboration with Amir Joudaki.
\begin{equation*} \newcommand{\I}{\mathcal I} \newcommand{\EE}{\mathbb E} \newcommand{\var}{\operatorname{Var}} \end{equation*}
Definitions Notation The alphabet is \(\Sigma\), of size \(|\Sigma| = \sigma\). The set of indices is \(\I := \{(i_1, \dots, i_t) \in [n]^t: i_1 < \dots < i_t\}\).Linear-time suffix array constructionhttps://curiouscoding.nl/notes/suffix-array-construction/Thu, 13 Oct 2022 00:00:00 +0200https://curiouscoding.nl/notes/suffix-array-construction/Table of Contents Notation Small and Large suffixes Building the suffix array from a smaller one Visualization These are some notes about linear time suffix array (SA) construction algorithms (SACA’s).
At the bottom you can find a visualization. This page has an interactive demo. History of suffix array construction algorithms:
1990 first algorithm: Manber and Myers (1993) 2002 small/large suffixes, explained below: Ko and Aluru (2005) 2009 recursion only on LMS suffixes: Nong, Zhang, and Chan (2009) These slides from Stanford are a nice reference for the last algorithm.Competitive Programming Lecturehttps://curiouscoding.nl/notes/competitive-programming-lecture/Wed, 28 Sep 2022 00:00:00 +0200https://curiouscoding.nl/notes/competitive-programming-lecture/Table of Contents Contest strategies Pairwise Alignment using A* Exercises Contest strategies Preparation Thinking costs energy! Sleep enough; early to bed the 2 nights before. No practising on contest day (and the day before); it just takes energy. During the contest Eat! At the very least take a break halfway with the entire team and eat some snacks. Make sure to read all the problems before the end of the contest.Reducing A* memory usage using frontshttps://curiouscoding.nl/notes/astar-memory-usage/Mon, 26 Sep 2022 00:00:00 +0200https://curiouscoding.nl/notes/astar-memory-usage/Table of Contents Motivation Parititioning A* memory by fronts Non-consistent heuristics Front indexing Tracing back the path Here is an idea to reduce the memory usage of A* by only storing one front at a time, similar to what Edlib and WFA do. Note that for now this will not work, but I’m putting this online anyway.
Motivation In our implementation of A*PA, we use a hashmap to store the value of \(g\) of all visited (explored/expanded) states by A*.Speeding up A*: computational volumes and pre-pruninghttps://curiouscoding.nl/posts/speeding-up-astar/Fri, 23 Sep 2022 00:00:00 +0200https://curiouscoding.nl/posts/speeding-up-astar/Table of Contents Motivation Summary Why is A* slow? Computational volumes Dealing with pruning Thoughts on more aggressive pruning Algorithm summary Challenges Results What about band-doubling? Maybe doubling can work after all? TODOs Extensions This post build on top of our recent preprint Groot Koerkamp and Ivanov (2022) and gives an overview of some of my new ideas to significantly speed up exact global pairwise alignment. It’s recommended you understand the seed heuristic and match pruning before reading this post.Revised Oxford Bioinformatics latex templatehttps://curiouscoding.nl/notes/bioinformatics-template/Thu, 22 Sep 2022 12:13:00 +0200https://curiouscoding.nl/notes/bioinformatics-template/I made an improved version of the Oxford Bioinformatics latex template. See the Github repository.Linear memory WFA?https://curiouscoding.nl/posts/linear-memory-wfa/Wed, 17 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/linear-memory-wfa/Table of Contents Motivation Path traceback: two strategies Observations What information is needed for path tracing A pragmatic solution Another interpretation Affine costs Conclusion Figure 1: Only the red substitutions and blue indel need to be stored to trace the entire path.
In this post I’ll discuss an idea to run WFA using less memory, while still allowing us to trace back the optimal path from the target state back to the start of the search.Transforming match bonus into costhttps://curiouscoding.nl/posts/alignment-scores-transform/Tue, 16 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/alignment-scores-transform/Table of Contents Tricks with match bonus or how to fool Dijkstra’s limitations Edit graph Algorithms Potentials Multiple variants Some notes on algorithms WFA A* Extending to different cost models Affine costs Substitution matrices But not local alignment Evaluations Unequal string length Equal string lengths Conclusion Tricks with match bonus or how to fool Dijkstra’s limitations The reader is assumed to have basic knowledge about pairwise alignment and graph theory.Paper styleguidehttps://curiouscoding.nl/notes/styleguide/Sat, 06 Aug 2022 00:00:00 +0200https://curiouscoding.nl/notes/styleguide/Table of Contents Notation Naming and style This is a growing list of notation and style decisions Pesho and I made during the writing of our paper, written down so that we don’t have to spend time on it again next time.
Notation Alphabet \(\Sigma\), \(|\Sigma| = 4\) Sequences \(A = \overline{a_0\dots a_{n-1}} \in \Sigma^*\), \(|A| = n\) \(B = \overline{b_0\dots b_{m-1}} \in \Sigma^*\), \(|B| = m\) Edit distance \(\mathrm{ed}(A, B)\) \(A_{<i} = \overline{a_0\dots a_{i-1}}\) \(A_{\geq i} = \overline{a_i\dots a_{n-1}}\) \(A_{i\dots i’} = \overline{a_i\dots a_{i’-1}}\) Edit graph State \(\langle i, j\rangle\) Graph \(G(V, E)\) where \(V = \{\langle i,j\rangle | 0\leq i\leq n, 0\leq j\leq m\}\) Root state \(v_s = \langle 0,0\rangle\) Target state \(v_t = \langle n,m\rangle\) Distance \(d(u, v)\) Path \(\pi\) Shortest path \(\pi^*\) Cost of path \(cost(\pi)\), \(cost(\pi^*) = d(v_s, v_t) = \mathrm{ed}(A, B)\).Diamond optimisation for diagonal transitionhttps://curiouscoding.nl/posts/diamond-optimization/Mon, 01 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/diamond-optimization/Table of Contents Diamond transition or how technicalities can break concepts But let’s take a closer look Conclusion Diamond transition or how technicalities can break concepts We assume the reader has some basic knowledge about pairwise alignment and in particular the WFA algorithm.
In this post we dive into a potential 2x speedup of WFA — one that turns out not to work.
Let’s take a look at one of the most important and efficient algorithms for pairwise alignment — WFA (Marco-Sola et al.Bidirectional A*https://curiouscoding.nl/notes/bidirectional-astar/Thu, 28 Jul 2022 17:59:00 +0200https://curiouscoding.nl/notes/bidirectional-astar/These are some links and papers on bidirectional A* variants. Nothing insightful at the moment.
small lecture introduces \(h_f(u) = \frac 12 (\pi_f(u) - \pi_r)\). Not found a paper yet. An Improved Bidirectional Heuristic Search Algorithm (Champeaux 1977) introduces a bidirectional variant Bidirectional Heuristic Search Again (Champeaux 1983) fixes a bug in the above paper Efficient modified bidirectional A* algorithm for optimal route-finding Didn’t read closely yet. A new bidirectional algorithm for shortest paths (Pijls 2008) Actually a new methods.The BiWFA meeting conditionhttps://curiouscoding.nl/notes/biwfa-meeting-condition/Mon, 11 Jul 2022 00:00:00 +0200https://curiouscoding.nl/notes/biwfa-meeting-condition/cross references: BiWFA GitHub issue
It seems that getting the meeting/overlap condition of BiWFA (Marco-Sola et al. (2022), Algorithm 1 and Lemma 2.1) correct is tricky.
Let \(p := \max(x, o+e)\) be the maximal cost of any edge in the edit graph. As in the BiWFA paper, let \(s_f\) and \(s_r\) be the distances of the forward and reverse fronts computed so far.
We prove the following lemma:
Lemma Once BiWFA has expanded the forward and reverse fronts up to \(s_f\) and \(s_r\) and has found some path of cost \(s \leq s_f + s_r\), expanding the fronts until \(s’_f + s’_r \geq s+p+o\) is guaranteed to find a shortest path.A* variantshttps://curiouscoding.nl/notes/astar-variants/Sun, 12 Jun 2022 12:04:00 +0200https://curiouscoding.nl/notes/astar-variants/These are some quick notes listing papers related to A* itself and variants. In particular, here I’m interested in papers that update \(h\) during the A* search, as a background for pruning.
Specifically, our version of pruning increases \(h\) during a single A* search, and in fact the heuristic becomes in-admissible after pruning.
Changing \(h\) The original A* paper has a proof of optimality. Later papers consider this also with heuristics that change their value over time.IGGSY 22 Slideshttps://curiouscoding.nl/notes/iggsy-presentation-slides/Sun, 12 Jun 2022 12:04:00 +0200https://curiouscoding.nl/notes/iggsy-presentation-slides/These are the slides Pesho Ivanov and I presented at IGGSY 2022 on Astarix and A*PA.
Drive: here
Pdf: hereBenchmark attention pointshttps://curiouscoding.nl/notes/benchmarks/Thu, 28 Apr 2022 23:33:00 +0200https://curiouscoding.nl/notes/benchmarks/Benchmarking is harder than you think, even when taking into account this rule.
This post lists some lessons I learned while attempting to run benchmarks for A* pairwise aligner. I was doing this on a laptop, which likely has different characteristics from CPUs in a typical server rack. All the programs I run are single threaded.
Hardware Do not run while charging the laptop Charging makes the battery hot and causes throttling.Motivationhttps://curiouscoding.nl/notes/motivation/Thu, 28 Apr 2022 23:22:00 +0200https://curiouscoding.nl/notes/motivation/It’s not the need for faster software that motivates; it’s the mathematical discovery that needs sharing.Proof sketch for linear time seed heuristic alignmenthttps://curiouscoding.nl/notes/linear-time-pa/Sun, 24 Apr 2022 00:00:00 +0200https://curiouscoding.nl/notes/linear-time-pa/Table of Contents Pairwise alignment in subquadratic time Random model Algorithm Seed heuristic Match pruning Analysis Expanded states Excess errors Algorithmic complexity This post is a proof sketch to show that A* with the seed heuristic (Groot Koerkamp and Ivanov 2022) does exact pairwise alignment of random strings with random mutations in near linear time.
Pairwise alignment in subquadratic time Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.Variations on the WFA recursionhttps://curiouscoding.nl/posts/wfa-variations/Sun, 17 Apr 2022 03:14:00 +0200https://curiouscoding.nl/posts/wfa-variations/Table of Contents Gap open Gap close Symmetric alternatives Another symmetry Conclusions cross references: BiWFA GitHub issue
In this post I will explore some variations of the recursion used by WFA/BiWFA for the affine version of the diagonal transition algorithm. In particular, we will go over a gap-close variant, and look into some more symmetric formulations.
Gap open WFA (Marco-Sola et al. 2020) introduces the affine cost variant of the classic diagonal transition method.Ongoing and future researchhttps://curiouscoding.nl/pages/todo/Fri, 15 Apr 2022 00:00:00 +0200https://curiouscoding.nl/pages/todo/Table of Contents In progress On hold Pending ideas/blogposts Smaller tasks Future plans Open questions Here I list projects that I’m currently working on, and ideas for future work.
In progress A* pairwise aligner [GitHub] Exact global pairwise alignment of random strings in expected linear time. Contains proof of correctness, implementation, evals and comparison with WFA and edlib on random data.
Proof of expected linear time alignment I have a proof of concept to show that a simplified version of the algorithm currently implemented by A* pairwise aligner runs in expected linear time on random input with sufficiently low edit distance (\(|\Sigma|^{1/e} \ll n\)), but need to spend some time on details and writing it down.Publicationshttps://curiouscoding.nl/pages/publications/Fri, 15 Apr 2022 00:00:00 +0200https://curiouscoding.nl/pages/publications/ References Groot Koerkamp, Ragnar, and Pesho Ivanov. 2022. “Exact Global Alignment Using A* with Seed Heuristic and Match Pruning,” September. https://doi.org/10.1101/2022.09.19.508631. Groot Koerkamp, Ragnar, and Marieke van der Wegen. 2019. “Stable gonality is computable.” Discrete Mathematics & Theoretical Computer Science vol. 21 no. 1, ICGT 2018 (June). https://doi.org/10.23638/DMTCS-21-1-10. Groot Koerkamp, Ragnar, and Stanislav Živný. 2021. “On Rainbow-Free Colourings of Uniform Hypergraphs.” Theoretical Computer Science 885 (September): 69–76. https://doi.org/10.1016/j.tcs.2021.06.022.Glossaryhttps://curiouscoding.nl/pages/glossary/Thu, 14 Apr 2022 00:00:00 +0200https://curiouscoding.nl/pages/glossary/This is a growing list of ambiguous terms and their definitions. More of a place to store random remarks than a complete reference for now.
diagonal transition name introduced by Navarro (2001) approximate approximate algorithm: an algorithms that does not always give the correct answer.
$k$-approximate string matching: variant semi-global alignment where we find all matches of a pattern in a reference with at most \(k\) mistakes.
Also approximate string matching: alternative name for global pairwise alignment.A review of exact global pairwise alignmenthttps://curiouscoding.nl/posts/pairwise-alignment/Fri, 01 Apr 2022 00:00:00 +0200https://curiouscoding.nl/posts/pairwise-alignment/Table of Contents Variants of pairwise alignment Cost models Alignment types A chronological overview of global pairwise alignment Algorithms in detail Classic DP algorithms Cubic algorithm of Needleman and Wunsch (1970) A quadratic DP Local alignment Affine costs Minimizing vs. maximizing duality Four Russians method TODO \(O(ns)\) methods TODO Exponential search on band TODO LCS: thresholds, $k$-candidates and contours TODO Diagonal transition: furthest reaching and wavefronts TODO Suffixtree for \(O(n+s^2)\) expected runtime Using less memory Computing the score in linear space Divide-and-conquer TODO LCSk[++] algorithms Theoretical lower bound TODO A note on DP (toposort) vs Dijkstra vs A* TODO Tools TODO Notes for other posts Semi-global alignment papers Approximate pairwise aligners Old vs new papers Note: This is a living document, and will likely remain so for a while.Pruning for A* heuristicshttps://curiouscoding.nl/notes/pruning/Sat, 11 Dec 2021 00:00:00 +0100https://curiouscoding.nl/notes/pruning/Note: this post extends the concept of multiple-path pruning presented in Poole and Mackworth (2017).
Say we’re running A* in a graph from \(s\) to \(t\). \(d(s,t)\) is the distance we are looking for.
An A* heuristic has to satisfy \(h(u) \leq d(u, t)\) to be admissible: the estimated distance to the end should never be larger than the actual distance to guarantee that the algorithm finds a shortest path.AStarixhttps://curiouscoding.nl/notes/astarix/Fri, 12 Nov 2021 13:05:00 +0100https://curiouscoding.nl/notes/astarix/Papers
AStarix: Fast and Optimal Sequence-to-Graph Alignment Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds AStarix is a method for aligning sequences (reads) to graphs:
Input A reference sequence or graph Alignment costs \((\Delta_{match}, \Delta_{subst}, \Delta_{del}, \Delta_{ins})\) for a match, substitution, insertion and deletion Sequence(s) to align Output An optimal alignment of each input sequence The input is a reference graph (automaton really) \(G_r = (V_r, E_r)\) with edges \(E_r \subseteq V_r\times V_r\times \Sigma\) that indicate the transitions between states.Neighbor joininghttps://curiouscoding.nl/notes/neighbor-joining/Fri, 12 Nov 2021 11:57:00 +0100https://curiouscoding.nl/notes/neighbor-joining/Neighbor joining (NJ, paper) is a phylogeny reconstruction method. It differs from UPGMA in the way it computes the distances between clusters.
This algorithm first assumes that the phylogeny is a star graph. Then it finds the pair of vertices that when merged and split out gives the minimal total edge length \(S_{ij}\) of the new almost-star graph. (See eq. (4) and figure 2a and 2b in the paper.) \[ S_{i,j} = \frac1{2(n-2)} \sum_{k\not\in \{i,j\}}(d(i, k)+d(j,k)) + \frac 12 d(i,j)+\frac 1{n-2} \sum_{k<l,\, k, l\not\in\{i,j\}}d(k,l).UPGMAhttps://curiouscoding.nl/notes/upgma/Thu, 28 Oct 2021 11:56:00 +0200https://curiouscoding.nl/notes/upgma/Unweighted pair group method with arithmetic mean (UPGMA) is a phylogeny reconstruction method.
Input Matrix of pairwise distances Output Phylogeny Algorithm Repeatedly merge the nearest two clusters. The distance between clusters is the average of all pairwise distances between them. When merging two clusters, the distances of the new cluster are the weighted averages of distances from the two clusters being merged. Complexity \(O(n^3)\) naive, \(O(n^2 \ln n)\) using heap.RTFEhttps://curiouscoding.nl/notes/rfte/Fri, 22 Oct 2021 15:16:00 +0200https://curiouscoding.nl/notes/rfte/Read The F*ing Error
When you complain about an error without reading it first. When you assume you understand the problem halfway through reading the error, and only after more debugging you realize you failed to read properly.1st law of Procrastinationhttps://curiouscoding.nl/notes/procrastination/Fri, 22 Oct 2021 11:46:00 +0200https://curiouscoding.nl/notes/procrastination/Important deadlines require important procrastination.Data should be reviewedhttps://curiouscoding.nl/notes/data-should-be-reviewed/Fri, 22 Oct 2021 11:41:00 +0200https://curiouscoding.nl/notes/data-should-be-reviewed/Experiments and their analysis should be reproducible, and all data/figures in a paper should be reviewable. Pipelines (e.g. snakemake files) to generated them should be attached to the paper.
I’ve asked for automated scripts to reproduce test data on 3+ github repositories now, and got a satisfactory answer zero times:
WFA: https://github.com/smarco/WFA/issues/26
Link to a datadump on the block-aligner repository. Good to have actual data, but exactly how this data was created is unclear to me.Spaced K-mer Seeded Distancehttps://curiouscoding.nl/posts/spaced-kmer-distance/Wed, 20 Oct 2021 00:00:00 +0200https://curiouscoding.nl/posts/spaced-kmer-distance/Table of Contents Background $k$-mers Sketching MinHash Terminology Introduction Spaced $k$-mer Seeded Distance Improving performance Analysis Pruning false positive candidate matches Phylogeny reconstruction Running the algorithm TODO Assembly \[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Background Quickly finding similar pieces of DNA within large datasets is at the core of computational biology. This has many applications:
Alignment: Given two pieces of related DNA, align them to find where mutations (i.Open Sciencehttps://curiouscoding.nl/posts/open-science/Tue, 19 Oct 2021 00:00:00 +0200https://curiouscoding.nl/posts/open-science/Let’s go over some reasons for why I’m writing this blog.
The internet is more accessible than papers The inspiration for this blog is the post on Succinct de Bruijn Graphs by Alex Bowe. I think blog posts are a great way to quickly learn about new ideas and concepts, since they are usually more accessible than papers. A blog post can omit some of the more formal text required in papers and spend more time explaining things on an intuitive level.Hugo and ox-hugohttps://curiouscoding.nl/notes/hugo/Thu, 14 Oct 2021 00:00:00 +0200https://curiouscoding.nl/notes/hugo/Here’s the customary how I made this site using X post.
This site is built using Hugo and ox-hugo.
The source is written in Org mode, which is converted to markdown by ox-hugo. To get started yourself, check out the initial commit of the source repository and build from there.
Some notes:
I’m using the Hugo-coder theme. Since the conversion from Org to markdown is done using an Emacs plugin, the emacs folder contains a simple init.Hello, World!https://curiouscoding.nl/notes/hello-world/Wed, 13 Oct 2021 00:00:00 +0200https://curiouscoding.nl/notes/hello-world/print("Hello, World!") std::cout << "Hello, World!" << std::endl;Spaced k-mer and assembler methodshttps://curiouscoding.nl/notes/spaced-kmer-review/Wed, 14 Jul 2021 00:00:00 +0200https://curiouscoding.nl/notes/spaced-kmer-review/Table of Contents Spaced \(k\)-mers Minimap SPAdes MUMmer4 BLASR Bowtie 2 Patternhunter Spaced seeds improve \(k\)-mer-based metagenomic classification LoMeX Meeting notes Concepts:
Mapping Map a sequence onto a reference genome/dataset Assembly Build a genome from a set of reads de novo (implied): without using a reference genome Otherwise just called mapping Typical complicating factors:
read errors non-uniform coverage insert size variation chimeric reads (?) bireads non-uniform read coverage (as in metagenomics, i.Ideas for assembling [long] readshttps://curiouscoding.nl/posts/thoughts-on-assembling/Fri, 09 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/thoughts-on-assembling/\[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\Z}{\mathbb Z} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Here is an idea for an algorithm to assemble long reads.
Go over all sequences and sketch their windows using the Hamming distance preserving sketch method described here. This method may need some tweaking to also work with an indel rate of around 10%.
Let’s say we find a pair of matching windows between reads \(A\) and \(B\) starting at positions \(i\) and \(j\).Hamming Similarity Searchhttps://curiouscoding.nl/posts/hamming-similarity-search/Thu, 08 Jul 2021 00:00:00 +0200https://curiouscoding.nl/posts/hamming-similarity-search/Table of Contents Background $k$-mers Sketching MinHash Introduction Hamming Similarity Search Improving performance Analysis Pruning false positive candidate matches Phylogeny reconstruction Running the algorithm Assembly \[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \]
Background Quickly finding similar pieces of DNA within large datasets is at the core of computational biology. This has many applications:
Alignment: Given two pieces of related DNA, align them to find where mutations (i.Detached fullscreen in Swayhttps://curiouscoding.nl/linux/sway-fullscreen/Fri, 02 Jul 2021 00:00:00 +0200https://curiouscoding.nl/linux/sway-fullscreen/Xrefs: PR for Sway | AUR package sway-inhibit-fullscreen-git
Once upon a time, Chromium had a bug where using $mod+f in i3 to fullscreen the Chromium window changed the window to occupy the entire screen, but didn’t actually make Chromium enter full screen mode. According to some, those1 were2 the3 good4 days5, 6. Watching 4 YouTube streams in parallel was still possibly, back in those days:
Without patches, the best we can do nowadays7 is the followingOpen source contributionshttps://curiouscoding.nl/linux/oss-contributions/Fri, 02 Jul 2021 00:00:00 +0200https://curiouscoding.nl/linux/oss-contributions/Table of Contents My aur packages Some issues I reported/fixed My aur packages List on aur.archlinux.org
bapctools-git: BAPCtools is used for developing ICPC style programming contest problems. feh-preload-next-image-git: Branch of Feh that loads the next image to speed up browsing images in a remote directory. i3-focus-last-git: Window switcher for i3/sway. python-pyexiftool-nocheck: the original python-pyexiftool is outdated, orphaned, and still depends on python2. sway-inhibit-fullscreen-git: Sway branch that adds the inhibit_fullscreen toggle command.Powersearch with Vimiumhttps://curiouscoding.nl/linux/vimium/Fri, 02 Jul 2021 00:00:00 +0200https://curiouscoding.nl/linux/vimium/Related posts: Dark mode with Vimium
Vimium (Github, Chromium extension) is not only a great way to navigate webpages; it’s also a great help to quickly search many webpages.
I am using it many times a day to search for just the documentation I need. Some of the search engines I have configured:
# Documentation archwiki: https://wiki.archlinux.org/index.php?search=%s ArchWiki aur: https://aur.archlinux.org/packages/?K=%s AUR cpp: https://en.cppreference.com/mwiki/index.php?search=%s CppReference github: https://github.com/search?q=%s GitHub hoogle: https://www.haskell.org/hoogle/?hoogle=%s Hoogle oeis: https://oeis.Wayland utilitieshttps://curiouscoding.nl/linux/wayland/Fri, 02 Jul 2021 00:00:00 +0200https://curiouscoding.nl/linux/wayland/This post goes over some useful utilities I have been using on my Wayland system.
Screen brightness: light Light is a nice tool to manage screen and keyboard brightness.
Install light Add your user to the video group: usermod -aG video <user> I really like the light -T flag, which multiplies the current brightness by some value. This way you can have fine grained control both for very low and very high brightness values.Browsing in the dark with Vimium and Dark Readerhttps://curiouscoding.nl/linux/dark-mode/Thu, 01 Jul 2021 00:00:00 +0200https://curiouscoding.nl/linux/dark-mode/Table of Contents Chromium theme Dark Reader Vimium Let’s quickly go over some settings you can change for a better dark mode experience in Chromium.
Chromium theme First of all, you can make Chromium itself use a dark theme. This will ensure both a dark tab bar and nice dark settings pages. As explained here, you’ll need to change the following:
Run chromium with the flags
chromium --enable-features=WebUIDarkMode --force-dark-mode If you are already using other feature flags, they can be comma separated:Window switching in Swayhttps://curiouscoding.nl/linux/sway-window-switching/Thu, 01 Jul 2021 00:00:00 +0200https://curiouscoding.nl/linux/sway-window-switching/Sway has many commands for switching the active workspace and focused window. However, I find that most of my window switching comes down to a few simple commands that focus a specific application, or open it first when it has no open windows yet. E.g.:
$mod+s: open and/or focus slack $mod+i: open and/or focus signal $mod+m: open and/or focus emacs $mod+c: open and/or focus chromium In addition to this, some apps like emacs have a separate $mod+Shift+m command that always opens a new window/instance.Clean your homedir with XDG Base Dirhttps://curiouscoding.nl/linux/xdg-base-dir/Wed, 30 Jun 2021 00:00:00 +0200https://curiouscoding.nl/linux/xdg-base-dir/Xrefs: XDG specification | ArchWiki | Reddit post
In case you are, like me, tired of applications polluting your homedir with config and data files, the XDB Base Directory Specification (ArchWiki) has your back.
You probably saw the ~/.config directory already, and in fact, many programs can be told to use this directory instead of polluting your homedir. The ArchWiki page has a list of many applications and which environment variables need to be set to change the location of their configuration.Emacs Doomhttps://curiouscoding.nl/linux/emacs/Wed, 30 Jun 2021 00:00:00 +0200https://curiouscoding.nl/linux/emacs/Table of Contents Configuration init.el config.el Running as server and client Wayland Useful commands Emacs as mail client Install Doom Emacs as explained in the readme.
Alongside it, you’ll want to install ripgrep and fd for better search integration, and possibly ttf-font-awesome for better icons.
Configuration Instead of the default ~/emacs.d/ and ~/doom.d/ config directories, you can also use ~/.config/emacs/ and ~/.config/doom/.
init.el My init.el is mostly default, and enables the languages I regularly use, with LSP support where possible:Environment variables done oncehttps://curiouscoding.nl/linux/environment-variables/Wed, 30 Jun 2021 00:00:00 +0200https://curiouscoding.nl/linux/environment-variables/Xrefs: GitHub issue
One problem I had with my Sway setup is that setting environment variables in my config.fish (the Fish equivalent to .bashrc or .zshrc) is not always sufficient.
In particular, I need my environment variables to be available in at least the following places:
my Fish shell, applications launched from Sway (e.g. using keybindings), applications launched as a systemd service (e.g. the Emacs server daemon). Setting variables in the shell profile has the problem that they are not picked up by systemd services.28000x speedup with Numba.CUDAhttps://curiouscoding.nl/posts/numba-cuda-speedup/Mon, 24 May 2021 00:00:00 +0200https://curiouscoding.nl/posts/numba-cuda-speedup/Table of Contents CUDA Overview Profiling Optimizing Tensor Sketch CPU code V0: Original python code V1: Numba V2: Multithreading GPU code V3: A first GPU version V4: Parallel kernel invocations V5: Single kernel with many blocks V6: Detailed profiling: Kernel Compute V7: Detailed profiling: Kernel Latency V8: Detailed profiling: Shared Memory Access Pattern V9: More work per thread V10: Cache seq to shared memory V11: Hashes and signs in shared memory V12: Revisiting blocks per kernel V13: Passing a tuple of sequences V14: Better hardware V15: Dynamic shared memory Wrap up Xrefs: r/CUDA, Numba discourseX1 Extreme Gen 3 - Migrating to Waylandhttps://curiouscoding.nl/linux/x1e3/Sun, 16 May 2021 00:00:00 +0200https://curiouscoding.nl/linux/x1e3/I got a new laptop, so this felt like the right time to migrate to Wayland.
Delta what before after hardware laptop Asus UX501V Lenovo X1 Extreme Gen 3 CPU i7-6700HQ i7-10750H GPU GTX 960M GTX 1650 RAM 16GB 64GB OS bootloader Grub EFISTUB OS Windows + Arch dualboot Windows + Arch dualboot networking netctl systemd-networkd dns/dhcp dhcpcd systemd-resolved wifi wpa_supplicant iwd Wayland display/login manager - - display server X Wayland window manager i3 Sway bar i3blocks waybar backlight xbacklight light night mode redshift gammastep clipboard - wl-clipboard, clipman program launcher rofi rofi [wayland] password finder rofi-pass rofi-pass-git key remapping setxkbmap, xcape, xmodmap interception-tools Tools terminal emulator urxvt foot shell zsh fish shell highlighting zsh-syntax-highlight - environment variables .SE Endurance: Early gamehttps://curiouscoding.nl/misc/factorio-early-game/Mon, 26 Apr 2021 00:00:00 +0200https://curiouscoding.nl/misc/factorio-early-game/Xrefs: Reddit
This is the start of a series of posts on our (philae, winston) play through Factorio with the Space Exploration mod.
After lots of struggling, we recently finished our first SE world after 624 in-game hours. Since this was also our first/second Factorio world, the start was very inefficient and we learned a lot of things along the way. In this new map, which we call Endurance (after the interplanetary spaceship in Interstellar), we will apply what we learned, and share it with the world :)Hashcode 2021 Finalshttps://curiouscoding.nl/misc/hashcode-2021-finals/Sat, 24 Apr 2021 00:00:00 +0200https://curiouscoding.nl/misc/hashcode-2021-finals/Xrefs: Problem | Scoreboard
Team: cat /dev/random | grep "to be or not to be"
Who: Jan-Willem Buurlage, Ragnar Groot Koerkamp, Timon Knigge, Abe Wits
Score: 274253375
Rank: 19 of 38
Not good.
Not bad.
Definitely ugly.
Linkerrijtje (aka top half).
I would have liked to write that I’m happy with the result, but to be fair–I’m not. Just the fact that I can’t sleep and feel the need to write this in the middle of the night surely is indication of this.Hashcode 2021: A lucky ridehttps://curiouscoding.nl/misc/hashcode-2021/Mon, 01 Mar 2021 00:00:00 +0100https://curiouscoding.nl/misc/hashcode-2021/Xrefs: Problem | Scoreboard | Codeforces announcement, this blog | Hacker News
Team: cat /dev/random | grep "to be or not to be"
Who: Jan-Willem Buurlage, Ragnar Groot Koerkamp, Timon Knigge, Abe Wits
Score: 10282641
Rank: 16
Since we did quite well, here is a write-up of our participation in Hashcode 2021.
Prep All four of us had previously participated in Hashcode, but this was the first time in the current composition.Abouthttps://curiouscoding.nl/about/Mon, 01 Jan 0001 00:00:00 +0000https://curiouscoding.nl/about/Hi there ;) I’m doing a PhD in bioinformatics at the BMI lab at ETH Zurich. Currently I’m working on near-linear algorithms for exact pairwise alignment.
This blog is where I dump my thoughts on my PhD research. For now it includes some short notes/remarks/ideas for research, and a few longer posts that may eventually turn into papers.
Feel free to use this blog as inspiration and build on the ideas you see here, as long as you cite appropriately.Readmehttps://curiouscoding.nl/readme/Mon, 01 Jan 0001 00:00:00 +0000https://curiouscoding.nl/readme/Research notes This repository contains the source of my blog: https://curiouscoding.nl.
Feel free to comment on the code or create an issue if you see something off.
This blog is written in Org, converted to markdown by ox-hugo and built using Hugo.
License All written text (i.e. everything rendered on my blog) is licensed under CC BY-SA 4.0.
The Hugo, ox-hugo, and org mode related source code (everything in the initial commit) are licensed under MIT.