Pairwise-Alignment on CuriousCodinghttps://curiouscoding.nl/tags/pairwise-alignment/Recent content in Pairwise-Alignment on CuriousCodingHugoenSat, 23 Mar 2024 00:00:00 +0100A*PA2: Up to 19x faster exact global alignmenthttps://curiouscoding.nl/posts/astarpa2/Sat, 23 Mar 2024 00:00:00 +0100https://curiouscoding.nl/posts/astarpa2/Table of Contents Abstract 1 Introduction 1.1 Contributions 1.2 Previous work 1.2.1 Needleman-Wunsch 1.2.2 Graph algorithms 1.2.3 Computational volumes 1.2.4 Parallelism 1.2.5 Tools 2 Preliminaries 3 Methods 3.1 Band-doubling 3.2 Blocks 3.3 Memory 3.4 SIMD 3.5 SIMD-friendly sequence profile 3.6 Traceback 3.7 A* 3.7.1 Bulk-contours update 3.7.2 Pre-pruning 3.8 Determining the rows to compute 3.8.1 Sparse heuristic invocation 3.9 Incremental doubling 4 Results 4.1 Setup 4.2 Comparison with other aligners 4.A*PA talk @ CWIhttps://curiouscoding.nl/posts/astarpa-talk-cwi/Wed, 27 Dec 2023 00:00:00 +0100https://curiouscoding.nl/posts/astarpa-talk-cwi/I recently gave a talk about A*PA at CWI. Sadly the recording doesn’t show the blackboard, but either way, find it here.BitPAl bitpacking algorithmhttps://curiouscoding.nl/posts/bitpal/Sun, 03 Sep 2023 00:00:00 +0200https://curiouscoding.nl/posts/bitpal/Table of Contents Problem Input Example Discussion Found the bug Outlook The supplement (download) of the Loving, Hernandez, and Benson (2014) paper introduces a \(15\) operation version of Myers (1999) bitpacking algorithm, which uses \(16\) operations when modified for edit distance.
I tried implementing it, but it seems to have a bug that I will describe below. The fix is here.
Problem To recap, this algorithm solves the unit-cost edit distance problem by using bitpacking to compute a \(1\times w\) at a time.Shortest paths, bucket queues, and A* on the edit graphhttps://curiouscoding.nl/posts/shortest_path_history/Sat, 29 Jul 2023 00:00:00 +0200https://curiouscoding.nl/posts/shortest_path_history/Table of Contents Shortest path algorithms .. .. in general .. for circuit design Bucket queues Shortest path algorithms by Hadlock Grid graphs Strings Spouge’s computational volumes This note summarizes some papers I was reading while investigating the history of A* for pairwise alignment, and related to that the first usage of a bucket queue. Schrijver (2012) provides a nice overview of general shortest path methods.
Shortest path algorithms .The complexity and performance of WFA and band doublinghttps://curiouscoding.nl/posts/wfa-edlib-perf/Thu, 17 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/wfa-edlib-perf/Table of Contents Complexity analysis Complexity of edit distance Complexity of affine cost alignment Comparison Implementation efficiency Band doubling for affine scores was never implemented WFA vs band doubling for affine costs Conclusion Future work This note explores the complexity and performance of band doubling (Edlib) and WFA under varying cost models.
Edlib (Šošić and Šikić 2017) uses band doubling and runs in \(O(ns)\) time, for sequence length \(n\) and edit distance \(s\) between the two sequences.Local Doublinghttps://curiouscoding.nl/posts/local-doubling/Wed, 19 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/local-doubling/Table of Contents Notation Needleman-Wunsch: where it all begins Dijkstra/BFS: visiting fewer states Band doubling: Dijkstra, but more efficient GapCost: A first heuristic Computational volumes: an even smaller search Cheating: an oracle gave us \(g^*\) A*: Better heuristics Broken idea: A* and computational volumes Local doubling Without heuristic With heuristic Diagonal Transition A* with Diagonal Transition and pruning: doing less work Goal: Diagonal Transition + pruning + local doubling Pruning: Improving A* heuristics on the go Cheating more: an oracle gave us the optimal path TODO: aspriation windows \begin{equation*} \newcommand{\st}[2]{\langle #1,#2\rangle} \newcommand{\g}{g^*} \newcommand{\fm}{f_{max}} \newcommand{\gap}{\operatorname{Gap}} \end{equation*}Competitive Programming Lecturehttps://curiouscoding.nl/posts/competitive-programming-lecture/Wed, 28 Sep 2022 00:00:00 +0200https://curiouscoding.nl/posts/competitive-programming-lecture/Table of Contents Contest strategies Pairwise Alignment using A* Exercises Contest strategies Preparation Thinking costs energy! Sleep enough; early to bed the 2 nights before. No practising on contest day (and the day before); it just takes energy. During the contest Eat! At the very least take a break halfway with the entire team and eat some snacks. Make sure to read all the problems before the end of the contest.Speeding up A*: computational volumes and path-pruninghttps://curiouscoding.nl/posts/speeding-up-astar/Fri, 23 Sep 2022 00:00:00 +0200https://curiouscoding.nl/posts/speeding-up-astar/Table of Contents Motivation Summary Why is A* slow? Computational volumes Dealing with pruning Thoughts on more aggressive pruning Algorithm summary Challenges Results What about band-doubling? Maybe doubling can work after all? TODOs Extensions This post build on top of our recent preprint Groot Koerkamp and Ivanov (2024) and gives an overview of some of my new ideas to significantly speed up exact global pairwise alignment. It’s recommended you understand the seed heuristic and match pruning before reading this post.Linear memory WFA?https://curiouscoding.nl/posts/linear-memory-wfa/Wed, 17 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/linear-memory-wfa/Table of Contents Motivation Path traceback: two strategies Observations What information is needed for path tracing A pragmatic solution Another interpretation Affine costs Conclusion Figure 1: Only the red substitutions and blue indel need to be stored to trace the entire path.
In this post I’ll discuss an idea to run WFA using less memory, while still allowing us to trace back the optimal path from the target state back to the start of the search.Transforming match bonus into costhttps://curiouscoding.nl/posts/alignment-scores-transform/Tue, 16 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/alignment-scores-transform/Table of Contents Tricks with match bonus or how to fool Dijkstra’s limitations Edit graph Algorithms Potentials Multiple variants Some notes on algorithms WFA A* Extending to different cost models Affine costs Substitution matrices But not local alignment Evaluations Unequal string length Equal string lengths Conclusion Tricks with match bonus or how to fool Dijkstra’s limitations The reader is assumed to have basic knowledge about pairwise alignment and graph theory.Diamond optimisation for diagonal transitionhttps://curiouscoding.nl/posts/diamond-optimization/Mon, 01 Aug 2022 00:00:00 +0200https://curiouscoding.nl/posts/diamond-optimization/Table of Contents Diamond transition or how technicalities can break concepts But let’s take a closer look Conclusion Diamond transition or how technicalities can break concepts We assume the reader has some basic knowledge about pairwise alignment and in particular the WFA algorithm.
In this post we dive into a potential 2x speedup of WFA — one that turns out not to work.
Let’s take a look at one of the most important and efficient algorithms for pairwise alignment — WFA (Marco-Sola et al.The BiWFA meeting conditionhttps://curiouscoding.nl/posts/biwfa-meeting-condition/Mon, 11 Jul 2022 00:00:00 +0200https://curiouscoding.nl/posts/biwfa-meeting-condition/cross references: BiWFA GitHub issue
It seems that getting the meeting/overlap condition of BiWFA (Marco-Sola et al. (2023), Algorithm 1 and Lemma 2.1) correct is tricky.
Let \(p := \max(x, o+e)\) be the maximal cost of any edge in the edit graph. As in the BiWFA paper, let \(s_f\) and \(s_r\) be the distances of the forward and reverse fronts computed so far.
We prove the following lemma:
Lemma Once BiWFA has expanded the forward and reverse fronts up to \(s_f\) and \(s_r\) and has found some path of cost \(s \leq s_f + s_r\), expanding the fronts until \(s’_f + s’_r \geq s+p+o\) is guaranteed to find a shortest path.Proof sketch for linear time seed heuristic alignmenthttps://curiouscoding.nl/posts/linear-time-pa/Sun, 24 Apr 2022 00:00:00 +0200https://curiouscoding.nl/posts/linear-time-pa/Table of Contents Pairwise alignment in subquadratic time Random model Algorithm Seed heuristic Match pruning Analysis Expanded states Excess errors Algorithmic complexity This post is a proof sketch to show that A* with the seed heuristic (Groot Koerkamp and Ivanov 2024) does exact pairwise alignment of random strings with random mutations in near linear time.
Pairwise alignment in subquadratic time Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadratic time (i.Variations on the WFA recursionhttps://curiouscoding.nl/posts/wfa-variations/Sun, 17 Apr 2022 03:14:00 +0200https://curiouscoding.nl/posts/wfa-variations/Table of Contents Gap open Gap close Symmetric alternatives Another symmetry Conclusions cross references: BiWFA GitHub issue
In this post I will explore some variations of the recursion used by WFA/BiWFA for the affine version of the diagonal transition algorithm. In particular, we will go over a gap-close variant, and look into some more symmetric formulations.
Gap open WFA (Marco-Sola et al. 2020) introduces the affine cost variant of the classic diagonal transition method.A survey of exact global pairwise alignmenthttps://curiouscoding.nl/posts/pairwise-alignment-history/Fri, 01 Apr 2022 00:00:00 +0200https://curiouscoding.nl/posts/pairwise-alignment-history/Table of Contents Variants of pairwise alignment Cost models Alignment types A chronological overview of global pairwise alignment Algorithms in detail Classic DP algorithms Cubic algorithm of Needleman and Wunsch (1970) A quadratic DP Local alignment Affine costs Minimizing vs. maximizing duality Four Russians method TODO \(O(ns)\) methods TODO Exponential search on band TODO LCS: thresholds, $k$-candidates and contours TODO Diagonal transition: furthest reaching and wavefronts TODO Suffixtree for \(O(n+s^2)\) expected runtime Using less memory Computing the score in linear space Divide-and-conquer TODO LCSk[++] algorithms Theoretical lower bound TODO A note on DP (toposort) vs Dijkstra vs A* TODO Tools TODO Notes for other posts Semi-global alignment papers Approximate pairwise aligners Old vs new papers Note: This is a living document, and will likely remain so for a while.