Method on CuriousCoding

Method on CuriousCodinghttps://curiouscoding.nl/categories/method/Recent content in Method on CuriousCodingHugoenMon, 15 Jan 2024 00:00:00 +0100Notes on bidirectional anchorshttps://curiouscoding.nl/posts/bd-anchors/Mon, 15 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/bd-anchors/Table of Contents Paper overview Remarks on the paper Thoughts \[ \newcommand{\A}{\mathcal{A}_\ell} \newcommand{\T}{\mathcal{T}_\ell} \] These are some notes on Bidirectional String Anchors (Loukides, Pissis, and Sweering 2023), also called bd-anchors. Resources: Loukides and Pissis (2021): preceding conference paper with subset of content. Loukides, Pissis, and Sweering (2023): The paper discussed here. Ayad, Loukides, and Pissis (2023): follow-up/second paper containing a faster average-case \(O(n)\) construction algorithm; a more memory efficient construction algorithms for the index.Notes on SsHashhttps://curiouscoding.nl/posts/sshash/Mon, 15 Jan 2024 00:00:00 +0100https://curiouscoding.nl/posts/sshash/Table of Contents Paper summary Intro Prelims Related work Sparse and skew hashing Remarks Ideas \[\newcommand{\S}{\mathcal{S}}\] Paper summary Intro SsHash (Pibiri 2022) is a datastructure for indexing kmers. Given a set of kmers \(\S\), it supports two operations: \(Lookup(g)\) return the unique id \(i\in [|\S|]\) of the kmer \(g\). \(Access(i)\) return the kmer corresponding to id \(i\). It also supports streaming queries, looking up all kmers from a longer string consecutively, by expoiting the overlap between them.BBHash: some ideashttps://curiouscoding.nl/posts/bbhash/Mon, 04 Sep 2023 00:00:00 +0200https://curiouscoding.nl/posts/bbhash/Table of Contents Possible speedup? BBHash Limasset et al. (2017) uses multiple layers to create a minimal perfect hashing functions (MPFH), that hashes some input set into \([n]\). (See also my note on PTHash (Pibiri and Trani 2021).) Simply said, it maps the \(n\) elements into \([\gamma \cdot n]\) using hashing function \(h_0\). The \(k_0\) elements that have collisions are mapped into \([\gamma \cdot k_0]\) using \(h_1\). Then, the \(k_1\) elements with collisions are mapped into \([\gamma \cdot k_1]\), and so on.BitPAl bitpacking algorithmhttps://curiouscoding.nl/posts/bitpal/Sun, 03 Sep 2023 00:00:00 +0200https://curiouscoding.nl/posts/bitpal/Table of Contents Problem Input Example Discussion Found the bug Outlook The supplement (download) of the Loving, Hernandez, and Benson (2014) paper introduces a \(15\) operation version of Myers (1999) bitpacking algorithm, which uses \(16\) operations when modified for edit distance. I tried implementing it, but it seems to have a bug that I will describe below. The fix is here. Problem To recap, this algorithm solves the unit-cost edit distance problem by using bitpacking to compute a \(1\times w\) at a time.BWT and FM-indexhttps://curiouscoding.nl/posts/bwt/Tue, 18 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/bwt/Table of Contents Burrows-Wheeler Transformation (BWT) Last-to-first mapping (LF mapping) Pattern matching Visualization Bi-directional BWT These are some notes about the Burrows-Wheeler Transform (BWT), FM-index, and variants. See my post on the linear time suffix array construction algorithm for notation and terminology. At the bottom you can find a visualization. This page has an interactive demo. Source code for visualizations is this GitHub repo. Burrows-Wheeler Transformation (BWT) The BWT of a string \(S\) is generated as follows:Linear-time suffix array constructionhttps://curiouscoding.nl/posts/suffix-array-construction/Thu, 13 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-construction/Table of Contents Notation Small and Large suffixes Building the suffix array from a smaller one Visualization These are some notes about linear time suffix array (SA) construction algorithms (SACA’s). At the bottom you can find a visualization. This page has an interactive demo. History of suffix array construction algorithms: 1990 first algorithm: Manber and Myers (1993) 2002 small/large suffixes, explained below: Ko and Aluru (2005) 2009 recursion only on LMS suffixes: Nong, Zhang, and Chan (2009) These slides from Stanford are a nice reference for the last algorithm.AStarixhttps://curiouscoding.nl/posts/astarix/Fri, 12 Nov 2021 13:05:00 +0100https://curiouscoding.nl/posts/astarix/Papers AStarix: Fast and Optimal Sequence-to-Graph Alignment Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds AStarix is a method for aligning sequences (reads) to graphs: Input A reference sequence or graph Alignment costs \((\Delta_{match}, \Delta_{subst}, \Delta_{del}, \Delta_{ins})\) for a match, substitution, insertion and deletion Sequence(s) to align Output An optimal alignment of each input sequence The input is a reference graph (automaton really) \(G_r = (V_r, E_r)\) with edges \(E_r \subseteq V_r\times V_r\times \Sigma\) that indicate the transitions between states.Neighbour joininghttps://curiouscoding.nl/posts/neighbour-joining/Fri, 12 Nov 2021 11:57:00 +0100https://curiouscoding.nl/posts/neighbour-joining/Neighbour joining (NJ, paper) is a phylogeny reconstruction method. It differs from UPGMA in the way it computes the distances between clusters. This algorithm first assumes that the phylogeny is a star graph. Then it finds the pair of vertices that when merged and split out gives the minimal total edge length \(S_{ij}\) of the new almost-star graph. (See eq. (4) and figure 2a and 2b in the paper.) \[ S_{i,j} = \frac1{2(n-2)} \sum_{k\not\in \{i,j\}}(d(i, k)+d(j,k)) + \frac 12 d(i,j)+\frac 1{n-2} \sum_{k<l,\, k, l\not\in\{i,j\}}d(k,l).UPGMAhttps://curiouscoding.nl/posts/upgma/Thu, 28 Oct 2021 11:56:00 +0200https://curiouscoding.nl/posts/upgma/Unweighted pair group method with arithmetic mean (UPGMA) is a phylogeny reconstruction method. Input Matrix of pairwise distances Output Phylogeny Algorithm Repeatedly merge the nearest two clusters. The distance between clusters is the average of all pairwise distances between them. When merging two clusters, the distances of the new cluster are the weighted averages of distances from the two clusters being merged. Complexity \(O(n^3)\) naive, \(O(n^2 \ln n)\) using heap.