<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Method on CuriousCoding</title><link>https://curiouscoding.nl/categories/method/</link><description>Recent content in Method on CuriousCoding</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 31 May 2026 00:00:00 +0200</lastBuildDate><atom:link href="https://curiouscoding.nl/categories/method/index.xml" rel="self" type="application/rss+xml"/><item><title>Notes on bidirectional anchors</title><link>https://curiouscoding.nl/posts/bd-anchors/</link><pubDate>Mon, 15 Jan 2024 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/bd-anchors/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#paper-overview" &gt;Paper overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#remarks-on-the-paper" &gt;Remarks on the paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#thoughts" &gt;Thoughts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;\[
\newcommand{\A}{\mathcal{A}_\ell}
\newcommand{\T}{\mathcal{T}_\ell}
\]&lt;/p&gt;
&lt;p&gt;These are some notes on &lt;em&gt;Bidirectional String Anchors&lt;/em&gt; (&lt;a href="#citeproc_bib_item_2"&gt;Loukides, Pissis, and Sweering 2023&lt;/a&gt;), also
called &lt;em&gt;bd-anchors&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Resources:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Loukides and Pissis (&lt;a href="#citeproc_bib_item_3"&gt;2021&lt;/a&gt;): preceding conference paper with subset of content.&lt;/li&gt;
&lt;li&gt;Loukides, Pissis, and Sweering (&lt;a href="#citeproc_bib_item_2"&gt;2023&lt;/a&gt;): The paper discussed here.&lt;/li&gt;
&lt;li&gt;Ayad, Loukides, and Pissis (&lt;a href="#citeproc_bib_item_1"&gt;2023&lt;/a&gt;): follow-up/second paper containing
&lt;ul&gt;
&lt;li&gt;a faster average-case \(O(n)\) construction algorithm;&lt;/li&gt;
&lt;li&gt;a more memory efficient construction algorithms for the index.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/solonas13/bd-anchors" class="external-link" target="_blank" rel="noopener"&gt;https://github.com/solonas13/bd-anchors&lt;/a&gt;: code for first paper&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/lorrainea/BDA-index" class="external-link" target="_blank" rel="noopener"&gt;https://github.com/lorrainea/BDA-index&lt;/a&gt;: code for follow-up paper&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The remainder of this post is split into &lt;a href="#paper-overview" &gt;an overview of the paper&lt;/a&gt;, &lt;a href="#remarks-on-the-paper" &gt;Remarks on the paper&lt;/a&gt;, and further &lt;a href="#thoughts" &gt;Thoughts&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Notes on SsHash</title><link>https://curiouscoding.nl/posts/sshash/</link><pubDate>Mon, 15 Jan 2024 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/sshash/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#paper-summary" &gt;Paper summary&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#intro" &gt;Intro&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#prelims" &gt;Prelims&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#related-work" &gt;Related work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#sparse-and-skew-hashing" &gt;Sparse and skew hashing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#remarks" &gt;Remarks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ideas" &gt;Ideas&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;\[\newcommand{\S}{\mathcal{S}}\]&lt;/p&gt;
&lt;h2 id="paper-summary"&gt;
 Paper summary
 &lt;a class="heading-link" href="#paper-summary"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;h3 id="intro"&gt;
 Intro
 &lt;a class="heading-link" href="#intro"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;p&gt;SsHash (&lt;a href="#citeproc_bib_item_7"&gt;Pibiri 2022&lt;/a&gt;) is a datastructure for indexing kmers.
Given a set of kmers \(\S\), it supports two operations:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;\(Lookup(g)\)&lt;/dt&gt;
&lt;dd&gt;return the unique id \(i\in [|\S|]\) of the kmer \(g\).&lt;/dd&gt;
&lt;dt&gt;\(Access(i)\)&lt;/dt&gt;
&lt;dd&gt;return the kmer corresponding to id \(i\).&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;It also supports &lt;em&gt;streaming&lt;/em&gt; queries, looking up all kmers from a longer string
consecutively, by expoiting the overlap between them.&lt;/p&gt;</description></item><item><title>BBHash: some ideas</title><link>https://curiouscoding.nl/posts/bbhash/</link><pubDate>Mon, 04 Sep 2023 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/bbhash/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#possible-speedup" &gt;Possible speedup?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;BBHash Limasset et al. (&lt;a href="#citeproc_bib_item_1"&gt;2017&lt;/a&gt;) uses multiple &lt;em&gt;layers&lt;/em&gt; to create a minimal perfect
hashing functions (MPHF), that hashes some input set into \([n]\).&lt;/p&gt;
&lt;p&gt;(See also my &lt;a href="https://curiouscoding.nl/posts/ptrhash/" &gt;note on PTHash&lt;/a&gt; (&lt;a href="#citeproc_bib_item_2"&gt;Pibiri and Trani 2021&lt;/a&gt;).)&lt;/p&gt;
&lt;p&gt;Simply said, it maps the \(n\) elements into \([\gamma \cdot n]\) using hashing function \(h_0\).
The \(k_0\) elements that have collisions are mapped into \([\gamma \cdot k_0]\)
using \(h_1\).
Then, the \(k_1\) elements with collisions are mapped into \([\gamma \cdot k_1]\),
and so on.&lt;/p&gt;</description></item><item><title>BitPAl bitpacking algorithm</title><link>https://curiouscoding.nl/posts/bitpal/</link><pubDate>Sun, 03 Sep 2023 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/bitpal/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#problem" &gt;Problem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#input" &gt;Input&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#example" &gt;Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#discussion" &gt;Discussion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#found-the-bug" &gt;Found the bug&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#outlook" &gt;Outlook&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;The supplement (&lt;a href="https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/30/22/10.1093_bioinformatics_btu507/3/bioinformatics_30_22_3166_s1.zip?Expires=1695376479&amp;amp;Signature=vroWHrpg-P0tvOPcafVy~gh6mhZ-AZ8kj6lHr1DH7byZGTK2sy8chti7hDiWdbtGx6onKv94EAI5odd~GMBMG0GNXxfp1bZ~7ItGeNCXp0tosJpArez7Yo~PuKT77nJpgQYo5rabbkJ6qtvP3-V-41oznQ~Zh9Tl~GNLvjLo~5vq0D1wa4PMmqhc-C0zcEeh8ybqEK7hQdyvoxreWppOTZFIHIJwmZOSOeXBWM0fQhcPnM9ZU8cEsqAI64WuWt1AJgmDOPDTBVzQHmHpsl01F4Jt8Hf2gvDYwhmoM7t4U~qCIGFr4raran~hzr-eD2vhwexQhpC7e1U2~N2lMC7e7w__&amp;amp;Key-Pair-Id=APKAIE5G5CRDK6RD3PGA" class="external-link" target="_blank" rel="noopener"&gt;download&lt;/a&gt;) of the Loving, Hernandez, and Benson (&lt;a href="#citeproc_bib_item_1"&gt;2014&lt;/a&gt;) paper introduces a \(15\)
operation version of Myers (&lt;a href="#citeproc_bib_item_2"&gt;1999&lt;/a&gt;) bitpacking algorithm, which uses \(16\)
operations when modified for edit distance.&lt;/p&gt;
&lt;p&gt;I tried implementing it, but it seems to have a bug that I will describe below.
The fix is &lt;a href="#found-the-bug" &gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="problem"&gt;
 Problem
 &lt;a class="heading-link" href="#problem"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;To recap, this algorithm solves the unit-cost edit distance problem by using
bitpacking to compute a \(1\times w\) at a time. As input, it takes&lt;/p&gt;</description></item><item><title>BWT and FM-index</title><link>https://curiouscoding.nl/posts/bwt/</link><pubDate>Tue, 18 Oct 2022 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/bwt/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#burrows-wheeler-transformation--bwt" &gt;Burrows-Wheeler Transformation (BWT)&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#last-to-first-mapping--lf-mapping" &gt;Last-to-first mapping (LF mapping)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pattern-matching" &gt;Pattern matching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#visualization" &gt;Visualization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#bi-directional-bwt" &gt;Bi-directional BWT&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;These are some notes about the &lt;a href="https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform" class="external-link" target="_blank" rel="noopener"&gt;Burrows-Wheeler Transform&lt;/a&gt; (BWT), &lt;a href="https://en.wikipedia.org/wiki/FM-index" class="external-link" target="_blank" rel="noopener"&gt;FM-index&lt;/a&gt;, and variants.&lt;/p&gt;
&lt;p&gt;See my post on the &lt;a href="../suffix-array-construction/" &gt;linear time suffix array construction algorithm&lt;/a&gt; for
notation and terminology.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;At &lt;a href="#visualization" &gt;the bottom&lt;/a&gt; you can find a visualization.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://curiouscoding.nl/posts/alg-viz/" &gt;&lt;strong&gt;&lt;strong&gt;This page&lt;/strong&gt;&lt;/strong&gt;&lt;/a&gt; has an &lt;strong&gt;&lt;strong&gt;interactive demo&lt;/strong&gt;&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Source code for visualizations is &lt;a href="https://github.com/RagnarGrootKoerkamp/suffix-array-construction" class="external-link" target="_blank" rel="noopener"&gt;this GitHub repo&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="burrows-wheeler-transformation--bwt"&gt;
 Burrows-Wheeler Transformation (BWT)
 &lt;a class="heading-link" href="#burrows-wheeler-transformation--bwt"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;The BWT of a string \(S\) is generated as follows:&lt;/p&gt;</description></item><item><title>Linear-time suffix array construction</title><link>https://curiouscoding.nl/posts/suffix-array-construction/</link><pubDate>Thu, 13 Oct 2022 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/suffix-array-construction/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#notation" &gt;Notation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#small-and-large-suffixes" &gt;Small and Large suffixes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#building-the-suffix-array-from-a-smaller-one" &gt;Building the suffix array from a smaller one&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#visualization" &gt;Visualization&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;These are some notes about linear time suffix array (SA) construction algorithms (SACA&amp;rsquo;s).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;At &lt;a href="#visualization" &gt;the bottom&lt;/a&gt; you can find a visualization.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://curiouscoding.nl/posts/alg-viz/" &gt;&lt;strong&gt;&lt;strong&gt;This page&lt;/strong&gt;&lt;/strong&gt;&lt;/a&gt; has an &lt;strong&gt;&lt;strong&gt;interactive demo&lt;/strong&gt;&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;History of suffix array construction algorithms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1990 first algorithm: Manber and Myers (&lt;a href="#citeproc_bib_item_2"&gt;1993&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;2002 small/large suffixes, explained below: Ko and Aluru (&lt;a href="#citeproc_bib_item_1"&gt;2005&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;2009 recursion only on &lt;em&gt;LMS&lt;/em&gt; suffixes: Nong, Zhang, and Chan (&lt;a href="#citeproc_bib_item_3"&gt;2009&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href="http://web.stanford.edu/class/archive/cs/cs166/cs166.1196/lectures/04/Small04.pdf" class="external-link" target="_blank" rel="noopener"&gt;These slides&lt;/a&gt; from Stanford are a nice reference for the last algorithm.&lt;/p&gt;</description></item><item><title>AStarix</title><link>https://curiouscoding.nl/posts/astarix/</link><pubDate>Fri, 12 Nov 2021 13:05:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/astarix/</guid><description>&lt;p&gt;&lt;strong&gt;Papers&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.biorxiv.org/content/10.1101/2020.01.22.915496v2.full" class="external-link" target="_blank" rel="noopener"&gt;AStarix: Fast and Optimal Sequence-to-Graph Alignment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.biorxiv.org/content/10.1101/2021.11.05.467453v1" class="external-link" target="_blank" rel="noopener"&gt;Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AStarix is a method for aligning sequences (reads) to graphs:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;Input&lt;/dt&gt;
&lt;dd&gt;&lt;ul&gt;
&lt;li&gt;A reference sequence or graph&lt;/li&gt;
&lt;li&gt;Alignment costs \((\Delta_{match}, \Delta_{subst}, \Delta_{del}, \Delta_{ins})\) for a match, substitution, insertion and deletion&lt;/li&gt;
&lt;li&gt;Sequence(s) to align&lt;/li&gt;
&lt;/ul&gt;
&lt;/dd&gt;
&lt;dt&gt;Output&lt;/dt&gt;
&lt;dd&gt;An optimal alignment of each input sequence&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;The input is a reference graph (automaton really) \(G_r = (V_r, E_r)\) with edges \(E_r \subseteq
V_r\times V_r\times \Sigma\) that indicate the transitions between states.&lt;/p&gt;</description></item><item><title>Neighbour joining</title><link>https://curiouscoding.nl/posts/neighbour-joining/</link><pubDate>Fri, 12 Nov 2021 11:57:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/neighbour-joining/</guid><description>&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Neighbor_joining" class="external-link" target="_blank" rel="noopener"&gt;Neighbour joining&lt;/a&gt; (NJ, &lt;a href="https://academic.oup.com/mbe/article/4/4/406/1029664" class="external-link" target="_blank" rel="noopener"&gt;paper&lt;/a&gt;) is a phylogeny reconstruction method.
It differs from UPGMA in the way it computes the distances between clusters.&lt;/p&gt;
&lt;p&gt;This algorithm first assumes that the phylogeny is a star graph.
Then it finds the pair of vertices that when merged and split out gives the
minimal total edge length \(S_{ij}\) of the new almost-star graph. (See eq. (4)
and figure 2a and 2b in the paper.)
\[
S_{i,j} = \frac1{2(n-2)} \sum_{k\not\in \{i,j\}}(d(i, k)+d(j,k)) + \frac 12
d(i,j)+\frac 1{n-2} \sum_{k&amp;lt;l,\, k, l\not\in\{i,j\}}d(k,l).
\]
After subtracting the sum of all pairwise distances (which is a constant) and multiplying by \(2(n-2)\), we obtain
the familiar
\[
Q(i, j) = (n-2) d(i, j) - \sum_{k=1}^n d(i, k) - \sum_{k=1}^n d(j, k).
\]
Thus, we merge the two vertices that minimize \(Q\).
The distance from the merging of vertices \(i\) and \(j\) to each other vertex
\(k\) is \(d_{(i-j)k} = (d_{i,k} + d_{j,k})/2\).&lt;/p&gt;</description></item><item><title>UPGMA</title><link>https://curiouscoding.nl/posts/upgma/</link><pubDate>Thu, 28 Oct 2021 11:56:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/upgma/</guid><description>&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/UPGMA" class="external-link" target="_blank" rel="noopener"&gt;Unweighted pair group method with arithmetic mean&lt;/a&gt; (UPGMA) is a phylogeny reconstruction method.&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;Input&lt;/dt&gt;
&lt;dd&gt;Matrix of pairwise distances&lt;/dd&gt;
&lt;dt&gt;Output&lt;/dt&gt;
&lt;dd&gt;Phylogeny&lt;/dd&gt;
&lt;dt&gt;Algorithm&lt;/dt&gt;
&lt;dd&gt;Repeatedly merge the nearest two clusters. The distance between
clusters is the average of all pairwise distances between them. When merging
two clusters, the distances of the new cluster are the weighted averages of
distances from the two clusters being merged.&lt;/dd&gt;
&lt;dt&gt;Complexity&lt;/dt&gt;
&lt;dd&gt;\(O(n^3)\) naive, \(O(n^2 \ln n)\) using heap.&lt;/dd&gt;
&lt;/dl&gt;</description></item></channel></rss>