<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hpc on CuriousCoding</title><link>https://curiouscoding.nl/tags/hpc/</link><description>Recent content in Hpc on CuriousCoding</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 04 Jun 2026 00:00:00 +0200</lastBuildDate><atom:link href="https://curiouscoding.nl/tags/hpc/index.xml" rel="self" type="application/rss+xml"/><item><title>Sassy: fuzzy searching DNA sequences using SIMD</title><link>https://curiouscoding.nl/posts/sassy/</link><pubDate>Thu, 04 Jun 2026 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/sassy/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#why-should-you-care" &gt;Why should you care?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#using-sassy" &gt;Using Sassy&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.1&lt;/span&gt; &lt;a href="#the-cli" &gt;The CLI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.2&lt;/span&gt; &lt;a href="#rust-library" &gt;Rust library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.3&lt;/span&gt; &lt;a href="#python-bindings" &gt;Python bindings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.4&lt;/span&gt; &lt;a href="#c-and-r-bindings" &gt;C and R bindings&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#problem-statement-approximate-string-matching" &gt;Problem statement: Approximate String Matching&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#overhang-when-the-pattern-extends-beyond-the-read" &gt;Overhang: when the pattern extends beyond the read&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#some-background" &gt;Some background&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#the-algorithm" &gt;The algorithm&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.1&lt;/span&gt; &lt;a href="#simd" &gt;SIMD&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.2&lt;/span&gt; &lt;a href="#sassy2" &gt;Sassy2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#results" &gt;Results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.1&lt;/span&gt; &lt;a href="#sassy1" &gt;Sassy1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.2&lt;/span&gt; &lt;a href="#sassy2" &gt;Sassy2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.3&lt;/span&gt; &lt;a href="#applications" &gt;Applications&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;div style="color:grey"&gt;
&lt;p&gt;Discuss on &lt;a href="https://www.reddit.com/r/rust/comments/1twnnq0" style="color:grey"&gt;reddit&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Sassy is a tool for quickly searching short DNA patterns across large inputs that
&lt;a href="https://github.com/rickbeeloo" class="external-link" target="_blank" rel="noopener"&gt;Rick Beeloo&lt;/a&gt; and myself have been working on over the past year.
It just got published (&lt;a href="#citeproc_bib_item_3"&gt;Beeloo and Groot Koerkamp 2026b&lt;/a&gt;) (&lt;a href="https://curiouscoding.nl/papers/sassy.pdf" &gt;PDF&lt;/a&gt;, &lt;a href="https://doi.org/10.1093/bioinformatics/btag244" class="external-link" target="_blank" rel="noopener"&gt;DOI&lt;/a&gt;), and this blog post
briefly outlines some possible applications and how we developed it.
See also the &lt;a href="https://github.com/RagnarGrootKoerkamp/sassy" class="external-link" target="_blank" rel="noopener"&gt;GitHub readme&lt;/a&gt; and the &lt;a href="https://curiouscoding.nl/slides/sassy/slides/" &gt;slides&lt;/a&gt; for my &lt;a href="https://recomb-seq.github.io/seq2026/program" class="external-link" target="_blank" rel="noopener"&gt;RECOMB-Seq&lt;/a&gt; talk, which loosely
form the basis for this post.&lt;/p&gt;</description></item><item><title>Route Planning using Customizable Contraction Hierarchies</title><link>https://curiouscoding.nl/posts/cch/</link><pubDate>Mon, 02 Mar 2026 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/cch/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#problem-statement-customizable-route-planning--crp" &gt;Problem Statement: Customizable Route Planning (CRP)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#contraction-hierarchies--ch" &gt;Contraction Hierarchies (CH)&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.1&lt;/span&gt; &lt;a href="#classic-contraction-hierarchies" &gt;&lt;em&gt;Classic&lt;/em&gt; Contraction Hierarchies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.2&lt;/span&gt; &lt;a href="#customizable-contraction-hierarchies--cch" &gt;&lt;em&gt;Customizable&lt;/em&gt; Contraction Hierarchies (CCH)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#analogy-with-trees" &gt;Analogy with trees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#shortest-paths-in-chs" &gt;Shortest Paths in CHs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#parents-faster-shortest-paths-in-cchs" &gt;Parents: Faster Shortest Paths in CCHs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#input-graph" &gt;Input Graph&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7&lt;/span&gt; &lt;a href="#initial-algorithm" &gt;Initial Algorithm&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1&lt;/span&gt; &lt;a href="#permute-input" &gt;Permute input&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.2&lt;/span&gt; &lt;a href="#chordal-completion-and-parents" &gt;Chordal Completion and Parents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.3&lt;/span&gt; &lt;a href="#customize" &gt;Customize&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.4&lt;/span&gt; &lt;a href="#query" &gt;Query&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8&lt;/span&gt; &lt;a href="#optimizing-things" &gt;Optimizing things&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.1&lt;/span&gt; &lt;a href="#binary-searching-in-find-edge" &gt;Binary searching in &lt;code&gt;find_edge&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.2&lt;/span&gt; &lt;a href="#hashmap-of-edges" &gt;&lt;code&gt;HashMap&lt;/code&gt; of edges&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.3&lt;/span&gt; &lt;a href="#ranges-of-neighbours" &gt;Ranges of neighbours&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.4&lt;/span&gt; &lt;a href="#linear-scan" &gt;Linear scan&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.5&lt;/span&gt; &lt;a href="#proper-query-algorithm" &gt;Proper query algorithm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.6&lt;/span&gt; &lt;a href="#pruning-edges" &gt;Pruning edges&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.7&lt;/span&gt; &lt;a href="#pruning" &gt;Pruning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.8&lt;/span&gt; &lt;a href="#unconditional-edge-relaxing" &gt;Unconditional edge relaxing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.9&lt;/span&gt; &lt;a href="#early-edge-break" &gt;Early edge break&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.10&lt;/span&gt; &lt;a href="#dfs-ordering-the-nodes" &gt;DFS-ordering the nodes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.11&lt;/span&gt; &lt;a href="#not-inclining-queries" &gt;Not inclining queries&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;9&lt;/span&gt; &lt;a href="#some-stats" &gt;Some stats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;10&lt;/span&gt; &lt;a href="#serializing-the-final-structure" &gt;Serializing the final structure&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;11&lt;/span&gt; &lt;a href="#merging-adjacent-edges" &gt;Merging adjacent edges&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;11.1&lt;/span&gt; &lt;a href="#perf-stat" &gt;Perf stat&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;11.2&lt;/span&gt; &lt;a href="#all-ranges-are-multiples-of-8" &gt;All ranges are multiples of 8&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;11.3&lt;/span&gt; &lt;a href="#all-ranges-have-size-8" &gt;All ranges have size 8&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;11.4&lt;/span&gt; &lt;a href="#finding-the-bottleneck" &gt;Finding the bottleneck&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;12&lt;/span&gt; &lt;a href="#bugfixing" &gt;Bugfixing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;13&lt;/span&gt; &lt;a href="#further-ideas" &gt;Further ideas&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;13.1&lt;/span&gt; &lt;a href="#failed-doubling-the-graph" &gt;Failed: Doubling the graph&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;13.2&lt;/span&gt; &lt;a href="#edge-pruning" &gt;Edge pruning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;13.3&lt;/span&gt; &lt;a href="#failed-expanding-a-node-and-its-parent-in-parallel" &gt;Failed: Expanding a node and its parent in parallel&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;14&lt;/span&gt; &lt;a href="#current-best-results" &gt;Current best results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;14.1&lt;/span&gt; &lt;a href="#bottleneck" &gt;Bottleneck&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;15&lt;/span&gt; &lt;a href="#d41d8c" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; &lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;These are some notes on &lt;em&gt;customizable contraction hierarchies&lt;/em&gt;, based on talks
with Michael Zündorf and the survey paper by Bläsius, Buchhold, Wagner, Zeitz, and Zündorf (&lt;a href="#citeproc_bib_item_1"&gt;2025&lt;/a&gt;).&lt;/p&gt;</description></item><item><title>QuadRank: Engineering a High-Throughput Rank</title><link>https://curiouscoding.nl/slides/quadrank-text/</link><pubDate>Wed, 18 Feb 2026 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/slides/quadrank-text/</guid><description>&lt;script src="https://curiouscoding.nl/livereload.js?mindelay=10&amp;amp;v=2&amp;amp;port=1313&amp;amp;path=livereload" data-no-instant defer&gt;&lt;/script&gt;
&lt;h2 id="problem-statement"&gt;
 Binary Rank: Problem statement
 &lt;a class="heading-link" href="#problem-statement"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;div display="none"&gt;
&lt;ul&gt;
&lt;li&gt;Input: a many-GB text \(T = t_0\dots t_{n-1}\) of \(n\) bits.&lt;/li&gt;
&lt;li&gt;Queries: given \(q\), find&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;p&gt;\begin{equation*}
\newcommand{\rank}{\mathsf{rank}}
\rank(q) := \sum_{0\leq i&amp;lt; q} t_i.
\end{equation*}&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;\(T = \underline{\texttt{1001001}}\texttt{110010100}\)
&lt;ul&gt;
&lt;li&gt;\(\rank(0) = 0\)&lt;/li&gt;
&lt;li&gt;\(\rank(7) = 3\)&lt;/li&gt;
&lt;li&gt;\(\rank(16) = 7\)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Why? Occurrences table in FM-index.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="history"&gt;
 History
 &lt;a class="heading-link" href="#history"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;figure class="full post-large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/rank-overview.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="naive"&gt;
 Naive solutions
 &lt;a class="heading-link" href="#naive"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Linear scan:
&lt;ul&gt;
&lt;li&gt;\(O(n/w)\) time using \(w\) bit popcount, no space overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="post-large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/1-linear.svg"&gt;
&lt;/figure&gt;

&lt;ul&gt;
&lt;li&gt;Precompute all 64-bit answers \(r_i := \rank(i)\):
&lt;ul&gt;
&lt;li&gt;\(O(1)\) time, \(64\times\) overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="post-large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/2-precompute.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="blocks"&gt;
 Middle-ground: block-based offsets
 &lt;a class="heading-link" href="#blocks"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Use \(B=512\) bit blocks, and store all \(b_j := \rank(j\cdot B)\).&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="post-large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/3-blocks.svg"&gt;
&lt;/figure&gt;

&lt;ul&gt;
&lt;li&gt;Query \(q = j\cdot B + q&amp;rsquo;\):
\(\rank(q) = b_j + \sum_{jB\leq i &amp;lt; jB+q&amp;rsquo;} t_i\).&lt;/li&gt;
&lt;li&gt;\(O(B/w) = O(512/64) = O(1)\) time.&lt;/li&gt;
&lt;li&gt;\(64/512 = 12.5\%\) space overhead.&lt;/li&gt;
&lt;li&gt;2 cache misses:
&lt;ul&gt;
&lt;li&gt;in \(n/8\) bit array: offset \(b_j\)&lt;/li&gt;
&lt;li&gt;in \(n\) bit array: block \(j\) bits&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="levels"&gt;
 Reducing overhead: PastaWide [and others]
 &lt;a class="heading-link" href="#levels"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;2 levels:
&lt;ul&gt;
&lt;li&gt;L2 with 16-bit &lt;em&gt;delta&lt;/em&gt; every block: 3.125% overhead&lt;/li&gt;
&lt;li&gt;L1 with 64-bit &lt;em&gt;offset&lt;/em&gt; every 128 blocks: 0.1% overhead&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="post-large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/4-pasta.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="spider"&gt;
 Reducing cache misses: SPIDER
 &lt;a class="heading-link" href="#spider"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Inline the 16-bit L2 deltas bits into each cache line
&lt;ul&gt;
&lt;li&gt;Remaining 0.1% overhead L1 array fits in cache.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="post-large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/5-spider.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="pairing"&gt;
 Reducing the popcount: Pairing
 &lt;a class="heading-link" href="#pairing"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Delta is to the &lt;em&gt;middle&lt;/em&gt; instead of &lt;em&gt;start&lt;/em&gt; of a block.
&lt;ul&gt;
&lt;li&gt;Count only 256 bits in first or second half of block.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="post-large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/6-pairing.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="birank"&gt;
 All together now: BiRank
 &lt;a class="heading-link" href="#birank"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Inline 16-bit deltas&lt;/li&gt;
&lt;li&gt;Pairing&lt;/li&gt;
&lt;li&gt;32-bit &lt;em&gt;reduced&lt;/em&gt; L1 offsets: 0.05% overhead
&lt;ul&gt;
&lt;li&gt;Low 11 bits are stored in deltas&lt;/li&gt;
&lt;li&gt;Input up to \(2^{43}\) bits&lt;/li&gt;
&lt;li&gt;16 MiB cache supports 32 GiB input&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="post-large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/7-birank.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="metric"&gt;
 What is &lt;em&gt;fast&lt;/em&gt;?
 &lt;a class="heading-link" href="#metric"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;Latency: 80 ns/q&lt;/p&gt;</description></item><item><title>QuadRank: Engineering a High Throughput Rank</title><link>https://curiouscoding.nl/posts/quadrank/</link><pubDate>Thu, 29 Jan 2026 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/quadrank/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#abstract" &gt;Abstract&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#introduction" &gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#background" &gt;Background&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.1&lt;/span&gt; &lt;a href="#implementations" &gt;Further implementations&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#birank" &gt;BiRank&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#variants" &gt;Variants&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#quadrank" &gt;QuadRank&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.1&lt;/span&gt; &lt;a href="#variants" &gt;Variants&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#results" &gt;Results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.1&lt;/span&gt; &lt;a href="#evals-birank" &gt;BiRank&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.2&lt;/span&gt; &lt;a href="#evals-quadrank" &gt;QuadRank&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#conclusion" &gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7&lt;/span&gt; &lt;a href="#acknowledgements" &gt;Acknowledgements&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8&lt;/span&gt; &lt;a href="#snippets" &gt;Code snippets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;9&lt;/span&gt; &lt;a href="#pairing-math" &gt;Pairing superblocks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;10&lt;/span&gt; &lt;a href="#additional-results" &gt;Additional results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;10.1&lt;/span&gt; &lt;a href="#cache-misses" &gt;Cache misses per query&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;10.2&lt;/span&gt; &lt;a href="#small-n" &gt;Throughput for small inputs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;10.3&lt;/span&gt; &lt;a href="#epyc" &gt;AMD EPYC evals&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;11&lt;/span&gt; &lt;a href="#fm-index" &gt;QuadFm: A Batching FM-index&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;11.1&lt;/span&gt; &lt;a href="#results" &gt;Results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;12&lt;/span&gt; &lt;a href="#further-code-optimization-ideas" &gt;Further code optimization ideas&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;\[\newcommand{\rank}{\mathsf{rank}}
\newcommand{\rankone}{\mathsf{rank}}
\newcommand{\rankall}{\mathsf{rank_4}}\]&lt;/p&gt;</description></item><item><title>Trying to understand DDR memory</title><link>https://curiouscoding.nl/posts/ddr/</link><pubDate>Tue, 20 Jan 2026 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/ddr/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#questions" &gt;Questions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#a-load-of-articles-blogs-pages-to-read" &gt;A load of articles/blogs/pages to read&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.1&lt;/span&gt; &lt;a href="#wikipedia-articles" &gt;Wikipedia articles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.2&lt;/span&gt; &lt;a href="#more-posts" &gt;More posts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.3&lt;/span&gt; &lt;a href="#notes" &gt;Notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.4&lt;/span&gt; &lt;a href="#my-own-ram" &gt;My own RAM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.5&lt;/span&gt; &lt;a href="#continued-notes" &gt;Continued notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.6&lt;/span&gt; &lt;a href="#address-mapping-notation" &gt;Address mapping notation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.7&lt;/span&gt; &lt;a href="#intel-spec" &gt;Intel spec&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.8&lt;/span&gt; &lt;a href="#rank-interleaving" &gt;Rank interleaving&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.9&lt;/span&gt; &lt;a href="#nontemporal-reads-writes" &gt;Nontemporal reads/writes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#remap-using-performance-counters" &gt;reMap: using Performance counters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#sudoku" &gt;Sudoku&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.1&lt;/span&gt; &lt;a href="#step-1-dram-addressing-functions" &gt;Step 1: DRAM addressing functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.2&lt;/span&gt; &lt;a href="#step-2-row-column-bits" &gt;Step 2: row/column bits&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.3&lt;/span&gt; &lt;a href="#step-3-validation" &gt;Step 3: validation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.4&lt;/span&gt; &lt;a href="#step-4-which-function-is-what" &gt;Step 4: which function is what?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.5&lt;/span&gt; &lt;a href="#refreshes" &gt;Refreshes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.6&lt;/span&gt; &lt;a href="#consecutive-accesses" &gt;Consecutive Accesses&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#sudoku-now-with-only-1-dimm" &gt;Sudoku, now with only 1 DIMM&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.1&lt;/span&gt; &lt;a href="#setup" &gt;setup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.2&lt;/span&gt; &lt;a href="#1-dot-reverse-functions" &gt;1. reverse functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.3&lt;/span&gt; &lt;a href="#2-dot-identify-bits" &gt;2. identify bits&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.4&lt;/span&gt; &lt;a href="#3-dot-validate-mapping" &gt;3. validate mapping&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.5&lt;/span&gt; &lt;a href="#4-dot-decompose-functions" &gt;4. decompose functions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#results" &gt;Final results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7&lt;/span&gt; &lt;a href="#decode-dimms" &gt;&lt;code&gt;decode-dimms&lt;/code&gt;&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1&lt;/span&gt; &lt;a href="#bank-groups" &gt;Bank groups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.2&lt;/span&gt; &lt;a href="#refresh" &gt;Refresh&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.3&lt;/span&gt; &lt;a href="#random-access-throughput" &gt;Random access throughput&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8&lt;/span&gt; &lt;a href="#cpu-benchmarks" &gt;CPU benchmarks&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.1&lt;/span&gt; &lt;a href="#cpu-benchmarks" &gt;cpu-benchmarks&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.1.1&lt;/span&gt; &lt;a href="#random-access-throughput-1-dimm" &gt;random access throughput 1 DIMM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.1.2&lt;/span&gt; &lt;a href="#random-access-throughput-2-dimm" &gt;random access throughput 2 DIMM&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.2&lt;/span&gt; &lt;a href="#memory-read-experiment" &gt;memory-read-experiment&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.2.1&lt;/span&gt; &lt;a href="#strided-reading-1-dimm" &gt;strided reading 1 DIMM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.2.2&lt;/span&gt; &lt;a href="#strided-reading-2-dimm" &gt;strided reading 2 DIMM&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;9&lt;/span&gt; &lt;a href="#tinymembench" &gt;&lt;code&gt;tinymembench&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;10&lt;/span&gt; &lt;a href="#remaining-questions" &gt;Remaining questions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;These are chronological (and thus, only lightly organized) notes on my attempt to
understand how DDR4 and DDR5 RAM memory work.&lt;/p&gt;</description></item><item><title> Quotes from "The Evolution of Mathematical Software"</title><link>https://curiouscoding.nl/posts/evolution-of-mathematical-software/</link><pubDate>Sun, 18 Jan 2026 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/evolution-of-mathematical-software/</guid><description>&lt;p&gt;These are some nice quotes from
&lt;a href="#citeproc_bib_item_1"&gt;“The Evolution of Mathematical Software”&lt;/a&gt;, Turing Lecture by the 2021
Turing Award winner Jack J. Dongarra, which talks about
algorithm and software development in the context of ever improving hardware.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;a large infrastructure of mathematical libraries [&amp;hellip;] that must be mainted,
ported, and enhanced for many years to come if the value of the application
codes that depend on it are to be preserved and extended.
The software that encapsulates all this time, energy, and thought, routinely
outlasts the hardware it was originally designed to run on.&lt;/p&gt;</description></item><item><title>QuickHeap: the fastest priority queue</title><link>https://curiouscoding.nl/posts/quickheap/</link><pubDate>Mon, 11 Aug 2025 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/quickheap/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#background" &gt;Background&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#priority-queue" &gt;Priority queue&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#binary-heap" &gt;Binary heap&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#d-ary-heaps" &gt;D-ary heaps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#other-heaps" &gt;Other heaps&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#literature-on-quickheaps" &gt;Literature on Quickheaps&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#optimal-incremental-sorting" &gt;Optimal incremental sorting&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#quickheap" &gt;Quickheap&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#randomized-quickheaps" &gt;Randomized quickheaps&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#implementation" &gt;Bucket-based implementation&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#data-structure" &gt;Data structure&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#push" &gt;Push&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#pop" &gt;Pop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#partition" &gt;Partition&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#results" &gt;Results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#libraries" &gt;Libraries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#datasets" &gt;Datasets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#results" &gt;Results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#conclusion" &gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;&lt;strong&gt;Backlinks&lt;/strong&gt;: &lt;a href="https://bsky.app/profile/curiouscoding.nl/post/3mko6y5k2f22w" class="external-link" target="_blank" rel="noopener"&gt;bsky&lt;/a&gt;, &lt;a href="https://x.com/curious_coding/status/2049441073718518046" class="external-link" target="_blank" rel="noopener"&gt;X&lt;/a&gt;, &lt;a href="https://news.ycombinator.com/item?id=47994575" class="external-link" target="_blank" rel="noopener"&gt;hacker news&lt;/a&gt;, &lt;a href="https://lobste.rs/s/85wpyb/quickheap_fastest_comparison_based_heap" class="external-link" target="_blank" rel="noopener"&gt;lobste.rs&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;A preprint on the SimdQuickHeap (&lt;a href="https://github.com/ragnargrootkoerkamp/quickheap" class="external-link" target="_blank" rel="noopener"&gt;github:RagnarGrootKoerkamp/QuickHeap&lt;/a&gt;) with the latest results can now be found on arXiv (&lt;a href="https://doi.org/10.48550/ARXIV.2604.25681" class="external-link" target="_blank" rel="noopener"&gt;DOI&lt;/a&gt;, &lt;a href="https://curiouscoding.nl/papers/simdquickheap-preprint.pdf" &gt;PDF&lt;/a&gt;):&lt;/p&gt;</description></item><item><title>Chunking for Fasta Parsing</title><link>https://curiouscoding.nl/posts/fasta-parsing/</link><pubDate>Wed, 06 Aug 2025 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/fasta-parsing/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#minimizing-critical-sections" &gt;Minimizing critical sections&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#rabbitfx-chunking" &gt;RabbitFx: chunking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#different-chunking" &gt;Different chunking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#experiments" &gt;Experiments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#outlook" &gt;Outlook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#some-helicase-benchmarks" &gt;Some helicase benchmarks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;This is a quick note on some thoughts &amp;amp; experiments on fasta parsing, alongside &lt;a href="https://github.com/RagnarGrootKoerkamp/fasta-parsing-playground" class="external-link" target="_blank" rel="noopener"&gt;github:RagnarGrootKoerkamp/fasta-parsing-playground&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s a common complaint these days that Fasta parsing is slow.
A common parser is Needletail (&lt;a href="https://github.com/onecodex/needletail" class="external-link" target="_blank" rel="noopener"&gt;github&lt;/a&gt;), which builds on seq_io (&lt;a href="https://github.com/markschl/seq_io" class="external-link" target="_blank" rel="noopener"&gt;github&lt;/a&gt;).
Another recent one is paraseq (&lt;a href="https://github.com/noamteyssier/paraseq" class="external-link" target="_blank" rel="noopener"&gt;github&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Paraseq helps the user by providing an interface for parallel processing of
reads, see eg &lt;a href="https://github.com/noamteyssier/paraseq/blob/main/examples/multi_parallel.rs#L33" class="external-link" target="_blank" rel="noopener"&gt;this example&lt;/a&gt;.
Unfortunately, this still has a bottleneck: while the users processing of reads
is multi threaded, the parsing itself is still single threaded.&lt;/p&gt;</description></item><item><title>40x Faster Binary Search</title><link>https://curiouscoding.nl/slides/p99-text/</link><pubDate>Sun, 20 Jul 2025 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/slides/p99-text/</guid><description>&lt;figure&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/p99-title-slide.png"&gt;
&lt;/figure&gt;

&lt;script src="https://curiouscoding.nl/livereload.js?mindelay=10&amp;amp;v=2&amp;amp;port=1313&amp;amp;path=livereload" data-no-instant defer&gt;&lt;/script&gt;
&lt;h2 id="me"&gt;
 Ragnar {Groot Koerkamp}
 &lt;a class="heading-link" href="#me"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Did IMO &amp;amp; ICPC; currently head-of-jury for NWERC.&lt;/li&gt;
&lt;li&gt;Some time at Google.&lt;/li&gt;
&lt;li&gt;Quit and solved all of &lt;a href="https://projecteuler.net/" class="external-link" target="_blank" rel="noopener"&gt;projecteuler.net&lt;/a&gt; (700+) during Covid.&lt;/li&gt;
&lt;li&gt;Just finished PhD on &lt;em&gt;high throughput bioinformatics&lt;/em&gt; @ ETH Zurich.
&lt;ul&gt;
&lt;li&gt;Lots of sequenced DNA that needs processing.&lt;/li&gt;
&lt;li&gt;Many &lt;em&gt;static&lt;/em&gt; datasets, e.g. a 3GB human genome.&lt;/li&gt;
&lt;li&gt;Revisiting basic algorithms and optimizing them to the limit.&lt;/li&gt;
&lt;li&gt;Good in theory \(\neq\) fast in practice.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="problem-statement"&gt;
 Problem Statement
 &lt;a class="heading-link" href="#problem-statement"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Input: a static sorted list of 32 bit integers.&lt;/li&gt;
&lt;li&gt;Queries: given \(q\), find the smallest value in the list \(\geq q\).&lt;/li&gt;
&lt;/ul&gt;
&lt;!--listend--&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-rust" data-lang="rust"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;trait&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SearchIndex&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// Initialize the data structure.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sorted_data&lt;/span&gt;: &lt;span class="kp"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&amp;gt; &lt;span class="nc"&gt;Self&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// Return the smallest value &amp;gt;=q.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;: &lt;span class="kt"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&amp;gt; &lt;span class="kt"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Why? E.g. searching through a suffix array.&lt;/li&gt;
&lt;li&gt;Previous work on Algorithmica:&lt;br /&gt;
&lt;a href="https://en.algorithmica.org/hpc/data-structures/s-tree/" class="external-link" target="_blank" rel="noopener"&gt;en.algorithmica.org/hpc/data-structures/s-tree&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;This work: &lt;a href="https://curiouscoding.nl/posts/static-search-tree/" class="external-link" target="_blank" rel="noopener"&gt;curiouscoding.nl/posts/static-search-tree&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="binary-search"&gt;
 Binary Search: complexity \(O(\lg n)\)
 &lt;a class="heading-link" href="#binary-search"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Compare \(q=5\) with the middle element \(x\).&lt;/li&gt;
&lt;li&gt;Recurse on left half if \(q\leq x\), right half if \(q&amp;gt;x\).&lt;/li&gt;
&lt;li&gt;End when 1 element left after \(\lceil\lg_2(n+1)\rceil\) steps.&lt;/li&gt;
&lt;/ol&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/binary-search.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="binary-search-latency"&gt;
 Binary Search: latency is more than \(O(\lg n)\)
 &lt;a class="heading-link" href="#binary-search-latency"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/bs-1.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="array-indexing-latency"&gt;
 Array Indexing: \(O(n^{0.35})\) latency!
 &lt;a class="heading-link" href="#array-indexing-latency"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/bs-2.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="caches"&gt;
 Heap Layout: efficient caching + prefetching
 &lt;a class="heading-link" href="#caches"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Binary search: top of tree is spread out; each cache line has 1 value.&lt;/li&gt;
&lt;li&gt;Eytzinger layout: top layers of tree are clustered in cache lines.
&lt;ul&gt;
&lt;li&gt;Also allows prefetching!&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/bs-eytzinger.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="eytzinger"&gt;
 Heap Layout: close to array indexing!
 &lt;a class="heading-link" href="#eytzinger"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/bs-3.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="static-search-trees"&gt;
 Static Search Trees / B-trees
 &lt;a class="heading-link" href="#static-search-trees"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Fully use each cache line.&lt;/li&gt;
&lt;/ul&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-rust" data-lang="rust"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#[repr(align(64))]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// Each block fills exactly one 512-bit cache line.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Block&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;: &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Tree&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tree&lt;/span&gt;: &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Block&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// Blocks.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;offsets&lt;/span&gt;: &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;usize&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// Index of first block in each layer.
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/full.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="btree-plot"&gt;
 Static Search Trees: Slower than Eytzinger?!
 &lt;a class="heading-link" href="#btree-plot"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/2-find-linear.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="asm"&gt;
 Up next: assembly-level optimizations :)
 &lt;a class="heading-link" href="#asm"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;figure class="full"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/p99-section.png"&gt;
&lt;/figure&gt;

&lt;h3 id="find-baseline"&gt;
 Optimizing &lt;code&gt;find&lt;/code&gt;: linear scan baseline
 &lt;a class="heading-link" href="#find-baseline"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-rust" data-lang="rust"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;find_linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;: &lt;span class="kp"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nc"&gt;Block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;: &lt;span class="kt"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&amp;gt; &lt;span class="kt"&gt;usize&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// Early break causes branch mispredictions
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// and prevents auto-vectorization!
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id="find-simd"&gt;
 Optimizing &lt;code&gt;find&lt;/code&gt;: auto-vectorization
 &lt;a class="heading-link" href="#find-simd"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;span class="lnt"&gt;8
&lt;/span&gt;&lt;span class="lnt"&gt;9
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-rust" data-lang="rust"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;find_count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;: &lt;span class="kp"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nc"&gt;Block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;: &lt;span class="kt"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&amp;gt; &lt;span class="kt"&gt;usize&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;+=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-asm" data-lang="asm"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vmovdqu&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;%rax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nv"&gt;%rcx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nv"&gt;%ymm1&lt;/span&gt; &lt;span class="c1"&gt;; load data[..8]
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vmovdqu&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;%rax&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nv"&gt;%rcx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nv"&gt;%ymm2&lt;/span&gt; &lt;span class="c1"&gt;; load data[8..]
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpbroadcastd&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt; &lt;span class="c1"&gt;; &amp;#39;splat&amp;#39; the query value
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpmaxud&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm3&lt;/span&gt; &lt;span class="c1"&gt;; v
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpcmpeqd&lt;/span&gt; &lt;span class="nv"&gt;%ymm3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm2&lt;/span&gt; &lt;span class="c1"&gt;; v
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpmaxud&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt; &lt;span class="c1"&gt;; v
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpcmpeqd&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt; &lt;span class="c1"&gt;; 4x compare query with values
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpackssdw&lt;/span&gt; &lt;span class="nv"&gt;%ymm2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt; &lt;span class="c1"&gt;;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpcmpeqd&lt;/span&gt; &lt;span class="nv"&gt;%ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm1&lt;/span&gt; &lt;span class="c1"&gt;; v
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpxor&lt;/span&gt; &lt;span class="nv"&gt;%ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt; &lt;span class="c1"&gt;; 2x negate result
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt; &lt;span class="no"&gt;$1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm1&lt;/span&gt; &lt;span class="c1"&gt;; v
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpacksswb&lt;/span&gt; &lt;span class="nv"&gt;%xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt; &lt;span class="c1"&gt;; v
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpshufd&lt;/span&gt; &lt;span class="no"&gt;$216&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt; &lt;span class="c1"&gt;; v
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpmovmskb&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ecx&lt;/span&gt; &lt;span class="c1"&gt;; 4x extract mask
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;popcntl&lt;/span&gt; &lt;span class="nv"&gt;%ecx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ecx&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id="find-popcount"&gt;
 Optimizing &lt;code&gt;find&lt;/code&gt;: popcount
 &lt;a class="heading-link" href="#find-popcount"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-rust" data-lang="rust"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="cp"&gt;#![feature(portable_simd)]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;find_popcount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;: &lt;span class="kp"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nc"&gt;Block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;: &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&amp;gt; &lt;span class="kt"&gt;usize&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;: &lt;span class="nc"&gt;Simd&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Simd&lt;/span&gt;::&lt;span class="n"&gt;from_slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Simd&lt;/span&gt;::&lt;span class="n"&gt;splat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;simd_lt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// x[i] &amp;lt; q
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_bitmask&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;count_ones&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// count_ones
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt;1
&lt;/span&gt;&lt;span class="lnt"&gt;2
&lt;/span&gt;&lt;span class="lnt"&gt;3
&lt;/span&gt;&lt;span class="lnt"&gt;4
&lt;/span&gt;&lt;span class="lnt"&gt;5
&lt;/span&gt;&lt;span class="lnt"&gt;6
&lt;/span&gt;&lt;span class="lnt"&gt;7
&lt;/span&gt;&lt;span class="lnt"&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-asm" data-lang="asm"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpcmpgtd&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;%rsi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nv"&gt;%rdi&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpcmpgtd&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;%rsi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nv"&gt;%rdi&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpackssdw&lt;/span&gt; &lt;span class="nv"&gt;%ymm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt; &lt;span class="c1"&gt;; interleave 16bit low halves
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vextracti128&lt;/span&gt; &lt;span class="no"&gt;$1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%ymm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm1&lt;/span&gt; &lt;span class="c1"&gt;; 1
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpacksswb&lt;/span&gt; &lt;span class="nv"&gt;%xmm1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt; &lt;span class="c1"&gt;; 2
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpshufd&lt;/span&gt; &lt;span class="no"&gt;$216&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt; &lt;span class="c1"&gt;; 3 unshuffle interleaving
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;vpmovmskb&lt;/span&gt; &lt;span class="nv"&gt;%xmm0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%edi&lt;/span&gt; &lt;span class="c1"&gt;; 4 instructions to extract bitmask
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nf"&gt;popcntl&lt;/span&gt; &lt;span class="nv"&gt;%edi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;%edi&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id="find-manual"&gt;
 Optimizing &lt;code&gt;find&lt;/code&gt;: manual &lt;code&gt;movemask_epi8&lt;/code&gt;
 &lt;a class="heading-link" href="#find-manual"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-rust" data-lang="rust"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;find_manual&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;: &lt;span class="kp"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nc"&gt;Block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;: &lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&amp;gt; &lt;span class="kt"&gt;usize&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// i32 now!
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;low&lt;/span&gt;: &lt;span class="nc"&gt;Simd&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Simd&lt;/span&gt;::&lt;span class="n"&gt;from_slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;high&lt;/span&gt;: &lt;span class="nc"&gt;Simd&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Simd&lt;/span&gt;::&lt;span class="n"&gt;from_slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Simd&lt;/span&gt;::&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;::&lt;span class="n"&gt;splat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask_low&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q_simd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;simd_gt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;low&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// compare with low half
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask_high&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;q_simd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;simd_gt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;high&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// compare with high half
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;_mm256_packs_epi32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mask_low&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask_high&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// interleave
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;: &lt;span class="kt"&gt;i32&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;_mm256_movemask_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// movemask_epi8!
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;mask&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count_ones&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kt"&gt;usize&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="c1"&gt;// correct for double-counting
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-diff" data-lang="diff"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; vpcmpgtd (%rsi,%rdi), %ymm0, %ymm1 ; 1 cycle, in parallel
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; vpcmpgtd 32(%rsi,%rdi), %ymm0, %ymm0 ; 1 cycle, in parallel
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; vpackssdw %ymm0, %ymm1, %ymm0 ; 1 cycle
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gd"&gt;-vextracti128 $1, %ymm0, %xmm1 ;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gd"&gt;-vpacksswb %xmm1, %xmm0, %xmm0 ;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gd"&gt;-vpshufd $216, %xmm0, %xmm0 ;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gd"&gt;-vpmovmskb %xmm0, %edi ; 4 instructions emulating movemask_epi16
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gi"&gt;+vpmovmskb %ymm0, %edi ; 2 cycles movemask_epi8
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; popcntl %edi, %edi ; 1 cycle
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gi"&gt;+ ; /2 is folded into pointer arithmetic
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id="optimized-find-plot"&gt;
 The results: branchless is great!
 &lt;a class="heading-link" href="#optimized-find-plot"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/3-find.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="throughput"&gt;
 Throughput, not latency
 &lt;a class="heading-link" href="#throughput"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;figure class="full"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/p99-section.png"&gt;
&lt;/figure&gt;

&lt;h3 id="batching"&gt;
 Batching: Many queries in parallel
 &lt;a class="heading-link" href="#batching"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;span class="lnt"&gt;17
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-diff" data-lang="diff"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gd"&gt;-fn query (&amp;amp;self, q : u32 ) -&amp;gt; u32 {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gi"&gt;+fn batch&amp;lt;const B: usize&amp;gt;(&amp;amp;self, qs: &amp;amp;[u32; B]) -&amp;gt; [u32; B] {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gd"&gt;- let mut k = 0 ; // current index
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gi"&gt;+ let mut k = [0; B]; // current indices
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; for [o, _o2] in self.offsets.array_windows() { // walk down the tree
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gi"&gt;+ for i in 0..B {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; let jump_to = self.node(o + k[i]).find(qb[i]); // call `find`
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; k[i] = k[i] * (B + 1) + jump_to; // update index
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gi"&gt;+ }
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; let o = self.offsets.last().unwrap();
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gi"&gt;+ from_fn(|i| {
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; let idx = self.node(o + k[i]).find(qb[i]);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; self.tree[o + k[i] + idx / N].data[idx % N] // return value
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gi"&gt;+ })
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id="batching-plot"&gt;
 Batching: up to 2.5x faster!
 &lt;a class="heading-link" href="#batching-plot"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/4-batching.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="prefetching"&gt;
 Prefetching
 &lt;a class="heading-link" href="#prefetching"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;div class="highlight"&gt;&lt;div class="chroma"&gt;
&lt;table class="lntable"&gt;&lt;tr&gt;&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code&gt;&lt;span class="lnt"&gt; 1
&lt;/span&gt;&lt;span class="lnt"&gt; 2
&lt;/span&gt;&lt;span class="lnt"&gt; 3
&lt;/span&gt;&lt;span class="lnt"&gt; 4
&lt;/span&gt;&lt;span class="lnt"&gt; 5
&lt;/span&gt;&lt;span class="lnt"&gt; 6
&lt;/span&gt;&lt;span class="lnt"&gt; 7
&lt;/span&gt;&lt;span class="lnt"&gt; 8
&lt;/span&gt;&lt;span class="lnt"&gt; 9
&lt;/span&gt;&lt;span class="lnt"&gt;10
&lt;/span&gt;&lt;span class="lnt"&gt;11
&lt;/span&gt;&lt;span class="lnt"&gt;12
&lt;/span&gt;&lt;span class="lnt"&gt;13
&lt;/span&gt;&lt;span class="lnt"&gt;14
&lt;/span&gt;&lt;span class="lnt"&gt;15
&lt;/span&gt;&lt;span class="lnt"&gt;16
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class="lntd"&gt;
&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-diff" data-lang="diff"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; fn batch&amp;lt;const B: usize&amp;gt;(&amp;amp;self, qs: &amp;amp;[u32; B]) -&amp;gt; [u32; B] {
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; let mut k = [0; B]; // current indices
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; for [o, o2] in self.offsets.array_windows() { // walk down the tree
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; for i in 0..B {
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; let jump_to = self.node(o + k[i]).find(qb[i]); // call `find`
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; k[i] = k[i] * (B + 1) + jump_to; // update index
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="gi"&gt;+ prefetch_index(&amp;amp;self.tree, o2 + k[i]); // prefetch next layer
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; let o = self.offsets.last().unwrap();
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; from_fn(|i| {
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; let idx = self.node(o + k[i]).find(qb[i]);
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; self.tree[o + k[i] + idx / N].data[idx % N] // return value
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; })
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; }
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id="prefetching-plot"&gt;
 Prefetching: 30% faster again!
 &lt;a class="heading-link" href="#prefetching-plot"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/5-prefetch.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="pointer-arithmetic-plot"&gt;
 Optimizing pointer arithmetic: more gains
 &lt;a class="heading-link" href="#pointer-arithmetic-plot"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Convert all pointers to byte units, to avoid conversions.&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/6-improvements.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="interleaving-plot"&gt;
 Interleaving: more pressure on the RAM, -20%
 &lt;a class="heading-link" href="#interleaving-plot"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Interleave multiple batches at different iterations.&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/8-interleave.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="tree-layout-before"&gt;
 Tree layout: internal nodes store minima
 &lt;a class="heading-link" href="#tree-layout-before"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;For query 5.5, we walk down the &lt;strong&gt;left&lt;/strong&gt; subtree.&lt;/li&gt;
&lt;li&gt;Returning 6 reads a &lt;strong&gt;new&lt;/strong&gt; cache line.&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/full.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="tree-layout-after"&gt;
 Tree layout: internal nodes store maxima
 &lt;a class="heading-link" href="#tree-layout-after"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;For query 5.5, we walk down the &lt;strong&gt;middle&lt;/strong&gt; subtree.&lt;/li&gt;
&lt;li&gt;Returning 6 reads &lt;strong&gt;the same&lt;/strong&gt; cache line.&lt;/li&gt;
&lt;/ul&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/flipped.svg"&gt;
&lt;/figure&gt;

&lt;h3 id="tree-layout-plot"&gt;
 Tree layout: another 10% gained!
 &lt;a class="heading-link" href="#tree-layout-plot"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h3&gt;
&lt;figure class="large"&gt;&lt;img src="https://curiouscoding.nl/ox-hugo/9-left-max-tree.svg"&gt;
&lt;/figure&gt;

&lt;h2 id="conclusion"&gt;
 Conclusion
 &lt;a class="heading-link" href="#conclusion"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;With 4GB input: 40x speedup!&lt;/p&gt;</description></item><item><title>High Throughput Bioinformatics</title><link>https://curiouscoding.nl/posts/throughput/</link><pubDate>Sun, 06 Apr 2025 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/throughput/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#introduction" &gt;Introduction&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.1&lt;/span&gt; &lt;a href="#overview" &gt;Overview&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#compute-bound" &gt;Optimizing Compute Bound Code: Random Minimizers&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.1&lt;/span&gt; &lt;a href="#avoiding-branch-misses" &gt;Avoiding Branch Misses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.2&lt;/span&gt; &lt;a href="#simd-processing-in-parallel" &gt;SIMD: Processing In Parallel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.3&lt;/span&gt; &lt;a href="#instruction-level-parallelism" &gt;Instruction Level Parallelism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.4&lt;/span&gt; &lt;a href="#input-format" &gt;Input Format&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#memory-bound" &gt;Optimizing Memory Bound Code: Minimal Perfect Hashing&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#using-less-memory" &gt;Using Less Memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.2&lt;/span&gt; &lt;a href="#reducing-memory-accesses" &gt;Reducing Memory Accesses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3&lt;/span&gt; &lt;a href="#interleaving-memory-accesses" &gt;Interleaving Memory Accesses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.4&lt;/span&gt; &lt;a href="#batching-streaming-and-prefetching" &gt;Batching, Streaming, and Prefetching&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;This is Chapter 10 of my &lt;a href="https://curiouscoding.nl/posts/thesis/" &gt;thesis&lt;/a&gt; (&lt;a href="#citeproc_bib_item_12"&gt;Groot Koerkamp 2025a&lt;/a&gt;), to introduce the last part on High Throughput Bioinformatics.&lt;/p&gt;</description></item><item><title>SimdSketch: a fast bucket sketch</title><link>https://curiouscoding.nl/posts/simd-sketch/</link><pubDate>Sun, 09 Mar 2025 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/simd-sketch/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#jaccard-similarity" &gt;Jaccard similarity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#hash-schemes" &gt;Hash schemes&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.1&lt;/span&gt; &lt;a href="#minhash" &gt;MinHash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.2&lt;/span&gt; &lt;a href="#s-mins-sketch" &gt;$s$-mins sketch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.3&lt;/span&gt; &lt;a href="#bottom-s" &gt;Bottom-\(s\) sketch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.4&lt;/span&gt; &lt;a href="#fracminhash" &gt;FracMinHash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.5&lt;/span&gt; &lt;a href="#bucket-sketch" &gt;Bucket sketch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.6&lt;/span&gt; &lt;a href="#mod-bucket-hash--new" &gt;Mod-bucket hash (new?)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.7&lt;/span&gt; &lt;a href="#variants" &gt;Variants&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#compressing-sketches" &gt;Compressing sketches&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#b-bit-hashing" &gt;$b$-bit hashing&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1.1&lt;/span&gt; &lt;a href="#accounting-for-collisions" &gt;Accounting for collisions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.2&lt;/span&gt; &lt;a href="#hyperminhash" &gt;HyperMinHash&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#densification-strategies" &gt;Densification strategies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#simdsketch" &gt;SimdSketch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#evaluation" &gt;Evaluation&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.1&lt;/span&gt; &lt;a href="#setup" &gt;Setup&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.1.1&lt;/span&gt; &lt;a href="#tools" &gt;Tools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.1.2&lt;/span&gt; &lt;a href="#inputs" &gt;Inputs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.1.3&lt;/span&gt; &lt;a href="#parameters" &gt;Parameters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.1.4&lt;/span&gt; &lt;a href="#metrics" &gt;Metrics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.2&lt;/span&gt; &lt;a href="#raw-results" &gt;Raw results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.3&lt;/span&gt; &lt;a href="#correlation" &gt;Correlation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.4&lt;/span&gt; &lt;a href="#comparison-speed" &gt;Comparison speed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.5&lt;/span&gt; &lt;a href="#low-similarity-data" &gt;Low-similarity data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7&lt;/span&gt; &lt;a href="#discussion" &gt;Discussion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8&lt;/span&gt; &lt;a href="#future-work" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; / Future work&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;\[
\newcommand{\sketch}{\mathsf{sketch}}
\]&lt;/p&gt;</description></item><item><title>Static search trees: 40x faster than binary search</title><link>https://curiouscoding.nl/posts/static-search-tree/</link><pubDate>Wed, 18 Dec 2024 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/static-search-tree/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#introduction" &gt;Introduction&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.1&lt;/span&gt; &lt;a href="#problem-statement" &gt;Problem statement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.2&lt;/span&gt; &lt;a href="#motivation" &gt;Motivation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.3&lt;/span&gt; &lt;a href="#recommended-reading" &gt;Recommended reading&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.4&lt;/span&gt; &lt;a href="#binary-search-and-eytzinger-layout" &gt;Binary search and Eytzinger layout&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.5&lt;/span&gt; &lt;a href="#hugepages" &gt;Hugepages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.6&lt;/span&gt; &lt;a href="#a-note-on-benchmarking" &gt;A note on benchmarking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.7&lt;/span&gt; &lt;a href="#cache-lines" &gt;Cache lines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.8&lt;/span&gt; &lt;a href="#s-trees-and-b-trees" &gt;S-trees and B-trees&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#optimizing-find" &gt;Optimizing &lt;code&gt;find&lt;/code&gt;&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.1&lt;/span&gt; &lt;a href="#linear" &gt;Linear&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.2&lt;/span&gt; &lt;a href="#auto-vectorization" &gt;Auto-vectorization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.3&lt;/span&gt; &lt;a href="#trailing-zeros" &gt;Trailing zeros&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.4&lt;/span&gt; &lt;a href="#popcount" &gt;Popcount&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.5&lt;/span&gt; &lt;a href="#manual-simd" &gt;Manual SIMD&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#optimizing-the-search" &gt;Optimizing the search&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#batching" &gt;Batching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.2&lt;/span&gt; &lt;a href="#prefetching" &gt;Prefetching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3&lt;/span&gt; &lt;a href="#pointer-arithmetic" &gt;Pointer arithmetic&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3.1&lt;/span&gt; &lt;a href="#up-front-splat" &gt;Up-front splat&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3.2&lt;/span&gt; &lt;a href="#byte-based-pointers" &gt;Byte-based pointers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3.3&lt;/span&gt; &lt;a href="#the-final-version" &gt;The final version&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.4&lt;/span&gt; &lt;a href="#skip-prefetch" &gt;Skip prefetch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.5&lt;/span&gt; &lt;a href="#interleave" &gt;Interleave&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#optimizing-the-tree-layout" &gt;Optimizing the tree layout&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.1&lt;/span&gt; &lt;a href="#left-tree" &gt;Left-tree&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.2&lt;/span&gt; &lt;a href="#memory-layouts" &gt;Memory layouts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.3&lt;/span&gt; &lt;a href="#node-size-b-15" &gt;Node size \(B=15\)&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.3.1&lt;/span&gt; &lt;a href="#data-structure-size" &gt;Data structure size&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.4&lt;/span&gt; &lt;a href="#summary" &gt;Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#prefix-partitioning" &gt;Prefix partitioning&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.1&lt;/span&gt; &lt;a href="#full-layout" &gt;Full layout&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.2&lt;/span&gt; &lt;a href="#compact-subtrees" &gt;Compact subtrees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.3&lt;/span&gt; &lt;a href="#the-best-of-both-compact-first-level" &gt;The best of both: compact first level&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.4&lt;/span&gt; &lt;a href="#overlapping-trees" &gt;Overlapping trees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.5&lt;/span&gt; &lt;a href="#human-data" &gt;Human data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.6&lt;/span&gt; &lt;a href="#prefix-map" &gt;Prefix map&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.7&lt;/span&gt; &lt;a href="#prefix-summary" &gt;Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#multi-threaded-comparison" &gt;Multi-threaded comparison&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7&lt;/span&gt; &lt;a href="#conclusion" &gt;Conclusion&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1&lt;/span&gt; &lt;a href="#future-work" &gt;Future work&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.1&lt;/span&gt; &lt;a href="#branchy-search" &gt;Branchy search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.2&lt;/span&gt; &lt;a href="#interpolation-search" &gt;Interpolation search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.3&lt;/span&gt; &lt;a href="#packing-data-smaller" &gt;Packing data smaller&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.4&lt;/span&gt; &lt;a href="#returning-indices-in-original-data" &gt;Returning indices in original data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.5&lt;/span&gt; &lt;a href="#range-queries" &gt;Range queries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.6&lt;/span&gt; &lt;a href="#sorting-queries" &gt;Sorting queries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.7&lt;/span&gt; &lt;a href="#suffix-array-searching" &gt;Suffix array searching&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;In this post, we will implement a static search tree (S+ tree) for
high-throughput searching of sorted data, as &lt;a href="https://en.algorithmica.org/hpc/data-structures/s-tree/" class="external-link" target="_blank" rel="noopener"&gt;introduced&lt;/a&gt; on Algorithmica.
We&amp;rsquo;ll mostly take the code presented there as a starting point, and optimize it
to its limits. For a large part, I&amp;rsquo;m simply taking the &amp;lsquo;future work&amp;rsquo; ideas of that post
and implementing them. And then there will be a bunch of looking at assembly
code to shave off all the instructions we can.
Lastly, there will be one big addition to optimize throughput: &lt;em&gt;batching&lt;/em&gt;.&lt;/p&gt;</description></item><item><title>SimdMinimizers: Computing random minimizers, fast</title><link>https://curiouscoding.nl/posts/simd-minimizers/</link><pubDate>Fri, 12 Jul 2024 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/simd-minimizers/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#introduction" &gt;Introduction&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.1&lt;/span&gt; &lt;a href="#intro-results" &gt;Results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#random-minimizers" &gt;Random minimizers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#algorithms" &gt;Algorithms&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#problem-statement" &gt;Problem statement&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#problem-a-only-the-set-of-minimizers" &gt;Problem A: Only the set of minimizers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#problem-b-the-minimizer-of-each-window" &gt;Problem B: The minimizer of each window&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#problem-c-super-k-mers" &gt;Problem C: Super-k-mers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#which-problem-to-solve" &gt;Which problem to solve&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#canonical-k-mers" &gt;Canonical k-mers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.2&lt;/span&gt; &lt;a href="#the-naive-algorithm" &gt;The naive algorithm&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#naive-performance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3&lt;/span&gt; &lt;a href="#rephrasing-as-sliding-window-minimum" &gt;Rephrasing as sliding window minimum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.4&lt;/span&gt; &lt;a href="#the-queue" &gt;The queue&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#queue-performance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.5&lt;/span&gt; &lt;a href="#jumping-away-with-the-queue" &gt;Jumping: Away with the queue&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#jumping-performance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.6&lt;/span&gt; &lt;a href="#re-scan" &gt;Re-scan&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#rescan-performance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.7&lt;/span&gt; &lt;a href="#split-windows" &gt;Split windows&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#split-perfomance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#analysing-what-we-have-so-far" &gt;Analysing what we have so far&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.1&lt;/span&gt; &lt;a href="#counting-comparisons" &gt;Counting comparisons&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#open-problem-theoretical-lower-bounds" &gt;Open problem: Theoretical lower bounds&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.2&lt;/span&gt; &lt;a href="#setting-up-benchmarking" &gt;Setting up benchmarking&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#adding-criterion" &gt;Adding criterion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#making-criterion-fast" &gt;Making criterion fast&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-note-on-cpu-frequency" &gt;A note on CPU frequency&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.3&lt;/span&gt; &lt;a href="#runtime-comparison-with-other-implementations" &gt;Runtime comparison with other implementations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.4&lt;/span&gt; &lt;a href="#deeper-inspection-using-perf-stat" &gt;Deeper inspection using &lt;code&gt;perf stat&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.5&lt;/span&gt; &lt;a href="#a-first-optimization-pass" &gt;A first optimization pass&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#optimizing-buffered-reducing-branch-misses" &gt;Optimizing &lt;code&gt;Buffered&lt;/code&gt;: reducing branch misses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#queue-is-hopelessly-branchy" &gt;&lt;code&gt;Queue&lt;/code&gt; is hopelessly branchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#jumping-is-already-very-efficient" &gt;&lt;code&gt;Jumping&lt;/code&gt; is already very efficient&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#optimizing-rescan" &gt;Optimizing &lt;code&gt;Rescan&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#optimizing-split" &gt;Optimizing &lt;code&gt;Split&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.6&lt;/span&gt; &lt;a href="#a-new-performance-comparison" &gt;A new performance comparison&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#rolling-our-own-hash" &gt;Rolling our own hash&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.1&lt;/span&gt; &lt;a href="#fxhash" &gt;FxHash&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#wyhash" &gt;WyHash&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.2&lt;/span&gt; &lt;a href="#nthash-a-rolling-hash" &gt;NtHash: a rolling hash&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#the-nthash-crate" &gt;The &lt;code&gt;nthash&lt;/code&gt; crate&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#buffered-hash-values" &gt;Buffered hash values&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.3&lt;/span&gt; &lt;a href="#making-nthash-fast-going-branchless" &gt;Making ntHash fast: going branchless&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#drop-sanity-checks" &gt;Drop sanity checks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#drop-bound-checks" &gt;Drop bound checks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#efficiently-collecting-to-a-vector" &gt;Efficiently collecting to a vector&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.4&lt;/span&gt; &lt;a href="#rolling-a-bit-less" &gt;Rolling a bit less&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#analysing-the-assembly-code" &gt;Analysing the assembly code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.5&lt;/span&gt; &lt;a href="#parallel-it-is" &gt;Parallel it is&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#more-parallel" &gt;More parallel&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.6&lt;/span&gt; &lt;a href="#actual-simd-at-last" &gt;Actual SIMD, at last&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#simd-table-lookups" &gt;SIMD table lookups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#32-bit-hashes" &gt;32-bit hashes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#shared-offsets" &gt;Shared offsets&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.7&lt;/span&gt; &lt;a href="#simd-the-gathering" &gt;SIMD: The Gathering&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#gathering-4-characters-at-a-time" &gt;Gathering 4 characters at a time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#gathering-8-characters-at-a-time" &gt;Gathering 8 characters at a time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#gathering-32-characters-at-a-time" &gt;Gathering 32 characters at a time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#reusing-the-gathers" &gt;Reusing the gathers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.8&lt;/span&gt; &lt;a href="#cached-vec" &gt;Fixing the benchmark&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#one-last-branch" &gt;One last branch&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.9&lt;/span&gt; &lt;a href="#analysis-machine-code-analysis" &gt;Analysis: Machine code analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.10&lt;/span&gt; &lt;a href="#finals-thoughts" &gt;Finals thoughts&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#doubling-down-again" &gt;Doubling down again&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#16-bit-hashes" &gt;16-bit hashes?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-about-a-simple-multiply-hash" &gt;What about a simple multiply hash&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#simd-sliding-window" &gt;SIMD sliding window&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.1&lt;/span&gt; &lt;a href="#sliding-window-results" &gt;Results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#human-genome-results" &gt;Human genome results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7&lt;/span&gt; &lt;a href="#extending-into-something-useful" &gt;Extending into something useful&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1&lt;/span&gt; &lt;a href="#collecting-minimizer-positions" &gt;Collecting minimizer positions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.2&lt;/span&gt; &lt;a href="#deduplicating-the-minimizer-positions" &gt;Deduplicating the minimizer positions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.3&lt;/span&gt; &lt;a href="#super-k-mers" &gt;Super-k-mers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.4&lt;/span&gt; &lt;a href="#canonical-k-mers" &gt;Canonical k-mers&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#nthash" &gt;NtHash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#leftmost-rightmost-sliding-min" &gt;Leftmost-rightmost sliding min&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#tiebreaking" &gt;Tiebreaking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reusing-iterated-bases" &gt;Further reusing iterated bases&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.5&lt;/span&gt; &lt;a href="#antilex-hash" &gt;AntiLex hash&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8&lt;/span&gt; &lt;a href="#conclusion" &gt;Conclusion&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.1&lt;/span&gt; &lt;a href="#future-work" &gt;Future work&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;SimdMinimizers has been published as a paper: &lt;a href="https://doi.org/10.1101/2025.01.27.634998" class="external-link" target="_blank" rel="noopener"&gt;DOI&lt;/a&gt;, &lt;a href="https://curiouscoding.nl/papers/simd-minimizers.pdf" &gt;PDF&lt;/a&gt;:&lt;/p&gt;</description></item><item><title>A*PA2: Up to 19x faster exact global alignment</title><link>https://curiouscoding.nl/posts/astarpa2/</link><pubDate>Sat, 23 Mar 2024 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/astarpa2/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#abstract" &gt;Abstract&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#introduction" &gt;Introduction&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.1&lt;/span&gt; &lt;a href="#contributions" &gt;Contributions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.2&lt;/span&gt; &lt;a href="#previous-work" &gt;Previous work&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.2.1&lt;/span&gt; &lt;a href="#needleman-wunsch" &gt;Needleman-Wunsch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.2.2&lt;/span&gt; &lt;a href="#graph-algorithms" &gt;Graph algorithms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.2.3&lt;/span&gt; &lt;a href="#computational-volumes" &gt;Computational volumes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.2.4&lt;/span&gt; &lt;a href="#parallelism" &gt;Parallelism&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.2.5&lt;/span&gt; &lt;a href="#tools" &gt;Tools&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#preliminaries" &gt;Preliminaries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#methods" &gt;Methods&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#band-doubling" &gt;Band-doubling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.2&lt;/span&gt; &lt;a href="#blocks" &gt;Blocks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3&lt;/span&gt; &lt;a href="#memory" &gt;Memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.4&lt;/span&gt; &lt;a href="#simd" &gt;SIMD&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.5&lt;/span&gt; &lt;a href="#simd-friendly-sequence-profile" &gt;SIMD-friendly sequence profile&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.6&lt;/span&gt; &lt;a href="#traceback" &gt;Traceback&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.7&lt;/span&gt; &lt;a href="#a" &gt;A*&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.7.1&lt;/span&gt; &lt;a href="#bulk-contours-update" &gt;Bulk-contours update&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.7.2&lt;/span&gt; &lt;a href="#pre-pruning" &gt;Pre-pruning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.8&lt;/span&gt; &lt;a href="#determining-the-rows-to-compute" &gt;Determining the rows to compute&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.8.1&lt;/span&gt; &lt;a href="#sparse-heuristic-invocation" &gt;Sparse heuristic invocation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.9&lt;/span&gt; &lt;a href="#incremental-doubling" &gt;Incremental doubling&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#results" &gt;Results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.1&lt;/span&gt; &lt;a href="#setup" &gt;Setup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.2&lt;/span&gt; &lt;a href="#comparison-with-other-aligners" &gt;Comparison with other aligners&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.3&lt;/span&gt; &lt;a href="#effects-of-methods" &gt;Effects of methods&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#discussion" &gt;Discussion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#acknowledgements" &gt;Acknowledgements&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conflict-of-interest" &gt;Conflict of interest&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#appendix" &gt;Appendix&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.1&lt;/span&gt; &lt;a href="#bitpacking" &gt;Bitpacking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.2&lt;/span&gt; &lt;a href="#app-comparison" &gt;Comparison with other aligners&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.3&lt;/span&gt; &lt;a href="#app-effects" &gt;Effects of methods&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;\begin{equation*}
\newcommand{\g}{g^*}
\newcommand{\h}{h^*}
\newcommand{\f}{f^*}
\newcommand{\cgap}{c_{\textrm{gap}}}
\newcommand{\xor}{\ \mathrm{xor}\ }
\newcommand{\and}{\ \mathrm{and}\ }
\newcommand{\st}[2]{\langle #1, #2\rangle}
\newcommand{\matches}{\mathcal M}
\end{equation*}&lt;/p&gt;</description></item><item><title>One Billion Row Challenge</title><link>https://curiouscoding.nl/posts/1brc/</link><pubDate>Wed, 03 Jan 2024 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/1brc/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#external-links" &gt;External links&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-problem" &gt;The problem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#initial-solution-105s" &gt;Initial solution: 105s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#first-flamegraph" &gt;First flamegraph&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#bytes-instead-of-strings-72s" &gt;Bytes instead of strings: 72s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#manual-parsing-61s" &gt;Manual parsing: 61s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#inline-hash-keys-50s" &gt;Inline hash keys: 50s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#faster-hash-function-41s" &gt;Faster hash function: 41s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-new-flame-graph" &gt;A new flame graph&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#perf-it-is" &gt;Perf it is&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#something-simple-allocating-the-right-size-41s" &gt;Something simple: allocating the right size: 41s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#memchr-for-scanning-47s" &gt;&lt;code&gt;memchr&lt;/code&gt; for scanning: 47s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#memchr-crate-29s" &gt;&lt;code&gt;memchr&lt;/code&gt; crate: 29s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#get-unchecked-28s" &gt;&lt;code&gt;get_unchecked&lt;/code&gt;: 28s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#manual-simd-29s" &gt;Manual SIMD: 29s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#profiling" &gt;Profiling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#revisiting-the-key-function-23s" &gt;Revisiting the key function: 23s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ptrhash-perfect-hash-function-17s" &gt;PtrHash perfect hash function: 17s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#larger-masks-15s" &gt;Larger masks: 15s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#reduce-pattern-matching-14s" &gt;Reduce pattern matching: 14s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#memory-map-12s" &gt;Memory map: 12s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#parallelization-2-dot-0s" &gt;Parallelization: 2.0s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#branchless-parsing-1-dot-7s" &gt;Branchless parsing: 1.7s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#purging-all-branches-1-dot-67s" &gt;Purging all branches: 1.67s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#some-more-attempts" &gt;Some more attempts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#faster-perfect-hashing-1-dot-55s" &gt;Faster perfect hashing: 1.55s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#bug-time-back-up-to-1-dot-71s" &gt;Bug time: Back up to 1.71s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#temperatures-less-than-100-1-dot-62s" &gt;Temperatures less than 100: 1.62s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#computing-min-as-a-max-1-dot-50" &gt;Computing &lt;code&gt;min&lt;/code&gt; as a &lt;code&gt;max&lt;/code&gt;: 1.50&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#intermezzo-hyperthreading-1-dot-34s" &gt;Intermezzo: Hyperthreading: 1.34s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#not-parsing-negative-numbers-1-dot-48s" &gt;Not parsing negative numbers: 1.48s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#more-efficient-parsing-1-dot-44s" &gt;More efficient parsing: 1.44s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#fixing-undefined-behaviour-back-to-1-dot-56s" &gt;Fixing undefined behaviour: back to 1.56s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#lazily-subtracting-b-0-1-dot-52s" &gt;Lazily subtracting &lt;code&gt;b'0'&lt;/code&gt;: 1.52s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#min-max-without-parsing-1-dot-55s" &gt;Min/max without parsing: 1.55s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#parsing-using-a-single-multiplication-doesn-t-work" &gt;Parsing using a single multiplication: doesn&amp;rsquo;t work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#parsing-using-a-single-multiplication-does-work-after-all-1-dot-48s" &gt;Parsing using a single multiplication does work after all! 1.48s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-side-note-ascii" &gt;A side note: ASCII&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#skip-parsing-using-pdep-1-dot-42s" &gt;Skip parsing using &lt;code&gt;PDEP&lt;/code&gt;: 1.42s&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#improved" &gt;Improved&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-further-note" &gt;A further note&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#branchy-min-max-1-dot-37s" &gt;Branchy min/max: 1.37s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#no-counting-1-dot-34s" &gt;No counting: 1.34s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#arbitrary-long-city-names-1-dot-34" &gt;Arbitrary long city names: 1.34&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#4-entries-in-parallel-1-dot-23s" &gt;4 entries in parallel: 1.23s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#mmap-per-thread" &gt;Mmap per thread&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#reordering-some-operations-1-dot-19s" &gt;Reordering some operations: 1.19s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#reordering-more-1-dot-11s" &gt;Reordering more: 1.11s&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#even-more-ilp-1-dot-05" &gt;Even more ILP: 1.05&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#compliance-1-ok-i-ll-count-1-dot-06" &gt;Compliance 1, OK I&amp;rsquo;ll count: 1.06&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#d41d8c" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; &lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#postscript" &gt;Postscript&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;A youtube video on this post is &lt;a href="https://youtu.be/e_9ziFKcEhw?si=JHy4aVliKw9gfryf&amp;amp;t=896" class="external-link" target="_blank" rel="noopener"&gt;here&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>PtrHash: Notes on adapting PTHash in Rust</title><link>https://curiouscoding.nl/posts/ptrhash-log/</link><pubDate>Thu, 21 Sep 2023 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/ptrhash-log/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#questions-and-remarks-on-pthash-paper" &gt;Questions and remarks on PTHash paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ideas-for-improvement" &gt;Ideas for improvement&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#parameters" &gt;Parameters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#align-packed-vectors-to-cachelines" &gt;Align packed vectors to cachelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#prefetching" &gt;Prefetching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#faster-modulo-operations" &gt;Faster modulo operations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#store-dictionary-d-sorted-using-elias-fano-coding" &gt;Store dictionary \(D\) sorted using Elias-Fano coding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#how-many-bits-of-n-and-hash-entropy-do-we-need" &gt;How many bits of \(n\) and hash entropy do we need?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ideas-for-faster-construction" &gt;Ideas for faster construction&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#implementation-log" &gt;Implementation log&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#hashing-function" &gt;Hashing function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#bitpacking-crates" &gt;Bitpacking crates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#construction" &gt;Construction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#fastmod" &gt;Fastmod&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#try-out-fastdivide-and-reciprocal-crates" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Try out &lt;code&gt;fastdivide&lt;/code&gt; and &lt;code&gt;reciprocal&lt;/code&gt; crates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#first-benchmark" &gt;First benchmark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#faster-bucket-computation" &gt;Faster bucket computation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#branchless-for-real-now--aka-the-trick-of-thirds" &gt;Branchless, for real now! (aka the trick-of-thirds)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#compiling-and-benchmarking-pthash" &gt;Compiling and benchmarking PTHash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#compact-encoding" &gt;Compact encoding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#find-the-x-differences" &gt;Find the \(x\) differences&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#fastreduce-revisited" &gt;&lt;code&gt;FastReduce&lt;/code&gt; revisited&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#is-there-a-problem-if-gcd--m-n--is-large" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Is there a problem if \(\gcd(m, n)\) is large?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#faster-hashing" &gt;Faster hashing&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#try-xxhash" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Try xxhash&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#an-experiment" &gt;An experiment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#compiler-struggles" &gt;Compiler struggles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#prefetching-at-last" &gt;Prefetching, at last&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#prefetching-with-vectorization" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Prefetching with vectorization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#inverting-hki" &gt;Inverting \(h(k_i)\)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#another-day-of-progress" &gt;Another day of progress&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#possible-sorting-algorithms" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Possible sorting algorithms&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#diving-into-the-inverse-hash-problem" &gt;Diving into the inverse hash problem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#bringing-it-home" &gt;Bringing it home&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#hash-inversion-for-faster-pthash-construction" &gt;Hash-inversion for faster PTHash construction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#fast-path-for-small-buckets" &gt;Fast path for small buckets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#dictionary-encoding" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Dictionary encoding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#larger-buckets" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Larger buckets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#prefetching-free-slots" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Prefetching free slots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#filling-the-last-few-empty-slots-needs-very-high-k-i" &gt;Filling the last few empty slots needs very high \(k_i\)!&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#perfect-matching-for-the-tail" &gt;Perfect matching for the tail&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#peeling-for-size-1-buckets" &gt;Peeling for size-1 buckets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#greedy-peeling-1-assigning-from-hard-to-easy" &gt;Greedy peeling 1: Assigning from hard to easy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#peeling-and-cuckoo-hashing-for-larger-buckets-dot" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Peeling and cuckoo hashing for larger buckets.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#sunday-morning-ideas" &gt;Sunday morning ideas&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#dinic" &gt;Dinic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#new-iterative-greedy-assignment-idea" &gt;New iterative greedy assignment idea&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#cuckoo-hashing-again" &gt;Cuckoo hashing, again&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#cuckoo-hashing-displacing-for-real-now" &gt;Cuckoo hashing / displacing, &lt;em&gt;for real now&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#displacing-globally" &gt;Displacing globally&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#running-it" &gt;Running it&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#limitations" &gt;Limitations&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#cleanup-and-revisiting-defaults" &gt;Cleanup and revisiting defaults&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#sum-instead-of-xor" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Sum instead of xor?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#revisiting-alpha-1" &gt;Revisiting \(\alpha &amp;lt; 1\)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#elias-fano-for-the-remap-dictionary" &gt;Elias-Fano for the remap-dictionary&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#global-iterative-prioritizing" &gt;Global iterative prioritizing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#cleanup-removing-peeling-and-suboptimal-displacing-code" &gt;Cleanup: removing peeling and suboptimal displacing code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#some-speedups-to-the-displacement-algorithm" &gt;Some speedups to the displacement algorithm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#runtime-analysis-of-displacement-algorithm" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Runtime analysis of displacement algorithm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#optimal-prefetching-strategy" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; Optimal prefetching strategy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#are-we-close-to-the-memory-bandwidth" &gt;Are we close to the memory bandwidth?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#more-sorting-algorithm-resources" &gt;More sorting algorithm resources&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#and-some-resources-on-partitioning" &gt;And some resources on partitioning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#partitioning-to-reduce-memory-latency" &gt;Partitioning to reduce memory latency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#back-from-a-break" &gt;Back from a break!&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#speeding-up-the-search-for-pilots" &gt;Speeding up the search for pilots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#multiplyreduce" &gt;&lt;code&gt;MultiplyReduce&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#linux-hugepages" &gt;Linux hugepages?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#dropping-the-bucket-split" &gt;Dropping the bucket split?&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#build-performance" &gt;Build performance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#an-alternative" &gt;An alternative&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#query-performance" &gt;Query performance&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#query-memory-bandwidth" &gt;Query memory bandwidth&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#some-more-experiments" &gt;Some more experiments&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#multithreading-benchmark" &gt;Multithreading benchmark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#multithreading-queries-satisfaction-at-last" &gt;Multithreading queries: satisfaction at last&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#packing-difference-from-expected-position" &gt;Packing difference from expected position&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#local-packing-ideas" &gt;Local packing ideas&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#query-times-for-different-remapping-structures" &gt;Query times for different remapping structures&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#sharding" &gt;Sharding&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#128bit-hashing" &gt;128bit hashing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#varying-the-partition-size" &gt;Varying the partition size&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#ptrhash-part-2" &gt;PtrHash, part 2&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#phobic" &gt;Phobic&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#for-ptrhash" &gt;&lt;span class="org-todo todo TODO"&gt;TODO&lt;/span&gt; for PtrHash&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;\[
%\newcommand{\mm}{\,\%\,}
\newcommand{\mm}{\bmod}
\newcommand{\lxor}{\oplus}
\newcommand{\K}{\mathcal K}
\]&lt;/p&gt;</description></item><item><title>Diamond optimisation for diagonal transition</title><link>https://curiouscoding.nl/posts/diamond-optimization/</link><pubDate>Mon, 01 Aug 2022 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/diamond-optimization/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#diamond-transition-or-how-technicalities-can-break-concepts" &gt;Diamond transition or how technicalities can break concepts&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#but-let-s-take-a-closer-look" &gt;But let’s take a closer look&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#conclusion" &gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;h2 id="diamond-transition-or-how-technicalities-can-break-concepts"&gt;
 Diamond transition or how technicalities can break concepts
 &lt;a class="heading-link" href="#diamond-transition-or-how-technicalities-can-break-concepts"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;We assume the reader has some basic knowledge about pairwise alignment
and in particular the WFA algorithm.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;In this post we dive into a potential 2x speedup of WFA &amp;mdash; one that turns out not to work.&lt;/p&gt;</description></item><item><title>Benchmark attention points</title><link>https://curiouscoding.nl/posts/benchmarks/</link><pubDate>Thu, 28 Apr 2022 23:33:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/benchmarks/</guid><description>&lt;p&gt;&lt;em&gt;Benchmarking is harder than you think, even when taking into account this rule.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;This post lists some lessons I learned while attempting to run benchmarks for
&lt;a href="https://github.com/RagnarGrootKoerkamp/astar-pairwise-aligner" class="external-link" target="_blank" rel="noopener"&gt;A* pairwise aligner&lt;/a&gt;. I was doing this on a laptop, which likely has different
characteristics from CPUs in a typical server rack. All the programs I run are
single threaded.&lt;/p&gt;
&lt;h2 id="hardware"&gt;
 Hardware
 &lt;a class="heading-link" href="#hardware"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;dl&gt;
&lt;dt&gt;Do not run while charging the laptop&lt;/dt&gt;
&lt;dd&gt;Charging makes the battery hot and causes throttling. Run either on
battery power or with a completely full battery to prevent this.&lt;/dd&gt;
&lt;dt&gt;Disable hyperthreading&lt;/dt&gt;
&lt;dd&gt;Completely disable hyperthreading in the BIOS.
Multiple programs running on the same core may fight for resources.&lt;/dd&gt;
&lt;/dl&gt;
&lt;h2 id="cpu-settings"&gt;
 CPU settings
 &lt;a class="heading-link" href="#cpu-settings"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;dl&gt;
&lt;dt&gt;Pin CPU frequency&lt;/dt&gt;
&lt;dd&gt;CPUs, especially laptops, have turboboost, (thermal) throttling, and powersave
features. Make sure to pin the CPU core frequency low enough that it can be
sustained for long times without throttling.
&lt;p&gt;In my case, the &lt;code&gt;performance&lt;/code&gt; governor can fix the CPU frequency. The base
frequency of my CPU is &lt;code&gt;2.6GHz&lt;/code&gt;, so that&amp;rsquo;s where I pinned it.&lt;/p&gt;</description></item><item><title>28000x speedup with Numba.CUDA</title><link>https://curiouscoding.nl/posts/numba-cuda-speedup/</link><pubDate>Mon, 24 May 2021 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/numba-cuda-speedup/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#cuda-overview" &gt;CUDA Overview&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#profiling" &gt;Profiling&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#optimizing-tensor-sketch" &gt;Optimizing Tensor Sketch&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#cpu-code" &gt;CPU code&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#v0-original-python-code" &gt;V0: Original python code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v1-numba" &gt;V1: Numba&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v2-multithreading" &gt;V2: Multithreading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#gpu-code" &gt;GPU code&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#v3-a-first-gpu-version" &gt;V3: A first GPU version&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v4-parallel-kernel-invocations" &gt;V4: Parallel kernel invocations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v5-single-kernel-with-many-blocks" &gt;V5: Single kernel with many blocks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v6-detailed-profiling-kernel-compute" &gt;V6: Detailed profiling: Kernel Compute&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v7-detailed-profiling-kernel-latency" &gt;V7: Detailed profiling: Kernel Latency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v8-detailed-profiling-shared-memory-access-pattern" &gt;V8: Detailed profiling: Shared Memory Access Pattern&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v9-more-work-per-thread" &gt;V9: More work per thread&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v10-cache-seq-to-shared-memory" &gt;V10: Cache seq to shared memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v11-hashes-and-signs-in-shared-memory" &gt;V11: Hashes and signs in shared memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v12-revisiting-blocks-per-kernel" &gt;V12: Revisiting blocks per kernel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v13-passing-a-tuple-of-sequences" &gt;V13: Passing a tuple of sequences&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v14-better-hardware" &gt;V14: Better hardware&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v15-dynamic-shared-memory" &gt;V15: Dynamic shared memory&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#wrap-up" &gt;Wrap up&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;&lt;strong&gt;Backlinks&lt;/strong&gt;: &lt;a href="https://www.reddit.com/r/CUDA/comments/mq1yrm/28000x_speedup_with_numbacuda/" class="external-link" target="_blank" rel="noopener"&gt;r/CUDA&lt;/a&gt;, &lt;a href="https://numba.discourse.group/t/blog-28000x-speedup-with-numba-cuda/667" class="external-link" target="_blank" rel="noopener"&gt;Numba discourse&lt;/a&gt;&lt;/p&gt;</description></item></channel></rss>