<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Walkthrough on CuriousCoding</title><link>https://curiouscoding.nl/categories/walkthrough/</link><description>Recent content in Walkthrough on CuriousCoding</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 12 Feb 2025 00:00:00 +0100</lastBuildDate><atom:link href="https://curiouscoding.nl/categories/walkthrough/index.xml" rel="self" type="application/rss+xml"/><item><title>Binary search variants and the effects of batching</title><link>https://curiouscoding.nl/posts/binsearch/</link><pubDate>Wed, 12 Feb 2025 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/binsearch/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#optimizing-binary-search-and-interpolation-search" &gt;Optimizing Binary Search And Interpolation Search&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.1&lt;/span&gt; &lt;a href="#problem-statement" &gt;Problem statement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.2&lt;/span&gt; &lt;a href="#inspiration-and-background" &gt;Inspiration and background&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.3&lt;/span&gt; &lt;a href="#benchmarking-setup" &gt;Benchmarking setup&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#binary-search" &gt;Binary search&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.1&lt;/span&gt; &lt;a href="#branchless-search" &gt;Branchless search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.2&lt;/span&gt; &lt;a href="#explicit-prefetching" &gt;Explicit prefetching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.3&lt;/span&gt; &lt;a href="#batching" &gt;Batching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.4&lt;/span&gt; &lt;a href="#a-note-on-power-of-two-array-sizes" &gt;A note on power-of-two array sizes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#alternative-memory-layout" &gt;Eytzinger&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#naive-implementation" &gt;Naive implementation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.2&lt;/span&gt; &lt;a href="#prefetching" &gt;Prefetching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3&lt;/span&gt; &lt;a href="#branchless-eytzinger" &gt;Branchless Eytzinger&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.4&lt;/span&gt; &lt;a href="#batched-eytzinger" &gt;Batched Eytzinger&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.4.1&lt;/span&gt; &lt;a href="#non-prefetched" &gt;Non-prefetched&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.4.2&lt;/span&gt; &lt;a href="#prefetched" &gt;Prefetched&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#eytzinger-or-binsearch" &gt;Eytzinger or BinSearch?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#memory-efficiency-parallel-search-and-comparison-to-s-trees" &gt;Memory efficiency &amp;ndash; parallel search and comparison to S-trees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#interpolation-search" &gt;Interpolation search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7&lt;/span&gt; &lt;a href="#conclusion-and-takeaways" &gt;Conclusion and takeaways&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;h2 id="optimizing-binary-search-and-interpolation-search"&gt;
 &lt;span class="section-num"&gt;1&lt;/span&gt; Optimizing Binary Search And Interpolation Search
 &lt;a class="heading-link" href="#optimizing-binary-search-and-interpolation-search"&gt;
 &lt;i class="fa-solid fa-link" aria-hidden="true" title="Link to heading"&gt;&lt;/i&gt;
 &lt;span class="sr-only"&gt;Link to heading&lt;/span&gt;
 &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;This blogpost is a preliminary of the
&lt;a href="https://curiouscoding.nl/posts/static-search-tree/" class="external-link" target="_blank" rel="noopener"&gt;post on static
search trees&lt;/a&gt;. We will be looking into binary search and how it can be
optimized using different memory layouts (Eytzinger), branchless
techniques and careful use of prefetching. In addition, we will explore
batching. Our language of choice will be Rust.&lt;/p&gt;</description></item><item><title>Static search trees: 40x faster than binary search</title><link>https://curiouscoding.nl/posts/static-search-tree/</link><pubDate>Wed, 18 Dec 2024 00:00:00 +0100</pubDate><guid>https://curiouscoding.nl/posts/static-search-tree/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#introduction" &gt;Introduction&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.1&lt;/span&gt; &lt;a href="#problem-statement" &gt;Problem statement&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.2&lt;/span&gt; &lt;a href="#motivation" &gt;Motivation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.3&lt;/span&gt; &lt;a href="#recommended-reading" &gt;Recommended reading&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.4&lt;/span&gt; &lt;a href="#binary-search-and-eytzinger-layout" &gt;Binary search and Eytzinger layout&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.5&lt;/span&gt; &lt;a href="#hugepages" &gt;Hugepages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.6&lt;/span&gt; &lt;a href="#a-note-on-benchmarking" &gt;A note on benchmarking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.7&lt;/span&gt; &lt;a href="#cache-lines" &gt;Cache lines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.8&lt;/span&gt; &lt;a href="#s-trees-and-b-trees" &gt;S-trees and B-trees&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#optimizing-find" &gt;Optimizing &lt;code&gt;find&lt;/code&gt;&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.1&lt;/span&gt; &lt;a href="#linear" &gt;Linear&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.2&lt;/span&gt; &lt;a href="#auto-vectorization" &gt;Auto-vectorization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.3&lt;/span&gt; &lt;a href="#trailing-zeros" &gt;Trailing zeros&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.4&lt;/span&gt; &lt;a href="#popcount" &gt;Popcount&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2.5&lt;/span&gt; &lt;a href="#manual-simd" &gt;Manual SIMD&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#optimizing-the-search" &gt;Optimizing the search&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#batching" &gt;Batching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.2&lt;/span&gt; &lt;a href="#prefetching" &gt;Prefetching&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3&lt;/span&gt; &lt;a href="#pointer-arithmetic" &gt;Pointer arithmetic&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3.1&lt;/span&gt; &lt;a href="#up-front-splat" &gt;Up-front splat&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3.2&lt;/span&gt; &lt;a href="#byte-based-pointers" &gt;Byte-based pointers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3.3&lt;/span&gt; &lt;a href="#the-final-version" &gt;The final version&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.4&lt;/span&gt; &lt;a href="#skip-prefetch" &gt;Skip prefetch&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.5&lt;/span&gt; &lt;a href="#interleave" &gt;Interleave&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#optimizing-the-tree-layout" &gt;Optimizing the tree layout&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.1&lt;/span&gt; &lt;a href="#left-tree" &gt;Left-tree&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.2&lt;/span&gt; &lt;a href="#memory-layouts" &gt;Memory layouts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.3&lt;/span&gt; &lt;a href="#node-size-b-15" &gt;Node size \(B=15\)&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.3.1&lt;/span&gt; &lt;a href="#data-structure-size" &gt;Data structure size&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.4&lt;/span&gt; &lt;a href="#summary" &gt;Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#prefix-partitioning" &gt;Prefix partitioning&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.1&lt;/span&gt; &lt;a href="#full-layout" &gt;Full layout&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.2&lt;/span&gt; &lt;a href="#compact-subtrees" &gt;Compact subtrees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.3&lt;/span&gt; &lt;a href="#the-best-of-both-compact-first-level" &gt;The best of both: compact first level&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.4&lt;/span&gt; &lt;a href="#overlapping-trees" &gt;Overlapping trees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.5&lt;/span&gt; &lt;a href="#human-data" &gt;Human data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.6&lt;/span&gt; &lt;a href="#prefix-map" &gt;Prefix map&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.7&lt;/span&gt; &lt;a href="#prefix-summary" &gt;Summary&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#multi-threaded-comparison" &gt;Multi-threaded comparison&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7&lt;/span&gt; &lt;a href="#conclusion" &gt;Conclusion&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1&lt;/span&gt; &lt;a href="#future-work" &gt;Future work&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.1&lt;/span&gt; &lt;a href="#branchy-search" &gt;Branchy search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.2&lt;/span&gt; &lt;a href="#interpolation-search" &gt;Interpolation search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.3&lt;/span&gt; &lt;a href="#packing-data-smaller" &gt;Packing data smaller&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.4&lt;/span&gt; &lt;a href="#returning-indices-in-original-data" &gt;Returning indices in original data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.5&lt;/span&gt; &lt;a href="#range-queries" &gt;Range queries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.6&lt;/span&gt; &lt;a href="#sorting-queries" &gt;Sorting queries&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1.7&lt;/span&gt; &lt;a href="#suffix-array-searching" &gt;Suffix array searching&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;In this post, we will implement a static search tree (S+ tree) for
high-throughput searching of sorted data, as &lt;a href="https://en.algorithmica.org/hpc/data-structures/s-tree/" class="external-link" target="_blank" rel="noopener"&gt;introduced&lt;/a&gt; on Algorithmica.
We&amp;rsquo;ll mostly take the code presented there as a starting point, and optimize it
to its limits. For a large part, I&amp;rsquo;m simply taking the &amp;lsquo;future work&amp;rsquo; ideas of that post
and implementing them. And then there will be a bunch of looking at assembly
code to shave off all the instructions we can.
Lastly, there will be one big addition to optimize throughput: &lt;em&gt;batching&lt;/em&gt;.&lt;/p&gt;</description></item><item><title>SimdMinimizers: Computing random minimizers, fast</title><link>https://curiouscoding.nl/posts/simd-minimizers/</link><pubDate>Fri, 12 Jul 2024 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/simd-minimizers/</guid><description>&lt;div class="ox-hugo-toc toc has-section-numbers"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1&lt;/span&gt; &lt;a href="#introduction" &gt;Introduction&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;1.1&lt;/span&gt; &lt;a href="#intro-results" &gt;Results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;2&lt;/span&gt; &lt;a href="#random-minimizers" &gt;Random minimizers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3&lt;/span&gt; &lt;a href="#algorithms" &gt;Algorithms&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.1&lt;/span&gt; &lt;a href="#problem-statement" &gt;Problem statement&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#problem-a-only-the-set-of-minimizers" &gt;Problem A: Only the set of minimizers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#problem-b-the-minimizer-of-each-window" &gt;Problem B: The minimizer of each window&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#problem-c-super-k-mers" &gt;Problem C: Super-k-mers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#which-problem-to-solve" &gt;Which problem to solve&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#canonical-k-mers" &gt;Canonical k-mers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.2&lt;/span&gt; &lt;a href="#the-naive-algorithm" &gt;The naive algorithm&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#naive-performance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.3&lt;/span&gt; &lt;a href="#rephrasing-as-sliding-window-minimum" &gt;Rephrasing as sliding window minimum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.4&lt;/span&gt; &lt;a href="#the-queue" &gt;The queue&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#queue-performance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.5&lt;/span&gt; &lt;a href="#jumping-away-with-the-queue" &gt;Jumping: Away with the queue&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#jumping-performance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.6&lt;/span&gt; &lt;a href="#re-scan" &gt;Re-scan&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#rescan-performance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;3.7&lt;/span&gt; &lt;a href="#split-windows" &gt;Split windows&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#split-perfomance" &gt;Performance characteristics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4&lt;/span&gt; &lt;a href="#analysing-what-we-have-so-far" &gt;Analysing what we have so far&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.1&lt;/span&gt; &lt;a href="#counting-comparisons" &gt;Counting comparisons&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#open-problem-theoretical-lower-bounds" &gt;Open problem: Theoretical lower bounds&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.2&lt;/span&gt; &lt;a href="#setting-up-benchmarking" &gt;Setting up benchmarking&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#adding-criterion" &gt;Adding criterion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#making-criterion-fast" &gt;Making criterion fast&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-note-on-cpu-frequency" &gt;A note on CPU frequency&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.3&lt;/span&gt; &lt;a href="#runtime-comparison-with-other-implementations" &gt;Runtime comparison with other implementations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.4&lt;/span&gt; &lt;a href="#deeper-inspection-using-perf-stat" &gt;Deeper inspection using &lt;code&gt;perf stat&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.5&lt;/span&gt; &lt;a href="#a-first-optimization-pass" &gt;A first optimization pass&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#optimizing-buffered-reducing-branch-misses" &gt;Optimizing &lt;code&gt;Buffered&lt;/code&gt;: reducing branch misses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#queue-is-hopelessly-branchy" &gt;&lt;code&gt;Queue&lt;/code&gt; is hopelessly branchy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#jumping-is-already-very-efficient" &gt;&lt;code&gt;Jumping&lt;/code&gt; is already very efficient&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#optimizing-rescan" &gt;Optimizing &lt;code&gt;Rescan&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#optimizing-split" &gt;Optimizing &lt;code&gt;Split&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;4.6&lt;/span&gt; &lt;a href="#a-new-performance-comparison" &gt;A new performance comparison&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5&lt;/span&gt; &lt;a href="#rolling-our-own-hash" &gt;Rolling our own hash&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.1&lt;/span&gt; &lt;a href="#fxhash" &gt;FxHash&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#wyhash" &gt;WyHash&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.2&lt;/span&gt; &lt;a href="#nthash-a-rolling-hash" &gt;NtHash: a rolling hash&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#the-nthash-crate" &gt;The &lt;code&gt;nthash&lt;/code&gt; crate&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#buffered-hash-values" &gt;Buffered hash values&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.3&lt;/span&gt; &lt;a href="#making-nthash-fast-going-branchless" &gt;Making ntHash fast: going branchless&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#drop-sanity-checks" &gt;Drop sanity checks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#drop-bound-checks" &gt;Drop bound checks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#efficiently-collecting-to-a-vector" &gt;Efficiently collecting to a vector&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.4&lt;/span&gt; &lt;a href="#rolling-a-bit-less" &gt;Rolling a bit less&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#analysing-the-assembly-code" &gt;Analysing the assembly code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.5&lt;/span&gt; &lt;a href="#parallel-it-is" &gt;Parallel it is&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#more-parallel" &gt;More parallel&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.6&lt;/span&gt; &lt;a href="#actual-simd-at-last" &gt;Actual SIMD, at last&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#simd-table-lookups" &gt;SIMD table lookups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#32-bit-hashes" &gt;32-bit hashes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#shared-offsets" &gt;Shared offsets&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.7&lt;/span&gt; &lt;a href="#simd-the-gathering" &gt;SIMD: The Gathering&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#gathering-4-characters-at-a-time" &gt;Gathering 4 characters at a time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#gathering-8-characters-at-a-time" &gt;Gathering 8 characters at a time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#gathering-32-characters-at-a-time" &gt;Gathering 32 characters at a time&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#reusing-the-gathers" &gt;Reusing the gathers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.8&lt;/span&gt; &lt;a href="#cached-vec" &gt;Fixing the benchmark&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#one-last-branch" &gt;One last branch&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.9&lt;/span&gt; &lt;a href="#analysis-machine-code-analysis" &gt;Analysis: Machine code analysis&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;5.10&lt;/span&gt; &lt;a href="#finals-thoughts" &gt;Finals thoughts&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#doubling-down-again" &gt;Doubling down again&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#16-bit-hashes" &gt;16-bit hashes?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#what-about-a-simple-multiply-hash" &gt;What about a simple multiply hash&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6&lt;/span&gt; &lt;a href="#simd-sliding-window" &gt;SIMD sliding window&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;6.1&lt;/span&gt; &lt;a href="#sliding-window-results" &gt;Results&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#human-genome-results" &gt;Human genome results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7&lt;/span&gt; &lt;a href="#extending-into-something-useful" &gt;Extending into something useful&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.1&lt;/span&gt; &lt;a href="#collecting-minimizer-positions" &gt;Collecting minimizer positions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.2&lt;/span&gt; &lt;a href="#deduplicating-the-minimizer-positions" &gt;Deduplicating the minimizer positions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.3&lt;/span&gt; &lt;a href="#super-k-mers" &gt;Super-k-mers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.4&lt;/span&gt; &lt;a href="#canonical-k-mers" &gt;Canonical k-mers&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#nthash" &gt;NtHash&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#leftmost-rightmost-sliding-min" &gt;Leftmost-rightmost sliding min&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#tiebreaking" &gt;Tiebreaking&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#further-reusing-iterated-bases" &gt;Further reusing iterated bases&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;7.5&lt;/span&gt; &lt;a href="#antilex-hash" &gt;AntiLex hash&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8&lt;/span&gt; &lt;a href="#conclusion" &gt;Conclusion&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="section-num"&gt;8.1&lt;/span&gt; &lt;a href="#future-work" &gt;Future work&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;SimdMinimizers has been published as a paper: &lt;a href="https://doi.org/10.1101/2025.01.27.634998" class="external-link" target="_blank" rel="noopener"&gt;DOI&lt;/a&gt;, &lt;a href="https://curiouscoding.nl/papers/simd-minimizers.pdf" &gt;PDF&lt;/a&gt;:&lt;/p&gt;</description></item><item><title>28000x speedup with Numba.CUDA</title><link>https://curiouscoding.nl/posts/numba-cuda-speedup/</link><pubDate>Mon, 24 May 2021 00:00:00 +0200</pubDate><guid>https://curiouscoding.nl/posts/numba-cuda-speedup/</guid><description>&lt;div class="ox-hugo-toc toc"&gt;
&lt;div class="heading"&gt;Table of Contents&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#cuda-overview" &gt;CUDA Overview&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#profiling" &gt;Profiling&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#optimizing-tensor-sketch" &gt;Optimizing Tensor Sketch&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#cpu-code" &gt;CPU code&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#v0-original-python-code" &gt;V0: Original python code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v1-numba" &gt;V1: Numba&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v2-multithreading" &gt;V2: Multithreading&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#gpu-code" &gt;GPU code&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#v3-a-first-gpu-version" &gt;V3: A first GPU version&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v4-parallel-kernel-invocations" &gt;V4: Parallel kernel invocations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v5-single-kernel-with-many-blocks" &gt;V5: Single kernel with many blocks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v6-detailed-profiling-kernel-compute" &gt;V6: Detailed profiling: Kernel Compute&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v7-detailed-profiling-kernel-latency" &gt;V7: Detailed profiling: Kernel Latency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v8-detailed-profiling-shared-memory-access-pattern" &gt;V8: Detailed profiling: Shared Memory Access Pattern&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v9-more-work-per-thread" &gt;V9: More work per thread&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v10-cache-seq-to-shared-memory" &gt;V10: Cache seq to shared memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v11-hashes-and-signs-in-shared-memory" &gt;V11: Hashes and signs in shared memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v12-revisiting-blocks-per-kernel" &gt;V12: Revisiting blocks per kernel&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v13-passing-a-tuple-of-sequences" &gt;V13: Passing a tuple of sequences&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v14-better-hardware" &gt;V14: Better hardware&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#v15-dynamic-shared-memory" &gt;V15: Dynamic shared memory&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#wrap-up" &gt;Wrap up&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;!--endtoc--&gt;
&lt;p&gt;&lt;strong&gt;Backlinks&lt;/strong&gt;: &lt;a href="https://www.reddit.com/r/CUDA/comments/mq1yrm/28000x_speedup_with_numbacuda/" class="external-link" target="_blank" rel="noopener"&gt;r/CUDA&lt;/a&gt;, &lt;a href="https://numba.discourse.group/t/blog-28000x-speedup-with-numba-cuda/667" class="external-link" target="_blank" rel="noopener"&gt;Numba discourse&lt;/a&gt;&lt;/p&gt;</description></item></channel></rss>