# 1 Introduction

In this post, we will develop a fast implementation of random minimizers.

The goals here are:

• a library for fast random minimizers,
• explaining the final code, and the design decisions leading to it,
• walking you through the development process, hopefully teaching some tricks and methods for how to do this yourself.

Code for this post is in the benches/ directory of github:RagnarGrootKoerkamp/minimizers. Most code snippets are directly included from the repository. I have tried to write clean and readable code that follows the order of this post. Occasionally some code snippets contain ‘future proofing’ that will only become relevant later on.

Contents1:

## 1.1 Results

The end result that we will be working towards here is twofold:

1. An implementation of 32-bit ntHash rolling hash that takes around 0.4ns/kmer, or around 1 CPU cycle per hash. This is 7x faster than the nthash crate by Luiz Irber.

2. An implementation of random minimizers that takes around 0.6ns/window, or 1.5CPU cycle per window. This is 15x faster than Daniel Liu’s gist and 40x faster than the rust-seq/minimizer-iter crate by Igor Martayan2.

It computes all window minimizers of a 2-bit packed human genome (with non-ACGT stripped) in two seconds.

I plan to upstream these algorithms into the rust-seq/nthash and rust-seq/minimizer-iter crates soon, after polishing the implementation, adding tests3, and designing a nice API4.

I welcome any (critical) feedback on both the content and styling of this post! I plan to fine-tune it a bit more based on this, and will take this into account when writing similar posts in the future. If you’re feeling adventurous, open an issue or pull request against the source of this blog.

# 2 Random minimizers

Many tools in bioinformatics rely on k-mers. Given a string/text/DNA sequence t: &[u8] of length $$|t|=n$$, the k-mers of t are all the length-$$k$$ substrings. There are $$n-k+1$$ of them.

In order to speed up processing, and since consecutive k-mers are overlapping and contain redundancy, it is common to subsample the k-mers. This should be done in a deterministic way, so that similar but slightly distinct text still mostly sample the same k-mers. One such way is random minimizers, introduced in parallel by Schleimer, Wilkerson, and Aiken (2003) and Roberts et al. (2004):

1. First, a window size $$w$$ is chosen, which will guarantee that at least one k-mer is sampled every $$w$$ positions of $$t$$, to prevent long gaps between consecutive k-mers.
2. Then, every k-mer is hashed using a pseudo-random hash function $$h$$, and in each window of $$w$$ consecutive k-mers (of total length $$\ell=w+k-1$$), the leftmost5 one with the smallest hash is selected.

Consecutive windows (that overlap in $$w-1$$ k-mers) often sample the same k-mer, so that the total number of sampled k-mers is much smaller than $$n-k+1$$. In fact, random minimizers have a density of $$2/(w+1)$$ and thus sample approximately two k-mers per window.

My previous post on minimizers gives more background on density and other sampling schemes. One such alternative scheme is the mod-minimizer (Groot Koerkamp and Pibiri 2024), that has low density when $$k$$ is large compared to $$w$$. That paper also contains a review of methods and pseudocode for them. In this post, I will focus on the most widely used random minimizers.

# 3 Algorithms

## 3.1 Problem statement

Before going into algorithms to compute random minimizers, let’s first precisely state the problem itself. In fact, we’ll state three versions of the problem, that each include slightly more additional information in the output.

### Problem A: Only the set of minimizers

The most basic form of computing minimizers is to simply return the list of minimizer kmers and their positions. This information is sufficient to sketch the text.

 1 2 3 4  pub trait Minimizer { /// Problem A: The absolute positions of all minimizers in the text. fn minimizer_positions(&mut self, text: &[u8]) -> Vec; } 
A trait that defines Problem A and returns the set of kmers of the input. As to why this takes a &mut self instead of a &self, see 5.8.

### Problem B: The minimizer of each window

When building datastructures on top of a sequence, such as SSHash (Pibiri 2022), just the set of k-mers is not sufficient. Each window should be mapped to its minimizer, but there could be multiple minimizers in a single window. Thus, in this version of the problem, we return the minimizer of each window.

Note that while in Problem A we return one result per minimizer, here we return one result per window (roughly, per character). This difference can have implications for the optimal algorithm implementation, since it’s not unlikely that the main loop of the optimal solution has exactly one iteration per returned element.

 1 2 3 4 5 6  pub trait Minimizer { ... /// Problem B: For each window, the absolute position in the text of its minimizer. fn window_minimizers(&mut self, text: &[u8]) -> Vec; } 
A trait for Problem B that returns the minimizer of each window.

### Problem C: Super-k-mers

We can implement SSHash and other indexing datastructures using the output of Problem B, but in most such applications, it is probably more natural to directly iterate over the super-k-mers: the longest substrings such that all contained windows have the same minimizer. This can easily be obtained from the output of Problem B by grouping consecutive windows that share the minimal k-mer.

Note that here the result is again an iterator over minimizers, like in Problem A rather than windows as in Problem B.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14  struct SuperKmer { /// The absolute position in the text where this super-k-mer starts. /// The end of the super-k-mer can be inferred from the start of the next super-k-mer. start_pos: usize, /// The absolute position in the text of the minimizer of this super-k-mer. minimizer_pos: usize, } pub trait Minimizer { ... /// Problem C: The super-k-mers of the text. fn super_kmers(&mut self, text: &[u8]) -> Vec; } 
A trait that defines Problem C and returns the set of super-k-mers.

### Which problem to solve

In the remainder, we mostly focus on Problem B, since that is what I initially started with. However, some methods more naturally lean themselves to solve the other variants. We will benchmark each method on whichever of the problems it runs fastest, and postpone optimizing a solution for each specific problem until later.

Most likely though, solving Problem C will yield the most useful tool for end users. I plan to revisit this at the end of this post.

Regardless, for convenience we can implement solutions to A and C in terms of B.6

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27  pub trait Minimizer { /// Problem A: The absolute positions of all minimizers in the text. fn minimizer_positions(&mut self, text: &[u8]) -> Vec { let mut minimizers = self.window_minimizers(text); minimizers.dedup(); minimizers } /// Problem B: For each window, the absolute position in the text of its minimizer. fn window_minimizers(&mut self, _text: &[u8]) -> Vec { unimplemented!() } /// Problem C: The super-k-mers of the text. fn super_kmers(&mut self, text: &[u8]) -> Vec { self.window_minimizers(text) .into_iter() .enumerate() .group_by(|(_idx, minimizer_pos)| *minimizer_pos) .into_iter() .map(|(minimizer_pos, mut group)| SuperKmer { start_pos: group.next().expect("groups are non-empty").0, minimizer_pos, }) .collect() } } 
Code Snippet 4: Implementing minimizer_positions and super_kmers in terms of window_minimizers.

### Canonical k-mers

In actual bioinformatics applications, it is common to consider canonical k-mers and corresponding canonical minimizers. These are defined such that the canonical k-mer of a string and its reverse complement are the same. This will significantly impact the complexity of our code, and hence we also postpone this for later.

## 3.2 The naive algorithm

The naive $$O(n \cdot w)$$ method simply finds the minimum of each window independently.

It looks like this:7

In code, it looks like this.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  pub struct NaiveMinimizer { pub w: usize, pub k: usize, pub hasher: H, } impl Minimizer for NaiveMinimizer where H::Out: Ord, { fn window_minimizers(&mut self, text: &[u8]) -> Vec { // Iterate over the windows of size l=w+k-1. text.windows(self.w + self.k - 1) .enumerate() // For each window, starting at pos j, find the lexicographically smallest k-mer. .map(|(j, window)| { j + window .windows(self.k) .enumerate() .min_by_key(|(_idx, kmer)| self.hasher.hash(kmer)) .expect("w > 0") .0 }) .collect() } } 
Code Snippet 5: An implementation of the naive algorithm. Note that hashes are recomputed for each window.

Note that we assume that hashing each k-mer takes constant time. This is reasonable in practice, since most hash functions hash many characters at once.

### Performance characteristics

I was planning to show some preliminary performance numbers here, to give some intuition about the relative performance of this and upcoming methods. Unfortunately though (and not completely surprising), performance depends heavily on small details of how the code is written. I could provide a comparison between the methods as-is, but the numbers would be hard to interpret. Instead, let’s consider some high level expectations of this algorithm for now, and postpone the quantitive comparison until later.

Since there are no if statements, the code should be highly predictable. Thus, we expect:

• few branch misses,
• high instructions per cycle,
• and generally running times linear in $$w$$.

## 3.3 Rephrasing as sliding window minimum

After hashing all k-mers, we basically have a sequence of $$n-k+1$$ pseudo-random integers. We would like to find the position of the leftmost minimum in each window of $$w$$ of those integers.

Thus, the problem of finding random minimizers can be reformulated in terms of the sliding window minima. We can model this problem using the following trait:

  1 2 3 4 5 6 7 8 9 10 11 12 13  /// A value at an absolute position. /// When comparing, ties between value are broken in favour of small position. #[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Debug)] pub struct Elem { pub val: V, pub pos: usize, } pub trait SlidingMin { /// Take an iterator over values of type V. /// Return an iterator over the minima of windows of size w, and their positions. fn sliding_min(&self, w: usize, it: impl Iterator) -> impl Iterator>; } 
Code Snippet 6: Trait for the sliding window minimum problem.

A first implementation that mirrors the Naive implementation above could simply store the last $$w$$ hashes in a RingBuffer, and in each step recompute the minimum. The boiler plate for the ring buffer is simple:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37  pub struct RingBuf { w: usize, idx: usize, data: Vec, } impl RingBuf { #[inline(always)] pub fn new(w: usize, v: V) -> Self { let data = vec![v; w]; RingBuf { w, idx: 0, data } } #[inline(always)] pub fn idx(&self) -> usize { self.idx } #[inline(always)] pub fn push(&mut self, v: V) { self.data[self.idx] = v; self.idx += 1; if self.idx == self.w { self.idx = 0; } } } /// A RingBuf can be used as a slice. impl std::ops::Deref for RingBuf { type Target = [V]; #[inline(always)] fn deref(&self) -> &[V] { &self.data } } 
Code Snippet 7: A simple ring buffer.

And then a sliding window implementation also becomes trivial:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  pub struct Buffered; impl SlidingMin for Buffered { fn sliding_min(&self, w: usize, it: impl Iterator) -> impl Iterator> { // A ring buffer that holds the w last elements. let mut ring_buf = RingBuf::new(w, Elem { val: V::MAX, pos: 0 }); // Iterate over items and their positions. it.enumerate() .map(move |(pos, val)| { ring_buf.push(Elem { val, pos }); // Return the minimum element in the buffer. // Ties between values are broken by the position. *ring_buf.iter().min().expect("w > 0") }) // The first w-1 elements do not complete a window. .skip(w - 1) } } 
Code Snippet 8: The Buffered implementation of SlidingMin that buffers the last $$w$$ values and recomputes the minimum on each push.

Using this trait, we can clean up8 the implementation of the Naive minimizer from before9. In fact, we can now generically implement a minimizer scheme using any implementation of the sliding window algorithm.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22  /// A minimizer implementation based on a sliding window minimum algorithm. /// Also takes a custom hash function, that defaults to FxHash. pub struct SlidingWindowMinimizer { pub w: usize, pub k: usize, pub alg: SlidingMinAlg, pub hasher: H, } impl> Minimizer for SlidingWindowMinimizer { fn window_minimizers(&mut self, text: &[u8]) -> Vec { self.hasher // Iterate over k-mers and hash them. .hash_kmers(self.k, text) // (See source for the iterator 'extension trait'.) .sliding_min(self.w, &self.alg) .map(|elem| elem.pos) .collect() } } 
A generic minimizer algorithm using any solution to the sliding window minimum problem, and the corresponding Buffered minimizer.

## 3.4 The queue

Let’s try to improve this somewhat inefficient $$O(n\cdot w)$$ algorithm. A first idea is to simply keep a rolling prefix-minimum that tracks the lowest value seen so far. For example, when $$w=3$$ and the hashes are $$[10,9,8,7,6,…]$$, the rolling prefix minimum is exactly also the minimum of each window. But alas, this won’t work for increasing sequences: when the hashes are $$[1,2,3,4,5,…]$$ and the window shifts from $$[1,2,3]$$ to $$[2,3,4]$$, the prefix minimum is $$1$$ (at index $$0$$), but it is not a minimum of the new window anymore, which instead goes up to $$2$$.

Nevertheless, as the window slides right, each time we see a new value that’s smaller than everything in the window, we can basically ‘forget’ about all existing values and just keep the new smaller value in memory. Otherwise, we can still forget about all values in the window larger than the value that is shifted in.

This is formalized by using a queue of non-decreasing values in the window. More precisely, each queue element will be smaller than all values in the window coming after it. At each step, the minimum of the window is the value at the front of the queue. Let’s look at our example again.7

We see that there are two reasons elements can be removed from the queue:

1. a smaller element is pushed,
2. the window shifts so far that the leftmost/smallest element falls outside it.

To handle this second case, we don’t just store values in this queue, but also their position in the original text, so that we can pop them when needed.

In code, it looks like this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25  pub struct Queue; impl SlidingMin for Queue { fn sliding_min(&self, w: usize, it: impl Iterator) -> impl Iterator> { // A double ended queue that is non-decreasing in both pos and val, so that the // smallest value is always at the front. let mut q = std::collections::VecDeque::>::new(); // Iterate over the items and their positions. it.enumerate().map(move |(pos, val)| { // Strictly larger preceding values are removed, so that the queue remains // non-decreasing. while q.back().is_some_and(|back| back.val >= val) { q.pop_back(); } q.push_back(Elem { pos, val }); // Check if the front falls out of the window. let front = q.front().expect("We just pushed"); if pos - front.pos >= w { q.pop_front(); } *q.front().expect("w > 0") }) } } 
Code Snippet 10: An implementaion of the MonotoneQueue and the corresponding QueueMinimizer.

### Performance characteristics

Since each element is pushed once and popped once, each call to push takes amortized constant $$O(1)$$ time, so that the total runtime is $$O(n)$$. We may expect a $$w=11$$ times speedup over the buffered method, but most likely this will not be the case. This is due to the high number of unpredictable branches, which hurts pipelining/instruction level parallelism (ILP). Thus, while the complexity of the Queue algorithm is good, its practical performance may not be optimal.

## 3.5 Jumping: Away with the queue

While the queue algorithm is nice in theory, in practice it is not ideal due to the branch-misses. Thus, we would like to find some middle ground between keeping this queue and the naive approach of rescanning every window. A first idea is this: Once we find the minimizer of some window, we can jump to the first window after that position to find the next minimizer. This finds all distinct minimizers, although it does not compute the minimizer for each window individually. Thus, it solves Problem A instead of Problem B.7

Since the expected distance between random minimizers is $$(w+1)/2$$, we expect to scan each position just under two times.

Bug

Bryce Kille noticed that this method has a bug: when the sequence of hashes is e.g. 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, the minimizer of the first window is 7. Then we’d jump to the 6, 5, 4, 3 slice to find 3 as the next minimizer, but that is wrong, since 6, 5, and 4 are also minimizers. Instead of only taking the minimum of the new slice, all prefix minima that are less than the previous minimum are minimizers.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36  pub struct JumpingMinimizer { pub w: usize, pub k: usize, pub hasher: H, } impl Minimizer for JumpingMinimizer where H::Out: Ord, { fn minimizer_positions(&mut self, text: &[u8]) -> Vec { let mut minimizer_positions = Vec::new(); // Precompute hashes of all k-mers. let hashes = self.hasher.hash_kmers(self.k, text).collect::>(); let mut start = 0; while start < hashes.len() - self.w { // Position_min returns the position of the leftmost minimal hash. let min_pos = start + hashes[start..start + self.w] .iter() .position_min() .expect("w > 0"); minimizer_positions.push(min_pos); start = min_pos + 1; } // Possibly add one last minimizer. let start = hashes.len() - self.w; let min_pos = start + hashes[start..].iter().position_min().expect("w > 0"); if minimizer_positions.last() != Some(&min_pos) { minimizer_positions.push(min_pos); } minimizer_positions } } 
Solving Problem A by jumping to the window after the previous minimizer. Note that this first computes all hashes in a separate pass.

### Performance characteristics

One benefit of this method is that it is completely branchless. Each iteration of the loop corresponds to exactly one output. The only unpredictable step is which element of a window is the minimizer, since that determines which is the next window to scan. Thus, even though it is branchless, consecutive iterations depend on each other and ILP (instruction level parallelism, aka pipelining) may only have a limited effect.

## 3.6 Re-scan

I learned of an improved method via Daniel Liu’s gist for winnowing, but I believe the method is folklore10. This is a slightly improved variant of the method above. Above, we do a new scan for each new minimizer, but it turns out this can sometimes be avoided. Now, we will only re-scan when a previous minimum falls outside the window, and the minimum of the window goes up.7

Thus, we keep a ring buffer of the last $$w$$ values, and scan it as needed. One point of attention is that we need the leftmost minimal value, but since the buffer is cyclic, the leftmost element in the buffer is not necessarily the leftmost one in the original text. Thus, we don’t only store elements but also their positions, and break ties by position.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  pub struct Rescan; impl SlidingMin for Rescan { fn sliding_min(&self, w: usize, it: impl Iterator) -> impl Iterator> { let mut min = Elem { val: V::MAX, pos: 0 }; let mut ring_buf = RingBuf::new(w, min); it.enumerate() .map(move |(pos, val)| { let elem = Elem { val, pos }; ring_buf.push(elem); min = min.min(elem); // If the minimum falls out of the window, rescan to find the new minimum. if pos - min.pos == w { min = *ring_buf.iter().min().expect("w > 0"); } min }) .skip(w - 1) } } 
Code Snippet 12: Rescan implementation of SlidingMin.

### Performance characteristics

This method should have much fewer branch-misses than the queue, since we rescan a full window of size $$w$$ at a time. Nevertheless, these re-scans occur at somewhat random moments, so some branch-misses around this are still to be expected.

## 3.7 Split windows

In an ideal world, we would use a fully predictable algorithm, i.e., an algorithm without any data-dependent branches/if-statements.11

With some Googling12 I found this codeforces blog on computing sliding window minima. In the post, the idea is presented using two stacks. My explanation below goes in a slightly different direction, but in the end it comes down to the same.

Aside: Accumulating non-invertible functions. So, as we saw, most methods keep some form of rolling prefix minimum that is occasionally ‘reset’, since ‘old’ small elements should not affect the current window. This all has to do with the fact that the min operation is not invertible: knowing only the minimum of a set $$S$$ is not sufficient to know the minimum of $$S - \{x\}$$. This is unlike for example addition: sliding window sums are easy by just adding the new value to the sum and subtracting the value that falls out of the window.

This distinction is also exactly the difference between segment trees and Fenwick trees13: a Fenwick tree computes some function $$f$$ on a range $$[l, r)$$ by removing $$f(a_0, \dots, a_{l-1})$$ from $$f(a_0, \dots, a_{r-1})$$, by decomposing both ranges into intervals with power-of-$$2$$ lengths corresponding to the binary representation of their length. But this only works if $$f$$ is invertible. If that is not the case, a segment tree must be used, which partitions the interval $$[l, r)$$ itself, without going ‘outside’ it.

The algorithm. Based on this, we would also like to partition our windows such that we can compute their minimum as a minimum over the different parts. It turns out that this is possible, and in a very clean way as well!

First, we conceptually chunk the input text into parts with length $$w$$, as shown in the top row of Figure 5 for $$w=4$$. Then, each window of length $$w$$ either exactly matches such a chunk, or overlaps with two consecutive chunks. In the latter case, it covers a suffix of chunk $$i$$ and a prefix of chunk $$i+1$$. Thus, to compute the minimum for all windows, we need to compute the prefix and suffix minima of all chunks. We could do that as a separate pass over the data in a pre-processing step, but it can also be done on the fly:

1. Each time a new kmer is processed, its hash is written to a size-$$w$$ buffer. (The orange values in the top-right of Figure 5.)
2. After processing the window corresponding to a chunk, the buffer contains exactly the hashes of the chunk. (Fourth row.)
3. Then, we compute the suffix-minima of the chunk by scanning the buffer backwards. (Fifth row.)
4. For the next $$w$$ windows, we intialise a new rolling prefix minimum in the new chunk (green outline).
5. The minimum of each window is the minimum between the rolling prefix minimum in chunk $$i+1$$ (green outline), and the suffix minimum in chunk $$i$$ that we computed in the buffer (blue outline).7

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  pub struct Split; impl SlidingMin for Split { fn sliding_min(&self, w: usize, it: impl Iterator) -> impl Iterator> { let mut prefix_min = Elem { val: V::MAX, pos: 0 }; let mut ring_buf = RingBuf::new(w, prefix_min); it.enumerate() .map(move |(pos, val)| { let elem = Elem { val, pos }; ring_buf.push(elem); prefix_min = prefix_min.min(elem); // After a chunk has been filled, compute suffix minima. if ring_buf.idx() == 0 { let mut suffix_min = ring_buf[w - 1]; for elem in ring_buf[0..w - 1].iter_mut().rev() { suffix_min = suffix_min.min(*elem); *elem = suffix_min; } prefix_min = Elem { val: V::MAX, pos: 0 }; } prefix_min.min(ring_buf[ring_buf.idx()]) }) .skip(w - 1) } } 
Code for the Split algorithm, that computes the suffix minima for each chunk exactly every $$w$$ iterations.

### Performance characteristics

The big benefit of this algorithm is that its execution is completely deterministic, similar to the jump algorithm: every $$w$$ iterations we compute the suffix-minima, and then we continue again with the rolling prefix minimum. Since this ‘side loop’ is taken exactly once in every $$w$$ iterations, the corresponding branch is easy to predict and does not cause branch-misses. Thus, we expect nearly $$0\%$$ of branch-misses, and up to $$4$$ instructions per cycle.

# 4 Analysing what we have so far

## 4.1 Counting comparisons

Before we look at runtime performance, let’s have a more theoretical look at the number of comparisons each method makes. We can do this by replacing the u64 hash type by a wrapper type that increments a counter each time .cmp() or .partial_cmp() is called on it. I exclude comparisons with the MAX sentinel value since those could possibly be optimized away anyway. The code is not very interesting, but you can find it here.

 1 2 3 4 5 6   w=10 w=20 formula Buffered: 9.000 19.000 w-1 Queue: 1.818 1.905 2-2/(w+1) Jumping: 1.457 1.595 ? Rescan: 1.818 1.905 2-2/(w+1) Split: 2.700 2.850 3-3/w 
The number of comparisons per character on a string of length $$10^8$$ for $$w=10$$ and $$w=20$$.

Observations:

• Buffered Does $$w-1$$ comparisons for each window to find the max of $$w$$ elements.
• Queue and Rescan both do exactly $$2-2/(w+1)$$ comparisons per element.
• For Queue, each pop_back is preceded by a comparison, and each push has one additional comparison with the last element in the queu that is smaller than the new element. Only when the new element is minimal (probability $$1/(w+1)$$), the queue becomes empty and that comparison doesn’t happen. Also when the smallest element is popped from the front (probability $$1/(w+1)$$), that means it won’t be popped from the back and hence also saves a comparison.
• Unsurprisingly, the Jumping algorithm uses the fewest comparisons, since it also returns strictly less information than the other methods.
• The Split method, which seems very promising because of its data-independence, actually uses around $$50\%$$ more comparisons ($$3-3/w$$, to be precise), because for each window, a minimum must be taken between the suffix and prefix.

### Open problem: Theoretical lower bounds

It would be cool to have a formal analysis showing that we cannot do better than $$2-2/(w+1)$$ expected comparisons per element to compute the random minimizer of all windows, and similarly it would be nice to have a lower bound on the minimal number of comparisons needed to only compute the positions of the minimizers.

## 4.2 Setting up benchmarking

Before we start our implementation, let’s set up a benchmark so we can easily compare them. We will use criterion, together with the cargo-criterion helper. First we add criterion as a dev-dependency to Cargo.toml.

  1 2 3 4 5 6 7 8 9 10  [dev-dependencies] criterion = "*" # Do not build the library as a benchmarking target. [lib] bench = false [[bench]] name = "bench" harness = false 
Code Snippet 15: Cargo.toml changes to set up criterion.

We add benches/bench.rs which looks like this:

Click to show code.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34   optimized, ext_nthash, buffered, local_nthash, simd_minimizer, human_genome ); criterion_main!(group); fn initial_runtime_comparison(c: &mut Criterion) { // Create a random string of length 1Mbp. let text = &(0..1000000) .map(|_| b"ACGT"[rand::random::() as usize % 4]) .collect::>(); let hasher = FxHash; let w = 11; let k = 21; #[rustfmt::skip] let minimizers: &mut [(&str, &mut dyn Minimizer, bool)] = &mut [ ("naive", &mut NaiveMinimizer { w, k, hasher }, true), ("buffered", &mut SlidingWindowMinimizer { w, k, alg: Buffered, hasher }, true), ("queue", &mut SlidingWindowMinimizer { w, k, alg: Queue, hasher }, true), ("jumping", &mut JumpingMinimizer { w, k, hasher }, false), ("rescan", &mut SlidingWindowMinimizer { w, k, alg: Rescan, hasher }, true), ("split", &mut SlidingWindowMinimizer { w, k, alg: Split, hasher }, true), ("queue_igor", &mut QueueIgor { w, k }, false), ("rescan_daniel", &mut RescanDaniel { w, k }, true), ]; let mut g = c.benchmark_group("g"); for (name, m, window_minimizers) in minimizers { g.bench_function(*name, move |b| { if *window_minimizers { b.iter(|| m.window_minimizers(text)); } else { 
Code Snippet 16: A basic criterion benchmark for the comparison so far.

### Making criterion fast

By default, the benchmarks are quite slow. That’s for a couple of reasons:

• Before each benchmark, a 3 second warmup is done.
• Each benchmark is 5 seconds.
• After each benchmark, some plots are generated.
• At the end, some HTML reports are generated.

I’m impatient, and all this waiting really impacts my iteration time, so let’s make it faster:

• Reduce warmup time to 0.5s.
• Reduce benchmark time to 2s, and only take 10 samples.
• Disable plots and html generation.

The first two are done from code:

 1 2 3 4 5 6 7 8 9  criterion_group!( name = group; config = Criterion::default() // Make sure that benchmarks are fast. .warm_up_time(Duration::from_millis(500)) .measurement_time(Duration::from_millis(2000)) .sample_size(10); targets = initial_runtime_comparison, count_comparisons_bench ); 
Code Snippet 17: Initializing criterion with reduced warmup and measurement time.

The last is best done by adding the --plotting-backend disabled flag. For convenience, we add this rule to the justfile so we can just do just bench. I’m also adding quiet to hide the comparison between runs to simplify presentation.

 1 2  bench: cargo criterion --plotting-backend disabled --output-format quiet 
Code Snippet 18: Adding the bench rule to the justfile.

When we now do just bench, we see this (TODO update):

### A note on CPU frequency

Most consumer CPUs support turboboost to increase the clock frequency for short periods of time. That’s nice, but not good for stable measurements. Thus, I always pin the frequency of my i7-10750H to the default 2.6GHz:

 1  sudo cpupower frequency-set --governor powersave -d 2.6GHz -u 2.6GHz 
Code Snippet 19: Pinning CPU frequency to 2.6GHz.

This usually results in very stable measurements.

Similarly, I have hyper threading disabled, so that each program has a full core to itself, without interference with other programs.

## 4.3 Runtime comparison with other implementations

Let’s compare the runtime of our method so far with two other Rust implementations:

 1 2 3 4 5 6 7 8  g/naive time: [64.399 ms 64.401 ms 64.404 ms] g/buffered time: [55.154 ms 55.160 ms 55.173 ms] g/queue time: [22.305 ms 22.322 ms 22.335 ms] g/jumping time: [7.0913 ms 7.0969 ms 7.1042 ms] g/rescan time: [14.067 ms 14.071 ms 14.078 ms] g/split time: [14.021 ms 14.023 ms 14.027 ms] g/queue_igor time: [26.680 ms 26.689 ms 26.700 ms] g/rescan_daniel time: [9.0335 ms 9.0340 ms 9.0349 ms] 
Code Snippet 20: Runtime comparison between our and other's implementations.

Some initial observations:

• Measurements between samples are very stable, varying less than $$1\%$$.
• Each method takes 7-55ms to process 1 million characters. That’s 7-55ns per character, or 7s to 55s for 1Gbp.
• Our queue implementation performs similar to the one in minimizer-iter.
• Daniel’s implementation of Rescan is around twice as fast as ours! Time to start looking deeper into our code.

Looking back at our expectations:

• As expected, Buffered is the slowest method by far, but even for $$w=11$$, it is only $$2\times$$ slower than the next-slowest method.
• Split is about as fast as Rescan, and even a bit slower. We were expecting a big speedup here, so something is weird.
• RescanDaniel is even much faster than our Split, so clearly there must be a lot of headroom.

## 4.4 Deeper inspection using perf stat

Let’s look some more at performance statistics using perf stat. We can reuse the benchmarks for this, and call criterion with the --profile-time 2 flag, which simple runs a benchmark for $$2$$ seconds.

 1 2 3  stat test='': cargo build --profile bench --benches perf stat -d cargo criterion -- --profile-time 2 {{test}} 
Code Snippet 21: just stat will run a benchmark for 2 seconds under perf stat.

The metrics that are relevant to us now are summarized in the following table.

Table 1: perf stat results for all methods so far.
metricRuntimeinstructions/cyclebranches/secbranch misses
ns/charGHzM/sec%
Naive644.1317430.11%
Buffered552.179874.66%
Queue222.298104.29%
QueueIgor272.336624.29%
Jumping7.13.8914590.13%
Rescan143.0211162.26%
RescanDaniel9.03.2114100.85%
Split143.3711161.55%

Observe:

• Naive has an instructions per cycle (IPC) of over 4! I have never actually seen IPC that high. It’s not measurement noise, so most likely it comes from the fact that some instructions such as compare followed by conditional jump are fused.
• Naive also has pretty much no branch misses.
• Buffered only has half the IPC of Naive but is still slightly faster since it only hashes each kmer once. For some reason, there are a lot of branch misses, which we’ll have to investigate further.
• The two Queue variants also have relatively few instructions per cycle, and also relatively a lot of branch misses. Here, the branch-misses are expected, since the queue pushing and popping are inherently unpredictable.
• In fact, I argued that the last element of the Queue is popped with probability $$50\%$$. But even then, the branch predictor is able to get down to $$4\%$$ of branch misses. Most likely there are correlations that make these events somewhat predictable, and of course, there are also other easier branches that count towards the total.
• Jumping has an very good IPC of nearly $$4$$, as we somewhat hoped for, and correspondingly has almost no branch misses.
• The remaining methods all have a decent IPC just above $$3$$.
• RescanDaniel has much fewer branch misses than Rescan. We should investigate this.
• We argued that Split is completely predictable, but we see a good number of branch misses anyway. Another thing to investigate.

## 4.5 A first optimization pass

It’s finally time to do some actual optimizations! We’ll go over each of the algorithms we saw and optimize them. This step is ‘just’ so we can make a fair benchmark between them, and in the end we will only optimize one of them further, so I will try to keep it short. The final optimized version of each algorithm can be found in the repository, but intermediate diffs are only shown here.

### Optimizing Buffered: reducing branch misses

Let’s have a look at perf record:

 1 2 3 4 5 6   1.16 │ cmp %r8,%rbp ; compare two values 0.92 │ setne %r11b 0.70 │ ┌──jb 976 ; conditionally jump down *6.15*│ │ mov %r11b,%r13b ; overwrite the minimum (I think) 0.02 │ │ mov %r13d,%r12d ; overwrite the position (I think) *5.44*│976:└─→mov 0x8(%rdx),%r10 ; continue 
Code Snippet 22: just perf buffered shows very inefficient and branchy code. Lines marked with a * have over 5% of time spent in them. This pattern occurs multiple times due to loop unrolling.

Most of these branches come from the fact that computing the minimum of two Elem { val: usize, pos: usize } is very inefficient, since two comparisons are needed, and then the result is conditionally overwritten.

The only reason we are storing an Elem that includes the position is that we must return the leftmost minimum. But since e.g. position_min() returns the leftmost minimum anyway, we can try to avoid storing the position and always iterate over values from left to right. We’ll add a method to our RingBuf for this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29   &mut self.data } } /// Extensions for ringbuf impl RingBuf { pub fn forward_slices(&self) -> [&[V]; 2] { let (a, b) = self.data.split_at(self.idx); [b, a] } pub fn forward_min(&self) -> Elem where V: Copy + Ord + Max, { let mut min = Elem { val: V::MAX, pos: 0 }; for (idx, &v) in self .forward_slices() .into_iter() .flat_map(|part| part) .enumerate() { if v < min.val { min = Elem { val: v, pos: idx }; } } min } } 
Code Snippet 23: Split the RingBuf into two slices, such that the element of the first returned slice come before the second part, and compute their minimum.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  pub struct BufferedOpt; impl SlidingMin for BufferedOpt { fn sliding_min(&self, w: usize, it: impl Iterator) -> impl Iterator> { // Note: this only stores V now, not Elem. let mut ring_buf = RingBuf::new(w, V::MAX); it.enumerate() .map(move |(pos, val)| { ring_buf.push(val); let mut min = ring_buf.forward_min(); min.pos += pos - w + 1; min }) .skip(w - 1) } } 
Code Snippet 24: BufferedOpt that only stores values and finds the minimum from left to right. Runs in 26ms, down from 55ms.

The hot loop of the code looks much, much better now, using cmov instead of branches to overwrite the minimum.

New assembly
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16   1.59 │898:┌─→lea 0x8(%r14),%r13 4.85 │ │ mov %rbx,0xe0(%rsp) 2.72 │ │ mov %r9,0xf8(%rsp) 5.29 │ │ mov %r13,0xf0(%rsp) 1.85 │ │ lea 0x1(%rax),%rbp 4.78 │ │ mov %rbp,0x110(%rsp) 11.20 │ │ mov (%r14),%r14 7.24 │ │ cmp %rcx,%r14 ; compare 3.41 │ │ cmovb %rax,%r11 ; conditionally overwrite pos 4.04 │ │ cmovb %r14,%rcx ; conditionally overwrite min 2.50 │ │ mov %rbp,%rax 5.19 │ │ mov %r13,%r14 3.01 │8d4:│ test %r14,%r14 0.92 │ │↓ je 8de 0.44 │ ├──cmp %r9,%r14 3.90 │ └──jne 898 
Code Snippet 25: New hot-loop, using cmov, and very few branches.

Also the perf stat is great now: 4.23 IPC and 0.13% branch misses, so we’ll leave it at this.

Takeaway
• Compute minima from left to right to avoid comparing positions.
• Prefer cmov over branches to compute minima.

### Queue is hopelessly branchy

We can confirm that indeed a lot of time is spent on branch-misses around the popping of the queue.

Assembly code showing a badly predicted branch.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16   1.20 │3a0: lea (%rdx,%rax,1),%rdi 0.63 │ ↑ cmp %rcx,%rdi 1.34 │ │ mov $0x0,%edi 0.97 │ │ cmovae %rcx,%rdi 0.46 │ │ shl$0x4,%rdi 0.49 │ │ mov %rsi,%r8 1.58 │ │ sub %rdi,%r8 1.52 │ │ ┌──cmp %rbp,(%r8) ; compare the new element with the back 1.97 │ │ ├──jb 3d2 ; if back is smaller, break loop *9.85*│ │ │ dec %rax ; pop: decrease capacity 3.70 │ │ │ mov %rax,0x28(%rsp) 0.41 │ │ │ add $0xfffffffffffffff0,%rsi 0.08 │ │ │ test %rax,%rax ; queue not yet empty 0.37 │ └──│─ jne 3a0 ; try popping another element 0.38 │ │ xor %eax,%eax *9.64*│3d2:└─→cmp %rcx,%rax  Code Snippet 26: Each of the two directions after the branch takes 10% of time. Sadly, I don’t really see much room for improvement here. Takeaway Data-dependent branches are terrible. ### Jumping is already very efficient If we do a perf report here, we see that only 25% of the time is spent in the minimizer_positions function:  1 2  63.69% as Iterator>::fold 24.25% as Minimizer>::minimizer_positions  Code Snippet 27: just perf jumping shows that most time is spent computing hashes. The hot loop for finding the minimum of a window is very compact again. Assembly code   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  2.84 │170:┌─→lea 0x1(%rcx),%rsi 1.58 │ │ mov -0x18(%rdx,%rcx,8),%rdi 4.17 │ │ mov -0x10(%rdx,%rcx,8),%r8 0.09 │ │ cmp %rdi,%rax 3.60 │ │ cmovb %rax,%rdi 9.34 │ │ cmovbe %rbp,%rsi │ │ lea 0x2(%rcx),%rax 0.94 │ │ cmp %r8,%rdi 2.59 │ │ cmovae %r8,%rdi 9.36 │ │ cmovbe %rsi,%rax 1.35 │ │ mov -0x8(%rdx,%rcx,8),%rsi 0.27 │ │ lea 0x3(%rcx),%rbp 0.86 │ │ cmp %rsi,%rdi 5.69 │ │ cmovae %rsi,%rdi 5.94 │ │ cmovbe %rax,%rbp 1.48 │ │ mov (%rdx,%rcx,8),%rsi 0.72 │ │ add$0x4,%rcx 0.32 │ │ cmp %rsi,%rdi 4.98 │ │ mov %rdi,%rax 1.31 │ │ cmovae %rsi,%rax 5.76 │ │ cmova %rcx,%rbp │ ├──cmp %rbx,%rcx 0.13 │ └──jne 170 
Code Snippet 28: Hot loop of Jumping that finds the minimum of a window, with plenty cmov instructions and no branches.

More problematic is the 63% of time spent on hashing kmers, but we will postpone introducing a faster hash function a bit longer.

### Optimizing Rescan

The original Rescan implementation suffers from the same issue as the Buffered and compares positions, so let’s reuse the RingBuf::forward_min() function, and compare elements only by value.

Code for RescanOpt.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  pub struct RescanOpt; impl SlidingMin for RescanOpt { fn sliding_min(&self, w: usize, it: impl Iterator) -> impl Iterator> { let mut min = Elem { val: V::MAX, pos: 0 }; // Store V instead of Elem. let mut ring_buf = RingBuf::new(w, V::MAX); it.enumerate() .map(move |(pos, val)| { ring_buf.push(val); if val < min.val { min = Elem { val, pos }; } // If the minimum falls out of the window, rescan to find the new minimum. if pos - min.pos == w { min = ring_buf.forward_min(); min.pos += pos - w + 1; } min }) .skip(w - 1) } } 
Code Snippet 29: Optimized RescanOpt that runs in 10.2ms instead of 14.1ns.

This has 3.55 IPC up from 3.02, and 0.91% branch misses, down from 2.26%, which results in the runtime going down from 14.1ms to 10.2ms.

The hottest part of the remaining code is now the branch-miss when if pos - min.pos == w is true, taking 6% of time, since this is an unpredictable branch. This seems rather unavoidable. Otherwise most time is spent hashing kmers.

### Optimizing Split

Also here we start by avoiding position comparisons. This time it is slightly more tricky though: the RingBuf does have to store positions, since after taking suffix-minima, the value at each position does not have to correspond anymore to its original position.

The tricky part here is to ensure that the compiler generates a branchless minimum, since usually it creates a branch to avoid overwriting two values with their own value:

 1 2 3 4 5 6 7 8  -if ring_buf[i + 1].val <= ring_buf[i].val { - ring_buf[i] = ring_buf[i + 1]; -}; +ring_buf[i] = if ring_buf[i + 1].val <= ring_buf[i].val { + ring_buf[i + 1] +} else { + ring_buf[i] +}; 
Code Snippet 30: Making sure the suffix minima use cmov instructions.
Code for SplitOpt.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35  pub struct SplitOpt; impl SlidingMin for SplitOpt { fn sliding_min(&self, w: usize, it: impl Iterator) -> impl Iterator> { let mut prefix_min = Elem { val: V::MAX, pos: 0 }; let mut ring_buf = RingBuf::new(w, prefix_min); it.enumerate() .map(move |(pos, val)| { let elem = Elem { val, pos }; ring_buf.push(elem); if val < prefix_min.val { prefix_min = elem; } // After a chunk has been filled, compute suffix minima. if ring_buf.idx() == 0 { for i in (0..w - 1).rev() { ring_buf[i] = if ring_buf[i + 1].val <= ring_buf[i].val { ring_buf[i + 1] } else { ring_buf[i] }; } prefix_min = Elem { val: V::MAX, pos: 0 }; } let suffix_min = ring_buf[ring_buf.idx()]; if suffix_min.val <= prefix_min.val { suffix_min } else { prefix_min } }) .skip(w - 1) } } 
Optimized SplitOpt that runs in 11.2ms instead of 14.0ns.

Now, it runs in 11.2ms instead of 14.0ms, and has 3.46 IPC (up from 3.37 IPC) and 0.63% branch misses, down from 1.55%. The remaining branch misses seem to be related to the kmer hashing, although this is somewhat surprising as Jumping doesn’t have them.

Takeaway

Data-independent branches are nearly free.

## 4.6 A new performance comparison

Now that we have our optimized versions, let’s summarize the new results.

Table 2: perf stat results for all methods so far, for w=11.
metricRuntimeinstructions/cyclebranches/secbranch misses
ns/charGHzM/sec%
Naive644.1317430.11%
Buffered552.179874.66%
BufferedOpt264.1914990.13%
Queue222.298104.29%
QueueIgor272.336624.29%
Jumping7.13.8914590.13%
Rescan143.0211162.26%
RescanOpt103.5813970.90%
RescanDaniel9.03.2114100.85%
Split143.3711161.55%
SplitOpt113.4714100.63%

Note how BufferedOpt is now almost as fast as Queue, even though it’s $$O(nw)$$ instead of $$O(n)$$. Unpredictable branches really are terrible. Also, RescanOpt is now nearly as fast as RescanDaniel, although Jumping is still quite a bit faster. For reasons I do not understand SplitOpt is still not faster than RescanOpt, but we’ll have to leave it at this.

# 5 Rolling14 our own hash

As we saw, Jumping spends two thirds of its time on hashing the kmers, and in fact, SplitOpt and RescanOpt spend around half of their time on hashing. Thus, it is finally time to investigate some different hash functions.

## 5.1 FxHash

So far, we have been using FxHash, a very simple hash function originally used by FireFox and also used in the Rust compiler. For each u64 chunk of 8 bytes, it bit-rotates the hash so far, xor’s it with the new words, and multiplies by a fixed constant. For inputs less than 8 characters, it’s only a single multiplication. Even though this method iterates over the input length, all k-mers have the same length, so that the branches involved are usually very predictable. Nevertheless, hashing each k-mer involves quite some instructions, and especially a lot of code is generated to to handle all lengths modulo $$8$$ efficiently.

### WyHash

WyHash is currently the fastest cryptographically secure hash function in the SMHasher benchmark, but using it instead of FxHash slows down the benchmarks by roughly 2ns per character. The reason is that FxHash itself is not secure, and hence can be simpler.

## 5.2 NtHash: a rolling hash

We could avoid this for loop for the hashing of each k-mer by using a rolling hash that only needs to update the hash based on the character being added and removed. The classsic rolling hash is the Karp-Rabin hash, where the hash of a kmer $$x$$ is $h(x) = \sum_{i=0}^{k-1} x_i \cdot p^i \pmod m,$ for some modulus $$m$$ (e.g. $$2^{64}$$) and coprime integer $$p$$. This can be computed in an incremental way by multiplying the previous hash by $$p$$ and adding/subtracting terms for the character being added and removed.

NtHash (Mohamadi et al. 2016) is a more efficient variant15 of this based on ‘buzhash’, using multiplication and addition $$(\times, +)$$ uses bitwise rotation and xor $$(\text{rot}, \oplus)$$: $h(x) = \bigoplus_{i=0}^{k-1} \text{rot}^i(f(x_i)),$ where $$f$$ is a static table lookup that maps each character to some fixed random value. (See also this post on ntHash).

The main benefit of this method over Karp-Rabin hashing is that typically table lookups and bit rotate instructions are much faster than multiplications.

One limitation of NtHash is that it is periodic: when $$k> 64$$, characters that are exactly 64 positions apart will have the same rotation applies to their $$f(x_i)$$. This means that the hashes of A....A and B....B are be the same, and similarly the hashes of A....B and B....A are the same. NtHash2 (Kazemi et al. 2022) fixes this by using multiple rotation periods, but we will stick with the original NtHash. In practice, minimizers with $$k> 64$$ are rarely used anyway, and even if so, then the probability of hash collisions between the $$w$$ k-mers in a window should be small enough.

### The nthash crate

Using the nthash crate, the implementation of the Hasher trait that we’ve implicitly been using so far is simple:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29  pub trait Hasher: Clone { type Out; fn hash(&self, t: &[u8]) -> Self::Out; #[inline(always)] fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { t.windows(k).map(|kmer| self.hash(kmer)) } } #[derive(Clone, Copy, Debug)] pub struct FxHash; impl Hasher for FxHash { type Out = u64; fn hash(&self, t: &[u8]) -> u64 { fxhash::hash64(t) } } #[derive(Clone, Copy, Debug)] pub struct ExtNtHash; impl Hasher for ExtNtHash { type Out = u64; fn hash(&self, t: &[u8]) -> u64 { nthash::ntf64(t, 0, t.len()) } fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { nthash::NtHashForwardIterator::new(t, k).unwrap() } } 
Wrapping the nthash crate with the Hasher trait.

Unfortunately, the results are mixed:

Table 3: Runtime in ns/kmer of methods with FxHash and ntHash.
alg \ hash:FxHashntHash
BufferedOpt2630
Queue2224
Jumping7.15.0
RescanOpt1014
SplitOpt1116

All methods become 2-4ns/ker slower, apart from Jumping, which is actually significantly faster. Remember that for Jumping (Code Snippet 11) we first compute all hashes and collect them into a vector, while all other methods compute hashes on the fly. Thus, for some reason ExtNtHasher must be faster than FxHash when simply streaming through all kmers, while it’s slower when rolling the hash at more irregular intervals.

### Buffered hash values

Based on the difference above, let’s investigate whether first collecting all hashes to a vector speeds up the other methods as well. We provide a simple wrapper type for this:

  1 2 3 4 5 6 7 8 9 10 11 12  pub struct Buffer { pub hasher: H, } impl Hasher for Buffer { type Out = H::Out; fn hash(&self, t: &[u8]) -> Self::Out { self.hasher.hash(t) } fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { self.hasher.hash_kmers(k, t).collect::>().into_iter() } } 
Code Snippet 33: Buffer<H> is a Hasher that first collects all hashes to a vector and then iterates over them.
Table 4: Runtime in ns/kmer of methods with FxHash, ntHash, and buffered versions.
alg \ hash:FxHashFxHashntHashntHash
buffered:noyesnoyes
BufferedOpt26313028
Queue22242422
Jumping7.17.15.05.0
RescanOpt10131411
SplitOpt11141612
Takeaway

Buffering hashes in an intermediate vector helps for ntHash, but not for FxHash.

## 5.3 Making ntHash fast: going branchless

Still now, ntHash take over half the time of the Jumping algorithm, so we’ll try to make it a bit faster.

 1 2  50.65% as SpecFromIter>::from_iter 36.68% as Minimizer>::minimizer_positions 
Code Snippet 34: just perf jumping_nt shows that most time is spent creating the Vec of hashes from the ntHash iterator.
Assembly for collecting ntHashes into a vector.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46  1c5: mov %r14,(%rax,%r13,8) inc %r13 mov %r13,0x40(%rsp) if self.current_idx == self.max_idx { mov %rbx,%rcx add %r13,%rcx ↑ je 153 if self.current_idx != 0 { 1dd: mov %rbp,%rcx add %r13,%rcx ↓ je 24d let seqi = self.seq[i]; lea -0x1(,%r13,1),%rdi add %rbp,%rdi cmp %r15,%rdi ↓ jae 2e2 mov 0x20(%rsp),%rcx let seqk = self.seq[i + self.k]; add %r13,%rcx dec %rcx cmp %r15,%rcx ↓ jae 2cd let seqi = self.seq[i]; movzbl -0x1(%r12,%r13,1),%ecx lea anon.c37bfa7dce888be50e4a96f2eaf16e27.6.llvm.2840843246760040980+0x6f,%rsi let val = H_LOOKUP[c as usize]; mov (%rsi,%rcx,8),%rdx if val == 1 { cmp $0x1,%rdx ↓ je 272 mov 0x28(%rsp),%rcx let seqk = self.seq[i + self.k]; movzbl -0x1(%rcx,%r13,1),%ecx let val = H_LOOKUP[c as usize]; mov (%rsi,%rcx,8),%rsi if val == 1 { cmp$0x1,%rsi ↓ je 272 rorx $0x3f,%r14,%r14 mov 0x18(%rsp),%rcx rol %cl,%rdx self.fh = self.fh.rotate_left(1) ^ h(seqi).rotate_left(self.k as u32) ^ h(seqk); xor %rsi,%r14 xor %rdx,%r14 24d: cmp 0x30(%rsp),%r13 ↑ jne 1c5  Code Snippet 35: Assembly code for collecting all ntHashes into a vector. We see plenty of branches. Even though we know from perf stat that there are basically no branch misses, the assembly does contain a lot of branches, most of which correspond to assertions/bound checks in the original code. To fix these, will copy the source of the nthash crate and modify it locally. To start, ExtNtHash takes in 2.9ns/kmer to collect kmer hashes to a vector. ### Drop sanity checks First, we see that for each character it is checked whether it is ACGT. It would be faster to do that up-front if needed, so let’s remove the check.  1 2 3 4 5 6 7   fn h(c: u8) -> u64 { let val = H_LOOKUP[c as usize]; - if val == 1 { - panic!("Non-ACGTN nucleotide encountered! {}", c as char) - } val }  Code Snippet 36: Skip checking that characters are ACGT. Runtime: 2.9ns/kmer2.6ns/kmer. ### Drop bound checks Next, we sprinkle in some get_unchecked, to avoid unnecessary bound checks. Both checks are provably impossible to fail, yet the compiler isn’t able to elide them. I suspect it has something to do with the NtHashIt::new() function not being inlined, and hence the next() function not being aware of invariants checked during initialization. I also tried adding some additional assertions and using unchecked_add to indicate that overflow is undefined behaviour, but even then one of the branches remained. Even with full link time optimization enabled (lto = true in Cargo.toml), the branches are still there16.  1 2 3 4 5 6 7 8 9   if self.current_idx != 0 { let i = self.current_idx - 1; - let seqi = self.seq[i]; - let seqk = self.seq[i + self.k]; + let seqi = unsafe { *self.seq.get_unchecked(i) }; + let seqk = unsafe { *self.seq.get_unchecked(i + self.k) }; ... }  Code Snippet 37: Avoiding bound checks: 2.6ns/kmer1.9ns/kmer. Takeaway Not all provably redundant bound checks are always elided. ### Efficiently collecting to a vector At this point there are three branches left: 1. Checking whether the end of the sequence was reached. 2. The self.current_idx != 0 check that makes sure we don’t remove a character in the first iteration. 3. A check that the vector we are accumulating into still has space. Both the second and third should be redundant: we can handle the first iteration separately, and pre-allocate a vector that is sufficiently large. Let’s fix the latter one first. Ideally, the compiler would check that the size_hint provided by the NtHashIt is large enough and that the initial Vec::reserve() has sufficient capacity that an overflow can not happen, but alas, that’s not the case (also not with full lto). Generally, it seems that we can not get rid of the possible resize check while using .collect::<Vec<_>>(). Thus, let’s roll our own. First, we can manually reserve the vector capacity and then fill it:  1 2 3 4 5 6 7 8 9  fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { let len = t.len() - k + 1; let mut v = Vec::with_capacity(len); let mut it = self.hasher.hash_kmers(k, t); for _ in 0..len { v.push(it.next().unwrap()); } v.into_iter() }  Code Snippet 38: Even with a manually reserved vector, there is an impossible-to-reach call to grow_one. Generated assembly   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25   0.04 │330:┌─→mov 0x10(%rsp),%rax 19.91 │ │ mov %r13,0x8(%rax,%rbx,8) 0.01 │ │ add$0x2,%rbx 0.11 │ │ mov %rbx,0x18(%rsp) │ │ mov %r12,%rax 19.65 │ │ mov %rbp,%rbx │ │ add %rbp,%rax │ │↓ je 38f ; unwrap failed? 0.06 │34e:│ lea 0x1(%rbx),%rbp 0.01 │ │ movzbl (%r15,%rbx,1),%eax 18.40 │ │ rorx $0x3f,%r13,%rdx 0.06 │ │ mov (%r14,%rax,8),%r13 0.16 │ │ mov 0x20(%rsp),%rcx 20.75 │ │ rol %cl,%r13 0.13 │ │ mov 0x60(%rsp),%rax 0.36 │ │ movzbl (%rax,%rbx,1),%eax 19.85 │ │ xor %rdx,%r13 0.41 │ │ xor (%r14,%rax,8),%r13 │ ├──cmp 0x8(%rsp),%rbp 0.06 │ └──jne 330 ; vector capacity was reached? │ lea 0x8(%rsp),%rdi │ vzeroupper │ → call alloc::raw_vec::RawVec::grow_one ; grow the vec │ ↑ jmp 330 │38f: mov 0x8(%rsp),%r12  Code Snippet 39: The assembly still shows a call to grow_one, which should be impossible to reach. Well this doesn’t work. I really trusted the compiler to optimize this away, but it doesn’t17. Guess I’ll have to adjust my trust level accordingly. Anyway, I suppose we’ll just make a default vector with the right size and then fill it, so we can completely drop the push call.   1 2 3 4 5 6 7 8 9 10 11 12 13   fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { let len = t.len() - k + 1; - let mut v = vec![H::Out::default(); len]; + let mut v = Vec::with_capacity(len); let mut it = self.hasher.hash_kmers(k, t); - for x in v.iter_mut() { - *x = it.next().unwrap(); - } + for _ in 0..len { + v.push(it.next().unwrap()); + } v.into_iter() }  Code Snippet 40: Using a fixed-size vector drops the grow check! Runtime: 1.94ns/kmer1.71ns/kmer. The current assembly looks like this.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  300:┌─→mov %r11,%rcx ; read max_iterations │ add %r8,%rcx ; check if last iteration │↓ je 362 ; unwrap fails? │ movzbl (%r12,%r8,1),%ecx ; get seq[i] │ mov (%r9,%rcx,8),%r10 ; lookup hash of seq[i] │ mov %r13d,%ecx ; read 32-bit k into %ecx │ rol %cl,%r10 ; rotate hash(seq[i]) by %cl = low 8 bits of %ecx │ rorx$0x3f,%rsi,%rsi ; rotate rolling hash %rsi │ xor %r10,%rsi ; xor with hash of seq[i] │ movzbl (%rax,%r8,1),%ecx ; get seq[i+k] │ xor (%r9,%rcx,8),%rsi ; xor with hash of seq[i+k] │ mov %rsi,0x8(%r14,%r8,8) ; write result to vec │ inc %r8 ; increase i ├──add $0xfffffffffffffff8,%rdx ; decrease loop counter └──jne 300 ; loop  Code Snippet 41: Current assembly code for collecting ntHashes into a vector. Spend some time to understand the comments here. Takeaway Avoid push to fill know-size arrays. (But maybe this doesn’t apply in general.) Aside: x64 intricacies. This looks pretty great! The first three lines are somewhat redundant though, since we know the unwrap can never fail18. And also on line 6, there is a 32-bit mov from %r13d to %ecx of which the low 8 bits are used on line 7 as %cl19. But then why not directly use the low 8 bits of %r13, %r13b? Maybe the compiler insists on using the counter register c for this? After some more reading and discussing with Tommasso Fontana, that does indeed seem to be the case: this page lists the opcodes for rol instructions, and the only supported variants are rotating by 1, by cl, and by an immediate value.20. Aside: native compilation. I’m using -march=native. Without that, it generates a rol$1,%rsi instead of a rorx $0x3f,%rsi,%rsi, and the runtime is 1.83ns/kmer instead of 1.71ns/kmer. rorx is probably faster because the x suffix means it doesn’t set overflow/carry flags, and hence has fewer dependencies with other instructions. But rorx is only available with BMI2, so this requires the native compilation. Takeaway Use -march=native. Back to branches. Either way, let’s get rid of the last remaining branch:  1 2 3 4   for x in v.iter_mut() { - *x = it.next().unwrap(); + unsafe { *x = it.next().unwrap_unchecked() }; }  Code Snippet 42: Using unwrap_unchecked to avoid a branch. Runtime: 1.71ns/kmer2.20ns/kmer. Somehow this is worse. It seems two new branches came back when removing this one!21 Assembly, but worse.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   │2f0:┌─→cmp %rbx,%r8 │ │↑ je 2e0 10.88 │ │ test %r8,%r8 0.01 │ │↓ je 323 0.03 │ │ movzbl -0x1(%r13,%r8,1),%ecx 8.79 │ │ lea (%r8,%rbp,1),%r9 │ │ dec %r9 12.47 │ │ movzbl 0x0(%r13,%r9,1),%r9d 0.06 │ │ rorx$0x3f,%rdx,%r10 9.37 │ │ mov (%r11,%rcx,8),%rdx │ │ mov %ebp,%ecx 19.99 │ │ rol %cl,%rdx 0.09 │ │ xor %r10,%rdx 13.48 │ │ xor (%r11,%r9,8),%rdx │323:│ inc %r8 16.40 │ │ mov %rdx,(%r14,%rsi,1) │ │ add $0x8,%rsi │ ├──cmp %rsi,%r12 8.44 │ └──jne 2f0  Code Snippet 43: There is now a branch again to check if the index is 0. I’ve also tried adding an assert_eq!(v.len(), len), but that doesn’t help either.22  1 2  +let mut v = vec![H::Out::default(); len]; +assert_eq!(v.len(), len);  Code Snippet 44: Asserting on the length of the vector doesn't change the runtime. Addressing mode. Another thing we can do is improve the vector addressing mode. Instead of writing to vector_offset + index * 8, it’s sometimes faster to directly write to a *ptr address. Let’s try something:  1 2 3 4 5 6 7 8 9   let mut v = vec![H::Out::default(); len]; assert_eq!(v.len(), len); let mut it = self.hasher.hash_kmers(k, t); -for x in v.iter_mut() { - *x = it.next().unwrap(); -} +for i in 0..len { + unsafe { v.as_mut_ptr().add(i).write(it.next().unwrap()) }; +}  Using a pointer offset instead of iterating the vector again makes things worse: 1.71ns/kmer1.82ns/kmer. I’m not quite sure why, but this causes the assembly to be one instruction longer and hence slightly slower. Confusion, again! Ok let’s remove the assertion again.  1 2 3 4 5 6   let mut v = vec![H::Out::default(); len]; -assert_eq!(v.len(), len); let mut it = self.hasher.hash_kmers(k, t); for i in 0..len { unsafe { v.as_mut_ptr().add(i).write(it.next().unwrap()) }; }  Code Snippet 46: Removing the assertion gets us where we wanted to be all this time: 1.71ns/kmer1.59ns/kmer. Finally the assembly is how I like it:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21   0.02 │7b0: movzbl 0x0(%rbp,%rsi,1),%ecx 10.28 │ mov (%r14,%rcx,8),%rdi 3.67 │ mov %r12d,%ecx 11.31 │ rol %cl,%rdi 3.13 │ rorx$0x3f,%rdx,%rdx │ xor %rdi,%rdx 0.02 │ movzbl -0x1(%rbx,%rsi,1),%ecx 10.94 │ xor (%r14,%rcx,8),%rdx 5.91 │ mov %rdx,0x8(%rax,%rsi,8) 0.02 │ movzbl 0x1(%rbp,%rsi,1),%ecx 0.02 │ mov (%r14,%rcx,8),%rdi 10.70 │ mov %r12d,%ecx 4.53 │ rol %cl,%rdi 11.34 │ rorx $0x3f,%rdx,%rdx 5.42 │ xor %rdi,%rdx 0.58 │ movzbl (%rbx,%rsi,1),%ecx 0.17 │ xor (%r14,%rcx,8),%rdx 18.47 │ mov %rdx,0x10(%rax,%rsi,8) 3.48 │ add$0x2,%rsi │ cmp %rsi,%r8 │ ↑ jne 7b0 
Code Snippet 47: Look at it! Isn't it wonderful? No more redundant branches and some loop-unrolling.

At this point I am rather confused, but at least the code looks good:

1. Why does removing the assertion make things faster?
2. Why is the loop using a pointer offset (Code Snippet 45) faster/different at all? Shouldn’t the compiler have the same information on both cases?
Takeaway

Adding assertions to ‘help’ the compiler can be counterproductive.

Takeaway

Instead of using v[i] or iterating over a vector, use v.as_ptr().add(i).23

## 5.4 Rolling a bit24 less

As we saw before, the rotation by $$k$$ positions takes two instructions: one to move $$k$$ into %cl, and one for the rotation itself. But all rotations are over exactly $$k$$, so in fact, we can precompute these:

 1 2 3 4 5 6 7 8 9   Some(NtHashIt { ... + rot_k: H_LOOKUP.map(|x| x.rotate_left(k as u32)), }) ... -self.fh = self.fh.rotate_left(1) ^ h(seqi).rotate_left(self.k as u32) ^ h(seqk); +self.fh = self.fh.rotate_left(1) ^ self.rot_k[seqi as usize] ^ h(seqk); 
Code Snippet 48: Precomputing the rotations by $$k$$ positions. Runtime: 1.59ns/kmer1.41ns/kmer.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15   │570: movzbl 0x0(%rbp,%rax,1),%ecx 16.03 │ movzbl -0x1(%r8,%rax,1),%edx 0.05 │ rorx $0x3f,%r12,%rsi * 0.34 │ xor 0xa58(%rsp,%rcx,8),%rsi * 0.05 │ xor (%rbx,%rdx,8),%rsi * 32.71 │ mov %rsi,0x8(%r15,%rax,8) 0.02 │ movzbl 0x1(%rbp,%rax,1),%ecx 0.02 │ rorx$0x3f,%rsi,%r12 * 0.34 │ xor 0xa58(%rsp,%rcx,8),%r12 * 17.62 │ movzbl (%r8,%rax,1),%ecx 0.49 │ xor (%rbx,%rcx,8),%r12 * 32.02 │ mov %r12,0x10(%r15,%rax,8) 0.08 │ add $0x2,%rax │ cmp %rax,%rdi 0.14 │ ↑ jne 570  Code Snippet 49: The corresponding assembly goes from 21 to just 15 instructions for 2 loops. ### Analysing the assembly code At this point, we cannot hope for much fewer instructions. The 15 instructions are: • 1: increase iteration count by 2, • 2: compare and jump, • Then for each of the two unrolled iterations: • 2: load two characters, • 1: rotate the rolling hash, • 2: xor in the two characters, • 1: write back the result. At 1.41ns/kmer, this loop should take 1.41ns/kmer * 2.6GHz = 3.7 cycles/kmer or 7.3 cycles/loop, which corresponds to just over 2 instructions per cycle. Indeed, perf stat shows 2.04 IPC. I have marked the critical path of xor’ing the hashes with * s in the assembly code above. The three rorx, xor, xor instructions have a total latency of 1+2+2=5 cycles, which is actually higher than the 3.7 cyckes/kmer we measure. Maybe the ’load’ micro-up in the xor can actually be done in parallel before the result of the previous xor is done? Regardless, this is pretty impressive performance! Still, instead of doing x = (x^a)^b, it would be more efficient to do x = x^(a^b). But just parenthesizing the code like that doesn’t improve the assembly. Probably because that would require a separate load instruction, since xor’ing with two addresses from memory (as opposed to a memory address and a register) is not a thing. ## 5.5 Parallel it is As we saw above, the current ntHash implementation only uses around 2 IPC, while each core should be able to execute up to 4 instructions every cycle. Thus, let’s simply try to run two copies of our main loop in parallel:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { let num_kmers = t.len() - k + 1; // For odd num_kmers, we skip the last kmer. let kmers_per_part = num_kmers / 2; let part_len = kmers_per_part + k - 1; let t0 = &t[..part_len]; let t1 = &t[kmers_per_part..kmers_per_part + part_len]; let mut v = vec![H::Out::default(); 2 * kmers_per_part]; let mut it0 = self.hasher1.hash_kmers(k, t0); let mut it1 = self.hasher2.hash_kmers(k, t1); for i in 0..kmers_per_part { unsafe { v.as_mut_ptr().add(i).write(it0.next().unwrap()); v.as_mut_ptr() .add(kmers_per_part + i) .write(it1.next().unwrap()); } } v.into_iter() }  Code Snippet 50: Running two copies of the main loop in parallel. Runtime: 1.41ns/kmer1.36ns/kmer. Sadly, this is not nearly the two times speedup I was hoping for, but at least we’re at 2.5 IPC and now, and 3.5 cycles/kmer, even closer to the 3 cycles/kmer lower bound from the xor dependency chain. (Even though now that we process two independent chains, that lower bound really is only 1.5 cycles/kmer. Probably the code is just fully occupying some ports.) The assembly code is:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18   │8a0:┌─→cmp %rcx,%r9 │ │↓ je 9ae 11.67 │ │ movzbl (%r15,%rcx,1),%edx 0.42 │ │ movzbl (%r8,%rcx,1),%esi 5.91 │ │ rorx$0x3f,%r14,%r14 *1 1.89 │ │ xor 0x1298(%rsp,%rdx,8),%r14 *1 14.32 │ │ xor (%r12,%rsi,8),%r14 *1 0.25 │ │ mov (%rsp),%rdx 19.19 │ │ mov %r14,0x8(%rdx,%rcx,8) 0.15 │ │ movzbl (%rdi,%rcx,1),%edx 5.85 │ │ rorx $0x3f,%rbx,%rbx *2 2.24 │ │ xor 0x268(%rsp,%rdx,8),%rbx *2 12.19 │ │ movzbl 0x0(%r13,%rcx,1),%edx 4.63 │ │ xor (%r12,%rdx,8),%rbx *2 15.21 │ │ mov %rbx,(%rax,%rcx,8) 0.03 │ │ inc %rcx │ ├──cmp %rcx,%r9 6.03 │ └──jne 8a0  Code Snippet 51: The assembly code for the parallelized version. But for some reason, there are two compares again! But this is just silly: At the end we compare %rcx with %r9, and if they are not equal, we continue with the next iteration. Then the first thing we do: again compare %rcx with %r9, and panic on the unwrap if they are equal. But we just checked they are not equal… Again, I’m at a loss why this it the case. Clearly there are two unwrap calls in our inner loop, but since they check exactly the same, there really is no need for this. Takeaway Compilers are weird and not to be trusted?25 ### More parallel I tried to parallellize more by doing 3 or 4 chunks at a time, both manually as above and by using const generics and iterating over [_; B] types, but they all perform worse at the moment, and generate very branchy code again. Probably the better way to avoid all these issues is by inlining and interleaving the ntHash code. Instead of doing a ‘high level’ parallellization where we make $$B$$ calls to it.next() in parallel, it’s more efficient to ‘internalize’ the parallellization, so that all the accounting overhead is done only once. Thus, we first try this:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  impl<'a, const L: usize> Iterator for NtHashParIt<'a, L> { // Each iteration now returns an array of L hashes. type Item = [u64; L]; #[inline(always)] fn next(&mut self) -> Option<[u64; L]> { if self.current_idx == self.n { return None; }; if self.current_idx != 0 { let i = self.current_idx - 1; // Do L steps consecutively. self.fh = from_fn(|l| { let seqi = unsafe { *self.seq.get_unchecked(l * self.n + i) }; let seqk = unsafe { *self.seq.get_unchecked(l * self.n + i + self.k) }; self.fh[l].rotate_left(1) ^ self.rot_k[seqi as usize] ^ h(seqk) }); } self.current_idx += 1; Some(self.fh) } }  Code Snippet 52: Splitting the text in $$L$$ parts that are processed in parallel, but without SIMD. Takes 1.15ns/kmer for L=2 and 1.19ns/kmer for L=3, down from 1.36ns/kmer. This is already slightly faster than the version we had. Let’s try more granular parallellization:  1 2 3 4 5 6 7 8  -self.fh = from_fn(|l| { - let seqi = unsafe { *self.seq.get_unchecked(l * self.n + i) }; - let seqk = unsafe { *self.seq.get_unchecked(l * self.n + i + self.k) }; - self.fh[l].rotate_left(1) ^ self.rot_k[seqi as usize] ^ h(seqk) -}); +let seqi: [u8; L] = from_fn(|l| unsafe { *self.seq.get_unchecked(l * self.n + i) }); +let seqk: [u8; L] = from_fn(|l| unsafe { *self.seq.get_unchecked(l * self.n + i + self.k) }); +self.fh = from_fn(|l| self.fh[l].rotate_left(1) ^ self.rot_k[seqi[l] as usize] ^ h(seqk[l]));  Code Snippet 53: With even more granular parallelization, we get 1.14ns/kmer for L=2 and 1.10ns/kmer for L=3, down from 1.15ns/kmer. This latest parallel version with $$L=3$$ runs at 1.10ns/kmer and 2.7 IPC, or exactly 3 cycles/kmer. I’m still not quite sure what the main bottleneck is though. Anyway, from Agner Fog’s instruction tables and uops.info, I gather that all of movzbl, rorx, xor, and mov can do 2 per cycle, but that all instructions that read from memory have latency up to ~5 cycles, which sounds like it may be a bottleneck. But it’s hard to say. Note that all the methods above return a [u64; L] at a time. For now I’m just collecting those into a Vec<[u64; L]>. We could then either return an iterator over those blocks, or return a sequential iterator that traverses the entire vector L times. Or we could even transpose the data back to a Vec<u64>, but that would be somewhat slower. Assumption It’s fine to return out-of-order data. ## 5.6 Actual SIMD, at last The first SIMD version is basically the same as before when using [u64, L], but using Simd<u64, L> instead, with #![feature(portable_simd)] enabled. SIMD version of ntHash  1 2 3 4 5 6 7 8  let i = self.current_idx - 1; let seqi: Simd = from_fn(|l| unsafe { *self.seq.get_unchecked(l * self.n + i) as u64 }).into(); let seqk: Simd = from_fn(|l| unsafe { *self.seq.get_unchecked(l * self.n + i + self.k) as u64 }) .into(); let hi: Simd = from_fn(|l| self.rot_k[seqi[l] as usize]).into(); let hk: Simd = from_fn(|l| h(seqk[l] as u8)).into(); self.fh = ((self.fh << 1) | (self.fh >> 63)) ^ hi ^ hk;  Code Snippet 54: The SIMD version is quite similar to before. It's slightly slower at 1.14ns/kmer for L=4. The main difference is that in SIMD, there is no bit-rotate instruction26, so that instead we have to implement this using two shifts. Let’s have a quick look at the generated code.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34   0.68 │c50:┌─→cmp %r9,%rax 0.04 │ │↓ je d46 │ │ movzbl (%rsi,%r9,1),%r10d ; read char from seq 0.04 │ │ vmovq 0x940(%rsp,%r10,8),%xmm1 ; get hash of char into xmm1 128-bit register 10.70 │ │ movzbl (%rdx,%r9,1),%r10d 0.51 │ │ vmovq 0x940(%rsp,%r10,8),%xmm2 │ │ movzbl (%rcx,%r9,1),%r10d 0.16 │ │ vmovq 0x940(%rsp,%r10,8),%xmm3 9.83 │ │ movzbl (%r12,%r9,1),%r10d 0.58 │ │ vmovq 0x940(%rsp,%r10,8),%xmm4 0.45 │ │ movzbl (%r14,%r9,1),%r10d 0.16 │ │ vmovq 0x0(%r13,%r10,8),%xmm5 10.32 │ │ movzbl (%r15,%r9,1),%r10d 0.70 │ │ vmovq 0x0(%r13,%r10,8),%xmm6 0.30 │ │ movzbl (%r11,%r9,1),%r10d 0.14 │ │ vmovq 0x0(%r13,%r10,8),%xmm7 11.43 │ │ movzbl (%rbx,%r9,1),%r10d 0.86 │ │ vmovq 0x0(%r13,%r10,8),%xmm8 0.02 │ │ vpunpcklqdq %xmm1,%xmm2,%xmm1 ; merge xmm1 and xmm2 0.12 │ │ vpunpcklqdq %xmm3,%xmm4,%xmm2 ; merge xmm3 and xmm4 10.25 │ │ vinserti128$0x1,%xmm1,%ymm2,%ymm1 ; merge xmm1 and xmm2 into ymm1 256-bit register 0.72 │ │ vpunpcklqdq %xmm5,%xmm6,%xmm2 ;\ 0.21 │ │ vpunpcklqdq %xmm7,%xmm8,%xmm3 ; merge xmm5-8 into ymm2 1.45 │ │ vinserti128 $0x1,%xmm2,%ymm3,%ymm2 ;/ 16.62 │ │ vpxor %ymm1,%ymm2,%ymm1 ; compute the 'delta': ymm1 ^ ymm2 0.51 │ │ vpsrlq$0x3f,%ymm0,%ymm2 ; shift right │ │ vpaddq %ymm0,%ymm0,%ymm0 ; shift left = *2 = ymm0 + ymm0 │ │ vpor %ymm2,%ymm0,%ymm0 ; or the two results 10.66 │ │ vpxor %ymm0,%ymm1,%ymm0 ; xor in the 'delta' term 2.55 │ │ vmovdqu %ymm0,(%rdi) ; write to memory │ │ inc %r9 │ │ add $0x20,%rdi 0.02 │ ├──cmp %r9,%r8 9.94 │ └──jne c50  Code Snippet 55: Assembly for the SIMD ntHash. Still lots of memory accesses and many instructions spent on packing the SIMD registers. This looks rather terrible! There are many instructions to read the sequence characters and get their hashes, and then a number of instructions to merge everything into SIMD registers. The final rotate and xor operations are optimized, but they make up only a small part of the total runtime like this. ### SIMD table lookups So far, we have been assuming that the text can be anything. But actually, ntHash already is only ‘seeded’ for DNA characters ACGT, and so the [u64; 256] table really only contains 4 interesting entries. So we could just use a [u64; 4]. Assumption From now on, we assume that the text is a &[u8] containing only values in 0..=3. That’s exactly the size of a single SIMD register. Indeed, we can store them in a single SIMD register, and use a shuffle instruction to extract them. Or at least that’s what we would like to do. Sadly, there is no AVX227 dynamic28 cross-128bit lane29 64-bit30 shuffle available, but if we give up any one of these constraints we’re good. Thus, we can implement the 64-bit shuffle using the 32-bit shuffle. When we want to look up the 64-bit hash of a character with value/index c, now instead we look up the two 32-bit parts independently, which will be at indices 2c and 2c+1. But instead of implementing this, let’s do something simpler31. ### 32-bit hashes Everything becomes much simpler if we use 32-bit hashes directly! There really is no reason we need the full 64-bit hashes to select the minimum hash of a window of size $$w$$ that is rarely more than $$100$$. And since also the 4 values in the dictionary now fit in 128 bits, so that we do not need cross 128-bit lane shuffles anymore at all! So we can now exactly use the _mm256_permutevar_ps instruction, which even has 1 cycle latency. And also, we can now compute 8 hashes at a time instead of 4!32 Somewhat annoyingly there is no _epi32 integer version of this, so we’ll have to cast our data to 32-bit floats and back to make the type checker happy. Assumption We assume that 32-bit ntHashes are sufficient. The lower periodicity causes some collisions for $$k> 32$$, but they will still be rare and the number of k-mers in a window $$w$$ that we’re minimizing over is small anyway. The new code looks like this:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  // Types const L: usize = 8; type S = Simd; ... // Initialization of 8x32 'tables', // containing a copy of the 4 32-bit values in both the high and low 128 bits. let table: S = b"ACGTACGT".map(|c| h(c) as u32).into(); let table_rot_k: S = b"ACGTACGT".map(|c| h(c).rotate_left(k as u32) as u32).into(); ... // Wrap _mm256_permutevar_ps, which is typed for floats, with integer types. use std::mem::transmute as t; let permutevar_epi32 = |table: S, idx: S| t(_mm256_permutevar_ps(t(table), t(idx))); // Look up the character values. let hi: S = permutevar_epi32(self.table_rot_k, seqi); let hk: S = permutevar_epi32(self.table, seqk); // Update the rolling hash. // There is no SIMD rotate in AVX2 so we do it manually. self.fh = ((self.fh << 1) | (self.fh >> 31)) ^ hi ^ hk;  Highlights of using SIMD permute to look up the hash values of characters. Runs in 1.66 ns/kmer, not yet close to our best of 1.10 ns/kmer. In the assembly, we see 16 copies of the code below, an shift-in and shift-out character for each of the 8 lanes.  1 2 3  mov 0x20(%rsp),%rbp ; read the offset in the sequence of current lane movzbl 0x0(%rbp,%rdi,1),%ebp ; read the character vpinsrd$0x2,%ebp,%xmm4,%xmm4 ; insert the read character into a SIMD register 
Code Snippet 57: Reading a single character from the sequence into a SIMD register.

### Shared offsets

Instead of storing all offsets on the stack, we can also put them all in a single SIMD register.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14   // initialization Some(Self { ... + offsets: from_fn(|l| (l * n) as T).into(), }) ... // loop let s_ptr = self.seq.as_ptr(); -let seqi: S = from_fn(|l| *s_ptr.add(l * self.n + i) as T).into(); -let seqk: S = from_fn(|l| *s_ptr.add(l * self.n + i + self.k) as T).into(); +let oi = self.offsets + S::splat(i as T); +let ok = oi + S::splat(self.k as T); +let seqi = oi.as_array().map(|o| *s_ptr.add(o as usize) as T).into(); +let seqk = ok.as_array().map(|o| *s_ptr.add(o as usize) as T).into(); 
Code Snippet 58: Using a single SIMD register with offsets. It's much slower.

In fact, this turns out worse. Now the indices are computed in SIMD, but then have to be extracted anyway before any lookups are done. I was expecting a gather instruction to be generated here, but maybe the ‘zero extend’ of reading a u8 into a u32 is preventing that.

## 5.7 SIMD: The Gathering

### Gathering 4 characters at a time

But that gives a new idea! If we’re anyway reading from a memory location, we may as well directly read of full u32 worth of memory instead of just a u8. That could reduce the time spent on reading 4-fold. Thus, we make a read every fourth iteration, and then buffer the values.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  pub struct NtHashSimdIt<'a> { ... // New buffer fields in the struct. chars_i: S, chars_k: S, } ... // New main loop. if i % 4 == 0 { let p = self.seq.as_ptr() as *const u32; let oi = self.offsets + S::splat(i as u32); let ok = oi + S::splat(self.k as u32); // Read a u32 at given byte offsets. // Assumes little-endian. self.chars_i = oi.as_array().map(|o| *p.byte_offset(o as isize)).into(); self.chars_k = ok.as_array().map(|o| *p.byte_offset(o as isize)).into(); } // Extract the last 2 bits of each character. let seqi = self.chars_i & S::splat(0x03); let seqk = self.chars_k & S::splat(0x03); // Shift remaining characters in the buffer to the right. self.chars_i >>= S::splat(8); self.chars_k >>= S::splat(8); 
Reading a u32 every fourth iteration. Still runs in 1.65 ns/kmer, like Code Snippet 56.

The assembly looks very interesting now:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46   0.04 │1450:┌─→vpsrld $0x1f,%ymm3,%ymm7 0.73 │ │ │ vpaddd %ymm3,%ymm3,%ymm3 0.89 │ │ │ vpor %ymm7,%ymm3,%ymm3 1.39 │ │ │ vpbroadcastd -0x544de(%rip),%ymm7 ; Splat [3,3,3,3] for the mask 0.05 │ │ │ vpand %ymm7,%ymm6,%ymm8 ; mask low 2 bits 0.90 │ │ │ vpermilps %ymm8,%ymm2,%ymm8 ; table lookup 1 0.78 │ │ │ vpxor %ymm3,%ymm8,%ymm3 ; xor in the results 1.71 │ │ │ vpand %ymm7,%ymm5,%ymm7 ; mask low 2 bits 0.03 │ │ │ vpermilps %ymm7,%ymm4,%ymm7 ; table lookup 2 2.23 │ │ │ vpxor %ymm7,%ymm3,%ymm3 ; xor in the result 0.01 │ │ │ vpsrld$0x8,%ymm5,%ymm5 ; shift remaining characters 1.34 │ │ │ vpsrld $0x8,%ymm6,%ymm6 ; shift remaining characters 0.01 │ │ │ vmovdqu %ymm3,(%rax) 1.42 │ │ │ add$0x20,%rax 0.01 │ │ │ inc %rdx 1.21 │ │ │ cmp %rdx,%rcx │ │ │↑ je 12de 0.01 │149e:│ cmp %rdx,%rdi │ │ │↓ je 152b 0.03 │ │ │ test $0x3,%dl 1.26 │ └──│─ jne 1450 ; loop when i%4 != 0 0.01 │ │ vmovd %edx,%xmm5 ; from here on: read sequence 0.73 │ │ vpbroadcastd %xmm5,%ymm5 │ │ vpaddd %ymm0,%ymm5,%ymm5 0.09 │ │ vpmovzxdq %xmm5,%ymm6 0.01 │ │ vextracti128$0x1,%ymm5,%xmm7 0.89 │ │ vpmovzxdq %xmm7,%ymm7 │ │ vpcmpeqd %xmm8,%xmm8,%xmm8 0.09 │ │ vpxor %xmm9,%xmm9,%xmm9 *21.19*│ │ vpgatherqd %xmm8,(%r12,%ymm7,1),%xmm9 ; slow gather instruction 0.64 │ │ vpcmpeqd %xmm7,%xmm7,%xmm7 │ │ vpxor %xmm8,%xmm8,%xmm8 *19.20*│ │ vpgatherqd %xmm7,(%r12,%ymm6,1),%xmm8 ; slow gather instruction 0.65 │ │ vpaddd %ymm1,%ymm5,%ymm5 │ │ vextracti128 $0x1,%ymm5,%xmm6 │ │ vpmovzxdq %xmm6,%ymm6 │ │ vpcmpeqd %xmm7,%xmm7,%xmm7 0.83 │ │ vpxor %xmm10,%xmm10,%xmm10 *20.09*│ │ vpgatherqd %xmm7,(%r12,%ymm6,1),%xmm10 ; slow gather instruction 0.61 │ │ vinserti128$0x1,%xmm9,%ymm8,%ymm6 │ │ vpmovzxdq %xmm5,%ymm5 │ │ vpcmpeqd %xmm7,%xmm7,%xmm7 │ │ vpxor %xmm8,%xmm8,%xmm8 *20.23*│ │ vpgatherqd %xmm7,(%r12,%ymm5,1),%xmm8 ; slow gather instruction 0.65 │ │ vinserti128 $0x1,%xmm10,%ymm8,%ymm5 │ └──jmp 1450  Code Snippet 60: Assembly code for gathering a u32 every fourth iteration. The 4 gather (vpgatherqd) instructions together take 80% of the time! The first very clear observation is that even though we only gather every fourth iteration, those 4 instructions take up 80% of the time! Clearly, getting the data from memory into the SIMD registers is the bottleneck here. What is somewhat surprising is that line 15 and 16 of Code Snippet 59 are each turned into two gather instructions. As it turns out, there are two very similar gather instructions: • _mm256_i64father_epi32, which takes a 256-bit SIMD register with four 64-bit indices and returns a 128-bit SIMD register with four 32-bit values. • _mm256_i32father_epi32, which takes a 256-bit SIMD register with eight 32-bit indices and returns a 256-bit SIMD register with eight 32-bit values. It would be slightly nicer to use the second version, but in the end I suspect it wouldn’t matter much for performance, since the same number of entries is read anyway. ### Gathering 8 characters at a time At this point, it is very clear that the gathering is the bottleneck. Since it is already split into two instructions anyway, we may as well fetch a u64 at a time. We then have to do some shuffling to those into two SIMD registers of u32 values, one with the first 32 bits of each u64, and one with the second 32 bits. Then after 4 iterations, we can cheaply swap in the remaining 32 bits.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40  pub struct NtHashSimdIt<'a> { offsets: Simd<*const u8, 4>, offsets_next: Simd<*const u8, 4>, chars_i: Simd, chars_k: Simd, chars_i_next: Simd, chars_k_next: Simd, } ... // In main loop: // Every 8th iteration, read a u64 of incoming and outgoing data for each of the 8 chunks. if i % 8 == 0 { let oi1 = self.offsets.wrapping_add(Simd::splat(i)); let oi2 = self.offsets_next.wrapping_add(Simd::splat(i)); let ok1 = oi1.wrapping_add(Simd::splat(self.k)); let ok2 = oi2.wrapping_add(Simd::splat(self.k)); // i1, k1: The next 64bits for the first 4 chunks. // i2, k2: The next 64bits for the second 4 chunks. // The inner transmute makes the *u8 into a *u32, which assumes little-endian. // The outer transmute makes the [u64, 4] into a [u32, 8]. let chars_i1: Simd = transmute(Simd::::gather_ptr(transmute(oi1))); let chars_i2: Simd = transmute(Simd::::gather_ptr(transmute(oi2))); let chars_k1: Simd = transmute(Simd::::gather_ptr(transmute(ok1))); let chars_k2: Simd = transmute(Simd::::gather_ptr(transmute(ok2))); // The upcoming 32 bits for each chunk. // The 32 bits after that for each chunk. (self.chars_i, self.chars_i_next) = chars_i1.deinterleave(chars_i2); (self.chars_k, self.chars_k_next) = chars_k1.deinterleave(chars_k2); } // After 4 iterations, swap in the next 32 bits. if i % 8 == 4 { self.chars_i = self.chars_i_next; self.chars_k = self.chars_k_next; }  Code Snippet 61: Gathering a full u64 at a time requires slightly more messy code, but now runs in 1.05 ns/kmer! Finally improving the 1.10 ns/kmer non-SIMD best so far. perf stat now shows that the total time on gather instructions has gone down from 80% to 67%. Still quite some room for improvement if we could make the input faster. The deinterleave is translated to some vshufps (latency 1) and vpermpd (latency 3) instructions as follows:  1 2 3 4 5 6 7 8  *15.25*│ vpgatherqq %ymm12,0x0(,%ymm10,1),%ymm11 ; gather to yxx11 ... *17.27*│ vpgatherqq %ymm12,0x0(,%ymm7,1),%ymm10 ; gather to yxx10 ... 0.70 │ vshufps$0x88,%ymm10,%ymm11,%ymm9 ; extract low 32 bits 0.18 │ vpermpd $0xd8,%ymm9,%ymm9 ; shuffle │ vshufps$0xdd,%ymm10,%ymm11,%ymm10 ; extract high 32 bits 0.07 │ vpermpd $0xd8,%ymm10,%ymm10 ; shuffle  Code Snippet 62: Snippet of the generated assembly for the deinterleave after two gather instructions. ### Gathering 32 characters at a time Clearly, streaming the input is still a limitation. But we have one last trick available. We already assume that the input takes 2-bit values in 0..=3, so that of each byte we read, 6 bits are wasted. Instead, let’s assume that the input is already bit-packed! Assumption The text is 2-bit encoded and bit-packed. For now I’ll just keep using the same &[u8] type for the text, but in the final API it would be cleaner to create a wrapper type. The code is mostly similar to before, with the change that we now shift out only 2 bits instead of 8 at the end of every iteration. One more change is that we can’t simply read data at every 2-bit offset. Specifically, we can’t necessarily read a u64 at position i and a u64 at position i+k at the same time. Thus, we separate the updates for the start and end positions of the window. We should then also initialize self.chars_k with the leading bits.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  let read = |i| { let oi1 = self.offsets.wrapping_add(Simd::splat(i)); let oi2 = self.offsets_next.wrapping_add(Simd::splat(i)); let chars_i1: Simd = transmute(Simd::::gather_ptr(transmute(oi1))); let chars_i2: Simd = transmute(Simd::::gather_ptr(transmute(oi2))); chars_i1.deinterleave(chars_i2) }; if i % 32 == 0 { (self.chars_i, self.chars_i_next) = read(i); } if i % 32 == 16 { self.chars_i = self.chars_i_next; } if (i + self.k) % 32 == 0 { (self.chars_k, self.chars_k_next) = read(i + self.k); } if (i + self.k) % 32 == 16 { self.chars_k = self.chars_k_next; }  Code Snippet 63: When using bit-packed data, we must read the position-i and position-i+k data separately. Runtime goes from 1.05 ns/kmer to 0.17 ns/kmer. Yes, you read that right, 0.17 ns/kmer :) That’s 0.44 cycles/kmer, and way faster than anything we had before. Bug It turns out I made some bugs here. First, the array indexing in the code above should use index i/4 instead of i. Second, and more importantly, for an input of $$n$$ bytes, I was only doing $$n$$ instead of $$4n$$ iterations. Thus, all numbers reported from here onwards are roughly 4 times slower in practice. I will update them once I have some proper tests in place. The full code for SIMD-based hashing of bitpacked data is here.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154  const L: usize = 8; type T = u32; type S = Simd; pub struct NtHashSimd; impl ParHasher for NtHashSimd { type Out = T; #[inline(always)] fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { if SMALL_K { assert!(k <= 32); } NtHashPackedSimdIt::::new(t, k) .unwrap() .map(|x| x.into()) } } #[derive(Debug)] pub struct NtHashPackedSimdIt<'a, const SMALL_K: bool> { seq: &'a [u8], n: usize, k: usize, fh: S, current_idx: usize, table: S, table_rot_k: S, offsets: Simd<*const u8, 4>, offsets_next: Simd<*const u8, 4>, chars_i: S, chars_i_next: S, chars_k: S, chars_k_next: S, chars_k_copy: S, chars_k_next_copy: S, } impl<'a, const SMALL_K: bool> NtHashPackedSimdIt<'a, SMALL_K> { /// Creates a new NtHashIt with internal state properly initialized. #[inline(always)] pub fn new(seq: &'a [u8], k: usize) -> Option { if k > seq.len() { return None; } if k > MAXIMUM_K_SIZE { return None; } if seq.len() > u32::MAX as _ { return None; } let num_kmers = seq.len() - k + 1; if num_kmers % L != 0 { return None; } let n = num_kmers / L; // Each 128-bit half has a copy of the 4 32-bit hashes. let table: S = b"ACTGACTG".map(|c| h(c) as T).into(); let table_rot_k: S = b"ACTGACTG".map(|c| h(c).rotate_left(k as u32) as T).into(); let fh = from_fn(|l| { let mut fh = 0; for (i, v) in seq[l * n..l * n + k].iter().enumerate() { fh ^= (h(*v) as T).rotate_left((k - i - 1) as u32); } fh }) .into(); Some(Self { seq, n, k, fh, current_idx: 0, table, table_rot_k, offsets: from_fn(|l| unsafe { seq.as_ptr().add(l * n) }).into(), offsets_next: from_fn(|l| unsafe { seq.as_ptr().add((4 + l) * n) }).into(), chars_i: S::splat(0), chars_i_next: S::splat(0), // TODO: properly initialize the first (-k)%32 characters of chars_k. chars_k: S::splat(0), chars_k_next: S::splat(0), chars_k_copy: S::splat(0), chars_k_next_copy: S::splat(0), }) } } impl<'a, const SMALL_K: bool> Iterator for NtHashPackedSimdIt<'a, SMALL_K> { type Item = S; #[inline(always)] fn next(&mut self) -> Option { if self.current_idx == self.n { return None; }; if self.current_idx != 0 { let i = self.current_idx - 1; unsafe { let read = |i| { let oi1 = self.offsets.wrapping_add(Simd::splat(i)); let oi2 = self.offsets_next.wrapping_add(Simd::splat(i)); let chars_i1: Simd = transmute(Simd::::gather_ptr(transmute(oi1))); let chars_i2: Simd = transmute(Simd::::gather_ptr(transmute(oi2))); chars_i1.deinterleave(chars_i2) }; if i % 16 == 0 { if i % 32 == 0 { if SMALL_K { self.chars_i = self.chars_k_copy; self.chars_i_next = self.chars_k_next_copy; } else { (self.chars_i, self.chars_i_next) = read(i); } } else { self.chars_i = self.chars_i_next; } } if (i + self.k) % 16 == 0 { if (i + self.k) % 32 == 0 { (self.chars_k, self.chars_k_next) = read(i + self.k); self.chars_k_copy = self.chars_k; self.chars_k_next_copy = self.chars_k_next; } else { self.chars_k = self.chars_k_next; } } // Extract the last 2 bits of each character. let seqi = self.chars_i & S::splat(0x03); let seqk = self.chars_k & S::splat(0x03); // Shift remaining characters to the right. self.chars_i >>= S::splat(2); self.chars_k >>= S::splat(2); use std::mem::transmute; let permutevar_epi32 = |a: S, b: S| transmute(_mm256_permutevar_ps(transmute(a), transmute(b))); let hi: S = permutevar_epi32(self.table_rot_k, seqi); let hk: S = permutevar_epi32(self.table, seqk); self.fh = ((self.fh << 1) | (self.fh >> 31)) ^ hi ^ hk; } } self.current_idx += 1; Some(self.fh) } }  Code Snippet 64: Full source of the SIMD ntHash with bit-packed data. The four gather instructions only take around 30% of the total time now, and thus aren’t nearly as much of a bottleneck anymore. At this point, I thought we were done and not much further improvement would be possible. But I was wrong. ### Reusing the gathers Each character is currently read twice: once when it enters the sliding window, and once when it exits. It would be more efficient to reuse the first read instead of re-reading from memory. When $$k\geq 32$$, this is actually quite straightforward: At some point $$(i+k) \bmod 32 = 0$$, and the next 32 characters/64 bits of data are read at position $$i+k$$. Then $$k$$ iterations later, $$i$$ itself is equal to that same value, and exactly the same data is read. Thus, we can buffer the data read for position $$i+k$$ and use it the next time $$i\bmod 32 = 0$$. But note that this only works when $$k\leq 32$$. For larger $$k$$. For larger $$k$$, it could still be done but needs a larger buffer.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27   pub struct NtHashPackedSimdIt<'a> { ... + chars_k_copy: S, + chars_k_next_copy: S, } ... if i % 32 == 0 { + if self.k <= 32 { + self.chars_i = self.chars_k_copy; + self.chars_i_next = self.chars_k_next_copy; + } else { (self.chars_i, self.chars_i_next) = read(i); + } } if i % 32 == 16 { self.chars_i = self.chars_i_next; } if (i + self.k) % 32 == 0 { (self.chars_k, self.chars_k_next) = read(i + self.k); + self.chars_k_copy = self.chars_k; + self.chars_k_next_copy = self.chars_k_next; } if (i + self.k) % 32 == 16 { self.chars_k = self.chars_k_next; }  Code Snippet 65: Reusing the gather at position $$i+k$$ for position $$i$$ when $$k\leq 32$$. Runtime goes from 0.17 ns/kmer to 0.155 ns/kmer. In each iteration we now check whether $$i\bmod 32 \equiv 0$$ and $$i\bmod 32\equiv 16$$. It’s more efficient to first do a single check whether $$i\bmod 16\equiv 0$$, and only then check which of the two cases it is, since that saves a compare and jump in the main loop. That brings the runtime down to 0.145 ns/kmer, or precisely 3 clock cycles per 8 hashes. One drawback of the above is that when $$k$$ is indeed small, we now have a useless if statement and a bunch of unnecessary code for read(i) is generated. So let’s put this condition behind a generic:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  -pub struct NtHashSimd; +pub struct NtHashSimd; -pub struct NtHashPackedSimdIt<'a> { .. } +pub struct NtHashPackedSimdIt<'a, const SMALL_K: bool> { .. } fn next(&mut self) -> Option { ... if i % 16 == 0 { if i % 32 == 0 { - if self.k <= 32 { + if SMALL_K { self.chars_i = self.chars_k_copy; self.chars_i_next = self.chars_k_next_copy; } else { (self.chars_i, self.chars_i_next) = read(i); } } } ... }  Code Snippet 66: Checking whether $$k\leq 32$$ ahead of time saves a branch.chars_i= and self.chars_k variables in registers again. Runtime goes from 0.145 ns/kmer to 0.135 ns/kmer. At this point, perf stat shows that there are nearly no branch misses, and 2.43 IPC, which is good but actually not that close to 4. Again this makes me wonder what the actual bottleneck is. Remember 5.5? Where we processed two chunks in parallel? Maybe we should eventually try that again. ## 5.8 Fixing the benchmark At this point, perf record shows:  1 2  43.89% bench-3dc28953a bench-3dc28953a01a51f9 as ParHasher<_>>::hash_kmers 41.57% bench-3dc28953a libc.so.6 0x0000000000167f4a  Code Snippet 67: perf record shows that only just over half the time is actually spent in hash_kmers. Surely, this other part has something to do with either allocating or dropping the vector to which we collect. Let’s fix that by reusing the vector between iterations. Thus, we now make all our Hasher instances statefull to give them the opportunity to reuse their allocated memory between iterations. This is what requires e.g. Hasher::hash_kmers to take a &mut self: so that it can use and modify this state. The new caching hasher looks like this:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38  #[derive(Debug)] pub struct BufferParCached> { hasher: H, /// A vector that can be reused between iterations. v: Vec<[H::Out; L]>, } impl> BufferParCached { pub fn new(hasher: H) -> Self { Self { hasher, v: vec![] } } } impl> ParHasher for BufferParCached where H::Out: Default + Copy, { type Out = H::Out; fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { let mut len = t.len() - k + 1; let t = &t[..t.len() - len % L]; len -= len % L; let n = len / L; // Resize the vector to the right size. self.v.resize(n, [Self::Out::default(); L]); let mut it = self.hasher.hash_kmers(k, t); for i in 0..n { let hs = it.next().unwrap(); for j in 0..L { unsafe { self.v.get_unchecked_mut(i)[j] = hs[j] }; } } // We can't 'into_iter' our owned vector, so we hand out copies to the elements instead. self.v.iter().copied() } }  Code Snippet 68: The new Cached hasher reuses the allocated buffer between usages. Runtime goes from 0.135 ns/kmer to 0.102 ns/kmer. Now, perf again shows that over 90% of time is spent in the hash_kmers function, and indeed the runtime improves significantly: 0.102 ns/kmer!33 And also, we’re at exactly 3 IPC now, which at last seems quite reasonable. Takeaway We’ve gone from 2.9 ns/kmer for the nthash crate to 0.102 ns/kmer by using 32-bit hashes, a bit-packed input, and by using SIMD to parallellize over chunks of the input. 28x speedup! Note that instead of returning a vector, we could also benchmark just the time to compute all the results, e.g. by summing all the output hashes. That only takes 0.90 ns/kmer. But for consistency, it is simpler to always require an actual vector as output. ### One last branch At this point, there is also still one redundant branch due to the unwrap when iterating over the vector.  1 2 3 4 5 6  for i in 0..n { let hs = it.next().unwrap(); for j in 0..L { unsafe { self.v.get_unchecked_mut(i)[j] = hs[j] }; } }  Code Snippet 69: The it.next().unwrap() causes a redundant branch that's never taken. I tried replacing it by unwrap_unchecked, but that creates slower code. I also tried removing the case that returns None completely, which does remove branch, but anyway the runtime of the code is the same, most likely since parsing the instructions isn’t the bottleneck anyway. ## 5.9 Analysis: Machine code analysis It would be great to know what the actual bottleneck of the code is now. One way to (try to) find out is by using LLVM’s Machine Code Analyzer (LLVM-mca). First, we must determine the loop to analyse. For simplicity, let’s exclude the input gathering, and only focus on the nthash computation, that covers around 70% of the time: Assembly code of the hot loop.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32   5.89 │1530:┌─→vmovdqa 0x1a0(%rsp),%ymm6 0.10 │1539:│ vmovdqa 0x160(%rsp),%ymm7 4.07 │ │ vpsrld$0x2,%ymm7,%ymm8 1.60 │ │ vmovdqa %ymm8,0x160(%rsp) 13.26 │ │ vpsrld $0x2,%ymm6,%ymm8 0.03 │ │ vmovdqa %ymm8,0x1a0(%rsp) 2.31 │ │ vpsrld$0x1f,%ymm2,%ymm8 0.53 │ │ vpaddd %ymm2,%ymm2,%ymm2 7.44 │ │ vpor %ymm2,%ymm8,%ymm2 0.06 │ │ vpbroadcastd -0x5b12c(%rip),%ymm8 # a8938 <.LCPI27_11> 2.14 │ │ vpand %ymm7,%ymm8,%ymm7 0.41 │ │ vpermilps %ymm7,%ymm0,%ymm7 8.60 │ │ vpxor %ymm7,%ymm2,%ymm2 0.07 │ │ vpand %ymm6,%ymm8,%ymm6 2.21 │ │ vpermilps %ymm6,%ymm3,%ymm6 2.13 │ │ vpxor %ymm6,%ymm2,%ymm2 │ │unsafe { v.get_unchecked_mut(i)[j] = hs[j] }; 9.16 │ │ vmovdqu %ymm2,(%rax) 0.01 │ │ inc %rdx 1.89 │ │ add $0x20,%rax 1.82 │ │ cmp %rdx,%rcx │ │↑ je 13e9 │ │if self.current_idx == self.n { 6.34 │15a2:│ cmp %rdx,%r12 │ │↓ je 16f2 │ │if i % 16 == 0 { 0.06 │ │ test$0xf,%dl │ │↓ je 15d0 │ │if (i + self.k) % 16 == 0 { 1.76 │ │ lea (%r8,%rdx,1),%rsi 0.10 │ ├──test $0xf,%sil 1.95 │ └──jne 1530  Code Snippet 70: The current hot loop. We can use carg-show-asm to conveniently extract the assembly code. First, we mark the hash_kmers function as #[inline(never)] so that we can find its function. Then we can get the assembly code of it using:  1  cargo asm --bench bench hash_kmers --att  Code Snippet 71: cargo asm invocation to get the AT&T style assembly for the hash_kmers function. This outputs the full function, including setup and cleanup. We manually copy the assembly corresponding to the inner loop to a separate file. Extracted assembly code   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27  vmovdqa 608(%rsp), %ymm7 vmovdqa 544(%rsp), %ymm8 vpsrld$2, %ymm8, %ymm9 vmovdqa %ymm9, 544(%rsp) vpsrld $2, %ymm7, %ymm9 vmovdqa %ymm9, 608(%rsp) vpand %ymm2, %ymm8, %ymm8 vpermilps %ymm8, %ymm3, %ymm8 vpand %ymm2, %ymm7, %ymm7 vpermilps %ymm7, %ymm4, %ymm7 vxorps %ymm7, %ymm8, %ymm7 vpsrld$31, %ymm0, %ymm8 vpaddd %ymm0, %ymm0, %ymm0 vpor %ymm0, %ymm8, %ymm0 vxorps %ymm0, %ymm7, %ymm0 vmovups %ymm0, (%rdx) incq %rsi addq $32, %rdx cmpq %rsi, %rcx je .LBB116_34 cmpq %rsi, %rax je .LBB116_35 testb$15, %sil je .LBB116_23 leaq 21(%rsi), %rdi testb $15, %dil jne .LBB116_30  Code Snippet 72: The extracted assembly code has some minor differences in ordering from the one before, but it mostly the same. Then, we can run LLVM-mca on this code: llvm-mca < assembly, optionally with some flags such as --all-stats, --all-view, --bottleneck-analysis, and --timeline. I highly encourage you to run it yourself and have a closer look at the output. First, the output contains some high level statistics:  1 2 3 4 5 6 7 8 9  Iterations: 100 Instructions: 2700 Total Cycles: 558 Total uOps: 3000 Dispatch Width: 6 uOps Per Cycle: 5.38 IPC: 4.84 Block RThroughput: 5.0  Code Snippet 73: According to LLVM-mca, the max inverse throughput is 5 cycles per loop. We see that the it reports the maximum throughput of this block at one iteration every 5 cycles. But our benchmark already showed that we can do 1 iteration every 3 cycles, so I’m confused here. Also the dispatch width of 6 and 4.84 IPC seem unreasonably high, but maybe I was using the wrong numbers for my CPU (a i7 10750H). Then, we get a table of latency and throughput values. Latency and throughput table   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35  [1]: #uOps [2]: Latency [3]: RThroughput [4]: MayLoad [5]: MayStore [6]: HasSideEffects (U) [1] [2] [3] [4] [5] [6] Instructions: 1 7 0.50 * vmovdqa 608(%rsp), %ymm7 1 7 0.50 * vmovdqa 544(%rsp), %ymm8 1 1 0.50 vpsrld$2, %ymm8, %ymm9 2 1 1.00 * vmovdqa %ymm9, 544(%rsp) 1 1 0.50 vpsrld $2, %ymm7, %ymm9 2 1 1.00 * vmovdqa %ymm9, 608(%rsp) 1 1 0.33 vpand %ymm2, %ymm8, %ymm8 1 1 1.00 vpermilps %ymm8, %ymm3, %ymm8 1 1 0.33 vpand %ymm2, %ymm7, %ymm7 1 1 1.00 vpermilps %ymm7, %ymm4, %ymm7 1 1 0.33 vxorps %ymm7, %ymm8, %ymm7 1 1 0.50 vpsrld$31, %ymm0, %ymm8 1 1 0.33 vpaddd %ymm0, %ymm0, %ymm0 1 1 0.33 vpor %ymm0, %ymm8, %ymm0 1 1 0.33 vxorps %ymm0, %ymm7, %ymm0 2 1 1.00 * vmovups %ymm0, (%rdx) 1 1 0.25 incq %rsi 1 1 0.25 addq $32, %rdx 1 1 0.25 cmpq %rsi, %rcx 1 1 0.50 je .LBB116_34 1 1 0.25 cmpq %rsi, %rax 1 1 0.50 je .LBB116_35 1 1 0.25 testb$15, %sil 1 1 0.50 je .LBB116_23 1 1 0.50 leaq 21(%rsi), %rdi 1 1 0.25 testb $15, %dil 1 1 0.50 jne .LBB116_30  Code Snippet 74: MicroOps, latency, and throughput numbers. And lastly, we see which ports every instruction can execute on. The top line is a summary of total usage of each port. Port usage table   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46  Resources: [0] - SKLDivider [1] - SKLFPDivider [2] - SKLPort0 [3] - SKLPort1 [4] - SKLPort2 [5] - SKLPort3 [6] - SKLPort4 [7] - SKLPort5 [8] - SKLPort6 [9] - SKLPort7 Resource pressure per iteration: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] - - 5.51 5.49 1.67 1.67 3.00 5.50 5.50 1.66 Resource pressure by instruction: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions: - - - - 0.33 0.67 - - - - vmovdqa 608(%rsp), %ymm7 - - - - 0.67 0.33 - - - - vmovdqa 544(%rsp), %ymm8 - - 0.05 0.95 - - - - - - vpsrld$2, %ymm8, %ymm9 - - - - 0.31 0.32 1.00 - - 0.37 vmovdqa %ymm9, 544(%rsp) - - 0.02 0.98 - - - - - - vpsrld $2, %ymm7, %ymm9 - - - - 0.32 0.34 1.00 - - 0.34 vmovdqa %ymm9, 608(%rsp) - - - - - - - 1.00 - - vpand %ymm2, %ymm8, %ymm8 - - - - - - - 1.00 - - vpermilps %ymm8, %ymm3, %ymm8 - - 0.51 0.49 - - - - - - vpand %ymm2, %ymm7, %ymm7 - - - - - - - 1.00 - - vpermilps %ymm7, %ymm4, %ymm7 - - 0.99 - - - - 0.01 - - vxorps %ymm7, %ymm8, %ymm7 - - 0.48 0.52 - - - - - - vpsrld$31, %ymm0, %ymm8 - - 0.51 0.48 - - - 0.01 - - vpaddd %ymm0, %ymm0, %ymm0 - - 0.48 - - - - 0.52 - - vpor %ymm0, %ymm8, %ymm0 - - 0.99 - - - - 0.01 - - vxorps %ymm0, %ymm7, %ymm0 - - - - 0.04 0.01 1.00 - - 0.95 vmovups %ymm0, (%rdx) - - 0.01 0.02 - - - - 0.97 - incq %rsi - - - 0.04 - - - 0.01 0.95 - addq $32, %rdx - - - 0.48 - - - 0.01 0.51 - cmpq %rsi, %rcx - - - - - - - - 1.00 - je .LBB116_34 - - 0.02 0.49 - - - 0.01 0.48 - cmpq %rsi, %rax - - 0.02 - - - - - 0.98 - je .LBB116_35 - - - 0.50 - - - 0.47 0.03 - testb$15, %sil - - 0.46 - - - - - 0.54 - je .LBB116_23 - - - 0.51 - - - 0.49 - - leaq 21(%rsi), %rdi - - - 0.03 - - - 0.96 0.01 - testb $15, %dil - - 0.97 - - - - - 0.03 - jne .LBB116_30  Code Snippet 75: Port usage for each instruction. Ports 1, 2, 5, and 6 are used 5.5 times every iteration. The timeline also looks cool. Timeline   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54  [5,0] . . . . . DeeeeeeeE-R . . . . . . vmovdqa 608(%rsp), %ymm7 [5,1] . . . . . DeeeeeeeE-R . . . . . . vmovdqa 544(%rsp), %ymm8 [5,2] . . . . . D=======eER . . . . . . vpsrld$2, %ymm8, %ymm9 [5,3] . . . . . D========eER . . . . . . vmovdqa %ymm9, 544(%rsp) [5,4] . . . . . .D=======eER . . . . . . vpsrld $2, %ymm7, %ymm9 [5,5] . . . . . .D========eER . . . . . . vmovdqa %ymm9, 608(%rsp) [5,6] . . . . . .D======eE--R . . . . . . vpand %ymm2, %ymm8, %ymm8 [5,7] . . . . . .D=======eE-R . . . . . . vpermilps %ymm8, %ymm3, %ymm8 [5,8] . . . . . .D=======eE-R . . . . . . vpand %ymm2, %ymm7, %ymm7 [5,9] . . . . . . D=======eER . . . . . . vpermilps %ymm7, %ymm4, %ymm7 [5,10] . . . . . . D========eER . . . . . . vxorps %ymm7, %ymm8, %ymm7 [5,11] . . . . . . D=======eE-R . . . . . . vpsrld$31, %ymm0, %ymm8 [5,12] . . . . . . D=======eE-R . . . . . . vpaddd %ymm0, %ymm0, %ymm0 [5,13] . . . . . . D========eER . . . . . . vpor %ymm0, %ymm8, %ymm0 [5,14] . . . . . . D=========eER. . . . . . vxorps %ymm0, %ymm7, %ymm0 [5,15] . . . . . . D=========eER . . . . . vmovups %ymm0, (%rdx) [5,16] . . . . . . D=eE--------R . . . . . incq %rsi [5,17] . . . . . . D===eE------R . . . . . addq $32, %rdx [5,18] . . . . . . D===eE------R . . . . . cmpq %rsi, %rcx [5,19] . . . . . . D====eE-----R . . . . . je .LBB116_34 [5,20] . . . . . . D====eE----R . . . . . cmpq %rsi, %rax [5,21] . . . . . . D=====eE---R . . . . . je .LBB116_35 [5,22] . . . . . . D======eE--R . . . . . testb$15, %sil [5,23] . . . . . . D=======eE-R . . . . . je .LBB116_23 [5,24] . . . . . . D=======eE-R . . . . . leaq 21(%rsi), %rdi [5,25] . . . . . . D========eER . . . . . testb $15, %dil [5,26] . . . . . . D========eER . . . . . jne .LBB116_30 [6,0] . . . . . . DeeeeeeeE--R . . . . . vmovdqa 608(%rsp), %ymm7 [6,1] . . . . . . DeeeeeeeE--R . . . . . vmovdqa 544(%rsp), %ymm8 [6,2] . . . . . . D=======eE-R . . . . . vpsrld$2, %ymm8, %ymm9 [6,3] . . . . . . D========eER . . . . . vmovdqa %ymm9, 544(%rsp) [6,4] . . . . . . .D======eE-R . . . . . vpsrld $2, %ymm7, %ymm9 [6,5] . . . . . . .D========eER . . . . . vmovdqa %ymm9, 608(%rsp) [6,6] . . . . . . .D=======eE-R . . . . . vpand %ymm2, %ymm8, %ymm8 [6,7] . . . . . . .D========eER . . . . . vpermilps %ymm8, %ymm3, %ymm8 [6,8] . . . . . . .D=======eE-R . . . . . vpand %ymm2, %ymm7, %ymm7 [6,9] . . . . . . . D========eER . . . . . vpermilps %ymm7, %ymm4, %ymm7 [6,10] . . . . . . . D=========eER. . . . . vxorps %ymm7, %ymm8, %ymm7 [6,11] . . . . . . . D=======eE--R. . . . . vpsrld$31, %ymm0, %ymm8 [6,12] . . . . . . . D=======eE--R. . . . . vpaddd %ymm0, %ymm0, %ymm0 [6,13] . . . . . . . D========eE-R. . . . . vpor %ymm0, %ymm8, %ymm0 [6,14] . . . . . . . D==========eER . . . . vxorps %ymm0, %ymm7, %ymm0 [6,15] . . . . . . . D==========eER . . . . vmovups %ymm0, (%rdx) [6,16] . . . . . . . D==eE--------R . . . . incq %rsi [6,17] . . . . . . . D===eE-------R . . . . addq $32, %rdx [6,18] . . . . . . . D====eE------R . . . . cmpq %rsi, %rcx [6,19] . . . . . . . D=====eE-----R . . . . je .LBB116_34 [6,20] . . . . . . . D=====eE----R . . . . cmpq %rsi, %rax [6,21] . . . . . . . D======eE---R . . . . je .LBB116_35 [6,22] . . . . . . . D======eE---R . . . . testb$15, %sil [6,23] . . . . . . . D=======eE--R . . . . je .LBB116_23 [6,24] . . . . . . . D=======eE--R . . . . leaq 21(%rsi), %rdi [6,25] . . . . . . . D========eE-R . . . . testb \$15, %dil [6,26] . . . . . . . D========eER . . . . jne .LBB116_30 
Code Snippet 76: Two iterations of the timeline, that show that the vmovdqa instructions that load the characters take a lot of time to execute.

The thing that struck me most in all this is that the vmovdqa instructions have a quite high latency of 7 cycles. Even though it is probably hidden, it would be nice to remove this. In fact, we can see that the buffers containing the characters are read from memory/the stack (608(%rsp) is an address relative to the stack pointer), then shifted by 2 bits, and then written back to the same place. I wonder why this value can’t just be kept in a register instead. In fact, I have seen a few cases where a register is used directly, but I can’t make it work reliably.

## 5.10 Finals thoughts

Now that we’ve reached 0.1 ns/kmer, and nearly 30x speedup over a previous implementation, I think it’s finally time to call it a day34.

To recap, our speed is:

• 0.1 ns/kmer
• 0.26 cycles/kmer
• 2.08 cycles/8kmers, i.e., 2 cycles per iteration of the main loop that produces 8 hashes in a SIMD.

And for completeness, here’s the full source code.

Full code for the SIMD NtHash implementation.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154  const L: usize = 8; type T = u32; type S = Simd; pub struct NtHashSimd; impl ParHasher for NtHashSimd { type Out = T; #[inline(always)] fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { if SMALL_K { assert!(k <= 32); } NtHashPackedSimdIt::::new(t, k) .unwrap() .map(|x| x.into()) } } #[derive(Debug)] pub struct NtHashPackedSimdIt<'a, const SMALL_K: bool> { seq: &'a [u8], n: usize, k: usize, fh: S, current_idx: usize, table: S, table_rot_k: S, offsets: Simd<*const u8, 4>, offsets_next: Simd<*const u8, 4>, chars_i: S, chars_i_next: S, chars_k: S, chars_k_next: S, chars_k_copy: S, chars_k_next_copy: S, } impl<'a, const SMALL_K: bool> NtHashPackedSimdIt<'a, SMALL_K> { /// Creates a new NtHashIt with internal state properly initialized. #[inline(always)] pub fn new(seq: &'a [u8], k: usize) -> Option { if k > seq.len() { return None; } if k > MAXIMUM_K_SIZE { return None; } if seq.len() > u32::MAX as _ { return None; } let num_kmers = seq.len() - k + 1; if num_kmers % L != 0 { return None; } let n = num_kmers / L; // Each 128-bit half has a copy of the 4 32-bit hashes. let table: S = b"ACTGACTG".map(|c| h(c) as T).into(); let table_rot_k: S = b"ACTGACTG".map(|c| h(c).rotate_left(k as u32) as T).into(); let fh = from_fn(|l| { let mut fh = 0; for (i, v) in seq[l * n..l * n + k].iter().enumerate() { fh ^= (h(*v) as T).rotate_left((k - i - 1) as u32); } fh }) .into(); Some(Self { seq, n, k, fh, current_idx: 0, table, table_rot_k, offsets: from_fn(|l| unsafe { seq.as_ptr().add(l * n) }).into(), offsets_next: from_fn(|l| unsafe { seq.as_ptr().add((4 + l) * n) }).into(), chars_i: S::splat(0), chars_i_next: S::splat(0), // TODO: properly initialize the first (-k)%32 characters of chars_k. chars_k: S::splat(0), chars_k_next: S::splat(0), chars_k_copy: S::splat(0), chars_k_next_copy: S::splat(0), }) } } impl<'a, const SMALL_K: bool> Iterator for NtHashPackedSimdIt<'a, SMALL_K> { type Item = S; #[inline(always)] fn next(&mut self) -> Option { if self.current_idx == self.n { return None; }; if self.current_idx != 0 { let i = self.current_idx - 1; unsafe { let read = |i| { let oi1 = self.offsets.wrapping_add(Simd::splat(i)); let oi2 = self.offsets_next.wrapping_add(Simd::splat(i)); let chars_i1: Simd = transmute(Simd::::gather_ptr(transmute(oi1))); let chars_i2: Simd = transmute(Simd::::gather_ptr(transmute(oi2))); chars_i1.deinterleave(chars_i2) }; if i % 16 == 0 { if i % 32 == 0 { if SMALL_K { self.chars_i = self.chars_k_copy; self.chars_i_next = self.chars_k_next_copy; } else { (self.chars_i, self.chars_i_next) = read(i); } } else { self.chars_i = self.chars_i_next; } } if (i + self.k) % 16 == 0 { if (i + self.k) % 32 == 0 { (self.chars_k, self.chars_k_next) = read(i + self.k); self.chars_k_copy = self.chars_k; self.chars_k_next_copy = self.chars_k_next; } else { self.chars_k = self.chars_k_next; } } // Extract the last 2 bits of each character. let seqi = self.chars_i & S::splat(0x03); let seqk = self.chars_k & S::splat(0x03); // Shift remaining characters to the right. self.chars_i >>= S::splat(2); self.chars_k >>= S::splat(2); use std::mem::transmute; let permutevar_epi32 = |a: S, b: S| transmute(_mm256_permutevar_ps(transmute(a), transmute(b))); let hi: S = permutevar_epi32(self.table_rot_k, seqi); let hk: S = permutevar_epi32(self.table, seqk); self.fh = ((self.fh << 1) | (self.fh >> 31)) ^ hi ^ hk; } } self.current_idx += 1; Some(self.fh) } } 
Code Snippet 77: Full code for the SIMD FxHash implementation.

### Doubling down again

Currently, the number of instructions per cycle is 3, while it could be over 4. One thing that we could try is to split the input in to 16 chunks and process them in parallel using two SIMD units of 8 lanes. That way, we could fully exploit the parallellism of SIMD in combination with the instruction level parallellism of the CPU. However, my feeling is that the throughput is already bounded by the number of available registers and ports, so that doing everything in parallel wouldn’t help much.

I did a very quick and ugly test for this, and it came out twice as slow. The main reasons seems to be that indeed many variables spill onto the stack, making things much slower.

### 16-bit hashes?

By going from 64bit to 32bit hashes, the SIMD code became much simpler and faster. One could imagine that going to 16bit hashes gives a similar speedup. Especially since now 80% of time is spent in the inner loop doing ‘plain’ SIMD instructions, having double the parallellism may be good. One drawback though is that there are no 16-bit shuffle/permute instructions to do the corresponding table lookup, so we’d have to:

1. Deinterleave the 16-bit characters into two vectors of 32-bit characters.
2. Do two 32-bit permutevar instructions.
3. Interleave the results back together.

Or alternatively, we could use 128-bit byte shuffles, but then still we have to do some extra accounting to turn the character values into appropriate byte-indices. Lastly, we’d also have to split the u64 of characters read in each place into 4 chunks of 16 instead of 2 chunks of 32, and then rotate the chunks every 8 instead of 16 characters. But this probably doesn’t cause much of a slowdown.

To summarize, I think some speedup may be possible here, but definitely not the 2x one may naively expect.

### What about a simple multiply hash

For small $$k\leq 16$$, a different approach is to simply multiply their 32-bit value by a constant and use that as a hash, as FxHash does. Surely, this can also be optimized to be very efficient.

By re-using all the work we already did to efficiently gather the data from the sequence, this is now fairly straightforward:

  1 2 3 4 5 6 7 8 9 10 11 12 13  // upcoming_chars contains the next 32bits for each lane, and is gathered as before. // Extract the last 2 bits of each character. let new_chars = self.upcoming_chars & S::splat(0x03); // Shift remaining characters to the right. self.upcoming_chars >>= S::splat(2); // Rotate the bit-representation. self.chars <<= S::splat(2); // Mask out the old bits. self.chars &= self.mask; // Or in the new bits. self.chars |= new_chars; // Multiply by a constant. return self.chars * self.constant; 
Code Snippet 78: Using FxHash instead of ntHash takes 75 ps/kmer, down from 102 ps/kmer!
Full code for the SIMD FxHash implementation.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114  //////////////////////////// Bit-packed SIMD //////////////////////////// use super::*; use std::{ array::from_fn, mem::transmute, simd::{ptr::SimdConstPtr, Simd}, }; const L: usize = 8; type T = u32; type S = Simd; pub struct FxHashSimd; impl ParHasher for FxHashSimd { type Out = T; #[inline(always)] fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator { FxHashPackedSimdIt::new(t, k).unwrap().map(|x| x.into()) } } #[derive(Debug)] pub struct FxHashPackedSimdIt<'a> { seq: &'a [u8], n: usize, k: usize, chars: S, mask: S, constant: S, current_idx: usize, offsets: Simd<*const u8, 4>, offsets_next: Simd<*const u8, 4>, upcoming_chars: S, upcoming_chars_next: S, } impl<'a> FxHashPackedSimdIt<'a> { /// Creates a new FxHashIt with internal state properly initialized. #[inline(always)] pub fn new(seq: &'a [u8], k: usize) -> Option { assert!(k <= 16); if k > seq.len() { return None; } if k > MAXIMUM_K_SIZE { return None; } if seq.len() > u32::MAX as _ { return None; } let num_kmers = seq.len() - k + 1; if num_kmers % L != 0 { return None; } let n = num_kmers / L; Some(Self { seq, n, k, chars: S::splat(0), current_idx: 0, mask: S::splat((1 << (2 * k)) - 1), constant: S::splat(1234565323), offsets: from_fn(|l| unsafe { seq.as_ptr().add(l * n) }).into(), offsets_next: from_fn(|l| unsafe { seq.as_ptr().add((4 + l) * n) }).into(), upcoming_chars: S::splat(0), upcoming_chars_next: S::splat(0), }) } } impl<'a> Iterator for FxHashPackedSimdIt<'a> { type Item = S; #[inline(always)] fn next(&mut self) -> Option { if self.current_idx == self.n { return None; }; let i = self.current_idx; self.current_idx += 1; let read = |i| unsafe { let oi1 = self.offsets.wrapping_add(Simd::splat(i)); let oi2 = self.offsets_next.wrapping_add(Simd::splat(i)); let chars_i1: Simd = transmute(Simd::::gather_ptr(transmute(oi1))); let chars_i2: Simd = transmute(Simd::::gather_ptr(transmute(oi2))); chars_i1.deinterleave(chars_i2) }; if (i + self.k) % 16 == 0 { if (i + self.k) % 32 == 0 { (self.upcoming_chars, self.upcoming_chars_next) = read(i + self.k); } else { self.upcoming_chars = self.upcoming_chars_next; } } // Extract the last 2 bits of each character. let new_chars = self.upcoming_chars & S::splat(0x03); // Shift remaining characters to the right. self.upcoming_chars >>= S::splat(2); // Rotate the bit-representation. self.chars <<= S::splat(2); // Mask the previous self.chars &= self.mask; // Or in the new bits. self.chars |= new_chars; // Multiply by a constant. let hashes = self.chars * self.constant; Some(hashes) } } 
Code Snippet 79: Full code for the SIMD FxHash implementation.

Again faster, with 75 ps/kmer now, or 1.56 cycles per iteration!

One drawback of course is that this really only works for small $$k$$. For $$16<k\leq 32$$ the same could be done by doing two 32-bit multiplications, but that would probably hurt performance and make it slower than our ntHash implementation.

# 6 SIMD sliding window

At this point, we can go back to the sliding window minimum problem and use our new rolling hash for it. Naturally, we would like to also use SIMD here. The easiest is to adapt the Split method, since it is completely deterministic. That ensures that when processing multiple chunks of a sequence, the steps in each chunk (SIMD lane) are exactly the same.

Below is a relatively straightforward SIMD version of our initial code Code Snippet 13 (and Code Snippet 31 for the optimized version). It takes as input an iterator over [u32; 8] of 8 hashes (of independent chunks), and outputs a [u32; 8] for each window of width $$w$$ with the absolute position of the leftmost minimizer. Some details:

• Instead of comparing full 32-bit values, only the high 16-bits of each hash are compared.
• The low 16 bits of each 32-bit word are used to store the position, so that ties are automatically broken towards the minimal position. This means that the position can only go up to $$2^{16}$$, so the sequences needs to be chunked into pieces shorter than that, say of length $$2^{15}$$.
• I already used .get_unchecked where needed.
• Made sure that the .skip(w-1) at the end doesn’t become an if statement in the inner loop (or even puts the inner closure behind a function call some cases).
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46  impl SplitSimd { /// Assumes that the input iterator has less than 2^16 elements. /// Compares 32bit hashes by their upper 16 bits. /// Returns positions of all window minima. #[inline(always)] pub fn sliding_min( &self, w: usize, it: impl Iterator, ) -> impl Iterator { type S = Simd; assert!(w > 0); let mut prefix_min = S::splat(u32::MAX); let mut ring_buf = RingBuf::new(w, prefix_min); // We only compare the upper 16 bits of each hash. // Ties are broken automatically in favour of lower pos. let val_mask = S::splat(0xffff_0000); let pos_mask = S::splat(0x0000_ffff); let mut pos = S::splat(0); let mut it = it.map( #[inline(always)] move |val| { let elem = (S::from(val) & val_mask) | pos; pos += S::splat(1); ring_buf.push(elem); prefix_min = prefix_min.simd_min(elem); // After a chunk has been filled, compute suffix minima. if ring_buf.idx() == 0 { let mut suffix_min = ring_buf[w - 1]; for i in (0..w - 1).rev() { suffix_min = suffix_min.simd_min(ring_buf[i]); ring_buf[i] = suffix_min; } prefix_min = elem; // slightly faster than assigning S::splat(u32::MAX) } let suffix_min = unsafe { *ring_buf.get_unchecked(ring_buf.idx()) }; (prefix_min.simd_min(suffix_min) & pos_mask).into() }, ); // This optimizes better than it.skip(w-1). it.by_ref().take(w - 1).for_each(drop); it } } 
Code Snippet 80: SIMD version of the Split algorithm.

## 6.1 Results

We can benchmark this code in a few ways:

1. Take a test, compute hashes, and then directly stream them into the sliding window minima.
2. Take a text, compute hashes to a vector, then read the vector to compute sliding window minima.
3. Like 2, but only measure the time for the sliding window minima.

As expected, method 2 reports close to 0.1ns more than method 3, exactly the time taken to compute hashes. As output, we test both taking just the sum of all positions, and writing all positions to a vector.

Table 5: Runtime of computing hashes and taking sliding window minima. The first column indicates whether hashes are streamed directly to the window minima or first written to a vector.
ps/windowsum positionscollect positions
streaming hashes158226
buffered hashes192210

We see that the fully streaming option that doesn’t use any memory is clearly the fastest. Collecting the final result is faster when the intermediate results are already in a vector. I’m not quite sure why. Possibly the allocation is reused, but I don’t think that is actually the case.

Either way, this seems to be plenty fast, and I don’t see any more easy wins here. Comparing (hash, pos) tuples as a single integer seems inevitable for good performance. Maybe we could go down from (u16, u16) to (u8, u8), but using 8-bit hashes is unlikely to be good enough when $$w\sim 20$$. In that case, each window has a probability somewhere around order $$256/w \sim 12\%$$ of having a duplicate minimizer, which will start to hurt the density.

One more thing to try could be to again process two SIMD units in parallel, but also again, a quick and dirty test was quite a bit slower.

Takeaway

We can now do sliding window minima in 0.158 ns/kmer, or 3.3 cycles/8kmers. That’s 60x faster than RescanDaniel, and 160x faster than QueueIgor!

### Human genome results

For a human genome, consisting of roughly 3Gbp, we measure a runtime of 496 ms35, which corresponds to 0.156 ns/kmer. Thus, the results we see transfer to larger datasets without problems!

# TODO7 Cleanup, Testing, Super-k-mers, and canonical k-mers

1. Currently most code is only written to get measure throughput, and not set up to handle edge cases where e.g. the input length is not a multiple of $$8$$. This needs fixing, and then also some testing for correctness.

2. Most likely, an API that returns an iterator over super-k-mers (i.e., the range of windows have the same minimizer) that possibly also includes the bit-representation of the minimizer itself would be more convenient than the current API that only returns the minimizer for each window.

3. In most bioinformatics applications, canonical k-mers are used instead. Most likely this will require doubling large parts of the pipeline to compute reverse-complement hashes and their minimizers, and then take the minimum between the two, with some special tie-breaking in cases where the minimum is shared.

4. Also the API should probably contain a fast method to convert between byte and packed representations.

5. https://arxiv.org/pdf/2310.13212

6. https://uica.uops.info/

# References

Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024. “The Mod-Minimizer: A Simple and Efficient Sampling Algorithm for Long K-Mers,” May. https://doi.org/10.1101/2024.05.25.595898.
Kazemi, Parham, Johnathan Wong, Vladimir Nikolić, Hamid Mohamadi, René L Warren, and Inanç Birol. 2022. “Nthash2: Recursive Spaced Seed Hashing for Nucleotide Sequences.” Edited by Peter Robinson. Bioinformatics 38 (20): 4812–13. https://doi.org/10.1093/bioinformatics/btac564.
Mohamadi, Hamid, Justin Chu, Benjamin P. Vandervalk, and Inanc Birol. 2016. “Nthash: Recursive Nucleotide Hashing.” Bioinformatics 32 (22): 3492–94. https://doi.org/10.1093/bioinformatics/btw397.
Pibiri, Giulio Ermanno. 2022. “Sparse and Skew Hashing of K-Mers.” Bioinformatics 38 (Supplement_1): i185–94. https://doi.org/10.1093/bioinformatics/btac245.
Roberts, Michael, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. 2004. “Reducing Storage Requirements for Biological Sequence Comparison.” Bioinformatics 20 (18): 3363–69. https://doi.org/10.1093/bioinformatics/bth408.
Schleimer, Saul, Daniel S. Wilkerson, and Alex Aiken. 2003. “Winnowing: Local Algorithms for Document Fingerprinting.” In Proceedings of the 2003 Acm Sigmod International Conference on Management of Data. Sigmod/Pods03. ACM. https://doi.org/10.1145/872757.872770.

2. I had a bug in earlier benchmarks that caused my timings to be 4x too fast. ↩︎

3. Headsup: all code I’m about to show is completely untested currently. ↩︎

4. On that matter, specific requests are welcome. ↩︎

6. Consider not skipping the code blocks. I made quite some effort to make them clean and readable. Most code is fairly straightforward assuming you’re comfortable reading Rust code. ↩︎

7. Legend:
Orange: processed state,
Bold outline: minimum of the prefix/suffix,
Bold character: minimum of the window,
Blue: state in memory,
Red: state removed from the queue,
Green: running prefix minimum. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

8. The hash_kmers function here is part of the Hasher trait that can be found in Code Snippet 32↩︎

9. In fact, the Buffered implementation is slightly more efficient since it stores hashes instead of recomputing them $$w$$ times. ↩︎

10. Citation needed, both for it being folklore and for the original idea. ↩︎

11. I’m not counting the for loop that iterates up to the variable length of the string as a branch miss, because that will only be a single branch-miss at the end, not $$O(1)$$ per iteration. ↩︎

12. It’s the third result. ↩︎

13. Feel free to skip this paragraph in case you are not familiar with segment and Fenwick trees. ↩︎

14. Hah, get it?
There’s at least two alternative interpretations ;) ↩︎

15. The ntHash paper doesn’t actually provide a benchmark for this though. But the consensus seems to be that buzhash is faster than classic multiplication based methods. ↩︎

16. The runtime does go down a bit to 2.3ns/kmer, but we’ll postpone looking into this more, since full lto is absolutely terrible for compile times, and thin lto is not sufficient. ↩︎

17. Maybe there is a reason for keeping it that I don’t see? Let me know! ↩︎

18. Again, the compiler disappoints a bit. Maybe there is some tricky reason why it’s not actually redundant? ↩︎

19. This article has a nice summary table of the register naming. ↩︎

21. Cf. Hydra (Greek mythology). ↩︎

22. Cf. Chekhov’s gun↩︎

23. In fact, Daniel Liu recommended me multiple times that pointer offsets are faster, so I’ll let this be my lesson. ↩︎

24. $$k$$ bits, actually ↩︎

25. I wouldn’t be surprised if I end up looking more into compilers eventually. ↩︎

26. At least not with AVX256. AVX512 does have it though. ↩︎

27. In AVX512 there is _mm256_permutexvar_epi64 (latency 3). ↩︎

28. There are static cross-128-bit lane shuffles that require a const shuffle argument, such as _mm256_permute4x64_epi64 (latency 3). ↩︎

29. There is a within-128-bit lane dynamic shuffle, _mm256_permutevar_pd (latency 1) ↩︎

30. There is a 32-bit variant, _mm256_permutevar8x32_ps (latency 3) ↩︎

31. And not waste more of the author’s time. ↩︎

32. You’d almost ask yourself why we didn’t do this from the start. ↩︎

33. Maybe it’s time to switch to picoseconds (ps)? ↩︎

34. Week, rather, for this section. ↩︎

35. Compare that to the 40 seconds it takes to convert the byte-encoded FASTA sequence to a bit-packed representation using some lazy code. Why aren’t we using bit-packed on-disk file formats anyway? ↩︎