- 1 Introduction
- 1.1 Results
- 2 Random minimizers
- 3 Algorithms
- 4 Analysing what we have so far
- 5 Rolling our own hash
- 6 SIMD sliding window
- 7 Extending into something useful
- 8 Conclusion
- 8.1 Future work
A 90 min recording of a talk I gave on this post can be found here.
1 Introduction
In this post, we will develop a fast implementation of random minimizers.
The goals here are:
- a library for fast random minimizers,
- explaining the final code, and the design decisions leading to it,
- walking you through the development process, hopefully teaching some tricks and methods for how to do this yourself.
Code for this post is in the simd-minimizers-bench/
directory of
github:rust-seq/simd-minimizers.
Most code snippets are directly included from the repository. I have tried to
write clean and readable code that follows the order of this post. Occasionally
some code snippets contain ‘future proofing’ that will only become relevant
later on.
Contents1:
- Background
- Overview of algorithms
- Analysing and optimizing the algorithms
- Rolling our own hash function
- SIMD sliding window
1.1 Results
The end result that we will be working towards here is twofold:
An implementation of 32-bit ntHash rolling hash that takes around
0.4ns/kmer
, or around 1 CPU cycle per hash. This is 7x faster than thenthash
crate by Luiz Irber.An implementation of random minimizers and super-k-mers that takes around
1.6 ns/window
, or 4 CPU cycles per window. This is 5x faster than Daniel Liu’s gist and 15x faster than therust-seq/minimizer-iter
crate by Igor Martayan2.It computes all window minimizers of a 2-bit packed human genome (with non-
ACGT
stripped) in 5.4 seconds.An implementation of canonical minimizers that runs in
1.9 ns/window
, or 6.4 seconds for a human genome.
Some of the code to iterate bases is already available in the packed-seq
crate
in the rust-seq
org.
The remaining algorithms are planned to be upstreamed to rust-seq/nthash
and/or
rust-seq/minimizer-iter
crates, after polishing the implementation,
adding tests3, and designing a nice API4.
I welcome any (critical) feedback on both the content and styling of this post! I plan to fine-tune it a bit more based on this, and will take this into account when writing similar posts in the future. If you’re feeling adventurous, open an issue or pull request against the source of this blog.
2 Random minimizers
Many tools in bioinformatics rely on k-mers.
Given a string/text/DNA sequence t: &[u8]
of length \(|t|=n\), the k-mers of t
are all the
length-\(k\) substrings. There are \(n-k+1\) of them.
In order to speed up processing, and since consecutive k-mers are overlapping and contain redundancy, it is common to subsample the k-mers. This should be done in a deterministic way, so that similar but slightly distinct text still mostly sample the same k-mers. One such way is random minimizers, introduced in parallel by Schleimer, Wilkerson, and Aiken (2003) and Roberts et al. (2004):
- First, a window size \(w\) is chosen, which will guarantee that at least one k-mer is sampled every \(w\) positions of \(t\), to prevent long gaps between consecutive k-mers.
- Then, every k-mer is hashed using a pseudo-random hash function \(h\), and in each window of \(w\) consecutive k-mers (of total length \(\ell=w+k-1\)), the leftmost5 one with the smallest hash is selected.
Consecutive windows (that overlap in \(w-1\) k-mers) often sample the same k-mer, so that the total number of sampled k-mers is much smaller than \(n-k+1\). In fact, random minimizers have a density of \(2/(w+1)\) and thus sample approximately two k-mers per window.
My previous post on minimizers gives more background on density and other sampling schemes. One such alternative scheme is the mod-minimizer (Groot Koerkamp and Pibiri 2024), that has low density when \(k\) is large compared to \(w\). That paper also contains a review of methods and pseudocode for them. In this post, I will focus on the most widely used random minimizers.
3 Algorithms
3.1 Problem statement
Before going into algorithms to compute random minimizers, let’s first precisely state the problem itself. In fact, we’ll state three versions of the problem, that each include slightly more additional information in the output.
Problem A: Only the set of minimizers
The most basic form of computing minimizers is to simply return the list of minimizer kmers and their positions. This information is sufficient to sketch the text.
|
|
Problem B: The minimizer of each window
When building datastructures on top of a sequence, such as SSHash (Pibiri 2022), just the set of k-mers is not sufficient. Each window should be mapped to its minimizer, but there could be multiple minimizers in a single window. Thus, in this version of the problem, we return the minimizer of each window.
Note that while in Problem A we return one result per minimizer, here we return one result per window (roughly, per character). This difference can have implications for the optimal algorithm implementation, since it’s not unlikely that the main loop of the optimal solution has exactly one iteration per returned element.
|
|
Problem C: Super-k-mers
We can implement SSHash and other indexing datastructures using the output of Problem B, but in most such applications, it is probably more natural to directly iterate over the super-k-mers: the longest substrings such that all contained windows have the same minimizer. This can easily be obtained from the output of Problem B by grouping consecutive windows that share the minimal k-mer.
Note that here the result is again an iterator over minimizers, like in Problem A rather than windows as in Problem B.
|
|
Which problem to solve
In the remainder, we mostly focus on Problem B, since that is what I initially started with. However, some methods more naturally lean themselves to solve the other variants. We will benchmark each method on whichever of the problems it runs fastest, and postpone optimizing a solution for each specific problem until later.
Most likely though, solving Problem C will yield the most useful tool for end users. I plan to revisit this at the end of this post.
Regardless, for convenience we can implement solutions to A and C in terms of B.6
|
|
Canonical k-mers
In actual bioinformatics applications, it is common to consider canonical k-mers and corresponding canonical minimizers. These are defined such that the canonical k-mer of a string and its reverse complement are the same. This will significantly impact the complexity of our code, and hence we also postpone this for later.
3.2 The naive algorithm
The naive \(O(n \cdot w)\) method simply finds the minimum of each window independently.
It looks like this:7
In code, it looks like this.
|
|
Note that we assume that hashing each k-mer takes constant time. This is reasonable in practice, since most hash functions hash many characters at once.
Performance characteristics
I was planning to show some preliminary performance numbers here, to give some intuition about the relative performance of this and upcoming methods. Unfortunately though (and not completely surprising), performance depends heavily on small details of how the code is written. I could provide a comparison between the methods as-is, but the numbers would be hard to interpret. Instead, let’s consider some high level expectations of this algorithm for now, and postpone the quantitive comparison until later.
Since there are no if
statements, the code should be highly predictable. Thus,
we expect:
- few branch misses,
- high instructions per cycle,
- and generally running times linear in \(w\).
3.3 Rephrasing as sliding window minimum
After hashing all k-mers, we basically have a sequence of \(n-k+1\) pseudo-random integers. We would like to find the position of the leftmost minimum in each window of \(w\) of those integers.
Thus, the problem of finding random minimizers can be reformulated in terms of the sliding window minima. We can model this problem using the following trait:
|
|
A first implementation that mirrors the Naive
implementation above could simply store the last \(w\) hashes in a RingBuffer
,
and in each step recompute the minimum. The boiler plate for the ring buffer is simple:
|
|
And then a sliding window implementation also becomes trivial:
|
|
Using this trait, we can clean up8 the implementation of the Naive
minimizer from
before9. In fact, we can now generically implement a minimizer scheme using any
implementation of the sliding window algorithm.
|
|
3.4 The queue
Let’s try to improve this somewhat inefficient \(O(n\cdot w)\) algorithm. A first idea is to simply keep a rolling prefix-minimum that tracks the lowest value seen so far. For example, when \(w=3\) and the hashes are \([10,9,8,7,6,…]\), the rolling prefix minimum is exactly also the minimum of each window. But alas, this won’t work for increasing sequences: when the hashes are \([1,2,3,4,5,…]\) and the window shifts from \([1,2,3]\) to \([2,3,4]\), the prefix minimum is \(1\) (at index \(0\)), but it is not a minimum of the new window anymore, which instead goes up to \(2\).
Nevertheless, as the window slides right, each time we see a new value that’s smaller than everything in the window, we can basically ‘forget’ about all existing values and just keep the new smaller value in memory. Otherwise, we can still forget about all values in the window larger than the value that is shifted in.
This is formalized by using a queue of non-decreasing values in the window. More precisely, each queue element will be smaller than all values in the window coming after it. At each step, the minimum of the window is the value at the front of the queue. Let’s look at our example again.7
We see that there are two reasons elements can be removed from the queue:
- a smaller element is pushed,
- the window shifts so far that the leftmost/smallest element falls outside it.
To handle this second case, we don’t just store values in this queue, but also their position in the original text, so that we can pop them when needed.
In code, it looks like this:
|
|
Performance characteristics
Since each element is pushed once and popped once, each call to
push
takes amortized constant \(O(1)\) time, so that the total runtime is
\(O(n)\).
We may expect a \(w=11\) times speedup over the buffered method, but most likely
this will not be the case.
This is due to the high number of unpredictable branches,
which hurts pipelining/instruction level parallelism (ILP). Thus, while the
complexity of the Queue
algorithm is good, its practical performance may not be optimal.
3.5 Jumping: Away with the queue
While the queue algorithm is nice in theory, in practice it is not ideal due to the branch-misses. Thus, we would like to find some middle ground between keeping this queue and the naive approach of rescanning every window. A first idea is this: Once we find the minimizer of some window, we can jump to the first window after that position to find the next minimizer. This finds all distinct minimizers, although it does not compute the minimizer for each window individually. Thus, it solves Problem A instead of Problem B.7
Since the expected distance between random minimizers is \((w+1)/2\), we expect to scan each position just under two times.
Bryce Kille noticed that this method has a bug: when the sequence of hashes is e.g.
10, 9, 8, 7, 6, 5, 4, 3, 2, 1
, the minimizer of the first window is 7
. Then
we’d jump to the 6, 5, 4, 3
slice to find 3
as the next minimizer, but that
is wrong, since 6
, 5
, and 4
are also minimizers. Instead of only taking
the minimum of the new slice, all prefix minima that are less than the previous
minimum are minimizers.
|
|
Performance characteristics
One benefit of this method is that it is completely branchless. Each iteration of the loop corresponds to exactly one output. The only unpredictable step is which element of a window is the minimizer, since that determines which is the next window to scan. Thus, even though it is branchless, consecutive iterations depend on each other and ILP (instruction level parallelism, aka pipelining) may only have a limited effect.
3.6 Re-scan
I learned of an improved method via Daniel Liu’s gist for winnowing, but I believe the method is folklore10. This is a slightly improved variant of the method above. Above, we do a new scan for each new minimizer, but it turns out this can sometimes be avoided. Now, we will only re-scan when a previous minimum falls outside the window, and the minimum of the window goes up.7
Thus, we keep a ring buffer of the last \(w\) values, and scan it as needed. One point of attention is that we need the leftmost minimal value, but since the buffer is cyclic, the leftmost element in the buffer is not necessarily the leftmost one in the original text. Thus, we don’t only store elements but also their positions, and break ties by position.
|
|
Performance characteristics
This method should have much fewer branch-misses than the queue, since we rescan a full window of size \(w\) at a time. Nevertheless, these re-scans occur at somewhat random moments, so some branch-misses around this are still to be expected.
3.7 Split windows
In an ideal world, we would use a fully predictable algorithm, i.e., an algorithm without any data-dependent branches/if-statements.11
With some Googling12 I found this codeforces blog on
computing sliding window minima (see also this stackoverflow answer and an
implementation in the moving_min_max
crate).
In the post, the idea is presented using two
stacks. My explanation below goes in a slightly different direction, but in the
end it comes down to the same.
Aside: Accumulating non-invertible functions.
So, as we saw, most methods keep some form of rolling prefix minimum that is
occasionally ‘reset’, since ‘old’ small elements should not affect the current
window.
This all has to do with the fact that the min
operation is not invertible:
knowing only the minimum of a set \(S\) is not sufficient to know the minimum of
\(S - \{x\}\).
This is unlike for example addition: sliding window sums are easy by just adding
the new value to the sum and subtracting the value that falls out of the window.
This distinction is also exactly the difference between segment trees and Fenwick trees13: a Fenwick tree computes some function \(f\) on a range \([l, r)\) by removing \(f(a_0, \dots, a_{l-1})\) from \(f(a_0, \dots, a_{r-1})\), by decomposing both ranges into intervals with power-of-\(2\) lengths corresponding to the binary representation of their length. But this only works if \(f\) is invertible. If that is not the case, a segment tree must be used, which partitions the interval \([l, r)\) itself, without going ‘outside’ it.
The algorithm. Based on this, we would also like to partition our windows such that we can compute their minimum as a minimum over the different parts. It turns out that this is possible, and in a very clean way as well!
First, we conceptually chunk the input text into parts with length \(w\), as shown in the top row of Figure 5 for \(w=4\). Then, each window of length \(w\) either exactly matches such a chunk, or overlaps with two consecutive chunks. In the latter case, it covers a suffix of chunk \(i\) and a prefix of chunk \(i+1\). Thus, to compute the minimum for all windows, we need to compute the prefix and suffix minima of all chunks. We could do that as a separate pass over the data in a pre-processing step, but it can also be done on the fly:
- Each time a new kmer is processed, its hash is written to a size-\(w\) buffer. (The orange values in the top-right of Figure 5.)
- After processing the window corresponding to a chunk, the buffer contains exactly the hashes of the chunk. (Fourth row.)
- Then, we compute the suffix-minima of the chunk by scanning the buffer backwards. (Fifth row.)
- For the next \(w\) windows, we intialise a new rolling prefix minimum in the new chunk (green outline).
- The minimum of each window is the minimum between the rolling prefix minimum in chunk \(i+1\) (green outline), and the suffix minimum in chunk \(i\) that we computed in the buffer (blue outline).7
|
|
Performance characteristics
The big benefit of this algorithm is that its execution is completely deterministic, similar to the jump algorithm: every \(w\) iterations we compute the suffix-minima, and then we continue again with the rolling prefix minimum. Since this ‘side loop’ is taken exactly once in every \(w\) iterations, the corresponding branch is easy to predict and does not cause branch-misses. Thus, we expect nearly \(0\%\) of branch-misses, and up to \(4\) instructions per cycle.
4 Analysing what we have so far
4.1 Counting comparisons
Before we look at runtime performance, let’s have a more theoretical look at the
number of comparisons each method makes.
We can do this by replacing the u64
hash type by a wrapper type that increments a
counter each time .cmp()
or .partial_cmp()
is called on it. I exclude
comparisons with the MAX
sentinel value since those could possibly be
optimized away anyway. The code is not very interesting, but you can find it here.
|
|
Observations:
Buffered
Does \(w-1\) comparisons for each window to find the max of \(w\) elements.Queue
andRescan
both do exactly \(2-2/(w+1)\) comparisons per element.- For
Queue
, eachpop_back
is preceded by a comparison, and eachpush
has one additional comparison with the last element in the queu that is smaller than the new element. Only when the new element is minimal (probability \(1/(w+1)\)), the queue becomes empty and that comparison doesn’t happen. Also when the smallest element is popped from the front (probability \(1/(w+1)\)), that means it won’t be popped from the back and hence also saves a comparison. - Unsurprisingly, the
Jumping
algorithm uses the fewest comparisons, since it also returns strictly less information than the other methods. - The
Split
method, which seems very promising because of its data-independence, actually uses around \(50\%\) more comparisons (\(3-3/w\), to be precise), because for each window, a minimum must be taken between the suffix and prefix.
Open problem: Theoretical lower bounds
It would be cool to have a formal analysis showing that we cannot do better than \(2-2/(w+1)\) expected comparisons per element to compute the random minimizer of all windows, and similarly it would be nice to have a lower bound on the minimal number of comparisons needed to only compute the positions of the minimizers.
4.2 Setting up benchmarking
Adding criterion
Before we start our implementation, let’s set up a benchmark so we can easily
compare them. We will use criterion, together with the cargo-criterion helper.
First we add criterion
as a dev-dependency to Cargo.toml
.
|
|
We add benches/bench.rs
which looks like this:
Click to show code.
|
|
Making criterion fast
By default, the benchmarks are quite slow. That’s for a couple of reasons:
- Before each benchmark, a 3 second warmup is done.
- Each benchmark is 5 seconds.
- After each benchmark, some plots are generated.
- At the end, some HTML reports are generated.
I’m impatient, and all this waiting really impacts my iteration time, so let’s make it faster:
- Reduce warmup time to 0.5s.
- Reduce benchmark time to 2s, and only take 10 samples.
- Disable plots and html generation.
The first two are done from code:
|
|
The last is best done by adding the --plotting-backend disabled
flag. For
convenience, we add this rule to the justfile
so we can just do just bench
. I’m also adding quiet
to hide the comparison between runs to simplify presentation.
|
|
When we now do just bench
, we see this:
A note on CPU frequency
Most consumer CPUs support turboboost to increase the clock frequency for short
periods of time. That’s nice, but not good for stable measurements. Thus, I
always pin the frequency of my i7-10750H
to the default 2.6GHz
:
|
|
This usually results in very stable measurements.
Similarly, I have hyper threading disabled, so that each program has a full core to itself, without interference with other programs.
4.3 Runtime comparison with other implementations
Let’s compare the runtime of our method so far with two other Rust implementations:
- minimizer-iter is an implementation written by Igor Martayan. It returns an iterator over all distinct minimizers and uses the queue algorithm.
- Daniel Liu’s implementation of the rescan algorithm.
|
|
Some initial observations:
- Measurements between samples are very stable, varying less than \(1\%\).
- Each method takes 7-55ms to process 1 million characters. That’s 7-55ns per character, or 7s to 55s for 1Gbp.
- Our queue implementation performs similar to the one in
minimizer-iter
. - Daniel’s implementation of
Rescan
is around twice as fast as ours! Time to start looking deeper into our code.
Looking back at our expectations:
- As expected,
Buffered
is the slowest method by far, but even for \(w=11\), it is only \(2\times\) slower than the next-slowest method. Split
is about as fast asRescan
, and even a bit slower. We were expecting a big speedup here, so something is weird.RescanDaniel
is even much faster than ourSplit
, so clearly there must be a lot of headroom.
4.4 Deeper inspection using perf stat
Let’s look some more at performance statistics using perf stat
. We can reuse
the benchmarks for this, and call criterion
with the --profile-time 2
flag,
which simple runs a benchmark for \(2\) seconds.
|
|
The metrics that are relevant to us now are summarized in the following table.
metric | Runtime | instructions/cycle | branches/sec | branch misses |
---|---|---|---|---|
ns/char | GHz | M/sec | % | |
Naive | 64 | 4.13 | 1743 | 0.11% |
Buffered | 55 | 2.17 | 987 | 4.66% |
Queue | 22 | 2.29 | 810 | 4.29% |
QueueIgor | 27 | 2.33 | 662 | 4.29% |
Jumping | 7.1 | 3.89 | 1459 | 0.13% |
Rescan | 14 | 3.02 | 1116 | 2.26% |
RescanDaniel | 9.0 | 3.21 | 1410 | 0.85% |
Split | 14 | 3.37 | 1116 | 1.55% |
Observe:
Naive
has an instructions per cycle (IPC) of over 4! I have never actually seen IPC that high. It’s not measurement noise, so most likely it comes from the fact that some instructions such as compare followed by conditional jump are fused.Naive
also has pretty much no branch misses.Buffered
only has half the IPC ofNaive
but is still slightly faster since it only hashes each kmer once. For some reason, there are a lot of branch misses, which we’ll have to investigate further.- The two
Queue
variants also have relatively few instructions per cycle, and also relatively a lot of branch misses. Here, the branch-misses are expected, since the queue pushing and popping are inherently unpredictable. - In fact, I argued that the last element of the
Queue
is popped with probability \(50\%\). But even then, the branch predictor is able to get down to \(4\%\) of branch misses. Most likely there are correlations that make these events somewhat predictable, and of course, there are also other easier branches that count towards the total. Jumping
has an very good IPC of nearly \(4\), as we somewhat hoped for, and correspondingly has almost no branch misses.- The remaining methods all have a decent IPC just above \(3\).
RescanDaniel
has much fewer branch misses thanRescan
. We should investigate this.- We argued that
Split
is completely predictable, but we see a good number of branch misses anyway. Another thing to investigate.
4.5 A first optimization pass
It’s finally time to do some actual optimizations! We’ll go over each of the algorithms we saw and optimize them. This step is ‘just’ so we can make a fair benchmark between them, and in the end we will only optimize one of them further, so I will try to keep it short. The final optimized version of each algorithm can be found in the repository, but intermediate diffs are only shown here.
Optimizing Buffered
: reducing branch misses
Let’s have a look at perf record
:
|
|
Most of these branches come from the fact that computing the minimum of two
Elem { val: usize, pos: usize }
is very inefficient, since two comparisons are
needed, and then the result is conditionally overwritten.
The only reason we are storing an Elem
that includes the position is that we
must return the leftmost minimum. But since e.g. position_min()
returns the
leftmost minimum anyway, we can try to avoid storing the position and always
iterate over values from left to right. We’ll add a method to our RingBuf
for this:
|
|
|
|
The hot loop of the code looks much, much better now, using cmov
instead of
branches to overwrite the minimum.
New assembly
|
|
Also the perf stat
is great now: 4.23 IPC
and 0.13%
branch misses, so
we’ll leave it at this.
- Compute minima from left to right to avoid comparing positions.
- Prefer
cmov
over branches to compute minima.
Queue
is hopelessly branchy
We can confirm that indeed a lot of time is spent on branch-misses around the popping of the queue.
Assembly code showing a badly predicted branch.
|
|
Sadly, I don’t really see much room for improvement here.
Data-dependent branches are terrible.
Jumping
is already very efficient
If we do a perf report
here, we see that only 25%
of the time is spent in
the minimizer_positions
function:
|
|
The hot loop for finding the minimum of a window is very compact again.
Assembly code
|
|
More problematic is the 63%
of time spent on hashing kmers, but we will
postpone introducing a faster hash function a bit longer.
Optimizing Rescan
The original Rescan
implementation suffers from the same issue as the
Buffered
and compares positions, so let’s reuse the RingBuf::forward_min()
function, and compare elements only by value.
Code for RescanOpt.
|
|
This has 3.55 IPC
up from 3.02
, and 0.91%
branch misses, down from
2.26%
, which results in the runtime going down from 14.1ms
to 10.2ms
.
The hottest part of the remaining code is now the branch-miss when if pos - min.pos == w
is true, taking 6%
of time, since this is an unpredictable
branch. This seems rather unavoidable. Otherwise most time is spent hashing kmers.
Optimizing Split
Also here we start by avoiding position comparisons.
This time it is slightly more tricky though: the RingBuf
does have to store
positions, since after taking suffix-minima, the value at each position does not
have to correspond anymore to its original position.
The tricky part here is to ensure that the compiler generates a branchless minimum, since usually it creates a branch to avoid overwriting two values with their own value:
|
|
Code for SplitOpt.
|
|
Now, it runs in 11.2ms
instead of 14.0ms
, and has 3.46 IPC
(up from 3.37 IPC
) and 0.63%
branch misses, down from 1.55%
.
The remaining branch misses seem to be related to the kmer hashing, although
this is somewhat surprising as Jumping
doesn’t have them.
Data-independent branches are nearly free.
4.6 A new performance comparison
Now that we have our optimized versions, let’s summarize the new results.
metric | Runtime | instructions/cycle | branches/sec | branch misses |
---|---|---|---|---|
ns/char | GHz | M/sec | % | |
Naive | 64 | 4.13 | 1743 | 0.11% |
Buffered | 55 | 2.17 | 987 | 4.66% |
BufferedOpt | 26 | 4.19 | 1499 | 0.13% |
Queue | 22 | 2.29 | 810 | 4.29% |
QueueIgor | 27 | 2.33 | 662 | 4.29% |
Jumping | 7.1 | 3.89 | 1459 | 0.13% |
Rescan | 14 | 3.02 | 1116 | 2.26% |
RescanOpt | 10 | 3.58 | 1397 | 0.90% |
RescanDaniel | 9.0 | 3.21 | 1410 | 0.85% |
Split | 14 | 3.37 | 1116 | 1.55% |
SplitOpt | 11 | 3.47 | 1410 | 0.63% |
Note how BufferedOpt
is now almost as fast as Queue
, even though it’s
\(O(nw)\) instead of \(O(n)\). Unpredictable branches really are terrible. Also,
RescanOpt
is now nearly as fast as
RescanDaniel
, although Jumping
is still quite a bit faster.
For reasons I do not understand SplitOpt
is still not faster than RescanOpt
,
but we’ll have to leave it at this.
5 Rolling14 our own hash
As we saw, Jumping
spends two thirds of its time on hashing the kmers, and in
fact, SplitOpt
and RescanOpt
spend around half of their time on hashing.
Thus, it is finally time to investigate some different hash functions.
5.1 FxHash
So far, we have been using FxHash
, a very simple hash function originally used by FireFox and also used
in the Rust compiler. For each u64
chunk of 8 bytes, it bit-rotates the hash
so far, xor’s it with the new words, and multiplies by a fixed constant. For
inputs less than 8 characters, it’s only a single multiplication.
Even though this method iterates over the input length, all k-mers have the same
length, so that the branches involved are usually very predictable.
Nevertheless, hashing each k-mer involves quite some instructions, and
especially a lot of code is generated to to handle all lengths modulo \(8\) efficiently.
WyHash
WyHash is currently the fastest cryptographically secure hash function in the
SMHasher benchmark, but using it instead of FxHash slows down
the benchmarks by roughly 2ns
per character.
The reason is that FxHash itself is not secure, and hence can be simpler.
5.2 NtHash: a rolling hash
We could avoid this for loop for the hashing of each k-mer by using a rolling hash that only needs to update the hash based on the character being added and removed. The classsic rolling hash is the Karp-Rabin hash, where the hash of a kmer \(x\) is \[ h(x) = \sum_{i=0}^{k-1} x_i \cdot p^i \pmod m, \] for some modulus \(m\) (e.g. \(2^{64}\)) and coprime integer \(p\). This can be computed in an incremental way by multiplying the previous hash by \(p\) and adding/subtracting terms for the character being added and removed.
NtHash
(Mohamadi et al. 2016) is a more efficient variant15 of this based on ‘buzhash’,
using multiplication and addition \((\times, +)\) uses bitwise rotation and xor \((\text{rot}, \oplus)\):
\[
h(x) = \bigoplus_{i=0}^{k-1} \text{rot}^i(f(x_i)),
\]
where \(f\) is a static table lookup that maps each character to some fixed random
value. (See also this post on ntHash).
The main benefit of this method over Karp-Rabin hashing is that typically table lookups and bit rotate instructions are much faster than multiplications.
One limitation of NtHash is that it is periodic: when \(k> 64\), characters
that are exactly 64 positions apart will have the same rotation applies to their
\(f(x_i)\). This means that the hashes of A....A
and B....B
are be the same,
and similarly the hashes of A....B
and B....A
are the same. NtHash2
(Kazemi et al. 2022) fixes this by using multiple rotation periods, but we will stick
with the original NtHash. In practice, minimizers with \(k> 64\) are rarely
used anyway, and even if so, then the probability of hash collisions between the
\(w\) k-mers in a window should be small enough.
The nthash
crate
Using the nthash
crate, the implementation of the Hasher
trait that we’ve
implicitly been using so far is simple:
|
|
Unfortunately, the results are mixed:
alg \ hash: | FxHash | ntHash |
---|---|---|
BufferedOpt | 26 | 30 |
Queue | 22 | 24 |
Jumping | 7.1 | 5.0 |
RescanOpt | 10 | 14 |
SplitOpt | 11 | 16 |
All methods become 2-4ns/ker
slower, apart from Jumping
, which is actually
significantly faster. Remember that for Jumping
(Code Snippet 11) we first compute all hashes
and collect them into a vector, while all other methods compute hashes on the
fly. Thus, for some reason ExtNtHasher
must be faster than FxHash when
simply streaming through all kmers, while it’s slower when rolling the hash at
more irregular intervals.
Buffered hash values
Based on the difference above, let’s investigate whether first collecting all hashes to a vector speeds up the other methods as well. We provide a simple wrapper type for this:
|
|
alg \ hash: | FxHash | FxHash | ntHash | ntHash |
---|---|---|---|---|
buffered: | no | yes | no | yes |
BufferedOpt | 26 | 31 | 30 | 28 |
Queue | 22 | 24 | 24 | 22 |
Jumping | 7.1 | 7.1 | 5.0 | 5.0 |
RescanOpt | 10 | 13 | 14 | 11 |
SplitOpt | 11 | 14 | 16 | 12 |
Buffering hashes in an intermediate vector helps for ntHash, but not for FxHash.
5.3 Making ntHash fast: going branchless
Still now, ntHash take over half the time of the Jumping
algorithm, so we’ll
try to make it a bit faster.
|
|
Assembly for collecting ntHashes into a vector.
|
|
Even though we know from perf stat
that there are basically no branch misses,
the assembly does contain a lot of branches, most of which correspond to
assertions/bound checks in the original code. To fix these, will copy the source
of the nthash
crate and modify it locally.
To start, ExtNtHash
takes in 2.9ns/kmer
to collect kmer hashes to a vector.
Drop sanity checks
First, we see that for each character it is checked whether it is ACGT
. It
would be faster to do that up-front if needed, so let’s remove the check.
|
|
Drop bound checks
Next, we sprinkle in some get_unchecked
, to avoid unnecessary bound checks.
Both checks are provably impossible to fail, yet the compiler isn’t able to
elide them. I suspect it has something to do with the
NtHashIt::new()
function not being inlined, and hence the
next()
function not being aware of invariants checked during initialization. I
also tried adding some additional assertions and using unchecked_add
to
indicate that overflow is undefined behaviour, but even then one of the branches remained.
Even with full link time optimization enabled (lto = true
in Cargo.toml
),
the branches are still there16.
|
|
Not all provably redundant bound checks are always elided.
Efficiently collecting to a vector
At this point there are three branches left:
- Checking whether the end of the sequence was reached.
- The
self.current_idx != 0
check that makes sure we don’t remove a character in the first iteration. - A check that the vector we are accumulating into still has space.
Both the second and third should be redundant: we can handle the first
iteration separately, and pre-allocate a vector that is sufficiently large.
Let’s fix the latter one first. Ideally, the compiler would check that the
size_hint
provided by the NtHashIt
is large enough and that the
initial Vec::reserve()
has sufficient capacity that an overflow can not
happen, but alas, that’s not the case (also not with full lto).
Generally, it seems that we can not get rid of the possible resize check while
using .collect::<Vec<_>>()
. Thus, let’s roll our own.
First, we can manually reserve the vector capacity and then fill it:
|
|
Generated assembly
|
|
Well this doesn’t work. I really trusted the compiler to optimize this away, but it doesn’t17. Guess I’ll have to adjust my trust level accordingly.
Anyway, I suppose we’ll just make a default vector with the right
size and then fill it, so we can completely drop the push
call.
|
|
The current assembly looks like this.
|
|
Avoid push
to fill know-size arrays. (But maybe this doesn’t apply in general.)
Aside: x64 intricacies. This looks pretty great! The first three lines are somewhat redundant though,
since we know the unwrap can never fail18. And also on line 6, there is a 32-bit mov
from %r13d
to %ecx
of
which the low 8 bits are used on line 7 as %cl
19. But then why not directly use the low 8
bits of %r13
, %r13b
? Maybe the compiler insists on using the counter
register c
for this? After some more reading and discussing with Tommasso Fontana, that does indeed seem to be the
case: this page lists the opcodes for rol
instructions, and the only supported
variants are rotating by 1
, by cl
, and by an immediate value.20.
Aside: native compilation. I’m using -march=native
. Without that, it generates a rol $1,%rsi
instead of a rorx $0x3f,%rsi,%rsi
, and the runtime is 1.83ns/kmer
instead of
1.71ns/kmer
. rorx
is probably faster because the x
suffix means it doesn’t
set overflow/carry flags, and hence has fewer dependencies with other
instructions. But rorx
is only available with BMI2, so this requires the
native compilation.
Use -march=native
.
Back to branches. Either way, let’s get rid of the last remaining branch:
|
|
Somehow this is worse. It seems two new branches came back when removing this one!21
Assembly, but worse.
|
|
I’ve also tried adding an assert_eq!(v.len(), len)
, but that doesn’t help
either.22
|
|
Addressing mode.
Another thing we can do is improve the vector addressing mode. Instead of
writing to vector_offset + index * 8
, it’s sometimes faster to directly write
to a *ptr
address. Let’s try something:
|
|
I’m not quite sure why, but this causes the assembly to be one instruction longer and hence slightly slower.
Confusion, again! Ok let’s remove the assertion again.
|
|
Finally the assembly is how I like it:
|
|
At this point I am rather confused, but at least the code looks good:
- Why does removing the assertion make things faster?
- Why is the loop using a pointer offset (Code Snippet 45) faster/different at all? Shouldn’t the compiler have the same information on both cases?
Adding assertions to ‘help’ the compiler can be counterproductive.
Instead of using v[i]
or iterating over a vector, use v.as_ptr().add(i)
.23
5.4 Rolling a bit24 less
As we saw before, the rotation by \(k\) positions takes two instructions: one to
move \(k\) into %cl
, and one for the rotation itself. But all rotations are over
exactly \(k\), so in fact, we can precompute these:
|
|
|
|
Analysing the assembly code
At this point, we cannot hope for much fewer instructions. The 15 instructions are:
- 1: increase iteration count by 2,
- 2: compare and jump,
- Then for each of the two unrolled iterations:
- 2: load two characters,
- 1: rotate the rolling hash,
- 2: xor in the two characters,
- 1: write back the result.
At 1.41ns/kmer
, this loop should take 1.41ns/kmer * 2.6GHz = 3.7 cycles/kmer
or 7.3 cycles/loop
, which corresponds to just over 2
instructions per cycle. Indeed, perf stat
shows 2.04 IPC
. I have
marked the critical path of xor’ing the hashes with *
s in the assembly code
above.
The three rorx, xor, xor
instructions have a total latency of 1+2+2=5
cycles, which is actually higher than the 3.7 cyckes/kmer
we measure. Maybe
the ’load’ micro-up in the xor can actually be done in parallel before the
result of the previous xor is done? Regardless, this is pretty impressive performance!
Still, instead of doing x = (x^a)^b
, it would be more efficient to do x = x^(a^b)
. But just parenthesizing the code like that doesn’t improve the
assembly. Probably because that would require a separate load instruction,
since xor’ing with two addresses from memory (as opposed to a memory address and
a register) is not a thing.
5.5 Parallel it is
As we saw above, the current ntHash implementation only uses around 2 IPC
,
while each core should be able to execute up to 4 instructions every cycle.
Thus, let’s simply try to run two copies of our main loop in parallel:
|
|
Sadly, this is not nearly the two times speedup I was hoping for, but at least
we’re at 2.5 IPC
and now, and 3.5 cycles/kmer
, even closer to the 3 cycles/kmer
lower bound from the xor
dependency chain. (Even though now that
we process two independent chains, that lower bound really is only 1.5 cycles/kmer
. Probably the code is just fully occupying some ports.) The assembly code is:
|
|
But this is just silly: At the end we compare %rcx
with %r9
, and if they
are not equal, we continue with the next iteration. Then the first thing we do:
again compare %rcx
with %r9
, and panic on the unwrap
if they are equal.
But we just checked they are not equal… Again, I’m at a loss why this it the
case. Clearly there are two unwrap
calls in our inner loop, but since they
check exactly the same, there really is no need for this.
Compilers are weird and not to be trusted?25
More parallel
I tried to parallellize more by doing 3 or 4 chunks at a time, both manually as
above and by using const generics and iterating over [_; B]
types, but they all
perform worse at the moment, and generate very branchy code again.
Probably the better way to avoid all these issues is by inlining and
interleaving the ntHash code.
Instead of doing a ‘high level’ parallellization where we make \(B\)
calls to it.next()
in parallel, it’s more efficient to ‘internalize’ the
parallellization, so that all the accounting overhead is done only once.
Thus, we first try this:
|
|
This is already slightly faster than the version we had. Let’s try more granular parallellization:
|
|
This latest parallel version with \(L=3\) runs at 1.10ns/kmer
and 2.7 IPC
, or
exactly 3 cycles/kmer
. I’m still not quite sure what the main bottleneck is
though. Anyway, from Agner Fog’s instruction tables and uops.info, I gather that
all of movzbl
, rorx
, xor
, and mov
can do 2 per cycle, but that all
instructions that read from memory have latency up to ~5 cycles, which sounds
like it may be a bottleneck. But it’s hard to say.
Note that all the methods above return a [u64; L]
at a time. For now I’m just
collecting those into a Vec<[u64; L]>
. We could then either return an iterator
over those blocks, or return a sequential iterator that traverses the entire
vector L
times. Or we could even transpose the data back to a Vec<u64>
, but
that would be somewhat slower.
It’s fine to return out-of-order data.
5.6 Actual SIMD, at last
The first SIMD version is basically the same as before when using [u64, L]
,
but using Simd<u64, L>
instead, with #![feature(portable_simd)]
enabled.
SIMD version of ntHash
|
|
The main difference is that in SIMD, there is no bit-rotate instruction26, so that instead we have to implement this using two shifts. Let’s have a quick look at the generated code.
|
|
This looks rather terrible! There are many instructions to read the sequence characters and get their hashes, and then a number of instructions to merge everything into SIMD registers. The final rotate and xor operations are optimized, but they make up only a small part of the total runtime like this.
SIMD table lookups
So far, we have been assuming that the text can be anything. But actually,
ntHash already is only ‘seeded’ for DNA characters ACGT
, and so the [u64; 256]
table really only contains 4 interesting entries. So we could just use a
[u64; 4]
.
From now on, we assume that the text is a &[u8]
containing only values in 0..=3
.
That’s exactly the size of a single SIMD register. Indeed, we can store them in a single SIMD register, and use a shuffle instruction to extract them.
Or at least that’s what we would like to do.
Sadly, there is no AVX227
dynamic28 cross-128bit lane29 64-bit30 shuffle available, but
if we give up any one of these constraints we’re good. Thus, we can implement the
64-bit shuffle using the 32-bit shuffle. When we want to look up the 64-bit hash
of a character with value/index c
, now instead we look up the two
32-bit parts independently, which will be at indices 2c
and 2c+1
.
But instead of implementing this, let’s do something simpler31.
32-bit hashes
Everything becomes much simpler if we
use 32-bit hashes directly! There really is no reason we need the full
64-bit hashes to select the minimum hash of a window of size \(w\) that is
rarely more than \(100\).
And since also the 4 values in the
dictionary now fit in 128 bits, so that we do not need cross 128-bit lane
shuffles anymore at all! So we can now exactly use the _mm256_permutevar_ps
instruction, which even has 1 cycle latency. And also, we can now compute 8 hashes at a time instead
of 4!32
Somewhat annoyingly there is no _epi32
integer version of this, so we’ll have
to cast our data to 32-bit floats and back to make the type checker happy.
We assume that 32-bit ntHashes are sufficient.
The lower periodicity causes some collisions for \(k> 32\), but they will still be rare and the number of k-mers in a window \(w\) that we’re minimizing over is small anyway.
The new code looks like this:
|
|
In the assembly, we see 16 copies of the code below, an shift-in and shift-out character for each of the 8 lanes.
|
|
Shared offsets
Instead of storing all offsets on the stack, we can also put them all in a single SIMD register.
|
|
In fact, this turns out worse. Now the indices are computed in SIMD, but then
have to be extracted anyway before any lookups are done.
I was expecting a gather
instruction to be generated here, but maybe the ‘zero
extend’ of reading a u8
into a u32
is preventing that.
5.7 SIMD: The Gathering
Gathering 4 characters at a time
But that gives a new idea! If we’re anyway reading from a memory location, we
may as well directly read of full u32
worth of memory instead of just a u8
.
That could reduce the time spent on reading 4-fold.
Thus, we make a read every fourth iteration, and then buffer the values.
|
|
The assembly looks very interesting now:
|
|
The first very clear observation is that even though we only gather every fourth
iteration, those 4 instructions take up 80% of the time! Clearly, getting the
data from memory into the SIMD registers is the bottleneck here. What is
somewhat surprising is that line 15 and 16 of Code Snippet 59 are each turned into
two gather
instructions. As it turns out, there are two very similar gather
instructions:
_mm256_i64gather_epi32
, which takes a 256-bit SIMD register with four 64-bit indices and returns a 128-bit SIMD register with four 32-bit values._mm256_i32gather_epi32
, which takes a 256-bit SIMD register with eight 32-bit indices and returns a 256-bit SIMD register with eight 32-bit values.
It would be slightly nicer to use the second version, but in the end I suspect it wouldn’t matter much for performance, since the same number of entries is read anyway.
Gathering 8 characters at a time
At this point, it is very clear that the gathering is the bottleneck. Since
it is already split into two instructions anyway, we may as well fetch a u64
at a time. We then have to do some shuffling to those into two SIMD registers of
u32
values, one with the first 32 bits of each u64
, and one with the second 32 bits.
Then after 4 iterations, we can cheaply swap in the remaining 32 bits.
|
|
perf stat
now shows that the total time on gather
instructions has gone down
from 80%
to 67%
. Still quite some room for improvement if we could make the
input faster. The deinterleave
is translated to some vshufps
(latency 1) and
vpermpd
(latency 3) instructions as follows:
|
|
Gathering 32 characters at a time
Clearly, streaming the input is still a limitation. But we have one last trick
available. We already assume that the input takes 2-bit values in 0..=3
, so that of
each byte we read, 6 bits are wasted. Instead, let’s assume that the input is
already bit-packed!
The text is 2-bit encoded and bit-packed.
For now I’ll just keep using the same &[u8]
type for the text, but in the
final API it would be cleaner to create a wrapper type.
The code is mostly similar to before, with the change that we now shift out only
2 bits instead of 8 at the end of every iteration.
One more change is that we can’t simply read data at every 2-bit offset.
Specifically, we can’t necessarily read a u64
at position i
and a u64
at
position i+k
at the same time. Thus, we separate the updates for the start and
end positions of the window. We should then also initialize self.chars_k
with the
leading bits.
|
|
Almost a 2x
improvement!
Originally, I made a bug here where I measured sequences of length \(n/4\) rather
than length \(n\). This has been fixed now, and timings are corrected.
They appear to be around 0.10 ns/kmer
slower than my original benchmarks, but
it is what it is.
The full code for SIMD-based hashing of bitpacked data is here.
|
|
The four gather
instructions only take around 30% of the total time now, and
thus aren’t nearly as much of a bottleneck anymore.
At this point, I thought we were done and not much further improvement would be possible. But I was wrong.
Reusing the gathers
Each character is currently read twice: once when it enters the sliding window, and once when it exits. It would be more efficient to reuse the first read instead of re-reading from memory. When \(k\geq 32\), this is actually quite straightforward: At some point \((i+k) \bmod 32 = 0\), and the next 32 characters/64 bits of data are read at position \(i+k\). Then \(k\) iterations later, \(i\) itself is equal to that same value, and exactly the same data is read. Thus, we can buffer the data read for position \(i+k\) and use it the next time \(i\bmod 32 = 0\). But note that this only works when \(k\leq 32\). For larger \(k\). For larger \(k\), it could still be done but needs a larger buffer.
|
|
In each iteration we now check whether \(i\bmod 32 \equiv 0\) and \(i\bmod 32\equiv
16\). It’s more efficient to first do a single check whether \(i\bmod 16\equiv 0\),
and only then check which of the two cases it is, since that saves a compare and
jump in the main loop. That brings the runtime down to 0.57 ns/kmer
, or
around 1.5 clock cycle per hash.
One drawback of the above is that when \(k\) is indeed small, we now have a
useless if statement and a bunch of unnecessary code for read(i)
is generated.
So let’s put this condition behind a generic:
|
|
At this point, perf stat
shows that there are nearly no branch misses, and
2.43 IPC
, which is good but actually not that close to 4
. Again this makes
me wonder what the actual bottleneck is. Remember 5.5? Where we
processed two chunks in parallel? Maybe we should eventually try that again.
5.8 Fixing the benchmark
At this point, perf record
shows:
|
|
Surely, this other part has something to do with either allocating or dropping the vector to which we collect. Let’s fix that by reusing the vector between iterations.
Thus, we now make all our Hasher
instances statefull to give them the
opportunity to reuse their allocated memory between iterations. This is what
requires e.g. Hasher::hash_kmers
to take a &mut self
: so that it can use and
modify this state.
The new caching hasher looks like this:
|
|
Now, perf
again shows that over 90% of time is spent in the hash_kmers
function, and indeed the runtime improves significantly: 0.39 ns/kmer
, or exactly one cycle per kmer!33
And also, we’re at exactly 3 IPC
now, which at last seems quite reasonable.
We’ve gone from 2.9 ns/kmer
for the nthash
crate to 0.39 ns/kmer
by using
32-bit hashes, a bit-packed input, and by using SIMD to parallellize over chunks
of the input. 7x speedup! Roughly, the first factor 2 is by making nthash more
efficient, while the remaining factor 3.5 is from using SIMD.
In the end, we need one cycle per kmer.
Note that instead of returning a vector, we could also benchmark just the time
to compute all the results, e.g. by summing all the output hashes. That only takes
0.35 ns/kmer
. But for consistency, it is simpler to always require an actual
vector as output.
One last branch
At this point, there is also still one redundant branch due to the unwrap
when
iterating over the vector.
|
|
I tried replacing it by unwrap_unchecked
, but that creates slower code. I
also tried removing the case that returns None
completely, which does remove
branch, but anyway the runtime of the code is the same, most likely since
parsing the instructions isn’t the bottleneck anyway.
5.9 Analysis: Machine code analysis
It would be great to know what the actual bottleneck of the code is now. One way to (try to) find out is by using LLVM’s Machine Code Analyzer (LLVM-mca). First, we must determine the loop to analyse. For simplicity, let’s exclude the input gathering, and only focus on the nthash computation, that covers around 70% of the time:
Assembly code of the hot loop.
|
|
We can use carg-show-asm
to conveniently extract the assembly code.
First, we mark the hash_kmers
function as #[inline(never)]
so that we can
find its function. Then we can get the assembly code of it using:
|
|
This outputs the full function, including setup and cleanup. We manually copy the assembly corresponding to the inner loop to a separate file.
Extracted assembly code
|
|
Then, we can run LLVM-mca on this code: llvm-mca < assembly
, optionally with
some flags such as --all-stats
, --all-view
, --bottleneck-analysis
, and --timeline
. I
highly encourage you to run it yourself and have a closer look at the output.
First, the output contains some high level statistics:
|
|
We see that the it reports the maximum throughput of this block at one iteration
every 5 cycles. Our benchmarks need 1 cycle per kmer, or 8 cycles per iteration
that computes 8 values, which
is indeed just above the 5, and makes sense taking into account that a
relatively large portion of time is spent on the gather
instructions that
occur only once every 32
iterations.
Then, we get a table of latency and throughput values.
Latency and throughput table
|
|
And lastly, we see which ports every instruction can execute on. The top line is a summary of total usage of each port.
Port usage table
|
|
The timeline also looks cool.
Timeline
|
|
The thing that struck me most in all this is that the vmovdqa
instructions
have a quite high latency of 7 cycles. Even though it is probably hidden, it
would be nice to remove this. In fact, we can see that the buffers containing
the characters are read from memory/the stack (608(%rsp)
is an address
relative to the stack pointer), then shifted by 2 bits, and then written back to
the same place. I wonder why this value can’t just be kept in a register
instead. In fact, I have seen a few cases where a register is used directly,
but I can’t make it work reliably.
5.10 Finals thoughts
Now that we’ve reached 0.39 ns/kmer
, and nearly 7.5x speedup over a previous
implementation, I think it’s finally time to call it a day34.
To recap, our speed is:
0.39 ns/kmer
1.01 cycles/kmer
8.08 cycles/8kmers
, i.e., 8 cycles per iteration of the main loop that produces 8 hashes in a SIMD.
And for completeness, here’s the full source code.
Full code for the SIMD NtHash implementation.
|
|
Doubling down again
Currently, the number of instructions per cycle is 3, while it could be over 4. One thing that we could try is to split the input in to 16 chunks and process them in parallel using two SIMD units of 8 lanes. That way, we could fully exploit the parallellism of SIMD in combination with the instruction level parallellism of the CPU. However, my feeling is that the throughput is already bounded by the number of available registers and ports, so that doing everything in parallel wouldn’t help much.
I did a very quick and ugly test for this, and it came out twice as slow. The main reasons seems to be that indeed many variables spill onto the stack, making things much slower.
16-bit hashes?
By going from 64bit to 32bit hashes, the SIMD code became much simpler and faster. One could imagine that going to 16bit hashes gives a similar speedup. Especially since now 80% of time is spent in the inner loop doing ‘plain’ SIMD instructions, having double the parallellism may be good. One drawback though is that there are no 16-bit shuffle/permute instructions to do the corresponding table lookup, so we’d have to:
- Deinterleave the 16-bit characters into two vectors of 32-bit characters.
- Do two 32-bit
permutevar
instructions. - Interleave the results back together.
Or alternatively, we could use 128-bit byte shuffles, but then still we have to
do some extra accounting to turn the character values into appropriate
byte-indices.
Lastly, we’d also have to split the u64
of characters read in each place into
4 chunks of 16 instead of 2 chunks of 32, and then rotate the chunks every 8
instead of 16 characters. But this probably doesn’t cause much of a slowdown.
To summarize, I think some speedup may be possible here, but definitely not the 2x one may naively expect.
What about a simple multiply hash
For small \(k\leq 16\), a different approach is to simply multiply their 32-bit value by a constant and use that as a hash, as FxHash does. Surely, this can also be optimized to be very efficient.
By re-using all the work we already did to efficiently gather the data from the sequence, this is now fairly straightforward:
|
|
Full code for the SIMD FxHash implementation.
|
|
Again faster, with 0.32 ns/kmer
now, or 0.83 cycles per kmer!
One drawback of course is that this really only works for small \(k\). For \(16<k\leq 32\) the same could be done by doing two 32-bit multiplications, but that would probably hurt performance and make it slower than our ntHash implementation.
6 SIMD sliding window
At this point, we can go back to the sliding window minimum problem and use our
new rolling hash for it. Naturally, we would like to also use SIMD here. The
easiest is to adapt the Split
method, since it is completely deterministic.
That ensures that when processing multiple chunks of a sequence, the steps in
each chunk (SIMD lane) are exactly the same.
Below is a relatively straightforward SIMD version of our initial code
Code Snippet 13 (and Code Snippet 31 for the optimized version). It takes as input an
iterator over [u32; 8]
of 8 hashes (of independent chunks), and outputs a
[u32; 8]
for each window of width \(w\) with the absolute position of the
leftmost minimizer. Some details:
- Instead of comparing full 32-bit values, only the high 16-bits of each hash are compared.
- The low 16 bits of each 32-bit word are used to store the position, so that ties are automatically broken towards the minimal position. This means that the position can only go up to \(2^{16}\), so the sequences needs to be chunked into pieces shorter than that, say of length \(2^{15}\).
- I already used
.get_unchecked
where needed. - Made sure that the
.skip(w-1)
at the end doesn’t become anif
statement in the inner loop (or even puts the inner closure behind a function call some cases).
|
|
6.1 Results
We can benchmark this code in a few ways:
- Take a test, compute hashes, and then directly stream them into the sliding window minima.
- Take a text, compute hashes to a (reused) vector, then read the vector to compute sliding window minima.
- Like 2, but only measure the time for the sliding window minima.
As expected, method 2 reports close to 0.4ns
more than method 3, exactly the
time taken to compute hashes.
As output, we test both taking just the sum of all positions, and writing all
positions to a vector.
ns/window | sum | collect |
---|---|---|
streaming hashes | 0.63 | 1.21 |
buffered hashes | 0.78 | 0.81 |
We see that the fully streaming option that doesn’t use any memory is clearly
the fastest. Collecting the final result is faster when the intermediate results
are already in a vector. I’m not quite sure why. Possibly the allocation is
reused, but I don’t think that is actually the case. Another reason could be
that our returned iterators currently do not implement ExactSizeIterator
, and
hence calling Vec::extend(iterator)
may not be as efficient as it could be.
Either way, this seems to be plenty fast, and I don’t see any more easy wins
here. Comparing (hash, pos)
tuples as a single integer seems inevitable for
good performance. Maybe we could go down from (u16, u16)
to (u8, u8)
, but
using 8-bit hashes is unlikely to be good enough when \(w\sim 20\). In that case,
each window has a probability somewhere around order \(256/w \sim 12\%\) of having
a duplicate minimizer, which will start to hurt the density.
One more thing to try could be to again process two SIMD units in parallel, but also again, a quick and dirty test was quite a bit slower.
We can now do sliding window minima in 0.63 ns/kmer
, or 1.6 cycles/kmers
.
That’s 15x faster than RescanDaniel
, and 40x faster than QueueIgor
!
Human genome results
For a human genome, consisting of roughly 3Gbp, we measure a runtime of 2 s
35,
which corresponds to 0.66 ns/kmer
. Thus, the results we see transfer to
larger datasets without problems!
7 Extending into something useful
(This section was added December 2024.)
7.1 Collecting minimizer positions
At this point, we compute for every window the absolute position of its
minimizer. This is returned as a Vec<Simd<u32, 8>>
, where the 8
SIMD lanes
correspond to separate chunks of the input.
In practice, we would prefer a simple Vec<u32>
, without processing 8
lanes
in parallel, which should be a hidden implementation detail.
We transpose our output as follows: first, we collect 8 Simd<u32, 8>
elements into an [Simd<u32, 8>; 8]
, corresponding to the next elements for
each of the 8 lanes. Then, we transpose this 8x8 matrix, to get for each lane
a single SIMD vector with the next 8 elements. Then those 8 SIMDs of upcoming
elements can be efficiently written to the output Vec
in parallel.
The matrix transform code comes from stackoverflow and looks like this.
|
|
The code to collect the transposed data into a single vector is as follows, and
takes 1.3 ns/kmer
, compared to 1.0 ns/kmer
for the version that naively
collects into a Vec<Simd<u32, 8>>
.
|
|
7.2 Deduplicating the minimizer positions
Now that we have a Vec<u32>
of the minimizers positions of all windows, we
would like to deduplicate it, so that each unique position is listed only once.
For this, we adapt this post on Daniel Lemire’s blog, which contains a function
that takes a Simd<u32, 8>
of new
values and a corresponding one of old
values (to decide whether the first element of new
is distinct). It computes
which of the new
elements are distinct from their previous element and appends
those to a vector.
|
|
Then all that’s left is to write a simple wrapper that iterates over a
Vec<u32>
and calls the function above for each 8 elements, which takes 1.6 ns/kmer
, up from 1.3 ns/kmer
for the version without deduplication.
In practice, it is slightly inefficient to first collect the transposed data
into a Vec<u32>
that is only then deduplicated. Instead, we can use a
[Vec<u32>; 8]
with a buffer for each lane, and deduplicate each lane directly
in collect()
, before writing its results.
For small inputs, this does not make much of a difference, but for larger inputs
this version does not use an intermediate buffer, and hence slows down less.
7.3 Super-k-mers
We can now efficiently compute the list of all minimizer positions, but unfortunately that information is not sufficient to exactly recover all super-k-mers, i.e., all minimizers together with the interval of windows for which they are minimal.
We can fix this by not only returning minimizer positions, but also the index of
the first window for which they are minimal.
This turns out quite simple by slightly modifying the write_unique()
function
above. If we assume that the number of kmers in a window \(w\) is below \(2^{16}\),
distinct consecutive minimizer positions always differ in the low 16 bits. Thus,
it is sufficient to do a 16-bit dedup. Then, we can use the high 16 bits to
store (the low 16 bits of) the absolute index of each element.
We add an additional argument to the write_unique
function, so that the mask
of values-to-be-stored is computed with only the minimizer positions, but the
actual stored values are those that also include the absolute positions of the windows.
The returned values will now contain both the minimizer positions, and the first windows where they occurred. Retrieving absolute positions rather than mod-\(2^{16}\) positions is easy by incrementing an offset whenever the index goes down.
This runs in 1.65 ns/kmer
, up from 1.60 ns/kmer
for simply returning
deduplicated minimizer positions without including super-k-mer information.
7.4 Canonical k-mers
In most bioinformatics applications, canonical k-mers are used instead, that are invariant under taking the reverse-complement of the sequence. This requires a few modifications.
While the ideas in this section are generally known, the precise steps on tiebreaking outlined below were mostly suggested to me by Igor Martayan.
NtHash
First, we use the canonical version of NtHash that sums the hash of a kmer and its reverse complement, so that it’s invariant under reverse complement. This requires duplicating most of the inner NtHash code, but is quite straightforward. Since the stream of input characters only has to be read once, this is less than 2x overhead.
This takes 1.83 ns/kmer
, up from 1.60 ns/kmer
for the version that does not
use canonical NtHash.
Leftmost-rightmost sliding min
When the minimum kmer hash occurs twice in a single window, we must make sure that the two strands choose the same occurrence. To do that, we will track both the leftmost and rightmost occurrence of the minimum. (Other options are also possible.)
Most of the code in the sliding window algorithm can simply be duplicated, where
for the rightmost position of the minimum, we instead invert the hash, and take
the max of (!hash, pos)
, alongside the existing minimum of (hash, pos)
.
Tiebreaking
Lastly, we must consistently choose between choosing the leftmost and rightmost
occurrence.
We do this using the idea introduced by Pan and Reinert (2024): we count
the number of AC
characters in a window, and say the window is canonical
when this is more than half the characters. When the length of the window is
odd, it can never equal exactly half, and hence every window is either canonical
or a reverse-complement. For canonical windows we choose the leftmost occurrence
of the minimum, and for reverse-complement windows we choose the rightmost
occurrence instead.
Overall, this runs in 1.91 ns/kmer
, which is less than 2x slower than the 1.0 ns/kmer
we needed for simply collecting all (non-canonical) minimizer positions
for all windows. That’s quite impressive! Especially since we do quite a bit
more work now:
- transposing the output,
- deduplicating the output,
- canonical NtHash,
- left-and-rightmost sliding min, and
- tiebreaking.
The number of characters in a window is odd, so strandedness is well-defined.
Note that we do not require \(k\) itself to be odd! Tiebreaking is an operation that happens for a window, and hence only those need to have odd length.
Further reusing iterated bases
Currently we use the input strings’ bases in four ways:
- NtHash iterates over pairs of the character entering the new kmer, and the character
leaving the new kmer, which reuses the same
gather
(Reusing the gathers). - The tiebreaking iterates over pairs of the character entering the new window (which is the same as the one entering the new kmer), and the character leaving the new window.
We would like to only do a single gather
and use that for all three required positions.
Up to now, NtHash and the sliding minimum had a signature as in Code Snippet 84, which somewhat hinders this: since NtHash takes as input the input sequence directly, and constructs the iterator over characters internally, we can’t use it for the tiebreaking iterator.
|
|
Instead, we change our design to the following: the hash function and sliding minimum simply return a closure with some internal state, and each time the closure is called (with a pair of characters to be added and removed, or the new hash), the internal state is updated and the next hash or next sliding minimum is returned. This turns out much simpler, and also optimized slightly better in some cases, Code Snippet 85.
|
|
For example, the canonical_mapper
looks like Code Snippet 86. Note though
that the prefix of values returned while the first window isn’t completely
‘filled’ yet should be skipped/ignored.
|
|
Using this, we can now extract the base-iteration to a separate function
returning triples of characters (new, last-in-kmer, last-in-window)
, and map
them as in Code Snippet 87.
|
|
To implement the delayed_2
iterator that returns triples of characters, the
new one and two ‘delayed’ streams, we extend the idea from Reusing the gathers:
instead of storing the last read gather
in a temporary variable, we write all
gathered data into a Vec<S>
ring buffer that is long enough to cover the
entire window. The delayed streams then replicate the original logic, but read
from this buffer instead of issuing new gather instructions.
7.5 AntiLex hash
In my practical minimizers post, I introduce the ‘anti-lexicographical’
(AntiLex) kmer order. This is very close to a classical lexicographical sort,
but whereas the classical sort has density worse than random (\(2/(w+1)\)),
AntiLex has better-than-random density. They work by inverting the first
character of the string (as in A <-> Z
or 0 <-> 3
) and then doing a normal comparison.
The benefit of this method is that unlike NtHash, it does not require a table lookup, and thus avoids some of the more complicated SIMD instructions.
I have not yet implemented and tested a reverse-complement variant of this, but
the forward version takes 1.40 ns/kmer
to collect and deduplicate, compared to
1.55 ns/kmer
for forward NtHash.
8 Conclusion
We now have a very optimized SIMD implementation to compute canonical minimizers,
taking under 2 ns/kmer
(or 5 cycles/kmer
).
time (ns/kmer) | |
---|---|
NtHash (sum) | 0.39 |
+ Sliding window min (sum) | 0.63 |
+ Vector output | 1.00 |
+ Collect/transpose | 1.30 |
+ Dedup | 1.60 |
+ Super-k-mer | 1.65 |
+ Canonical | 1.90 |
8.1 Future work
Next up is to find some usage for this library, for example in my Rust implementation of SSHash. Additional extensions are possible of course:
- Implement and test a canonical version of
AntiLex
. - Generalize the matrix-transpose and deduplication SIMD code to ARM and NEON instruction sets.
- In general, implement and test other low-density schemes such as the mod-minimizer.
Also, I think it has not been much investigated how general low-density schemes can be transformed in to a canonical version, and how well they then perform.
References
Read this on a wide screen to see the table of contents on the left, and footnotes on the right. ↩︎
I had a bug in earlier benchmarks that caused my timings to be 4x too fast. ↩︎
Headsup: all code I’m about to show is completely untested currently. ↩︎
On that matter, specific requests are welcome. ↩︎
Some foreshadowing here.. ↩︎
Consider not skipping the code blocks. I made quite some effort to make them clean and readable. Most code is fairly straightforward assuming you’re comfortable reading Rust code. ↩︎
Legend:
Orange: processed state,
Bold outline: minimum of the prefix/suffix,
Bold character: minimum of the window,
Blue: state in memory,
Red: state removed from the queue,
Green: running prefix minimum. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎The
hash_kmers
function here is part of theHasher
trait that can be found in Code Snippet 32. ↩︎In fact, the
Buffered
implementation is slightly more efficient since it stores hashes instead of recomputing them \(w\) times. ↩︎Citation needed, both for it being folklore and for the original idea. ↩︎
I’m not counting the for loop that iterates up to the variable length of the string as a branch miss, because that will only be a single branch-miss at the end, not \(O(1)\) per iteration. ↩︎
It’s the third result. ↩︎
Feel free to skip this paragraph in case you are not familiar with segment and Fenwick trees. ↩︎
Hah, get it?
There’s at least two alternative interpretations ;) ↩︎The ntHash paper doesn’t actually provide a benchmark for this though. But the consensus seems to be that buzhash is faster than classic multiplication based methods. ↩︎
The runtime does go down a bit to
2.3ns/kmer
, but we’ll postpone looking into this more, since full lto is absolutely terrible for compile times, and thin lto is not sufficient. ↩︎Maybe there is a reason for keeping it that I don’t see? Let me know! ↩︎
Again, the compiler disappoints a bit. Maybe there is some tricky reason why it’s not actually redundant? ↩︎
This article has a nice summary table of the register naming. ↩︎
See also this wiki page and this nice article for some more information about special purpose registers ↩︎
Cf. Chekhov’s gun. ↩︎
In fact, Daniel Liu recommended me multiple times that pointer offsets are faster, so I’ll let this be my lesson. ↩︎
\(k\) bits, actually ↩︎
I wouldn’t be surprised if I end up looking more into compilers eventually. ↩︎
At least not with AVX256. AVX512 does have it though. ↩︎
In AVX512 there is
_mm256_permutexvar_epi64
(latency 3). ↩︎There are static cross-128-bit lane shuffles that require a
const
shuffle argument, such as_mm256_permute4x64_epi64
(latency 3). ↩︎There is a within-128-bit lane dynamic shuffle,
_mm256_permutevar_pd
(latency 1) ↩︎There is a 32-bit variant,
_mm256_permutevar8x32_ps
(latency 3) ↩︎And not waste more of the author’s time. ↩︎
You’d almost ask yourself why we didn’t do this from the start. ↩︎
Maybe it’s time to switch to picoseconds (ps)? ↩︎
Week, rather, for this section. ↩︎
Compare that to the 40 seconds it takes to convert the byte-encoded FASTA sequence to a bit-packed representation using some lazy code. Why aren’t we using bit-packed on-disk file formats anyway? ↩︎