In this post, we will develop a fast implementation of random minimizers.
The goals here are:
a library for fast random minimizers,
explaining the final code, and the design decisions leading to it,
walking you through the development process, hopefully teaching some tricks
and methods for how to do this yourself.
Code for this post is in the benches/ directory of
github:RagnarGrootKoerkamp/minimizers.
Most code snippets are directly included from the repository. I have tried to
write clean and readable code that follows the order of this post. Occasionally
some code snippets contain ‘future proofing’ that will only become relevant
later on.
The end result that we will be working towards here is twofold:
An implementation of 32-bit ntHash rolling hash that takes
around 0.4ns/kmer, or around 1 CPU cycle per hash. This is 7x faster than the nthash crate by Luiz Irber.
An implementation of random minimizers that takes around 0.6ns/window, or
1.5CPU cycle per window. This is 15x faster than Daniel Liu’s gist and 40x
faster than the rust-seq/minimizer-iter crate by Igor Martayan2.
It computes all window minimizers of a 2-bit packed human genome (with
non-ACGT stripped) in two seconds.
I plan to upstream these algorithms into the rust-seq/nthash and
rust-seq/minimizer-iter crates soon, after polishing the implementation,
adding tests3, and designing a nice API4.
I welcome any (critical) feedback on both the content and styling of this post!
I plan to fine-tune it a bit more based on this, and will take this into account
when writing similar posts in the future. If you’re feeling adventurous, open an
issue or pull request against the source of this blog.
2 Random minimizers
Many tools in bioinformatics rely on k-mers.
Given a string/text/DNA sequence t: &[u8] of length \(|t|=n\), the k-mers of t are all the
length-\(k\) substrings. There are \(n-k+1\) of them.
In order to speed up processing, and since consecutive k-mers are overlapping
and contain redundancy, it is common to subsample the k-mers. This should be
done in a deterministic way, so that similar but slightly distinct text
still mostly sample the same k-mers. One such way is random minimizers,
introduced in parallel by Schleimer, Wilkerson, and Aiken (2003) and Roberts et al. (2004):
First, a window size \(w\) is chosen, which will guarantee that at least one
k-mer is sampled every \(w\) positions of \(t\), to prevent long gaps between
consecutive k-mers.
Then, every k-mer is hashed using a pseudo-random hash function \(h\), and in
each window of \(w\) consecutive k-mers (of total length \(\ell=w+k-1\)), the leftmost5 one with the smallest
hash is selected.
Consecutive windows (that overlap in \(w-1\) k-mers) often sample the same k-mer,
so that the total number of sampled k-mers is much smaller than \(n-k+1\). In
fact, random minimizers have a density of \(2/(w+1)\) and thus sample
approximately two k-mers per window.
My previous post on minimizers gives more background on density and other
sampling schemes. One such alternative scheme is the mod-minimizer
(Groot Koerkamp and Pibiri 2024), that has low density when \(k\) is large compared to \(w\). That
paper also contains a review of methods and pseudocode for them.
In this post, I will focus on the most widely used random minimizers.
3 Algorithms
3.1 Problem statement
Before going into algorithms to compute random minimizers, let’s first precisely state the
problem itself. In fact, we’ll state three versions of the problem, that each
include slightly more additional information in the output.
Problem A: Only the set of minimizers
The most basic form of computing minimizers is to simply return the list of
minimizer kmers and their positions. This information is sufficient to sketch
the text.
1
2
3
4
pubtraitMinimizer{/// Problem A: The absolute positions of all minimizers in the text.
fnminimizer_positions(&mutself,text: &[u8])-> Vec<usize>;}
Code Snippet 1:
A trait that defines Problem A and returns the set of kmers of the input. As to why this takes a &mut self instead of a &self, see 5.8.
Problem B: The minimizer of each window
When building datastructures on top of a sequence, such as SSHash
(Pibiri 2022), just the set of k-mers is not sufficient. Each window should be
mapped to its minimizer, but there could be multiple minimizers in a single
window. Thus, in this version of the problem, we return the minimizer of each window.
Note that while in Problem A we return one result per minimizer, here we return
one result per window (roughly, per character). This difference can have
implications for the optimal algorithm implementation, since it’s not unlikely
that the main loop of the optimal solution has exactly one iteration per
returned element.
1
2
3
4
5
6
pubtraitMinimizer{.../// Problem B: For each window, the absolute position in the text of its minimizer.
fnwindow_minimizers(&mutself,text: &[u8])-> Vec<usize>;}
Code Snippet 2:
A trait for Problem B that returns the minimizer of each window.
Problem C: Super-k-mers
We can implement SSHash and other indexing datastructures using the output of Problem B,
but in most such applications, it is probably more natural to directly iterate over the super-k-mers: the longest substrings such that all contained windows have the same
minimizer. This can easily be obtained from the output of Problem B by
grouping consecutive windows that share the minimal k-mer.
Note that here the result is again an iterator over minimizers, like in
Problem A rather than windows as in Problem B.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
structSuperKmer{/// The absolute position in the text where this super-k-mer starts.
/// The end of the super-k-mer can be inferred from the start of the next super-k-mer.
start_pos: usize,/// The absolute position in the text of the minimizer of this super-k-mer.
minimizer_pos: usize,}pubtraitMinimizer{.../// Problem C: The super-k-mers of the text.
fnsuper_kmers(&mutself,text: &[u8])-> Vec<SuperKmer>;}
Code Snippet 3:
A trait that defines Problem C and returns the set of super-k-mers.
Which problem to solve
In the remainder, we mostly focus on Problem B, since that is what I initially
started with. However, some methods more naturally lean themselves to solve the
other variants. We will benchmark each method on whichever of the problems it
runs fastest, and postpone optimizing a solution for each specific problem until
later.
Most likely though, solving Problem C will yield the most useful tool for end
users. I plan to revisit this at the end of this post.
Regardless, for convenience we can implement solutions to A and C in terms of
B.6
pubtraitMinimizer{/// Problem A: The absolute positions of all minimizers in the text.
fnminimizer_positions(&mutself,text: &[u8])-> Vec<usize>{letmutminimizers=self.window_minimizers(text);minimizers.dedup();minimizers}/// Problem B: For each window, the absolute position in the text of its minimizer.
fnwindow_minimizers(&mutself,_text: &[u8])-> Vec<usize>{unimplemented!()}/// Problem C: The super-k-mers of the text.
fnsuper_kmers(&mutself,text: &[u8])-> Vec<SuperKmer>{self.window_minimizers(text).into_iter().enumerate().group_by(|(_idx,minimizer_pos)|*minimizer_pos).into_iter().map(|(minimizer_pos,mutgroup)|SuperKmer{start_pos: group.next().expect("groups are non-empty").0,minimizer_pos,}).collect()}}
Code Snippet 4:
Implementing minimizer_positions and super_kmers in terms of window_minimizers.
Canonical k-mers
In actual bioinformatics applications, it is common to consider canonical
k-mers and corresponding canonical minimizers. These are defined such that
the canonical k-mer of a string and its reverse complement are the same.
This will significantly impact the complexity of our code, and hence we also
postpone this for later.
3.2 The naive algorithm
The naive \(O(n \cdot w)\) method simply finds the minimum of each window independently.
pubstructNaiveMinimizer<H=FxHash>{pubw: usize,pubk: usize,pubhasher: H,}impl<H: Hasher>MinimizerforNaiveMinimizer<H>whereH::Out: Ord,{fnwindow_minimizers(&mutself,text: &[u8])-> Vec<usize>{// Iterate over the windows of size l=w+k-1.
text.windows(self.w+self.k-1).enumerate()// For each window, starting at pos j, find the lexicographically smallest k-mer.
.map(|(j,window)|{j+window.windows(self.k).enumerate().min_by_key(|(_idx,kmer)|self.hasher.hash(kmer)).expect("w > 0").0}).collect()}}
Code Snippet 5:
An implementation of the naive algorithm. Note that hashes are recomputed for each window.
Note that we assume that hashing each k-mer takes constant time. This is
reasonable in practice, since most hash functions hash many characters at once.
Performance characteristics
I was planning to show some preliminary performance numbers here, to give some
intuition about the relative performance of this and upcoming methods.
Unfortunately though (and not completely surprising), performance depends
heavily on small details of how the code is written. I could provide a
comparison between the methods as-is, but the numbers would be hard to interpret.
Instead, let’s consider some high level expectations of this algorithm for now,
and postpone the quantitive comparison until later.
Since there are no if statements, the code should be highly predictable. Thus,
we expect:
few branch misses,
high instructions per cycle,
and generally running times linear in \(w\).
3.3 Rephrasing as sliding window minimum
After hashing all k-mers, we basically have a sequence of \(n-k+1\) pseudo-random
integers. We would like to find the position of the leftmost minimum in each
window of \(w\) of those integers.
Thus, the problem of finding random minimizers can be reformulated in terms of the sliding
window minima.
We can model this problem using the following trait:
1
2
3
4
5
6
7
8
9
10
11
12
13
/// A value at an absolute position.
/// When comparing, ties between value are broken in favour of small position.
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Debug)]pubstructElem<V>{pubval: V,pubpos: usize,}pubtraitSlidingMin<V>{/// Take an iterator over values of type V.
/// Return an iterator over the minima of windows of size w, and their positions.
fnsliding_min(&self,w: usize,it: implIterator<Item=V>)-> implIterator<Item=Elem<V>>;}
Code Snippet 6:
Trait for the sliding window minimum problem.
A first implementation that mirrors the Naive implementation above could simply store the last \(w\) hashes in a RingBuffer,
and in each step recompute the minimum. The boiler plate for the ring buffer is simple:
pubstructRingBuf<V>{w: usize,idx: usize,data: Vec<V>,}impl<V: Clone>RingBuf<V>{#[inline(always)]pubfnnew(w: usize,v: V)-> Self{letdata=vec![v;w];RingBuf{w,idx: 0,data}}#[inline(always)]pubfnidx(&self)-> usize{self.idx}#[inline(always)]pubfnpush(&mutself,v: V){self.data[self.idx]=v;self.idx+=1;ifself.idx==self.w{self.idx=0;}}}/// A RingBuf can be used as a slice.
impl<V>std::ops::DerefforRingBuf<V>{typeTarget=[V];#[inline(always)]fnderef(&self)-> &[V]{&self.data}}
Code Snippet 7:
A simple ring buffer.
And then a sliding window implementation also becomes trivial:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
pubstructBuffered;impl<V: Copy+Max+Ord>SlidingMin<V>forBuffered{fnsliding_min(&self,w: usize,it: implIterator<Item=V>)-> implIterator<Item=Elem<V>>{// A ring buffer that holds the w last elements.
letmutring_buf=RingBuf::new(w,Elem{val: V::MAX,pos: 0});// Iterate over items and their positions.
it.enumerate().map(move|(pos,val)|{ring_buf.push(Elem{val,pos});// Return the minimum element in the buffer.
// Ties between values are broken by the position.
*ring_buf.iter().min().expect("w > 0")})// The first w-1 elements do not complete a window.
.skip(w-1)}}
Code Snippet 8:
The Buffered implementation of SlidingMin that buffers the last \(w\) values and recomputes the minimum on each push.
Using this trait, we can clean up8 the implementation of the Naive minimizer from
before9. In fact, we can now generically implement a minimizer scheme using any
implementation of the sliding window algorithm.
/// A minimizer implementation based on a sliding window minimum algorithm.
/// Also takes a custom hash function, that defaults to FxHash.
pubstructSlidingWindowMinimizer<SlidingMinAlg,H=FxHash>{pubw: usize,pubk: usize,pubalg: SlidingMinAlg,pubhasher: H,}impl<H: Hasher,SlidingMinAlg: SlidingMin<H::Out>>MinimizerforSlidingWindowMinimizer<SlidingMinAlg,H>{fnwindow_minimizers(&mutself,text: &[u8])-> Vec<usize>{self.hasher// Iterate over k-mers and hash them.
.hash_kmers(self.k,text)// (See source for the iterator 'extension trait'.)
.sliding_min(self.w,&self.alg).map(|elem|elem.pos).collect()}}
Code Snippet 9:
A generic minimizer algorithm using any solution to the sliding window minimum problem, and the corresponding Buffered minimizer.
3.4 The queue
Let’s try to improve this somewhat inefficient \(O(n\cdot w)\) algorithm. A first
idea is to simply keep a rolling prefix-minimum that tracks the lowest
value seen so far. For example, when \(w=3\) and the hashes are
\([10,9,8,7,6,…]\), the rolling prefix minimum is exactly also the minimum of
each window. But alas, this won’t work for
increasing sequences: when the hashes are \([1,2,3,4,5,…]\) and the window
shifts from \([1,2,3]\) to \([2,3,4]\), the prefix minimum is \(1\) (at index \(0\)),
but it is not a minimum of the new window anymore, which instead goes up to \(2\).
Nevertheless, as the window slides right, each time we see a new value that’s smaller than
everything in the window, we can basically ‘forget’ about all existing values
and just keep the new smaller value in memory. Otherwise, we can still forget
about all values in the window larger than the value that is shifted in.
This is formalized by using a queue of non-decreasing values in the window.
More precisely, each queue element will be smaller than all values in the window
coming after it.
At each step, the minimum of the window is the value at the front of the queue.
Let’s look at our example again.7
We see that there are two reasons elements can be removed from the queue:
a smaller element is pushed,
the window shifts so far that the leftmost/smallest element falls outside it.
To handle this second case, we don’t just store values in this queue, but also
their position in the original text, so that we can pop them when needed.
pubstructQueue;impl<V: Ord+Copy>SlidingMin<V>forQueue{fnsliding_min(&self,w: usize,it: implIterator<Item=V>)-> implIterator<Item=Elem<V>>{// A double ended queue that is non-decreasing in both pos and val, so that the
// smallest value is always at the front.
letmutq=std::collections::VecDeque::<Elem<V>>::new();// Iterate over the items and their positions.
it.enumerate().map(move|(pos,val)|{// Strictly larger preceding values are removed, so that the queue remains
// non-decreasing.
whileq.back().is_some_and(|back|back.val>=val){q.pop_back();}q.push_back(Elem{pos,val});// Check if the front falls out of the window.
letfront=q.front().expect("We just pushed");ifpos-front.pos>=w{q.pop_front();}*q.front().expect("w > 0")})}}
Code Snippet 10:
An implementaion of the MonotoneQueue and the corresponding QueueMinimizer.
Performance characteristics
Since each element is pushed once and popped once, each call to
push takes amortized constant \(O(1)\) time, so that the total runtime is
\(O(n)\).
We may expect a \(w=11\) times speedup over the buffered method, but most likely
this will not be the case.
This is due to the high number of unpredictable branches,
which hurts pipelining/instruction level parallelism (ILP). Thus, while the
complexity of the Queue algorithm is good, its practical performance may not be optimal.
3.5 Jumping: Away with the queue
While the queue algorithm is nice in theory, in practice it is not ideal due to
the branch-misses.
Thus, we would like to find some middle ground between keeping this queue
and the naive approach of rescanning every window. A first idea is this: Once we find the
minimizer of some window, we can jump to the first window after that position to
find the next minimizer. This
finds all distinct minimizers, although it does not compute the minimizer for
each window individually. Thus, it solves Problem A instead of Problem B.7
Since the expected distance between random minimizers is \((w+1)/2\), we expect to
scan each position just under two times.
Bug
Bryce Kille noticed that this method has a bug: when the sequence of hashes is e.g.
10, 9, 8, 7, 6, 5, 4, 3, 2, 1, the minimizer of the first window is 7. Then
we’d jump to the 6, 5, 4, 3 slice to find 3 as the next minimizer, but that
is wrong, since 6, 5, and 4 are also minimizers. Instead of only taking
the minimum of the new slice, all prefix minima that are less than the previous
minimum are minimizers.
pubstructJumpingMinimizer<H=FxHash>{pubw: usize,pubk: usize,pubhasher: H,}impl<H: Hasher>MinimizerforJumpingMinimizer<H>whereH::Out: Ord,{fnminimizer_positions(&mutself,text: &[u8])-> Vec<usize>{letmutminimizer_positions=Vec::new();// Precompute hashes of all k-mers.
lethashes=self.hasher.hash_kmers(self.k,text).collect::<Vec<_>>();letmutstart=0;whilestart<hashes.len()-self.w{// Position_min returns the position of the leftmost minimal hash.
letmin_pos=start+hashes[start..start+self.w].iter().position_min().expect("w > 0");minimizer_positions.push(min_pos);start=min_pos+1;}// Possibly add one last minimizer.
letstart=hashes.len()-self.w;letmin_pos=start+hashes[start..].iter().position_min().expect("w > 0");ifminimizer_positions.last()!=Some(&min_pos){minimizer_positions.push(min_pos);}minimizer_positions}}
Code Snippet 11:
Solving Problem A by jumping to the window after the previous minimizer. Note that this first computes all hashes in a separate pass.
Performance characteristics
One benefit of this method is that it is completely branchless. Each iteration of
the loop corresponds to exactly one output. The only unpredictable step is which
element of a window is the minimizer, since that determines which is the next
window to scan. Thus, even though it is branchless, consecutive iterations
depend on each other and ILP (instruction level parallelism, aka pipelining) may only have a limited effect.
3.6 Re-scan
I learned of an improved method via Daniel Liu’s gist for winnowing,
but I believe the method is folklore10.
This is a slightly improved variant of the method above. Above, we do a new scan
for each new minimizer, but it turns out this can sometimes be avoided.
Now, we will only re-scan when a previous minimum falls outside the window, and
the minimum of the window goes up.7
Thus, we keep
a ring buffer of the last \(w\) values, and scan it as needed. One point of attention
is that we need the leftmost minimal value, but since the buffer is cyclic,
the leftmost element in the buffer is not necessarily the leftmost one in the
original text. Thus, we don’t only store elements but also their positions,
and break ties by position.
pubstructRescan;impl<V: Ord+Copy+Max>SlidingMin<V>forRescan{fnsliding_min(&self,w: usize,it: implIterator<Item=V>)-> implIterator<Item=Elem<V>>{letmutmin=Elem{val: V::MAX,pos: 0};letmutring_buf=RingBuf::new(w,min);it.enumerate().map(move|(pos,val)|{letelem=Elem{val,pos};ring_buf.push(elem);min=min.min(elem);// If the minimum falls out of the window, rescan to find the new minimum.
ifpos-min.pos==w{min=*ring_buf.iter().min().expect("w > 0");}min}).skip(w-1)}}
Code Snippet 12:Rescan implementation of SlidingMin.
Performance characteristics
This method should have much fewer branch-misses than the queue, since we rescan
a full window of size \(w\) at a time. Nevertheless, these re-scans occur at
somewhat random moments, so some branch-misses around this are still to be expected.
3.7 Split windows
In an ideal world, we would use a fully predictable algorithm, i.e., an
algorithm without any data-dependent branches/if-statements.11
With some Googling12 I found this codeforces blog on
computing sliding window minima (see also this stackoverflow answer and an
implementation in the moving_min_max crate).
In the post, the idea is presented using two
stacks. My explanation below goes in a slightly different direction, but in the
end it comes down to the same.
Aside: Accumulating non-invertible functions.
So, as we saw, most methods keep some form of rolling prefix minimum that is
occasionally ‘reset’, since ‘old’ small elements should not affect the current
window.
This all has to do with the fact that the min operation is not invertible:
knowing only the minimum of a set \(S\) is not sufficient to know the minimum of
\(S - \{x\}\).
This is unlike for example addition: sliding window sums are easy by just adding
the new value to the sum and subtracting the value that falls out of the window.
This distinction is also exactly the difference between segment trees and
Fenwick trees13: a Fenwick tree computes some function
\(f\) on a range \([l, r)\) by removing \(f(a_0, \dots, a_{l-1})\) from \(f(a_0, \dots, a_{r-1})\), by
decomposing both ranges into intervals with power-of-\(2\) lengths corresponding
to the binary representation of their length. But this only works if \(f\) is
invertible. If that is not the case, a segment tree must be used, which
partitions the interval \([l, r)\) itself, without going ‘outside’ it.
The algorithm.
Based on this, we would also like to partition our windows such that we can
compute their minimum as a minimum over the different parts. It turns out that
this is possible, and in a very clean way as well!
First, we conceptually chunk the input text into parts with length \(w\), as
shown in the top row of Figure 5 for \(w=4\). Then, each window of length \(w\)
either
exactly matches such a chunk, or overlaps with two consecutive chunks. In the latter
case, it covers a suffix of chunk \(i\) and a prefix of chunk \(i+1\). Thus, to
compute the minimum for all windows, we need to compute the prefix and suffix
minima of all chunks. We could do that as a separate pass over the data in a
pre-processing step, but it can also be done on the fly:
Each time a new kmer is processed, its hash is written to a size-\(w\) buffer.
(The orange values in the top-right of Figure 5.)
After processing the window corresponding to a chunk, the buffer contains
exactly the hashes of the chunk. (Fourth row.)
Then, we compute the suffix-minima of the chunk by scanning the buffer
backwards. (Fifth row.)
For the next \(w\) windows, we intialise a new rolling prefix minimum in the
new chunk (green outline).
The minimum of each window is the minimum between the rolling prefix minimum
in chunk \(i+1\) (green outline), and the suffix minimum in chunk \(i\) that we computed in the
buffer (blue outline).7
pubstructSplit;impl<V: Ord+Copy+Max>SlidingMin<V>forSplit{fnsliding_min(&self,w: usize,it: implIterator<Item=V>)-> implIterator<Item=Elem<V>>{letmutprefix_min=Elem{val: V::MAX,pos: 0};letmutring_buf=RingBuf::new(w,prefix_min);it.enumerate().map(move|(pos,val)|{letelem=Elem{val,pos};ring_buf.push(elem);prefix_min=prefix_min.min(elem);// After a chunk has been filled, compute suffix minima.
ifring_buf.idx()==0{letmutsuffix_min=ring_buf[w-1];foreleminring_buf[0..w-1].iter_mut().rev(){suffix_min=suffix_min.min(*elem);*elem=suffix_min;}prefix_min=Elem{val: V::MAX,pos: 0};}prefix_min.min(ring_buf[ring_buf.idx()])}).skip(w-1)}}
Code Snippet 13:
Code for the Split algorithm, that computes the suffix minima for each chunk exactly every \(w\) iterations.
Performance characteristics
The big benefit of this algorithm is that its execution is completely deterministic, similar
to the jump algorithm: every \(w\) iterations we compute the suffix-minima, and
then we continue again with the rolling prefix minimum. Since this ‘side loop’
is taken exactly once in every \(w\) iterations, the corresponding branch is easy
to predict and does not cause branch-misses. Thus, we expect nearly \(0\%\) of
branch-misses, and up to \(4\) instructions per cycle.
4 Analysing what we have so far
4.1 Counting comparisons
Before we look at runtime performance, let’s have a more theoretical look at the
number of comparisons each method makes.
We can do this by replacing the u64 hash type by a wrapper type that increments a
counter each time .cmp() or .partial_cmp() is called on it. I exclude
comparisons with the MAX sentinel value since those could possibly be
optimized away anyway. The code is not very interesting, but you can find it here.
Code Snippet 14:
The number of comparisons per character on a string of length \(10^8\) for \(w=10\) and \(w=20\).
Observations:
Buffered Does \(w-1\) comparisons for each window to find the max of \(w\) elements.
Queue and Rescan both do exactly \(2-2/(w+1)\) comparisons per element.
For Queue, each pop_back is preceded by a comparison, and each push has
one additional comparison with the last element in the queu that is smaller
than the new element. Only when the new element is minimal (probability
\(1/(w+1)\)), the queue becomes empty and that comparison doesn’t happen. Also
when the smallest element is popped from the front (probability \(1/(w+1)\)), that means it won’t be
popped from the back and hence also saves a comparison.
Unsurprisingly, the Jumping algorithm uses the fewest comparisons, since it also
returns strictly less information than the other methods.
The Split method, which seems very promising because of its
data-independence, actually uses around \(50\%\) more comparisons (\(3-3/w\), to be
precise), because for each window, a minimum must be taken between the suffix
and prefix.
Open problem: Theoretical lower bounds
It would be cool to have a formal analysis showing that we cannot do
better than \(2-2/(w+1)\) expected comparisons per element to compute the random
minimizer of all windows, and similarly it would be nice to have a lower bound
on the minimal number of comparisons needed to only compute the positions of the minimizers.
4.2 Setting up benchmarking
Adding criterion
Before we start our implementation, let’s set up a benchmark so we can easily
compare them. We will use criterion, together with the cargo-criterion helper.
First we add criterion as a dev-dependency to Cargo.toml.
1
2
3
4
5
6
7
8
9
10
[dev-dependencies]criterion="*"# Do not build the library as a benchmarking target.[lib]bench=false[[bench]]name="bench"harness=false
Code Snippet 15:Cargo.toml changes to set up criterion.
optimized,ext_nthash,buffered,local_nthash,simd_minimizer,human_genome);criterion_main!(group);fninitial_runtime_comparison(c: &mutCriterion){// Create a random string of length 1Mbp.
lettext=&(0..1000000).map(|_|b"ACGT"[rand::random::<u8>()asusize%4]).collect::<Vec<_>>();lethasher=FxHash;letw=11;letk=21;#[rustfmt::skip]letminimizers: &mut[(&str,&mutdynMinimizer,bool)]=&mut[("naive",&mutNaiveMinimizer{w,k,hasher},true),("buffered",&mutSlidingWindowMinimizer{w,k,alg: Buffered,hasher},true),("queue",&mutSlidingWindowMinimizer{w,k,alg: Queue,hasher},true),("jumping",&mutJumpingMinimizer{w,k,hasher},false),("rescan",&mutSlidingWindowMinimizer{w,k,alg: Rescan,hasher},true),("split",&mutSlidingWindowMinimizer{w,k,alg: Split,hasher},true),("queue_igor",&mutQueueIgor{w,k},false),("rescan_daniel",&mutRescanDaniel{w,k},true),];letmutg=c.benchmark_group("g");for(name,m,window_minimizers)inminimizers{g.bench_function(*name,move|b|{if*window_minimizers{b.iter(||m.window_minimizers(text));}else{
Code Snippet 16:
A basic criterion benchmark for the comparison so far.
Making criterion fast
By default, the benchmarks are quite slow. That’s for a couple of reasons:
Before each benchmark, a 3 second warmup is done.
Each benchmark is 5 seconds.
After each benchmark, some plots are generated.
At the end, some HTML reports are generated.
I’m impatient, and all this waiting really impacts my iteration time, so let’s
make it faster:
Reduce warmup time to 0.5s.
Reduce benchmark time to 2s, and only take 10 samples.
Disable plots and html generation.
The first two are done from code:
1
2
3
4
5
6
7
8
9
criterion_group!(name=group;config=Criterion::default()// Make sure that benchmarks are fast.
.warm_up_time(Duration::from_millis(500)).measurement_time(Duration::from_millis(2000)).sample_size(10);targets=initial_runtime_comparison,count_comparisons_bench);
Code Snippet 17:
Initializing criterion with reduced warmup and measurement time.
The last is best done by adding the --plotting-backend disabled flag. For
convenience, we add this rule to the justfile so we can just do just bench. I’m also adding quiet to hide the comparison between runs to simplify presentation.
Code Snippet 18:
Adding the bench rule to the justfile.
When we now do just bench, we see this (TODO update):
A note on CPU frequency
Most consumer CPUs support turboboost to increase the clock frequency for short
periods of time. That’s nice, but not good for stable measurements. Thus, I
always pin the frequency of my i7-10750H to the default 2.6GHz:
g/naive time: [64.399 ms 64.401 ms 64.404 ms]
g/buffered time: [55.154 ms 55.160 ms 55.173 ms]
g/queue time: [22.305 ms 22.322 ms 22.335 ms]
g/jumping time: [7.0913 ms 7.0969 ms 7.1042 ms]
g/rescan time: [14.067 ms 14.071 ms 14.078 ms]
g/split time: [14.021 ms 14.023 ms 14.027 ms]
g/queue_igor time: [26.680 ms 26.689 ms 26.700 ms]
g/rescan_daniel time: [9.0335 ms 9.0340 ms 9.0349 ms]
Code Snippet 20:
Runtime comparison between our and other's implementations.
Some initial observations:
Measurements between samples are very stable, varying less than \(1\%\).
Each method takes 7-55ms to process 1 million characters. That’s 7-55ns per
character, or 7s to 55s for 1Gbp.
Our queue implementation performs similar to the one in minimizer-iter.
Daniel’s implementation of Rescan is around twice as fast as ours! Time to
start looking deeper into our code.
Looking back at our expectations:
As expected, Buffered is the slowest method by far, but even for \(w=11\), it
is only \(2\times\) slower than the next-slowest method.
Split is about as fast as Rescan, and even a bit slower. We were expecting
a big speedup here, so something is weird.
RescanDaniel is even much faster than our Split, so clearly there must be
a lot of headroom.
4.4 Deeper inspection using perf stat
Let’s look some more at performance statistics using perf stat. We can reuse
the benchmarks for this, and call criterion with the --profile-time 2 flag,
which simple runs a benchmark for \(2\) seconds.
1
2
3
stat test='':
cargo build --profile bench --benches
perf stat -d cargo criterion -- --profile-time 2 {{test}}
Code Snippet 21:just stat will run a benchmark for 2 seconds under perf stat.
The metrics that are relevant to us now are summarized in the following table.
Table 1:perf stat results for all methods so far.
metric
Runtime
instructions/cycle
branches/sec
branch misses
ns/char
GHz
M/sec
%
Naive
64
4.13
1743
0.11%
Buffered
55
2.17
987
4.66%
Queue
22
2.29
810
4.29%
QueueIgor
27
2.33
662
4.29%
Jumping
7.1
3.89
1459
0.13%
Rescan
14
3.02
1116
2.26%
RescanDaniel
9.0
3.21
1410
0.85%
Split
14
3.37
1116
1.55%
Observe:
Naive has an instructions per cycle (IPC) of over 4! I have never actually seen IPC that high.
It’s not measurement noise, so most likely it comes from the fact that some
instructions such as compare followed by conditional jump are fused.
Naive also has pretty much no branch misses.
Buffered only has half the IPC of Naive but is still slightly faster since
it only hashes each kmer once. For some reason, there are a lot of branch
misses, which we’ll have to investigate further.
The two Queue variants also have
relatively few instructions per cycle, and
also relatively a lot of branch misses. Here, the branch-misses are expected,
since the queue pushing and popping are inherently unpredictable.
In fact, I argued that the last element of the Queue is popped with
probability \(50\%\). But even then, the branch predictor is able to get down to
\(4\%\) of branch misses. Most likely there are correlations that make these
events somewhat predictable, and of course, there are also other easier
branches that count towards the total.
Jumping has an very good IPC of nearly \(4\), as we somewhat hoped for, and
correspondingly has almost no branch misses.
The remaining methods all have a decent IPC just above \(3\).
RescanDaniel has much fewer branch misses than Rescan. We should investigate this.
We argued that Split is completely predictable, but we see a good number of
branch misses anyway. Another thing to investigate.
4.5 A first optimization pass
It’s finally time to do some actual optimizations!
We’ll go over each of the algorithms we saw and optimize them.
This step is ‘just’ so we can make a fair benchmark between them, and in the end
we will only optimize one of them further, so I will try to keep it short.
The final optimized version of each algorithm can be found in the repository,
but intermediate diffs are only shown here.
Optimizing Buffered: reducing branch misses
Let’s have a look at perf record:
1
2
3
4
5
6
1.16│cmp%r8,%rbp; compare two values
0.92│setne%r11b0.70│┌──jb976; conditionally jump down
*6.15*││mov%r11b,%r13b; overwrite the minimum (I think)
0.02││mov%r13d,%r12d; overwrite the position (I think)
*5.44*│976:└─→mov0x8(%rdx),%r10; continue
Code Snippet 22:just perf buffered shows very inefficient and branchy code. Lines marked with a * have over 5% of time spent in them. This pattern occurs multiple times due to loop unrolling.
Most of these branches come from the fact that computing the minimum of two
Elem { val: usize, pos: usize } is very inefficient, since two comparisons are
needed, and then the result is conditionally overwritten.
The only reason we are storing an Elem that includes the position is that we
must return the leftmost minimum. But since e.g. position_min() returns the
leftmost minimum anyway, we can try to avoid storing the position and always
iterate over values from left to right. We’ll add a method to our RingBuf for this:
Code Snippet 23:
Split the RingBuf into two slices, such that the element of the first returned slice come before the second part, and compute their minimum.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
pubstructBufferedOpt;impl<V: Copy+Max+Ord>SlidingMin<V>forBufferedOpt{fnsliding_min(&self,w: usize,it: implIterator<Item=V>)-> implIterator<Item=Elem<V>>{// Note: this only stores V now, not Elem<V>.
letmutring_buf=RingBuf::new(w,V::MAX);it.enumerate().map(move|(pos,val)|{ring_buf.push(val);letmutmin=ring_buf.forward_min();min.pos+=pos-w+1;min}).skip(w-1)}}
Code Snippet 24:BufferedOpt that only stores values and finds the minimum from left to right. Runs in 26ms, down from 55ms.
The hot loop of the code looks much, much better now, using cmov instead of
branches to overwrite the minimum.
Code Snippet 25:
New hot-loop, using cmov, and very few branches.
Also the perf stat is great now: 4.23 IPC and 0.13% branch misses, so
we’ll leave it at this.
Takeaway
Compute minima from left to right to avoid comparing positions.
Prefer cmov over branches to compute minima.
Queue is hopelessly branchy
We can confirm that indeed a lot of time is spent on branch-misses around
the popping of the queue.
Assembly code showing a badly predicted branch.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1.20│3a0:lea(%rdx,%rax,1),%rdi0.63│↑cmp%rcx,%rdi1.34││mov$0x0,%edi0.97││cmovae%rcx,%rdi0.46││shl$0x4,%rdi0.49││mov%rsi,%r81.58││sub%rdi,%r81.52││┌──cmp%rbp,(%r8); compare the new element with the back
1.97││├──jb3d2; if back is smaller, break loop
*9.85*│││dec%rax; pop: decrease capacity
3.70│││mov%rax,0x28(%rsp)0.41│││add$0xfffffffffffffff0,%rsi0.08│││test%rax,%rax; queue not yet empty
0.37│└──│─jne3a0; try popping another element
0.38││xor%eax,%eax*9.64*│3d2:└─→cmp%rcx,%rax
Code Snippet 26:
Each of the two directions after the branch takes 10% of time.
Sadly, I don’t really see much room for improvement here.
Takeaway
Data-dependent branches are terrible.
Jumping is already very efficient
If we do a perf report here, we see that only 25% of the time is spent in
the minimizer_positions function:
1
2
63.69% <map::Map<I,F> as Iterator>::fold
24.25% <JumpingMinimizer<H> as Minimizer>::minimizer_positions
Code Snippet 27:just perf jumping shows that most time is spent computing hashes.
The hot loop for finding the minimum of a window is very compact again.
Code Snippet 28:
Hot loop of Jumping that finds the minimum of a window, with plenty cmov instructions and no branches.
More problematic is the 63% of time spent on hashing kmers, but we will
postpone introducing a faster hash function a bit longer.
Optimizing Rescan
The original Rescan implementation suffers from the same issue as the
Buffered and compares positions, so let’s reuse the RingBuf::forward_min()
function, and compare elements only by value.
pubstructRescanOpt;impl<V: Ord+Copy+Max>SlidingMin<V>forRescanOpt{fnsliding_min(&self,w: usize,it: implIterator<Item=V>)-> implIterator<Item=Elem<V>>{letmutmin=Elem{val: V::MAX,pos: 0};// Store V instead of Elem.
letmutring_buf=RingBuf::new(w,V::MAX);it.enumerate().map(move|(pos,val)|{ring_buf.push(val);ifval<min.val{min=Elem{val,pos};}// If the minimum falls out of the window, rescan to find the new minimum.
ifpos-min.pos==w{min=ring_buf.forward_min();min.pos+=pos-w+1;}min}).skip(w-1)}}
Code Snippet 29:
Optimized RescanOpt that runs in 10.2ms instead of 14.1ns.
This has 3.55 IPC up from 3.02, and 0.91% branch misses, down from
2.26%, which results in the runtime going down from 14.1ms to 10.2ms.
The hottest part of the remaining code is now the branch-miss when if pos - min.pos == w is true, taking 6% of time, since this is an unpredictable
branch. This seems rather unavoidable. Otherwise most time is spent hashing kmers.
Optimizing Split
Also here we start by avoiding position comparisons.
This time it is slightly more tricky though: the RingBufdoes have to store
positions, since after taking suffix-minima, the value at each position does not
have to correspond anymore to its original position.
The tricky part here is to ensure that the compiler generates a branchless
minimum, since usually it creates a branch to avoid overwriting two values with
their own value:
pubstructSplitOpt;impl<V: Ord+Copy+Max>SlidingMin<V>forSplitOpt{fnsliding_min(&self,w: usize,it: implIterator<Item=V>)-> implIterator<Item=Elem<V>>{letmutprefix_min=Elem{val: V::MAX,pos: 0};letmutring_buf=RingBuf::new(w,prefix_min);it.enumerate().map(move|(pos,val)|{letelem=Elem{val,pos};ring_buf.push(elem);ifval<prefix_min.val{prefix_min=elem;}// After a chunk has been filled, compute suffix minima.
ifring_buf.idx()==0{foriin(0..w-1).rev(){ring_buf[i]=ifring_buf[i+1].val<=ring_buf[i].val{ring_buf[i+1]}else{ring_buf[i]};}prefix_min=Elem{val: V::MAX,pos: 0};}letsuffix_min=ring_buf[ring_buf.idx()];ifsuffix_min.val<=prefix_min.val{suffix_min}else{prefix_min}}).skip(w-1)}}
Code Snippet 31:
Optimized SplitOpt that runs in 11.2ms instead of 14.0ns.
Now, it runs in 11.2ms instead of 14.0ms, and has 3.46 IPC (up from 3.37 IPC) and 0.63% branch misses, down from 1.55%.
The remaining branch misses seem to be related to the kmer hashing, although
this is somewhat surprising as Jumping doesn’t have them.
Takeaway
Data-independent branches are nearly free.
4.6 A new performance comparison
Now that we have our optimized versions, let’s summarize the new results.
Table 2:perf stat results for all methods so far, for w=11.
metric
Runtime
instructions/cycle
branches/sec
branch misses
ns/char
GHz
M/sec
%
Naive
64
4.13
1743
0.11%
Buffered
55
2.17
987
4.66%
BufferedOpt
26
4.19
1499
0.13%
Queue
22
2.29
810
4.29%
QueueIgor
27
2.33
662
4.29%
Jumping
7.1
3.89
1459
0.13%
Rescan
14
3.02
1116
2.26%
RescanOpt
10
3.58
1397
0.90%
RescanDaniel
9.0
3.21
1410
0.85%
Split
14
3.37
1116
1.55%
SplitOpt
11
3.47
1410
0.63%
Note how BufferedOpt is now almost as fast as Queue, even though it’s
\(O(nw)\) instead of \(O(n)\). Unpredictable branches really are terrible. Also,
RescanOpt is now nearly as fast as
RescanDaniel, although Jumping is still quite a bit faster.
For reasons I do not understand SplitOpt is still not faster than RescanOpt,
but we’ll have to leave it at this.
As we saw, Jumping spends two thirds of its time on hashing the kmers, and in
fact, SplitOpt and RescanOpt spend around half of their time on hashing.
Thus, it is finally time to investigate some different hash functions.
5.1 FxHash
So far, we have been using FxHash, a very simple hash function originally used by FireFox and also used
in the Rust compiler. For each u64 chunk of 8 bytes, it bit-rotates the hash
so far, xor’s it with the new words, and multiplies by a fixed constant. For
inputs less than 8 characters, it’s only a single multiplication.
Even though this method iterates over the input length, all k-mers have the same
length, so that the branches involved are usually very predictable.
Nevertheless, hashing each k-mer involves quite some instructions, and
especially a lot of code is generated to to handle all lengths modulo \(8\) efficiently.
WyHash
WyHash is currently the fastest cryptographically secure hash function in the
SMHasher benchmark, but using it instead of FxHash slows down
the benchmarks by roughly 2ns per character.
The reason is that FxHash itself is not secure, and hence can be simpler.
5.2 NtHash: a rolling hash
We could avoid this for loop for the hashing of each k-mer by using a rolling
hash that only needs to update the hash based on the character being added and
removed. The classsic rolling hash is the Karp-Rabin hash, where the hash of a kmer \(x\)
is
\[
h(x) = \sum_{i=0}^{k-1} x_i \cdot p^i \pmod m,
\]
for some modulus \(m\) (e.g. \(2^{64}\)) and coprime integer \(p\). This can be computed in an incremental
way by multiplying the previous hash by \(p\) and adding/subtracting terms for the
character being added and removed.
NtHash (Mohamadi et al. 2016) is a more efficient variant15 of this based on ‘buzhash’,
using multiplication and addition \((\times, +)\) uses bitwise rotation and xor \((\text{rot}, \oplus)\):
\[
h(x) = \bigoplus_{i=0}^{k-1} \text{rot}^i(f(x_i)),
\]
where \(f\) is a static table lookup that maps each character to some fixed random
value. (See also this post on ntHash).
The main benefit of this method over Karp-Rabin hashing is that typically table
lookups and bit rotate instructions are much faster than multiplications.
One limitation of NtHash is that it is periodic: when \(k> 64\), characters
that are exactly 64 positions apart will have the same rotation applies to their
\(f(x_i)\). This means that the hashes of A....A and B....B are be the same,
and similarly the hashes of A....B and B....A are the same. NtHash2
(Kazemi et al. 2022) fixes this by using multiple rotation periods, but we will stick
with the original NtHash. In practice, minimizers with \(k> 64\) are rarely
used anyway, and even if so, then the probability of hash collisions between the
\(w\) k-mers in a window should be small enough.
The nthash crate
Using the nthashcrate, the implementation of the Hasher trait that we’ve
implicitly been using so far is simple:
Code Snippet 32:
Wrapping the nthash crate with the Hasher trait.
Unfortunately, the results are mixed:
Table 3:
Runtime in ns/kmer of methods with FxHash and ntHash.
alg \ hash:
FxHash
ntHash
BufferedOpt
26
30
Queue
22
24
Jumping
7.1
5.0
RescanOpt
10
14
SplitOpt
11
16
All methods become 2-4ns/ker slower, apart from Jumping, which is actually
significantly faster. Remember that for Jumping (Code Snippet 11) we first compute all hashes
and collect them into a vector, while all other methods compute hashes on the
fly. Thus, for some reason ExtNtHasher must be faster than FxHash when
simply streaming through all kmers, while it’s slower when rolling the hash at
more irregular intervals.
Buffered hash values
Based on the difference above, let’s investigate whether first collecting all
hashes to a vector speeds up the other methods as well. We provide a simple
wrapper type for this:
Code Snippet 35:
Assembly code for collecting all ntHashes into a vector. We see plenty of branches.
Even though we know from perf stat that there are basically no branch misses,
the assembly does contain a lot of branches, most of which correspond to
assertions/bound checks in the original code. To fix these, will copy the source
of the nthash crate and modify it locally.
To start, ExtNtHash takes in 2.9ns/kmer to collect kmer hashes to a vector.
Drop sanity checks
First, we see that for each character it is checked whether it is ACGT. It
would be faster to do that up-front if needed, so let’s remove the check.
1
2
3
4
5
6
7
fn h(c: u8) -> u64 {
let val = H_LOOKUP[c as usize];
- if val == 1 {
- panic!("Non-ACGTN nucleotide encountered! {}", c as char)
- }
val
}
Code Snippet 36:
Skip checking that characters are ACGT. Runtime: 2.9ns/kmer ↦ 2.6ns/kmer.
Drop bound checks
Next, we sprinkle in some get_unchecked, to avoid unnecessary bound checks.
Both checks are provably impossible to fail, yet the compiler isn’t able to
elide them. I suspect it has something to do with the
NtHashIt::new() function not being inlined, and hence the
next() function not being aware of invariants checked during initialization. I
also tried adding some additional assertions and using unchecked_add to
indicate that overflow is undefined behaviour, but even then one of the branches remained.
Even with full link time optimization enabled (lto = true in Cargo.toml),
the branches are still there16.
1
2
3
4
5
6
7
8
9
if self.current_idx != 0 {
let i = self.current_idx - 1;
- let seqi = self.seq[i];
- let seqk = self.seq[i + self.k];
+ let seqi = unsafe { *self.seq.get_unchecked(i) };
+ let seqk = unsafe { *self.seq.get_unchecked(i + self.k) };
...
}
Not all provably redundant bound checks are always elided.
Efficiently collecting to a vector
At this point there are three branches left:
Checking whether the end of the sequence was reached.
The self.current_idx != 0 check that makes sure we don’t remove a character
in the first iteration.
A check that the vector we are accumulating into still has space.
Both the second and third should be redundant: we can handle the first
iteration separately, and pre-allocate a vector that is sufficiently large.
Let’s fix the latter one first. Ideally, the compiler would check that the
size_hint provided by the NtHashIt is large enough and that the
initial Vec::reserve() has sufficient capacity that an overflow can not
happen, but alas, that’s not the case (also not with full lto).
Generally, it seems that we can not get rid of the possible resize check while
using .collect::<Vec<_>>(). Thus, let’s roll our own.
First, we can manually reserve the vector capacity and then fill it:
0.04│330:┌─→mov0x10(%rsp),%rax19.91││mov%r13,0x8(%rax,%rbx,8)0.01││add$0x2,%rbx0.11││mov%rbx,0x18(%rsp)││mov%r12,%rax19.65││mov%rbp,%rbx││add%rbp,%rax││↓je38f; unwrap failed?
0.06│34e:│lea0x1(%rbx),%rbp0.01││movzbl(%r15,%rbx,1),%eax18.40││rorx$0x3f,%r13,%rdx0.06││mov(%r14,%rax,8),%r130.16││mov0x20(%rsp),%rcx20.75││rol%cl,%r130.13││mov0x60(%rsp),%rax0.36││movzbl(%rax,%rbx,1),%eax19.85││xor%rdx,%r130.41││xor(%r14,%rax,8),%r13│├──cmp0x8(%rsp),%rbp0.06│└──jne330; vector capacity was reached?
│lea0x8(%rsp),%rdi│vzeroupper│→callalloc::raw_vec::RawVec<T,A>::grow_one; grow the vec
│↑jmp330│38f:mov0x8(%rsp),%r12
Code Snippet 39:
The assembly still shows a call to grow_one, which should be impossible to reach.
Well this doesn’t work. I really trusted the compiler to optimize this away, but
it doesn’t17. Guess I’ll have to adjust my trust level accordingly.
Anyway, I suppose we’ll just make a default vector with the right
size and then fill it, so we can completely drop the push call.
1
2
3
4
5
6
7
8
9
10
11
12
13
fn hash_kmers(&mut self, k: usize, t: &[u8]) -> impl Iterator<Item = Self::Out> {
let len = t.len() - k + 1;
- let mut v = vec![H::Out::default(); len];
+ let mut v = Vec::with_capacity(len);
let mut it = self.hasher.hash_kmers(k, t);
- for x in v.iter_mut() {
- *x = it.next().unwrap();
- }
+ for _ in 0..len {
+ v.push(it.next().unwrap());
+ }
v.into_iter()
}
Code Snippet 40:
Using a fixed-size vector drops the grow check! Runtime: 1.94ns/kmer ↦ 1.71ns/kmer.
The current assembly looks like this.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
300:┌─→mov%r11,%rcx; read max_iterations
│add%r8,%rcx; check if last iteration
│↓je362; unwrap fails?
│movzbl(%r12,%r8,1),%ecx; get seq[i]
│mov(%r9,%rcx,8),%r10; lookup hash of seq[i]
│mov%r13d,%ecx; read 32-bit k into %ecx
│rol%cl,%r10; rotate hash(seq[i]) by %cl = low 8 bits of %ecx
│rorx$0x3f,%rsi,%rsi; rotate rolling hash %rsi
│xor%r10,%rsi; xor with hash of seq[i]
│movzbl(%rax,%r8,1),%ecx; get seq[i+k]
│xor(%r9,%rcx,8),%rsi; xor with hash of seq[i+k]
│mov%rsi,0x8(%r14,%r8,8); write result to vec
│inc%r8; increase i
├──add$0xfffffffffffffff8,%rdx; decrease loop counter
└──jne300; loop
Code Snippet 41:
Current assembly code for collecting ntHashes into a vector. Spend some time to understand the comments here.
Takeaway
Avoid push to fill know-size arrays. (But maybe this doesn’t apply in general.)
Aside: x64 intricacies. This looks pretty great! The first three lines are somewhat redundant though,
since we know the unwrap can never fail18. And also on line 6, there is a 32-bit mov from %r13d to %ecx of
which the low 8 bits are used on line 7 as %cl19. But then why not directly use the low 8
bits of %r13, %r13b? Maybe the compiler insists on using the counter
register c for this? After some more reading and discussing with Tommasso Fontana, that does indeed seem to be the
case: this page lists the opcodes for rol instructions, and the only supported
variants are rotating by 1, by cl, and by an immediate value.20.
Aside: native compilation. I’m using -march=native. Without that, it generates a rol $1,%rsi
instead of a rorx $0x3f,%rsi,%rsi, and the runtime is 1.83ns/kmer instead of
1.71ns/kmer. rorx is probably faster because the x suffix means it doesn’t
set overflow/carry flags, and hence has fewer dependencies with other
instructions. But rorx is only available with BMI2, so this requires the
native compilation.
Takeaway
Use -march=native.
Back to branches. Either way, let’s get rid of the last remaining branch:
1
2
3
4
for x in v.iter_mut() {
- *x = it.next().unwrap();
+ unsafe { *x = it.next().unwrap_unchecked() };
}
Code Snippet 42:
Using unwrap_unchecked to avoid a branch. Runtime: 1.71ns/kmer ↦ 2.20ns/kmer.
Somehow this is worse. It seems two new branches came back when removing this
one!21
Code Snippet 43:
There is now a branch again to check if the index is 0.
I’ve also tried adding an assert_eq!(v.len(), len), but that doesn’t help
either.22
1
2
+let mut v = vec![H::Out::default(); len];
+assert_eq!(v.len(), len);
Code Snippet 44:
Asserting on the length of the vector doesn't change the runtime.
Addressing mode.
Another thing we can do is improve the vector addressing mode. Instead of
writing to vector_offset + index * 8, it’s sometimes faster to directly write
to a *ptr address. Let’s try something:
1
2
3
4
5
6
7
8
9
let mut v = vec![H::Out::default(); len];
assert_eq!(v.len(), len);
let mut it = self.hasher.hash_kmers(k, t);
-for x in v.iter_mut() {
- *x = it.next().unwrap();
-}
+for i in 0..len {
+ unsafe { v.as_mut_ptr().add(i).write(it.next().unwrap()) };
+}
Code Snippet 45:
Using a pointer offset instead of iterating the vector again makes things worse: 1.71ns/kmer ↦ 1.82ns/kmer.
I’m not quite sure why, but this causes the assembly to be one instruction
longer and hence slightly slower.
Confusion, again! Ok let’s remove the assertion again.
1
2
3
4
5
6
let mut v = vec![H::Out::default(); len];
-assert_eq!(v.len(), len);
let mut it = self.hasher.hash_kmers(k, t);
for i in 0..len {
unsafe { v.as_mut_ptr().add(i).write(it.next().unwrap()) };
}
Code Snippet 46:
Removing the assertion gets us where we wanted to be all this time: 1.71ns/kmer ↦ 1.59ns/kmer.
As we saw before, the rotation by \(k\) positions takes two instructions: one to
move \(k\) into %cl, and one for the rotation itself. But all rotations are over
exactly \(k\), so in fact, we can precompute these:
1
2
3
4
5
6
7
8
9
Some(NtHashIt {
...
+ rot_k: H_LOOKUP.map(|x| x.rotate_left(k as u32)),
})
...
-self.fh = self.fh.rotate_left(1) ^ h(seqi).rotate_left(self.k as u32) ^ h(seqk);
+self.fh = self.fh.rotate_left(1) ^ self.rot_k[seqi as usize] ^ h(seqk);
Code Snippet 48:
Precomputing the rotations by \(k\) positions. Runtime: 1.59ns/kmer ↦ 1.41ns/kmer.
Code Snippet 49:
The corresponding assembly goes from 21 to just 15 instructions for 2 loops.
Analysing the assembly code
At this point, we cannot hope for much fewer instructions. The 15 instructions are:
1: increase iteration count by 2,
2: compare and jump,
Then for each of the two unrolled iterations:
2: load two characters,
1: rotate the rolling hash,
2: xor in the two characters,
1: write back the result.
At 1.41ns/kmer, this loop should take 1.41ns/kmer * 2.6GHz = 3.7 cycles/kmer
or 7.3 cycles/loop, which corresponds to just over 2
instructions per cycle. Indeed, perf stat shows 2.04 IPC. I have
marked the critical path of xor’ing the hashes with * s in the assembly code
above.
The three rorx, xor, xor instructions have a total latency of 1+2+2=5
cycles, which is actually higher than the 3.7 cyckes/kmer we measure. Maybe
the ’load’ micro-up in the xor can actually be done in parallel before the
result of the previous xor is done? Regardless, this is pretty impressive performance!
Still, instead of doing x = (x^a)^b, it would be more efficient to do x = x^(a^b). But just parenthesizing the code like that doesn’t improve the
assembly. Probably because that would require a separate load instruction,
since xor’ing with two addresses from memory (as opposed to a memory address and
a register) is not a thing.
5.5 Parallel it is
As we saw above, the current ntHash implementation only uses around 2 IPC,
while each core should be able to execute up to 4 instructions every cycle.
Thus, let’s simply try to run two copies of our main loop in parallel:
fnhash_kmers(&mutself,k: usize,t: &[u8])-> implIterator<Item=Self::Out>{letnum_kmers=t.len()-k+1;// For odd num_kmers, we skip the last kmer.
letkmers_per_part=num_kmers/2;letpart_len=kmers_per_part+k-1;lett0=&t[..part_len];lett1=&t[kmers_per_part..kmers_per_part+part_len];letmutv=vec![H::Out::default();2*kmers_per_part];letmutit0=self.hasher1.hash_kmers(k,t0);letmutit1=self.hasher2.hash_kmers(k,t1);foriin0..kmers_per_part{unsafe{v.as_mut_ptr().add(i).write(it0.next().unwrap());v.as_mut_ptr().add(kmers_per_part+i).write(it1.next().unwrap());}}v.into_iter()}
Code Snippet 50:
Running two copies of the main loop in parallel. Runtime: 1.41ns/kmer ↦ 1.36ns/kmer.
Sadly, this is not nearly the two times speedup I was hoping for, but at least
we’re at 2.5 IPC and now, and 3.5 cycles/kmer, even closer to the 3 cycles/kmer lower bound from the xor dependency chain. (Even though now that
we process two independent chains, that lower bound really is only 1.5 cycles/kmer. Probably the code is just fully occupying some ports.) The assembly code is:
Code Snippet 51:
The assembly code for the parallelized version. But for some reason, there are two compares again!
But this is just silly: At the end we compare %rcx with %r9, and if they
are not equal, we continue with the next iteration. Then the first thing we do:
again compare %rcx with %r9, and panic on the unwrap if they are equal.
But we just checked they are not equal… Again, I’m at a loss why this it the
case. Clearly there are two unwrap calls in our inner loop, but since they
check exactly the same, there really is no need for this.
I tried to parallellize more by doing 3 or 4 chunks at a time, both manually as
above and by using const generics and iterating over [_; B] types, but they all
perform worse at the moment, and generate very branchy code again.
Probably the better way to avoid all these issues is by inlining and
interleaving the ntHash code.
Instead of doing a ‘high level’ parallellization where we make \(B\)
calls to it.next() in parallel, it’s more efficient to ‘internalize’ the
parallellization, so that all the accounting overhead is done only once.
impl<'a,constL: usize>IteratorforNtHashParIt<'a,L>{// Each iteration now returns an array of L hashes.
typeItem=[u64;L];#[inline(always)]fnnext(&mutself)-> Option<[u64;L]>{ifself.current_idx==self.n{returnNone;};ifself.current_idx!=0{leti=self.current_idx-1;// Do L steps consecutively.
self.fh=from_fn(|l|{letseqi=unsafe{*self.seq.get_unchecked(l*self.n+i)};letseqk=unsafe{*self.seq.get_unchecked(l*self.n+i+self.k)};self.fh[l].rotate_left(1)^self.rot_k[seqiasusize]^h(seqk)});}self.current_idx+=1;Some(self.fh)}}
Code Snippet 52:
Splitting the text in \(L\) parts that are processed in parallel, but without SIMD. Takes 1.15ns/kmer for L=2 and 1.19ns/kmer for L=3, down from 1.36ns/kmer.
This is already slightly faster than the version we had. Let’s try more granular parallellization:
Code Snippet 53:
With even more granular parallelization, we get 1.14ns/kmer for L=2 and 1.10ns/kmer for L=3, down from 1.15ns/kmer.
This latest parallel version with \(L=3\) runs at 1.10ns/kmer and 2.7 IPC, or
exactly 3 cycles/kmer. I’m still not quite sure what the main bottleneck is
though. Anyway, from Agner Fog’s instruction tables and uops.info, I gather that
all of movzbl, rorx, xor, and mov can do 2 per cycle, but that all
instructions that read from memory have latency up to ~5 cycles, which sounds
like it may be a bottleneck. But it’s hard to say.
Note that all the methods above return a [u64; L] at a time. For now I’m just
collecting those into a Vec<[u64; L]>. We could then either return an iterator
over those blocks, or return a sequential iterator that traverses the entire
vector L times. Or we could even transpose the data back to a Vec<u64>, but
that would be somewhat slower.
Assumption
It’s fine to return out-of-order data.
5.6 Actual SIMD, at last
The first SIMD version is basically the same as before when using [u64, L],
but using Simd<u64, L> instead, with #![feature(portable_simd)] enabled.
Code Snippet 54:
The SIMD version is quite similar to before. It's slightly slower at 1.14ns/kmer for L=4.
The main difference is that in SIMD, there is no bit-rotate instruction26, so
that instead we have to implement this using two shifts. Let’s have a quick look
at the generated code.
0.68│c50:┌─→cmp%r9,%rax0.04││↓jed46││movzbl(%rsi,%r9,1),%r10d; read char from seq
0.04││vmovq0x940(%rsp,%r10,8),%xmm1; get hash of char into xmm1 128-bit register
10.70││movzbl(%rdx,%r9,1),%r10d0.51││vmovq0x940(%rsp,%r10,8),%xmm2││movzbl(%rcx,%r9,1),%r10d0.16││vmovq0x940(%rsp,%r10,8),%xmm39.83││movzbl(%r12,%r9,1),%r10d0.58││vmovq0x940(%rsp,%r10,8),%xmm40.45││movzbl(%r14,%r9,1),%r10d0.16││vmovq0x0(%r13,%r10,8),%xmm510.32││movzbl(%r15,%r9,1),%r10d0.70││vmovq0x0(%r13,%r10,8),%xmm60.30││movzbl(%r11,%r9,1),%r10d0.14││vmovq0x0(%r13,%r10,8),%xmm711.43││movzbl(%rbx,%r9,1),%r10d0.86││vmovq0x0(%r13,%r10,8),%xmm80.02││vpunpcklqdq%xmm1,%xmm2,%xmm1; merge xmm1 and xmm2
0.12││vpunpcklqdq%xmm3,%xmm4,%xmm2; merge xmm3 and xmm4
10.25││vinserti128$0x1,%xmm1,%ymm2,%ymm1; merge xmm1 and xmm2 into ymm1 256-bit register
0.72││vpunpcklqdq%xmm5,%xmm6,%xmm2;\
0.21││vpunpcklqdq%xmm7,%xmm8,%xmm3; merge xmm5-8 into ymm2
1.45││vinserti128$0x1,%xmm2,%ymm3,%ymm2;/
16.62││vpxor%ymm1,%ymm2,%ymm1; compute the 'delta': ymm1 ^ ymm2
0.51││vpsrlq$0x3f,%ymm0,%ymm2; shift right
││vpaddq%ymm0,%ymm0,%ymm0; shift left = *2 = ymm0 + ymm0
││vpor%ymm2,%ymm0,%ymm0; or the two results
10.66││vpxor%ymm0,%ymm1,%ymm0; xor in the 'delta' term
2.55││vmovdqu%ymm0,(%rdi); write to memory
││inc%r9││add$0x20,%rdi0.02│├──cmp%r9,%r89.94│└──jnec50
Code Snippet 55:
Assembly for the SIMD ntHash. Still lots of memory accesses and many instructions spent on packing the SIMD registers.
This looks rather terrible! There are many instructions to read the sequence
characters and get their hashes, and then a number of instructions to merge
everything into SIMD registers. The final rotate and xor operations are
optimized, but they make up only a small part of the total runtime like this.
SIMD table lookups
So far, we have been assuming that the text can be anything. But actually,
ntHash already is only ‘seeded’ for DNA characters ACGT, and so the [u64; 256] table really only contains 4 interesting entries. So we could just use a
[u64; 4].
Assumption
From now on, we assume that the text is a &[u8] containing only values in 0..=3.
That’s exactly the size of a single SIMD register.
Indeed, we can store them in a single SIMD register, and use a shuffle instruction to
extract them.
Or at least that’s what we would like to do.
Sadly, there is no AVX227
dynamic28 cross-128bit lane29 64-bit30 shuffle available, but
if we give up any one of these constraints we’re good. Thus, we can implement the
64-bit shuffle using the 32-bit shuffle. When we want to look up the 64-bit hash
of a character with value/index c, now instead we look up the two
32-bit parts independently, which will be at indices 2c and 2c+1.
But instead of implementing this, let’s do something simpler31.
32-bit hashes
Everything becomes much simpler if we
use 32-bit hashes directly! There really is no reason we need the full
64-bit hashes to select the minimum hash of a window of size \(w\) that is
rarely more than \(100\).
And since also the 4 values in the
dictionary now fit in 128 bits, so that we do not need cross 128-bit lane
shuffles anymore at all! So we can now exactly use the _mm256_permutevar_ps
instruction, which even has 1 cycle latency. And also, we can now compute 8 hashes at a time instead
of 4!32
Somewhat annoyingly there is no _epi32 integer version of this, so we’ll have
to cast our data to 32-bit floats and back to make the type checker happy.
Assumption
We assume that 32-bit ntHashes are sufficient.
The lower periodicity causes some
collisions for \(k> 32\), but they will still be rare and the number of k-mers in
a window \(w\) that we’re minimizing over is small anyway.
// Types
constL: usize=8;typeS=Simd<u32,L>;...// Initialization of 8x32 'tables',
// containing a copy of the 4 32-bit values in both the high and low 128 bits.
lettable: S=b"ACGTACGT".map(|c|h(c)asu32).into();lettable_rot_k: S=b"ACGTACGT".map(|c|h(c).rotate_left(kasu32)asu32).into();...// Wrap _mm256_permutevar_ps, which is typed for floats, with integer types.
usestd::mem::transmuteast;letpermutevar_epi32=|table: S,idx: S|t(_mm256_permutevar_ps(t(table),t(idx)));// Look up the character values.
lethi: S=permutevar_epi32(self.table_rot_k,seqi);lethk: S=permutevar_epi32(self.table,seqk);// Update the rolling hash.
// There is no SIMD rotate in AVX2 so we do it manually.
self.fh=((self.fh<<1)|(self.fh>>31))^hi^hk;
Code Snippet 56:
Highlights of using SIMD permute to look up the hash values of characters. Runs in 1.66 ns/kmer, not yet close to our best of 1.10 ns/kmer.
In the assembly, we see 16 copies of the code below, an shift-in and shift-out
character for each of the 8 lanes.
1
2
3
mov0x20(%rsp),%rbp; read the offset in the sequence of current lane
movzbl0x0(%rbp,%rdi,1),%ebp; read the character
vpinsrd$0x2,%ebp,%xmm4,%xmm4; insert the read character into a SIMD register
Code Snippet 57:
Reading a single character from the sequence into a SIMD register.
Shared offsets
Instead of storing all offsets on the stack, we can also put them all in a
single SIMD register.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// initialization
Some(Self {
...
+ offsets: from_fn(|l| (l * n) as T).into(),
})
...
// loop
let s_ptr = self.seq.as_ptr();
-let seqi: S = from_fn(|l| *s_ptr.add(l * self.n + i) as T).into();
-let seqk: S = from_fn(|l| *s_ptr.add(l * self.n + i + self.k) as T).into();
+let oi = self.offsets + S::splat(i as T);
+let ok = oi + S::splat(self.k as T);
+let seqi = oi.as_array().map(|o| *s_ptr.add(o as usize) as T).into();
+let seqk = ok.as_array().map(|o| *s_ptr.add(o as usize) as T).into();
Code Snippet 58:
Using a single SIMD register with offsets. It's much slower.
In fact, this turns out worse. Now the indices are computed in SIMD, but then
have to be extracted anyway before any lookups are done.
I was expecting a gather instruction to be generated here, but maybe the ‘zero
extend’ of reading a u8 into a u32 is preventing that.
5.7 SIMD: The Gathering
Gathering 4 characters at a time
But that gives a new idea! If we’re anyway reading from a memory location, we
may as well directly read of full u32 worth of memory instead of just a u8.
That could reduce the time spent on reading 4-fold.
Thus, we make a read every fourth iteration, and then buffer the values.
pubstructNtHashSimdIt<'a>{...// New buffer fields in the struct.
chars_i: S,chars_k: S,}...// New main loop.
ifi%4==0{letp=self.seq.as_ptr()as*constu32;letoi=self.offsets+S::splat(iasu32);letok=oi+S::splat(self.kasu32);// Read a u32 at given byte offsets.
// Assumes little-endian.
self.chars_i=oi.as_array().map(|o|*p.byte_offset(oasisize)).into();self.chars_k=ok.as_array().map(|o|*p.byte_offset(oasisize)).into();}// Extract the last 2 bits of each character.
letseqi=self.chars_i&S::splat(0x03);letseqk=self.chars_k&S::splat(0x03);// Shift remaining characters in the buffer to the right.
self.chars_i>>=S::splat(8);self.chars_k>>=S::splat(8);
Code Snippet 59:
Reading a u32 every fourth iteration. Still runs in 1.65 ns/kmer, like Code Snippet 56.
0.04│1450:┌─→vpsrld$0x1f,%ymm3,%ymm70.73│││vpaddd%ymm3,%ymm3,%ymm30.89│││vpor%ymm7,%ymm3,%ymm31.39│││vpbroadcastd-0x544de(%rip),%ymm7; Splat [3,3,3,3] for the mask
0.05│││vpand%ymm7,%ymm6,%ymm8; mask low 2 bits
0.90│││vpermilps%ymm8,%ymm2,%ymm8; table lookup 1
0.78│││vpxor%ymm3,%ymm8,%ymm3; xor in the results
1.71│││vpand%ymm7,%ymm5,%ymm7; mask low 2 bits
0.03│││vpermilps%ymm7,%ymm4,%ymm7; table lookup 2
2.23│││vpxor%ymm7,%ymm3,%ymm3; xor in the result
0.01│││vpsrld$0x8,%ymm5,%ymm5; shift remaining characters
1.34│││vpsrld$0x8,%ymm6,%ymm6; shift remaining characters
0.01│││vmovdqu%ymm3,(%rax)1.42│││add$0x20,%rax0.01│││inc%rdx1.21│││cmp%rdx,%rcx│││↑je12de0.01│149e:│cmp%rdx,%rdi│││↓je152b0.03│││test$0x3,%dl1.26│└──│─jne1450; loop when i%4 != 0
0.01││vmovd%edx,%xmm5; from here on: read sequence
0.73││vpbroadcastd%xmm5,%ymm5││vpaddd%ymm0,%ymm5,%ymm50.09││vpmovzxdq%xmm5,%ymm60.01││vextracti128$0x1,%ymm5,%xmm70.89││vpmovzxdq%xmm7,%ymm7││vpcmpeqd%xmm8,%xmm8,%xmm80.09││vpxor%xmm9,%xmm9,%xmm9*21.19*││vpgatherqd%xmm8,(%r12,%ymm7,1),%xmm9; slow gather instruction
0.64││vpcmpeqd%xmm7,%xmm7,%xmm7││vpxor%xmm8,%xmm8,%xmm8*19.20*││vpgatherqd%xmm7,(%r12,%ymm6,1),%xmm8; slow gather instruction
0.65││vpaddd%ymm1,%ymm5,%ymm5││vextracti128$0x1,%ymm5,%xmm6││vpmovzxdq%xmm6,%ymm6││vpcmpeqd%xmm7,%xmm7,%xmm70.83││vpxor%xmm10,%xmm10,%xmm10*20.09*││vpgatherqd%xmm7,(%r12,%ymm6,1),%xmm10; slow gather instruction
0.61││vinserti128$0x1,%xmm9,%ymm8,%ymm6││vpmovzxdq%xmm5,%ymm5││vpcmpeqd%xmm7,%xmm7,%xmm7││vpxor%xmm8,%xmm8,%xmm8*20.23*││vpgatherqd%xmm7,(%r12,%ymm5,1),%xmm8; slow gather instruction
0.65││vinserti128$0x1,%xmm10,%ymm8,%ymm5│└──jmp1450
Code Snippet 60:
Assembly code for gathering a u32 every fourth iteration. The 4 gather (vpgatherqd) instructions together take 80% of the time!
The first very clear observation is that even though we only gather every fourth
iteration, those 4 instructions take up 80% of the time! Clearly, getting the
data from memory into the SIMD registers is the bottleneck here. What is
somewhat surprising is that line 15 and 16 of Code Snippet 59 are each turned into
two gather instructions. As it turns out, there are two very similar gather
instructions:
_mm256_i64father_epi32, which takes a 256-bit SIMD register with four 64-bit
indices and returns a 128-bit SIMD register with four 32-bit values.
_mm256_i32father_epi32, which takes a 256-bit SIMD register with eight 32-bit
indices and returns a 256-bit SIMD register with eight 32-bit values.
It would be slightly nicer to use the second version, but in the end I suspect
it wouldn’t matter much for performance, since the same number of entries is
read anyway.
Gathering 8 characters at a time
At this point, it is very clear that the gathering is the bottleneck. Since
it is already split into two instructions anyway, we may as well fetch a u64
at a time. We then have to do some shuffling to those into two SIMD registers of
u32 values, one with the first 32 bits of each u64, and one with the second 32 bits.
Then after 4 iterations, we can cheaply swap in the remaining 32 bits.
pubstructNtHashSimdIt<'a>{offsets: Simd<*constu8,4>,offsets_next: Simd<*constu8,4>,chars_i: Simd<u32,4>,chars_k: Simd<u32,4>,chars_i_next: Simd<u32,4>,chars_k_next: Simd<u32,4>,}...// In main loop:
// Every 8th iteration, read a u64 of incoming and outgoing data for each of the 8 chunks.
ifi%8==0{letoi1=self.offsets.wrapping_add(Simd::splat(i));letoi2=self.offsets_next.wrapping_add(Simd::splat(i));letok1=oi1.wrapping_add(Simd::splat(self.k));letok2=oi2.wrapping_add(Simd::splat(self.k));// i1, k1: The next 64bits for the first 4 chunks.
// i2, k2: The next 64bits for the second 4 chunks.
// The inner transmute makes the *u8 into a *u32, which assumes little-endian.
// The outer transmute makes the [u64, 4] into a [u32, 8].
letchars_i1: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(oi1)));letchars_i2: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(oi2)));letchars_k1: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(ok1)));letchars_k2: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(ok2)));// The upcoming 32 bits for each chunk.
// The 32 bits after that for each chunk.
(self.chars_i,self.chars_i_next)=chars_i1.deinterleave(chars_i2);(self.chars_k,self.chars_k_next)=chars_k1.deinterleave(chars_k2);}// After 4 iterations, swap in the next 32 bits.
ifi%8==4{self.chars_i=self.chars_i_next;self.chars_k=self.chars_k_next;}
Code Snippet 61:
Gathering a full u64 at a time requires slightly more messy code, but now runs in 1.05 ns/kmer! Finally improving the 1.10 ns/kmer non-SIMD best so far.
perf stat now shows that the total time on gather instructions has gone down
from 80% to 67%. Still quite some room for improvement if we could make the
input faster. The deinterleave is translated to some vshufps (latency 1) and
vpermpd (latency 3) instructions as follows:
1
2
3
4
5
6
7
8
*15.25*│vpgatherqq%ymm12,0x0(,%ymm10,1),%ymm11; gather to yxx11
...*17.27*│vpgatherqq%ymm12,0x0(,%ymm7,1),%ymm10; gather to yxx10
...0.70│vshufps$0x88,%ymm10,%ymm11,%ymm9; extract low 32 bits
0.18│vpermpd$0xd8,%ymm9,%ymm9; shuffle
│vshufps$0xdd,%ymm10,%ymm11,%ymm10; extract high 32 bits
0.07│vpermpd$0xd8,%ymm10,%ymm10; shuffle
Code Snippet 62:
Snippet of the generated assembly for the deinterleave after two gather instructions.
Gathering 32 characters at a time
Clearly, streaming the input is still a limitation. But we have one last trick
available. We already assume that the input takes 2-bit values in 0..=3, so that of
each byte we read, 6 bits are wasted. Instead, let’s assume that the input is
already bit-packed!
Assumption
The text is 2-bit encoded and bit-packed.
For now I’ll just keep using the same &[u8] type for the text, but in the
final API it would be cleaner to create a wrapper type.
The code is mostly similar to before, with the change that we now shift out only
2 bits instead of 8 at the end of every iteration.
One more change is that we can’t simply read data at every 2-bit offset.
Specifically, we can’t necessarily read a u64 at position i and a u64 at
position i+k at the same time. Thus, we separate the updates for the start and
end positions of the window. We should then also initialize self.chars_k with the
leading bits.
Code Snippet 63:
When using bit-packed data, we must read the position-i and position-i+k data separately. Runtime goes from 1.05 ns/kmer to 0.17 ns/kmer.
Yes, you read that right, 0.17 ns/kmer :) That’s 0.44 cycles/kmer, and way
faster than anything we had before.
Bug
It turns out I made some bugs here. First, the
array indexing in the code above should use index i/4 instead of i.
Second, and more importantly, for an input of \(n\) bytes, I was only doing \(n\)
instead of \(4n\) iterations. Thus, all numbers reported from here onwards
are roughly 4 times slower in practice. I will update them once I have some
proper tests in place.
The full code for SIMD-based hashing of bitpacked data is here.
constL: usize=8;typeT=u32;typeS=Simd<T,L>;pubstructNtHashSimd<constSMALL_K: bool>;impl<constSMALL_K: bool>ParHasher<L>forNtHashSimd<SMALL_K>{typeOut=T;#[inline(always)]fnhash_kmers(&mutself,k: usize,t: &[u8])-> implIterator<Item=[Self::Out;L]>{ifSMALL_K{assert!(k<=32);}NtHashPackedSimdIt::<SMALL_K>::new(t,k).unwrap().map(|x|x.into())}}#[derive(Debug)]pubstructNtHashPackedSimdIt<'a,constSMALL_K: bool>{seq: &'a[u8],n: usize,k: usize,fh: S,current_idx: usize,table: S,table_rot_k: S,offsets: Simd<*constu8,4>,offsets_next: Simd<*constu8,4>,chars_i: S,chars_i_next: S,chars_k: S,chars_k_next: S,chars_k_copy: S,chars_k_next_copy: S,}impl<'a,constSMALL_K: bool>NtHashPackedSimdIt<'a,SMALL_K>{/// Creates a new NtHashIt with internal state properly initialized.
#[inline(always)]pubfnnew(seq: &'a[u8],k: usize)-> Option<Self>{ifk>seq.len(){returnNone;}ifk>MAXIMUM_K_SIZE{returnNone;}ifseq.len()>u32::MAXas_{returnNone;}letnum_kmers=seq.len()-k+1;ifnum_kmers%L!=0{returnNone;}letn=num_kmers/L;// Each 128-bit half has a copy of the 4 32-bit hashes.
lettable: S=b"ACTGACTG".map(|c|h(c)asT).into();lettable_rot_k: S=b"ACTGACTG".map(|c|h(c).rotate_left(kasu32)asT).into();letfh=from_fn(|l|{letmutfh=0;for(i,v)inseq[l*n..l*n+k].iter().enumerate(){fh^=(h(*v)asT).rotate_left((k-i-1)asu32);}fh}).into();Some(Self{seq,n,k,fh,current_idx: 0,table,table_rot_k,offsets: from_fn(|l|unsafe{seq.as_ptr().add(l*n)}).into(),offsets_next: from_fn(|l|unsafe{seq.as_ptr().add((4+l)*n)}).into(),chars_i: S::splat(0),chars_i_next: S::splat(0),// TODO: properly initialize the first (-k)%32 characters of chars_k.
chars_k: S::splat(0),chars_k_next: S::splat(0),chars_k_copy: S::splat(0),chars_k_next_copy: S::splat(0),})}}impl<'a,constSMALL_K: bool>IteratorforNtHashPackedSimdIt<'a,SMALL_K>{typeItem=S;#[inline(always)]fnnext(&mutself)-> Option<S>{ifself.current_idx==self.n{returnNone;};ifself.current_idx!=0{leti=self.current_idx-1;unsafe{letread=|i|{letoi1=self.offsets.wrapping_add(Simd::splat(i));letoi2=self.offsets_next.wrapping_add(Simd::splat(i));letchars_i1: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(oi1)));letchars_i2: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(oi2)));chars_i1.deinterleave(chars_i2)};ifi%16==0{ifi%32==0{ifSMALL_K{self.chars_i=self.chars_k_copy;self.chars_i_next=self.chars_k_next_copy;}else{(self.chars_i,self.chars_i_next)=read(i);}}else{self.chars_i=self.chars_i_next;}}if(i+self.k)%16==0{if(i+self.k)%32==0{(self.chars_k,self.chars_k_next)=read(i+self.k);self.chars_k_copy=self.chars_k;self.chars_k_next_copy=self.chars_k_next;}else{self.chars_k=self.chars_k_next;}}// Extract the last 2 bits of each character.
letseqi=self.chars_i&S::splat(0x03);letseqk=self.chars_k&S::splat(0x03);// Shift remaining characters to the right.
self.chars_i>>=S::splat(2);self.chars_k>>=S::splat(2);usestd::mem::transmute;letpermutevar_epi32=|a: S,b: S|transmute(_mm256_permutevar_ps(transmute(a),transmute(b)));lethi: S=permutevar_epi32(self.table_rot_k,seqi);lethk: S=permutevar_epi32(self.table,seqk);self.fh=((self.fh<<1)|(self.fh>>31))^hi^hk;}}self.current_idx+=1;Some(self.fh)}}
Code Snippet 64:
Full source of the SIMD ntHash with bit-packed data.
The four gather instructions only take around 30% of the total time now, and
thus aren’t nearly as much of a bottleneck anymore.
At this point, I thought we were done and not much further improvement would be
possible. But I was wrong.
Reusing the gathers
Each character is currently read twice: once when it enters the sliding window,
and once when it exits. It would be more efficient to reuse the first read
instead of re-reading from memory.
When \(k\geq 32\), this is actually quite straightforward: At some point \((i+k)
\bmod 32 = 0\),
and the next 32 characters/64 bits of data are read at position \(i+k\).
Then \(k\) iterations later, \(i\) itself is equal to that same value, and exactly
the same data is read. Thus, we can buffer the data read for position \(i+k\) and
use it the next time \(i\bmod 32 = 0\). But note that this only works when \(k\leq
32\). For larger \(k\). For larger \(k\), it could still be done but needs a larger buffer.
Code Snippet 65:
Reusing the gather at position \(i+k\) for position \(i\) when \(k\leq 32\). Runtime goes from 0.17 ns/kmer to 0.155 ns/kmer.
In each iteration we now check whether \(i\bmod 32 \equiv 0\) and \(i\bmod 32\equiv
16\). It’s more efficient to first do a single check whether \(i\bmod 16\equiv 0\),
and only then check which of the two cases it is, since that saves a compare and
jump in the main loop. That brings the runtime down to 0.145 ns/kmer, or
precisely 3 clock cycles per 8 hashes.
One drawback of the above is that when \(k\) is indeed small, we now have a
useless if statement and a bunch of unnecessary code for read(i) is generated.
So let’s put this condition behind a generic:
Code Snippet 66:
Checking whether \(k\leq 32\) ahead of time saves a branch.chars_i= and self.chars_k variables in registers again. Runtime goes from 0.145 ns/kmer to 0.135 ns/kmer.
At this point, perf stat shows that there are nearly no branch misses, and
2.43 IPC, which is good but actually not that close to 4. Again this makes
me wonder what the actual bottleneck is. Remember 5.5? Where we
processed two chunks in parallel? Maybe we should eventually try that again.
5.8 Fixing the benchmark
At this point, perf record shows:
1
2
43.89% bench-3dc28953a bench-3dc28953a01a51f9 <BufferPar<_,H> as ParHasher<_>>::hash_kmers
41.57% bench-3dc28953a libc.so.6 0x0000000000167f4a
Code Snippet 67:perf record shows that only just over half the time is actually spent in hash_kmers.
Surely, this other part has something to do with either allocating or dropping
the vector to which we collect. Let’s fix that by reusing the vector between iterations.
Thus, we now make all our Hasher instances statefull to give them the
opportunity to reuse their allocated memory between iterations. This is what
requires e.g. Hasher::hash_kmers to take a &mut self: so that it can use and
modify this state.
#[derive(Debug)]pubstructBufferParCached<constL: usize,H: ParHasher<L>>{hasher: H,/// A vector that can be reused between iterations.
v: Vec<[H::Out;L]>,}impl<constL: usize,H: ParHasher<L>>BufferParCached<L,H>{pubfnnew(hasher: H)-> Self{Self{hasher,v: vec![]}}}impl<constL: usize,H: ParHasher<L>>ParHasher<L>forBufferParCached<L,H>whereH::Out: Default+Copy,{typeOut=H::Out;fnhash_kmers(&mutself,k: usize,t: &[u8])-> implIterator<Item=[H::Out;L]>{letmutlen=t.len()-k+1;lett=&t[..t.len()-len%L];len-=len%L;letn=len/L;// Resize the vector to the right size.
self.v.resize(n,[Self::Out::default();L]);letmutit=self.hasher.hash_kmers(k,t);foriin0..n{leths=it.next().unwrap();forjin0..L{unsafe{self.v.get_unchecked_mut(i)[j]=hs[j]};}}// We can't 'into_iter' our owned vector, so we hand out copies to the elements instead.
self.v.iter().copied()}}
Code Snippet 68:
The new Cached hasher reuses the allocated buffer between usages. Runtime goes from 0.135 ns/kmer to 0.102 ns/kmer.
Now, perf again shows that over 90% of time is spent in the hash_kmers
function, and indeed the runtime improves significantly: 0.102 ns/kmer!33
And also, we’re at exactly 3 IPC now, which at last seems quite reasonable.
Takeaway
We’ve gone from 2.9 ns/kmer for the nthash crate to 0.102 ns/kmer by using
32-bit hashes, a bit-packed input, and by using SIMD to parallellize over chunks
of the input. 28x speedup!
Note that instead of returning a vector, we could also benchmark just the time
to compute all the results, e.g. by summing all the output hashes. That only takes
0.90 ns/kmer. But for consistency, it is simpler to always require an actual
vector as output.
One last branch
At this point, there is also still one redundant branch due to the unwrap when
iterating over the vector.
Code Snippet 69:
The it.next().unwrap() causes a redundant branch that's never taken.
I tried replacing it by unwrap_unchecked, but that creates slower code. I
also tried removing the case that returns None completely, which does remove
branch, but anyway the runtime of the code is the same, most likely since
parsing the instructions isn’t the bottleneck anyway.
5.9 Analysis: Machine code analysis
It would be great to know what the actual bottleneck of the code is now. One way
to (try to) find out is by using LLVM’s Machine Code Analyzer (LLVM-mca). First,
we must determine the loop to analyse. For simplicity, let’s exclude the input
gathering, and only focus on the nthash computation, that covers around 70% of
the time:
We can use carg-show-asm to conveniently extract the assembly code.
First, we mark the hash_kmers function as #[inline(never)] so that we can
find its function. Then we can get the assembly code of it using:
1
cargo asm --bench bench hash_kmers --att
Code Snippet 71:cargo asm invocation to get the AT&T style assembly for the hash_kmers function.
This outputs the full function, including setup and cleanup. We manually copy
the assembly corresponding to the inner loop to a separate file.
Code Snippet 72:
The extracted assembly code has some minor differences in ordering from the one before, but it mostly the same.
Then, we can run LLVM-mca on this code: llvm-mca < assembly, optionally with
some flags such as --all-stats, --all-view, --bottleneck-analysis, and --timeline. I
highly encourage you to run it yourself and have a closer look at the output.
First, the output contains some high level statistics:
1
2
3
4
5
6
7
8
9
Iterations: 100
Instructions: 2700
Total Cycles: 558
Total uOps: 3000
Dispatch Width: 6
uOps Per Cycle: 5.38
IPC: 4.84
Block RThroughput: 5.0
Code Snippet 73:
According to LLVM-mca, the max inverse throughput is 5 cycles per loop.
We see that the it reports the maximum throughput of this block at one iteration
every 5 cycles. But our benchmark already showed that we can do 1 iteration
every 3 cycles, so I’m confused here. Also the dispatch width of 6 and 4.84 IPC seem unreasonably high, but maybe I was using the wrong numbers for my CPU
(a i7 10750H).
Then, we get a table of latency and throughput values.
Code Snippet 76:
Two iterations of the timeline, that show that the vmovdqa instructions that load the characters take a lot of time to execute.
The thing that struck me most in all this is that the vmovdqa instructions
have a quite high latency of 7 cycles. Even though it is probably hidden, it
would be nice to remove this. In fact, we can see that the buffers containing
the characters are read from memory/the stack (608(%rsp) is an address
relative to the stack pointer), then shifted by 2 bits, and then written back to
the same place. I wonder why this value can’t just be kept in a register
instead. In fact, I have seen a few cases where a register is used directly,
but I can’t make it work reliably.
5.10 Finals thoughts
Now that we’ve reached 0.1 ns/kmer, and nearly 30x speedup over a previous
implementation, I think it’s finally time to call it a day34.
To recap, our speed is:
0.1 ns/kmer
0.26 cycles/kmer
2.08 cycles/8kmers, i.e., 2 cycles per iteration of the main loop that produces 8
hashes in a SIMD.
And for completeness, here’s the full source code.
constL: usize=8;typeT=u32;typeS=Simd<T,L>;pubstructNtHashSimd<constSMALL_K: bool>;impl<constSMALL_K: bool>ParHasher<L>forNtHashSimd<SMALL_K>{typeOut=T;#[inline(always)]fnhash_kmers(&mutself,k: usize,t: &[u8])-> implIterator<Item=[Self::Out;L]>{ifSMALL_K{assert!(k<=32);}NtHashPackedSimdIt::<SMALL_K>::new(t,k).unwrap().map(|x|x.into())}}#[derive(Debug)]pubstructNtHashPackedSimdIt<'a,constSMALL_K: bool>{seq: &'a[u8],n: usize,k: usize,fh: S,current_idx: usize,table: S,table_rot_k: S,offsets: Simd<*constu8,4>,offsets_next: Simd<*constu8,4>,chars_i: S,chars_i_next: S,chars_k: S,chars_k_next: S,chars_k_copy: S,chars_k_next_copy: S,}impl<'a,constSMALL_K: bool>NtHashPackedSimdIt<'a,SMALL_K>{/// Creates a new NtHashIt with internal state properly initialized.
#[inline(always)]pubfnnew(seq: &'a[u8],k: usize)-> Option<Self>{ifk>seq.len(){returnNone;}ifk>MAXIMUM_K_SIZE{returnNone;}ifseq.len()>u32::MAXas_{returnNone;}letnum_kmers=seq.len()-k+1;ifnum_kmers%L!=0{returnNone;}letn=num_kmers/L;// Each 128-bit half has a copy of the 4 32-bit hashes.
lettable: S=b"ACTGACTG".map(|c|h(c)asT).into();lettable_rot_k: S=b"ACTGACTG".map(|c|h(c).rotate_left(kasu32)asT).into();letfh=from_fn(|l|{letmutfh=0;for(i,v)inseq[l*n..l*n+k].iter().enumerate(){fh^=(h(*v)asT).rotate_left((k-i-1)asu32);}fh}).into();Some(Self{seq,n,k,fh,current_idx: 0,table,table_rot_k,offsets: from_fn(|l|unsafe{seq.as_ptr().add(l*n)}).into(),offsets_next: from_fn(|l|unsafe{seq.as_ptr().add((4+l)*n)}).into(),chars_i: S::splat(0),chars_i_next: S::splat(0),// TODO: properly initialize the first (-k)%32 characters of chars_k.
chars_k: S::splat(0),chars_k_next: S::splat(0),chars_k_copy: S::splat(0),chars_k_next_copy: S::splat(0),})}}impl<'a,constSMALL_K: bool>IteratorforNtHashPackedSimdIt<'a,SMALL_K>{typeItem=S;#[inline(always)]fnnext(&mutself)-> Option<S>{ifself.current_idx==self.n{returnNone;};ifself.current_idx!=0{leti=self.current_idx-1;unsafe{letread=|i|{letoi1=self.offsets.wrapping_add(Simd::splat(i));letoi2=self.offsets_next.wrapping_add(Simd::splat(i));letchars_i1: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(oi1)));letchars_i2: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(oi2)));chars_i1.deinterleave(chars_i2)};ifi%16==0{ifi%32==0{ifSMALL_K{self.chars_i=self.chars_k_copy;self.chars_i_next=self.chars_k_next_copy;}else{(self.chars_i,self.chars_i_next)=read(i);}}else{self.chars_i=self.chars_i_next;}}if(i+self.k)%16==0{if(i+self.k)%32==0{(self.chars_k,self.chars_k_next)=read(i+self.k);self.chars_k_copy=self.chars_k;self.chars_k_next_copy=self.chars_k_next;}else{self.chars_k=self.chars_k_next;}}// Extract the last 2 bits of each character.
letseqi=self.chars_i&S::splat(0x03);letseqk=self.chars_k&S::splat(0x03);// Shift remaining characters to the right.
self.chars_i>>=S::splat(2);self.chars_k>>=S::splat(2);usestd::mem::transmute;letpermutevar_epi32=|a: S,b: S|transmute(_mm256_permutevar_ps(transmute(a),transmute(b)));lethi: S=permutevar_epi32(self.table_rot_k,seqi);lethk: S=permutevar_epi32(self.table,seqk);self.fh=((self.fh<<1)|(self.fh>>31))^hi^hk;}}self.current_idx+=1;Some(self.fh)}}
Code Snippet 77:
Full code for the SIMD FxHash implementation.
Doubling down again
Currently, the number of instructions per cycle is 3, while it could be over 4.
One thing that we could try is to split the input in to 16 chunks and process
them in parallel using two SIMD units of 8 lanes. That way, we could fully
exploit the parallellism of SIMD in combination with the instruction level
parallellism of the CPU. However, my feeling is that the throughput is already
bounded by the number of available registers and ports, so that doing everything
in parallel wouldn’t help much.
I did a very quick and ugly test for this, and it came out twice as slow. The
main reasons seems to be that indeed many variables spill onto the stack, making
things much slower.
16-bit hashes?
By going from 64bit to 32bit hashes, the SIMD code became much simpler and
faster. One could imagine that going to 16bit hashes gives a similar speedup.
Especially since now 80% of time is spent in the inner loop doing ‘plain’ SIMD
instructions, having double the parallellism may be good.
One drawback though is that there are no 16-bit shuffle/permute instructions to
do the corresponding table lookup, so we’d have to:
Deinterleave the 16-bit characters into two vectors of 32-bit characters.
Do two 32-bit permutevar instructions.
Interleave the results back together.
Or alternatively, we could use 128-bit byte shuffles, but then still we have to
do some extra accounting to turn the character values into appropriate
byte-indices.
Lastly, we’d also have to split the u64 of characters read in each place into
4 chunks of 16 instead of 2 chunks of 32, and then rotate the chunks every 8
instead of 16 characters. But this probably doesn’t cause much of a slowdown.
To summarize, I think some speedup may be possible here, but definitely not the
2x one may naively expect.
What about a simple multiply hash
For small \(k\leq 16\), a different approach is to simply multiply their 32-bit value by
a constant and use that as a hash, as FxHash does. Surely, this can also be
optimized to be very efficient.
By re-using all the work we already did to efficiently gather the data from the
sequence, this is now fairly straightforward:
1
2
3
4
5
6
7
8
9
10
11
12
13
// upcoming_chars contains the next 32bits for each lane, and is gathered as before.
// Extract the last 2 bits of each character.
letnew_chars=self.upcoming_chars&S::splat(0x03);// Shift remaining characters to the right.
self.upcoming_chars>>=S::splat(2);// Rotate the bit-representation.
self.chars<<=S::splat(2);// Mask out the old bits.
self.chars&=self.mask;// Or in the new bits.
self.chars|=new_chars;// Multiply by a constant.
returnself.chars*self.constant;
Code Snippet 78:
Using FxHash instead of ntHash takes 75 ps/kmer, down from 102 ps/kmer!
//////////////////////////// Bit-packed SIMD ////////////////////////////
usesuper::*;usestd::{array::from_fn,mem::transmute,simd::{ptr::SimdConstPtr,Simd},};constL: usize=8;typeT=u32;typeS=Simd<T,L>;pubstructFxHashSimd;implParHasher<L>forFxHashSimd{typeOut=T;#[inline(always)]fnhash_kmers(&mutself,k: usize,t: &[u8])-> implIterator<Item=[Self::Out;L]>{FxHashPackedSimdIt::new(t,k).unwrap().map(|x|x.into())}}#[derive(Debug)]pubstructFxHashPackedSimdIt<'a>{seq: &'a[u8],n: usize,k: usize,chars: S,mask: S,constant: S,current_idx: usize,offsets: Simd<*constu8,4>,offsets_next: Simd<*constu8,4>,upcoming_chars: S,upcoming_chars_next: S,}impl<'a>FxHashPackedSimdIt<'a>{/// Creates a new FxHashIt with internal state properly initialized.
#[inline(always)]pubfnnew(seq: &'a[u8],k: usize)-> Option<Self>{assert!(k<=16);ifk>seq.len(){returnNone;}ifk>MAXIMUM_K_SIZE{returnNone;}ifseq.len()>u32::MAXas_{returnNone;}letnum_kmers=seq.len()-k+1;ifnum_kmers%L!=0{returnNone;}letn=num_kmers/L;Some(Self{seq,n,k,chars: S::splat(0),current_idx: 0,mask: S::splat((1<<(2*k))-1),constant: S::splat(1234565323),offsets: from_fn(|l|unsafe{seq.as_ptr().add(l*n)}).into(),offsets_next: from_fn(|l|unsafe{seq.as_ptr().add((4+l)*n)}).into(),upcoming_chars: S::splat(0),upcoming_chars_next: S::splat(0),})}}impl<'a>IteratorforFxHashPackedSimdIt<'a>{typeItem=S;#[inline(always)]fnnext(&mutself)-> Option<S>{ifself.current_idx==self.n{returnNone;};leti=self.current_idx;self.current_idx+=1;letread=|i|unsafe{letoi1=self.offsets.wrapping_add(Simd::splat(i));letoi2=self.offsets_next.wrapping_add(Simd::splat(i));letchars_i1: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(oi1)));letchars_i2: Simd<u32,8>=transmute(Simd::<u64,4>::gather_ptr(transmute(oi2)));chars_i1.deinterleave(chars_i2)};if(i+self.k)%16==0{if(i+self.k)%32==0{(self.upcoming_chars,self.upcoming_chars_next)=read(i+self.k);}else{self.upcoming_chars=self.upcoming_chars_next;}}// Extract the last 2 bits of each character.
letnew_chars=self.upcoming_chars&S::splat(0x03);// Shift remaining characters to the right.
self.upcoming_chars>>=S::splat(2);// Rotate the bit-representation.
self.chars<<=S::splat(2);// Mask the previous
self.chars&=self.mask;// Or in the new bits.
self.chars|=new_chars;// Multiply by a constant.
lethashes=self.chars*self.constant;Some(hashes)}}
Code Snippet 79:
Full code for the SIMD FxHash implementation.
Again faster, with 75 ps/kmer now, or 1.56 cycles per iteration!
One drawback of course is that this really only works for small \(k\). For
\(16<k\leq 32\) the same could be done by doing two 32-bit multiplications, but
that would probably hurt performance and make it slower than our ntHash implementation.
6 SIMD sliding window
At this point, we can go back to the sliding window minimum problem and use our
new rolling hash for it. Naturally, we would like to also use SIMD here. The
easiest is to adapt the Split method, since it is completely deterministic.
That ensures that when processing multiple chunks of a sequence, the steps in
each chunk (SIMD lane) are exactly the same.
Below is a relatively straightforward SIMD version of our initial code
Code Snippet 13 (and Code Snippet 31 for the optimized version). It takes as input an
iterator over [u32; 8] of 8 hashes (of independent chunks), and outputs a
[u32; 8] for each window of width \(w\) with the absolute position of the
leftmost minimizer. Some details:
Instead of comparing full 32-bit values, only the high 16-bits of each hash are compared.
The low 16 bits of each 32-bit word are used to store the position, so that
ties are automatically broken towards the minimal position. This means that
the position can only go up to \(2^{16}\), so the sequences needs to be chunked
into pieces shorter than that, say of length \(2^{15}\).
I already used .get_unchecked where needed.
Made sure that the .skip(w-1) at the end doesn’t become an if statement in
the inner loop (or even puts the inner closure behind a function call some cases).
implSplitSimd{/// Assumes that the input iterator has less than 2^16 elements.
/// Compares 32bit hashes by their upper 16 bits.
/// Returns positions of all window minima.
#[inline(always)]pubfnsliding_min(&self,w: usize,it: implIterator<Item=[u32;8]>,)-> implIterator<Item=[u32;8]>{typeS=Simd<u32,8>;assert!(w>0);letmutprefix_min=S::splat(u32::MAX);letmutring_buf=RingBuf::new(w,prefix_min);// We only compare the upper 16 bits of each hash.
// Ties are broken automatically in favour of lower pos.
letval_mask=S::splat(0xffff_0000);letpos_mask=S::splat(0x0000_ffff);letmutpos=S::splat(0);letmutit=it.map(#[inline(always)]move|val|{letelem=(S::from(val)&val_mask)|pos;pos+=S::splat(1);ring_buf.push(elem);prefix_min=prefix_min.simd_min(elem);// After a chunk has been filled, compute suffix minima.
ifring_buf.idx()==0{letmutsuffix_min=ring_buf[w-1];foriin(0..w-1).rev(){suffix_min=suffix_min.simd_min(ring_buf[i]);ring_buf[i]=suffix_min;}prefix_min=elem;// slightly faster than assigning S::splat(u32::MAX)
}letsuffix_min=unsafe{*ring_buf.get_unchecked(ring_buf.idx())};(prefix_min.simd_min(suffix_min)&pos_mask).into()},);// This optimizes better than it.skip(w-1).
it.by_ref().take(w-1).for_each(drop);it}}
Code Snippet 80:
SIMD version of the Split algorithm.
6.1 Results
We can benchmark this code in a few ways:
Take a test, compute hashes, and then directly stream them into the sliding window minima.
Take a text, compute hashes to a vector, then read the vector to compute
sliding window minima.
Like 2, but only measure the time for the sliding window minima.
As expected, method 2 reports close to 0.1ns more than method 3, exactly the
time taken to compute hashes.
As output, we test both taking just the sum of all positions, and writing all
positions to a vector.
Table 5:
Runtime of computing hashes and taking sliding window minima. The first column indicates whether hashes are streamed directly to the window minima or first written to a vector.
ps/window
sum positions
collect positions
streaming hashes
158
226
buffered hashes
192
210
We see that the fully streaming option that doesn’t use any memory is clearly
the fastest. Collecting the final result is faster when the intermediate results
are already in a vector. I’m not quite sure why. Possibly the allocation is
reused, but I don’t think that is actually the case.
Either way, this seems to be plenty fast, and I don’t see any more easy wins
here. Comparing (hash, pos) tuples as a single integer seems inevitable for
good performance. Maybe we could go down from (u16, u16) to (u8, u8), but
using 8-bit hashes is unlikely to be good enough when \(w\sim 20\). In that case,
each window has a probability somewhere around order \(256/w \sim 12\%\) of having
a duplicate minimizer, which will start to hurt the density.
One more thing to try could be to again process two SIMD units in parallel, but
also again, a quick and dirty test was quite a bit slower.
Takeaway
We can now do sliding window minima in 0.158 ns/kmer, or 3.3 cycles/8kmers.
That’s 60x faster than RescanDaniel, and 160x faster than QueueIgor!
Human genome results
For a human genome, consisting of roughly 3Gbp, we measure a runtime of 496 ms35,
which corresponds to 0.156 ns/kmer. Thus, the results we see transfer to
larger datasets without problems!
TODO7 Cleanup, Testing, Super-k-mers, and canonical k-mers
Currently most code is only written to get measure throughput, and not set up to
handle edge cases where e.g. the input length is not a multiple of \(8\). This
needs fixing, and then also some testing for correctness.
Most likely, an API that returns an iterator over super-k-mers (i.e., the
range of windows have the same minimizer) that possibly also includes the
bit-representation of the minimizer itself would be more convenient than the
current API that only returns the minimizer for each window.
In most bioinformatics applications, canonical k-mers are used instead. Most
likely this will require doubling large parts of the pipeline to compute
reverse-complement hashes and their minimizers, and then take the minimum
between the two, with some special tie-breaking in cases where the minimum is
shared.
Also the API should probably contain a fast method to convert between byte
and packed representations.
Groot Koerkamp, Ragnar, and Giulio Ermanno Pibiri. 2024. “The mod-minimizer: A Simple and Efficient Sampling Algorithm for Long k-mers.” In 24th International Workshop on Algorithms in Bioinformatics (Wabi 2024), edited by Solon P. Pissis and Wing-Kin Sung, 312:11:1–11:23. Leibniz International Proceedings in Informatics (Lipics). Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPIcs.WABI.2024.11.
Kazemi, Parham, Johnathan Wong, Vladimir Nikolić, Hamid Mohamadi, René L Warren, and Inanç Birol. 2022. “Nthash2: Recursive Spaced Seed Hashing for Nucleotide Sequences.” Edited by Peter Robinson. Bioinformatics 38 (20): 4812–13. https://doi.org/10.1093/bioinformatics/btac564.
Mohamadi, Hamid, Justin Chu, Benjamin P. Vandervalk, and Inanc Birol. 2016. “Nthash: Recursive Nucleotide Hashing.” Bioinformatics 32 (22): 3492–94. https://doi.org/10.1093/bioinformatics/btw397.
Roberts, Michael, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. 2004. “Reducing Storage Requirements for Biological Sequence Comparison.” Bioinformatics 20 (18): 3363–69. https://doi.org/10.1093/bioinformatics/bth408.
Schleimer, Saul, Daniel S. Wilkerson, and Alex Aiken. 2003. “Winnowing: Local Algorithms for Document Fingerprinting.” In Proceedings of the 2003 Acm Sigmod International Conference on Management of Data. Sigmod/Pods03. ACM. https://doi.org/10.1145/872757.872770.
Read this on a wide screen to see the table of contents on the
left, and footnotes on the right. ↩︎
I had a
bug in earlier benchmarks that caused my timings to be 4x too fast. ↩︎
Headsup: all code I’m about to show is completely untested currently. ↩︎
Consider not skipping the code blocks. I made quite some effort to make
them clean and readable. Most code is fairly straightforward assuming you’re
comfortable reading Rust code. ↩︎
Legend: Orange: processed state, Bold outline: minimum of
the prefix/suffix, Bold character: minimum of the window, Blue: state in memory, Red: state removed from the queue, Green: running prefix minimum. ↩︎↩︎↩︎↩︎↩︎
The hash_kmers function here is part of
the Hasher trait that can be found in Code Snippet 32. ↩︎
In fact, the Buffered implementation is slightly more efficient
since it stores hashes instead of recomputing them \(w\) times. ↩︎
Citation needed, both for it being folklore and
for the original idea. ↩︎
I’m not
counting the for loop that iterates up to the variable length of the string as a
branch miss, because that will only be a single branch-miss at the end, not
\(O(1)\) per iteration. ↩︎
Feel free to skip this paragraph in case you are not
familiar with segment and Fenwick trees. ↩︎
Hah, get it? There’s at least two alternative interpretations ;) ↩︎
The ntHash paper doesn’t
actually provide a benchmark for this though. But the consensus seems to be that
buzhash is faster than classic multiplication based methods. ↩︎
The runtime does go down a bit to 2.3ns/kmer,
but we’ll postpone looking into this more, since full lto is absolutely terrible
for compile times, and thin lto is not sufficient. ↩︎
Maybe there is a reason for keeping it that I don’t see? Let me know! ↩︎
Again, the compiler disappoints a
bit. Maybe there is some tricky reason why it’s not actually redundant? ↩︎
This article has a nice
summary table of the register naming. ↩︎
Compare that to the 40 seconds it takes to convert the byte-encoded
FASTA sequence to a bit-packed representation using some lazy code. Why aren’t
we using bit-packed on-disk file formats anyway? ↩︎