- Questions and remarks on PTHash paper
- Ideas for improvement
- Implementation log
- Hashing function
- Bitpacking crates
- Construction
- Fastmod
- TODO Try out
fastdivide
andreciprocal
crates - First benchmark
- Faster bucket computation
- Branchless, for real now! (aka the trick-of-thirds)
- Compiling and benchmarking PTHash
- Compact encoding
- Find the \(x\) differences
FastReduce
revisited- TODO Is there a problem if \(\gcd(m, n)\) is large?
- Faster hashing
- An experiment
- Compiler struggles
- Prefetching, at last
- TODO Prefetching with vectorization
- Inverting \(h(k_i)\)
- Another day of progress
- TODO Possible sorting algorithms
- Diving into the inverse hash problem
- Bringing it home
- Hash-inversion for faster PTHash construction
- Fast path for small buckets
- TODO Dictionary encoding
- TODO Larger buckets
- TODO Prefetching free slots
- Filling the last few empty slots needs very high \(k_i\)!
- Perfect matching for the tail
- Peeling for size-1 buckets
- Greedy peeling 1: Assigning from hard to easy
- TODO Peeling and cuckoo hashing for larger buckets.
- Sunday morning ideas
- Cuckoo hashing / displacing, for real now
- Displacing globally
- Cleanup and revisiting defaults
- TODO Sum instead of xor?
- Revisiting \(\alpha < 1\)
- Elias-Fano for the remap-dictionary
- Global iterative prioritizing
- Cleanup: removing peeling and suboptimal displacing code
- Some speedups to the displacement algorithm
- TODO Runtime analysis of displacement algorithm
- TODO Optimal prefetching strategy
- Are we close to the memory bandwidth?
- More sorting algorithm resources
- Partitioning to reduce memory latency
- Back from a break!
- Speeding up the search for pilots
MultiplyReduce
- Linux hugepages?
- Dropping the bucket split?
- Query memory bandwidth
- Packing difference from expected position
- Sharding
- 128bit hashing
- Varying the partition size
- PtrHash, part 2
\[ %\newcommand{\mm}{\,\%\,} \newcommand{\mm}{\bmod} \newcommand{\lxor}{\oplus} \newcommand{\K}{\mathcal K} \]
NOTE: WIP paper is here.
Daniel got me excited about minimal perfect hashing functions (MPHFs), and then
twitter Rob asked for a rust implementation of PTHash (Pibiri and Trani 2021b, 2021a), and also
I have some ideas related to memory prefetching I want to play with, so here we
are: I’m working on a Rust implementation ptr-hash
at https://github.com/ragnargrootkoerkamp/ptrhash.
This post is a collection of thoughts, ideas, implementation nodes, and experiments that came up while writing the code.
Twitter discussion is here.
(See also my note on BBHash (Limasset et al. 2017).)
Questions and remarks on PTHash paper
Probably some of these questions are answered by reading into the original FCH paper (Fox, Chen, and Heath 1992) – that’ll come at some later time.
FCH uses \(cm/\log_2 n\) buckets because each bucket needs \(\log_2 n\) bits storage, but for PTHash the bits per bucket is lower so this isn’t really the right formula maybe.
- Tweet: another option is to fix \(c = \alpha * \log(n) / d\) for some constant \(d\).
FCH uses \(\lceil cn / (\log_2n+1)\rceil\) buckets, but PTHash uses \(\lceil cn/\log_2 n\rceil\) buckets. To me, \(\lceil cn/\lceil \log_2n\rceil\rceil\) actually sounds most intuitive.
\(p_1=0.6n\) and \(p_2=0.3m\) seem somewhat arbitrary. Can this be tuned better? Could partitioning into more chunks help?
Anyway, this puts \(60\%\) of data in the first \(30\%\) of buckets (i.e. double the average size), and \(40\%\) of data in \(70\%\) of buckets (i.e. roughly half the average size).
I suppose the goal is to stitch two binomial distributions of bucket sizes together in a nice way to get a more even distribution over sizes.
How many bits of entropy are really needed? When \(n\) is a power of \(2\), we make sure to use all \(64\) bits instead of just \(\log_2 n\). But how many are really needed. Maybe as long as \(n \ll 2^{32}\), a simpler/faster \(32\) bit hash is sufficient?
And how large will \(n\) be at most? Can we reasonably assume \(2^{32} \approx 4\cdot 10^9\) as a bound?
Is it reasonable to compute and store hashes of all keys, and then changing the API to let the user pass in the hash instead of the key?
What’s the probability of failure with the first global seed \(s\) that’s tried?
What are the largest \(k_i\) found in typical cases?
What if \(\gcd(m, n)\) is large?
Ideas for improvement
Parameters
- Better tuning of parameters may work?
- Partitioning into three or more chunks may speed up construction?
Align packed vectors to cachelines
Instead of packing \(r\) bit values throughout the entire vector, only pack them inside cachelines.
Prefetching
The query performance of this and other MPHFs seem to be limited by the latency of memory lookups. For this reason, most MPHFs minimize the number of memory lookups, and in particular PTHash uses one random memory access followed by one lookup in a small (cached) dictionary.
When processing a number of lookups sequentially, each lookup currently incurs a cache miss. This could be made much more efficient by doing \(k\) (e.g. \(k=16\)) cache lookups in parallel:
- first compute the array location to lookup for \(k\) keys (possibly using SIMD),
- then prefetch the \(k\) memory locations,
- then do the rest of the computation in parallel.
- If needed, do lookups in the \(free\) table in parallel.
This should mostly hide the memory latency and could give significant speedup. I’m not quite sure yet whether this would work best when done in batches (as described), or in streaming fashion, where we iterate over elements and prefetch memory for the element \(k\) iterations ahead. Algorithmica.org has a nice article on the streaming approach.
Faster modulo operations
There are a lot of % n
, % p1
and % (m-p1)
operations throughout the code.
I didn’t look into it yet, but possibly these are the bottleneck on the CPU
latency.
First note that
\[
a\, \%\, b = a - \lfloor a/b\rfloor * b.
\]
This division by a constant can be computed efficiently using a trick which
replaces division by multiplication with the inversion.
Using the formula of the wikipedia article we can precompute some constants to
evaluate \(\lfloor a/b\rfloor\) in \(6\) operations and a % b
in \(8\) operations.
(Note that it might be possible compilers already do this, but I don’t expect so.)
Some blogposts by Daniel Lemire (Thanks Daniel Liu ;)
A fast alternative to the modulo reduction
Instead of
a % b
, computea * b >> 64
, assuming that \(a\) is uniform in \([2^{64}-1]\).This doesn’t seem to work well in practice though for PTHash, probably since this only uses the entropy in the high-order bits of \(a\).
Faster remainders when the divisor is a constant: beating compilers and libdivide
Indeed, the C++ PTHash implementation already uses the
fastmod
library.More fun with fast remainders when the divisor is a constant
Lemire, Kaser, and Kurz (2019)
Store dictionary \(D\) sorted using Elias-Fano coding
I doubt whether the memory savings here are worth the time overhead, but it’s an idea :shrug:.
How many bits of \(n\) and hash entropy do we need?
One bottleneck of fastmod64
is that it needs to compute the
highest \(64\) bits of a u128 * u64
product.
If we can assume that \(n\) is at most \(40\) bits, and that \(44\) bits are
sufficient entropy, then I think we could do away with the u128 x u64
multiplication and do everything inside u128
.
Ideas for faster construction
Once the table is almost filled, determining the \(k_i\) becomes slower. Some ideas to speed this up:
- reversing
murmurhash
- Instead of finding a \(k_i\) such that \(position(x) :=
(h(x) \lxor
h(k_i))\mm n\) is not taken becomes slow. Instead, we could keep a list of
empty positions \(p\) and determine \(k_i\) as \(h^{inv}((p + j \cdot n) \lxor
h(x))\) for different \(j\), assuming we can invert
murmurhash
. As it turns out,murmurhash2
is an invertible function with \(O(1)\) inverse! See this blogpost. Thus, we can easily find many \(k_i\) such that \(position(x, k_i)\) maps to an empty position.- All buckets of size \(1\) can be directly assigned to slots this way.
- Size \(2\) and \(3\) buckets also need fewer tries than before.
The big drawback though is that the \(k_i\) values found will be uniform in \([0, 2^{64}]\).
- Cuckoo-hashing
- For sets of size \(1\) and \(2\) (and maybe \(3\)?) we could
displace an already taken slot if that is the last remaining slot needed to
fix the current bucket. Probably we want to only displace buckets of the same
size and never buckets of larger size.
I wonder though how useful this actually is. If the current bucket is hard to place, there is not really any reason a different bucket of the same size would be easier to fix.
Implementation log
A somewhat chronological list of notes and remarks.
Hashing function
For now I use murmur64a
, documented on the SMHasher
GitHub wiki.
Bitpacking crates
There are a lot of bitvector and bitpacking crates!
- bitvectors
- All of the below seem to do the same
bitvec
: \(30M\) downloadsbit-vec
: \(30M\) downloadsfixedbitset
: \(55M\) downloads
No idea which is best; probably I’ll settle for the one below in
sucds
. - sucds
- only \(60K\) downloads, but contains
- BitVector
- fixed-width integer packing: CompactVector
- Decoding seems somewhat inefficient
- increasing-integer sequence packing: EliasFano
- Giulio has lecture notes on this.
succinct - IntVector - Can not be constructed from slice/iterator of values. - Decoding seems somewhat inefficient - No updates in the past 2 years.
Construction
- Storing buckets as
Vec<Vec<Key>>
is bad for large keys, so now I storeVec<Vec<usize>>
, but the nestedVecs still waste a lot of space and will cause allocation slowdowns. PTHash pushes onto a vector which is sorted later, which seems more efficient. - When testing \(k_i\), not only do we need to test that positions are not filled
by previous buckets, but also we have to check that elements within the bucket
do not collide. It is not sufficient that \(h(x, s)\) does not collide within
buckets, since they could collide after taking the
% n
.
Fastmod
It seems that Daniel Lemire’s fastmod
C++ library has not yet been ported to
Rust, so I converted the few parts I need.
There is also strength_reduce
, which contains a similar but distinct algorithm
for a % b
that computes the remainder from the quotient.
TODO Try out fastdivide
and reciprocal
crates
First benchmark
I implemented these under a generic Reduce
trait.
just bench
at the linked commit at 2.6GHz
gives the following for \(10^7\) keys:
method | construction (s) | query (ns) |
---|---|---|
u64 | 10.474591 | 91 |
fastmod64 | 10.731583 | 55 |
fastmod32 | 9.911751 | 50 |
strengthreduce64 | 11.520939 | 56 |
strengthreduce32 | 10.002017 | 50 |
The u32
versions simply only use the lower \(32\) bits of the \(64\) bit hash.
This is not yet as fast as the fastest 28ns
reported in the PTHash paper (for
C-C encoding), but I also haven’t optimized anything else yet. Time for profiling.
Profiling: Looking at the flamegraph (cargo flamegraph
), and zooming in on the hash function, we see
A lot of time is spend on fold! The murmur2
function I use has signature
murmur2(bytes: &[u8], seed: u64)
, and even though my keys/bytes always correspond
to just a u64
, it’s iterating over them!
In the generated perf report
, we see
|
|
We can ignore the \(33\%\) for construction and only focus on querying here, where
we see that the index
function calls to murmur2
and a lot of time is spent
in both. In fact, murmur2
is not inlined at all! That explains the iterator
appearing in the flamegraph.
Thin-LTO: This is fixed by enabling link-time optimization: add lto = "thin"
to
cargo.toml
.
Rerunning the benchmark we get
construction (s) | construction (s) | query (ns) | query (ns) | |
---|---|---|---|---|
method | no LTO | thin-LTO | no LTO | thin-LTO |
u64 | 10.5 | 8.9 | 91 | 60 |
fastmod64 | 10.7 | 8.3 | 55 | 34 |
fastmod32 | 9.9 | 8.5 | 50 | 26 |
strengthreduce64 | 11.5 | 8.3 | 56 | 38 |
strengthreduce32 | 10.0 | 8.8 | 50 | 31 |
Sweet! 26ns
is faster than any of the numbers in table 5 of Pibiri and Trani (2021b)!
(Admittedly, there is no compression yet and the dictionary size is \(10\times\)
smaller, but still!)
More inlining:
Actually, we don’t even want the index()
function call to show up in our logs:
inlining it should give better instruction pipelining in the benchmarking hot-loop
|
|
and indeed, we now get
query (ns) | no LTO | thin-LTO | inline index() |
---|---|---|---|
u64 | 91 | 60 | 55 |
fastmod64 | 55 | 34 | 33 |
fastmod32 | 50 | 26 | 24 |
strengthreduce64 | 56 | 38 | 33 |
strengthreduce32 | 50 | 31 | 26 |
Conclusion: From now on let’s only use fastmod64
and fastmod32
. (I suspect
the 32bit
variant does not have sufficient entropy for large key sets, so we
keep the original 64bit
variant as well.)
Faster bucket computation
After inlining everything, the generated assembly for our test is just one big
\(\sim 100\) line assembly function. Currently, the bucket(hx)
function (that
computes the bucket for the given hash hx = hash(x, s)
) looks like
|
|
The assembly looks like this:
|
|
We see that there are quite some branches:
- The first and second branch of the
bucket()
function are both fully written out. - They use the same number of instructions.
- One branch does
add 0
, I suppose because the CPU likes equal-sized branches. - There are redundant index-out-of-bounds checks.
- The last line, the array index itself, has \(8000\) samples: \(57\%\) of the total samples is this single assembly instruction!
Branchless bucket index:
I tried rewriting the bucket()
function into a branchless form as follows:
|
|
but this turns out to be slower than the original, probably because the new
assembly now needs a lot of cmov
instructions. (In particular, rem
contains
a u128
and a u64
, so needs \(3\) mov
’s and \(3\) cmov
’s.)
|
|
No bounds check:
We can replace k[index]
by unsafe { *k.get_unchecked(index) }
.
This doesn’t give much performance gain (less than the few ns
of measurement
noise I have), but can’t hurt. It removes the final cmp; jbe
lines from the assembly.
Fix tests: Instead of ignoring test results we can accumulate the resulting
indices and pass them to black_box(sum)
. This prevents the compiler from
optimizing away all queries. Somehow this affects the reported timings. I now get:
query (ns) | no LTO | thin-LTO | inline index() | fixed tests |
---|---|---|---|---|
u64 | 91 | 60 | 55 | 63 |
fastmod64 | 55 | 34 | 33 | 35 |
fastmod32 | 50 | 26 | 24 | 20 |
strengthreduce64 | 56 | 38 | 33 | 38 |
strengthreduce32 | 50 | 31 | 26 | 30 |
I’m confused how the fastmod32
timing went down, but the fastmod64
went up.
(Typical situation when you do constant profiling and there are more numbers
than you can make sense of, sadly.)
Branchless, for real now! (aka the trick-of-thirds)
I’m still annoyed by this branching. Branches are bad! They may be fast for now,
but I kinda have the long term goal to put SIMD on top of this and that doesn’t
go well with branching. Also, branch-misprediction is a thing, and the \(70\% -
30\%\) uniform random split is about as bad as you can do to a branch predictor.
The code from earlier does fix it, but at the cost of a whole bunch of mov
’s and
cmov
’s.
But there is a trick we can do! \(p_1\) and \(p_2\) are sort of arbitrarily chosen, and all the original paper (Fox, Chen, and Heath 1992) has to say about it is
Good values for these two parameters are experimentally determined to be around \(0.6n\) and \(0.3m\), respectively.
Thus I feel at liberty to change the value of \(p_2\) from \(0.3m\) to \(m/3\). This gives: \[m-p_2 = m-m/3 = \frac 23 m = 2p_2.\] The cute tick is that now we can use that \[x \mm p_2 = (x \mm (2p_2)) \mm p_2 = (x \mm (m - p_2)) \mm p_2,\] and since $ 0≤ x \mm (2p_2) < 2p_2$, computing that value modulo \(p_2\) is as simple as comparing the value to \(p_2\) and subtracting \(p_2\) if needed.
Thus, we modify the initialization to round \(m\) up to the next multiple of \(3\), and change the bucket function to
|
|
The new timings are
query (ns) | no LTO | thin-LTO | inline index() | fixed tests | \(p_2 = m/3\) |
---|---|---|---|---|---|
u64 | 91 | 60 | 55 | 63 | 54 |
fastmod64 | 55 | 34 | 33 | 35 | 27 |
fastmod32 | 50 | 26 | 24 | 20 | 19 |
strengthreduce64 | 56 | 38 | 33 | 38 | 33 |
strengthreduce32 | 50 | 31 | 26 | 30 | 21 |
fastmod32
didn’t get much faster, but all others went down a lot! Let’s check
out the generated assembly for fastmod64
:
|
|
Huh what?! I don’t really what is going on here, but I do know that the compiler
just vectorized our code for us! All the vp
instructions are vector/packed
instructions! Magic! This probably explains the big speedup we get for fastmod64
.
Closer inspection: As it turns out, the 32bit
versions were already
auto-vectorized before we implemented this last optimization. Probably because
the FastMod32
type is smaller (two u64
) than the Fastmod64
type (u128
and u64
) and hence easier to vectorize (and similar for StrengthReduce32
).
But either way this last trick helps a lot for the 64bit
variants that will
be needed for large hashmaps.
Compiling and benchmarking PTHash
Compiling PTHash was very smooth; just a git clone
, submodule init, and
building just worked :)
Running a benchmark similar to the ones here:
|
|
reports a query performance of 26ns/key
, similar to the fastmod64
performance I get.
Note that PTHash uses fixed-width bitpacking here, while I just store u64
’s
directly, but this shouldn’t affect the time too much.
Vectorization: More interestingly, PTHash is not auto-vectorized by my
compiler, so I’m surprised it performs this well. Maybe the vpgatherqq
instruction just doesn’t give that much speedup over sequential lookups – I
don’t know yet. But still, my equivalent code using fastmod64
with \(p_2 =
0.3m\) has 35ns/key
vs 26ns/key
for PTHash. Confusing.
Branching: PTHash compiles to a branchy version of fastmod(x, p2) or fastmod(x, m-p2)
, but is still fast.
Compact encoding
Adding fixed-width encoding was easy using the sucds
CompactVector
type.
The generated code doesn’t look so pretty though – it branches on whether the
bits cross a usize
boundary, whereas PTHash’s implementation does an unaligned
read from a *u8
to avoid this, which seems nicer.
Find the \(x\) differences
At this point, both pthash-rs
and the original PTHash
support encoding by a
single compacted vector, but there is still quite some time difference: 31ns
vs 25ns
. Time to find all the differences.
This may or may not be the best approach, but I decided to put the assemblies side-by-side.
Exhibit A: the missing modulo Ok, I won’t bore you with the full assembly, but I found this in the PTHash assembly:
|
|
with nothing similar in the rust version. Turns out that this is \(0.6 \cdot (2^{64}-1)\). Indeed, the code is:
|
|
note how it does \(hash < 0.6 2^{64}\) instead of \(hash \mm n < 0.6 n\) as written in the paper for what FCH does. Basically we can completely drop the \(\mm n\) there! Sweet! That’s \(1\) of \(3\) modulo operations gone :)
FastReduce
revisited
Earlier I mentioned the blogpost A fast alternative to the modulo reduction: to map \(h\in[2^k]\) to \([n]\) (\([n] := \{0, \dots, n-1\}\) here), instead of taking \(h\mm n\), one can do \(h*n/2^k\) which is must faster to evaluate. The problem with this approach is that it only uses the \(\log_2 n\) high-order bits of \(n\), discarding some necessary entropy.
On second thought, it seems like this may still be useful though. There are two modulo operations in the PTHash algorithm:
- In the
bucket()
function, mapping from \([2^{64}]\) to \([p_2]\) or \([m - p_2]\). - In the
position()
function, mapping from \([2^{64}]\) to \([n]\).
And there is one related check:
- In the
bucket()
function, check whether \(h < p_1 \cdot 2^{64}\) (or \(h \mm n \leq p_1 \cdot n\)), which is actually also a reduction operation.
While we cannot use FastReduce
twice, I think it may still be possible to use
it once. In particular, it should be fine to use a low-entropy hash for
bucket-selection since we anyway have collisions there – that’s the point of
bucketing in the first place.
A second optimization may be possible if we could find a function that uses the
lower \(\log_2 n\) bits of the hash for position()
. Then we can access \(2\cdot
\log_2 n\) bits of entropy in total which should be sufficient to avoid
collisions with constant probability (via the birthday paradox).
One idea is something like \(h \mm 2^{\lceil{\log_2 n\rceil}}\), but this is not quite fair so it may not work out nicely. But then again, maybe we could use that for the bucketing modulo, since it possibly doesn’t require (as much) fairness.
Or maybe we can just take the lower \(32\) bits of \(h\) and do \((h\mm 2^{32}) * n / 2^{32}\). That should probably work just fine :)
So, we have 7 possible reduction functions now:
FastMod64
, same asmod
but faster.FastMod32L
, taking the lower \(32\) bits modulo \(n\).FastMod32H
, taking the higher \(32\) bits modulo \(n\).FastReduce64
: \((h * n) » 64\)FastReduce32L
, fastreduce on the \(32\) low bits: \(((h \mm 2^{32}) * n) » 32\)FastReduce32H
, fastreduce on the \(32\) high bits: \(((h » 32) * n) » 32\)
Excluding mod
(which is never better than FastMod64
, we can make \(36\)
combinations from this to use for the two modulo operations.
Not all combinations end up working (because of lack of entropy when e.g. only
the \(32\) high bits are used). It’s clear that FastMod64
tends to be slow, and
that Reduce
operations are usually (but not always) faster than Mod
operations.
The big problem with this benchmark though seems to be that timings are quite inconsistent (variance of a few nanoseconds), and that the quality of generated code (auto-vectorization and loop unrolling) depends on a lot of things and is somewhat unpredictable/inconsistent.
bucket reduce (\(\mm m\)) \ position reduce (\(\mm n\)) | FM32L | FM32H | FM64 | FR32L | FR32H | FR64 |
---|---|---|---|---|---|---|
FastMod32L | 20 | 27 | 19 | 20 | ||
FastMod32H | 18 | 26 | 19 | |||
FastMod64 | 29 | 27 | 33 | 28 | 26 | 27 |
FastReduce32L | 18 | 23 | 18 | |||
FastReduce32H | ||||||
FastReduce64 |
Note that the last two rows fail in particular, because they strongly correlate with the check whether elements belong in a small or large bucket, \(h(x, s) < p_1\).
TODO Is there a problem if \(\gcd(m, n)\) is large?
Faster hashing
The current implementation uses Murmur2
to hash both the key \(x\mapsto h(x, s)\) and the
pilot \(k_i \mapsto h(k_i, s)\). While this hash is fast, it’s still quite some
instructions. Instead, especially for the \(h(k_i, s)\), we may be able to get
away with either no or a much simpler hash.
No hash: A first try of \(h_0(k_i, s) := k_i\) returns in failures because the \(k_i\) become too large.
Multiplication hash (MulHash
): So we do need more mixing of bits rather than just
incrementally increasing \(k_i\) starting at \(0\). One common way of doing that is
simply to multiply by a large semi-random $64$bit integer. In particular,
Murmur
also does this, so let’s just reuse their mixing constant and set:
\[
h_1(k_i, s) := (0xc6a4a7935bd1e995 \cdot k_i) \mm 2^{64}.
\]
(I tried looking for documentation on why this constant was chosen, but there
doesn’t seem to be more to it than it works.)
In experiments, this gives anywhere between \(1\) and \(4\) nanoseconds of speedup.
TODO Try xxhash
See github.
An experiment
Another fun comparison is here, where I use MulHash
for \(k_i\) and replace the
MurmurHash
for \(h(x)\) by simply the identity operation (since we’re testing on
uniform random \(x\) anyway):
|
|
|
|
MurmurHash
takes slightly less than half the instructions, but removing them
gives almost \(2.5\times\) speedup! My current thinking is that this is not so
much due to the reduced instruction count itself (the CPU is stalling anyway to
wait for memory), but rather due to the better pipelining it results in: when
loop iterations are shorter (in number of assembly instructions), pipelining can
look ahead more iterations, and hence does a better job at prefetching memory.
But even with this short loop, around two thirds of the time is still spend
waiting for memory.
Conclusion 1.: I should really write code with prefetching.
Conclusion 2.: It’s time to use perf stat
for some metrics on branch
mispredictions and instructions per cycle.
Compiler struggles
Auto-vectorization: The compiler is quite eager to generated vectorized assembly code. It’s quite unpredictable when auto-vectorization triggers, and it seems I have to keep at least one branch in the hot loop to prevent it. The vectorized code seems bad for a few reasons:
- Gather instructions (
vpgatherqq
) are slow. - Pipelining: It seems that pipelining works much better for the scalar version, being able to look ahead further and keeping busy while waiting for memory to load.
Even worse: Also, it did the following terrible thing. Starting with this piece of innocent looking code:
|
|
the compiler decided to generate:
Basically: it created a branchless implementation of this if statement where the
false
branch is always executed. But that branch is super slow! Basically a
completely unnecessary read from main memory!
For now I’ll just completely remove the false
branch to prevent this issue…
Prefetching, at last
Without further ado, here we go:
|
|
For \(L = 64\), this is around twice as fast as the non-streaming/non-prefetching version!
In our \(n=10^7\) benchmark, this reduces latency to 4.2ns
!!
For the larger \(n=10^8\) benchmark, latency is 7.5ns
, down from 28ns
of
the original PTHash paper! (But note that I don’t do any compression here.)
And this is without vectorization still :)
TODO Prefetching with vectorization
Preliminary results: this seems tricky to get right and tends to be slower. It
sometimes generates unwanted gather
instructions, but even when it doesn’t
it’s slow although I don’t know exactly why yet. Does pipelining work with SIMD instructions?
Inverting \(h(k_i)\)
When there are only a few empty slots left, we want to find \(k\) such that \[ R(h_1(x) \lxor h_2(k), n) = p \] for some fixed position \(p\).
Using FastReduce64
as reduction \(R\), we want
\[
\left\lfloor\frac{(h_1(x) \lxor h_2(k)) * n}{2^{64}}\right\rfloor = p
\]
i.e.
\[
2^{64} p \leq (h_1(x) \lxor h_2(k)) \cdot n < 2^{64}(p+1)
\]
i.e.
\[
\frac{2^{64} p}{n} \leq h_1(x) \lxor h_2(k) < \frac{2^{64}(p+1)}{n}.
\]
This basically fixes the high \(r:=\log_2(n)\) bits of \(h_1(x) \lxor h_2(k)\).
(I suspect it only leaves two possible options for the highest \(\lfloor
r\rfloor\) bits.)
Thus, we infer (with some handwaving)
\[
h_2(k) = \frac{2^{64}(p+\frac 12)}{n} \oplus h_1(x) \oplus A =: X \oplus A
\]
where \(X\) is the target constant and \(A\)’s upper \(r\) bits are \(0\), i.e. \(A < 2^{64-r}\).
(TODO: Maybe we can just for the lower \(64-r\) bits of \(X\) to \(0\). Not sure.)
We could solve this when using Murmur
for \(h_2\), but in practice we use a much
simpler MulHash
: \(h_2(k) = (C\cdot k) \mm 2^{64}\). Thus, we must solve
\[
(C \cdot k) \mm 2^{64} \in \{ X \oplus A : 0\leq A < 2^{64-r}\}
\]
for \(k\).
Now, since \(C\) is odd it has a multiplicative inverse \(C’\) over \(F_{2^{32}}\) with \((C \cdot C’)\mm 2^{64} = 1\), so we’re looking for \[ k \in \{ (C’ \cdot (X \oplus A)) \mm 2^{64} : 0\leq A < 2^{64-r}\}. \]
This makes it very easy to find arbitrary solutions! The big drawback is that this results in ‘random’ \(k\) in \([2^{64}]\), while for compression purposes we much prefer finding small (or even minimal) \(k\).
Given the size of the set, the expected value of the smallest solution \(k\) is \(2^r = n\). But I’m not sure if this is solvable efficiently. We could first look for solutions where the \(64-r\) high bits of \(k_i\) are \(0\), and if none are found we can relax to \(64-r-1\) high bits of \(0\), and so on. But does that even help?
Remember that \(C\) is a mixing constant and hence both \(C\) and \(C’\) are quite ‘random’ with lots of \(1\) bits. Maybe we can choose \(C\) to have good mixing but also simplify this inverse? Or can it always be done? I don’t know..
Another day of progress
- Explored ways to solve the hash-inversion problem above. Seems that an \(O(64-r)\) solution is possible, and maybe even \(O( r)\) but that’s probably more annoying. (Either way both will be \(O(1)\) for \(n < 2^{64}\) ;) There are a lot of interesting things going on to be written down and formalized later.
- Added a binary
src/bin/bucket_sizes.rs -n <n> -c <c> -a <a>
to print a nice table with number of buckets of each size and related statistics. - Replaced
Vec::sort()
byradsort::sort
which is roughly \(2\times\) faster for uniformu64
hashes. TODO: Probably it will be faster to only sort by the highest \(n\) bits (or \(32\) bits), i.e. by only sorting on buckets, since most buckets will have size \(1\). - I’m playing with the idea of implementing some kind of interpolation sort
algorithm that just inserts things directly in the right place in an array of
Option<NonZero<usize>>
of size \((1+\epsilon)n\) or maybe \(n + C \cdot \sqrt n\) and then runs a collect on this. Should work quite well I think.
TODO Possible sorting algorithms
- Robinhoodsort
- Bounds on linear probing
- Flashsort (wikipedia, article)
- Drawback: bad cache locality when writing out buckets. Maybe just write to \(O(\sqrt n)\) buckets (should fit in L2 cache ideally) and then sort each bucket individually.
Diving into the inverse hash problem
So we want to find the minimal \(k\) such that
\[
(C\cdot k) \mm 2^{64} \in \{X \oplus A : 0 \leq A < 2^{64}\}
\]
for some given \(X\), i.e. the \(r\) high bits of C*k
as \(64\) bit integers should
be equal to the \(r\) high bits of \(X\). Let \(\K\) be the set of \(k\) for which this
holds, so that we are interested in \(\min \K\).
A first thought is: can we first find some valid \(k=k_0\), and then hope that the space of solutions \(\K\) has some kind of affine structure so that we can easily translate the initial \(k_0\) to a minimal solution? So let’s investigate the linear structure first: let’s find solutions that have \(r\) leading zeros, i.e \[\K_0 := \{k : C \times k < 2^{r’}\},\] where I will use \(\times\) for $64$bit multiplication and \(r’ := 64-r\).
We definitely expect \(\K_0\) to have some structure: if \(C\times k_1 < 2^{r’}\) and \(C\times k_2 < 2^{r’}\), then \(C\times (k_1 + k_2) < 2^{r’+1}\), which is almost good enough, and similarly, \(-2^{r’} \leq C\times (k_1 - k_2) \leq 2^{r’}\), so that either \(k_1 - k_2\) or \(k_2 - k_1\) is also a solution.
Thus, consider those \(k\in \K_0\) with small absolute value (modulo \(2^{64}\), so the absolute value \(|k|\) of \(k\) is \(\min(k, 2^{64}-k)\)). Then, each difference \(k_d\) between two consecutive elements of \(\K_0\) must occur in \(\K_0\) either as \(k_d\) or as \(-k_d\).
So let’s write some code to determine all possible differences. I’m doing this for $32$bit integers first since we can quickly iterate over them
|
|
Result:
|
|
WOW: For each \(r\), it turns out there are only \(3\) possible differences between adjacent elements of \(\K_0\)! And also note that (when there are \(3\) distinct values) the largest one is always the sum of the smaller two. What is more: when we increase \(r\) by one, the new differences always include one or two of the previous differences, and some small linear combination of them. I.e. \((a,b,c) \mapsto (b, c, b+c)\) or \((a,b,c) \mapsto (a, a+c, b+c)\).
This inspires a much faster algorithm to find these possible differences: for \(r\) from \(1\) to \(32\), start with differences \(\Delta_0 := \{1\}\) for \(r=0\), and then to get \(\Delta_r\) from \(\Delta_{r-1}\) first make all small linear combinations of elements of \(\Delta_{r-1}\) and then start with \(k=0 \in K_0^r\), and then keep increasing \(k\) by all the possible delta’s to see if \(k+\delta\in K_0^r\). If this is the case add \(\delta\) to \(\Delta_r\), and otherwise keep going. Once \(\Delta_r\) reaches size \(3\), we can stop assuming our hypothesis that \(|\Delta_r| \leq 3\) is true. (TODO: This will require a proof eventually.) See the implementation here on GitHub.
Also, let’s have a look at what \(C \times \delta\) looks like in binary for all \(\delta \in \Delta_r\):
|
|
We see that the first \(r\) bits of \(C\times \delta_r\) are either \(0\) or \(1\), so \(-2^{32-r} < C \times \delta_r < 2^{32-r}\).
So using this knowledge, how do we find \(\min \K\) for an arbitrary constant \(X\)? Well, we can start with \(k = 0\) and \(r=0\), and then keep trying to add elements of \(\Delta_{r+1}\) until the \(r\)’th-highest bit of \(C\times k\) is the same as the one in \(X\). Then we increase \(r\) and repeat until all highest \(r\) bits are correct.
|
|
Running this on some random input for \(r=32\) gives for example:
|
|
In this example we are somewhat lucky and \(k=2\) happens to have the \(10\) highest bits correct, but generally the number of correct bits grows by a few each step.
I have written some code to verify the correctness and minimality of the result, and indeed results are always identical to a bruteforce search starting at \(k=0\).
We can also make some statistics on the value of \(\min \K^r\) compared to \(2^r\). Since \(|\K^r| = 2^{64-r}\) we expect the smallest solution to be around \(2^r / 2\). In practice this seems to be mostly the case, but not quite exactly. Over a total of \(10^8\) samples (which takes \(40\) seconds) we get the following ratios of \(k/2^r = \min(\K^r)/2^r\):
|
|
Indeed, the average value of \(k/2^r\) is typically just above \(0.5\) or at least below \(1.0\), but occasionally jumps to above \(4\), in particular around \(r\in \{20, 21, 22, 23\}\). I’m really not quite sure why, but this particular range seems to have something weird going on, as also seen by the maximum \(k/2^r\) ratio of \(16\). (Note that this is not a bug with our fast inversion algorithm. The bruteforce algorithm confirms these results.)
Most likely this is caused in some way by the presence of a very small
\(\delta_r\) for these \(r\) (marked with a *
), and this correlating with all
\(C\times \delta_r\) not only being \(< 2^{64-r}\) (in absolute value) but being either very close to it
or very small.
|
|
Conclusion: We now have a fast \(O( r)\) algorithm to solve \(C\times k = X \lxor A\) with \(0\leq A < 2^{64-r}\). But this does not yet solve the full problem since we actually wanted \[ \frac{2^{64} p}{n} \leq X \lxor (C\times k) < \frac{2^{64}(p+1)}{n} \] and we did some handwaving to assume that implies the topmost \(r\) bits of \(C\times k\).
Bringing it home
Let’s start with a (slightly pathological, but illustrating) \(32\) bit example: \(n=511\), \(p=127\), so we want \[ \frac{2^{32}\cdot 127}{511} \leq X \lxor (C\times k) < \frac{2^{32} \cdot 128}{511}. \] In binary, the lower and (inclusive) upper bound are:
|
|
These have a longest common prefix 0
of length \(r=1\), so we could now use the
solution above to enumerate solutions \(k\) for \(r=1\) and then for each of them
check if \(X \lxor (C\times k)\) is indeed in this interval. But this is quite
inefficient, as the actual interval only has length \(2^{32}/511 \approx 2^{23}\), so that
each possible \(k\) only works with probability \(2^{23} / 2^{32 - 1} = 2^{23-31} =
2^{-8} = 1/256\), indicating that we may have to try a lot of distinct \(k\).
Instead, we can split the interval into two ranges:
|
|
Both range \(r_1\) and \(r_2\) now cover more than half of the values with the prefix of length \(r=9\) resp. \(r=10\) bits, so a suitable \(k\) should be found within a few tries in each case, and we can simply take the minimum of the two solutions.
One additional optimization that we can do is that instead of computing
solutions for the two ranges independently, we can first compute a \(k_0\) that
gives the right bits in the LCP of low
and high
, and then extend from there
towards the two disjoint intervals.
The final code is as follows (and includes some more handling of edge cases):
|
|
This is \(O(64)\) and takes around a minute to invert \(10^8\) hashes.
Conclusion: We can now efficiently find \(k_i = O(n)\) such that \(h_1(x) \lxor
h_2(k_i)\) maps to a chosen free slot (when \(h_2\) is a FastReduce
instance).
This should allow us to fill the last slots of the table much faster.
Hash-inversion for faster PTHash construction
So now we have a fast way to find \(k_i\) for the tail of the last \(t\) buckets. We will assume that these buckets all have size \(1\). (Otherwise decrease \(t\).) Let \(F\) be the set of free positions once the head of \(m-t\) buckets has been processed. We always have \(|F| \geq t\) and when \(\alpha = 1\) we have \(|F| = t\). We can then implement two strategies:
- Sequential
- Iterate over buckets and free slots in parallel, matching each bucket to a slot. Then compute the \(k_i\) that sends each bucket to the corresponding free slot. This will give \(k_i\sim n\) in expectation, uses \(t \cdot \log_2(n)\) bits in total, and runs in \(O(t)\).
- Greedy
- For each bucket (in order), compute \(k_i(f)\) for each candidate slot \(f\), and choose the minimal value. When \(\alpha=1\) this gives \(k_i \sim n/f\) and runs in \(O(t^2)\). The total number of bits is \[ \sum_{f=1}^t \log_2(n/f) = t\log_2(n) - \log_2(t!) \sim t (\log_2(n) - \log_2(t)) \] For \(t=O(\sqrt n)\), this saves up to half the bits for these numbers.
In some quick experiments with \(n=10^8\), the sequential strategy seems to give at most around \(15\%\) speedup (\(35\) to \(30\) seconds for \(t=10000\)), which is not as much as I had hoped. This seems to be because relatively a lot of time is also spent on finding \(k_i\) for the last buckets of size \(2\) and \(3\).
Fast path for small buckets
For small buckets (size \(\leq 4\)) it pays of to use a code path that knows the
explicit bucket size and processes a fixed size &[Hash; BUCKET_SIZE]
array
instead of an arbitrary sized slice &[Hash]
. This allows for better code generation.
TODO Dictionary encoding
The dictionary will be quite dense for numbers up to some threshold (say \(1024\)), and sparser afterwards. We can encode the small numbers directly and only do the dictionary lookup for larger ones.
- TODO: Figure out if the branch is worth the savings of the lookup.
TODO Larger buckets
The largest bucket should be able to have size \(O(\sqrt n)\) without issues. From there it should slowly decay (TODO: figure out the math) to constant size. This could put the elements that are currently in the largest \(\sim 1\%\) of buckets all together in a few buckets, reducing the average size of the remaining buckets. Although the reduction seems only minimal, so this may not give too much benefit.
One way of achieving such a skew distribution might be to replace the partitioning of \(h\in [0, 2^{64})\) in \(m\) chunks, by a partitioning of \(h^2 \in [0, 2^{128})\) in \(m\) chunks.
TODO Prefetching free slots
Looking up whether slots in the array for a certain \(k_i\) are free is quite slow and memory bound. Maybe we can prefetch values for a few \(k_i\) ahead.
Also, the computation of position
could be vectorized.
Filling the last few empty slots needs very high \(k_i\)!
Have a look at the last two columns of this table, indicating the number of newly appearing \(k_i\) in each class and total number of distinct \(k_i\) up to then:
|
|
The buckets of size \(2\) and larger only need \(3649\) distinct \(k_i\) which can be represented using \(12\) bits each with dictionary encoding. But after the last single-element buckets we need \(17733\) distinct \(k_i\) which would require \(15\) bits! If we instead use \(\alpha=0.99\) things look much better:
|
|
Now no new pilot values are needed at all for the single-element buckets, and \(12\) buts is enough for each value, saving \(20\%\) of memory!
Looking closer at the buckets of size \(1\), we see that most new values are added for the last few elements, where large \(k_i\) are needed.
|
|
Perfect matching for the tail
Let’s for now only consider some tail of length \(t\) of buckets of size \(1\). We already saw two techniques to assign them: sequentially and greedily (giving the minimal \(k_i\)).
We also already discussed the possibility of cuckoo hashing, where inserting an element can ’evict’ already placed elements. This seems to need a bit more bookkeeping though, as for each placed element we must track the bucket it belongs to.
Another more complex way would be a perfect matching: Compute \(k_i(f)\) for each bucket \(i\) and each free slot \(f\) and find the weighted bipartite matching assigning buckets to free slots that minimizes e.g. \(\sum \log_2 k_i(f)\) (the total number of bits) or \(\max k_i(f)\) (the maximal number of bits).
This last case is interesting: we could fix a maximal \(K=2^b\) up front and then find all possible \(k_i < K\) for all buckets by simply trying them all. Then we find a perfect matching (if one exists) and use these \(k_i\) for the final buckets.
This raises the question:
Let \(G\) be a bipartite graph on \(|A|+|B|=n+n\) vertices. Let each vertex in \(A\) have degree \(d\) edges towards random vertices in \(B\). What is the smallest \(d\) for which a perfect matching exists with high probability.
If we can answer this, than we can find the \(d\) smallest \(k_i\) solutions for the last \(t\) buckets and solve the perfect matching problem. In this case all final \(k_i\) would be \(O(d\cdot n/t)\), for \(O(t \log_2(dn/t))\) bits total.
A closely related question is answered on stackoverflow which implies that we should take \(d = c + \ln n\).
I implemented this and it seems to work quite well! Take for example the tail of \(n=10^8\), \(c=7\), \(\alpha=1\) without this optimization:
|
|
and now the same where we run perfect matching on the last \(500000\) buckets/slots where we find all \(k_i\) below \(2^{12} = 4096\):
|
|
Observe that:
- The maximum \(k_i\) overall is now \(6655\) instead of \(50M\) (the value for one of the last buckets of size \(2\)), requiring half the bits to store.
- The number of distinct \(k_i\) has gone down from \(17695\) to \(4198\), requiring one less bit to store.
- Both the maximum \(k_i\) and the number of distinct \(k_i\) is between \(2^{12}=4096\) and \(2^{13}=8192\). I.e. dictionary compression on the full array does not provide any additional benefits anymore! (I’m not yet sure whether separate front-back encoding would provide benefits.)
For reference, here are the corresponding results when using \(\alpha = 0.99\) without tail optimizations.
|
|
- Both the maximum \(k_i\) and number of distinct \(k_i\) are a bit smaller here, mostly because the buckets of size \(2\) are much easier to place.
Conclusion: This works extremely well to handle the last size-\(1\) buckets. It seems to be a bit slower for now, but I think this can be fixed. (The matching itself is super fast – it’s rather that we now need to search up to a threshold for all buckets, rather than only for the last few buckets.)
TODO: See if we can make something similar work for the tail of buckets of size \(2\). There we don’t need perfect matching though, and simpler techniques such as peeling may suffice (as used in XOR-filters (Graf and Lemire 2020)). This is also somewhat similar to cuckoo hashing (Fan et al. 2014), which has a similar result but uses a slightly less structured approach. In particular Cuckoo hashing does a greedy ‘DFS’ without backtracking for augmenting paths, compared to matching/flow algorithms which do a normal DFS and are able to ‘reset’ \(k_i\) back to the lowest solution after an augmenting path through them has been found.
Peeling for size-1 buckets
Instead of full matching, let’s try peeling to match the size-1 buckets: Again we first choose a threshold \(K=2^b\) and find all suitable \(k_i < K\) for each bucket using bruteforce. Also as before, consider the induced bipartite graph.
Suppose that there is a vertex \(u\) of degree \(1\). Then we know that the single edge incident to it (say \(uv\)) must be part of the final matching, so we can greedily take it and remove any other edges incident to \(v\).
We can repeat this process: repeatedly choose a vertex \(u\) of minimal remaining degree, choosing a random edge \(uv\) incident to it, add \(uv\) to the matching, and remove any other edges incident to \(u\) and \(v\).
I was hoping that this would result in a perfect matching with high probability, but it seems that in practice this procedure typically leaves behind a few percent of vertices without any remaining edges. One example situation is where this ‘deadlock’ could occur is when we have two fully connected subgraphs \(A=K_{x,x}\) and \(B=K_{y,y}\), and additionally a few extra edges from \(a \in A\) to \(B\). In this case, we must never take these ‘cross’ edges, but the algorithm described above doesn’t have such non-local information and could create an imbalance.
Not all hope is lost though:
- This approach may actually work well for buckets of size \(2\) and larger, where there are spare empty slots and we are not looking for a perfect matching.
- The partial matching found by peeling could be input to Dinic’ full matching algorithm, possibly reducing the number of iterations needed and providing overall speedup. But on the other hand, Dinic also matches all but a few percent of elements in its first round, so it’s not clear if it would help much.
Greedy peeling 1: Assigning from hard to easy
A simpler alternative to peeling is the following. As before choose a number of bits \(b\) and threshold \(K = 2^b\), and find all solutions \(k_i < K\) for all buckets of size \(2\). Then sort the buckets by increasing number of solutions. This leaves the buckets with more freedom of choice for last, hopefully giving them a larger probability of finding a match.
I had high hopes for this, but it seems to not work great. As expected it reduces the maximum \(k_i\) required, but (for \(n=10^8\)) it only goes down from \(3000\) to \(2500\), which is not even enough to save a bit in a direct representation.
TODO Peeling and cuckoo hashing for larger buckets.
We can processing the buckets in various ways:
- By increasing number of total edges, as in previous section.
- By increasing number of remaining edges.
- By increasing number of remaining free slots covered by all edges.
This last method sounds more complicated than the second one, but is simpler in practice: Whenever we fill a slot we simply decrease all buckets mapping to it by \(1\), instead of having to check first whether those edges are still alive.
Sunday morning ideas
Dinic
Possible speedups for Dinic algorithm:
- It’s never needed to visit vertices with level \(\geq level(t)\) (other than
\(t\) itself). Do such vertices exist? How many of them? Would it speed things
up to skip over them?
Answer: Yes, they exist. But when matching \(1M\) buckets, there are only around \(100\) at level \(>level(t)\). Sometimes there are quite some (up to \(n/2\)) at level \(=level(t)\) though, so we could skip those.
So during level computation, we can stop as soon as we determine the level of \(t\), and leave all other levels at \(-1\), so they will never be processed during augmentation, which only visits vertices \(v\) with \(level(v) > level(u)\).
This saves up to \(25\%\) of time when matching \(1M\) buckets.
- For vertices in the left component at \(level(u) = level(t)-2\), instead of
doing a DFS, we could first check if any of the outgoing \(v\) has a free edge
to \(t\), by storing all taken/free \(v\) in a bitmask.
- Done: Instead I implemented a check that for vertices in level \(level(t)-1\), we
only check the edge to \(t\). This saves around \(30\%\) time of the
augment
step, and around \(15\%\) time of the entire Dinic algorithm.
- Done: Instead I implemented a check that for vertices in level \(level(t)-1\), we
only check the edge to \(t\). This saves around \(30\%\) time of the
Together these two optimizations bring the runtime on \(1M\) elements down from around \(3.5\) seconds to \(2.5\) seconds: \(30\%\) speedup.
New iterative greedy assignment idea
Possible improvement to Greedy peeling 1:
Run the greedy linear-pass algorithm once.
- Then fix all the remaining unmatched ones to their first outgoing edge.
- Rerun the algorithm starting with these ‘hard cases’ already fixed.
- Repeat.
Will this converge? How many iterations? Is it better to fix hard cases on the fly, evicting previous ones as soon as the hard one is found?
Answer: Around 10-20 iterations are typically needed. These iterations are relatively fast, but the iterating to find all solutions below some threshold is quite slow.
We can try different sorting orders in which the buckets are processed:
- no sorting,
- by number of solutions \(k_i\) below the threshold (going from few to many),
- by smallest solution \(k_i\) (going from high to low).
Experiments show that sorting by number of solutions is the best, but this requires precomputing all solutions below the threshold which is quite costly in practice. Sorting by smallest solution is probably a better idea in the end, since this is something that needs to be computed for each bucket anyway.
Cuckoo hashing, again
The reason that this keeps coming up is because I think it will work very well, but I don’t yet see exactly how to implement this in a sufficiently nice/clean way. So I procrastinate by first implementing simpler (to implement) ideas.
Basically it’s the same as the iterative greedy assignment idea above, but displacing others on the fly, doing a random walk DFS until success.
- TODO: Do we choose outgoing edges at random? sequentially (wrapping around)? Always starting at the start of the list (leading to smaller \(k_i\))?
- Prefer displacing edges that only include one occupied slot. (Or if possible better, never take an edge with \(2\) (or more) slots already taken.)
Some statistics for bucket size \(2\) (\(n=10^8\)):
- \(\sim20\%\) of the buckets has size \(2\), and \(\sim20\%\) of the buckets has size \(1\).
- Bucket size \(2\) covers elements from percentile \(85\%\) to \(95\%\).
- The ’early’ buckets have success rate one in \(1/0.15^2 \sim 45\).
- The ’late’ buckets have success rate one in \(1/0.05^2 \sim 400\).
- So only \(1\) in \(9\) that works initially still works at the end.
- This is problematic, because even though in expectation most will find a solution with \(20\) candidates, some of the \(f = O(10^6)\) buckets will be unlucky and will need more like \(O(9\cdot \ln f)\) candidates, which corresponds to quite large \(k_i\) actually (\(45\cdot 9\cdot \ln f \sim 4000\)).
- Basically, all these cuckoo/displacement strategies are really mostly needed for the few buckets that lie a factor \(\ln f\) (or say more than \(3\times\)) above the average, so if we can fix those the rest should be relatively OK.
- Coming back to cuckoo hashing: Each candidate \(k_i\) maps the two elements to the \(15\%\) remaining slots. After most size-\(2\) buckets have been assigned, \(5\%\) of slots still remains empty. That means that every candidate edge consisting of \(2\) slots in this \(15\%\) has quite a high probability of including at least one empty slot: \(1-(2/3)^2 = 55\%\). So basically cuckoo hashing has a very high probability of finding edges with only one slot already filled, and \(1/9\) of the edges has both slots empty. So I think the DFS only needs to see roughly \(10\) edges in total before success, which is very little actually!
Cuckoo hashing / displacing, for real now
Ok so I implemented a quick first version and, lo and behold: It works great!.
So for completeness the algorithm I run now:
- Put all buckets of fixed size (from \(2\) to \(5\) or so, but not \(1\) (TODO: maybe \(1\) works as well?)) on a stack.
- While the stack is not empty: find a \(k_i\) for the top of the stack that maps to all empty slots.
- If not possible, find a \(k_i\) that minimizes the number of collisions. Drop the colliding buckets from the taken slot array and push those buckets on the stack.
- Repeat until the stack is empty.
One tricky point is how to tie-break the selection of the minimal bucket. Always choosing the smallest \(k_i\) with the minimal number of collisions can lead to cycles where two buckets keep displacing each other. What seems to work: Choose the first \(k_i\) after any existing \(k_i\) (modulo \(2^b\)).
For \(n=10^7\):
- the \(450k\) buckets of size \(3\) needs \(4k\) displacements, around \(1\%\);
- the \(600k\) buckets of size \(2\) need \(20k\) displacements, around \(3\%\).
For now the bottleneck is twofold:
Iterating over all buckets to find all solutions (edges) up-front is slower than necessary. In practice most solutions \(k_i\) are quite small and there is no need to iterate all the way to the upper bound up-front.
The bit width \(b\) has to be large enough to ensure each bucket has at least one solution. When the expected value of the smallest solution is \(\overline{k}\), the maximum value we need is roughly \(\overline{k} \ln b\) when there are \(b\) buckets of the current size. This is unfortunate because in practice \(b\) is of order \(1M\), so we need to go up to \(20 \overline{k}\), whereas maybe \(2\overline{k}\) would be enough for most of them.
TODO: We should fix this by allowing some buckets to displace larger buckets. Probably displacing buckets one larger than itself is good enough.
Displacing globally
A drawback with the implementation above is that it first finds all edges, and only then starts the algorithm. Much more efficient is to do things right from the start. We can make a single global displacement algorithm:
- Set a threshold \(k_{max} = 2^b\).
- Determine buckets and order by decreasing size.
- For each bucket, try to find the smallest \(k_i<2^b\) that places it without collisions. If so, move to the next bucket.
- Otherwise, find the \(k_i<2^b\) that minimizes the number of collisions, and (more importantly) minimizes the size of the colliding buckets. Then evict those other buckets (remove their \(k_i\) and push them on a stack), and set the \(k_i\) for the current bucket.
- While the stack is non-empty, repeat steps 2 and 3 for the top of the stack.
One drawback is that this needs some more memory since for each slot it needs to store the bucket it came from, and it’s also a bit slow since (in case collision-free matching is not possible) we need to lookup the size of each bucket in another random-memory-access.
Running it
This seems to work super well! My implementation isn’t as streamlined yet as the original that just finds the smallest \(k_i\) per bucket (which has a simpler and thus faster inner loop), but nevertheless gets the job done. Only around the last buckets of size \(2\) do displacements go above \(1\%\) of buckets. The main drawback is that it somewhat explodes at the last \(100\) to \(1000\) size-\(1\) buckets, where it has a harder time finding chains that lead to a free slot. Probably falling back to deterministic matching using Dinic is better here.
The average \(k_i\) per bucket-size using this method seems similar to the
original algorithm, but the main benefit is that even for \(n=10^8\) we can now
find solutions that with all \(k_i < 2^8\), meaning that we can store the pilots
simply as a vector of u8
instead of needing a compact representation and/or
dictionary encoding. In fact I’m now thinking it’s probably simpler to tune \(c\)
(i.e. the number of buckets) such that we can easily find a solution where all
pilots fit in \(8\) bits, rather than having a variable packing size.
Some further observations:
- Only rarely is more than one collision needed.
- TODO: In this case can we backtrack instead of displacing two buckets?
- When filling buckets of size \(s\), at worst buckets of size \(s+1\) are evicted, never larger ones.
Limitations
As already mentioned, one problem with this is that matching the last few buckets is hard. Basically, for the last bucket we have to find a size-\(1\) bucket for which one of its \(256\) possible pilots maps it to the remaining free slot, which only happens for \(0.5\%\) of buckets (which is large for \(n=10^8\)).
Alternatively we can use Dinic for the size \(1\) buckets again, but this also has a limitation as mentioned before, that there will always be some size-\(1\) buckets that do not map to any free slot. Thus, it seems we need some balance where:
- Some size-\(1\) buckets displace already-placed size-\(2\) buckets.
- For the remaining size-\(1\) buckets we find some/all solutions, build the graph, and run Dinic.
One other tricky problem is that not only do we have to ensure that each bucket maps to a free slot, but also that each free slot has some bucket mapping to it. I’m not quite sure yet how exactly to solve this, since the probability of such a position existing is also quite large. One option is to track for each slot one or two \((bucket, pilot)\) pairs that maps to it, and then use that in case no other remaining values map there. But this might then trigger a chain of displacements again in the other (‘pulling’) direction. Although that’s somewhat inevitable in the \(\alpha = 1\) setting.
Cleanup and revisiting defaults
So far I have been using Vec<u64>
as backing storage for pilots, MurmurHash
to hash keys, and \(16\) iteration lookahead for prefetching.
Now that we aim to have \(k_i < 2^8\), let’s switch that to a simple
Vec<u8>
.I added support for
FxHash
, which processes au64
at a time using a simple rotate-xor-multiply:state=(state.rotate(5) ^ x)*constant
.While benchmarking this, I realized that the reason
NoHash
(which is a toy implementation that does exactly nothing) was so much faster than other hashes is not only because it spends less instructions hashing, but also because my random keys were sorted (to check for duplicates), and this lead to linear access patterns rather than random access patterns.It now seems that $32$-iteration lookahead is around \(10\%\) faster for large \(n\).
Some results for \(n=10^8\) (\(c=7\), \(\alpha=1\)):
|
|
(1), (16/32/64) indicate no prefetching resp. prefetching 16/32/64 iterations ahead.
Observations:
- Switching from
Vec<u64>
toVec<u8>
saves around \(2ns\) per query. - Going from
Murmur
toFxHash
typically saves a bit. - Going further to
NoHash
doesn’t save more. - At \(32\) lookahead, the runtime does not depend much on hash function, so probably it’s memory bound rather than cpu bound.
CompactVector
andCompactArray
do not support prefetching and hence aren’t nearly as good.
TODO Sum instead of xor?
PTHash replaces the summation in \((h(x) + d)\mm n\) by a xor. It’s not exactly clear to me why this is beneficial. I the xor has slightly better mixing, especially when the \(\mm n\) is implemented as a reduction of the high bits, and \(h(k_i) = C\cdot k_i\) has a linear structure as well.
Revisiting \(\alpha < 1\)
After playing some more with \(\alpha<1\) values, it seems that we can do this with only very minor loss in query performance. Some experiments for \(n=10^8\), \(c=10\), with prefetching \(32\) iterations ahead:
\(\alpha\) | index | index_remap |
---|---|---|
\(1.0\) | 5.7 | 5.7 |
\(0.999\) | 5.6 | 5.8 |
\(0.995\) | 5.7 | 5.8 |
\(0.99\) | 5.7 | 5.9 |
\(0.95\) | 5.7 | 7.1 |
\(0.9\) | 5.7 | 8.7 |
As can be seen, index
is independent of \(\alpha\) as expected, since it doesn’t
do any extra work for out-of-range elements. index_remap
becomes slow once too
many elements fall out of bounds around \(\alpha = 0.95\). But we can safely
choose \(\alpha=0.99\) without significant loss of query performance.
Elias-Fano for the remap-dictionary
I added support to use sucds::EliasFano
to store the dictionary elements. This
saves quite some bits, since (for \(\alpha=0.99\)) it only needs \(2+\log_2(100) =
9\) bits per element as opposed to the default which encodes each index using a
u32
.
So let’s see the query performance.
\(\alpha\) | index | index_remap |
---|---|---|
\(1.0\) | 5.7 | 6.6 |
\(0.999\) | 5.8 | 6.7 |
\(0.995\) | 5.8 | 7.2 |
\(0.99\) | 5.7 | 7.6 |
\(0.95\) | 5.7 | 11.1 |
\(0.9\) | 5.7 | 16.1 |
This comes out quite a bit slower than using a direct Vec<u32>
encoding. On
the other hand, it does save space: for \(\alpha=0.99\), the vector encoding uses
\(0.01 \cdot 32 = 0.32\) bits per element overall, while EliasFano only needs
\(0.01\cdot 9 = 0.09\) bits per element overall, saving \(0.2\) bits/element.
Sadly the EliasFano hot-loop has slightly worse codegen even in the non-EF part,
introducing a branch where previously non-branching assembly was generated for
the >>small
shift in determining the right bucket.
Global iterative prioritizing
Instead of displacing using DFS, another idea I had that may work for \(\alpha<1\):
- Iterate over all buckets by decreasing size; place those that can be placed without collisions.
- Collect the rest in a vector of
priority_buckets
. - Repeat, but place the priority buckets before iterating over the others.
This worked well with only buckets of a fixed size, but seems to not work as
well globally, even when we try to assign the priority buckets to have minimal
collisions with others. Instead of converging to a stable solution, each
iteration about the same amount (\(0.3\%\) for \(n=10^8\), \(c=10\), \(\alpha=0.99\)) of
buckets is added to the priority list. It seems that the changes made by the few
extra priority buckets propagate a lot, changing the optimal pilot for \(40\%\) of
buckets. This basically means the taken
slots are completely newly random each
iteration, preventing converging to a stable solution where the number of added
priority elements goes down each round.
Basically, just DFSing the displacement using the earlier strategy seems much better.
Cleanup: removing peeling and suboptimal displacing code
The code base is getting messy. I’m going to drop all the peeling experiments, and also the ‘displacement per bucket size’ will go. Basically all methods that require computing edges up front are annoying in this regard. Also matching seems not-so-useful anymore now that we can settle on \(\alpha=0.99\) without impacting the query throughput. It may still be useful though in cases where we want to minimize memory usage, but even then we can go with \(\alpha=0.999\) which would take longer to construct but should still be fine, and avoids the complexity of having to balance between perfect matching and displacing.
So we’re dropping all these algorithms which required all edges to be precomputed up-front:
- Peeling algorithms
- Didn’t really work in the first place
- Dinic matching
- Works well for size-\(1\) buckets, but needs a large threshold to guarantee every bucket and slot have at least one edge. For larger bucket sizes the balancing act is annoying.
- Per-bucket-size displacing:
- The greedy non-iterative version that simply runs once and only fills \(95\%\) of buckets.
- The greedy iterative version works well but only does one bucket size at a time.
- The DFS version that works per bucket size isn’t quite good enough because sometimes we must displace to larger buckets.
- The global iteration method of the previous subsection also goes – it just doesn’t work well.
- The code to invert hashes \(s = C \times k_i\) is also dropped. Inverting returns values that are \(O(n)\) while we are now only interested in values at most \(2^8\). I’m kinda sad about this because the math there was really cool.
- The ‘fast small buckets’ optimization is also dropped for now, to be reintroduced later, to clean up code a bit.
All together, this drops around 1900 lines of code :)
Some speedups to the displacement algorithm
- Use a fast-path for detecting placement without collisions, which is sufficient for \(99\%\) of buckets.
- Use a
taken
bitvector for the fast-path, instead of a full slot-to-bucket-index mapping . - Attempt at prefetching of positions in the
taken
vector for upcoming \(k_i\). Doubtful whether it actually improves construction time though.
TODO Runtime analysis of displacement algorithm
It would be nice to be able to estimate the runtime complexity of the displacement algorithm in terms of \(n\), \(m\) (/\(c\)), \(\alpha\), and the maximum allowed pilot value.
TODO Optimal prefetching strategy
For each class of buckets of the same size we could device some optimal prefetch strategy: It’s wasteful to prefetch the slot corresponding to the last element of each bucket when the probability of failing in earlier elements in the bucket is high. E.g. for size \(2\) buckets only roughly \(1\) in \(6\) of the second elements needs to be queried. Not prefetching those may give a nice speedup.
Are we close to the memory bandwidth?
I have DDR4 @ 3200MHz, which has a theoretical bandwidth limit of \(25GB/s\).
For \(n=10^9\) (1G
),
the taken vector has size 1G*1bit=125MB
and does not fit in my 12MB
L3
cache. Thus let’s say for simplicity that all lookups in it are cache misses,
and thus that each lookup reads a single $64$byte cacheline from main memory.
But how many lookups do we need?
- there are
0.3G
buckets - the average \(k_i\) is \(40\)
- buckets have around \(3\) elements on average
- total number of lookups:
0.3G * 40 * 3 ~ 40G
- But that’s an upper bound, since in practice we do an early break after the first taken slot.
Simply counting the number of lookups gives me 20G
, which sounds reasonable:
it’s lower and the same order of magnitude.
So how long does it take to do 20G
lookups? \(20G * 64byte = 1.28TB\), which at
\(25GB/s\) is \(51s\). In practice the construction takes longer at \(350s\), so this
indicates we’re not yet memory-bandwidth bound, assuming we could indeed saturate the full
\(25GB/s\), but it’s not too far from it.
More sorting algorithm resources
I found some more implementations and posts about sorting algorithms, so listing them here for future reference:
- Robinhoodsort
- https://github.com/mlochbaum/rhsort
Sorts uniform random data by placing values at their interpolated location in a \(2n\) size array.
- Ska_sort
- https://github.com/skarupke/ska_sort
In-place radix sort with CPU-pipelining-friendly swapping of values to their buckets. See posts investigating radix sort, I wrote a faster sorting algorithm, and faster sorting algorithm part 2.
- Wolfsort
- https://github.com/scandum/wolfsort
Didn’t really look into this.
- Voracious sort
- https://github.com/lakwet/voracious_sort
Rust library with various radix sort implementations, with the default voracious sort being an improved ska_sort. I tried all implementations in here for sorting random \(10^8\) random
64=bit hashes, and none beats =radsort::sort_by_key(|h| h>>32)
followed by manually sorting equal buckets. - More blogs
- http://stereopsis.com/radix.html
- pqdsort
- Pattern defeating quicksort, https://github.com/orlp/pdqsort, by Orson Peters
- glidesort
- https://github.com/orlp/glidesort, by Orson Peters
And some resources on partitioning
- Branchles partitioning
- https://github.com/Voultapher/sort-research-rs/blob/main/writeup/lomcyc_partition/text.md by lukas Bergdoll
- Blog on Lomoto partitioning
- https://orlp.net/blog/branchless-lomuto-partitioning/ by Orson Peters
Partitioning to reduce memory latency
As discussed just before, memory latency and bandwidth are a bottleneck in
construction times when \(n=10^9\). For a typical bucket with \(2\) elements and
\(k_i=50\), we must lookup \(100\) positions in taken
, corresponding to fetching \(100\)
cachelines from main memory. It would be much better if these cachelines were
already available in a low level cache. So how about if we already roughly know
where the elements of this bucket are supposed to go? Maybe a window of say \(10\)
cachelines (i.e. \(10\cdot 64\cdot 8 = 5120\) bits) should be wide enough to find
an empty slot with high probability. So we could say that bucket \(i\) of \(m\) maps
to a window/slice of some length starting at \(i/m \cdot s\) (where
\(s=n/\alpha\) is the total number of slots). But actually we can simplify this:
- We can divide the \(m\) buckets into \(P\) partitions, and then the first \(1/P\) of buckets map to the first \(1/P\) of slots.
So there are \(P\) partitions with \(m_p = round(m/P)\) buckets each, and each partition corresponds to \(s_p = round(n/\alpha/P)\) slots.
So then, the position for an element \(x\) with bucket \(i\) becomes \[p(x, i) = \lfloor i/m_p \rfloor \cdot s_p + ((h(x) \lxor h(k_i)) \mm s_p.\]
As usual, we can remap values larger than \(n\) to the free slots in the rest of
the array. Note that this is slightly different from the PTHash-HEM partitioning in
Pibiri and Trani (2021a) since that builds a number of independent MPHFs, while
this method builds a single one and only constructs a single free
array.
One point needs some careful consideration: this method does not guarantee that each partition has the same number \(n_p = round(n/\alpha)\) of elements, so the total load factor per partition will vary. This is not fatal though: The expected number of elements per partition is indeed \(n_p\), and since this is a binomial distribution for large \(n\) and small probability \(p=1/P\) the standard deviation is only \(\sqrt{n_p}\). This means that as long as we take, say, \(4\sqrt{n_p}\) additional slots per partition, the probability of a bucket exceeding the number of slots is less than \(1\) in \(16000\). (Otherwise we rehash everything with a new global seed.) Since we anyway reserve free slots with \(\alpha<1\), we can simply make some more free slots. Construction time may then be slightly concentrated in the buckets with number of elements significantly above the mean, but the memory latency savings should counter this easily.
So the new algorithm would be:
- Compute bucket for each element.
- Partition buckets into \(P\) partitions of \(m_p\) buckets each.
- Sort the buckets in each partition by decreasing size.
- (Optional) prefetch the \(s_p\) bits of the
taken
array corresponding to the current partition into low-level cache. - Run the usual algorithm for each partition, searching for the smallest pilot that maps each bucket to empty slots only.
- Remap position above \(n\) to the free slots.
As said before, the main benefit here is that step 5 should now run much much faster, since each lookup for a free slot can go directly to L1 or L2 cache rather than main memory. So let’s see how many slots \(s_p\) we should aim for in each partition.
- My L1 cache is \(32kb = 256kbit\), so we could choose \(n_p = 200k\) elements per partition.
- My L2 cache is \(256kb = 2048Mbit\), so we could choose \(n_p = 2M\) elements per partition.
The optimal size is probably determined using experimentation. The benefit of smaller partitions is that they fit in a faster cache, while \(10\) times larger partitions have a relatively \(\sqrt{10} \approx 3\times\) smaller standard deviation, hence needing less buffer slots.
As an example let’s take \(n=10^9\) with \(n_p = 2M\) elements per bucket. This creates \(500\) buckets, and needs \(4\sqrt{n_p} \approx 5k\) buffer slots. \(5k/2M = 0.25\%\), so actually choosing \(\alpha=0.99\) (which gives \(1\%\) buffer slots) already gives plenty of guarantee that buckets are not too full and leaves at least \(0.75\%\) of free slots at the end to speed up construction.
Note that this should also be very good for parallelization, since each thread
can fill it’s own part of the taken
vector in L1 or L2 without interfering
with the other threads at all.
Implementation efficiency
- \(s_p\) should be a multiple of the \(512\) bit cache-line size to avoid sharing cachelines between threads. The entire bitvector should be cacheline-aligned as well. But this is really nitpicking.
- We should either use fast-divide for the bucket partitioning, or we can simply ensure that there’s \(2^k\) buckets per partition so we can just do this via a bitshift.
Oh some problem? Ughhh, so we manually create imbalanced bucket sizes, and the way we do this is by putting \(60\%\) of the elements in the first \(30\%\) of buckets. That means that if we partition buckets simply by id, the first buckets will be much larger than the last buckets…. And also, that means there will be very few small buckets to fill gaps among those large buckets… So it sounds like we want to either distribute those large buckets equally over the range, or partition the buckets e.g. via their id modulo \(P\).
A fix? So instead of making the first \(30\%\) of buckets large, we have to distribute the large buckets roughly equally over the partitions. Some ideas for this:
Make every third bucket large is as follows:
Remap large buckets from
0..m/3
to0..m
by taking \(i \mapsto 3i-1\), filling positions \(2\mm 3\).Remap small buckets from
0..2/m3
t00..m
by taking \(i\mapsto \lfloor 3i/2 \rfloor\). Filling positions \(0\mm 3\) and \(1\mm 3\)(This is a specific instance of the more general following result: for irrational \(\alpha\) and \(\beta\) with \(1/\alpha + 1/\beta = 1\) the sets \(\{\lfloor i\alpha\rfloor: i\in \mathbb N\}\) and \(\{\lfloor i\beta\rfloor: i\in \mathbb N\}\) are disjoint and partition \(\mathbb N\).)
First compute the bucket id as \(h(x) \mm m\). Then remap elements that should go to large buckets by setting the last two bits of the bucket id to \(0\). That maps some percentage of elements to buckets \(0 \mm 4\), creating large buckets there. (We could also map them to buckets \(0\mm 3\) by doing
i/3*3
, but that requires more operations.)Simply do the original logic or partitioning by small or large hash value, but choose a cut-off within each partition. So for reference the bucketing function we currently use is
\begin{equation*} bucket(x) = \begin{cases} \phantom{p_2+{}}\lfloor h(x) \cdot p_2 / 2^{64}\rfloor & \text{if } h(x) < p_1 \\ p_2 + \lfloor h(x) \cdot (n-p_2) / 2^{64}\rfloor & \text{if } h(x) \geq p_1. \end{cases} \end{equation*}
If we have \(2^k\) partitions, we can mask the first \(k\) bits of \(h(x)\) and check \(mask_k(h(x)) < p_1 / 2^k\).
Back from a break!
So I finally managed to implement partitioning and it works great! It’s super fast when putting ~200k elements per part, which exactly fits in L1.
This now also allows for parallelization, since each part can be processed independently, and since each part only needs little memory, they don’t interfere with each other at all!
I now use FastReduce32H
to determine the part and bucket.
One issue though: Using FastReduce32L
to determine the slot does not provide
sufficient entropy, since it only gives \(\lg(N/P)\) bits of entropy, whereas we
need more like \(\lg N\) bit of entropy. This is somewhat fixed by reverting to
FastMod32L
, but that still lacks entropy for \(N = 10^9\), where we need the
much slower FastMod64
. More on this later.
Speeding up the search for pilots
Currently the loop that searches for pilots is as follows:
|
|
This is quite efficient, but has a problem: each iteration of the inner loop is
quite unpredictable: looking up random bits of taken
gives random bits true
or false
with some probability, so there will be lots of branch-mispredictions. Indeed,
running under perf stat -d
shows:
|
|
To reduce branches and increase parallelization, we can only check for success every \(4\) lookups, which has much more predictable results (i.e. failing most of the time):
|
|
The new perf stat -d
:
|
|
Note how this roughly \(10\%\) faster (in clock and cycles). It does use \(50\%\) more instructions, but is able to execute those much more efficiently, at \(2.5\) instructions per cycle vs. \(1.5\) before. This is probably because of the $4$-fold reduction in branch-misses!
The remaining branch misses seem to mostly come from the less-optimized case where a bucket requires displacing some other buckets.
MultiplyReduce
For \(n\) close to \(10^9\) or \(2^{32}\), we really must use all \(64\) bits of
entropy. FastReduce
only uses the high bits, and FastMod
works but is quite
slow (taking \(2\) multiplications for the $32$bit variant and more for $64$bits).
When the target modulus is a power of \(2\), \(2^d\), an alternative is what I’ll call MultiplyReduce
:
\[
h \mapsto \lfloor \frac {C\cdot h}{2^{64}}\rfloor \mm 2^d.
\]
This simply multiplies the hash by a random $64$bit constant and then takes the
\(d\) low bits of the high word of the result.
We can use this for the reduction to slots in each part, so that all entropy is used.
The one problem with this function (and also MulHash
) is that $128$bit
multiplications don’t work well with SIMD, but so far SIMD hasn’t payed off yet anyway.
Linux hugepages?
I may or may not have this already enabled on my laptop, but if not, if may make prefetches faster by having a smaller TLB cache.
It seems that jemalloc
has builtin support for optional hugepages so I may try that.
See also this reddit post.
After trying a bit more (tweet, example code), it seems that a simple vec![0u8; len]
already uses transparent huge pages on my machine:
|
|
goes up by around 300MB
when filling the vector of pilots, and similarly
|
|
increases by \(150\) at the same time, counting the total number of allocated
2MB
hugepages.
Either way, here is the full allocation code I tried for completeness:
|
|
Dropping the bucket split?
So far we have been mapping \(\beta = 0.6\) of elements to \(\gamma = 0.3\) of slots to create two classes of buckets: large and small ones. This helps the performance of the construction algorithm by putting more elements in large and easy-to-place buckets so that in the end we have more smaller buckets and especially more free slots remaining once we get to the last buckets of size \(2\).
In my current implementation it’s kind of annoying to find the bucket for a
given hash:
First FastReduce
for the part: multiply \(h\) by \(P\), the number of parts. The
P*h>>64
is the part, and we use P*h % (2^64)
to compute the bucket, which
uses the high bits of that result to determine whether it belongs in a small or
large bucket and then sends it there via a linear transformation.
If we drop these two classes things would be much simpler:
the part is P*h>>64
, and similarly the bucket is B*h>>64
, where B
is the
total number of buckets over all parts.
Build performance
This is a bit more tricky: dropping the two classes roughly doubles the number of displacements and slows the construction by around \(10\%\). Also, the minimal \(c\) for which construction succeeds at all goes up slightly.
An alternative
Maybe we can come up with a ’local’ way to imbalance buckets, so that each
bucket roughly maps to position h*B >> 64
. Maybe we could or
the last bit
with another random bit, so that \(75\%\) of elements goes to \(50\%\) of buckets?
Query performance
for-loop | prefetching | |
---|---|---|
2 classes | 14.5 | 7.6 |
1 class | 10.3 | 7.5 |
As you can see, when we already do prefetching the added simplicity doesn’t affect performance much, but without it, the simpler and shorter code helps considerably.
Query memory bandwidth
Most likely, the prefetching variant is already saturating the memory bandwidth. Each cacheline is $64$bytes, which comes down to \(64B / 7.5ns = 8.5GB/s\). That’s a third of the \(25GB/s\) peak bandwidth for my RAM, but I’m not sure whether the peak is achievable at all for random-access reads.
I did find this quora answer that talks about Line-Fill Buffer. To quote:
If you test performance from a single thread on a machine that’s mostly idle (ie, you’re not competing with other threads for access to the memory bus), then you’ll find that on many chips, you can perform roughly twice as many sequential reads per second as random reads. The explanation for this is that the CPU has clever circuitry that watches your memory access pattern and tries to predict which cache-lines your code is going to want next. Why does this prefetching help? Well, this gets a bit deep. My current understanding is that in the random access case, your parallelism is limited by that fact that each core can only have a limited number of L1 misses outstanding at one time. Intel calls this the “Line-Fill Buffer” and it has 10 slots on recent chip models, and there are similar mechanisms on AMD and Arm. DRAM physics dictate that reads have ~60ns of latency (and this barely changes as the tech evolves), which is ~400 clock cycles, so that would rate limit throughput of random parallel reads to at most 1 per ~40 clock cycles. Versus when the hardware prefetch mechanisms kick in, they don’t consume line-fill slots and can effectively perform more reads in parallel. Sadly, if you issue software prefetch instructions, many CPUs treat that as needing a line-fill entry, so it’s only reads triggered by hardware-prefetching that are able to achieve greater parallelism than the size of your line-fill buffer.
So with this in mind, maybe the next question is whether we can get the last factor \(3\) improvement from \(8GB/s\) to \(25GB/s\) by using multithreading, since each CPU has it’s own copy of these Line-Fill Buffers between L2 and L1.
So I tried multithreading, but it seems not to be faster at all! It’s usually slightly slower. Weirdly though, the threads seem to do the work sequentially, with the $i$th of \(k\) threads finishing roughly after \(i/k\) of the total time. Some threads just seem to be busy waiting for memory that never comes until other threads are done.
But then does that mean that my memory bandwidth is actually capped at \(8GB/s\)? That would be disappointing.
Maybe another solution could be to store two copies of the pilot array, one on each of the two \(32GB\) memory cards I have? I don’t really know whether the main memory or the processor is the bottleneck, so no idea if this would help.
Some more experiments
I wrote some experiments to measure memory bandwidth where I sum the first byte of each cacheline of a \(4GB\) array. Results:
|
|
so basically it caps out around \(9GB/s\) even for very predictable access patterns. Prefetching also doesn’t help.
So the \(8.5GB/s\) we currently get with the queries is already super good and basically saturates the full memory bandwidth!
Multithreading benchmark
So can we get to the full \(25GB/s\) using multithreading? It seems that the answer is yes! Running multiple threads in parallel each with a different (coprime) stride, I get the following throughput per thread and in total:
|
|
As you can see, this converges to just above \(25GB/s\). Probably it’s a bit higher because some reads are cached in L3, and some reads are shared between threads.
If we give each thread it’s own copy of the \(4GB\) array, this changes to:
|
|
This is quite a bit lower!
Changing to a \(7GB\) array may be interesting, since \(6\cdot 7GB = 42GB\) must be spread over two \(32GB\) ram sticks. (\(8GB\) crashed because I don’t have \(48GB\) free currently.) This gives:
|
|
I.e., very similar to results above. So probably we won’t be able to gain speedup by distributing reads over the two ram sticks.
Multithreading queries: satisfaction at last
I had a silly bug at first, but with that fixed, here’s the inverse throughput when using up to \(6\) threads in parallel, each prefetching \(64\) iterations ahead:
threads | ns/query | memory throughput (GB/s) |
---|---|---|
1 | 7.59 | 8.4 |
2 | 4.17 | 15.3 |
3 | 3.09 | 20.7 |
4 | 2.81 | 22.8 |
5 | 2.69 | 23.8 |
6 | 2.76 | 23.2 |
Satisfaction, at last! \(2.69ns/query\) means a memory throughput of \(64B/2.69ns = 23.79GB/s\), quite close to the full memory bandwidth.
Also, each thread is also reading the keys to be queried from main memory. Those are efficiently packed so each \(8B\) key is only $1/8$th of a \(64B\) cacheline and hence only requires $1/8$th of a read. But adding those in we get \((1+1/8) \cdot 23.79 GB/s = 26.76GB/s\), quite a bit above the full memory bandwidth. I suppose this is because some queries are already in L3 cache, sparing the main memory bus. Let’s see more precisely: The memory used for all pilots is \(2.43bit/elem\cdot 10^9 elem = 2.43\cdot 10^9 bit = 304MB\). My L3 cache is \(12MB\), so each query hits the cache with probability \(12MB/304MB=4\%\), getting to $$ \frac 18 ⋅ \frac{64B}{2.69ns}
- (1-0.04) ⋅\frac{64B}{2.69ns} = 25.8GB/s
$$ This is still slightly above the \(25GB/s\) max throughput my ram should have, but I’ll call it done here. We’re within a few percent of the expected maximum performance, which is as good as we can get.
Maybe the last bit is L1 and L2 caches containing a few more elements than L3?
Packing difference from expected position
Since the values we pack are a random subset of \([0, n]\), it may be possible to encode them efficiently by only storing the deviation from their expected value. I.e. we store \(r=(1/\alpha - 1)n\) values where position \(i\) stores the difference with \(i/r \cdot n\).
Result: for \(n=10^8\) with \(\alpha = 0.99\), the largest difference between actual and expected position is around \(500\ 000\). I.e., this would still need at least \(20\) compared to the \(20\) we started with.
Local packing ideas
Instead we could store a fixed offset \(c\) every \(k\) positions, and approximate the
following \(k\) values as \(c + i/r\cdot n\), where \(n/r\) is the expected difference
between adjacent remap targets (i.e. the distance between adjacent empty slots
in the taken
array).
For example, we could store a u32
offset every \(30\) positions, one per
cacheline, totalling to \(16 + 32/30 \sim 17 bits\) per remapped position.
Or maybe we could pack slightly more efficient and only use somewhere between \(10\) to \(15\) bits for the variance from the expected value. Some maths will be needed to figure out a safe upper bound on the max deviation.
Some possibilities:
- Absolute: Store \(32bit\) offset, and \(16bit\) deltas to it.
- Linear: Store \(32bit\) offset, and \(16bit\) deltas to \(offset + i\cdot n/r\).
- Custom linear: Store \(32bit\) offset, \(32bit\) (or less) difference \(d\), and \(16bit\) deltas to \(offset + i \cdot d\).
- Tiny EF: Store \(32bit\) offset, \(128\) bits indicating where \(256bit\) boundaries are crossed similar to EF encoding, and \(44\) \(1byte\) values.
This last idea needs \(64/44\cdot 8bit=11.6bits\) per value. That’s slightly more
than using EF encoding directly on the full vector (which uses around \(9bits\)),
but since this only uses a single cache line lookups are much faster, and it
uses \(3\times\) less bits that the direct Vec<u32>
.
Query times for different remapping structures
threads | no remap | Vec<u32> | TinyEF | EliasFano | Vec<u32> | TinyEF | EliasFano |
---|---|---|---|---|---|---|---|
\(\alpha\) | \(0.99\) | \(0.99\) | \(0.99\) | \(0.98\) | \(0.98\) | \(0.98\) | |
1 | 7.4 | 7.9 | 7.9 | 10.1 | 8.1 | 8.2 | 11.6 |
2 | 4.1 | 4.3 | 4.3 | 5.1 | 4.8 | 4.5 | 6.0 |
3 | 3.0 | 3.2 | 3.2 | 3.6 | 3.5 | 3.3 | 4.3 |
4 | 2.8 | 2.9 | 2.8 | 3.2 | 3.0 | 2.9 | 3.6 |
5 | 2.7 | 2.7 | 2.8 | 3.0 | 2.8 | 2.7 | 3.2 |
6 | 2.7 | 2.7 | 2.7 | 3.0 | 3.0 | 2.8 | 3.1 |
bits/elem | 2.43 | +0.33 | +0.12 | +0.09 | +0.66 | +0.24 | +0.17 |
As can be seen, TinyEF
uses only about a third of the memory of Vec<u32>
,
while being at least as fast (and faster for \(\alpha=0.98\)).
EliasFano
uses another \(25\%\) less memory, but significantly slows down
queries due to its non-locality.
The code for TinyEF is here.
Sharding
When the number of keys is very large, they may not all fit in memory at the same time. To solve this, we can shard the keys and process one shard at a time.
In particular, the \(P\) parts are split into \(S\) shards, where part \(0\) to \(P/S\) go in the first shard, and so on. We make sure that \(P\) is a multiple of \(S\). Then we process one shard at a time: we hash the keys, sort them into their parts, and then compute the pilots for each part independently.
I implemented two ways to shard keys:
- On-disk sharding: create a file for each shard, iterate over the keys, and append the hash of each key to file for the shard it belongs to.
- In-memory sharding: iterate over the keys multiple times, and each time only return those which belong to the current shard.
128bit hashing
In order to scale more than \(2^{32}\) keys, we need to use $128$bit hash
functions (github issue). There are many hash functions out there, all with
different speed vs. randomness tradeoffs. I implemented many of them. Although I
have no idea of the quality of the implementations I used, these are some quick
results for hashing u64
keys:
Hasher | Bits | seq. | prefetch |
---|---|---|---|
fxhash | 64 | 14.2 | 7.48 |
murmur2 | 64 | 15.8 | 7.70 |
murmur3 | 128 | 16.5 | 9.40 |
xx | 64 | 39.0 | 14.04 |
xx | 128 | 32.3 | 13.94 |
city | 64 | 28.0 | 12.19 |
city | 128 | 45.1 | 20.09 |
metro | 64 | 29.9 | 13.21 |
metro | 128 | 40.3 | 16.71 |
highway | 64 | 107.6 | 57.20 |
highway | 128 | 114.5 | 47.84 |
spooky | 64 | 115.7 | 63.43 |
spooky | 128 | 115.5 | 67.06 |
From this we conclude that FxHash
, Murmur2
, and Murmur3
are good fast hash
functions where throughput is close to the memory bandwidth, while most other
hash functions are quite a bit slower and lower the throughput.
I also added support for hashing of &[u8]
strings. Results are here:
Hasher | Bits | seq. | prefetch |
---|---|---|---|
fxhash | 64 | 31.7 | 21.26 |
murmur2 | 64 | 11.5 | 6.23 |
murmur3 | 128 | 24.7 | 15.35 |
wy | 64 | 19.5 | 11.58 |
xx | 64 | 23.2 | 14.26 |
xx | 128 | 21.0 | 12.90 |
city | 64 | 18.0 | 10.97 |
city | 128 | 27.5 | 17.90 |
metro | 64 | 22.2 | 15.36 |
metro | 128 | 24.3 | 17.71 |
highway | 64 | 61.1 | 36.23 |
highway | 128 | 67.0 | 42.51 |
spooky | 64 | 74.1 | 62.48 |
spooky | 128 | 74.1 | 62.87 |
Observations:
fxhash
is slow for strings.murmur2
is super fast!city
andwy
are great for =64=bit hashes.xx
is best for =128=bit hashes.
Varying the partition size
Let’s construct for \(n=10^9\) keys, \(c=8\), \(\alpha=0.98\), with different partition sizes. \(\max \alpha\) is the maximum load factor over all parts.
slots per part | num parts | displace time | total time | \(\max\alpha\) |
---|---|---|---|---|
\(2^{15}\) | $31$k | - | - | \(>1\) |
\(2^{16}\) | $16$k | \(20.7\) | \(27.0\) | \(99.49\) |
\(2^{17}\) | $7.8$k | \(22.8\) | \(29.0\) | \(99.18\) |
\(2^{18}\) | $3.9$k | \(22.7\) | \(29.0\) | \(98.59\) |
\(2^{19}\) | \(1947\) | \(25.8\) | \(31.6\) | \(98.40\) |
\(2^{20}\) | \(974\) | \(30.3\) | \(36.1\) | \(98.21\) |
\(2^{21}\) | \(487\) | \(42.3\) | \(48.0\) | \(98.14\) |
\(2^{22}\) | \(244\) | \(41.2\) | \(48.1\) | \(97.83\) |
\(2^{23}\) | \(122\) | \(40.9\) | \(47.0\) | \(97.81\) |
\(2^{24}\) | \(61\) | \(72.5\) | \(78.3\) | \(97.77\) |
\(2^{25}\) | \(31\) | \(88.8\) | \(94.9\) | \(96.17\) |
Observe that:
- Using too small partitions causes too many parts, with too much load-unbalancing, consistently giving \(\alpha > 1\).
- \(\max \alpha\) decreases as the parts get larger, reducing the worst-case outliers. In particular the delta with \(98\%\) roughly halves each step.
- When there are very few parts, \(\max \alpha\) goes below \(98\%\) due to rounding, because each part is fixed to have a power-of-two size.
- Smaller parts are faster to build, due to data fitting in lower caches.
- \(2^{17}\) slots per part was my default, but \(2^{19}\) is not much slower and more reliable.
- From \(2^{21}\), the displacement time quickly grows. Storing \(2^{21}\)
taken
bits takes \(256kb\), which is exactly the L2 cache size. But the displacement algorithm also needs to store for each slot which bucket is occupying it, taking \(32bit/slot\). This does not fit in L3 and goes to main memory instead. - At \(2^{24}\) slots, even the
taken
arrays do not fit in L3 cache anymore and things slow down even more. (In fact I’m surprised it’s only a \(2\times\) slowdown. The latency is hidden pretty well.)
One way in which PTRHash differs from PTHash is that apart from just the taken
bitvector, it needs to store an additional \(32bit\) index for each slot to be able to
displace elements, which requires a lot more memory. If we could prevent this,
memory usage would go down and it may be possible to use larger parts without
slowing down the construction time so much.
PtrHash, part 2
After leaving this dormant for a year, it’s time to pick it up again and write a paper.
TODO:
- compare with RecSplit-GPU: Bez et al. (2023)
- Incorporate Phobic (Hermann et al. 2024; Hermann 2023)
Phobic
Results:
- 0.17bits/key less space
- 2.17bits/key GPU construction in 28ns/key
- query at 37ns/key
Contributions:
- Near-optimal bucket size distribution, for reduced construction time and space.
- Storing seeds ’transposed’, so that they compress better.
- GPU implementation.
Theoretical results:
- Using large bucket sizes quickly gives near-optimal size.
- Filling a bucket takes \(\Omega(e^\lambda)\) time.
- Presented method runs in \(O(e^{\lambda(1+\epsilon)})\) time, close to theoretical lower bound.
Algorithm:
- Split data into partitions of expected size \(\rho\).
- Construct MPHF per part.
- Concatenate into single MPHF.
- \(B=\rho/\lambda\) buckets per part.
- \(\gamma: [0, 1]\to [0,1]\) is the bucket assignment function, taking a hash and returning a bucket.
- Intuitive optimal bucket assignment function:
\[
\gamma = \Beta(x) = x + (1-x) \ln (1-x)
\]
- First (very) large buckets, then smaller buckets.
- In practice, use \(\epsilon x + (1-\epsilon) \Beta(x)\), to avoid too large buckets.
- Instead of evaluating \(\Beta\), a $2048$-piecewise linear approximation is made.
- Given a bucket seed/pilot \(s\), pthash finds the position as \(h(key) \oplus h(s)\)
- Phobic: Splits seed \(s\) in low and high bits \(s_l\) and \(s_h\), and maps to position \(h(key, s_h) + s_l\). Then when the seed is incremented, first all rotations (up to the part size) are searched before rehashing is required.
Practicalities:
- Many small parts (size 2500 only; PTRhash uses around 100k-500k); encode their starts relative to the expected start.
- Compact encoding: all values have the same bit length.
- interleaved coding: first code first bucket of every part, then second bucket, and so on, since the \(i\)’th bucket of each part has a more similar distribution and thus compresses better.
- One simple benefit of having all large buckets next to each other is more efficient caching. This will also be great for PTRhash.
TODO for PTRhash
- Try the optimal bucket function and see how much it improves construction time.
- Try ‘rotation’ displacement inside a part instead of re-hashing.
- This will probably be bad, because \((h(x)+d)\bmod p\) (for part size \(p\)) does not resolve collisions when \(h(x)\bmod p = h(y)\bmod p\).
- Maybe \((h(x)\oplus d)\bmod p\) is good enough?