# PTRHash: Notes on adapting PTHash in Rust

$%\newcommand{\mm}{\,\%\,} \newcommand{\mm}{\bmod} \newcommand{\lxor}{\oplus} \newcommand{\K}{\mathcal K}$

NOTE: WIP paper is here.

Daniel got me excited about minimal perfect hashing functions (MPHFs), and then twitter Rob asked for a rust implementation of PTHash (Pibiri and Trani 2021b, 2021a), and also I have some ideas related to memory prefetching I want to play with, so here we are: I’m working on a Rust implementation ptr-hash at https://github.com/ragnargrootkoerkamp/ptrhash.

This post is a collection of thoughts, ideas, implementation nodes, and experiments that came up while writing the code.

## Questions and remarks on PTHash paper

Probably some of these questions are answered by reading into the original FCH paper (Fox, Chen, and Heath 1992) – that’ll come at some later time.

• FCH uses $$cm/\log_2 n$$ buckets because each bucket needs $$\log_2 n$$ bits storage, but for PTHash the bits per bucket is lower so this isn’t really the right formula maybe.

• Tweet: another option is to fix $$c = \alpha * \log(n) / d$$ for some constant $$d$$.
• FCH uses $$\lceil cn / (\log_2n+1)\rceil$$ buckets, but PTHash uses $$\lceil cn/\log_2 n\rceil$$ buckets. To me, $$\lceil cn/\lceil \log_2n\rceil\rceil$$ actually sounds most intuitive.

• $$p_1=0.6n$$ and $$p_2=0.3m$$ seem somewhat arbitrary. Can this be tuned better? Could partitioning into more chunks help?

Anyway, this puts $$60\%$$ of data in the first $$30\%$$ of buckets (i.e. double the average size), and $$40\%$$ of data in $$70\%$$ of buckets (i.e. roughly half the average size).

I suppose the goal is to stitch two binomial distributions of bucket sizes together in a nice way to get a more even distribution over sizes.

• This is an open question (tweet)
• Proof in appendix of paper could possibly be extended.
• How many bits of entropy are really needed? When $$n$$ is a power of $$2$$, we make sure to use all $$64$$ bits instead of just $$\log_2 n$$. But how many are really needed. Maybe as long as $$n \ll 2^{32}$$, a simpler/faster $$32$$ bit hash is sufficient?

• And how large will $$n$$ be at most? Can we reasonably assume $$2^{32} \approx 4\cdot 10^9$$ as a bound?

• Is it reasonable to compute and store hashes of all keys, and then changing the API to let the user pass in the hash instead of the key?

• What’s the probability of failure with the first global seed $$s$$ that’s tried?

• What are the largest $$k_i$$ found in typical cases?

• What if $$\gcd(m, n)$$ is large?

## Ideas for improvement

### Parameters

• Better tuning of parameters may work?
• Partitioning into three or more chunks may speed up construction?

### Align packed vectors to cachelines

Instead of packing $$r$$ bit values throughout the entire vector, only pack them inside cachelines.

### Prefetching

The query performance of this and other MPHFs seem to be limited by the latency of memory lookups. For this reason, most MPHFs minimize the number of memory lookups, and in particular PTHash uses one random memory access followed by one lookup in a small (cached) dictionary.

When processing a number of lookups sequentially, each lookup currently incurs a cache miss. This could be made much more efficient by doing $$k$$ (e.g. $$k=16$$) cache lookups in parallel:

1. first compute the array location to lookup for $$k$$ keys (possibly using SIMD),
2. then prefetch the $$k$$ memory locations,
3. then do the rest of the computation in parallel.
4. If needed, do lookups in the $$free$$ table in parallel.

This should mostly hide the memory latency and could give significant speedup. I’m not quite sure yet whether this would work best when done in batches (as described), or in streaming fashion, where we iterate over elements and prefetch memory for the element $$k$$ iterations ahead. Algorithmica.org has a nice article on the streaming approach.

### Faster modulo operations

There are a lot of % n, % p1 and % (m-p1) operations throughout the code. I didn’t look into it yet, but possibly these are the bottleneck on the CPU latency.

First note that $a\, \%\, b = a - \lfloor a/b\rfloor * b.$ This division by a constant can be computed efficiently using a trick which replaces division by multiplication with the inversion. Using the formula of the wikipedia article we can precompute some constants to evaluate $$\lfloor a/b\rfloor$$ in $$6$$ operations and a % b in $$8$$ operations.

(Note that it might be possible compilers already do this, but I don’t expect so.)

Some blogposts by Daniel Lemire (Thanks Daniel Liu ;)

### Store dictionary $$D$$ sorted using Elias-Fano coding

I doubt whether the memory savings here are worth the time overhead, but it’s an idea :shrug:.

### How many bits of $$n$$ and hash entropy do we need?

One bottleneck of fastmod64 is that it needs to compute the highest $$64$$ bits of a u128 * u64 product. If we can assume that $$n$$ is at most $$40$$ bits, and that $$44$$ bits are sufficient entropy, then I think we could do away with the u128 x u64 multiplication and do everything inside u128.

### Ideas for faster construction

Once the table is almost filled, determining the $$k_i$$ becomes slower. Some ideas to speed this up:

reversing murmurhash
Instead of finding a $$k_i$$ such that $$position(x) := (h(x) \lxor h(k_i))\mm n$$ is not taken becomes slow. Instead, we could keep a list of empty positions $$p$$ and determine $$k_i$$ as $$h^{inv}((p + j \cdot n) \lxor h(x))$$ for different $$j$$, assuming we can invert murmurhash. As it turns out, murmurhash2 is an invertible function with $$O(1)$$ inverse! See this blogpost. Thus, we can easily find many $$k_i$$ such that $$position(x, k_i)$$ maps to an empty position.
• All buckets of size $$1$$ can be directly assigned to slots this way.
• Size $$2$$ and $$3$$ buckets also need fewer tries than before.

The big drawback though is that the $$k_i$$ values found will be uniform in $$[0, 2^{64}]$$.

Cuckoo-hashing
For sets of size $$1$$ and $$2$$ (and maybe $$3$$?) we could displace an already taken slot if that is the last remaining slot needed to fix the current bucket. Probably we want to only displace buckets of the same size and never buckets of larger size.

I wonder though how useful this actually is. If the current bucket is hard to place, there is not really any reason a different bucket of the same size would be easier to fix.

## Implementation log

A somewhat chronological list of notes and remarks.

### Hashing function

For now I use murmur64a, documented on the SMHasher GitHub wiki.

### Bitpacking crates

There are a lot of bitvector and bitpacking crates!

bitvectors
All of the below seem to do the same
• bitvec: $$30M$$ downloads
• bit-vec: $$30M$$ downloads
• fixedbitset: $$55M$$ downloads

No idea which is best; probably I’ll settle for the one below in sucds.

sucds
only $$60K$$ downloads, but contains

succinct - IntVector - Can not be constructed from slice/iterator of values. - Decoding seems somewhat inefficient - No updates in the past 2 years.

### Construction

• Storing buckets as Vec<Vec<Key>> is bad for large keys, so now I store Vec<Vec<usize>>, but the nested Vecs still waste a lot of space and will cause allocation slowdowns. PTHash pushes onto a vector which is sorted later, which seems more efficient.
• When testing $$k_i$$, not only do we need to test that positions are not filled by previous buckets, but also we have to check that elements within the bucket do not collide. It is not sufficient that $$h(x, s)$$ does not collide within buckets, since they could collide after taking the % n.

### Fastmod

It seems that Daniel Lemire’s fastmod C++ library has not yet been ported to Rust, so I converted the few parts I need.

There is also strength_reduce, which contains a similar but distinct algorithm for a % b that computes the remainder from the quotient.

### First benchmark

I implemented these under a generic Reduce trait.

just bench at the linked commit at 2.6GHz gives the following for $$10^7$$ keys:

methodconstruction (s)query (ns)
u6410.47459191
fastmod6410.73158355
fastmod329.91175150
strengthreduce6411.52093956
strengthreduce3210.00201750

The u32 versions simply only use the lower $$32$$ bits of the $$64$$ bit hash.

This is not yet as fast as the fastest 28ns reported in the PTHash paper (for C-C encoding), but I also haven’t optimized anything else yet. Time for profiling.

Profiling: Looking at the flamegraph (cargo flamegraph), and zooming in on the hash function, we see

A lot of time is spend on fold! The murmur2 function I use has signature murmur2(bytes: &[u8], seed: u64), and even though my keys/bytes always correspond to just a u64, it’s iterating over them!

In the generated perf report, we see

 1 2 3  33.14% 27328 test::queries_e pthash_rs-f15b4648f77f672b [.] pthash_rs::PTHash::new 18.18% 14823 test::queries_e pthash_rs-f15b4648f77f672b [.] pthash_rs::PTHash::index 13.76% 11245 test::queries_e pthash_rs-f15b4648f77f672b [.] murmur2::murmur64ane 

We can ignore the $$33\%$$ for construction and only focus on querying here, where we see that the index function calls to murmur2 and a lot of time is spent in both. In fact, murmur2 is not inlined at all! That explains the iterator appearing in the flamegraph.

Thin-LTO: This is fixed by enabling link-time optimization: add lto = "thin" to cargo.toml.

Rerunning the benchmark we get

construction (s)construction (s)query (ns)query (ns)
methodno LTOthin-LTOno LTOthin-LTO
u6410.58.99160
fastmod6410.78.35534
fastmod329.98.55026
strengthreduce6411.58.35638
strengthreduce3210.08.85031

Sweet! 26ns is faster than any of the numbers in table 5 of Pibiri and Trani (2021b)! (Admittedly, there is no compression yet and the dictionary size is $$10\times$$ smaller, but still!)

More inlining: Actually, we don’t even want the index() function call to show up in our logs: inlining it should give better instruction pipelining in the benchmarking hot-loop

 1 2 3  for key in &keys { mphf.index(key); } 

and indeed, we now get

query (ns)no LTOthin-LTOinline index()
u64916055
fastmod64553433
fastmod32502624
strengthreduce64563833
strengthreduce32503126

Conclusion: From now on let’s only use fastmod64 and fastmod32. (I suspect the 32bit variant does not have sufficient entropy for large key sets, so we keep the original 64bit variant as well.)

### Faster bucket computation

After inlining everything, the generated assembly for our test is just one big $$\sim 100$$ line assembly function. Currently, the bucket(hx) function (that computes the bucket for the given hash hx = hash(x, s)) looks like

 1 2 3 4 5 6 7  fn bucket(&self, hx: u64) -> u64 { if (hx % self.rem_n) < self.p1 { // Compare hx % self.rem_p2 } else { self.p2 + hx % self.rem_mp2 } } 

The assembly looks like this:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27   5 │ ┌──cmp %rdi,0xc0(%rsp) # compare 1 │ ├──jbe 370 102 │ │ mov 0x98(%rsp),%rdx # first branch: fastmod 41 │ │ mulx %r9,%rdx,%rdi 4 │ │ imul 0xa0(%rsp),%r9 85 │ │ mulx %r10,%r13,%r13 12 │ │ add %rdi,%r9 7 │ │ mov %r9,%rdx 72 │ │ mulx %r10,%rdx,%rdi 17 │ │ add %r13,%rdx 3 │ │ adc $0x0,%rdi # add 0 │ │ cmp %rdi,0x58(%rsp) # index-out-of-bound check for k array 56 │ │↓ ja 3ac # ok: continue below at line 3ac: │ │↓ jmp 528 # panic! │ │ cs nopw 0x0(%rax,%rax,1) 128 │370:└─→mov 0xa8(%rsp),%rdx # second branch: fastmod 41 │ mulx %r9,%rdx,%rdi │ imul 0xb0(%rsp),%r9 66 │ mulx %rcx,%r13,%r13 12 │ add %rdi,%r9 │ mov %r9,%rdx 58 │ mulx %rcx,%rdx,%rdi 14 │ add %r13,%rdx │ adc 0xb8(%rsp),%rdi # add p2 54 │ cmp %rdi,0x58(%rsp) # out-of-bound check for k array 1 │ ↓ jbe 528 # panic! 8100 │3ac: mov (%r11,%rdi,8),%rdx # Do array index.  We see that there are quite some branches: • The first and second branch of the bucket() function are both fully written out. • They use the same number of instructions. • One branch does add 0, I suppose because the CPU likes equal-sized branches. • There are redundant index-out-of-bounds checks. • The last line, the array index itself, has $$8000$$ samples: $$57\%$$ of the total samples is this single assembly instruction! Branchless bucket index: I tried rewriting the bucket() function into a branchless form as follows:  1 2 3 4 5  fn bucket(&self, hx: u64) -> u64 { let is_large = (hx % self.rem_n) >= self.p1; let rem = if is_large { self.rem_mp2 } else { self.rem_p2 }; is_large as u64 * self.p2 + hx % rem }  but this turns out to be slower than the original, probably because the new assembly now needs a lot of cmov instructions. (In particular, rem contains a u128 and a u64, so needs $$3$$ mov’s and $$3$$ cmov’s.)   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22   │ cmp %rdi,0xd0(%rsp) # comparison 112 │ mov 0xb8(%rsp),%rdi # load rem_p2 29 │ cmova 0xc8(%rsp),%rdi # conditionally overwrite with rem_mp2 │ mov 0xb0(%rsp),%rdx 137 │ cmova 0xc0(%rsp),%rdx │ mov 0xa0(%rsp),%r14 137 │ cmova 0xa8(%rsp),%r14 26 │ mov %r9,%r10 │ mov$0x0,%r11d # set offset to 0 90 │ cmova %r11,%r10 # conditionally overwrite offset 38 │ imul %r8,%rdi # start computation 2 │ mulx %r8,%rdx,%r8 122 │ add %r8,%rdi 48 │ mulx %r14,%r8,%r8 │ mov %rdi,%rdx 163 │ mulx %r14,%rdx,%rdi │ add %r8,%rdx │ adc %r10,%rdi 184 │ cmp %rdi,0x60(%rsp) # index-out-of-bounds check │ ↓ jbe 52f # panic 38 │ mov 0x98(%rsp),%rdx 10798 │ mov (%rdx,%rdi,8),%rdx # Do array index. 

No bounds check: We can replace k[index] by unsafe { *k.get_unchecked(index) }. This doesn’t give much performance gain (less than the few ns of measurement noise I have), but can’t hurt. It removes the final cmp; jbe lines from the assembly.

Fix tests: Instead of ignoring test results we can accumulate the resulting indices and pass them to black_box(sum). This prevents the compiler from optimizing away all queries. Somehow this affects the reported timings. I now get:

query (ns)no LTOthin-LTOinline index()fixed tests
u6491605563
fastmod6455343335
fastmod3250262420
strengthreduce6456383338
strengthreduce3250312630

I’m confused how the fastmod32 timing went down, but the fastmod64 went up. (Typical situation when you do constant profiling and there are more numbers than you can make sense of, sadly.)

### Branchless, for real now! (aka the trick-of-thirds)

I’m still annoyed by this branching. Branches are bad! They may be fast for now, but I kinda have the long term goal to put SIMD on top of this and that doesn’t go well with branching. Also, branch-misprediction is a thing, and the $$70\% - 30\%$$ uniform random split is about as bad as you can do to a branch predictor. The code from earlier does fix it, but at the cost of a whole bunch of mov’s and cmov’s.

But there is a trick we can do! $$p_1$$ and $$p_2$$ are sort of arbitrarily chosen, and all the original paper (Fox, Chen, and Heath 1992) has to say about it is

Good values for these two parameters are experimentally determined to be around $$0.6n$$ and $$0.3m$$, respectively.

Thus I feel at liberty to change the value of $$p_2$$ from $$0.3m$$ to $$m/3$$. This gives: $m-p_2 = m-m/3 = \frac 23 m = 2p_2.$ The cute tick is that now we can use that $x \mm p_2 = (x \mm (2p_2)) \mm p_2 = (x \mm (m - p_2)) \mm p_2,$ and since $0≤ x \mm (2p_2) < 2p_2$, computing that value modulo $$p_2$$ is as simple as comparing the value to $$p_2$$ and subtracting $$p_2$$ if needed.

Thus, we modify the initialization to round $$m$$ up to the next multiple of $$3$$, and change the bucket function to

 1 2 3 4 5 6  fn bucket(&self, hx: u64) -> u64 { let mod_mp2 = hx % self.rem_mp2; let mod_p2 = mod_mp2 - self.p2 * (mod_mp2 >= self.p2) as u64; let large = (hx % self.rem_n) >= self.p1; self.p2 * large as u64 + if large { mod_mp2 } else { mod_p2 } } 

The new timings are

query (ns)no LTOthin-LTOinline index()fixed tests$$p_2 = m/3$$
u649160556354
fastmod645534333527
fastmod325026242019
strengthreduce645638333833
strengthreduce325031263021

fastmod32 didn’t get much faster, but all others went down a lot! Let’s check out the generated assembly for fastmod64:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24   10 │ vpunpcklqdq %xmm3,%xmm5,%xmm3 3 │ vmovq %r15,%xmm12 1 │ vpunpcklqdq %xmm6,%xmm7,%xmm5 33 │ vinserti128 $0x1,%xmm3,%ymm5,%ymm5 14 │ vpunpcklqdq %xmm8,%xmm14,%xmm3 2 │ vpunpcklqdq %xmm15,%xmm12,%xmm6 2 │ vinserti128$0x1,%xmm3,%ymm6,%ymm3 32 │ vmovdqu 0x570(%rsp),%ymm14 15 │ vpxor %ymm3,%ymm14,%ymm3 2 │ vpxor 0x610(%rsp),%ymm14,%ymm12 2 │ vpxor %ymm5,%ymm14,%ymm6 29 │ vmovdqu 0x630(%rsp),%ymm11 13 │ vpxor %ymm14,%ymm11,%ymm15 2 │ vpcmpgtq %ymm6,%ymm15,%ymm6 2 │ vpcmpgtq %ymm3,%ymm12,%ymm3 39 │ vpandn %ymm11,%ymm6,%ymm6 13 │ vpandn %ymm11,%ymm3,%ymm7 4 │ vpand %ymm6,%ymm3,%ymm3 2 │ vpaddq %ymm5,%ymm7,%ymm5 31 │ vpcmpeqd %ymm7,%ymm7,%ymm7 8 │ vpsubq %ymm3,%ymm5,%ymm3 │ vpxor %xmm6,%xmm6,%xmm6 2 │ mov 0x1c0(%rsp),%r14 4264 │ vpgatherqq %ymm7,(%r14,%ymm3,8),%ymm6 # Do array index 

Huh what?! I don’t really what is going on here, but I do know that the compiler just vectorized our code for us! All the vp instructions are vector/packed instructions! Magic! This probably explains the big speedup we get for fastmod64.

Closer inspection: As it turns out, the 32bit versions were already auto-vectorized before we implemented this last optimization. Probably because the FastMod32 type is smaller (two u64) than the Fastmod64 type (u128 and u64) and hence easier to vectorize (and similar for StrengthReduce32). But either way this last trick helps a lot for the 64bit variants that will be needed for large hashmaps.

### Compiling and benchmarking PTHash

Compiling PTHash was very smooth; just a git clone, submodule init, and building just worked :)

Running a benchmark similar to the ones here:

 1  ./build -n 10000000 -c 7.0 -a 1 -e compact_compact -s 1234567890 --minimal --verbose --lookup 

reports a query performance of 26ns/key, similar to the fastmod64 performance I get.

Note that PTHash uses fixed-width bitpacking here, while I just store u64’s directly, but this shouldn’t affect the time too much.

Vectorization: More interestingly, PTHash is not auto-vectorized by my compiler, so I’m surprised it performs this well. Maybe the vpgatherqq instruction just doesn’t give that much speedup over sequential lookups – I don’t know yet. But still, my equivalent code using fastmod64 with $$p_2 = 0.3m$$ has 35ns/key vs 26ns/key for PTHash. Confusing.

Branching: PTHash compiles to a branchy version of fastmod(x, p2) or fastmod(x, m-p2), but is still fast.

### Compact encoding

Adding fixed-width encoding was easy using the sucds CompactVector type. The generated code doesn’t look so pretty though – it branches on whether the bits cross a usize boundary, whereas PTHash’s implementation does an unaligned read from a *u8 to avoid this, which seems nicer.

### Find the $$x$$ differences

At this point, both pthash-rs and the original PTHash support encoding by a single compacted vector, but there is still quite some time difference: 31ns vs 25ns. Time to find all the differences.

This may or may not be the best approach, but I decided to put the assemblies side-by-side.

Exhibit A: the missing modulo Ok, I won’t bore you with the full assembly, but I found this in the PTHash assembly:

 1  movabs $0x999999ffffffffff,%rbx  with nothing similar in the rust version. Turns out that this is $$0.6 \cdot (2^{64}-1)$$. Indeed, the code is:  1 2 3 4 5 6  inline uint64_t bucket(uint64_t hash) const { static const uint64_t T = constants::a * UINT64_MAX; return (hash < T) ? fastmod::fastmod_u64(hash, m_M_num_dense_buckets, m_num_dense_buckets) : m_num_dense_buckets + fastmod::fastmod_u64(hash, m_M_num_sparse_buckets, m_num_sparse_buckets); }  note how it does $$hash < 0.6 2^{64}$$ instead of $$hash \mm n < 0.6 n$$ as written in the paper for what FCH does. Basically we can completely drop the $$\mm n$$ there! Sweet! That’s $$1$$ of $$3$$ modulo operations gone :) ### FastReduce revisited Earlier I mentioned the blogpost A fast alternative to the modulo reduction: to map $$h\in[2^k]$$ to $$[n]$$ ($$[n] := \{0, \dots, n-1\}$$ here), instead of taking $$h\mm n$$, one can do $$h*n/2^k$$ which is must faster to evaluate. The problem with this approach is that it only uses the $$\log_2 n$$ high-order bits of $$n$$, discarding some necessary entropy. On second thought, it seems like this may still be useful though. There are two modulo operations in the PTHash algorithm: 1. In the bucket() function, mapping from $$[2^{64}]$$ to $$[p_2]$$ or $$[m - p_2]$$. 2. In the position() function, mapping from $$[2^{64}]$$ to $$[n]$$. And there is one related check: 1. In the bucket() function, check whether $$h < p_1 \cdot 2^{64}$$ (or $$h \mm n \leq p_1 \cdot n$$), which is actually also a reduction operation. While we cannot use FastReduce twice, I think it may still be possible to use it once. In particular, it should be fine to use a low-entropy hash for bucket-selection since we anyway have collisions there – that’s the point of bucketing in the first place. A second optimization may be possible if we could find a function that uses the lower $$\log_2 n$$ bits of the hash for position(). Then we can access $$2\cdot \log_2 n$$ bits of entropy in total which should be sufficient to avoid collisions with constant probability (via the birthday paradox). One idea is something like $$h \mm 2^{\lceil{\log_2 n\rceil}}$$, but this is not quite fair so it may not work out nicely. But then again, maybe we could use that for the bucketing modulo, since it possibly doesn’t require (as much) fairness. Or maybe we can just take the lower $$32$$ bits of $$h$$ and do $$(h\mm 2^{32}) * n / 2^{32}$$. That should probably work just fine :) So, we have 7 possible reduction functions now: 1. FastMod64, same as mod but faster. 2. FastMod32L, taking the lower $$32$$ bits modulo $$n$$. 3. FastMod32H, taking the higher $$32$$ bits modulo $$n$$. 4. FastReduce64: $$(h * n) » 64$$ 5. FastReduce32L, fastreduce on the $$32$$ low bits: $$((h \mm 2^{32}) * n) » 32$$ 6. FastReduce32H, fastreduce on the $$32$$ high bits: $$((h » 32) * n) » 32$$ Excluding mod (which is never better than FastMod64, we can make $$36$$ combinations from this to use for the two modulo operations. Not all combinations end up working (because of lack of entropy when e.g. only the $$32$$ high bits are used). It’s clear that FastMod64 tends to be slow, and that Reduce operations are usually (but not always) faster than Mod operations. The big problem with this benchmark though seems to be that timings are quite inconsistent (variance of a few nanoseconds), and that the quality of generated code (auto-vectorization and loop unrolling) depends on a lot of things and is somewhat unpredictable/inconsistent. Table 1: Query time (ns) for different combinations of reduction functions. Missing entries either fail or slow down construction. bucket reduce ($$\mm m$$) \ position reduce ($$\mm n$$)FM32LFM32HFM64FR32LFR32HFR64 FastMod32L20271920 FastMod32H182619 FastMod64292733282627 FastReduce32L182318 FastReduce32H FastReduce64 Note that the last two rows fail in particular, because they strongly correlate with the check whether elements belong in a small or large bucket, $$h(x, s) < p_1$$. ### TODO Is there a problem if $$\gcd(m, n)$$ is large? ### Faster hashing The current implementation uses Murmur2 to hash both the key $$x\mapsto h(x, s)$$ and the pilot $$k_i \mapsto h(k_i, s)$$. While this hash is fast, it’s still quite some instructions. Instead, especially for the $$h(k_i, s)$$, we may be able to get away with either no or a much simpler hash. No hash: A first try of $$h_0(k_i, s) := k_i$$ returns in failures because the $$k_i$$ become too large. Multiplication hash (MulHash): So we do need more mixing of bits rather than just incrementally increasing $$k_i$$ starting at $$0$$. One common way of doing that is simply to multiply by a large semi-random$64$bit integer. In particular, Murmur also does this, so let’s just reuse their mixing constant and set: $h_1(k_i, s) := (0xc6a4a7935bd1e995 \cdot k_i) \mm 2^{64}.$ (I tried looking for documentation on why this constant was chosen, but there doesn’t seem to be more to it than it works.) In experiments, this gives anywhere between $$1$$ and $$4$$ nanoseconds of speedup. #### TODO Try xxhash See github. ### An experiment Another fun comparison is here, where I use MulHash for $$k_i$$ and replace the MurmurHash for $$h(x)$$ by simply the identity operation (since we’re testing on uniform random $$x$$ anyway):   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38   17 │1f0: mov (%rsi,%rax,1),%rdx 11 │ imul %r13,%rdx # Start of Murmur 32 │ mov %rdx,%rbx 106 │ shr$0x2f,%rbx 7 │ xor %rdx,%rbx 23 │ imul %r13,%rbx 24 │ mov %rdi,%rdx 99 │ movabs $0x35253c9ade8f4ca8,%rbp 11 │ xor %rbp,%rdx 27 │ xor %rbx,%rdx 28 │ imul %r13,%rdx 92 │ mov %rdx,%rbx 15 │ shr$0x2f,%rbx 15 │ xor %rdx,%rbx 28 │ imul %r13,%rbx 95 │ mov %rbx,%rbp 16 │ shr $0x2f,%rbp 17 │ xor %rbx,%rbp # End of Murmur 25 │ mov %ebp,%edx 64 │ cmp %rbp,%r8 43 │ ↓ jbe 250 # Branch for bucket index 9 │ imul %r14,%rdx 12 │ shr$0x20,%rdx 27 │ ↓ jmp 25d │ cs nopw 0x0(%rax,%rax,1) # nop; for code alignment 7 │250: imul %r15,%rdx 4 │ shr $0x20,%rdx 11 │ add 0x10(%rsp),%rdx 5088 │25d: mov (%r9,%rdx,8),%rdx # Memory lookup -- most waiting is here. 301 │ imul %r13,%rdx 98 │ xor %rbp,%rdx 453 │ mulx %r11,%rdx,%rdx 100 │ cmp %rdx,%r10 2 │ ↓ jbe 6e8 13 │ add %rdx,%r12 10 │ add$0x8,%rax 37 │ cmp %rax,%rcx 100 │ ↑ jne 1f0 
Code Snippet 1: Assembly of $$index()$$ function when using Murmur (which takes $$17/38$$ instructions): 18ns/query.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21   72 │4a0: mov (%rcx,%rbp,1),%rbx 38 │ mov %ebx,%edx 21 │ cmp %rbx,%rsi 36 │ ↓ jbe 4c0 # Branch for bucket index 16 │ imul %r15,%rdx 22 │ shr 0x20,%rdx 24 │ ↓ jmp 4cb │ data16 cs nopw 0x0(%rax,%rax,1) # code alignment │4c0: imul %r14,%rdx 20 │ shr0x20,%rdx 21 │ add %r10,%rdx 1895 │4cb: mov (%rdi,%rdx,8),%rdx # Memory lookup -- most waiting is here 172 │ imul %r13,%rdx 75 │ xor %rbx,%rdx 252 │ mulx %r9,%rdx,%rdx 56 │ cmp %rdx,%r8 │ ↓ jbe 706 43 │ add %rdx,%r11 26 │ add $0x8,%rbp │ cmp %rbp,%rax 34 │ ↑ jne 4a0  Code Snippet 2: Assembly of $$index()$$ function when using $$h(x) = x$$ instead: 7ns/query. MurmurHash takes slightly less than half the instructions, but removing them gives almost $$2.5\times$$ speedup! My current thinking is that this is not so much due to the reduced instruction count itself (the CPU is stalling anyway to wait for memory), but rather due to the better pipelining it results in: when loop iterations are shorter (in number of assembly instructions), pipelining can look ahead more iterations, and hence does a better job at prefetching memory. But even with this short loop, around two thirds of the time is still spend waiting for memory. Conclusion 1.: I should really write code with prefetching. Conclusion 2.: It’s time to use perf stat for some metrics on branch mispredictions and instructions per cycle. ### Compiler struggles Auto-vectorization: The compiler is quite eager to generated vectorized assembly code. It’s quite unpredictable when auto-vectorization triggers, and it seems I have to keep at least one branch in the hot loop to prevent it. The vectorized code seems bad for a few reasons: • Gather instructions (vpgatherqq) are slow. • Pipelining: It seems that pipelining works much better for the scalar version, being able to look ahead further and keeping busy while waiting for memory to load. Even worse: Also, it did the following terrible thing. Starting with this piece of innocent looking code:  1 2 3 4 5  if likely(p < self.n0) { p } else { unsafe { *self.free.get_unchecked(p - self.n0) } }  the compiler decided to generate: Basically: it created a branchless implementation of this if statement where the false branch is always executed. But that branch is super slow! Basically a completely unnecessary read from main memory! For now I’ll just completely remove the false branch to prevent this issue… ### Prefetching, at last Without further ado, here we go:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  #[inline(always)] pub fn index_stream<'a, const L: usize>( &'a self, xs: &'a [Key], ) -> impl Iterator + 'a { let mut next_hx: [Hash; L] = xs.split_array_ref().0.map(|x| self.hash_key(&x)); let mut next_i: [usize; L] = next_hx.map(|hx| self.bucket(hx)); xs[L..].iter().enumerate().map(move |(idx, next_x)| { let idx = idx % L; let cur_hx = next_hx[idx]; let cur_i = next_i[idx]; next_hx[idx] = self.hash_key(next_x); next_i[idx] = self.bucket(next_hx[idx]); // TODO: Use 0 or 3 here? // I.e. populate caches or do a 'Non-temporal access', meaning the // cache line can skip caches and be immediately discarded after // reading. unsafe { prefetch_read_data(self.k.address(next_i[idx]), 3) }; let ki = self.k.index(cur_i); let p = self.position(cur_hx, ki); p }) }  For $$L = 64$$, this is around twice as fast as the non-streaming/non-prefetching version! In our $$n=10^7$$ benchmark, this reduces latency to 4.2ns!! For the larger $$n=10^8$$ benchmark, latency is 7.5ns, down from 28ns of the original PTHash paper! (But note that I don’t do any compression here.) And this is without vectorization still :) ### TODO Prefetching with vectorization Preliminary results: this seems tricky to get right and tends to be slower. It sometimes generates unwanted gather instructions, but even when it doesn’t it’s slow although I don’t know exactly why yet. Does pipelining work with SIMD instructions? ### Inverting $$h(k_i)$$ When there are only a few empty slots left, we want to find $$k$$ such that $R(h_1(x) \lxor h_2(k), n) = p$ for some fixed position $$p$$. Using FastReduce64 as reduction $$R$$, we want $\left\lfloor\frac{(h_1(x) \lxor h_2(k)) * n}{2^{64}}\right\rfloor = p$ i.e. $2^{64} p \leq (h_1(x) \lxor h_2(k)) \cdot n < 2^{64}(p+1)$ i.e. $\frac{2^{64} p}{n} \leq h_1(x) \lxor h_2(k) < \frac{2^{64}(p+1)}{n}.$ This basically fixes the high $$r:=\log_2(n)$$ bits of $$h_1(x) \lxor h_2(k)$$. (I suspect it only leaves two possible options for the highest $$\lfloor r\rfloor$$ bits.) Thus, we infer (with some handwaving) $h_2(k) = \frac{2^{64}(p+\frac 12)}{n} \oplus h_1(x) \oplus A =: X \oplus A$ where $$X$$ is the target constant and $$A$$’s upper $$r$$ bits are $$0$$, i.e. $$A < 2^{64-r}$$. (TODO: Maybe we can just for the lower $$64-r$$ bits of $$X$$ to $$0$$. Not sure.) We could solve this when using Murmur for $$h_2$$, but in practice we use a much simpler MulHash: $$h_2(k) = (C\cdot k) \mm 2^{64}$$. Thus, we must solve $(C \cdot k) \mm 2^{64} \in \{ X \oplus A : 0\leq A < 2^{64-r}\}$ for $$k$$. Now, since $$C$$ is odd it has a multiplicative inverse $$C’$$ over $$F_{2^{32}}$$ with $$(C \cdot C’)\mm 2^{64} = 1$$, so we’re looking for $k \in \{ (C’ \cdot (X \oplus A)) \mm 2^{64} : 0\leq A < 2^{64-r}\}.$ This makes it very easy to find arbitrary solutions! The big drawback is that this results in ‘random’ $$k$$ in $$[2^{64}]$$, while for compression purposes we much prefer finding small (or even minimal) $$k$$. Given the size of the set, the expected value of the smallest solution $$k$$ is $$2^r = n$$. But I’m not sure if this is solvable efficiently. We could first look for solutions where the $$64-r$$ high bits of $$k_i$$ are $$0$$, and if none are found we can relax to $$64-r-1$$ high bits of $$0$$, and so on. But does that even help? Remember that $$C$$ is a mixing constant and hence both $$C$$ and $$C’$$ are quite ‘random’ with lots of $$1$$ bits. Maybe we can choose $$C$$ to have good mixing but also simplify this inverse? Or can it always be done? I don’t know.. ### Another day of progress • Explored ways to solve the hash-inversion problem above. Seems that an $$O(64-r)$$ solution is possible, and maybe even $$O( r)$$ but that’s probably more annoying. (Either way both will be $$O(1)$$ for $$n < 2^{64}$$ ;) There are a lot of interesting things going on to be written down and formalized later. • Added a binary src/bin/bucket_sizes.rs -n <n> -c <c> -a <a> to print a nice table with number of buckets of each size and related statistics. • Replaced Vec::sort() by radsort::sort which is roughly $$2\times$$ faster for uniform u64 hashes. TODO: Probably it will be faster to only sort by the highest $$n$$ bits (or $$32$$ bits), i.e. by only sorting on buckets, since most buckets will have size $$1$$. • I’m playing with the idea of implementing some kind of interpolation sort algorithm that just inserts things directly in the right place in an array of Option<NonZero<usize>> of size $$(1+\epsilon)n$$ or maybe $$n + C \cdot \sqrt n$$ and then runs a collect on this. Should work quite well I think. ### TODO Possible sorting algorithms ### Diving into the inverse hash problem So we want to find the minimal $$k$$ such that $(C\cdot k) \mm 2^{64} \in \{X \oplus A : 0 \leq A < 2^{64}\}$ for some given $$X$$, i.e. the $$r$$ high bits of C*k as $$64$$ bit integers should be equal to the $$r$$ high bits of $$X$$. Let $$\K$$ be the set of $$k$$ for which this holds, so that we are interested in $$\min \K$$. A first thought is: can we first find some valid $$k=k_0$$, and then hope that the space of solutions $$\K$$ has some kind of affine structure so that we can easily translate the initial $$k_0$$ to a minimal solution? So let’s investigate the linear structure first: let’s find solutions that have $$r$$ leading zeros, i.e $\K_0 := \{k : C \times k < 2^{r’}\},$ where I will use $$\times$$ for$64$bit multiplication and $$r’ := 64-r$$. We definitely expect $$\K_0$$ to have some structure: if $$C\times k_1 < 2^{r’}$$ and $$C\times k_2 < 2^{r’}$$, then $$C\times (k_1 + k_2) < 2^{r’+1}$$, which is almost good enough, and similarly, $$-2^{r’} \leq C\times (k_1 - k_2) \leq 2^{r’}$$, so that either $$k_1 - k_2$$ or $$k_2 - k_1$$ is also a solution. Thus, consider those $$k\in \K_0$$ with small absolute value (modulo $$2^{64}$$, so the absolute value $$|k|$$ of $$k$$ is $$\min(k, 2^{64}-k)$$). Then, each difference $$k_d$$ between two consecutive elements of $$\K_0$$ must occur in $$\K_0$$ either as $$k_d$$ or as $$-k_d$$. So let’s write some code to determine all possible differences. I’m doing this for$32$bit integers first since we can quickly iterate over them   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  type T = u32; fn find_diffs_bruteforce(c: T) { for r in 1..T::BITS { let mut last = 0; let mut diffs = HashSet::new(); for i in 0..=T::MAX { let ci = c.wrapping_mul(i); if ci.leading_zeros() >= r { diffs.insert(i - last); last = i; } } eprintln!("r = {r:>2}: {diffs:?}"); } }  Result:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  r = 1: {1, 2, 3} r = 2: {3, 5, 8} r = 3: {3, 11, 14} r = 4: {11, 14, 25} r = 5: {14, 39, 53} r = 6: {39, 53, 92} r = 7: {92, 145, 237} r = 8: {92, 329, 421} r = 9: {421, 513, 934} r = 10: {421, 934, 1355} r = 11: {1355, 2289, 3644} r = 12: {1355, 3644, 4999} r = 13: {3644, 8643, 12287} r = 14: {8643, 29573, 38216} r = 15: {8643, 46859, 55502} r = 16: {55502, 64145, 119647} r = 17: {55502, 119647, 175149} r = 18: {175149, 230651, 405800} r = 19: {405800, 1042251, 1448051} r = 20: {405800, 2665451, 3071251} r = 21: {405800, 3477051, 3882851} r = 22: {3882851, 4288651, 8171502} r = 23: {3882851, 27585757, 31468608} r = 24: {3882851, 35351459, 39234310} r = 25: {3882851, 39234310, 43117161} r = 26: {43117161, 90117173, 133234334} r = 27: {43117161, 133234334, 176351495} r = 28: {133234334, 309585829, 442820163} r = 29: {442820163, 1195226155} r = 30: {442820163, 2966506807} r = 31: {3852147133}  WOW: For each $$r$$, it turns out there are only $$3$$ possible differences between adjacent elements of $$\K_0$$! And also note that (when there are $$3$$ distinct values) the largest one is always the sum of the smaller two. What is more: when we increase $$r$$ by one, the new differences always include one or two of the previous differences, and some small linear combination of them. I.e. $$(a,b,c) \mapsto (b, c, b+c)$$ or $$(a,b,c) \mapsto (a, a+c, b+c)$$. This inspires a much faster algorithm to find these possible differences: for $$r$$ from $$1$$ to $$32$$, start with differences $$\Delta_0 := \{1\}$$ for $$r=0$$, and then to get $$\Delta_r$$ from $$\Delta_{r-1}$$ first make all small linear combinations of elements of $$\Delta_{r-1}$$ and then start with $$k=0 \in K_0^r$$, and then keep increasing $$k$$ by all the possible delta’s to see if $$k+\delta\in K_0^r$$. If this is the case add $$\delta$$ to $$\Delta_r$$, and otherwise keep going. Once $$\Delta_r$$ reaches size $$3$$, we can stop assuming our hypothesis that $$|\Delta_r| \leq 3$$ is true. (TODO: This will require a proof eventually.) See the implementation here on GitHub. Also, let’s have a look at what $$C \times \delta$$ looks like in binary for all $$\delta \in \Delta_r$$:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121  r = 1 1 1011011110100011110100110010101 2 10110111101000111101001100101010 3 10011011101011011110010111111 r = 2 3 10011011101011011110010111111 5 11001011000110011000111111101001 8 11011110100011110100110010101000 r = 3 3 10011011101011011110010111111 11 11110010000001010000100101100111 14 101011110101100011000100110 r = 4 11 11110010000001010000100101100111 14 101011110101100011000100110 25 11110111011111111100111110001101 r = 5 14 101011110101100011000100110 39 11111100111110101001010110110011 53 10011101010101101111011001 r = 6 39 11111100111110101001010110110011 53 10011101010101101111011001 92 11111111011011111111000110001100 r = 7 92 11111111011011111111000110001100 145 1111001010100110101100101 237 1010101010011111011110001 r = 8 92 11111111011011111111000110001100 329 110001010011000001111101 421 1101010010001000001001 r = 9 421 1101010010001000001001 513 11111111101001010001001110010101 934 11111111110110100011010110011110 r = 10 421 1101010010001000001001 934 11111111110110100011010110011110 1355 11110101011110100111 r = 11 1355 11110101011110100111 2289 11111111111010011000110101000101 3644 11111111111110001110010011101100 r = 12 1355 11110101011110100111 3644 11111111111110001110010011101100 4999 10000011110010010011 r = 13 3644 11111111111110001110010011101100 8643 10010000101111111 12287 11111111111110100000011001101011 r = 14 8643 10010000101111111 29573 11111111111111000100100101101001 38216 11111111111111010110101011101000 r = 15 8643 10010000101111111 46859 11111111111111101000110001100111 55502 11111111111111111010110111100110 r = 16 55502 11111111111111111010110111100110 64145 1100111101100101 119647 111110101001011 r = 17 55502 11111111111111111010110111100110 119647 111110101001011 175149 10101100110001 r = 18 175149 10101100110001 230651 11111111111111111101100100010111 405800 10001001000 r = 19 405800 10001001000 1042251 11111111111111111110000110100111 1448051 11111111111111111110010111101111 r = 20 405800 10001001000 2665451 11111111111111111111001011000111 3071251 11111111111111111111011100001111 r = 21 405800 10001001000 3477051 11111111111111111111101101010111 3882851 11111111111111111111111110011111 r = 22 3882851 11111111111111111111111110011111 4288651 1111100111 8171502 1110000110 r = 23 3882851 11111111111111111111111110011111 27585757 110100001 31468608 101000000 r = 24 3882851 11111111111111111111111110011111 35351459 11011111 39234310 1111110 r = 25 3882851 11111111111111111111111110011111 39234310 1111110 43117161 11101 r = 26 43117161 11101 90117173 11111111111111111111111111011001 133234334 11111111111111111111111111110110 r = 27 43117161 11101 133234334 11111111111111111111111111110110 176351495 10011 r = 28 133234334 11111111111111111111111111110110 309585829 1001 442820163 11111111111111111111111111111111 r = 29 442820163 11111111111111111111111111111111 1195226155 111 r = 30 442820163 11111111111111111111111111111111 2966506807 11 r = 31 442820163 11111111111111111111111111111111 3852147133 1  We see that the first $$r$$ bits of $$C\times \delta_r$$ are either $$0$$ or $$1$$, so $$-2^{32-r} < C \times \delta_r < 2^{32-r}$$. So using this knowledge, how do we find $$\min \K$$ for an arbitrary constant $$X$$? Well, we can start with $$k = 0$$ and $$r=0$$, and then keep trying to add elements of $$\Delta_{r+1}$$ until the $$r$$’th-highest bit of $$C\times k$$ is the same as the one in $$X$$. Then we increase $$r$$ and repeat until all highest $$r$$ bits are correct.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  fn find_inverse_fast(X: T, r: u32, diffs: &Vec>) -> T { let mut k = 0; let mut rr = (C.wrapping_mul(k) ^ X).leading_zeros(); 'rr: while rr < r { for &d in &diffs[rr as usize] { let new_rr = (C.wrapping_mul(k + d) ^ X).leading_zeros(); if new_rr >= rr { k += d; rr = new_rr; eprintln!( "k+={d:10} = {k:10}: {:032b} {:032b} {rr:>2}", C.wrapping_mul(k), C.wrapping_mul(k) ^ X ); continue 'rr; } } unreachable!(); } k }  Running this on some random input for $$r=32$$ gives for example:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27  r = 32 X = 10110111100010000001010011000110 delta k k*C (k*C)^X r k+= 1 = 1: 01011011110100011110100110010101 11101100010110011111110101010011 0 k+= 1 = 2: 10110111101000111101001100101010 00000000001010111100011111101100 10 k+= 1355 = 1357: 10110111101100110010101011010001 00000000001110110011111000010111 10 k+= 934 = 2291: 10110111100011010110000001101111 00000000000001010111010010101001 13 k+= 8643 = 10934: 10110111100011101000000111101110 00000000000001101001010100101000 13 k+= 8643 = 19577: 10110111100011111010001101101101 00000000000001111011011110101011 13 k+= 3644 = 23221: 10110111100010001000100001011001 00000000000000001001110010011111 16 k+= 55502 = 78723: 10110111100010000011011000111111 00000000000000000010001011111001 18 k+= 230651 = 309374: 10110111100010000000111101010110 00000000000000000001101110010000 19 k+= 405800 = 715174: 10110111100010000001001110011110 00000000000000000000011101011000 21 k+= 405800 = 1120974: 10110111100010000001011111100110 00000000000000000000001100100000 22 k+= 3882851 = 5003825: 10110111100010000001011110000101 00000000000000000000001101000011 22 k+= 3882851 = 8886676: 10110111100010000001011100100100 00000000000000000000001111100010 22 k+= 3882851 = 12769527: 10110111100010000001011011000011 00000000000000000000001000000101 22 k+= 3882851 = 16652378: 10110111100010000001011001100010 00000000000000000000001010100100 22 k+= 3882851 = 20535229: 10110111100010000001011000000001 00000000000000000000001011000111 22 k+= 3882851 = 24418080: 10110111100010000001010110100000 00000000000000000000000101100110 23 k+= 3882851 = 28300931: 10110111100010000001010100111111 00000000000000000000000111111001 23 k+= 3882851 = 32183782: 10110111100010000001010011011110 00000000000000000000000000011000 27 k+= 133234334 = 165418116: 10110111100010000001010011010100 00000000000000000000000000010010 27 k+= 133234334 = 298652450: 10110111100010000001010011001010 00000000000000000000000000001100 28 k+= 133234334 = 431886784: 10110111100010000001010011000000 00000000000000000000000000000110 29 k+=1195226155 = 1627112939: 10110111100010000001010011000111 00000000000000000000000000000001 31 k+= 442820163 = 2069933102: 10110111100010000001010011000110 00000000000000000000000000000000 32  In this example we are somewhat lucky and $$k=2$$ happens to have the $$10$$ highest bits correct, but generally the number of correct bits grows by a few each step. I have written some code to verify the correctness and minimality of the result, and indeed results are always identical to a bruteforce search starting at $$k=0$$. We can also make some statistics on the value of $$\min \K^r$$ compared to $$2^r$$. Since $$|\K^r| = 2^{64-r}$$ we expect the smallest solution to be around $$2^r / 2$$. In practice this seems to be mostly the case, but not quite exactly. Over a total of $$10^8$$ samples (which takes $$40$$ seconds) we get the following ratios of $$k/2^r = \min(\K^r)/2^r$$:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66   r avg(k/2^r) max(k/2^r) r = 0: 0.000 0.000 r = 1: 0.250 0.500 r = 2: 0.375 0.750 r = 3: 0.500 1.000 r = 4: 0.694 2.375 r = 5: 0.741 1.750 r = 6: 0.538 1.922 r = 7: 0.600 1.422 r = 8: 0.588 1.645 r = 9: 0.644 1.539 r = 10: 0.564 1.718 r = 11: 0.536 1.245 r = 12: 0.552 1.677 r = 13: 0.549 1.365 r = 14: 0.553 1.784 r = 15: 0.527 2.319 r = 16: 0.762 2.060 r = 17: 0.599 1.475 r = 18: 0.571 1.991 r = 19: 0.501 1.309 r = 20: 4.275 16.086 r = 21: 4.580 12.026 r = 22: 3.083 7.008 r = 23: 1.763 3.691 r = 24: 0.939 1.908 r = 25: 0.515 1.923 r = 26: 0.801 2.416 r = 27: 0.671 1.692 r = 28: 0.556 1.571 r = 29: 0.526 1.209 r = 30: 0.529 1.602 r = 31: 0.645 1.798 r = 32: 0.513 1.148 r = 33: 0.515 1.473 r = 34: 0.934 2.784 r = 35: 0.865 2.160 r = 36: 0.561 1.208 r = 37: 0.532 1.748 r = 38: 0.584 1.446 r = 39: 0.555 1.732 r = 40: 0.603 3.102 r = 41: 1.291 4.149 r = 42: 1.100 2.724 r = 43: 0.690 1.470 r = 44: 0.902 4.465 r = 45: 1.530 4.437 r = 46: 1.191 2.770 r = 47: 0.721 1.569 r = 48: 0.560 1.523 r = 49: 0.525 1.154 r = 50: 0.600 1.534 r = 51: 0.516 1.823 r = 52: 0.598 1.439 r = 53: 0.535 1.703 r = 54: 0.578 1.343 r = 55: 0.583 1.589 r = 56: 0.522 1.130 r = 57: 0.614 1.528 r = 58: 0.523 1.329 r = 59: 0.849 2.757 r = 60: 0.777 1.902 r = 61: 0.502 1.082 r = 62: 0.565 1.492 r = 63: 0.532 1.254 r = 64: 0.500 1.000  Indeed, the average value of $$k/2^r$$ is typically just above $$0.5$$ or at least below $$1.0$$, but occasionally jumps to above $$4$$, in particular around $$r\in \{20, 21, 22, 23\}$$. I’m really not quite sure why, but this particular range seems to have something weird going on, as also seen by the maximum $$k/2^r$$ ratio of $$16$$. (Note that this is not a bug with our fast inversion algorithm. The bruteforce algorithm confirms these results.) Most likely this is caused in some way by the presence of a very small $$\delta_r$$ for these $$r$$ (marked with a *), and this correlating with all $$C\times \delta_r$$ not only being $$< 2^{64-r}$$ (in absolute value) but being either very close to it or very small.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29   r delta_r C*delta_r r = 19 164249 111111111101001111010010100101010111100001101 521986 1111111111111111111111110111101110100011000101011001011000101010 686235 111110111011000011101011010000100010100110111 r = 20 * 521986 1111111111111111111111110111101110100011000101011001011000101010 16345815 11111111001100111001111011111101111000100011 16867801 11110110111011011101000001010111010001001101 r = 21 * 521986 1111111111111111111111110111101110100011000101011001011000101010 24697591 1111010110101101011010010010100000011000011 25219577 1110010100100001110010111101101011011101101 r = 22 * 521986 1111111111111111111111110111101110100011000101011001011000101010 28873479 111000101010000011111101011111001000010011 29395465 110000011000100111000010111000100000111101 r = 23 * 521986 1111111111111111111111110111101110100011000101011001011000101010 30439437 11111110101101101001101101011010010010001 30961423 10111100100010000010011000100101010111011 r = 24 * 521986 1111111111111111111111110111101110100011000101011001011000101010 31483409 1111010010110011011000011110000011100101 32005395 111000001010110011101110111011100001111 r = 25 32005395 111000001010110011101110111011100001111 32527381 1111111111111111111111111110101111111001100011010000110100111001 64532776 101110001010000000001001000010001001000  Conclusion: We now have a fast $$O( r)$$ algorithm to solve $$C\times k = X \lxor A$$ with $$0\leq A < 2^{64-r}$$. But this does not yet solve the full problem since we actually wanted $\frac{2^{64} p}{n} \leq X \lxor (C\times k) < \frac{2^{64}(p+1)}{n}$ and we did some handwaving to assume that implies the topmost $$r$$ bits of $$C\times k$$. ### Bringing it home Let’s start with a (slightly pathological, but illustrating) $$32$$ bit example: $$n=511$$, $$p=127$$, so we want $\frac{2^{32}\cdot 127}{511} \leq X \lxor (C\times k) < \frac{2^{32} \cdot 128}{511}.$ In binary, the lower and (inclusive) upper bound are:  1 2 3  low : 0011 1111 1001 1111 1100 1111 1110 0111 high: 0100 0000 0010 0000 0001 0000 0000 1000 lcp : 0 (r=1)  These have a longest common prefix 0 of length $$r=1$$, so we could now use the solution above to enumerate solutions $$k$$ for $$r=1$$ and then for each of them check if $$X \lxor (C\times k)$$ is indeed in this interval. But this is quite inefficient, as the actual interval only has length $$2^{32}/511 \approx 2^{23}$$, so that each possible $$k$$ only works with probability $$2^{23} / 2^{32 - 1} = 2^{23-31} = 2^{-8} = 1/256$$, indicating that we may have to try a lot of distinct $$k$$. Instead, we can split the interval into two ranges:  1 2 3 4 5 6 7  low = r1 start: 0011 1111 1001 1111 1100 1111 1110 0111 r1 end : 0011 1111 1111 1111 1111 1111 1111 1111 lcp : 0011 1111 1 (r=9) r2 start: 0100 0000 0000 0000 0000 0000 0000 0000 high = r2 end : 0100 0000 0010 0000 0001 0000 0000 1000 lcp : 0100 0000 00 (r=10)  Both range $$r_1$$ and $$r_2$$ now cover more than half of the values with the prefix of length $$r=9$$ resp. $$r=10$$ bits, so a suitable $$k$$ should be found within a few tries in each case, and we can simply take the minimum of the two solutions. One additional optimization that we can do is that instead of computing solutions for the two ranges independently, we can first compute a $$k_0$$ that gives the right bits in the LCP of low and high, and then extend from there towards the two disjoint intervals. The final code is as follows (and includes some more handling of edge cases):   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41  /// Solve FastReduce(x ^ MH(k), n) == p efficiently. fn invert_fr64_fast(x: Hash, n: usize, p: usize, diffs: &Vec>) -> u64 { // low = 2^64 * p/n <= x^FR(k) < 2^64 * (p+1)/n = high+1 let low = ((1u128 << 64) * p as u128 / n as u128) as u64; // high is an inclusive bound: (2^64 * (p+1) - 1)/n let high = (((1u128 << 64) * (p + 1) as u128 - 1) / n as u128 - 1) as u64; // In this case the partitioning into two intervals doesn't work. if low == high { return find_inverse_fast_with_test( low ^ x.get(), 64, // r |k| { let xck = x.get() ^ C.wrapping_mul(k); low <= xck && xck < high }, 0, // k0 diffs, ); } // Split [low, high] into two pieces [low, low_end] and [high_start, high] // that have (much) longer LCP. let lcp = (low ^ high).leading_zeros(); // First find the solution for the LCP. let k0 = find_inverse_fast(low ^ x.get(), lcp, diffs); let low_end = low | ((1u64 << (63 - lcp)) - 1); let high_start = low_end + 1; let low_lcp = (low ^ low_end).leading_zeros(); let high_lcp = (high_start ^ high).leading_zeros(); let test = |k| low <= x.get() ^ C.wrapping_mul(k) && x.get() ^ C.wrapping_mul(k) < high; let low_k = find_inverse_fast_with_test(low ^ x.get(), low_lcp, test, k0, diffs); let high_k = find_inverse_fast_with_test(high_start ^ x.get(), high_lcp, test, k0, diffs); min(low_k, high_k) }  This is $$O(64)$$ and takes around a minute to invert $$10^8$$ hashes. Conclusion: We can now efficiently find $$k_i = O(n)$$ such that $$h_1(x) \lxor h_2(k_i)$$ maps to a chosen free slot (when $$h_2$$ is a FastReduce instance). This should allow us to fill the last slots of the table much faster. ### Hash-inversion for faster PTHash construction So now we have a fast way to find $$k_i$$ for the tail of the last $$t$$ buckets. We will assume that these buckets all have size $$1$$. (Otherwise decrease $$t$$.) Let $$F$$ be the set of free positions once the head of $$m-t$$ buckets has been processed. We always have $$|F| \geq t$$ and when $$\alpha = 1$$ we have $$|F| = t$$. We can then implement two strategies: Sequential Iterate over buckets and free slots in parallel, matching each bucket to a slot. Then compute the $$k_i$$ that sends each bucket to the corresponding free slot. This will give $$k_i\sim n$$ in expectation, uses $$t \cdot \log_2(n)$$ bits in total, and runs in $$O(t)$$. Greedy For each bucket (in order), compute $$k_i(f)$$ for each candidate slot $$f$$, and choose the minimal value. When $$\alpha=1$$ this gives $$k_i \sim n/f$$ and runs in $$O(t^2)$$. The total number of bits is $\sum_{f=1}^t \log_2(n/f) = t\log_2(n) - \log_2(t!) \sim t (\log_2(n) - \log_2(t))$ For $$t=O(\sqrt n)$$, this saves up to half the bits for these numbers. In some quick experiments with $$n=10^8$$, the sequential strategy seems to give at most around $$15\%$$ speedup ($$35$$ to $$30$$ seconds for $$t=10000$$), which is not as much as I had hoped. This seems to be because relatively a lot of time is also spent on finding $$k_i$$ for the last buckets of size $$2$$ and $$3$$. ### Fast path for small buckets For small buckets (size $$\leq 4$$) it pays of to use a code path that knows the explicit bucket size and processes a fixed size &[Hash; BUCKET_SIZE] array instead of an arbitrary sized slice &[Hash]. This allows for better code generation. ### TODO Dictionary encoding The dictionary will be quite dense for numbers up to some threshold (say $$1024$$), and sparser afterwards. We can encode the small numbers directly and only do the dictionary lookup for larger ones. • TODO: Figure out if the branch is worth the savings of the lookup. ### TODO Larger buckets The largest bucket should be able to have size $$O(\sqrt n)$$ without issues. From there it should slowly decay (TODO: figure out the math) to constant size. This could put the elements that are currently in the largest $$\sim 1\%$$ of buckets all together in a few buckets, reducing the average size of the remaining buckets. Although the reduction seems only minimal, so this may not give too much benefit. One way of achieving such a skew distribution might be to replace the partitioning of $$h\in [0, 2^{64})$$ in $$m$$ chunks, by a partitioning of $$h^2 \in [0, 2^{128})$$ in $$m$$ chunks. ### TODO Prefetching free slots Looking up whether slots in the array for a certain $$k_i$$ are free is quite slow and memory bound. Maybe we can prefetch values for a few $$k_i$$ ahead. Also, the computation of position could be vectorized. ### Filling the last few empty slots needs very high $$k_i$$! Have a look at the last two columns of this table, indicating the number of newly appearing $$k_i$$ in each class and total number of distinct $$k_i$$ up to then:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27  sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki 25: 1 0.00 0.00 0.00 0.00 0.0 0 1 1 24: 1 0.00 0.00 0.00 0.00 0.0 0 0 1 23: 5 0.00 0.00 0.00 0.00 0.0 0 0 1 22: 11 0.00 0.00 0.00 0.00 0.0 0 0 1 21: 63 0.00 0.00 0.00 0.00 0.0 0 0 1 20: 181 0.00 0.00 0.00 0.01 0.0 0 0 1 19: 566 0.00 0.00 0.01 0.02 0.0 1 1 2 18: 1494 0.01 0.01 0.03 0.04 0.0 1 0 2 17: 4143 0.02 0.02 0.07 0.11 0.0 2 1 3 16: 10119 0.04 0.06 0.16 0.28 0.0 2 0 3 15: 23730 0.09 0.15 0.36 0.63 0.1 3 1 4 14: 52393 0.20 0.35 0.73 1.36 0.2 6 3 7 13: 107711 0.41 0.76 1.40 2.77 0.4 8 2 9 12: 205081 0.78 1.54 2.46 5.23 0.8 14 6 15 11: 359440 1.36 2.90 3.95 9.18 1.5 27 10 25 10: 581154 2.21 5.11 5.81 14.99 2.9 57 23 48 9: 854333 3.24 8.35 7.69 22.68 6.0 123 44 92 8: 1149239 4.36 12.72 9.19 31.87 12.7 264 95 187 7: 1418308 5.38 18.10 9.93 41.80 25.6 539 181 368 6: 1685178 6.40 24.50 10.11 51.91 46.4 873 288 656 5: 2096913 7.96 32.46 10.48 62.40 73.6 1578 391 1047 4: 2880342 10.94 43.40 11.52 73.92 107.5 2276 580 1627 3: 4046307 15.36 58.76 12.14 86.06 150.3 4334 956 2583 2: 4887755 18.56 77.31 9.78 95.83 171.2 8071 1066 3649 * 1: 4166126 15.82 93.13 4.17 100.00 382.2 126928083 14084 17733 * 0: 1809532 6.87 100.00 0.00 100.00 0.0 0 0 17733  The buckets of size $$2$$ and larger only need $$3649$$ distinct $$k_i$$ which can be represented using $$12$$ bits each with dictionary encoding. But after the last single-element buckets we need $$17733$$ distinct $$k_i$$ which would require $$15$$ bits! If we instead use $$\alpha=0.99$$ things look much better:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29   sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki 27: 1 0.00 0.00 0.00 0.00 0.0 0 1 1 26: 1 0.00 0.00 0.00 0.00 0.0 0 0 1 24: 4 0.00 0.00 0.00 0.00 0.0 0 0 1 23: 4 0.00 0.00 0.00 0.00 0.0 0 0 1 22: 20 0.00 0.00 0.00 0.00 0.0 0 0 1 21: 53 0.00 0.00 0.00 0.00 0.0 0 0 1 20: 165 0.00 0.00 0.00 0.01 0.0 1 1 2 19: 514 0.00 0.00 0.01 0.01 0.0 1 0 2 18: 1378 0.01 0.01 0.02 0.04 0.0 1 0 2 17: 3736 0.01 0.02 0.06 0.10 0.0 2 1 3 16: 9386 0.04 0.06 0.15 0.25 0.0 2 0 3 15: 22223 0.08 0.14 0.33 0.59 0.1 3 1 4 14: 49298 0.19 0.33 0.69 1.28 0.2 5 2 6 13: 102501 0.39 0.71 1.33 2.61 0.4 7 2 8 12: 196895 0.74 1.45 2.36 4.97 0.7 14 7 15 11: 348772 1.31 2.76 3.84 8.81 1.3 22 8 23 10: 568406 2.14 4.90 5.68 14.49 2.7 49 19 42 9: 843818 3.17 8.07 7.59 22.09 5.4 118 45 87 8: 1143451 4.30 12.37 9.15 31.23 11.4 210 94 181 7: 1426213 5.36 17.74 9.98 41.22 22.9 395 160 341 6: 1701482 6.40 24.14 10.21 51.43 41.3 9910 261 602 5: 2111700 7.94 32.08 10.56 61.99 65.0 1494 332 934 4: 2897775 10.90 42.98 11.59 73.58 93.5 2247 472 1406 3: 4079515 15.34 58.32 12.24 85.82 125.3 3209 752 2158 2: 4964046 18.67 76.98 9.93 95.74 126.5 4074 437 2595 * 1: 4256736 16.01 92.99 4.26 100.00 38.6 1061 0 2595 * 0: 1863589 7.01 100.00 0.00 100.00 0.0 0 0 2595  Now no new pilot values are needed at all for the single-element buckets, and $$12$$ buts is enough for each value, saving $$20\%$$ of memory! Looking closer at the buckets of size $$1$$, we see that most new values are added for the last few elements, where large $$k_i$$ are needed.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  1: 263401 1.00 79.00 0.26 96.28 25.0 355 0 3649 1: 263401 1.00 80.00 0.26 96.54 27.0 364 0 3649 1: 263402 1.00 81.00 0.26 96.80 29.3 381 0 3649 1: 263401 1.00 82.00 0.26 97.07 31.8 429 0 3649 1: 263401 1.00 83.00 0.26 97.33 35.1 463 0 3649 1: 263401 1.00 84.00 0.26 97.60 38.7 562 0 3649 1: 263402 1.00 85.00 0.26 97.86 43.4 623 0 3649 1: 263401 1.00 86.00 0.26 98.12 49.3 630 0 3649 1: 263401 1.00 87.00 0.26 98.39 57.1 720 0 3649 1: 263401 1.00 88.00 0.26 98.65 67.2 896 0 3649 1: 263402 1.00 89.00 0.26 98.91 82.0 1160 0 3649 1: 263401 1.00 90.00 0.26 99.18 105.0 1333 0 3649 1: 263401 1.00 91.00 0.26 99.44 146.4 2047 0 3649 1: 263401 1.00 92.00 0.26 99.70 240.6 3272 0 3649 1: 263402 1.00 93.00 0.26 99.97 824.0 33353 3762 7411 * 1: 263401 1.00 94.00 0.03 100.00 4227.2 126928083 10322 17733 *  ### Perfect matching for the tail Let’s for now only consider some tail of length $$t$$ of buckets of size $$1$$. We already saw two techniques to assign them: sequentially and greedily (giving the minimal $$k_i$$). We also already discussed the possibility of cuckoo hashing, where inserting an element can ’evict’ already placed elements. This seems to need a bit more bookkeeping though, as for each placed element we must track the bucket it belongs to. Another more complex way would be a perfect matching: Compute $$k_i(f)$$ for each bucket $$i$$ and each free slot $$f$$ and find the weighted bipartite matching assigning buckets to free slots that minimizes e.g. $$\sum \log_2 k_i(f)$$ (the total number of bits) or $$\max k_i(f)$$ (the maximal number of bits). This last case is interesting: we could fix a maximal $$K=2^b$$ up front and then find all possible $$k_i < K$$ for all buckets by simply trying them all. Then we find a perfect matching (if one exists) and use these $$k_i$$ for the final buckets. This raises the question: Let $$G$$ be a bipartite graph on $$|A|+|B|=n+n$$ vertices. Let each vertex in $$A$$ have degree $$d$$ edges towards random vertices in $$B$$. What is the smallest $$d$$ for which a perfect matching exists with high probability. If we can answer this, than we can find the $$d$$ smallest $$k_i$$ solutions for the last $$t$$ buckets and solve the perfect matching problem. In this case all final $$k_i$$ would be $$O(d\cdot n/t)$$, for $$O(t \log_2(dn/t))$$ bits total. A closely related question is answered on stackoverflow which implies that we should take $$d = c + \ln n$$. I implemented this and it seems to work quite well! Take for example the tail of $$n=10^8$$, $$c=7$$, $$\alpha=1$$ without this optimization:   1 2 3 4 5 6 7 8 9 10 11  n: 100000000 m: 26340126 sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki ... 1: 263401 1.00 79.00 0.26 96.28 25.0 390 0 3680 ... 1: 263401 1.00 91.00 0.26 99.44 146.2 1626 0 3680 1: 263401 1.00 92.00 0.26 99.70 241.8 3699 3 3683 1: 263402 1.00 93.00 0.26 99.97 835.7 21967 3825 7508 1: 263401 1.00 94.00 0.03 100.00 4135.5 52284963 10187 17695 : 26340126 100.00 100.00 100.00 100.00 137.3 52284963 17695 17695  and now the same where we run perfect matching on the last $$500000$$ buckets/slots where we find all $$k_i$$ below $$2^{12} = 4096$$:   1 2 3 4 5 6 7 8 9 10 11  n: 100000000 m: 26340126 sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki ... 1: 263401 1.00 79.00 0.26 96.28 25.0 418 0 3662 ... 1: 263401 1.00 91.00 0.26 99.44 146.0 1819 0 3662 1: 263401 1.00 92.00 0.26 99.70 299.7 4094 474 4136 1: 263402 1.00 93.00 0.26 99.97 753.4 4095 62 4198 1: 263401 1.00 94.00 0.03 100.00 193.9 4095 0 4198 : 26340126 100.00 100.00 100.00 100.00 97.7 6655 4198 4198  Observe that: • The maximum $$k_i$$ overall is now $$6655$$ instead of $$50M$$ (the value for one of the last buckets of size $$2$$), requiring half the bits to store. • The number of distinct $$k_i$$ has gone down from $$17695$$ to $$4198$$, requiring one less bit to store. • Both the maximum $$k_i$$ and the number of distinct $$k_i$$ is between $$2^{12}=4096$$ and $$2^{13}=8192$$. I.e. dictionary compression on the full array does not provide any additional benefits anymore! (I’m not yet sure whether separate front-back encoding would provide benefits.) For reference, here are the corresponding results when using $$\alpha = 0.99$$ without tail optimizations.  1 2 3 4 5 6   sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki ... 1: 265917 1.00 91.00 0.27 99.48 60.5 804 0 2595 1: 265917 1.00 92.00 0.27 99.74 72.2 887 0 2595 1: 265917 1.00 93.00 0.26 100.00 86.4 1342 0 2595 : 26591682 100.00 100.00 100.00 100.00 69.0 4431 2595 2595  • Both the maximum $$k_i$$ and number of distinct $$k_i$$ are a bit smaller here, mostly because the buckets of size $$2$$ are much easier to place. Conclusion: This works extremely well to handle the last size-$$1$$ buckets. It seems to be a bit slower for now, but I think this can be fixed. (The matching itself is super fast – it’s rather that we now need to search up to a threshold for all buckets, rather than only for the last few buckets.) TODO: See if we can make something similar work for the tail of buckets of size $$2$$. There we don’t need perfect matching though, and simpler techniques such as peeling may suffice (as used in XOR-filters (Graf and Lemire 2020)). This is also somewhat similar to cuckoo hashing (Fan et al. 2014), which has a similar result but uses a slightly less structured approach. In particular Cuckoo hashing does a greedy ‘DFS’ without backtracking for augmenting paths, compared to matching/flow algorithms which do a normal DFS and are able to ‘reset’ $$k_i$$ back to the lowest solution after an augmenting path through them has been found. ### Peeling for size-1 buckets Instead of full matching, let’s try peeling to match the size-1 buckets: Again we first choose a threshold $$K=2^b$$ and find all suitable $$k_i < K$$ for each bucket using bruteforce. Also as before, consider the induced bipartite graph. Suppose that there is a vertex $$u$$ of degree $$1$$. Then we know that the single edge incident to it (say $$uv$$) must be part of the final matching, so we can greedily take it and remove any other edges incident to $$v$$. We can repeat this process: repeatedly choose a vertex $$u$$ of minimal remaining degree, choosing a random edge $$uv$$ incident to it, add $$uv$$ to the matching, and remove any other edges incident to $$u$$ and $$v$$. I was hoping that this would result in a perfect matching with high probability, but it seems that in practice this procedure typically leaves behind a few percent of vertices without any remaining edges. One example situation is where this ‘deadlock’ could occur is when we have two fully connected subgraphs $$A=K_{x,x}$$ and $$B=K_{y,y}$$, and additionally a few extra edges from $$a \in A$$ to $$B$$. In this case, we must never take these ‘cross’ edges, but the algorithm described above doesn’t have such non-local information and could create an imbalance. Not all hope is lost though: 1. This approach may actually work well for buckets of size $$2$$ and larger, where there are spare empty slots and we are not looking for a perfect matching. 2. The partial matching found by peeling could be input to Dinic’ full matching algorithm, possibly reducing the number of iterations needed and providing overall speedup. But on the other hand, Dinic also matches all but a few percent of elements in its first round, so it’s not clear if it would help much. ### Greedy peeling 1: Assigning from hard to easy A simpler alternative to peeling is the following. As before choose a number of bits $$b$$ and threshold $$K = 2^b$$, and find all solutions $$k_i < K$$ for all buckets of size $$2$$. Then sort the buckets by increasing number of solutions. This leaves the buckets with more freedom of choice for last, hopefully giving them a larger probability of finding a match. I had high hopes for this, but it seems to not work great. As expected it reduces the maximum $$k_i$$ required, but (for $$n=10^8$$) it only goes down from $$3000$$ to $$2500$$, which is not even enough to save a bit in a direct representation. ### TODO Peeling and cuckoo hashing for larger buckets. We can processing the buckets in various ways: 1. By increasing number of total edges, as in previous section. 2. By increasing number of remaining edges. 3. By increasing number of remaining free slots covered by all edges. This last method sounds more complicated than the second one, but is simpler in practice: Whenever we fill a slot we simply decrease all buckets mapping to it by $$1$$, instead of having to check first whether those edges are still alive. ### Sunday morning ideas #### Dinic Possible speedups for Dinic algorithm: • It’s never needed to visit vertices with level $$\geq level(t)$$ (other than $$t$$ itself). Do such vertices exist? How many of them? Would it speed things up to skip over them? • Answer: Yes, they exist. But when matching $$1M$$ buckets, there are only around $$100$$ at level $$>level(t)$$. Sometimes there are quite some (up to $$n/2$$) at level $$=level(t)$$ though, so we could skip those. So during level computation, we can stop as soon as we determine the level of $$t$$, and leave all other levels at $$-1$$, so they will never be processed during augmentation, which only visits vertices $$v$$ with $$level(v) > level(u)$$. This saves up to $$25\%$$ of time when matching $$1M$$ buckets. • For vertices in the left component at $$level(u) = level(t)-2$$, instead of doing a DFS, we could first check if any of the outgoing $$v$$ has a free edge to $$t$$, by storing all taken/free $$v$$ in a bitmask. • Done: Instead I implemented a check that for vertices in level $$level(t)-1$$, we only check the edge to $$t$$. This saves around $$30\%$$ time of the augment step, and around $$15\%$$ time of the entire Dinic algorithm. Together these two optimizations bring the runtime on $$1M$$ elements down from around $$3.5$$ seconds to $$2.5$$ seconds: $$30\%$$ speedup. #### New iterative greedy assignment idea Possible improvement to Greedy peeling 1: • Run the greedy linear-pass algorithm once. • Then fix all the remaining unmatched ones to their first outgoing edge. • Rerun the algorithm starting with these ‘hard cases’ already fixed. • Repeat. Will this converge? How many iterations? Is it better to fix hard cases on the fly, evicting previous ones as soon as the hard one is found? Answer: Around 10-20 iterations are typically needed. These iterations are relatively fast, but the iterating to find all solutions below some threshold is quite slow. We can try different sorting orders in which the buckets are processed: • no sorting, • by number of solutions $$k_i$$ below the threshold (going from few to many), • by smallest solution $$k_i$$ (going from high to low). Experiments show that sorting by number of solutions is the best, but this requires precomputing all solutions below the threshold which is quite costly in practice. Sorting by smallest solution is probably a better idea in the end, since this is something that needs to be computed for each bucket anyway. #### Cuckoo hashing, again The reason that this keeps coming up is because I think it will work very well, but I don’t yet see exactly how to implement this in a sufficiently nice/clean way. So I procrastinate by first implementing simpler (to implement) ideas. Basically it’s the same as the iterative greedy assignment idea above, but displacing others on the fly, doing a random walk DFS until success. • TODO: Do we choose outgoing edges at random? sequentially (wrapping around)? Always starting at the start of the list (leading to smaller $$k_i$$)? • Prefer displacing edges that only include one occupied slot. (Or if possible better, never take an edge with $$2$$ (or more) slots already taken.) Some statistics for bucket size $$2$$ ($$n=10^8$$): • $$\sim20\%$$ of the buckets has size $$2$$, and $$\sim20\%$$ of the buckets has size $$1$$. • Bucket size $$2$$ covers elements from percentile $$85\%$$ to $$95\%$$. • The ’early’ buckets have success rate one in $$1/0.15^2 \sim 45$$. • The ’late’ buckets have success rate one in $$1/0.05^2 \sim 400$$. • So only $$1$$ in $$9$$ that works initially still works at the end. • This is problematic, because even though in expectation most will find a solution with $$20$$ candidates, some of the $$f = O(10^6)$$ buckets will be unlucky and will need more like $$O(9\cdot \ln f)$$ candidates, which corresponds to quite large $$k_i$$ actually ($$45\cdot 9\cdot \ln f \sim 4000$$). • Basically, all these cuckoo/displacement strategies are really mostly needed for the few buckets that lie a factor $$\ln f$$ (or say more than $$3\times$$) above the average, so if we can fix those the rest should be relatively OK. • Coming back to cuckoo hashing: Each candidate $$k_i$$ maps the two elements to the $$15\%$$ remaining slots. After most size-$$2$$ buckets have been assigned, $$5\%$$ of slots still remains empty. That means that every candidate edge consisting of $$2$$ slots in this $$15\%$$ has quite a high probability of including at least one empty slot: $$1-(2/3)^2 = 55\%$$. So basically cuckoo hashing has a very high probability of finding edges with only one slot already filled, and $$1/9$$ of the edges has both slots empty. So I think the DFS only needs to see roughly $$10$$ edges in total before success, which is very little actually! ### Cuckoo hashing / displacing, for real now Ok so I implemented a quick first version and, lo and behold: It works great!. So for completeness the algorithm I run now: 1. Put all buckets of fixed size (from $$2$$ to $$5$$ or so, but not $$1$$ (TODO: maybe $$1$$ works as well?)) on a stack. 2. While the stack is not empty: find a $$k_i$$ for the top of the stack that maps to all empty slots. 3. If not possible, find a $$k_i$$ that minimizes the number of collisions. Drop the colliding buckets from the taken slot array and push those buckets on the stack. 4. Repeat until the stack is empty. One tricky point is how to tie-break the selection of the minimal bucket. Always choosing the smallest $$k_i$$ with the minimal number of collisions can lead to cycles where two buckets keep displacing each other. What seems to work: Choose the first $$k_i$$ after any existing $$k_i$$ (modulo $$2^b$$). For $$n=10^7$$: • the $$450k$$ buckets of size $$3$$ needs $$4k$$ displacements, around $$1\%$$; • the $$600k$$ buckets of size $$2$$ need $$20k$$ displacements, around $$3\%$$. For now the bottleneck is twofold: • Iterating over all buckets to find all solutions (edges) up-front is slower than necessary. In practice most solutions $$k_i$$ are quite small and there is no need to iterate all the way to the upper bound up-front. • The bit width $$b$$ has to be large enough to ensure each bucket has at least one solution. When the expected value of the smallest solution is $$\overline{k}$$, the maximum value we need is roughly $$\overline{k} \ln b$$ when there are $$b$$ buckets of the current size. This is unfortunate because in practice $$b$$ is of order $$1M$$, so we need to go up to $$20 \overline{k}$$, whereas maybe $$2\overline{k}$$ would be enough for most of them. TODO: We should fix this by allowing some buckets to displace larger buckets. Probably displacing buckets one larger than itself is good enough. ### Displacing globally A drawback with the implementation above is that it first finds all edges, and only then starts the algorithm. Much more efficient is to do things right from the start. We can make a single global displacement algorithm: 1. Set a threshold $$k_{max} = 2^b$$. 2. Determine buckets and order by decreasing size. 3. For each bucket, try to find the smallest $$k_i<2^b$$ that places it without collisions. If so, move to the next bucket. 4. Otherwise, find the $$k_i<2^b$$ that minimizes the number of collisions, and (more importantly) minimizes the size of the colliding buckets. Then evict those other buckets (remove their $$k_i$$ and push them on a stack), and set the $$k_i$$ for the current bucket. 5. While the stack is non-empty, repeat steps 2 and 3 for the top of the stack. One drawback is that this needs some more memory since for each slot it needs to store the bucket it came from, and it’s also a bit slow since (in case collision-free matching is not possible) we need to lookup the size of each bucket in another random-memory-access. #### Running it This seems to work super well! My implementation isn’t as streamlined yet as the original that just finds the smallest $$k_i$$ per bucket (which has a simpler and thus faster inner loop), but nevertheless gets the job done. Only around the last buckets of size $$2$$ do displacements go above $$1\%$$ of buckets. The main drawback is that it somewhat explodes at the last $$100$$ to $$1000$$ size-$$1$$ buckets, where it has a harder time finding chains that lead to a free slot. Probably falling back to deterministic matching using Dinic is better here. The average $$k_i$$ per bucket-size using this method seems similar to the original algorithm, but the main benefit is that even for $$n=10^8$$ we can now find solutions that with all $$k_i < 2^8$$, meaning that we can store the pilots simply as a vector of u8 instead of needing a compact representation and/or dictionary encoding. In fact I’m now thinking it’s probably simpler to tune $$c$$ (i.e. the number of buckets) such that we can easily find a solution where all pilots fit in $$8$$ bits, rather than having a variable packing size. Some further observations: • Only rarely is more than one collision needed. • TODO: In this case can we backtrack instead of displacing two buckets? • When filling buckets of size $$s$$, at worst buckets of size $$s+1$$ are evicted, never larger ones. #### Limitations As already mentioned, one problem with this is that matching the last few buckets is hard. Basically, for the last bucket we have to find a size-$$1$$ bucket for which one of its $$256$$ possible pilots maps it to the remaining free slot, which only happens for $$0.5\%$$ of buckets (which is large for $$n=10^8$$). Alternatively we can use Dinic for the size $$1$$ buckets again, but this also has a limitation as mentioned before, that there will always be some size-$$1$$ buckets that do not map to any free slot. Thus, it seems we need some balance where: • Some size-$$1$$ buckets displace already-placed size-$$2$$ buckets. • For the remaining size-$$1$$ buckets we find some/all solutions, build the graph, and run Dinic. One other tricky problem is that not only do we have to ensure that each bucket maps to a free slot, but also that each free slot has some bucket mapping to it. I’m not quite sure yet how exactly to solve this, since the probability of such a position existing is also quite large. One option is to track for each slot one or two $$(bucket, pilot)$$ pairs that maps to it, and then use that in case no other remaining values map there. But this might then trigger a chain of displacements again in the other (‘pulling’) direction. Although that’s somewhat inevitable in the $$\alpha = 1$$ setting. ### Cleanup and revisiting defaults So far I have been using Vec<u64> as backing storage for pilots, MurmurHash to hash keys, and $$16$$ iteration lookahead for prefetching. 1. Now that we aim to have $$k_i < 2^8$$, let’s switch that to a simple Vec<u8>. 2. I added support for FxHash, which processes a u64 at a time using a simple rotate-xor-multiply: state=(state.rotate(5) ^ x)*constant. While benchmarking this, I realized that the reason NoHash (which is a toy implementation that does exactly nothing) was so much faster than other hashes is not only because it spends less instructions hashing, but also because my random keys were sorted (to check for duplicates), and this lead to linear access patterns rather than random access patterns. 3. It now seems that$32$-iteration lookahead is around $$10\%$$ faster for large $$n$$. Some results for $$n=10^8$$ ($$c=7$$, $$\alpha=1$$):   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  Vec murmur (1): 20.3 (16): 7.6 (32): 7.2 (64): 7.3 fxhash (1): 16.3 (16): 7.6 (32): 7.2 (64): 7.2 nohash (1): 15.4 (16): 7.6 (32): 7.2 (64): 7.2 Vec murmur (1): 17.4 (16): 5.8 (32): 5.1 (64): 5.2 fxhash (1): 13.8 (16): 5.5 (32): 4.9 (64): 5.6 nohash (1): 13.0 (16): 5.5 (32): 4.9 (64): 4.8 sucds::CompactVector murmur (1): 20.9 (16): 27.4 (32): 27.4 (64): 27.5 fxhash (1): 62.9 (16): 17.2 (32): 17.6 (64): 17.6 nohash (1): 60.5 (16): 16.8 (32): 17.4 (64): 17.3 sux::CompactArray murmur (1): 21.5 (16): 20.9 (32): 21.0 (64): 20.9 fxhash (1): 16.7 (16): 16.9 (32): 16.1 (64): 17.8 nohash (1): 15.6 (16): 15.3 (32): 15.5 (64): 16.1  (1), (16/32/64) indicate no prefetching resp. prefetching 16/32/64 iterations ahead. Observations: • Switching from Vec<u64> to Vec<u8> saves around $$2ns$$ per query. • Going from Murmur to FxHash typically saves a bit. • Going further to NoHash doesn’t save more. • At $$32$$ lookahead, the runtime does not depend much on hash function, so probably it’s memory bound rather than cpu bound. • CompactVector and CompactArray do not support prefetching and hence aren’t nearly as good. ### TODO Sum instead of xor? PTHash replaces the summation in $$(h(x) + d)\mm n$$ by a xor. It’s not exactly clear to me why this is beneficial. I the xor has slightly better mixing, especially when the $$\mm n$$ is implemented as a reduction of the high bits, and $$h(k_i) = C\cdot k_i$$ has a linear structure as well. ### Revisiting $$\alpha < 1$$ After playing some more with $$\alpha<1$$ values, it seems that we can do this with only very minor loss in query performance. Some experiments for $$n=10^8$$, $$c=10$$, with prefetching $$32$$ iterations ahead: Table 2: Query time (ns) for distinct alpha. index returns the computed $$position(x)$$ directly as a lowerbound on time, while index_remap looks up the target position for positions after $$n$$. $$\alpha$$indexindex_remap $$1.0$$5.75.7 $$0.999$$5.65.8 $$0.995$$5.75.8 $$0.99$$5.75.9 $$0.95$$5.77.1 $$0.9$$5.78.7 As can be seen, index is independent of $$\alpha$$ as expected, since it doesn’t do any extra work for out-of-range elements. index_remap becomes slow once too many elements fall out of bounds around $$\alpha = 0.95$$. But we can safely choose $$\alpha=0.99$$ without significant loss of query performance. ### Elias-Fano for the remap-dictionary I added support to use sucds::EliasFano to store the dictionary elements. This saves quite some bits, since (for $$\alpha=0.99$$) it only needs $$2+\log_2(100) = 9$$ bits per element as opposed to the default which encodes each index using a u32. So let’s see the query performance. Table 3: Query time (ns) for distinct alpha, using EliasFano encoding for the remapping. $$\alpha$$indexindex_remap $$1.0$$5.76.6 $$0.999$$5.86.7 $$0.995$$5.87.2 $$0.99$$5.77.6 $$0.95$$5.711.1 $$0.9$$5.716.1 This comes out quite a bit slower than using a direct Vec<u32> encoding. On the other hand, it does save space: for $$\alpha=0.99$$, the vector encoding uses $$0.01 \cdot 32 = 0.32$$ bits per element overall, while EliasFano only needs $$0.01\cdot 9 = 0.09$$ bits per element overall, saving $$0.2$$ bits/element. Sadly the EliasFano hot-loop has slightly worse codegen even in the non-EF part, introducing a branch where previously non-branching assembly was generated for the >>small shift in determining the right bucket. ### Global iterative prioritizing Instead of displacing using DFS, another idea I had that may work for $$\alpha<1$$: 1. Iterate over all buckets by decreasing size; place those that can be placed without collisions. 2. Collect the rest in a vector of priority_buckets. 3. Repeat, but place the priority buckets before iterating over the others. This worked well with only buckets of a fixed size, but seems to not work as well globally, even when we try to assign the priority buckets to have minimal collisions with others. Instead of converging to a stable solution, each iteration about the same amount ($$0.3\%$$ for $$n=10^8$$, $$c=10$$, $$\alpha=0.99$$) of buckets is added to the priority list. It seems that the changes made by the few extra priority buckets propagate a lot, changing the optimal pilot for $$40\%$$ of buckets. This basically means the taken slots are completely newly random each iteration, preventing converging to a stable solution where the number of added priority elements goes down each round. Basically, just DFSing the displacement using the earlier strategy seems much better. ### Cleanup: removing peeling and suboptimal displacing code The code base is getting messy. I’m going to drop all the peeling experiments, and also the ‘displacement per bucket size’ will go. Basically all methods that require computing edges up front are annoying in this regard. Also matching seems not-so-useful anymore now that we can settle on $$\alpha=0.99$$ without impacting the query throughput. It may still be useful though in cases where we want to minimize memory usage, but even then we can go with $$\alpha=0.999$$ which would take longer to construct but should still be fine, and avoids the complexity of having to balance between perfect matching and displacing. So we’re dropping all these algorithms which required all edges to be precomputed up-front: • Peeling algorithms • Didn’t really work in the first place • Dinic matching • Works well for size-$$1$$ buckets, but needs a large threshold to guarantee every bucket and slot have at least one edge. For larger bucket sizes the balancing act is annoying. • Per-bucket-size displacing: • The greedy non-iterative version that simply runs once and only fills $$95\%$$ of buckets. • The greedy iterative version works well but only does one bucket size at a time. • The DFS version that works per bucket size isn’t quite good enough because sometimes we must displace to larger buckets. • The global iteration method of the previous subsection also goes – it just doesn’t work well. • The code to invert hashes $$s = C \times k_i$$ is also dropped. Inverting returns values that are $$O(n)$$ while we are now only interested in values at most $$2^8$$. I’m kinda sad about this because the math there was really cool. • The ‘fast small buckets’ optimization is also dropped for now, to be reintroduced later, to clean up code a bit. All together, this drops around 1900 lines of code :) ### Some speedups to the displacement algorithm • Use a fast-path for detecting placement without collisions, which is sufficient for $$99\%$$ of buckets. • Use a taken bitvector for the fast-path, instead of a full slot-to-bucket-index mapping . • Attempt at prefetching of positions in the taken vector for upcoming $$k_i$$. Doubtful whether it actually improves construction time though. ### TODO Runtime analysis of displacement algorithm It would be nice to be able to estimate the runtime complexity of the displacement algorithm in terms of $$n$$, $$m$$ (/$$c$$), $$\alpha$$, and the maximum allowed pilot value. ### TODO Optimal prefetching strategy For each class of buckets of the same size we could device some optimal prefetch strategy: It’s wasteful to prefetch the slot corresponding to the last element of each bucket when the probability of failing in earlier elements in the bucket is high. E.g. for size $$2$$ buckets only roughly $$1$$ in $$6$$ of the second elements needs to be queried. Not prefetching those may give a nice speedup. ### Are we close to the memory bandwidth? I have DDR4 @ 3200MHz, which has a theoretical bandwidth limit of $$25GB/s$$. For $$n=10^9$$ (1G), the taken vector has size 1G*1bit=125MB and does not fit in my 12MB L3 cache. Thus let’s say for simplicity that all lookups in it are cache misses, and thus that each lookup reads a single$64byte cacheline from main memory. But how many lookups do we need? • there are 0.3G buckets • the average $$k_i$$ is $$40$$ • buckets have around $$3$$ elements on average • total number of lookups: 0.3G * 40 * 3 ~ 40G • But that’s an upper bound, since in practice we do an early break after the first taken slot. Simply counting the number of lookups gives me 20G, which sounds reasonable: it’s lower and the same order of magnitude. So how long does it take to do 20G lookups? $$20G * 64byte = 1.28TB$$, which at $$25GB/s$$ is $$51s$$. In practice the construction takes longer at $$350s$$, so this indicates we’re not yet memory-bandwidth bound, assuming we could indeed saturate the full $$25GB/s$$, but it’s not too far from it. ### More sorting algorithm resources I found some more implementations and posts about sorting algorithms, so listing them here for future reference: Robinhoodsort https://github.com/mlochbaum/rhsort Sorts uniform random data by placing values at their interpolated location in a $$2n$$ size array. Ska_sort https://github.com/skarupke/ska_sort In-place radix sort with CPU-pipelining-friendly swapping of values to their buckets. See posts investigating radix sort, I wrote a faster sorting algorithm, and faster sorting algorithm part 2. Wolfsort https://github.com/scandum/wolfsort Didn’t really look into this. Voracious sort https://github.com/lakwet/voracious_sort Rust library with various radix sort implementations, with the default voracious sort being an improved ska_sort. I tried all implementations in here for sorting random $$10^8$$ random 64=bit hashes, and none beats =radsort::sort_by_key(|h| h>>32) followed by manually sorting equal buckets. More blogs http://stereopsis.com/radix.html pqdsort Pattern defeating quicksort, https://github.com/orlp/pdqsort, by Orson Peters glidesort https://github.com/orlp/glidesort, by Orson Peters #### And some resources on partitioning Branchles partitioning https://github.com/Voultapher/sort-research-rs/blob/main/writeup/lomcyc_partition/text.md by lukas Bergdoll Blog on Lomoto partitioning https://orlp.net/blog/branchless-lomuto-partitioning/ by Orson Peters ### Partitioning to reduce memory latency As discussed just before, memory latency and bandwidth are a bottleneck in construction times when $$n=10^9$$. For a typical bucket with $$2$$ elements and $$k_i=50$$, we must lookup $$100$$ positions in taken, corresponding to fetching $$100$$ cachelines from main memory. It would be much better if these cachelines were already available in a low level cache. So how about if we already roughly know where the elements of this bucket are supposed to go? Maybe a window of say $$10$$ cachelines (i.e. $$10\cdot 64\cdot 8 = 5120$$ bits) should be wide enough to find an empty slot with high probability. So we could say that bucket $$i$$ of $$m$$ maps to a window/slice of some length starting at $$i/m \cdot s$$ (where $$s=n/\alpha$$ is the total number of slots). But actually we can simplify this: • We can divide the $$m$$ buckets into $$P$$ partitions, and then the first $$1/P$$ of buckets map to the first $$1/P$$ of slots. So there are $$P$$ partitions with $$m_p = round(m/P)$$ buckets each, and each partition corresponds to $$s_p = round(n/\alpha/P)$$ slots. So then, the position for an element $$x$$ with bucket $$i$$ becomes $p(x, i) = \lfloor i/m_p \rfloor \cdot s_p + ((h(x) \lxor h(k_i)) \mm s_p.$ As usual, we can remap values larger than $$n$$ to the free slots in the rest of the array. Note that this is slightly different from the PTHash-HEM partitioning in Pibiri and Trani (2021a) since that builds a number of independent MPHFs, while this method builds a single one and only constructs a single free array. One point needs some careful consideration: this method does not guarantee that each partition has the same number $$n_p = round(n/\alpha)$$ of elements, so the total load factor per partition will vary. This is not fatal though: The expected number of elements per partition is indeed $$n_p$$, and since this is a binomial distribution for large $$n$$ and small probability $$p=1/P$$ the standard deviation is only $$\sqrt{n_p}$$. This means that as long as we take, say, $$4\sqrt{n_p}$$ additional slots per partition, the probability of a bucket exceeding the number of slots is less than $$1$$ in $$16000$$. (Otherwise we rehash everything with a new global seed.) Since we anyway reserve free slots with $$\alpha<1$$, we can simply make some more free slots. Construction time may then be slightly concentrated in the buckets with number of elements significantly above the mean, but the memory latency savings should counter this easily. So the new algorithm would be: 1. Compute bucket for each element. 2. Partition buckets into $$P$$ partitions of $$m_p$$ buckets each. 3. Sort the buckets in each partition by decreasing size. 4. (Optional) prefetch the $$s_p$$ bits of the taken array corresponding to the current partition into low-level cache. 5. Run the usual algorithm for each partition, searching for the smallest pilot that maps each bucket to empty slots only. 6. Remap position above $$n$$ to the free slots. As said before, the main benefit here is that step 5 should now run much much faster, since each lookup for a free slot can go directly to L1 or L2 cache rather than main memory. So let’s see how many slots $$s_p$$ we should aim for in each partition. • My L1 cache is $$32kb = 256kbit$$, so we could choose $$n_p = 200k$$ elements per partition. • My L2 cache is $$256kb = 2048Mbit$$, so we could choose $$n_p = 2M$$ elements per partition. The optimal size is probably determined using experimentation. The benefit of smaller partitions is that they fit in a faster cache, while $$10$$ times larger partitions have a relatively $$\sqrt{10} \approx 3\times$$ smaller standard deviation, hence needing less buffer slots. As an example let’s take $$n=10^9$$ with $$n_p = 2M$$ elements per bucket. This creates $$500$$ buckets, and needs $$4\sqrt{n_p} \approx 5k$$ buffer slots. $$5k/2M = 0.25\%$$, so actually choosing $$\alpha=0.99$$ (which gives $$1\%$$ buffer slots) already gives plenty of guarantee that buckets are not too full and leaves at least $$0.75\%$$ of free slots at the end to speed up construction. Note that this should also be very good for parallelization, since each thread can fill it’s own part of the taken vector in L1 or L2 without interfering with the other threads at all. Implementation efficiency • $$s_p$$ should be a multiple of the $$512$$ bit cache-line size to avoid sharing cachelines between threads. The entire bitvector should be cacheline-aligned as well. But this is really nitpicking. • We should either use fast-divide for the bucket partitioning, or we can simply ensure that there’s $$2^k$$ buckets per partition so we can just do this via a bitshift. Oh some problem? Ughhh, so we manually create imbalanced bucket sizes, and the way we do this is by putting $$60\%$$ of the elements in the first $$30\%$$ of buckets. That means that if we partition buckets simply by id, the first buckets will be much larger than the last buckets…. And also, that means there will be very few small buckets to fill gaps among those large buckets… So it sounds like we want to either distribute those large buckets equally over the range, or partition the buckets e.g. via their id modulo $$P$$. A fix? So instead of making the first $$30\%$$ of buckets large, we have to distribute the large buckets roughly equally over the partitions. Some ideas for this: 1. Make every third bucket large is as follows: • Remap large buckets from 0..m/3 to 0..m by taking $$i \mapsto 3i-1$$, filling positions $$2\mm 3$$. • Remap small buckets from 0..2/m3 t0 0..m by taking $$i\mapsto \lfloor 3i/2 \rfloor$$. Filling positions $$0\mm 3$$ and $$1\mm 3$$ (This is a specific instance of the more general following result: for irrational $$\alpha$$ and $$\beta$$ with $$1/\alpha + 1/\beta = 1$$ the sets $$\{\lfloor i\alpha\rfloor: i\in \mathbb N\}$$ and $$\{\lfloor i\beta\rfloor: i\in \mathbb N\}$$ are disjoint and partition $$\mathbb N$$.) 2. First compute the bucket id as $$h(x) \mm m$$. Then remap elements that should go to large buckets by setting the last two bits of the bucket id to $$0$$. That maps some percentage of elements to buckets $$0 \mm 4$$, creating large buckets there. (We could also map them to buckets $$0\mm 3$$ by doing i/3*3, but that requires more operations.) 3. Simply do the original logic or partitioning by small or large hash value, but choose a cut-off within each partition. So for reference the bucketing function we currently use is \begin{equation*} bucket(x) = \begin{cases} \phantom{p_2+{}}\lfloor h(x) \cdot p_2 / 2^{64}\rfloor & \text{if } h(x) < p_1 \\ p_2 + \lfloor h(x) \cdot (n-p_2) / 2^{64}\rfloor & \text{if } h(x) \geq p_1. \end{cases} \end{equation*} If we have $$2^k$$ partitions, we can mask the first $$k$$ bits of $$h(x)$$ and check $$mask_k(h(x)) < p_1 / 2^k$$. ### Back from a break! So I finally managed to implement partitioning and it works great! It’s super fast when putting ~200k elements per part, which exactly fits in L1. This now also allows for parallelization, since each part can be processed independently, and since each part only needs little memory, they don’t interfere with each other at all! I now use FastReduce32H to determine the part and bucket. One issue though: Using FastReduce32L to determine the slot does not provide sufficient entropy, since it only gives $$\lg(N/P)$$ bits of entropy, whereas we need more like $$\lg N$$ bit of entropy. This is somewhat fixed by reverting to FastMod32L, but that still lacks entropy for $$N = 10^9$$, where we need the much slower FastMod64. More on this later. ### Speeding up the search for pilots Currently the loop that searches for pilots is as follows:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  'p: for p in 0u64..kmax { let hp = self.hash_pilot(p); // True when the slot for hx is already taken. let check = |hx| unsafe { *taken.get_unchecked(self.position_in_part_hp(hx, hp)) }; for hx in bucket { if check(*hx) { continue 'p; } } if self.try_take_pilot(bucket, hp, taken) { return Some((p, hp)); } }  This is quite efficient, but has a problem: each iteration of the inner loop is quite unpredictable: looking up random bits of taken gives random bits true or false with some probability, so there will be lots of branch-mispredictions. Indeed, running under perf stat -d shows:  1 2 3 4 5   16,324.28 msec task-clock:u # 1.000 CPUs utilized 57,487,973,942 cycles:u # 3.522 GHz 87,101,496,658 instructions:u # 1.52 insn per cycle 10,513,721,875 branches:u # 644.054 M/sec 643,749,689 branch-misses:u # 6.12% of all branches  To reduce branches and increase parallelization, we can only check for success every $$4$$ lookups, which has much more predictable results (i.e. failing most of the time):   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  'p: for p in 0u64..kmax { let hp = self.hash_pilot(p); // True when the slot for hx is already taken. let check = |hx| unsafe { *taken.get_unchecked(self.position_in_part_hp(hx, hp)) }; // Process chunks of 4 bucket elements at a time. let chunks = bucket.array_chunks::<4>(); for &hxs in chunks.clone() { // Check all 4 elements of the chunk without early break. if hxs.map(check).iter().any(|&bad| bad) { continue 'p; } } // Check remaining elements. let mut bad = false; for &hx in chunks.remainder() { bad |= check(hx); } if bad { continue 'p; } if self.try_take_pilot(bucket, hp, taken) { return Some((p, hp)); } }  The new perf stat -d:  1 2 3 4 5   14,310.45 msec task-clock:u # 0.999 CPUs utilized (was 16k) 50,239,412,832 cycles:u # 3.511 GHz (was 57G) 123,285,550,459 instructions:u # 2.45 insn per cycle (was 87G; 1.5) 10,003,769,536 branches:u # 699.053 M/sec (was 10.5G) 170,547,753 branch-misses:u # 1.70% of all branches (was 650M; 6.1%)  Note how this roughly $$10\%$$ faster (in clock and cycles). It does use $$50\%$$ more instructions, but is able to execute those much more efficiently, at $$2.5$$ instructions per cycle vs. $$1.5$$ before. This is probably because of the4$-fold reduction in branch-misses! The remaining branch misses seem to mostly come from the less-optimized case where a bucket requires displacing some other buckets. ### MultiplyReduce For $$n$$ close to $$10^9$$ or $$2^{32}$$, we really must use all $$64$$ bits of entropy. FastReduce only uses the high bits, and FastMod works but is quite slow (taking $$2$$ multiplications for the$32$bit variant and more for$64$bits). When the target modulus is a power of $$2$$, $$2^d$$, an alternative is what I’ll call MultiplyReduce: $h \mapsto \lfloor \frac {C\cdot h}{2^{64}}\rfloor \mm 2^d.$ This simply multiplies the hash by a random$64$bit constant and then takes the $$d$$ low bits of the high word of the result. We can use this for the reduction to slots in each part, so that all entropy is used. The one problem with this function (and also MulHash) is that$128$bit multiplications don’t work well with SIMD, but so far SIMD hasn’t payed off yet anyway. ### Linux hugepages? I may or may not have this already enabled on my laptop, but if not, if may make prefetches faster by having a smaller TLB cache. It seems that jemalloc has builtin support for optional hugepages so I may try that. See also this reddit post. After trying a bit more (tweet, example code), it seems that a simple vec![0u8; len] already uses transparent huge pages on my machine:  1  $ grep AnonHugePages /proc/meminfo 

goes up by around 300MB when filling the vector of pilots, and similarly

 1  grep thp_fault_alloc /proc/vmstat  increases by $$150$$ at the same time, counting the total number of allocated 2MB hugepages. Either way, here is the full allocation code I tried for completeness:  1 2 3 4 5 6 7 8 9  let len = 300_000_000; const HUGE_PAGE_SIZE_BYTES: usize = 2 * 1024 * 1024; unsafe { let ptr = std::alloc::alloc( std::alloc::Layout::from_size_align(len, HUGE_PAGE_SIZE_BYTES).unwrap(), ); libc::madvise(ptr as _, len, libc::MADV_HUGEPAGE); Vec::from_raw_parts(ptr, len, len) }  ### Dropping the bucket split? So far we have been mapping $$\beta = 0.6$$ of elements to $$\gamma = 0.3$$ of slots to create two classes of buckets: large and small ones. This helps the performance of the construction algorithm by putting more elements in large and easy-to-place buckets so that in the end we have more smaller buckets and especially more free slots remaining once we get to the last buckets of size $$2$$. In my current implementation it’s kind of annoying to find the bucket for a given hash: First FastReduce for the part: multiply $$h$$ by $$P$$, the number of parts. The P*h>>64 is the part, and we use P*h % (2^64) to compute the bucket, which uses the high bits of that result to determine whether it belongs in a small or large bucket and then sends it there via a linear transformation. If we drop these two classes things would be much simpler: the part is P*h>>64, and similarly the bucket is B*h>>64, where B is the total number of buckets over all parts. #### Build performance This is a bit more tricky: dropping the two classes roughly doubles the number of displacements and slows the construction by around $$10\%$$. Also, the minimal $$c$$ for which construction succeeds at all goes up slightly. #### An alternative Maybe we can come up with a ’local’ way to imbalance buckets, so that each bucket roughly maps to position h*B >> 64. Maybe we could or the last bit with another random bit, so that $$75\%$$ of elements goes to $$50\%$$ of buckets? #### Query performance Table 4: Query performance (ns) for $$n=10^9$$ for-loopprefetching 2 classes14.57.6 1 class10.37.5 As you can see, when we already do prefetching the added simplicity doesn’t affect performance much, but without it, the simpler and shorter code helps considerably. ### Query memory bandwidth Most likely, the prefetching variant is already saturating the memory bandwidth. Each cacheline is64$bytes, which comes down to $$64B / 7.5ns = 8.5GB/s$$. That’s a third of the $$25GB/s$$ peak bandwidth for my RAM, but I’m not sure whether the peak is achievable at all for random-access reads. I did find this quora answer that talks about Line-Fill Buffer. To quote: If you test performance from a single thread on a machine that’s mostly idle (ie, you’re not competing with other threads for access to the memory bus), then you’ll find that on many chips, you can perform roughly twice as many sequential reads per second as random reads. The explanation for this is that the CPU has clever circuitry that watches your memory access pattern and tries to predict which cache-lines your code is going to want next. Why does this prefetching help? Well, this gets a bit deep. My current understanding is that in the random access case, your parallelism is limited by that fact that each core can only have a limited number of L1 misses outstanding at one time. Intel calls this the “Line-Fill Buffer” and it has 10 slots on recent chip models, and there are similar mechanisms on AMD and Arm. DRAM physics dictate that reads have ~60ns of latency (and this barely changes as the tech evolves), which is ~400 clock cycles, so that would rate limit throughput of random parallel reads to at most 1 per ~40 clock cycles. Versus when the hardware prefetch mechanisms kick in, they don’t consume line-fill slots and can effectively perform more reads in parallel. Sadly, if you issue software prefetch instructions, many CPUs treat that as needing a line-fill entry, so it’s only reads triggered by hardware-prefetching that are able to achieve greater parallelism than the size of your line-fill buffer. So with this in mind, maybe the next question is whether we can get the last factor $$3$$ improvement from $$8GB/s$$ to $$25GB/s$$ by using multithreading, since each CPU has it’s own copy of these Line-Fill Buffers between L2 and L1. So I tried multithreading, but it seems not to be faster at all! It’s usually slightly slower. Weirdly though, the threads seem to do the work sequentially, with the$i$th of $$k$$ threads finishing roughly after $$i/k$$ of the total time. Some threads just seem to be busy waiting for memory that never comes until other threads are done. But then does that mean that my memory bandwidth is actually capped at $$8GB/s$$? That would be disappointing. Maybe another solution could be to store two copies of the pilot array, one on each of the two $$32GB$$ memory cards I have? I don’t really know whether the main memory or the processor is the bottleneck, so no idea if this would help. #### Some more experiments I wrote some experiments to measure memory bandwidth where I sum the first byte of each cacheline of a $$4GB$$ array. Results:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  stride 1 strided 1.78s 22.513GB/s prefetch 1.77s 22.553GB/s stride 2 strided 2.73s 14.678GB/s prefetch 2.79s 14.357GB/s stride 4 strided 3.61s 11.079GB/s prefetch 3.48s 11.501GB/s stride 8 strided 3.85s 10.403GB/s prefetch 3.76s 10.649GB/s stride 16 strided 4.05s 9.876GB/s prefetch 4.00s 9.995GB/s stride 32 strided 4.53s 8.828GB/s prefetch 4.50s 8.893GB/s stride 64 strided 3.91s 10.228GB/s prefetch 3.89s 10.279GB/s stride 128 strided 4.01s 9.976GB/s prefetch 4.02s 9.957GB/s  so basically it caps out around $$9GB/s$$ even for very predictable access patterns. Prefetching also doesn’t help. So the $$8.5GB/s$$ we currently get with the queries is already super good and basically saturates the full memory bandwidth! #### Multithreading benchmark So can we get to the full $$25GB/s$$ using multithreading? It seems that the answer is yes! Running multiple threads in parallel each with a different (coprime) stride, I get the following throughput per thread and in total:  1 2 3 4 5 6  Threads: 1 3.85s 10.396GB/s Threads: 2 4.35s 18.393GB/s Threads: 3 5.03s 23.845GB/s Threads: 4 6.15s 26.035GB/s Threads: 5 7.52s 26.578GB/s Threads: 6 8.86s 27.074GB/s  As you can see, this converges to just above $$25GB/s$$. Probably it’s a bit higher because some reads are cached in L3, and some reads are shared between threads. If we give each thread it’s own copy of the $$4GB$$ array, this changes to:  1 2 3 4 5 6  Threads: 1 4.55s 8.800GB/s Threads: 2 5.37s 14.900GB/s Threads: 3 6.85s 17.512GB/s Threads: 4 8.51s 18.791GB/s Threads: 5 10.53s 18.985GB/s Threads: 6 12.15s 19.756GB/s  This is quite a bit lower! Changing to a $$7GB$$ array may be interesting, since $$6\cdot 7GB = 42GB$$ must be spread over two $$32GB$$ ram sticks. ($$8GB$$ crashed because I don’t have $$48GB$$ free currently.) This gives:  1 2 3 4 5 6  Threads: 1 7.76s 9.016GB/s Threads: 2 9.54s 14.679GB/s Threads: 3 12.20s 17.211GB/s Threads: 4 14.98s 18.686GB/s Threads: 5 18.15s 19.282GB/s Threads: 6 22.93s 18.313GB/s  I.e., very similar to results above. So probably we won’t be able to gain speedup by distributing reads over the two ram sticks. #### Multithreading queries: satisfaction at last I had a silly bug at first, but with that fixed, here’s the inverse throughput when using up to $$6$$ threads in parallel, each prefetching $$64$$ iterations ahead: threadsns/querymemory throughput (GB/s) 17.598.4 24.1715.3 33.0920.7 42.8122.8 52.6923.8 62.7623.2 Satisfaction, at last! $$2.69ns/query$$ means a memory throughput of $$64B/2.69ns = 23.79GB/s$$, quite close to the full memory bandwidth. Also, each thread is also reading the keys to be queried from main memory. Those are efficiently packed so each $$8B$$ key is only$1/8$th of a $$64B$$ cacheline and hence only requires$1/8$th of a read. But adding those in we get $$(1+1/8) \cdot 23.79 GB/s = 26.76GB/s$$, quite a bit above the full memory bandwidth. I suppose this is because some queries are already in L3 cache, sparing the main memory bus. Let’s see more precisely: The memory used for all pilots is $$2.43bit/elem\cdot 10^9 elem = 2.43\cdot 10^9 bit = 304MB$$. My L3 cache is $$12MB$$, so each query hits the cache with probability $$12MB/304MB=4\%$$, getting to $$\frac 18 ⋅ \frac{64B}{2.69ns} • (1-0.04) ⋅\frac{64B}{2.69ns} = 25.8GB/s$$ This is still slightly above the $$25GB/s$$ max throughput my ram should have, but I’ll call it done here. We’re within a few percent of the expected maximum performance, which is as good as we can get. Maybe the last bit is L1 and L2 caches containing a few more elements than L3? ### Packing difference from expected position Since the values we pack are a random subset of $$[0, n]$$, it may be possible to encode them efficiently by only storing the deviation from their expected value. I.e. we store $$r=(1/\alpha - 1)n$$ values where position $$i$$ stores the difference with $$i/r \cdot n$$. Result: for $$n=10^8$$ with $$\alpha = 0.99$$, the largest difference between actual and expected position is around $$500\ 000$$. I.e., this would still need at least $$20$$ compared to the $$20$$ we started with. #### Local packing ideas Instead we could store a fixed offset $$c$$ every $$k$$ positions, and approximate the following $$k$$ values as $$c + i/r\cdot n$$, where $$n/r$$ is the expected difference between adjacent remap targets (i.e. the distance between adjacent empty slots in the taken array). For example, we could store a u32 offset every $$30$$ positions, one per cacheline, totalling to $$16 + 32/30 \sim 17 bits$$ per remapped position. Or maybe we could pack slightly more efficient and only use somewhere between $$10$$ to $$15$$ bits for the variance from the expected value. Some maths will be needed to figure out a safe upper bound on the max deviation. Some possibilities: 1. Absolute: Store $$32bit$$ offset, and $$16bit$$ deltas to it. 2. Linear: Store $$32bit$$ offset, and $$16bit$$ deltas to $$offset + i\cdot n/r$$. 3. Custom linear: Store $$32bit$$ offset, $$32bit$$ (or less) difference $$d$$, and $$16bit$$ deltas to $$offset + i \cdot d$$. 4. Tiny EF: Store $$32bit$$ offset, $$128$$ bits indicating where $$256bit$$ boundaries are crossed similar to EF encoding, and $$44$$ $$1byte$$ values. This last idea needs $$64/44\cdot 8bit=11.6bits$$ per value. That’s slightly more than using EF encoding directly on the full vector (which uses around $$9bits$$), but since this only uses a single cache line lookups are much faster, and it uses $$3\times$$ less bits that the direct Vec<u32>. #### Query times for different remapping structures Table 5: Query times in nanoseconds without and with remapping, for different remap structures. $$n=10^9$$, $$c=9$$, $$\alpha\in\{0.99, 0.98\}$$. The bottom row indicates bits/element needed for pilots under no remap and additional memory for remapping datastructures. threadsno remapVec<u32>TinyEFEliasFanoVec<u32>TinyEFEliasFano $$\alpha$$$$0.99$$$$0.99$$$$0.99$$$$0.98$$$$0.98$$$$0.98$$ 17.47.97.910.18.18.211.6 24.14.34.35.14.84.56.0 33.03.23.23.63.53.34.3 42.82.92.83.23.02.93.6 52.72.72.83.02.82.73.2 62.72.72.73.03.02.83.1 bits/elem2.43+0.33+0.12+0.09+0.66+0.24+0.17 As can be seen, TinyEF uses only about a third of the memory of Vec<u32>, while being at least as fast (and faster for $$\alpha=0.98$$). EliasFano uses another $$25\%$$ less memory, but significantly slows down queries due to its non-locality. The code for TinyEF is here. ### Sharding When the number of keys is very large, they may not all fit in memory at the same time. To solve this, we can shard the keys and process one shard at a time. In particular, the $$P$$ parts are split into $$S$$ shards, where part $$0$$ to $$P/S$$ go in the first shard, and so on. We make sure that $$P$$ is a multiple of $$S$$. Then we process one shard at a time: we hash the keys, sort them into their parts, and then compute the pilots for each part independently. I implemented two ways to shard keys: 1. On-disk sharding: create a file for each shard, iterate over the keys, and append the hash of each key to file for the shard it belongs to. 2. In-memory sharding: iterate over the keys multiple times, and each time only return those which belong to the current shard. ### 128bit hashing In order to scale more than $$2^{32}$$ keys, we need to use$128$bit hash functions (github issue). There are many hash functions out there, all with different speed vs. randomness tradeoffs. I implemented many of them. Although I have no idea of the quality of the implementations I used, these are some quick results for hashing u64 keys: Table 6: Query throughput (ns) for different hash functions on $$10^9$$ u64 keys for sequential querying and streaming with$64$-ahead prefetching. HasherBitsseq.prefetch fxhash6414.27.48 murmur26415.87.70 murmur312816.59.40 xx6439.014.04 xx12832.313.94 city6428.012.19 city12845.120.09 metro6429.913.21 metro12840.316.71 highway64107.657.20 highway128114.547.84 spooky64115.763.43 spooky128115.567.06 From this we conclude that FxHash, Murmur2, and Murmur3 are good fast hash functions where throughput is close to the memory bandwidth, while most other hash functions are quite a bit slower and lower the throughput. I also added support for hashing of &[u8] strings. Results are here: Table 7: Query throughput (ns) for different hash functions on $$4\cdot 10^6$$ URLs (avg. $$71$$ bytes long) for sequential querying and streaming with$64$-ahead prefetching. HasherBitsseq.prefetch fxhash6431.721.26 murmur26411.56.23 murmur312824.715.35 wy6419.511.58 xx6423.214.26 xx12821.012.90 city6418.010.97 city12827.517.90 metro6422.215.36 metro12824.317.71 highway6461.136.23 highway12867.042.51 spooky6474.162.48 spooky12874.162.87 Observations: • fxhash is slow for strings. • murmur2 is super fast! • city and wy are great for =64=bit hashes. • xx is best for =128=bit hashes. ### Varying the partition size Let’s construct for $$n=10^9$$ keys, $$c=8$$, $$\alpha=0.98$$, with different partition sizes. $$\max \alpha$$ is the maximum load factor over all parts. slots per partnum partsdisplace timetotal time$$\max\alpha$$ $$2^{15}$$$31$k--$$>1$$ $$2^{16}$$$16$k$$20.7$$$$27.0$$$$99.49$$ $$2^{17}$$$7.8$k$$22.8$$$$29.0$$$$99.18$$ $$2^{18}$$$3.9\$k$$22.7$$$$29.0$$$$98.59$$
$$2^{19}$$$$1947$$$$25.8$$$$31.6$$$$98.40$$
$$2^{20}$$$$974$$$$30.3$$$$36.1$$$$98.21$$
$$2^{21}$$$$487$$$$42.3$$$$48.0$$$$98.14$$
$$2^{22}$$$$244$$$$41.2$$$$48.1$$$$97.83$$
$$2^{23}$$$$122$$$$40.9$$$$47.0$$$$97.81$$
$$2^{24}$$$$61$$$$72.5$$$$78.3$$$$97.77$$
$$2^{25}$$$$31$$$$88.8$$$$94.9$$$$96.17$$

Observe that:

• Using too small partitions causes too many parts, with too much load-unbalancing, consistently giving $$\alpha > 1$$.
• $$\max \alpha$$ decreases as the parts get larger, reducing the worst-case outliers. In particular the delta with $$98\%$$ roughly halves each step.
• When there are very few parts, $$\max \alpha$$ goes below $$98\%$$ due to rounding, because each part is fixed to have a power-of-two size.
• Smaller parts are faster to build, due to data fitting in lower caches.
• $$2^{17}$$ slots per part was my default, but $$2^{19}$$ is not much slower and more reliable.
• From $$2^{21}$$, the displacement time quickly grows. Storing $$2^{21}$$ taken bits takes $$256kb$$, which is exactly the L2 cache size. But the displacement algorithm also needs to store for each slot which bucket is occupying it, taking $$32bit/slot$$. This does not fit in L3 and goes to main memory instead.
• At $$2^{24}$$ slots, even the taken arrays do not fit in L3 cache anymore and things slow down even more. (In fact I’m surprised it’s only a $$2\times$$ slowdown. The latency is hidden pretty well.)

One way in which PTRHash differs from PTHash is that apart from just the taken bitvector, it needs to store an additional $$32bit$$ index for each slot to be able to displace elements, which requires a lot more memory. If we could prevent this, memory usage would go down and it may be possible to use larger parts without slowing down the construction time so much.

## References

Fan, Bin, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. 2014. “Cuckoo Filter.” Proceedings of the 10th Acm International on Conference on Emerging Networking Experiments and Technologies, December. https://doi.org/10.1145/2674005.2674994.
Fox, Edward A., Qi Fan Chen, and Lenwood S. Heath. 1992. “A Faster Algorithm for Constructing Minimal Perfect Hash Functions.” Proceedings of the 15th Annual International Acm Sigir Conference on Research and Development in Information Retrieval - Sigir ’92. https://doi.org/10.1145/133160.133209.
Graf, Thomas Mueller, and Daniel Lemire. 2020. “Xor Filters.” Acm Journal of Experimental Algorithmics 25 (March): 1–16. https://doi.org/10.1145/3376122.
Lemire, Daniel, Owen Kaser, and Nathan Kurz. 2019. “Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries.” Software: Practice and Experience 49 (6): 953–70. https://doi.org/10.1002/spe.2689.
Limasset, Antoine, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo. 2017. “Fast and Scalable Minimal Perfect Hashing for Massive Key Sets.” arXiv. https://doi.org/10.48550/ARXIV.1702.03154.
Pibiri, Giulio Ermanno, and Roberto Trani. 2021a. “Parallel and External-Memory Construction of Minimal Perfect Hash Functions with Pthash.” arXiv. https://doi.org/10.48550/ARXIV.2106.02350.
———. 2021b. “Pthash: Revisiting Fch Minimal Perfect Hashing.” Proceedings of the 44th International Acm Sigir Conference on Research and Development in Information Retrieval, July. https://doi.org/10.1145/3404835.3462849.