- Questions and remarks on PTHash paper
- Ideas for improvement
- Implementation log
- Hashing function
- Bitpacking crates
- Construction
- Fastmod
- TODO Try out
`fastdivide`

and`reciprocal`

crates - First benchmark
- Faster bucket computation
- Branchless, for real now! (aka the trick-of-thirds)
- Compiling and benchmarking PTHash
- Compact encoding
- Find the \(x\) differences
`FastReduce`

revisited- TODO Is there a problem if \(\gcd(m, n)\) is large?
- Faster hashing
- An experiment
- Compiler struggles
- Prefetching, at last
- TODO Prefetching with vectorization
- Inverting \(h(k_i)\)
- Another day of progress
- TODO Possible sorting algorithms
- Diving into the inverse hash problem
- Bringing it home
- Hash-inversion for faster PTHash construction
- Fast path for small buckets
- TODO Dictionary encoding
- TODO Larger buckets
- TODO Prefetching free slots
- Filling the last few empty slots needs very high \(k_i\)!
- Perfect matching for the tail
- Peeling for size-1 buckets
- Greedy peeling 1: Assigning from hard to easy
- TODO Peeling and cuckoo hashing for larger buckets.
- Sunday morning ideas
- Cuckoo hashing / displacing,
*for real now* - Displacing globally
- Cleanup and revisiting defaults
- TODO Sum instead of xor?
- Revisiting \(\alpha < 1\)
- Elias-Fano for the remap-dictionary
- Global iterative prioritizing
- Cleanup: removing peeling and suboptimal displacing code
- Some speedups to the displacement algorithm
- TODO Runtime analysis of displacement algorithm
- TODO Optimal prefetching strategy
- Are we close to the memory bandwidth?
- More sorting algorithm resources
- Partitioning to reduce memory latency
- Back from a break!
- Speeding up the search for pilots
`MultiplyReduce`

- Linux hugepages?
- Dropping the bucket split?
- Query memory bandwidth
- Packing difference from expected position
- Sharding
- 128bit hashing
- Varying the partition size

\[ %\newcommand{\mm}{\,\%\,} \newcommand{\mm}{\bmod} \newcommand{\lxor}{\oplus} \newcommand{\K}{\mathcal K} \]

**NOTE:** WIP paper is here.

Daniel got me excited about minimal perfect hashing functions (MPHFs), and then
~~twitter~~ Rob asked for a rust implementation of PTHash (Pibiri and Trani 2021b, 2021a), and also
I have some ideas related to memory prefetching I want to play with, so here we
are: I’m working on a Rust implementation `ptr-hash`

at https://github.com/ragnargrootkoerkamp/ptrhash.

This post is a collection of thoughts, ideas, implementation nodes, and experiments that came up while writing the code.

Twitter discussion is here.

(See also my note on BBHash (Limasset et al. 2017).)

## Questions and remarks on PTHash paper

Probably some of these questions are answered by reading into the original FCH paper (Fox, Chen, and Heath 1992) – that’ll come at some later time.

FCH uses \(cm/\log_2 n\) buckets because each bucket needs \(\log_2 n\) bits storage, but for PTHash the bits per bucket is lower so this isn’t really the right formula maybe.

- Tweet: another option is to fix \(c = \alpha * \log(n) / d\) for some constant \(d\).

FCH uses \(\lceil cn / (\log_2n+1)\rceil\) buckets, but PTHash uses \(\lceil cn/\log_2 n\rceil\) buckets. To me, \(\lceil cn/\lceil \log_2n\rceil\rceil\) actually sounds most intuitive.

\(p_1=0.6n\) and \(p_2=0.3m\) seem somewhat arbitrary. Can this be tuned better? Could partitioning into more chunks help?

Anyway, this puts \(60\%\) of data in the first \(30\%\) of buckets (i.e. double the average size), and \(40\%\) of data in \(70\%\) of buckets (i.e. roughly half the average size).

I suppose the goal is to stitch two binomial distributions of bucket sizes together in a nice way to get a more even distribution over sizes.

How many bits of entropy are really needed? When \(n\) is a power of \(2\), we make sure to use all \(64\) bits instead of just \(\log_2 n\). But how many are really needed. Maybe as long as \(n \ll 2^{32}\), a simpler/faster \(32\) bit hash is sufficient?

And how large will \(n\) be at most? Can we reasonably assume \(2^{32} \approx 4\cdot 10^9\) as a bound?

Is it reasonable to compute and store hashes of all keys, and then changing the API to let the user pass in the hash instead of the key?

What’s the probability of failure with the first global seed \(s\) that’s tried?

What are the largest \(k_i\) found in typical cases?

What if \(\gcd(m, n)\) is large?

## Ideas for improvement

### Parameters

- Better tuning of parameters may work?
- Partitioning into three or more chunks may speed up construction?

### Align packed vectors to cachelines

Instead of packing \(r\) bit values throughout the entire vector, only pack them inside cachelines.

### Prefetching

The query performance of this and other MPHFs seem to be limited by the latency of memory lookups. For this reason, most MPHFs minimize the number of memory lookups, and in particular PTHash uses one random memory access followed by one lookup in a small (cached) dictionary.

When processing a number of lookups sequentially, each lookup currently incurs a cache miss. This could be made much more efficient by doing \(k\) (e.g. \(k=16\)) cache lookups in parallel:

- first compute the array location to lookup for \(k\) keys (possibly using SIMD),
- then prefetch the \(k\) memory locations,
- then do the rest of the computation in parallel.
- If needed, do lookups in the \(free\) table in parallel.

This should mostly hide the memory latency and could give significant speedup. I’m not quite sure yet whether this would work best when done in batches (as described), or in streaming fashion, where we iterate over elements and prefetch memory for the element \(k\) iterations ahead. Algorithmica.org has a nice article on the streaming approach.

### Faster modulo operations

There are a lot of `% n`

, `% p1`

and `% (m-p1)`

operations throughout the code.
I didn’t look into it yet, but possibly these are the bottleneck on the CPU
latency.

First note that
\[
a\, \%\, b = a - \lfloor a/b\rfloor * b.
\]
This division by a constant can be computed efficiently using a trick which
replaces division by multiplication with the inversion.
Using the formula of the wikipedia article we can precompute some constants to
evaluate \(\lfloor a/b\rfloor\) in \(6\) operations and `a % b`

in \(8\) operations.

~~(Note that it might be possible compilers already do this, but I don’t expect so.)~~

Some blogposts by Daniel Lemire (Thanks Daniel Liu ;)

A fast alternative to the modulo reduction

Instead of

`a % b`

, compute`a * b >> 64`

, assuming that \(a\) is uniform in \([2^{64}-1]\).This doesn’t seem to work well in practice though for PTHash, probably since this only uses the entropy in the high-order bits of \(a\).

Faster remainders when the divisor is a constant: beating compilers and libdivide

**Indeed, the C++ PTHash implementation already uses the**`fastmod`

library.More fun with fast remainders when the divisor is a constant

Lemire, Kaser, and Kurz (2019)

### Store dictionary \(D\) sorted using Elias-Fano coding

I doubt whether the memory savings here are worth the time overhead, but it’s an idea :shrug:.

### How many bits of \(n\) and hash entropy do we need?

One bottleneck of `fastmod64`

is that it needs to compute the
highest \(64\) bits of a `u128 * u64`

product.
If we can assume that \(n\) is at most \(40\) bits, and that \(44\) bits are
sufficient entropy, then I think we could do away with the `u128 x u64`

multiplication and do everything inside `u128`

.

### Ideas for faster construction

Once the table is almost filled, determining the \(k_i\) becomes slower. Some ideas to speed this up:

- reversing
`murmurhash`

- Instead of finding a \(k_i\) such that \(position(x) :=
(h(x) \lxor
h(k_i))\mm n\) is not taken becomes slow. Instead, we could keep a list of
empty positions \(p\) and determine \(k_i\) as \(h^{inv}((p + j \cdot n) \lxor
h(x))\) for different \(j\), assuming we can invert
`murmurhash`

. As it turns out,`murmurhash2`

is an invertible function with \(O(1)\) inverse! See this blogpost. Thus, we can easily find many \(k_i\) such that \(position(x, k_i)\) maps to an empty position.- All buckets of size \(1\) can be directly assigned to slots this way.
- Size \(2\) and \(3\) buckets also need fewer tries than before.

The big drawback though is that the \(k_i\) values found will be uniform in \([0, 2^{64}]\).

- Cuckoo-hashing
- For sets of size \(1\) and \(2\) (and maybe \(3\)?) we could
*displace*an already taken slot if that is the last remaining slot needed to fix the current bucket. Probably we want to only displace buckets of the same size and never buckets of larger size.I wonder though how useful this actually is. If the current bucket is hard to place, there is not really any reason a different bucket of the same size would be easier to fix.

## Implementation log

A somewhat chronological list of notes and remarks.

### Hashing function

For now I use `murmur64a`

, documented on the `SMHasher`

GitHub wiki.

### Bitpacking crates

There are **a lot** of bitvector and bitpacking crates!

- bitvectors
- All of the below seem to do the same
`bitvec`

: \(30M\) downloads`bit-vec`

: \(30M\) downloads`fixedbitset`

: \(55M\) downloads

No idea which is best; probably I’ll settle for the one below in

`sucds`

. - sucds
- only \(60K\) downloads, but contains
- BitVector
- fixed-width integer packing: CompactVector
- Decoding seems somewhat inefficient

- increasing-integer sequence packing: EliasFano
- Giulio has lecture notes on this.

succinct - IntVector - Can not be constructed from slice/iterator of values. - Decoding seems somewhat inefficient - No updates in the past 2 years.

### Construction

- Storing buckets as
`Vec<Vec<Key>>`

is bad for large keys, so now I store`Vec<Vec<usize>>`

, but the nested ~Vec~s still waste a lot of space and will cause allocation slowdowns. PTHash pushes onto a vector which is sorted later, which seems more efficient. - When testing \(k_i\), not only do we need to test that positions are not filled
by previous buckets, but also we have to check that elements within the bucket
do not collide.
**It is not sufficient that \(h(x, s)\) does not collide within buckets,**since they could collide after taking the`% n`

.

### Fastmod

It seems that Daniel Lemire’s `fastmod`

C++ library has not yet been ported to
Rust, so I converted the few parts I need.

There is also `strength_reduce`

, which contains a similar but distinct algorithm
for `a % b`

that computes the remainder from the quotient.

### TODO Try out `fastdivide`

and `reciprocal`

crates

### First benchmark

I implemented these under a generic `Reduce`

trait.

`just bench`

at the linked commit at `2.6GHz`

gives the following for \(10^7\) keys:

method | construction (s) | query (ns) |
---|---|---|

u64 | 10.474591 | 91 |

fastmod64 | 10.731583 | 55 |

fastmod32 | 9.911751 | 50 |

strengthreduce64 | 11.520939 | 56 |

strengthreduce32 | 10.002017 | 50 |

The `u32`

versions simply only use the lower \(32\) bits of the \(64\) bit hash.

This is not yet as fast as the fastest `28ns`

reported in the PTHash paper (for
C-C encoding), but I also haven’t optimized anything else yet. Time for profiling.

**Profiling:** Looking at the flamegraph (`cargo flamegraph`

), and zooming in on the hash function, we see

A lot of time is spend on fold! The `murmur2`

function I use has signature
`murmur2(bytes: &[u8], seed: u64)`

, and even though my keys/bytes always correspond
to just a `u64`

, it’s iterating over them!

In the generated `perf report`

, we see

```
33.14% 27328 test::queries_e pthash_rs-f15b4648f77f672b [.] pthash_rs::PTHash<P,R>::new
18.18% 14823 test::queries_e pthash_rs-f15b4648f77f672b [.] pthash_rs::PTHash<P,R>::index
13.76% 11245 test::queries_e pthash_rs-f15b4648f77f672b [.] murmur2::murmur64ane
```

We can ignore the \(33\%\) for construction and only focus on querying here, where
we see that the `index`

function calls to `murmur2`

and a lot of time is spent
in both. In fact, `murmur2`

is not inlined at all! That explains the iterator
appearing in the flamegraph.

**Thin-LTO:** This is fixed by enabling link-time optimization: add `lto = "thin"`

to
`cargo.toml`

.

Rerunning the benchmark we get

construction (s) | construction (s) | query (ns) | query (ns) | |
---|---|---|---|---|

method | no LTO | thin-LTO | no LTO | thin-LTO |

u64 | 10.5 | 8.9 | 91 | 60 |

fastmod64 | 10.7 | 8.3 | 55 | 34 |

fastmod32 | 9.9 | 8.5 | 50 | 26 |

strengthreduce64 | 11.5 | 8.3 | 56 | 38 |

strengthreduce32 | 10.0 | 8.8 | 50 | 31 |

Sweet! `26ns`

is faster than any of the numbers in table 5 of Pibiri and Trani (2021b)!
(Admittedly, there is no compression yet and the dictionary size is \(10\times\)
smaller, but still!)

**More inlining:**
Actually, we don’t even want the `index()`

function call to show up in our logs:
inlining it should give better instruction pipelining in the benchmarking hot-loop

```
for key in &keys {
mphf.index(key);
}
```

and indeed, we now get

query (ns) | no LTO | thin-LTO | inline index() |
---|---|---|---|

u64 | 91 | 60 | 55 |

fastmod64 | 55 | 34 | 33 |

fastmod32 | 50 | 26 | 24 |

strengthreduce64 | 56 | 38 | 33 |

strengthreduce32 | 50 | 31 | 26 |

**Conclusion:** From now on let’s only use `fastmod64`

and `fastmod32`

. (I suspect
the `32bit`

variant does not have sufficient entropy for large key sets, so we
keep the original `64bit`

variant as well.)

### Faster bucket computation

After inlining everything, the generated assembly for our test is just one big
\(\sim 100\) line assembly function. Currently, the `bucket(hx)`

function (that
computes the bucket for the given hash `hx = hash(x, s)`

) looks like

```
fn bucket(&self, hx: u64) -> u64 {
if (hx % self.rem_n) < self.p1 { // Compare
hx % self.rem_p2
} else {
self.p2 + hx % self.rem_mp2
}
}
```

The assembly looks like this:

```
5 │ ┌──cmp %rdi,0xc0(%rsp) # compare
1 │ ├──jbe 370
102 │ │ mov 0x98(%rsp),%rdx # first branch: fastmod
41 │ │ mulx %r9,%rdx,%rdi
4 │ │ imul 0xa0(%rsp),%r9
85 │ │ mulx %r10,%r13,%r13
12 │ │ add %rdi,%r9
7 │ │ mov %r9,%rdx
72 │ │ mulx %r10,%rdx,%rdi
17 │ │ add %r13,%rdx
3 │ │ adc $0x0,%rdi # add 0
│ │ cmp %rdi,0x58(%rsp) # index-out-of-bound check for k array
56 │ │↓ ja 3ac # ok: continue below at line 3ac:
│ │↓ jmp 528 # panic!
│ │ cs nopw 0x0(%rax,%rax,1)
128 │370:└─→mov 0xa8(%rsp),%rdx # second branch: fastmod
41 │ mulx %r9,%rdx,%rdi
│ imul 0xb0(%rsp),%r9
66 │ mulx %rcx,%r13,%r13
12 │ add %rdi,%r9
│ mov %r9,%rdx
58 │ mulx %rcx,%rdx,%rdi
14 │ add %r13,%rdx
│ adc 0xb8(%rsp),%rdi # add p2
54 │ cmp %rdi,0x58(%rsp) # out-of-bound check for k array
1 │ ↓ jbe 528 # panic!
8100 │3ac: mov (%r11,%rdi,8),%rdx # Do array index.
```

We see that there are quite some branches:

- The first and second branch of the
`bucket()`

function are both fully written out. - They use the same number of instructions.
- One branch does
`add 0`

, I suppose because the CPU likes equal-sized branches. - There are redundant index-out-of-bounds checks.
- The last line, the array index itself, has \(8000\) samples: \(57\%\) of the total
samples is
**this single assembly instruction**!

**Branchless bucket index:**
I tried rewriting the `bucket()`

function into a branchless form as follows:

```
fn bucket(&self, hx: u64) -> u64 {
let is_large = (hx % self.rem_n) >= self.p1;
let rem = if is_large { self.rem_mp2 } else { self.rem_p2 };
is_large as u64 * self.p2 + hx % rem
}
```

but this turns out to be **slower** than the original, probably because the new
assembly now needs a lot of `cmov`

instructions. (In particular, `rem`

contains
a `u128`

and a `u64`

, so needs \(3\) `mov`

’s and \(3\) `cmov`

’s.)

```
│ cmp %rdi,0xd0(%rsp) # comparison
112 │ mov 0xb8(%rsp),%rdi # load rem_p2
29 │ cmova 0xc8(%rsp),%rdi # conditionally overwrite with rem_mp2
│ mov 0xb0(%rsp),%rdx
137 │ cmova 0xc0(%rsp),%rdx
│ mov 0xa0(%rsp),%r14
137 │ cmova 0xa8(%rsp),%r14
26 │ mov %r9,%r10
│ mov $0x0,%r11d # set offset to 0
90 │ cmova %r11,%r10 # conditionally overwrite offset
38 │ imul %r8,%rdi # start computation
2 │ mulx %r8,%rdx,%r8
122 │ add %r8,%rdi
48 │ mulx %r14,%r8,%r8
│ mov %rdi,%rdx
163 │ mulx %r14,%rdx,%rdi
│ add %r8,%rdx
│ adc %r10,%rdi
184 │ cmp %rdi,0x60(%rsp) # index-out-of-bounds check
│ ↓ jbe 52f # panic
38 │ mov 0x98(%rsp),%rdx
10798 │ mov (%rdx,%rdi,8),%rdx # Do array index.
```

**No bounds check:**
We can replace `k[index]`

by `unsafe { *k.get_unchecked(index) }`

.
This doesn’t give much performance gain (less than the few `ns`

of measurement
noise I have), but can’t hurt. It removes the final `cmp; jbe`

lines from the assembly.

**Fix tests:** Instead of ignoring test results we can accumulate the resulting
indices and pass them to `black_box(sum)`

. This prevents the compiler from
optimizing away all queries. *Somehow* this affects the reported timings. I now get:

query (ns) | no LTO | thin-LTO | inline index() | fixed tests |
---|---|---|---|---|

u64 | 91 | 60 | 55 | 63 |

fastmod64 | 55 | 34 | 33 | 35 |

fastmod32 | 50 | 26 | 24 | 20 |

strengthreduce64 | 56 | 38 | 33 | 38 |

strengthreduce32 | 50 | 31 | 26 | 30 |

I’m confused how the `fastmod32`

timing went down, but the `fastmod64`

went up.
(Typical situation when you do constant profiling and there are more numbers
than you can make sense of, sadly.)

### Branchless, for real now! (aka the trick-of-thirds)

I’m still annoyed by this branching. Branches are bad! They may be fast for now,
but I kinda have the long term goal to put SIMD on top of this and that doesn’t
go well with branching. Also, branch-misprediction is a thing, and the \(70\% -
30\%\) uniform random split is about as bad as you can do to a branch predictor.
The code from earlier does fix it, but at the cost of a whole bunch of `mov`

’s and
`cmov`

’s.

But there is a trick we can do! \(p_1\) and \(p_2\) are sort of arbitrarily chosen, and all the original paper (Fox, Chen, and Heath 1992) has to say about it is

Good values for these two parameters are experimentally determined to be around \(0.6n\) and \(0.3m\), respectively.

Thus I feel at liberty to change the value of \(p_2\) from \(0.3m\) to \(m/3\). This gives: \[m-p_2 = m-m/3 = \frac 23 m = 2p_2.\] The cute tick is that now we can use that \[x \mm p_2 = (x \mm (2p_2)) \mm p_2 = (x \mm (m - p_2)) \mm p_2,\] and since $ 0≤ x \mm (2p_2) < 2p_2$, computing that value modulo \(p_2\) is as simple as comparing the value to \(p_2\) and subtracting \(p_2\) if needed.

Thus, we modify the initialization to round \(m\) up to the next multiple of \(3\), and change the bucket function to

```
fn bucket(&self, hx: u64) -> u64 {
let mod_mp2 = hx % self.rem_mp2;
let mod_p2 = mod_mp2 - self.p2 * (mod_mp2 >= self.p2) as u64;
let large = (hx % self.rem_n) >= self.p1;
self.p2 * large as u64 + if large { mod_mp2 } else { mod_p2 }
}
```

The new timings are

query (ns) | no LTO | thin-LTO | inline index() | fixed tests | \(p_2 = m/3\) |
---|---|---|---|---|---|

u64 | 91 | 60 | 55 | 63 | 54 |

fastmod64 | 55 | 34 | 33 | 35 | 27 |

fastmod32 | 50 | 26 | 24 | 20 | 19 |

strengthreduce64 | 56 | 38 | 33 | 38 | 33 |

strengthreduce32 | 50 | 31 | 26 | 30 | 21 |

`fastmod32`

didn’t get much faster, but all others went down a lot! Let’s check
out the generated assembly for `fastmod64`

:

```
10 │ vpunpcklqdq %xmm3,%xmm5,%xmm3
3 │ vmovq %r15,%xmm12
1 │ vpunpcklqdq %xmm6,%xmm7,%xmm5
33 │ vinserti128 $0x1,%xmm3,%ymm5,%ymm5
14 │ vpunpcklqdq %xmm8,%xmm14,%xmm3
2 │ vpunpcklqdq %xmm15,%xmm12,%xmm6
2 │ vinserti128 $0x1,%xmm3,%ymm6,%ymm3
32 │ vmovdqu 0x570(%rsp),%ymm14
15 │ vpxor %ymm3,%ymm14,%ymm3
2 │ vpxor 0x610(%rsp),%ymm14,%ymm12
2 │ vpxor %ymm5,%ymm14,%ymm6
29 │ vmovdqu 0x630(%rsp),%ymm11
13 │ vpxor %ymm14,%ymm11,%ymm15
2 │ vpcmpgtq %ymm6,%ymm15,%ymm6
2 │ vpcmpgtq %ymm3,%ymm12,%ymm3
39 │ vpandn %ymm11,%ymm6,%ymm6
13 │ vpandn %ymm11,%ymm3,%ymm7
4 │ vpand %ymm6,%ymm3,%ymm3
2 │ vpaddq %ymm5,%ymm7,%ymm5
31 │ vpcmpeqd %ymm7,%ymm7,%ymm7
8 │ vpsubq %ymm3,%ymm5,%ymm3
│ vpxor %xmm6,%xmm6,%xmm6
2 │ mov 0x1c0(%rsp),%r14
4264 │ vpgatherqq %ymm7,(%r14,%ymm3,8),%ymm6 # Do array index
```

Huh what?! I don’t really what is going on here, but I do know that the compiler
just vectorized our code for us! All the `vp`

instructions are vector/packed
instructions! Magic! This probably explains the big speedup we get for `fastmod64`

.

**Closer inspection:** As it turns out, the `32bit`

versions were already
auto-vectorized before we implemented this last optimization. Probably because
the `FastMod32`

type is smaller (two `u64`

) than the `Fastmod64`

type (`u128`

and `u64`

) and hence easier to vectorize (and similar for `StrengthReduce32`

).
But either way this last trick helps a lot for the `64bit`

variants that will
be needed for large hashmaps.

### Compiling and benchmarking PTHash

Compiling PTHash was very smooth; just a `git clone`

, submodule init, and
building *just worked* :)

Running a benchmark similar to the ones here:

```
./build -n 10000000 -c 7.0 -a 1 -e compact_compact -s 1234567890 --minimal --verbose --lookup
```

reports a query performance of `26ns/key`

, similar to the `fastmod64`

performance I get.

Note that PTHash uses fixed-width bitpacking here, while I just store `u64`

’s
directly, but this shouldn’t affect the time too much.

**Vectorization:** More interestingly, PTHash is not auto-vectorized by my
compiler, so I’m surprised it performs this well. Maybe the `vpgatherqq`

instruction just doesn’t give that much speedup over sequential lookups – I
don’t know yet. But still, my equivalent code using `fastmod64`

with \(p_2 =
0.3m\) has `35ns/key`

vs `26ns/key`

for PTHash. Confusing.

**Branching:** PTHash compiles to a branchy version of `fastmod(x, p2) or fastmod(x, m-p2)`

, but is still fast.

### Compact encoding

Adding fixed-width encoding was easy using the `sucds`

`CompactVector`

type.
The generated code doesn’t look so pretty though – it branches on whether the
bits cross a `usize`

boundary, whereas PTHash’s implementation does an unaligned
read from a `*u8`

to avoid this, which seems nicer.

### Find the \(x\) differences

At this point, both `pthash-rs`

and the original `PTHash`

support encoding by a
single compacted vector, but there is still quite some time difference: `31ns`

vs `25ns`

. Time to find all the differences.

This may or may not be the best approach, but I decided to put the assemblies side-by-side.

**Exhibit A: the missing modulo** Ok, I won’t bore you with the full assembly, but I found this in
the PTHash assembly:

```
movabs $0x999999ffffffffff,%rbx
```

with nothing similar in the rust version. Turns out that this is \(0.6 \cdot (2^{64}-1)\). Indeed, the code is:

```
inline uint64_t bucket(uint64_t hash) const {
static const uint64_t T = constants::a * UINT64_MAX;
return (hash < T) ? fastmod::fastmod_u64(hash, m_M_num_dense_buckets, m_num_dense_buckets)
: m_num_dense_buckets + fastmod::fastmod_u64(hash, m_M_num_sparse_buckets,
m_num_sparse_buckets);
}
```

note how it does \(hash < 0.6 2^{64}\) instead of \(hash \mm n < 0.6 n\) as written in the paper for what FCH does. Basically we can completely drop the \(\mm n\) there! Sweet! That’s \(1\) of \(3\) modulo operations gone :)

`FastReduce`

revisited

Earlier I mentioned the blogpost A fast alternative to the modulo reduction: to map \(h\in[2^k]\) to \([n]\) (\([n] := \{0, \dots, n-1\}\) here), instead of taking \(h\mm n\), one can do \(h*n/2^k\) which is must faster to evaluate. The problem with this approach is that it only uses the \(\log_2 n\) high-order bits of \(n\), discarding some necessary entropy.

On second thought, it seems like this may still be useful though. There are two modulo operations in the PTHash algorithm:

- In the
`bucket()`

function, mapping from \([2^{64}]\) to \([p_2]\) or \([m - p_2]\). - In the
`position()`

function, mapping from \([2^{64}]\) to \([n]\).

And there is one related check:

- In the
`bucket()`

function, check whether \(h < p_1 \cdot 2^{64}\) (or \(h \mm n \leq p_1 \cdot n\)), which is actually also a reduction operation.

While we cannot use `FastReduce`

twice, I think it may still be possible to use
it once. In particular, it should be fine to use a low-entropy hash for
bucket-selection since we anyway have collisions there – that’s the point of
bucketing in the first place.

A second optimization may be possible if we could find a function that uses the
lower \(\log_2 n\) bits of the hash for `position()`

. Then we can access \(2\cdot
\log_2 n\) bits of entropy in total which should be sufficient to avoid
collisions with constant probability (via the birthday paradox).

One idea is something like \(h \mm 2^{\lceil{\log_2 n\rceil}}\), but this is not quite fair so it may not work out nicely. But then again, maybe we could use that for the bucketing modulo, since it possibly doesn’t require (as much) fairness.

Or maybe we can just take the lower \(32\) bits of \(h\) and do \((h\mm 2^{32}) * n / 2^{32}\). That should probably work just fine :)

So, we have 7 possible reduction functions now:

`FastMod64`

, same as`mod`

but faster.`FastMod32L`

, taking the lower \(32\) bits modulo \(n\).`FastMod32H`

, taking the higher \(32\) bits modulo \(n\).`FastReduce64`

: \((h * n) » 64\)`FastReduce32L`

, fastreduce on the \(32\) low bits: \(((h \mm 2^{32}) * n) » 32\)`FastReduce32H`

, fastreduce on the \(32\) high bits: \(((h » 32) * n) » 32\)

Excluding `mod`

(which is never better than `FastMod64`

, we can make \(36\)
combinations from this to use for the two modulo operations.

Not all combinations end up working (because of lack of entropy when e.g. only
the \(32\) high bits are used). It’s clear that `FastMod64`

tends to be slow, and
that `Reduce`

operations are usually (but not always) faster than `Mod`

operations.

The big problem with this benchmark though seems to be that timings are quite inconsistent (variance of a few nanoseconds), and that the quality of generated code (auto-vectorization and loop unrolling) depends on a lot of things and is somewhat unpredictable/inconsistent.

`bucket` reduce (\(\mm m\)) \ `position` reduce (\(\mm n\)) | `FM32L` | `FM32H` | `FM64` | `FR32L` | `FR32H` | `FR64` |
---|---|---|---|---|---|---|

`FastMod32L` | 20 | 27 | 19 | 20 | ||

`FastMod32H` | 18 | 26 | 19 | |||

`FastMod64` | 29 | 27 | 33 | 28 | 26 | 27 |

`FastReduce32L` | 18 | 23 | 18 | |||

`FastReduce32H` | ||||||

`FastReduce64` |

Note that the last two rows fail in particular, because they strongly correlate with the check whether elements belong in a small or large bucket, \(h(x, s) < p_1\).

### TODO Is there a problem if \(\gcd(m, n)\) is large?

### Faster hashing

The current implementation uses `Murmur2`

to hash both the key \(x\mapsto h(x, s)\) and the
pilot \(k_i \mapsto h(k_i, s)\). While this hash is fast, it’s still quite some
instructions. Instead, especially for the \(h(k_i, s)\), we may be able to get
away with either no or a much simpler hash.

**No hash:** A first try of \(h_0(k_i, s) := k_i\) returns in failures because the
\(k_i\) become too large.

**Multiplication hash ( MulHash):** So we do need more

*mixing*of bits rather than just incrementally increasing \(k_i\) starting at \(0\). One common way of doing that is simply to multiply by a large semi-random $64$bit integer. In particular,

`Murmur`

also does this, so let’s just reuse their mixing constant and set:
\[
h_1(k_i, s) := (0xc6a4a7935bd1e995 \cdot k_i) \mm 2^{64}.
\]
(I tried looking for documentation on why this constant was chosen, but there
doesn’t seem to be more to it than *it works*.)

In experiments, this gives anywhere between \(1\) and \(4\) nanoseconds of speedup.

#### TODO Try xxhash

See github.

### An experiment

Another fun comparison is here, where I use `MulHash`

for \(k_i\) and replace the
`MurmurHash`

for \(h(x)\) by simply the identity operation (since we’re testing on
uniform random \(x\) anyway):

```
17 │1f0: mov (%rsi,%rax,1),%rdx
11 │ imul %r13,%rdx # Start of Murmur
32 │ mov %rdx,%rbx
106 │ shr $0x2f,%rbx
7 │ xor %rdx,%rbx
23 │ imul %r13,%rbx
24 │ mov %rdi,%rdx
99 │ movabs $0x35253c9ade8f4ca8,%rbp
11 │ xor %rbp,%rdx
27 │ xor %rbx,%rdx
28 │ imul %r13,%rdx
92 │ mov %rdx,%rbx
15 │ shr $0x2f,%rbx
15 │ xor %rdx,%rbx
28 │ imul %r13,%rbx
95 │ mov %rbx,%rbp
16 │ shr $0x2f,%rbp
17 │ xor %rbx,%rbp # End of Murmur
25 │ mov %ebp,%edx
64 │ cmp %rbp,%r8
43 │ ↓ jbe 250 # Branch for bucket index
9 │ imul %r14,%rdx
12 │ shr $0x20,%rdx
27 │ ↓ jmp 25d
│ cs nopw 0x0(%rax,%rax,1) # nop; for code alignment
7 │250: imul %r15,%rdx
4 │ shr $0x20,%rdx
11 │ add 0x10(%rsp),%rdx
5088 │25d: mov (%r9,%rdx,8),%rdx # Memory lookup -- most waiting is here.
301 │ imul %r13,%rdx
98 │ xor %rbp,%rdx
453 │ mulx %r11,%rdx,%rdx
100 │ cmp %rdx,%r10
2 │ ↓ jbe 6e8
13 │ add %rdx,%r12
10 │ add $0x8,%rax
37 │ cmp %rax,%rcx
100 │ ↑ jne 1f0
```

```
72 │4a0: mov (%rcx,%rbp,1),%rbx
38 │ mov %ebx,%edx
21 │ cmp %rbx,%rsi
36 │ ↓ jbe 4c0 # Branch for bucket index
16 │ imul %r15,%rdx
22 │ shr $0x20,%rdx
24 │ ↓ jmp 4cb
│ data16 cs nopw 0x0(%rax,%rax,1) # code alignment
│4c0: imul %r14,%rdx
20 │ shr $0x20,%rdx
21 │ add %r10,%rdx
1895 │4cb: mov (%rdi,%rdx,8),%rdx # Memory lookup -- most waiting is here
172 │ imul %r13,%rdx
75 │ xor %rbx,%rdx
252 │ mulx %r9,%rdx,%rdx
56 │ cmp %rdx,%r8
│ ↓ jbe 706
43 │ add %rdx,%r11
26 │ add $0x8,%rbp
│ cmp %rbp,%rax
34 │ ↑ jne 4a0
```

`MurmurHash`

takes slightly less than half the instructions, but removing them
gives almost \(2.5\times\) speedup! My current thinking is that this is not so
much due to the reduced instruction count itself (the CPU is stalling anyway to
wait for memory), but rather due to the better pipelining it results in: when
loop iterations are shorter (in number of assembly instructions), pipelining can
look ahead more iterations, and hence does a better job at prefetching memory.
But even with this short loop, around two thirds of the time is still spend
waiting for memory.

**Conclusion 1.:** I should really write code with prefetching.

**Conclusion 2.:** It’s time to use `perf stat`

for some metrics on *branch
mispredictions* and *instructions per cycle*.

### Compiler struggles

**Auto-vectorization:** The compiler is quite eager to generated vectorized assembly code.
It’s quite unpredictable when auto-vectorization triggers, and it seems I have
to keep at least one branch in the hot loop to prevent it. The vectorized code
seems bad for a few reasons:

**Gather instructions**(`vpgatherqq`

) are slow.**Pipelining:**It seems that pipelining works much better for the scalar version, being able to look ahead further and keeping busy while waiting for memory to load.

**Even worse:** Also, it did the following terrible thing. Starting with this piece of innocent looking code:

```
if likely(p < self.n0) {
p
} else {
unsafe { *self.free.get_unchecked(p - self.n0) }
}
```

the compiler decided to generate:

Basically: it created a branchless implementation of this if statement where the
`false`

branch is always executed. But that branch is super slow! Basically a
completely unnecessary read from main memory!
For now I’ll just completely remove the `false`

branch to prevent this issue…

### Prefetching, at last

Without further ado, here we go:

```
#[inline(always)]
pub fn index_stream<'a, const L: usize>(
&'a self,
xs: &'a [Key],
) -> impl Iterator<Item = usize> + 'a {
let mut next_hx: [Hash; L] = xs.split_array_ref().0.map(|x| self.hash_key(&x));
let mut next_i: [usize; L] = next_hx.map(|hx| self.bucket(hx));
xs[L..].iter().enumerate().map(move |(idx, next_x)| {
let idx = idx % L;
let cur_hx = next_hx[idx];
let cur_i = next_i[idx];
next_hx[idx] = self.hash_key(next_x);
next_i[idx] = self.bucket(next_hx[idx]);
// TODO: Use 0 or 3 here?
// I.e. populate caches or do a 'Non-temporal access', meaning the
// cache line can skip caches and be immediately discarded after
// reading.
unsafe { prefetch_read_data(self.k.address(next_i[idx]), 3) };
let ki = self.k.index(cur_i);
let p = self.position(cur_hx, ki);
p
})
}
```

For \(L = 64\), this is around twice as fast as the non-streaming/non-prefetching version!

In our \(n=10^7\) benchmark, **this reduces latency to 4.2ns**!!
For the larger \(n=10^8\) benchmark, latency is

**, down from**

`7.5ns`

**of the original PTHash paper! (But note that I don’t do any compression here.)**

`28ns`

And this is without vectorization still :)

### TODO Prefetching with vectorization

Preliminary results: this seems tricky to get right and tends to be slower. It
sometimes generates unwanted `gather`

instructions, but even when it doesn’t
it’s slow although I don’t know exactly why yet. **Does pipelining work with SIMD instructions?**

### Inverting \(h(k_i)\)

When there are only a few empty slots left, we want to find \(k\) such that \[ R(h_1(x) \lxor h_2(k), n) = p \] for some fixed position \(p\).

Using `FastReduce64`

as reduction \(R\), we want
\[
\left\lfloor\frac{(h_1(x) \lxor h_2(k)) * n}{2^{64}}\right\rfloor = p
\]
i.e.
\[
2^{64} p \leq (h_1(x) \lxor h_2(k)) \cdot n < 2^{64}(p+1)
\]
i.e.
\[
\frac{2^{64} p}{n} \leq h_1(x) \lxor h_2(k) < \frac{2^{64}(p+1)}{n}.
\]
This basically fixes the high \(r:=\log_2(n)\) bits of \(h_1(x) \lxor h_2(k)\).
(I suspect it only leaves two possible options for the highest \(\lfloor
r\rfloor\) bits.)
Thus, we infer (with some handwaving)
\[
h_2(k) = \frac{2^{64}(p+\frac 12)}{n} \oplus h_1(x) \oplus A =: X \oplus A
\]
where \(X\) is the target constant and \(A\)’s upper \(r\) bits are \(0\), i.e. \(A < 2^{64-r}\).
(TODO: Maybe we can just for the lower \(64-r\) bits of \(X\) to \(0\). Not sure.)

We could solve this when using `Murmur`

for \(h_2\), but in practice we use a much
simpler `MulHash`

: \(h_2(k) = (C\cdot k) \mm 2^{64}\). Thus, we must solve
\[
(C \cdot k) \mm 2^{64} \in \{ X \oplus A : 0\leq A < 2^{64-r}\}
\]
for \(k\).

Now, since \(C\) is odd it has a multiplicative inverse \(C’\) over \(F_{2^{32}}\) with \((C \cdot C’)\mm 2^{64} = 1\), so we’re looking for \[ k \in \{ (C’ \cdot (X \oplus A)) \mm 2^{64} : 0\leq A < 2^{64-r}\}. \]

This makes it very easy to find arbitrary solutions! The big drawback is that this results in ‘random’ \(k\) in \([2^{64}]\), while for compression purposes we much prefer finding small (or even minimal) \(k\).

Given the size of the set, the expected value of the smallest solution \(k\) is \(2^r = n\). But I’m not sure if this is solvable efficiently. We could first look for solutions where the \(64-r\) high bits of \(k_i\) are \(0\), and if none are found we can relax to \(64-r-1\) high bits of \(0\), and so on. But does that even help?

Remember that \(C\) is a mixing constant and hence both \(C\) and \(C’\) are quite ‘random’ with lots of \(1\) bits. Maybe we can choose \(C\) to have good mixing but also simplify this inverse? Or can it always be done? I don’t know..

### Another day of progress

- Explored ways to solve the hash-inversion problem above. Seems that an \(O(64-r)\) solution is possible, and maybe even \(O( r)\) but that’s probably more annoying. (Either way both will be \(O(1)\) for \(n < 2^{64}\) ;) There are a lot of interesting things going on to be written down and formalized later.
- Added a binary
`src/bin/bucket_sizes.rs -n <n> -c <c> -a <a>`

to print a nice table with number of buckets of each size and related statistics. - Replaced
`Vec::sort()`

by`radsort::sort`

which is roughly \(2\times\) faster for uniform`u64`

hashes. TODO: Probably it will be faster to only sort by the highest \(n\) bits (or \(32\) bits), i.e. by only sorting on buckets, since most buckets will have size \(1\). - I’m playing with the idea of implementing some kind of interpolation sort
algorithm that just inserts things directly in the right place in an array of
`Option<NonZero<usize>>`

of size \((1+\epsilon)n\) or maybe \(n + C \cdot \sqrt n\) and then runs a collect on this. Should work quite well I think.

### TODO Possible sorting algorithms

- Robinhoodsort
- Bounds on linear probing
- Flashsort (wikipedia, article)
- Drawback: bad cache locality when writing out buckets. Maybe just write to \(O(\sqrt n)\) buckets (should fit in L2 cache ideally) and then sort each bucket individually.

### Diving into the inverse hash problem

So we want to find the minimal \(k\) such that
\[
(C\cdot k) \mm 2^{64} \in \{X \oplus A : 0 \leq A < 2^{64}\}
\]
for some given \(X\), i.e. the \(r\) high bits of `C*k`

as \(64\) bit integers should
be equal to the \(r\) high bits of \(X\). Let \(\K\) be the set of \(k\) for which this
holds, so that we are interested in \(\min \K\).

A first thought is: can we first find *some* valid \(k=k_0\), and then hope that the
space of solutions \(\K\) has some kind of affine structure so that we can easily
translate the initial \(k_0\) to a minimal solution? So let’s investigate the
linear structure first: let’s find solutions that have \(r\) leading zeros, i.e
\[\K_0 := \{k : C \times k < 2^{r’}\},\]
where I will use \(\times\) for $64$bit multiplication and \(r’ := 64-r\).

We definitely expect \(\K_0\) to have *some* structure: if \(C\times k_1 < 2^{r’}\)
and \(C\times k_2 < 2^{r’}\), then \(C\times (k_1 + k_2) < 2^{r’+1}\), which is
almost good enough, and similarly, \(-2^{r’} \leq C\times (k_1 - k_2) \leq
2^{r’}\), so that either \(k_1 - k_2\) or \(k_2 - k_1\) is also a solution.

Thus, consider those \(k\in \K_0\) with small absolute value (modulo \(2^{64}\), so the absolute value \(|k|\) of \(k\) is \(\min(k, 2^{64}-k)\)). Then, each difference \(k_d\) between two consecutive elements of \(\K_0\) must occur in \(\K_0\) either as \(k_d\) or as \(-k_d\).

So let’s write some code to determine all possible differences. I’m doing this for $32$bit integers first since we can quickly iterate over them

```
type T = u32;
fn find_diffs_bruteforce(c: T) {
for r in 1..T::BITS {
let mut last = 0;
let mut diffs = HashSet::new();
for i in 0..=T::MAX {
let ci = c.wrapping_mul(i);
if ci.leading_zeros() >= r {
diffs.insert(i - last);
last = i;
}
}
eprintln!("r = {r:>2}: {diffs:?}");
}
}
```

Result:

```
r = 1: {1, 2, 3}
r = 2: {3, 5, 8}
r = 3: {3, 11, 14}
r = 4: {11, 14, 25}
r = 5: {14, 39, 53}
r = 6: {39, 53, 92}
r = 7: {92, 145, 237}
r = 8: {92, 329, 421}
r = 9: {421, 513, 934}
r = 10: {421, 934, 1355}
r = 11: {1355, 2289, 3644}
r = 12: {1355, 3644, 4999}
r = 13: {3644, 8643, 12287}
r = 14: {8643, 29573, 38216}
r = 15: {8643, 46859, 55502}
r = 16: {55502, 64145, 119647}
r = 17: {55502, 119647, 175149}
r = 18: {175149, 230651, 405800}
r = 19: {405800, 1042251, 1448051}
r = 20: {405800, 2665451, 3071251}
r = 21: {405800, 3477051, 3882851}
r = 22: {3882851, 4288651, 8171502}
r = 23: {3882851, 27585757, 31468608}
r = 24: {3882851, 35351459, 39234310}
r = 25: {3882851, 39234310, 43117161}
r = 26: {43117161, 90117173, 133234334}
r = 27: {43117161, 133234334, 176351495}
r = 28: {133234334, 309585829, 442820163}
r = 29: {442820163, 1195226155}
r = 30: {442820163, 2966506807}
r = 31: {3852147133}
```

**WOW**: For each \(r\), it turns out there are only \(3\) possible differences
between adjacent elements of \(\K_0\)! And also note that (when there are \(3\)
distinct values) the largest one is always the sum of the smaller two. What is
more: when we increase \(r\) by one, the new differences always include one or two
of the previous differences, and some small linear combination of them. I.e.
\((a,b,c) \mapsto (b, c, b+c)\) or \((a,b,c) \mapsto (a, a+c, b+c)\).

This inspires a much faster algorithm to find these possible differences: for \(r\) from \(1\) to \(32\), start with differences \(\Delta_0 := \{1\}\) for \(r=0\), and then to get \(\Delta_r\) from \(\Delta_{r-1}\) first make all small linear combinations of elements of \(\Delta_{r-1}\) and then start with \(k=0 \in K_0^r\), and then keep increasing \(k\) by all the possible delta’s to see if \(k+\delta\in K_0^r\). If this is the case add \(\delta\) to \(\Delta_r\), and otherwise keep going. Once \(\Delta_r\) reaches size \(3\), we can stop assuming our hypothesis that \(|\Delta_r| \leq 3\) is true. (TODO: This will require a proof eventually.) See the implementation here on GitHub.

Also, let’s have a look at what \(C \times \delta\) looks like in binary for all \(\delta \in \Delta_r\):

```
r = 1
1 1011011110100011110100110010101
2 10110111101000111101001100101010
3 10011011101011011110010111111
r = 2
3 10011011101011011110010111111
5 11001011000110011000111111101001
8 11011110100011110100110010101000
r = 3
3 10011011101011011110010111111
11 11110010000001010000100101100111
14 101011110101100011000100110
r = 4
11 11110010000001010000100101100111
14 101011110101100011000100110
25 11110111011111111100111110001101
r = 5
14 101011110101100011000100110
39 11111100111110101001010110110011
53 10011101010101101111011001
r = 6
39 11111100111110101001010110110011
53 10011101010101101111011001
92 11111111011011111111000110001100
r = 7
92 11111111011011111111000110001100
145 1111001010100110101100101
237 1010101010011111011110001
r = 8
92 11111111011011111111000110001100
329 110001010011000001111101
421 1101010010001000001001
r = 9
421 1101010010001000001001
513 11111111101001010001001110010101
934 11111111110110100011010110011110
r = 10
421 1101010010001000001001
934 11111111110110100011010110011110
1355 11110101011110100111
r = 11
1355 11110101011110100111
2289 11111111111010011000110101000101
3644 11111111111110001110010011101100
r = 12
1355 11110101011110100111
3644 11111111111110001110010011101100
4999 10000011110010010011
r = 13
3644 11111111111110001110010011101100
8643 10010000101111111
12287 11111111111110100000011001101011
r = 14
8643 10010000101111111
29573 11111111111111000100100101101001
38216 11111111111111010110101011101000
r = 15
8643 10010000101111111
46859 11111111111111101000110001100111
55502 11111111111111111010110111100110
r = 16
55502 11111111111111111010110111100110
64145 1100111101100101
119647 111110101001011
r = 17
55502 11111111111111111010110111100110
119647 111110101001011
175149 10101100110001
r = 18
175149 10101100110001
230651 11111111111111111101100100010111
405800 10001001000
r = 19
405800 10001001000
1042251 11111111111111111110000110100111
1448051 11111111111111111110010111101111
r = 20
405800 10001001000
2665451 11111111111111111111001011000111
3071251 11111111111111111111011100001111
r = 21
405800 10001001000
3477051 11111111111111111111101101010111
3882851 11111111111111111111111110011111
r = 22
3882851 11111111111111111111111110011111
4288651 1111100111
8171502 1110000110
r = 23
3882851 11111111111111111111111110011111
27585757 110100001
31468608 101000000
r = 24
3882851 11111111111111111111111110011111
35351459 11011111
39234310 1111110
r = 25
3882851 11111111111111111111111110011111
39234310 1111110
43117161 11101
r = 26
43117161 11101
90117173 11111111111111111111111111011001
133234334 11111111111111111111111111110110
r = 27
43117161 11101
133234334 11111111111111111111111111110110
176351495 10011
r = 28
133234334 11111111111111111111111111110110
309585829 1001
442820163 11111111111111111111111111111111
r = 29
442820163 11111111111111111111111111111111
1195226155 111
r = 30
442820163 11111111111111111111111111111111
2966506807 11
r = 31
442820163 11111111111111111111111111111111
3852147133 1
```

We see that the first \(r\) bits of \(C\times \delta_r\) are either \(0\) or \(1\), so \(-2^{32-r} < C \times \delta_r < 2^{32-r}\).

So using this knowledge, how do we find \(\min \K\) for an arbitrary constant \(X\)? Well, we can start with \(k = 0\) and \(r=0\), and then keep trying to add elements of \(\Delta_{r+1}\) until the \(r\)’th-highest bit of \(C\times k\) is the same as the one in \(X\). Then we increase \(r\) and repeat until all highest \(r\) bits are correct.

```
fn find_inverse_fast(X: T, r: u32, diffs: &Vec<Vec<T>>) -> T {
let mut k = 0;
let mut rr = (C.wrapping_mul(k) ^ X).leading_zeros();
'rr: while rr < r {
for &d in &diffs[rr as usize] {
let new_rr = (C.wrapping_mul(k + d) ^ X).leading_zeros();
if new_rr >= rr {
k += d;
rr = new_rr;
eprintln!(
"k+={d:10} = {k:10}: {:032b} {:032b} {rr:>2}",
C.wrapping_mul(k),
C.wrapping_mul(k) ^ X
);
continue 'rr;
}
}
unreachable!();
}
k
}
```

Running this on some random input for \(r=32\) gives for example:

```
r = 32 X = 10110111100010000001010011000110
delta k k*C (k*C)^X r
k+= 1 = 1: 01011011110100011110100110010101 11101100010110011111110101010011 0
k+= 1 = 2: 10110111101000111101001100101010 00000000001010111100011111101100 10
k+= 1355 = 1357: 10110111101100110010101011010001 00000000001110110011111000010111 10
k+= 934 = 2291: 10110111100011010110000001101111 00000000000001010111010010101001 13
k+= 8643 = 10934: 10110111100011101000000111101110 00000000000001101001010100101000 13
k+= 8643 = 19577: 10110111100011111010001101101101 00000000000001111011011110101011 13
k+= 3644 = 23221: 10110111100010001000100001011001 00000000000000001001110010011111 16
k+= 55502 = 78723: 10110111100010000011011000111111 00000000000000000010001011111001 18
k+= 230651 = 309374: 10110111100010000000111101010110 00000000000000000001101110010000 19
k+= 405800 = 715174: 10110111100010000001001110011110 00000000000000000000011101011000 21
k+= 405800 = 1120974: 10110111100010000001011111100110 00000000000000000000001100100000 22
k+= 3882851 = 5003825: 10110111100010000001011110000101 00000000000000000000001101000011 22
k+= 3882851 = 8886676: 10110111100010000001011100100100 00000000000000000000001111100010 22
k+= 3882851 = 12769527: 10110111100010000001011011000011 00000000000000000000001000000101 22
k+= 3882851 = 16652378: 10110111100010000001011001100010 00000000000000000000001010100100 22
k+= 3882851 = 20535229: 10110111100010000001011000000001 00000000000000000000001011000111 22
k+= 3882851 = 24418080: 10110111100010000001010110100000 00000000000000000000000101100110 23
k+= 3882851 = 28300931: 10110111100010000001010100111111 00000000000000000000000111111001 23
k+= 3882851 = 32183782: 10110111100010000001010011011110 00000000000000000000000000011000 27
k+= 133234334 = 165418116: 10110111100010000001010011010100 00000000000000000000000000010010 27
k+= 133234334 = 298652450: 10110111100010000001010011001010 00000000000000000000000000001100 28
k+= 133234334 = 431886784: 10110111100010000001010011000000 00000000000000000000000000000110 29
k+=1195226155 = 1627112939: 10110111100010000001010011000111 00000000000000000000000000000001 31
k+= 442820163 = 2069933102: 10110111100010000001010011000110 00000000000000000000000000000000 32
```

In this example we are somewhat lucky and \(k=2\) happens to have the \(10\) highest bits correct, but generally the number of correct bits grows by a few each step.

I have written some code to verify the correctness and minimality of the result, and indeed results are always identical to a bruteforce search starting at \(k=0\).

We can also make some statistics on the value of \(\min \K^r\) compared to \(2^r\). Since \(|\K^r| = 2^{64-r}\) we expect the smallest solution to be around \(2^r / 2\). In practice this seems to be mostly the case, but not quite exactly. Over a total of \(10^8\) samples (which takes \(40\) seconds) we get the following ratios of \(k/2^r = \min(\K^r)/2^r\):

```
r avg(k/2^r) max(k/2^r)
r = 0: 0.000 0.000
r = 1: 0.250 0.500
r = 2: 0.375 0.750
r = 3: 0.500 1.000
r = 4: 0.694 2.375
r = 5: 0.741 1.750
r = 6: 0.538 1.922
r = 7: 0.600 1.422
r = 8: 0.588 1.645
r = 9: 0.644 1.539
r = 10: 0.564 1.718
r = 11: 0.536 1.245
r = 12: 0.552 1.677
r = 13: 0.549 1.365
r = 14: 0.553 1.784
r = 15: 0.527 2.319
r = 16: 0.762 2.060
r = 17: 0.599 1.475
r = 18: 0.571 1.991
r = 19: 0.501 1.309
r = 20: 4.275 16.086
r = 21: 4.580 12.026
r = 22: 3.083 7.008
r = 23: 1.763 3.691
r = 24: 0.939 1.908
r = 25: 0.515 1.923
r = 26: 0.801 2.416
r = 27: 0.671 1.692
r = 28: 0.556 1.571
r = 29: 0.526 1.209
r = 30: 0.529 1.602
r = 31: 0.645 1.798
r = 32: 0.513 1.148
r = 33: 0.515 1.473
r = 34: 0.934 2.784
r = 35: 0.865 2.160
r = 36: 0.561 1.208
r = 37: 0.532 1.748
r = 38: 0.584 1.446
r = 39: 0.555 1.732
r = 40: 0.603 3.102
r = 41: 1.291 4.149
r = 42: 1.100 2.724
r = 43: 0.690 1.470
r = 44: 0.902 4.465
r = 45: 1.530 4.437
r = 46: 1.191 2.770
r = 47: 0.721 1.569
r = 48: 0.560 1.523
r = 49: 0.525 1.154
r = 50: 0.600 1.534
r = 51: 0.516 1.823
r = 52: 0.598 1.439
r = 53: 0.535 1.703
r = 54: 0.578 1.343
r = 55: 0.583 1.589
r = 56: 0.522 1.130
r = 57: 0.614 1.528
r = 58: 0.523 1.329
r = 59: 0.849 2.757
r = 60: 0.777 1.902
r = 61: 0.502 1.082
r = 62: 0.565 1.492
r = 63: 0.532 1.254
r = 64: 0.500 1.000
```

Indeed, the average value of \(k/2^r\) is typically just above \(0.5\) or at least below \(1.0\), but occasionally jumps to above \(4\), in particular around \(r\in \{20, 21, 22, 23\}\). I’m really not quite sure why, but this particular range seems to have something weird going on, as also seen by the maximum \(k/2^r\) ratio of \(16\). (Note that this is not a bug with our fast inversion algorithm. The bruteforce algorithm confirms these results.)

Most likely this is caused in some way by the presence of a very small
\(\delta_r\) for these \(r\) (marked with a `*`

), and this correlating with all
\(C\times \delta_r\) not only being \(< 2^{64-r}\) (in absolute value) but being either very close to it
or very small.

```
r delta_r C*delta_r
r = 19
164249 111111111101001111010010100101010111100001101
521986 1111111111111111111111110111101110100011000101011001011000101010
686235 111110111011000011101011010000100010100110111
r = 20
* 521986 1111111111111111111111110111101110100011000101011001011000101010
16345815 11111111001100111001111011111101111000100011
16867801 11110110111011011101000001010111010001001101
r = 21
* 521986 1111111111111111111111110111101110100011000101011001011000101010
24697591 1111010110101101011010010010100000011000011
25219577 1110010100100001110010111101101011011101101
r = 22
* 521986 1111111111111111111111110111101110100011000101011001011000101010
28873479 111000101010000011111101011111001000010011
29395465 110000011000100111000010111000100000111101
r = 23
* 521986 1111111111111111111111110111101110100011000101011001011000101010
30439437 11111110101101101001101101011010010010001
30961423 10111100100010000010011000100101010111011
r = 24
* 521986 1111111111111111111111110111101110100011000101011001011000101010
31483409 1111010010110011011000011110000011100101
32005395 111000001010110011101110111011100001111
r = 25
32005395 111000001010110011101110111011100001111
32527381 1111111111111111111111111110101111111001100011010000110100111001
64532776 101110001010000000001001000010001001000
```

**Conclusion:** We now have a fast \(O( r)\) algorithm to solve \(C\times k = X \lxor A\) with
\(0\leq A < 2^{64-r}\). But this does not yet solve the full problem since we actually wanted
\[
\frac{2^{64} p}{n} \leq X \lxor (C\times k) < \frac{2^{64}(p+1)}{n}
\]
and we did some handwaving to assume that implies the topmost \(r\) bits of
\(C\times k\).

### Bringing it home

Let’s start with a (slightly pathological, but illustrating) \(32\) bit example: \(n=511\), \(p=127\), so we want \[ \frac{2^{32}\cdot 127}{511} \leq X \lxor (C\times k) < \frac{2^{32} \cdot 128}{511}. \] In binary, the lower and (inclusive) upper bound are:

```
low : 0011 1111 1001 1111 1100 1111 1110 0111
high: 0100 0000 0010 0000 0001 0000 0000 1000
lcp : 0 (r=1)
```

These have a longest common prefix `0`

of length \(r=1\), so we could now use the
solution above to enumerate solutions \(k\) for \(r=1\) and then for each of them
check if \(X \lxor (C\times k)\) is indeed in this interval. But this is quite
inefficient, as the actual interval only has length \(2^{32}/511 \approx 2^{23}\), so that
each possible \(k\) only works with probability \(2^{23} / 2^{32 - 1} = 2^{23-31} =
2^{-8} = 1/256\), indicating that we may have to try a lot of distinct \(k\).

Instead, we can split the interval into two ranges:

```
low = r1 start: 0011 1111 1001 1111 1100 1111 1110 0111
r1 end : 0011 1111 1111 1111 1111 1111 1111 1111
lcp : 0011 1111 1 (r=9)
r2 start: 0100 0000 0000 0000 0000 0000 0000 0000
high = r2 end : 0100 0000 0010 0000 0001 0000 0000 1000
lcp : 0100 0000 00 (r=10)
```

Both range \(r_1\) and \(r_2\) now cover more than half of the values with the prefix of length \(r=9\) resp. \(r=10\) bits, so a suitable \(k\) should be found within a few tries in each case, and we can simply take the minimum of the two solutions.

One additional optimization that we can do is that instead of computing
solutions for the two ranges independently, we can first compute a \(k_0\) that
gives the right bits in the LCP of `low`

and `high`

, and then extend from there
towards the two disjoint intervals.

The final code is as follows (and includes some more handling of edge cases):

```
/// Solve FastReduce(x ^ MH(k), n) == p efficiently.
fn invert_fr64_fast(x: Hash, n: usize, p: usize, diffs: &Vec<Vec<T>>) -> u64 {
// low = 2^64 * p/n <= x^FR(k) < 2^64 * (p+1)/n = high+1
let low = ((1u128 << 64) * p as u128 / n as u128) as u64;
// high is an inclusive bound: (2^64 * (p+1) - 1)/n
let high = (((1u128 << 64) * (p + 1) as u128 - 1) / n as u128 - 1) as u64;
// In this case the partitioning into two intervals doesn't work.
if low == high {
return find_inverse_fast_with_test(
low ^ x.get(),
64, // r
|k| {
let xck = x.get() ^ C.wrapping_mul(k);
low <= xck && xck < high
},
0, // k0
diffs,
);
}
// Split [low, high] into two pieces [low, low_end] and [high_start, high]
// that have (much) longer LCP.
let lcp = (low ^ high).leading_zeros();
// First find the solution for the LCP.
let k0 = find_inverse_fast(low ^ x.get(), lcp, diffs);
let low_end = low | ((1u64 << (63 - lcp)) - 1);
let high_start = low_end + 1;
let low_lcp = (low ^ low_end).leading_zeros();
let high_lcp = (high_start ^ high).leading_zeros();
let test = |k| low <= x.get() ^ C.wrapping_mul(k) && x.get() ^ C.wrapping_mul(k) < high;
let low_k = find_inverse_fast_with_test(low ^ x.get(), low_lcp, test, k0, diffs);
let high_k = find_inverse_fast_with_test(high_start ^ x.get(), high_lcp, test, k0, diffs);
min(low_k, high_k)
}
```

This is \(O(64)\) and takes around a minute to invert \(10^8\) hashes.

**Conclusion:** We can now efficiently find \(k_i = O(n)\) such that \(h_1(x) \lxor
h_2(k_i)\) maps to a chosen free slot (when \(h_2\) is a `FastReduce`

instance).
This should allow us to fill the last slots of the table much faster.

### Hash-inversion for faster PTHash construction

So now we have a fast way to find \(k_i\) for the *tail* of the last \(t\) buckets.
We will assume that these buckets all have size \(1\). (Otherwise decrease \(t\).)
Let \(F\) be the set of free positions once the *head* of \(m-t\) buckets has been processed.
We always have \(|F| \geq t\) and when \(\alpha = 1\) we have \(|F| = t\).
We can then implement two strategies:

- Sequential
- Iterate over buckets and free slots in parallel, matching each bucket to a slot. Then compute the \(k_i\) that sends each bucket to the corresponding free slot. This will give \(k_i\sim n\) in expectation, uses \(t \cdot \log_2(n)\) bits in total, and runs in \(O(t)\).
- Greedy
- For each bucket (in order), compute \(k_i(f)\) for each candidate slot \(f\), and choose the minimal value. When \(\alpha=1\) this gives \(k_i \sim n/f\) and runs in \(O(t^2)\). The total number of bits is \[ \sum_{f=1}^t \log_2(n/f) = t\log_2(n) - \log_2(t!) \sim t (\log_2(n) - \log_2(t)) \] For \(t=O(\sqrt n)\), this saves up to half the bits for these numbers.

In some quick experiments with \(n=10^8\), the sequential strategy seems to give at most around \(15\%\) speedup (\(35\) to \(30\) seconds for \(t=10000\)), which is not as much as I had hoped. This seems to be because relatively a lot of time is also spent on finding \(k_i\) for the last buckets of size \(2\) and \(3\).

### Fast path for small buckets

For small buckets (size \(\leq 4\)) it pays of to use a code path that knows the
explicit bucket size and processes a fixed size `&[Hash; BUCKET_SIZE]`

array
instead of an arbitrary sized slice `&[Hash]`

. This allows for better code generation.

### TODO Dictionary encoding

The dictionary will be quite dense for numbers up to some threshold (say \(1024\)), and sparser afterwards. We can encode the small numbers directly and only do the dictionary lookup for larger ones.

- TODO: Figure out if the branch is worth the savings of the lookup.

### TODO Larger buckets

The largest bucket should be able to have size \(O(\sqrt n)\) without issues. From there it should slowly decay (TODO: figure out the math) to constant size. This could put the elements that are currently in the largest \(\sim 1\%\) of buckets all together in a few buckets, reducing the average size of the remaining buckets. Although the reduction seems only minimal, so this may not give too much benefit.

One way of achieving such a skew distribution might be to replace the partitioning of \(h\in [0, 2^{64})\) in \(m\) chunks, by a partitioning of \(h^2 \in [0, 2^{128})\) in \(m\) chunks.

### TODO Prefetching free slots

Looking up whether slots in the array for a certain \(k_i\) are free is quite slow and memory bound. Maybe we can prefetch values for a few \(k_i\) ahead.

Also, the computation of `position`

could be vectorized.

### Filling the last few empty slots needs very high \(k_i\)!

Have a look at the last two columns of this table, indicating the number of newly appearing \(k_i\) in each class and total number of distinct \(k_i\) up to then:

```
sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki
25: 1 0.00 0.00 0.00 0.00 0.0 0 1 1
24: 1 0.00 0.00 0.00 0.00 0.0 0 0 1
23: 5 0.00 0.00 0.00 0.00 0.0 0 0 1
22: 11 0.00 0.00 0.00 0.00 0.0 0 0 1
21: 63 0.00 0.00 0.00 0.00 0.0 0 0 1
20: 181 0.00 0.00 0.00 0.01 0.0 0 0 1
19: 566 0.00 0.00 0.01 0.02 0.0 1 1 2
18: 1494 0.01 0.01 0.03 0.04 0.0 1 0 2
17: 4143 0.02 0.02 0.07 0.11 0.0 2 1 3
16: 10119 0.04 0.06 0.16 0.28 0.0 2 0 3
15: 23730 0.09 0.15 0.36 0.63 0.1 3 1 4
14: 52393 0.20 0.35 0.73 1.36 0.2 6 3 7
13: 107711 0.41 0.76 1.40 2.77 0.4 8 2 9
12: 205081 0.78 1.54 2.46 5.23 0.8 14 6 15
11: 359440 1.36 2.90 3.95 9.18 1.5 27 10 25
10: 581154 2.21 5.11 5.81 14.99 2.9 57 23 48
9: 854333 3.24 8.35 7.69 22.68 6.0 123 44 92
8: 1149239 4.36 12.72 9.19 31.87 12.7 264 95 187
7: 1418308 5.38 18.10 9.93 41.80 25.6 539 181 368
6: 1685178 6.40 24.50 10.11 51.91 46.4 873 288 656
5: 2096913 7.96 32.46 10.48 62.40 73.6 1578 391 1047
4: 2880342 10.94 43.40 11.52 73.92 107.5 2276 580 1627
3: 4046307 15.36 58.76 12.14 86.06 150.3 4334 956 2583
2: 4887755 18.56 77.31 9.78 95.83 171.2 8071 1066 3649 *
1: 4166126 15.82 93.13 4.17 100.00 382.2 126928083 14084 17733 *
0: 1809532 6.87 100.00 0.00 100.00 0.0 0 0 17733
```

The buckets of size \(2\) and larger only need \(3649\) distinct \(k_i\) which can be represented using \(12\) bits each with dictionary encoding. But after the last single-element buckets we need \(17733\) distinct \(k_i\) which would require \(15\) bits! If we instead use \(\alpha=0.99\) things look much better:

```
sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki
27: 1 0.00 0.00 0.00 0.00 0.0 0 1 1
26: 1 0.00 0.00 0.00 0.00 0.0 0 0 1
24: 4 0.00 0.00 0.00 0.00 0.0 0 0 1
23: 4 0.00 0.00 0.00 0.00 0.0 0 0 1
22: 20 0.00 0.00 0.00 0.00 0.0 0 0 1
21: 53 0.00 0.00 0.00 0.00 0.0 0 0 1
20: 165 0.00 0.00 0.00 0.01 0.0 1 1 2
19: 514 0.00 0.00 0.01 0.01 0.0 1 0 2
18: 1378 0.01 0.01 0.02 0.04 0.0 1 0 2
17: 3736 0.01 0.02 0.06 0.10 0.0 2 1 3
16: 9386 0.04 0.06 0.15 0.25 0.0 2 0 3
15: 22223 0.08 0.14 0.33 0.59 0.1 3 1 4
14: 49298 0.19 0.33 0.69 1.28 0.2 5 2 6
13: 102501 0.39 0.71 1.33 2.61 0.4 7 2 8
12: 196895 0.74 1.45 2.36 4.97 0.7 14 7 15
11: 348772 1.31 2.76 3.84 8.81 1.3 22 8 23
10: 568406 2.14 4.90 5.68 14.49 2.7 49 19 42
9: 843818 3.17 8.07 7.59 22.09 5.4 118 45 87
8: 1143451 4.30 12.37 9.15 31.23 11.4 210 94 181
7: 1426213 5.36 17.74 9.98 41.22 22.9 395 160 341
6: 1701482 6.40 24.14 10.21 51.43 41.3 9910 261 602
5: 2111700 7.94 32.08 10.56 61.99 65.0 1494 332 934
4: 2897775 10.90 42.98 11.59 73.58 93.5 2247 472 1406
3: 4079515 15.34 58.32 12.24 85.82 125.3 3209 752 2158
2: 4964046 18.67 76.98 9.93 95.74 126.5 4074 437 2595 *
1: 4256736 16.01 92.99 4.26 100.00 38.6 1061 0 2595 *
0: 1863589 7.01 100.00 0.00 100.00 0.0 0 0 2595
```

Now no new pilot values are needed at all for the single-element buckets, and \(12\) buts is enough for each value, saving \(20\%\) of memory!

Looking closer at the buckets of size \(1\), we see that most new values are added for the last few elements, where large \(k_i\) are needed.

```
1: 263401 1.00 79.00 0.26 96.28 25.0 355 0 3649
1: 263401 1.00 80.00 0.26 96.54 27.0 364 0 3649
1: 263402 1.00 81.00 0.26 96.80 29.3 381 0 3649
1: 263401 1.00 82.00 0.26 97.07 31.8 429 0 3649
1: 263401 1.00 83.00 0.26 97.33 35.1 463 0 3649
1: 263401 1.00 84.00 0.26 97.60 38.7 562 0 3649
1: 263402 1.00 85.00 0.26 97.86 43.4 623 0 3649
1: 263401 1.00 86.00 0.26 98.12 49.3 630 0 3649
1: 263401 1.00 87.00 0.26 98.39 57.1 720 0 3649
1: 263401 1.00 88.00 0.26 98.65 67.2 896 0 3649
1: 263402 1.00 89.00 0.26 98.91 82.0 1160 0 3649
1: 263401 1.00 90.00 0.26 99.18 105.0 1333 0 3649
1: 263401 1.00 91.00 0.26 99.44 146.4 2047 0 3649
1: 263401 1.00 92.00 0.26 99.70 240.6 3272 0 3649
1: 263402 1.00 93.00 0.26 99.97 824.0 33353 3762 7411 *
1: 263401 1.00 94.00 0.03 100.00 4227.2 126928083 10322 17733 *
```

### Perfect matching for the tail

Let’s for now only consider some tail of length \(t\) of buckets of size \(1\).
We already saw two techniques to assign them: *sequentially* and *greedily*
(giving the minimal \(k_i\)).

We also already discussed the possibility of *cuckoo hashing*, where inserting
an element can ’evict’ already placed elements. This seems to need a bit more
bookkeeping though, as for each placed element we must track the bucket it
belongs to.

Another more complex way would be a *perfect matching*:
Compute \(k_i(f)\) for each bucket \(i\) and each free slot \(f\) and find the
*weighted bipartite matching* assigning buckets to free slots that minimizes e.g.
\(\sum \log_2 k_i(f)\) (the total number of bits) or \(\max k_i(f)\) (the maximal
number of bits).

This last case is interesting: we could fix a maximal \(K=2^b\) up front and then find all possible \(k_i < K\) for all buckets by simply trying them all. Then we find a perfect matching (if one exists) and use these \(k_i\) for the final buckets.

This raises the question:

Let \(G\) be a bipartite graph on \(|A|+|B|=n+n\) vertices. Let each vertex in \(A\) have degree \(d\) edges towards random vertices in \(B\). What is the smallest \(d\) for which a perfect matching exists with high probability.

If we can answer this, than we can find the \(d\) smallest \(k_i\) solutions for the last \(t\) buckets and solve the perfect matching problem. In this case all final \(k_i\) would be \(O(d\cdot n/t)\), for \(O(t \log_2(dn/t))\) bits total.

A closely related question is answered on stackoverflow which implies that we should take \(d = c + \ln n\).

I implemented this and it seems to work quite well! Take for example the tail of \(n=10^8\), \(c=7\), \(\alpha=1\) without this optimization:

```
n: 100000000
m: 26340126
sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki
...
1: 263401 1.00 79.00 0.26 96.28 25.0 390 0 3680
...
1: 263401 1.00 91.00 0.26 99.44 146.2 1626 0 3680
1: 263401 1.00 92.00 0.26 99.70 241.8 3699 3 3683
1: 263402 1.00 93.00 0.26 99.97 835.7 21967 3825 7508
1: 263401 1.00 94.00 0.03 100.00 4135.5 52284963 10187 17695
: 26340126 100.00 100.00 100.00 100.00 137.3 52284963 17695 17695
```

and now the same where we run perfect matching on the last \(500000\) buckets/slots where we find all \(k_i\) below \(2^{12} = 4096\):

```
n: 100000000
m: 26340126
sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki
...
1: 263401 1.00 79.00 0.26 96.28 25.0 418 0 3662
...
1: 263401 1.00 91.00 0.26 99.44 146.0 1819 0 3662
1: 263401 1.00 92.00 0.26 99.70 299.7 4094 474 4136
1: 263402 1.00 93.00 0.26 99.97 753.4 4095 62 4198
1: 263401 1.00 94.00 0.03 100.00 193.9 4095 0 4198
: 26340126 100.00 100.00 100.00 100.00 97.7 6655 4198 4198
```

Observe that:

- The maximum \(k_i\) overall is now \(6655\) instead of \(50M\) (the value for one of the last buckets of size \(2\)), requiring half the bits to store.
- The number of distinct \(k_i\) has gone down from \(17695\) to \(4198\), requiring one less bit to store.
- Both the maximum \(k_i\) and the number of distinct \(k_i\) is between \(2^{12}=4096\) and \(2^{13}=8192\). I.e. dictionary compression on the full array does not provide any additional benefits anymore! (I’m not yet sure whether separate front-back encoding would provide benefits.)

For reference, here are the corresponding results when using \(\alpha = 0.99\) without tail optimizations.

```
sz cnt bucket% cuml% elem% cuml% avg ki max ki new ki # ki
...
1: 265917 1.00 91.00 0.27 99.48 60.5 804 0 2595
1: 265917 1.00 92.00 0.27 99.74 72.2 887 0 2595
1: 265917 1.00 93.00 0.26 100.00 86.4 1342 0 2595
: 26591682 100.00 100.00 100.00 100.00 69.0 4431 2595 2595
```

- Both the maximum \(k_i\) and number of distinct \(k_i\) are a bit smaller here, mostly because the buckets of size \(2\) are much easier to place.

**Conclusion:** This works extremely well to handle the last size-\(1\) buckets.
It seems to be a bit slower for now, but I think this can be fixed. (The
matching itself is super fast – it’s rather that we now need to search up to a
threshold for *all* buckets, rather than only for the last few buckets.)

**TODO:** See if we can make something similar work for the tail of buckets of
size \(2\). There we don’t need perfect matching though, and simpler techniques
such as *peeling* may suffice (as used in XOR-filters (Graf and Lemire 2020)). This
is also somewhat similar to cuckoo hashing (Fan et al. 2014),
which has a similar result but uses a slightly less structured approach. In
particular Cuckoo hashing does a greedy ‘DFS’ without backtracking for
augmenting paths, compared to matching/flow algorithms which do a normal DFS and
are able to ‘reset’ \(k_i\) back to the lowest solution after an augmenting path
through them has been found.

### Peeling for size-1 buckets

Instead of full matching, let’s try *peeling* to match the size-1 buckets: Again
we first choose a threshold \(K=2^b\) and find all suitable \(k_i < K\) for each
bucket using bruteforce. Also as before, consider the induced bipartite graph.

Suppose that there is a vertex \(u\) of degree \(1\). Then we know that the single edge
incident to it (say \(uv\)) *must* be part of the final matching, so we can greedily take it
and remove any other edges incident to \(v\).

We can repeat this process: repeatedly choose a vertex \(u\) of minimal remaining degree, choosing a random edge \(uv\) incident to it, add \(uv\) to the matching, and remove any other edges incident to \(u\) and \(v\).

I was hoping that this would result in a perfect matching with high probability, but it seems that in practice this procedure typically leaves behind a few percent of vertices without any remaining edges. One example situation is where this ‘deadlock’ could occur is when we have two fully connected subgraphs \(A=K_{x,x}\) and \(B=K_{y,y}\), and additionally a few extra edges from \(a \in A\) to \(B\). In this case, we must never take these ‘cross’ edges, but the algorithm described above doesn’t have such non-local information and could create an imbalance.

Not all hope is lost though:

- This approach may actually work well for buckets of size \(2\) and larger, where there are spare empty slots and we are not looking for a perfect matching.
- The partial matching found by peeling could be input to Dinic’ full matching algorithm, possibly reducing the number of iterations needed and providing overall speedup. But on the other hand, Dinic also matches all but a few percent of elements in its first round, so it’s not clear if it would help much.

### Greedy peeling 1: Assigning from hard to easy

A simpler alternative to peeling is the following. As before choose a number of bits \(b\) and threshold \(K = 2^b\), and find all solutions \(k_i < K\) for all buckets of size \(2\). Then sort the buckets by increasing number of solutions. This leaves the buckets with more freedom of choice for last, hopefully giving them a larger probability of finding a match.

I had high hopes for this, but it seems to not work great. As expected it reduces the maximum \(k_i\) required, but (for \(n=10^8\)) it only goes down from \(3000\) to \(2500\), which is not even enough to save a bit in a direct representation.

### TODO Peeling and cuckoo hashing for larger buckets.

We can processing the buckets in various ways:

- By increasing number of total edges, as in previous section.
- By increasing number of remaining edges.
- By increasing number of remaining free slots covered by all edges.

This last method sounds more complicated than the second one, but is simpler in practice: Whenever we fill a slot we simply decrease all buckets mapping to it by \(1\), instead of having to check first whether those edges are still alive.

### Sunday morning ideas

#### Dinic

Possible speedups for Dinic algorithm:

- It’s never needed to visit vertices with level \(\geq level(t)\) (other than
\(t\) itself). Do such vertices exist? How many of them? Would it speed things
up to skip over them?
**Answer:**Yes, they exist. But when matching \(1M\) buckets, there are only around \(100\) at level \(>level(t)\). Sometimes there are quite some (up to \(n/2\)) at level \(=level(t)\) though, so we could skip those.So during level computation, we can stop as soon as we determine the level of \(t\), and leave all other levels at \(-1\), so they will never be processed during augmentation, which only visits vertices \(v\) with \(level(v) > level(u)\).

This saves up to \(25\%\) of time when matching \(1M\) buckets.

- For vertices in the left component at \(level(u) = level(t)-2\), instead of
doing a DFS, we could first check if any of the outgoing \(v\) has a free edge
to \(t\), by storing all taken/free \(v\) in a bitmask.
**Done:**Instead I implemented a check that for vertices in level \(level(t)-1\), we only check the edge to \(t\). This saves around \(30\%\) time of the`augment`

step, and around \(15\%\) time of the entire Dinic algorithm.

Together these two optimizations bring the runtime on \(1M\) elements down from around \(3.5\) seconds to \(2.5\) seconds: \(30\%\) speedup.

#### New iterative greedy assignment idea

Possible improvement to *Greedy peeling 1*:

Run the greedy linear-pass algorithm once.

- Then fix all the remaining unmatched ones to their first outgoing edge.
- Rerun the algorithm starting with these ‘hard cases’ already fixed.
- Repeat.

Will this converge? How many iterations? Is it better to fix hard cases on the fly, evicting previous ones as soon as the hard one is found?

**Answer:** Around 10-20 iterations are typically needed. These iterations are
relatively fast, but the iterating to find *all* solutions below some threshold
is quite slow.

We can try different sorting orders in which the buckets are processed:

- no sorting,
- by number of solutions \(k_i\) below the threshold (going from few to many),
- by smallest solution \(k_i\) (going from high to low).

Experiments show that sorting by number of solutions is the best, but this requires precomputing all solutions below the threshold which is quite costly in practice. Sorting by smallest solution is probably a better idea in the end, since this is something that needs to be computed for each bucket anyway.

#### Cuckoo hashing, again

The reason that this keeps coming up is because I think it will work very well,
but I don’t yet see *exactly* how to implement this in a sufficiently nice/clean
way. So I procrastinate by first implementing simpler (to implement) ideas.

Basically it’s the same as the iterative greedy assignment idea above, but displacing others on the fly, doing a random walk DFS until success.

- TODO: Do we choose outgoing edges at random? sequentially (wrapping around)? Always starting at the start of the list (leading to smaller \(k_i\))?
- Prefer displacing edges that only include one occupied slot. (Or if possible better, never take an edge with \(2\) (or more) slots already taken.)

Some statistics for bucket size \(2\) (\(n=10^8\)):

- \(\sim20\%\) of the buckets has size \(2\), and \(\sim20\%\) of the buckets has size \(1\).
- Bucket size \(2\) covers elements from percentile \(85\%\) to \(95\%\).
- The ’early’ buckets have success rate one in \(1/0.15^2 \sim 45\).
- The ’late’ buckets have success rate one in \(1/0.05^2 \sim 400\).
- So only \(1\) in \(9\) that works initially still works at the end.
- This is problematic, because even though in expectation most will find a solution with \(20\) candidates, some of the \(f = O(10^6)\) buckets will be unlucky and will need more like \(O(9\cdot \ln f)\) candidates, which corresponds to quite large \(k_i\) actually (\(45\cdot 9\cdot \ln f \sim 4000\)).
- Basically, all these cuckoo/displacement strategies are really mostly needed for the few buckets that lie a factor \(\ln f\) (or say more than \(3\times\)) above the average, so if we can fix those the rest should be relatively OK.
- Coming back to cuckoo hashing: Each candidate \(k_i\) maps the two elements to the \(15\%\) remaining slots. After most size-\(2\) buckets have been assigned, \(5\%\) of slots still remains empty. That means that every candidate edge consisting of \(2\) slots in this \(15\%\) has quite a high probability of including at least one empty slot: \(1-(2/3)^2 = 55\%\). So basically cuckoo hashing has a very high probability of finding edges with only one slot already filled, and \(1/9\) of the edges has both slots empty. So I think the DFS only needs to see roughly \(10\) edges in total before success, which is very little actually!

### Cuckoo hashing / displacing, *for real now*

Ok so I implemented a quick first version and, lo and behold: **It works great!**.

So for completeness the algorithm I run now:

- Put all buckets of fixed size (from \(2\) to \(5\) or so, but not \(1\) (TODO: maybe \(1\) works as well?)) on a stack.
- While the stack is not empty: find a \(k_i\) for the top of the stack that maps to all empty slots.
- If not possible, find a \(k_i\) that minimizes the number of collisions.
*Drop*the colliding buckets from the*taken*slot array and push those buckets on the stack. - Repeat until the stack is empty.

One tricky point is how to tie-break the selection of the minimal bucket. Always choosing the smallest \(k_i\) with the minimal number of collisions can lead to cycles where two buckets keep displacing each other. What seems to work: Choose the first \(k_i\) after any existing \(k_i\) (modulo \(2^b\)).

For \(n=10^7\):

- the \(450k\) buckets of size \(3\) needs \(4k\) displacements, around \(1\%\);
- the \(600k\) buckets of size \(2\) need \(20k\) displacements, around \(3\%\).

For now the bottleneck is twofold:

Iterating over all buckets to find all solutions (

*edges*) up-front is slower than necessary. In practice most solutions \(k_i\) are quite small and there is no need to iterate all the way to the upper bound up-front.The bit width \(b\) has to be large enough to ensure each bucket has

*at least one*solution. When the expected value of the smallest solution is \(\overline{k}\), the maximum value we need is roughly \(\overline{k} \ln b\) when there are \(b\) buckets of the current size. This is unfortunate because in practice \(b\) is of order \(1M\), so we need to go up to \(20 \overline{k}\), whereas maybe \(2\overline{k}\) would be enough for most of them.TODO: We should fix this by allowing some buckets to displace larger buckets. Probably displacing buckets one larger than itself is good enough.

### Displacing globally

A drawback with the implementation above is that it first finds all edges, and
only then starts the algorithm. Much more efficient is to do things right from
the start. We can make a single *global displacement* algorithm:

- Set a threshold \(k_{max} = 2^b\).
- Determine buckets and order by decreasing size.
- For each bucket, try to find the smallest \(k_i<2^b\) that places it without collisions. If so, move to the next bucket.
- Otherwise, find the \(k_i<2^b\) that minimizes the number of collisions, and (more importantly) minimizes the size of the colliding buckets. Then evict those other buckets (remove their \(k_i\) and push them on a stack), and set the \(k_i\) for the current bucket.
- While the stack is non-empty, repeat steps 2 and 3 for the top of the stack.

One drawback is that this needs some more memory since for each slot it needs to store the bucket it came from, and it’s also a bit slow since (in case collision-free matching is not possible) we need to lookup the size of each bucket in another random-memory-access.

#### Running it

This seems to work super well! My implementation isn’t as streamlined yet as the original that just finds the smallest \(k_i\) per bucket (which has a simpler and thus faster inner loop), but nevertheless gets the job done. Only around the last buckets of size \(2\) do displacements go above \(1\%\) of buckets. The main drawback is that it somewhat explodes at the last \(100\) to \(1000\) size-\(1\) buckets, where it has a harder time finding chains that lead to a free slot. Probably falling back to deterministic matching using Dinic is better here.

The average \(k_i\) per bucket-size using this method seems similar to the
original algorithm, but the main benefit is that even for \(n=10^8\) we can now
find solutions that with all \(k_i < 2^8\), meaning that we can store the pilots
simply as a vector of `u8`

instead of needing a compact representation and/or
dictionary encoding. In fact I’m now thinking it’s probably simpler to tune \(c\)
(i.e. the number of buckets) such that we can easily find a solution where all
pilots fit in \(8\) bits, rather than having a variable packing size.

Some further observations:

- Only rarely is more than one collision needed.
- TODO: In this case can we backtrack instead of displacing two buckets?

- When filling buckets of size \(s\), at worst buckets of size \(s+1\) are evicted, never larger ones.

#### Limitations

As already mentioned, one problem with this is that matching the last few buckets is hard. Basically, for the last bucket we have to find a size-\(1\) bucket for which one of its \(256\) possible pilots maps it to the remaining free slot, which only happens for \(0.5\%\) of buckets (which is large for \(n=10^8\)).

Alternatively we can use Dinic for the size \(1\) buckets again, but this also has a limitation as mentioned before, that there will always be some size-\(1\) buckets that do not map to any free slot. Thus, it seems we need some balance where:

- Some size-\(1\) buckets displace already-placed size-\(2\) buckets.
- For the remaining size-\(1\) buckets we find some/all solutions, build the graph, and run Dinic.

One other tricky problem is that not only do we have to ensure that each bucket maps to a free slot, but also that each free slot has some bucket mapping to it. I’m not quite sure yet how exactly to solve this, since the probability of such a position existing is also quite large. One option is to track for each slot one or two \((bucket, pilot)\) pairs that maps to it, and then use that in case no other remaining values map there. But this might then trigger a chain of displacements again in the other (‘pulling’) direction. Although that’s somewhat inevitable in the \(\alpha = 1\) setting.

### Cleanup and revisiting defaults

So far I have been using `Vec<u64>`

as backing storage for pilots, `MurmurHash`

to hash keys, and \(16\) iteration lookahead for prefetching.

Now that we aim to have \(k_i < 2^8\), let’s switch that to a simple

`Vec<u8>`

.I added support for

`FxHash`

, which processes a`u64`

at a time using a simple rotate-xor-multiply:`state=(state.rotate(5) ^ x)*constant`

.While benchmarking this, I realized that the reason

`NoHash`

(which is a toy implementation that does exactly nothing) was so much faster than other hashes is not only because it spends less instructions hashing, but also because my random keys were sorted (to check for duplicates), and this lead to linear access patterns rather than random access patterns.It now seems that $32$-iteration lookahead is around \(10\%\) faster for large \(n\).

Some results for \(n=10^8\) (\(c=7\), \(\alpha=1\)):

```
Vec<u64>
murmur (1): 20.3 (16): 7.6 (32): 7.2 (64): 7.3
fxhash (1): 16.3 (16): 7.6 (32): 7.2 (64): 7.2
nohash (1): 15.4 (16): 7.6 (32): 7.2 (64): 7.2
Vec<u8>
murmur (1): 17.4 (16): 5.8 (32): 5.1 (64): 5.2
fxhash (1): 13.8 (16): 5.5 (32): 4.9 (64): 5.6
nohash (1): 13.0 (16): 5.5 (32): 4.9 (64): 4.8
sucds::CompactVector
murmur (1): 20.9 (16): 27.4 (32): 27.4 (64): 27.5
fxhash (1): 62.9 (16): 17.2 (32): 17.6 (64): 17.6
nohash (1): 60.5 (16): 16.8 (32): 17.4 (64): 17.3
sux::CompactArray
murmur (1): 21.5 (16): 20.9 (32): 21.0 (64): 20.9
fxhash (1): 16.7 (16): 16.9 (32): 16.1 (64): 17.8
nohash (1): 15.6 (16): 15.3 (32): 15.5 (64): 16.1
```

(1), (16/32/64) indicate no prefetching resp. prefetching 16/32/64 iterations ahead.

Observations:

- Switching from
`Vec<u64>`

to`Vec<u8>`

saves around \(2ns\) per query. - Going from
`Murmur`

to`FxHash`

typically saves a bit. - Going further to
`NoHash`

doesn’t save more. - At \(32\) lookahead, the runtime does not depend much on hash function, so probably it’s memory bound rather than cpu bound.
`CompactVector`

and`CompactArray`

do not support prefetching and hence aren’t nearly as good.

### TODO Sum instead of xor?

PTHash replaces the summation in \((h(x) + d)\mm n\) by a xor. It’s not exactly clear to me why this is beneficial. I the xor has slightly better mixing, especially when the \(\mm n\) is implemented as a reduction of the high bits, and \(h(k_i) = C\cdot k_i\) has a linear structure as well.

### Revisiting \(\alpha < 1\)

After playing some more with \(\alpha<1\) values, it seems that we can do this with only very minor loss in query performance. Some experiments for \(n=10^8\), \(c=10\), with prefetching \(32\) iterations ahead:

\(\alpha\) | `index` | `index_remap` |
---|---|---|

\(1.0\) | 5.7 | 5.7 |

\(0.999\) | 5.6 | 5.8 |

\(0.995\) | 5.7 | 5.8 |

\(0.99\) | 5.7 | 5.9 |

\(0.95\) | 5.7 | 7.1 |

\(0.9\) | 5.7 | 8.7 |

As can be seen, `index`

is independent of \(\alpha\) as expected, since it doesn’t
do any extra work for out-of-range elements. `index_remap`

becomes slow once too
many elements fall out of bounds around \(\alpha = 0.95\). But we can safely
choose \(\alpha=0.99\) without significant loss of query performance.

### Elias-Fano for the remap-dictionary

I added support to use `sucds::EliasFano`

to store the dictionary elements. This
saves quite some bits, since (for \(\alpha=0.99\)) it only needs \(2+\log_2(100) =
9\) bits per element as opposed to the default which encodes each index using a
`u32`

.

So let’s see the query performance.

\(\alpha\) | `index` | `index_remap` |
---|---|---|

\(1.0\) | 5.7 | 6.6 |

\(0.999\) | 5.8 | 6.7 |

\(0.995\) | 5.8 | 7.2 |

\(0.99\) | 5.7 | 7.6 |

\(0.95\) | 5.7 | 11.1 |

\(0.9\) | 5.7 | 16.1 |

This comes out quite a bit slower than using a direct `Vec<u32>`

encoding. On
the other hand, it does save space: for \(\alpha=0.99\), the vector encoding uses
\(0.01 \cdot 32 = 0.32\) bits per element overall, while EliasFano only needs
\(0.01\cdot 9 = 0.09\) bits per element overall, saving \(0.2\) bits/element.

Sadly the EliasFano hot-loop has slightly worse codegen even in the non-EF part,
introducing a branch where previously non-branching assembly was generated for
the `>>small`

shift in determining the right bucket.

### Global iterative prioritizing

Instead of displacing using DFS, another idea I had that may work for \(\alpha<1\):

- Iterate over all buckets by decreasing size; place those that can be placed without collisions.
- Collect the rest in a vector of
`priority_buckets`

. - Repeat, but place the priority buckets before iterating over the others.

This worked well with only buckets of a fixed size, but seems to not work as
well globally, even when we try to assign the priority buckets to have minimal
collisions with others. Instead of converging to a stable solution, each
iteration about the same amount (\(0.3\%\) for \(n=10^8\), \(c=10\), \(\alpha=0.99\)) of
buckets is added to the priority list. It seems that the changes made by the few
extra priority buckets propagate a lot, changing the optimal pilot for \(40\%\) of
buckets. This basically means the `taken`

slots are completely newly random each
iteration, preventing converging to a stable solution where the number of added
priority elements goes down each round.

Basically, just DFSing the displacement using the earlier strategy seems much better.

### Cleanup: removing peeling and suboptimal displacing code

The code base is getting messy. I’m going to drop all the peeling experiments, and also the ‘displacement per bucket size’ will go. Basically all methods that require computing edges up front are annoying in this regard. Also matching seems not-so-useful anymore now that we can settle on \(\alpha=0.99\) without impacting the query throughput. It may still be useful though in cases where we want to minimize memory usage, but even then we can go with \(\alpha=0.999\) which would take longer to construct but should still be fine, and avoids the complexity of having to balance between perfect matching and displacing.

So we’re dropping all these algorithms which required all edges to be precomputed up-front:

- Peeling algorithms
- Didn’t really work in the first place

- Dinic matching
- Works well for size-\(1\) buckets, but needs a large threshold to guarantee every bucket and slot have at least one edge. For larger bucket sizes the balancing act is annoying.

- Per-bucket-size displacing:
- The greedy non-iterative version that simply runs once and only fills \(95\%\) of buckets.
- The greedy iterative version works well but only does one bucket size at a time.
- The DFS version that works per bucket size isn’t quite good enough because sometimes we must displace to larger buckets.
- The global iteration method of the previous subsection also goes – it just doesn’t work well.

- The code to invert hashes \(s = C \times k_i\) is also dropped. Inverting returns values that are \(O(n)\) while we are now only interested in values at most \(2^8\). I’m kinda sad about this because the math there was really cool.
- The ‘fast small buckets’ optimization is also dropped for now, to be reintroduced later, to clean up code a bit.

All together, this drops around 1900 lines of code :)

### Some speedups to the displacement algorithm

- Use a fast-path for detecting placement without collisions, which is sufficient for \(99\%\) of buckets.
- Use a
`taken`

bitvector for the fast-path, instead of a full slot-to-bucket-index mapping . - Attempt at prefetching of positions in the
`taken`

vector for upcoming \(k_i\). Doubtful whether it actually improves construction time though.

### TODO Runtime analysis of displacement algorithm

It would be nice to be able to estimate the runtime complexity of the displacement algorithm in terms of \(n\), \(m\) (/\(c\)), \(\alpha\), and the maximum allowed pilot value.

### TODO Optimal prefetching strategy

For each class of buckets of the same size we could device some optimal prefetch strategy: It’s wasteful to prefetch the slot corresponding to the last element of each bucket when the probability of failing in earlier elements in the bucket is high. E.g. for size \(2\) buckets only roughly \(1\) in \(6\) of the second elements needs to be queried. Not prefetching those may give a nice speedup.

### Are we close to the memory bandwidth?

I have DDR4 @ 3200MHz, which has a theoretical bandwidth limit of \(25GB/s\).

For \(n=10^9\) (`1G`

),
the taken vector has size `1G*1bit=125MB`

and does not fit in my `12MB`

L3
cache. Thus let’s say for simplicity that all lookups in it are cache misses,
and thus that each lookup reads a single $64$byte cacheline from main memory.

But how many lookups do we need?

- there are
`0.3G`

buckets - the average \(k_i\) is \(40\)
- buckets have around \(3\) elements on average
- total number of lookups:
`0.3G * 40 * 3 ~ 40G`

- But that’s an upper bound, since in practice we do an early break after the first taken slot.

Simply counting the number of lookups gives me `20G`

, which sounds reasonable:
it’s lower and the same order of magnitude.

So how long does it take to do `20G`

lookups? \(20G * 64byte = 1.28TB\), which at
\(25GB/s\) is \(51s\). In practice the construction takes longer at \(350s\), so this
indicates we’re not yet memory-bandwidth bound, assuming we could indeed saturate the full
\(25GB/s\), but it’s not too far from it.

### More sorting algorithm resources

I found some more implementations and posts about sorting algorithms, so listing them here for future reference:

- Robinhoodsort
- https://github.com/mlochbaum/rhsort
Sorts uniform random data by placing values at their interpolated location in a \(2n\) size array.

- Ska_sort
- https://github.com/skarupke/ska_sort
In-place radix sort with CPU-pipelining-friendly swapping of values to their buckets. See posts investigating radix sort, I wrote a faster sorting algorithm, and faster sorting algorithm part 2.

- Wolfsort
- https://github.com/scandum/wolfsort
Didn’t really look into this.

- Voracious sort
- https://github.com/lakwet/voracious_sort
Rust library with various radix sort implementations, with the default voracious sort being an improved ska_sort. I tried all implementations in here for sorting random \(10^8\) random

`64=bit hashes, and none beats =radsort::sort_by_key(|h| h>>32)`

followed by manually sorting equal buckets. - More blogs
- http://stereopsis.com/radix.html
- pqdsort
- Pattern defeating quicksort, https://github.com/orlp/pdqsort, by Orson Peters
- glidesort
- https://github.com/orlp/glidesort, by Orson Peters

#### And some resources on partitioning

- Branchles partitioning
- https://github.com/Voultapher/sort-research-rs/blob/main/writeup/lomcyc_partition/text.md by lukas Bergdoll
- Blog on Lomoto partitioning
- https://orlp.net/blog/branchless-lomuto-partitioning/ by Orson Peters

### Partitioning to reduce memory latency

As discussed just before, memory latency and bandwidth are a bottleneck in
construction times when \(n=10^9\). For a typical bucket with \(2\) elements and
\(k_i=50\), we must lookup \(100\) positions in `taken`

, corresponding to fetching \(100\)
cachelines from main memory. It would be much better if these cachelines were
already available in a low level cache. So how about if we already roughly know
where the elements of this bucket are supposed to go? Maybe a window of say \(10\)
cachelines (i.e. \(10\cdot 64\cdot 8 = 5120\) bits) should be wide enough to find
an empty slot with high probability. So we could say that bucket \(i\) of \(m\) maps
to a window/slice of some length starting at \(i/m \cdot s\) (where
\(s=n/\alpha\) is the total number of slots). But actually we can simplify this:

- We can divide the \(m\) buckets into \(P\) partitions, and then the first \(1/P\) of buckets map to the first \(1/P\) of slots.

So there are \(P\) partitions with \(m_p = round(m/P)\) buckets each, and each partition corresponds to \(s_p = round(n/\alpha/P)\) slots.

So then, the position for an element \(x\) with bucket \(i\) becomes \[p(x, i) = \lfloor i/m_p \rfloor \cdot s_p + ((h(x) \lxor h(k_i)) \mm s_p.\]

As usual, we can remap values larger than \(n\) to the free slots in the rest of
the array. Note that this is slightly different from the PTHash-HEM partitioning in
Pibiri and Trani (2021a) since that builds a number of independent MPHFs, while
this method builds a single one and only constructs a single `free`

array.

One point needs some careful consideration: this method does not guarantee that each partition has the same number \(n_p = round(n/\alpha)\) of elements, so the total load factor per partition will vary. This is not fatal though: The expected number of elements per partition is indeed \(n_p\), and since this is a binomial distribution for large \(n\) and small probability \(p=1/P\) the standard deviation is only \(\sqrt{n_p}\). This means that as long as we take, say, \(4\sqrt{n_p}\) additional slots per partition, the probability of a bucket exceeding the number of slots is less than \(1\) in \(16000\). (Otherwise we rehash everything with a new global seed.) Since we anyway reserve free slots with \(\alpha<1\), we can simply make some more free slots. Construction time may then be slightly concentrated in the buckets with number of elements significantly above the mean, but the memory latency savings should counter this easily.

So the new algorithm would be:

- Compute bucket for each element.
- Partition buckets into \(P\) partitions of \(m_p\) buckets each.
- Sort the buckets in each partition by decreasing size.
- (Optional) prefetch the \(s_p\) bits of the
`taken`

array corresponding to the current partition into low-level cache. - Run the usual algorithm for each partition, searching for the smallest pilot that maps each bucket to empty slots only.
- Remap position above \(n\) to the free slots.

As said before, the main benefit here is that step 5 should now run much much faster, since each lookup for a free slot can go directly to L1 or L2 cache rather than main memory. So let’s see how many slots \(s_p\) we should aim for in each partition.

- My L1 cache is \(32kb = 256kbit\), so we could choose \(n_p = 200k\) elements per partition.
- My L2 cache is \(256kb = 2048Mbit\), so we could choose \(n_p = 2M\) elements per partition.

The optimal size is probably determined using experimentation. The benefit of smaller partitions is that they fit in a faster cache, while \(10\) times larger partitions have a relatively \(\sqrt{10} \approx 3\times\) smaller standard deviation, hence needing less buffer slots.

As an example let’s take \(n=10^9\) with \(n_p = 2M\) elements per bucket. This creates \(500\) buckets, and needs \(4\sqrt{n_p} \approx 5k\) buffer slots. \(5k/2M = 0.25\%\), so actually choosing \(\alpha=0.99\) (which gives \(1\%\) buffer slots) already gives plenty of guarantee that buckets are not too full and leaves at least \(0.75\%\) of free slots at the end to speed up construction.

Note that this should also be **very** good for parallelization, since each thread
can fill it’s own part of the `taken`

vector in L1 or L2 without interfering
with the other threads at all.

**Implementation efficiency**

- \(s_p\) should be a multiple of the \(512\) bit cache-line size to avoid sharing cachelines between threads. The entire bitvector should be cacheline-aligned as well. But this is really nitpicking.
- We should either use fast-divide for the bucket partitioning, or we can simply ensure that there’s \(2^k\) buckets per partition so we can just do this via a bitshift.

**Oh some problem?** Ughhh, so we manually create imbalanced bucket sizes, and the
way we do this is by putting \(60\%\) of the elements in the first \(30\%\) of
buckets. That means that if we partition buckets simply by id, the first buckets
will be much larger than the last buckets…. And also, that means there will be
very few small buckets to fill gaps among those large buckets…
So it sounds like we want to either distribute those large buckets equally over
the range, or partition the buckets e.g. via their id modulo \(P\).

**A fix?** So instead of making the first \(30\%\) of buckets large, we have to
distribute the large buckets roughly equally over the partitions.
Some ideas for this:

Make every third bucket large is as follows:

Remap large buckets from

`0..m/3`

to`0..m`

by taking \(i \mapsto 3i-1\), filling positions \(2\mm 3\).Remap small buckets from

`0..2/m3`

t0`0..m`

by taking \(i\mapsto \lfloor 3i/2 \rfloor\). Filling positions \(0\mm 3\) and \(1\mm 3\)(This is a specific instance of the more general following result: for irrational \(\alpha\) and \(\beta\) with \(1/\alpha + 1/\beta = 1\) the sets \(\{\lfloor i\alpha\rfloor: i\in \mathbb N\}\) and \(\{\lfloor i\beta\rfloor: i\in \mathbb N\}\) are disjoint and partition \(\mathbb N\).)

First compute the bucket id as \(h(x) \mm m\). Then remap elements that should go to large buckets by setting the last two bits of the bucket id to \(0\). That maps some percentage of elements to buckets \(0 \mm 4\), creating large buckets there. (We could also map them to buckets \(0\mm 3\) by doing

`i/3*3`

, but that requires more operations.)Simply do the original logic or partitioning by small or large hash value, but choose a cut-off within each partition. So for reference the bucketing function we currently use is

\begin{equation*} bucket(x) = \begin{cases} \phantom{p_2+{}}\lfloor h(x) \cdot p_2 / 2^{64}\rfloor & \text{if } h(x) < p_1 \\ p_2 + \lfloor h(x) \cdot (n-p_2) / 2^{64}\rfloor & \text{if } h(x) \geq p_1. \end{cases} \end{equation*}

If we have \(2^k\) partitions, we can mask the first \(k\) bits of \(h(x)\) and check \(mask_k(h(x)) < p_1 / 2^k\).

### Back from a break!

So I finally managed to implement partitioning and it works great! It’s super fast when putting ~200k elements per part, which exactly fits in L1.

This now also allows for parallelization, since each part can be processed independently, and since each part only needs little memory, they don’t interfere with each other at all!

I now use `FastReduce32H`

to determine the part and bucket.

One issue though: Using `FastReduce32L`

to determine the slot does not provide
sufficient entropy, since it only gives \(\lg(N/P)\) bits of entropy, whereas we
need more like \(\lg N\) bit of entropy. This is somewhat fixed by reverting to
`FastMod32L`

, but that still lacks entropy for \(N = 10^9\), where we need the
much slower `FastMod64`

. More on this later.

### Speeding up the search for pilots

Currently the loop that searches for pilots is as follows:

```
'p: for p in 0u64..kmax {
let hp = self.hash_pilot(p);
// True when the slot for hx is already taken.
let check = |hx| unsafe { *taken.get_unchecked(self.position_in_part_hp(hx, hp)) };
for hx in bucket {
if check(*hx) {
continue 'p;
}
}
if self.try_take_pilot(bucket, hp, taken) {
return Some((p, hp));
}
}
```

This is quite efficient, but has a problem: each iteration of the inner loop is
quite unpredictable: looking up random bits of `taken`

gives random bits `true`

or `false`

with some probability, so there will be lots of branch-mispredictions. Indeed,
running under `perf stat -d`

shows:

```
16,324.28 msec task-clock:u # 1.000 CPUs utilized
57,487,973,942 cycles:u # 3.522 GHz
87,101,496,658 instructions:u # 1.52 insn per cycle
10,513,721,875 branches:u # 644.054 M/sec
643,749,689 branch-misses:u # 6.12% of all branches
```

To reduce branches and increase parallelization, we can only check for success every \(4\) lookups, which has much more predictable results (i.e. failing most of the time):

```
'p: for p in 0u64..kmax {
let hp = self.hash_pilot(p);
// True when the slot for hx is already taken.
let check = |hx| unsafe { *taken.get_unchecked(self.position_in_part_hp(hx, hp)) };
// Process chunks of 4 bucket elements at a time.
let chunks = bucket.array_chunks::<4>();
for &hxs in chunks.clone() {
// Check all 4 elements of the chunk without early break.
if hxs.map(check).iter().any(|&bad| bad) {
continue 'p;
}
}
// Check remaining elements.
let mut bad = false;
for &hx in chunks.remainder() {
bad |= check(hx);
}
if bad {
continue 'p;
}
if self.try_take_pilot(bucket, hp, taken) {
return Some((p, hp));
}
}
```

The new `perf stat -d`

:

```
14,310.45 msec task-clock:u # 0.999 CPUs utilized (was 16k)
50,239,412,832 cycles:u # 3.511 GHz (was 57G)
123,285,550,459 instructions:u # 2.45 insn per cycle (was 87G; 1.5)
10,003,769,536 branches:u # 699.053 M/sec (was 10.5G)
170,547,753 branch-misses:u # 1.70% of all branches (was 650M; 6.1%)
```

Note how this roughly \(10\%\) faster (in clock and cycles).
It does use \(50\%\) more instructions, but is able to execute those **much** more
efficiently, at \(2.5\) instructions per cycle vs. \(1.5\) before. This is probably
because of the **$4$-fold reduction** in branch-misses!

The remaining branch misses seem to mostly come from the less-optimized case where a bucket requires displacing some other buckets.

`MultiplyReduce`

For \(n\) close to \(10^9\) or \(2^{32}\), we really must use all \(64\) bits of
entropy. `FastReduce`

only uses the high bits, and `FastMod`

works but is quite
slow (taking \(2\) multiplications for the $32$bit variant and more for $64$bits).

When the target modulus is a power of \(2\), \(2^d\), an alternative is what I’ll call `MultiplyReduce`

:
\[
h \mapsto \lfloor \frac {C\cdot h}{2^{64}}\rfloor \mm 2^d.
\]
This simply multiplies the hash by a random $64$bit constant and then takes the
\(d\) low bits of the high word of the result.

We can use this for the reduction to slots in each part, so that all entropy is used.

The one problem with this function (and also `MulHash`

) is that $128$bit
multiplications don’t work well with SIMD, but so far SIMD hasn’t payed off yet anyway.

### Linux hugepages?

I may or may not have this already enabled on my laptop, but if not, if may make prefetches faster by having a smaller TLB cache.

It seems that `jemalloc`

has builtin support for optional hugepages so I may try that.
See also this reddit post.

After trying a bit more (tweet, example code), it seems that a simple `vec![0u8; len]`

already uses transparent huge pages on my machine:

```
$ grep AnonHugePages /proc/meminfo
```

goes up by around `300MB`

when filling the vector of pilots, and similarly

```
$ grep thp_fault_alloc /proc/vmstat
```

increases by \(150\) at the same time, counting the total number of allocated
`2MB`

hugepages.

Either way, here is the full allocation code I tried for completeness:

```
let len = 300_000_000;
const HUGE_PAGE_SIZE_BYTES: usize = 2 * 1024 * 1024;
unsafe {
let ptr = std::alloc::alloc(
std::alloc::Layout::from_size_align(len, HUGE_PAGE_SIZE_BYTES).unwrap(),
);
libc::madvise(ptr as _, len, libc::MADV_HUGEPAGE);
Vec::from_raw_parts(ptr, len, len)
}
```

### Dropping the bucket split?

So far we have been mapping \(\beta = 0.6\) of elements to \(\gamma = 0.3\) of slots
to create two *classes* of buckets: *large* and *small* ones. This helps the
performance of the construction algorithm by putting more elements in large and
easy-to-place buckets so that in the end we have more smaller buckets and
especially more free slots remaining once we get to the last buckets of size
\(2\).

In my current implementation it’s kind of annoying to find the bucket for a
given hash:
First `FastReduce`

for the part: multiply \(h\) by \(P\), the number of parts. The
`P*h>>64`

is the part, and we use `P*h % (2^64)`

to compute the bucket, which
uses the high bits of that result to determine whether it belongs in a small or
large bucket and then sends it there via a linear transformation.

If we drop these two classes things would be much simpler:
the part is `P*h>>64`

, and similarly the bucket is `B*h>>64`

, where `B`

is the
total number of buckets over all parts.

#### Build performance

This is a bit more tricky: dropping the two classes roughly doubles the number of displacements and slows the construction by around \(10\%\). Also, the minimal \(c\) for which construction succeeds at all goes up slightly.

#### An alternative

Maybe we can come up with a ’local’ way to imbalance buckets, so that each
bucket roughly maps to position `h*B >> 64`

. Maybe we could `or`

the last bit
with another random bit, so that \(75\%\) of elements goes to \(50\%\) of buckets?

#### Query performance

for-loop | prefetching | |
---|---|---|

2 classes | 14.5 | 7.6 |

1 class | 10.3 | 7.5 |

As you can see, when we already do prefetching the added simplicity doesn’t affect performance much, but without it, the simpler and shorter code helps considerably.

### Query memory bandwidth

Most likely, the prefetching variant is already saturating the memory bandwidth. Each cacheline is $64$bytes, which comes down to \(64B / 7.5ns = 8.5GB/s\). That’s a third of the \(25GB/s\) peak bandwidth for my RAM, but I’m not sure whether the peak is achievable at all for random-access reads.

I did find this quora answer that talks about *Line-Fill Buffer*. To quote:

If you test performance from a single thread on a machine that’s mostly idle (ie, you’re not competing with other threads for access to the memory bus), then you’ll find that on many chips, you can perform roughly twice as many sequential reads per second as random reads. The explanation for this is that the CPU has clever circuitry that watches your memory access pattern and tries to predict which cache-lines your code is going to want next. Why does this prefetching help? Well, this gets a bit deep. My current understanding is that in the random access case, your parallelism is limited by that fact that each core can only have a limited number of L1 misses outstanding at one time. Intel calls this the “Line-Fill Buffer” and it has 10 slots on recent chip models, and there are similar mechanisms on AMD and Arm. DRAM physics dictate that reads have ~60ns of latency (and this barely changes as the tech evolves), which is ~400 clock cycles, so that would rate limit throughput of random parallel reads to at most 1 per ~40 clock cycles. Versus when the hardware prefetch mechanisms kick in, they don’t consume line-fill slots and can effectively perform more reads in parallel. Sadly, if you issue software prefetch instructions, many CPUs treat that as needing a line-fill entry, so it’s only reads triggered by hardware-prefetching that are able to achieve greater parallelism than the size of your line-fill buffer.

So with this in mind, maybe the next question is whether we can get the last factor \(3\) improvement from \(8GB/s\) to \(25GB/s\) by using multithreading, since each CPU has it’s own copy of these Line-Fill Buffers between L2 and L1.

So I tried multithreading, but it seems not to be faster at all! It’s usually slightly slower. Weirdly though, the threads seem to do the work sequentially, with the $i$th of \(k\) threads finishing roughly after \(i/k\) of the total time. Some threads just seem to be busy waiting for memory that never comes until other threads are done.

But then does that mean that my memory bandwidth is actually capped at \(8GB/s\)? That would be disappointing.

Maybe another solution could be to store two copies of the pilot array, one on each of the two \(32GB\) memory cards I have? I don’t really know whether the main memory or the processor is the bottleneck, so no idea if this would help.

#### Some more experiments

I wrote some experiments to measure memory bandwidth where I sum the first byte of each cacheline of a \(4GB\) array. Results:

```
stride 1
strided 1.78s 22.513GB/s
prefetch 1.77s 22.553GB/s
stride 2
strided 2.73s 14.678GB/s
prefetch 2.79s 14.357GB/s
stride 4
strided 3.61s 11.079GB/s
prefetch 3.48s 11.501GB/s
stride 8
strided 3.85s 10.403GB/s
prefetch 3.76s 10.649GB/s
stride 16
strided 4.05s 9.876GB/s
prefetch 4.00s 9.995GB/s
stride 32
strided 4.53s 8.828GB/s
prefetch 4.50s 8.893GB/s
stride 64
strided 3.91s 10.228GB/s
prefetch 3.89s 10.279GB/s
stride 128
strided 4.01s 9.976GB/s
prefetch 4.02s 9.957GB/s
```

so basically it caps out around \(9GB/s\) even for very predictable access patterns. Prefetching also doesn’t help.

So the \(8.5GB/s\) we currently get with the queries is already super good and basically saturates the full memory bandwidth!

#### Multithreading benchmark

So can we get to the full \(25GB/s\) using multithreading? It seems that the answer is yes! Running multiple threads in parallel each with a different (coprime) stride, I get the following throughput per thread and in total:

```
Threads: 1 3.85s 10.396GB/s
Threads: 2 4.35s 18.393GB/s
Threads: 3 5.03s 23.845GB/s
Threads: 4 6.15s 26.035GB/s
Threads: 5 7.52s 26.578GB/s
Threads: 6 8.86s 27.074GB/s
```

As you can see, this converges to just above \(25GB/s\). Probably it’s a bit higher because some reads are cached in L3, and some reads are shared between threads.

If we give each thread it’s own copy of the \(4GB\) array, this changes to:

```
Threads: 1 4.55s 8.800GB/s
Threads: 2 5.37s 14.900GB/s
Threads: 3 6.85s 17.512GB/s
Threads: 4 8.51s 18.791GB/s
Threads: 5 10.53s 18.985GB/s
Threads: 6 12.15s 19.756GB/s
```

This is quite a bit lower!

Changing to a \(7GB\) array may be interesting, since \(6\cdot 7GB = 42GB\) must be spread over two \(32GB\) ram sticks. (\(8GB\) crashed because I don’t have \(48GB\) free currently.) This gives:

```
Threads: 1 7.76s 9.016GB/s
Threads: 2 9.54s 14.679GB/s
Threads: 3 12.20s 17.211GB/s
Threads: 4 14.98s 18.686GB/s
Threads: 5 18.15s 19.282GB/s
Threads: 6 22.93s 18.313GB/s
```

I.e., very similar to results above. So probably we won’t be able to gain speedup by distributing reads over the two ram sticks.

#### Multithreading queries: satisfaction at last

I had a silly bug at first, but with that fixed, here’s the inverse throughput when using up to \(6\) threads in parallel, each prefetching \(64\) iterations ahead:

threads | ns/query | memory throughput (GB/s) |
---|---|---|

1 | 7.59 | 8.4 |

2 | 4.17 | 15.3 |

3 | 3.09 | 20.7 |

4 | 2.81 | 22.8 |

5 | 2.69 | 23.8 |

6 | 2.76 | 23.2 |

**Satisfaction**, at last! \(2.69ns/query\) means a memory throughput of \(64B/2.69ns =
23.79GB/s\), quite close to the full memory bandwidth.

Also, each thread is also reading the keys to be queried from main memory. Those
are efficiently packed so each \(8B\) key is only $1/8$th of a \(64B\) cacheline and
hence only requires $1/8$th of a read. But adding those in we get \((1+1/8)
\cdot 23.79 GB/s = 26.76GB/s\), quite a bit **above** the full memory bandwidth. I
suppose this is because some queries are already in L3 cache, sparing the main
memory bus. Let’s see more precisely:
The memory used for all pilots is \(2.43bit/elem\cdot 10^9 elem = 2.43\cdot 10^9
bit = 304MB\). My L3 cache is \(12MB\), so each query hits the cache with
probability \(12MB/304MB=4\%\), getting to
$$
\frac 18 ⋅ \frac{64B}{2.69ns}

- (1-0.04) ⋅\frac{64B}{2.69ns} = 25.8GB/s

$$ This is still slightly above the \(25GB/s\) max throughput my ram should have, but I’ll call it done here. We’re within a few percent of the expected maximum performance, which is as good as we can get.

Maybe the last bit is L1 and L2 caches containing a few more elements than L3?

### Packing difference from expected position

Since the values we pack are a random subset of \([0, n]\), it may be possible to encode them efficiently by only storing the deviation from their expected value. I.e. we store \(r=(1/\alpha - 1)n\) values where position \(i\) stores the difference with \(i/r \cdot n\).

Result: for \(n=10^8\) with \(\alpha = 0.99\), the largest difference between actual and expected position is around \(500\ 000\). I.e., this would still need at least \(20\) compared to the \(20\) we started with.

#### Local packing ideas

Instead we could store a fixed offset \(c\) every \(k\) positions, and approximate the
following \(k\) values as \(c + i/r\cdot n\), where \(n/r\) is the expected difference
between adjacent remap targets (i.e. the distance between adjacent empty slots
in the `taken`

array).

For example, we could store a `u32`

offset every \(30\) positions, one per
cacheline, totalling to \(16 + 32/30 \sim 17 bits\) per remapped position.

Or maybe we could pack slightly more efficient and only use somewhere between \(10\) to \(15\) bits for the variance from the expected value. Some maths will be needed to figure out a safe upper bound on the max deviation.

Some possibilities:

**Absolute:**Store \(32bit\) offset, and \(16bit\) deltas to it.**Linear:**Store \(32bit\) offset, and \(16bit\) deltas to \(offset + i\cdot n/r\).**Custom linear:**Store \(32bit\) offset, \(32bit\) (or less) difference \(d\), and \(16bit\) deltas to \(offset + i \cdot d\).**Tiny EF:**Store \(32bit\) offset, \(128\) bits indicating where \(256bit\) boundaries are crossed similar to EF encoding, and \(44\) \(1byte\) values.

This last idea needs \(64/44\cdot 8bit=11.6bits\) per value. That’s slightly more
than using EF encoding directly on the full vector (which uses around \(9bits\)),
but since this only uses a single cache line lookups are much faster, and it
uses \(3\times\) less bits that the direct `Vec<u32>`

.

#### Query times for different remapping structures

threads | no remap | `Vec<u32>` | `TinyEF` | `EliasFano` | `Vec<u32>` | `TinyEF` | `EliasFano` |
---|---|---|---|---|---|---|---|

\(\alpha\) | \(0.99\) | \(0.99\) | \(0.99\) | \(0.98\) | \(0.98\) | \(0.98\) | |

1 | 7.4 | 7.9 | 7.9 | 10.1 | 8.1 | 8.2 | 11.6 |

2 | 4.1 | 4.3 | 4.3 | 5.1 | 4.8 | 4.5 | 6.0 |

3 | 3.0 | 3.2 | 3.2 | 3.6 | 3.5 | 3.3 | 4.3 |

4 | 2.8 | 2.9 | 2.8 | 3.2 | 3.0 | 2.9 | 3.6 |

5 | 2.7 | 2.7 | 2.8 | 3.0 | 2.8 | 2.7 | 3.2 |

6 | 2.7 | 2.7 | 2.7 | 3.0 | 3.0 | 2.8 | 3.1 |

bits/elem | 2.43 | +0.33 | +0.12 | +0.09 | +0.66 | +0.24 | +0.17 |

As can be seen, `TinyEF`

uses only about a third of the memory of `Vec<u32>`

,
while being at least as fast (and faster for \(\alpha=0.98\)).
`EliasFano`

uses another \(25\%\) less memory, but significantly slows down
queries due to its non-locality.

The code for TinyEF is here.

### Sharding

When the number of keys is very large, they may not all fit in memory at the
same time. To solve this, we can *shard* the keys and process one shard at a
time.

In particular, the \(P\) parts are split into \(S\) shards, where part \(0\) to \(P/S\) go in the first shard, and so on. We make sure that \(P\) is a multiple of \(S\). Then we process one shard at a time: we hash the keys, sort them into their parts, and then compute the pilots for each part independently.

I implemented two ways to shard keys:

- On-disk sharding: create a file for each shard, iterate over the keys, and append the hash of each key to file for the shard it belongs to.
- In-memory sharding: iterate over the keys multiple times, and each time only return those which belong to the current shard.

### 128bit hashing

In order to scale more than \(2^{32}\) keys, we need to use $128$bit hash
functions (github issue). There are many hash functions out there, all with
different speed vs. randomness tradeoffs. I implemented many of them. Although I
have no idea of the quality of the implementations I used, these are some quick
results for hashing `u64`

keys:

Hasher | Bits | seq. | prefetch |
---|---|---|---|

fxhash | 64 | 14.2 | 7.48 |

murmur2 | 64 | 15.8 | 7.70 |

murmur3 | 128 | 16.5 | 9.40 |

xx | 64 | 39.0 | 14.04 |

xx | 128 | 32.3 | 13.94 |

city | 64 | 28.0 | 12.19 |

city | 128 | 45.1 | 20.09 |

metro | 64 | 29.9 | 13.21 |

metro | 128 | 40.3 | 16.71 |

highway | 64 | 107.6 | 57.20 |

highway | 128 | 114.5 | 47.84 |

spooky | 64 | 115.7 | 63.43 |

spooky | 128 | 115.5 | 67.06 |

From this we conclude that `FxHash`

, `Murmur2`

, and `Murmur3`

are good fast hash
functions where throughput is close to the memory bandwidth, while most other
hash functions are quite a bit slower and lower the throughput.

I also added support for hashing of `&[u8]`

strings. Results are here:

Hasher | Bits | seq. | prefetch |
---|---|---|---|

fxhash | 64 | 31.7 | 21.26 |

murmur2 | 64 | 11.5 | 6.23 |

murmur3 | 128 | 24.7 | 15.35 |

wy | 64 | 19.5 | 11.58 |

xx | 64 | 23.2 | 14.26 |

xx | 128 | 21.0 | 12.90 |

city | 64 | 18.0 | 10.97 |

city | 128 | 27.5 | 17.90 |

metro | 64 | 22.2 | 15.36 |

metro | 128 | 24.3 | 17.71 |

highway | 64 | 61.1 | 36.23 |

highway | 128 | 67.0 | 42.51 |

spooky | 64 | 74.1 | 62.48 |

spooky | 128 | 74.1 | 62.87 |

Observations:

`fxhash`

is slow for strings.`murmur2`

is super fast!`city`

and`wy`

are great for =64=bit hashes.`xx`

is best for =128=bit hashes.

### Varying the partition size

Let’s construct for \(n=10^9\) keys, \(c=8\), \(\alpha=0.98\), with different partition sizes. \(\max \alpha\) is the maximum load factor over all parts.

slots per part | num parts | displace time | total time | \(\max\alpha\) |
---|---|---|---|---|

\(2^{15}\) | $31$k | - | - | \(>1\) |

\(2^{16}\) | $16$k | \(20.7\) | \(27.0\) | \(99.49\) |

\(2^{17}\) | $7.8$k | \(22.8\) | \(29.0\) | \(99.18\) |

\(2^{18}\) | $3.9$k | \(22.7\) | \(29.0\) | \(98.59\) |

\(2^{19}\) | \(1947\) | \(25.8\) | \(31.6\) | \(98.40\) |

\(2^{20}\) | \(974\) | \(30.3\) | \(36.1\) | \(98.21\) |

\(2^{21}\) | \(487\) | \(42.3\) | \(48.0\) | \(98.14\) |

\(2^{22}\) | \(244\) | \(41.2\) | \(48.1\) | \(97.83\) |

\(2^{23}\) | \(122\) | \(40.9\) | \(47.0\) | \(97.81\) |

\(2^{24}\) | \(61\) | \(72.5\) | \(78.3\) | \(97.77\) |

\(2^{25}\) | \(31\) | \(88.8\) | \(94.9\) | \(96.17\) |

Observe that:

- Using too small partitions causes too many parts, with too much load-unbalancing, consistently giving \(\alpha > 1\).
- \(\max \alpha\) decreases as the parts get larger, reducing the worst-case outliers. In particular the delta with \(98\%\) roughly halves each step.
- When there are very few parts, \(\max \alpha\) goes below \(98\%\) due to rounding, because each part is fixed to have a power-of-two size.
- Smaller parts are faster to build, due to data fitting in lower caches.
- \(2^{17}\) slots per part was my default, but \(2^{19}\) is not much slower and more reliable.
- From \(2^{21}\), the displacement time quickly grows. Storing \(2^{21}\)
`taken`

bits takes \(256kb\), which is exactly the L2 cache size. But the displacement algorithm also needs to store for each slot which bucket is occupying it, taking \(32bit/slot\). This does not fit in L3 and goes to main memory instead. - At \(2^{24}\) slots, even the
`taken`

arrays do not fit in L3 cache anymore and things slow down even more. (In fact I’m surprised it’s only a \(2\times\) slowdown. The latency is hidden pretty well.)

One way in which PTRHash differs from PTHash is that apart from just the `taken`

bitvector, it needs to store an additional \(32bit\) index for each slot to be able to
displace elements, which requires a lot more memory. If we could prevent this,
memory usage would go down and it may be possible to use larger parts without
slowing down the construction time so much.

## References

*Proceedings of the 10th Acm International on Conference on Emerging Networking Experiments and Technologies*, December. https://doi.org/10.1145/2674005.2674994.

*Proceedings of the 15th Annual International Acm Sigir Conference on Research and Development in Information Retrieval - Sigir ’92*. https://doi.org/10.1145/133160.133209.

*Acm Journal of Experimental Algorithmics*25 (March): 1–16. https://doi.org/10.1145/3376122.

*Software: Practice and Experience*49 (6): 953–70. https://doi.org/10.1002/spe.2689.

*Proceedings of the 44th International Acm Sigir Conference on Research and Development in Information Retrieval*, July. https://doi.org/10.1145/3404835.3462849.