PTRHash: Notes on adapting PTHash in Rust

Table of Contents

\[ %\newcommand{\mm}{\,\%\,} \newcommand{\mm}{\bmod} \newcommand{\lxor}{\oplus} \newcommand{\K}{\mathcal K} \]

NOTE: WIP paper is here.

Daniel got me excited about minimal perfect hashing functions (MPHFs), and then twitter Rob asked for a rust implementation of PTHash (Pibiri and Trani 2021b, 2021a), and also I have some ideas related to memory prefetching I want to play with, so here we are: I’m working on a Rust implementation ptr-hash at https://github.com/ragnargrootkoerkamp/ptrhash.

This post is a collection of thoughts, ideas, implementation nodes, and experiments that came up while writing the code.

Twitter discussion is here.

(See also my note on BBHash (Limasset et al. 2017).)

Questions and remarks on PTHash paper

Probably some of these questions are answered by reading into the original FCH paper (Fox, Chen, and Heath 1992) – that’ll come at some later time.

  • FCH uses \(cm/\log_2 n\) buckets because each bucket needs \(\log_2 n\) bits storage, but for PTHash the bits per bucket is lower so this isn’t really the right formula maybe.

    • Tweet: another option is to fix \(c = \alpha * \log(n) / d\) for some constant \(d\).
  • FCH uses \(\lceil cn / (\log_2n+1)\rceil\) buckets, but PTHash uses \(\lceil cn/\log_2 n\rceil\) buckets. To me, \(\lceil cn/\lceil \log_2n\rceil\rceil\) actually sounds most intuitive.

  • \(p_1=0.6n\) and \(p_2=0.3m\) seem somewhat arbitrary. Can this be tuned better? Could partitioning into more chunks help?

    Anyway, this puts \(60\%\) of data in the first \(30\%\) of buckets (i.e. double the average size), and \(40\%\) of data in \(70\%\) of buckets (i.e. roughly half the average size).

    I suppose the goal is to stitch two binomial distributions of bucket sizes together in a nice way to get a more even distribution over sizes.

    • This is an open question (tweet)
      • Proof in appendix of paper could possibly be extended.
  • How many bits of entropy are really needed? When \(n\) is a power of \(2\), we make sure to use all \(64\) bits instead of just \(\log_2 n\). But how many are really needed. Maybe as long as \(n \ll 2^{32}\), a simpler/faster \(32\) bit hash is sufficient?

  • And how large will \(n\) be at most? Can we reasonably assume \(2^{32} \approx 4\cdot 10^9\) as a bound?

  • Is it reasonable to compute and store hashes of all keys, and then changing the API to let the user pass in the hash instead of the key?

  • What’s the probability of failure with the first global seed \(s\) that’s tried?

  • What are the largest \(k_i\) found in typical cases?

  • What if \(\gcd(m, n)\) is large?

Ideas for improvement

Parameters

  • Better tuning of parameters may work?
  • Partitioning into three or more chunks may speed up construction?

Align packed vectors to cachelines

Instead of packing \(r\) bit values throughout the entire vector, only pack them inside cachelines.

Prefetching

The query performance of this and other MPHFs seem to be limited by the latency of memory lookups. For this reason, most MPHFs minimize the number of memory lookups, and in particular PTHash uses one random memory access followed by one lookup in a small (cached) dictionary.

When processing a number of lookups sequentially, each lookup currently incurs a cache miss. This could be made much more efficient by doing \(k\) (e.g. \(k=16\)) cache lookups in parallel:

  1. first compute the array location to lookup for \(k\) keys (possibly using SIMD),
  2. then prefetch the \(k\) memory locations,
  3. then do the rest of the computation in parallel.
  4. If needed, do lookups in the \(free\) table in parallel.

This should mostly hide the memory latency and could give significant speedup. I’m not quite sure yet whether this would work best when done in batches (as described), or in streaming fashion, where we iterate over elements and prefetch memory for the element \(k\) iterations ahead. Algorithmica.org has a nice article on the streaming approach.

Faster modulo operations

There are a lot of % n, % p1 and % (m-p1) operations throughout the code. I didn’t look into it yet, but possibly these are the bottleneck on the CPU latency.

First note that \[ a\, \%\, b = a - \lfloor a/b\rfloor * b. \] This division by a constant can be computed efficiently using a trick which replaces division by multiplication with the inversion. Using the formula of the wikipedia article we can precompute some constants to evaluate \(\lfloor a/b\rfloor\) in \(6\) operations and a % b in \(8\) operations.

(Note that it might be possible compilers already do this, but I don’t expect so.)

Some blogposts by Daniel Lemire (Thanks Daniel Liu ;)

Store dictionary \(D\) sorted using Elias-Fano coding

I doubt whether the memory savings here are worth the time overhead, but it’s an idea :shrug:.

How many bits of \(n\) and hash entropy do we need?

One bottleneck of fastmod64 is that it needs to compute the highest \(64\) bits of a u128 * u64 product. If we can assume that \(n\) is at most \(40\) bits, and that \(44\) bits are sufficient entropy, then I think we could do away with the u128 x u64 multiplication and do everything inside u128.

Ideas for faster construction

Once the table is almost filled, determining the \(k_i\) becomes slower. Some ideas to speed this up:

reversing murmurhash
Instead of finding a \(k_i\) such that \(position(x) := (h(x) \lxor h(k_i))\mm n\) is not taken becomes slow. Instead, we could keep a list of empty positions \(p\) and determine \(k_i\) as \(h^{inv}((p + j \cdot n) \lxor h(x))\) for different \(j\), assuming we can invert murmurhash. As it turns out, murmurhash2 is an invertible function with \(O(1)\) inverse! See this blogpost. Thus, we can easily find many \(k_i\) such that \(position(x, k_i)\) maps to an empty position.
  • All buckets of size \(1\) can be directly assigned to slots this way.
  • Size \(2\) and \(3\) buckets also need fewer tries than before.

The big drawback though is that the \(k_i\) values found will be uniform in \([0, 2^{64}]\).

Cuckoo-hashing
For sets of size \(1\) and \(2\) (and maybe \(3\)?) we could displace an already taken slot if that is the last remaining slot needed to fix the current bucket. Probably we want to only displace buckets of the same size and never buckets of larger size.

I wonder though how useful this actually is. If the current bucket is hard to place, there is not really any reason a different bucket of the same size would be easier to fix.

Implementation log

A somewhat chronological list of notes and remarks.

Hashing function

For now I use murmur64a, documented on the SMHasher GitHub wiki.

Bitpacking crates

There are a lot of bitvector and bitpacking crates!

bitvectors
All of the below seem to do the same
  • bitvec: \(30M\) downloads
  • bit-vec: \(30M\) downloads
  • fixedbitset: \(55M\) downloads

No idea which is best; probably I’ll settle for the one below in sucds.

sucds
only \(60K\) downloads, but contains

succinct - IntVector - Can not be constructed from slice/iterator of values. - Decoding seems somewhat inefficient - No updates in the past 2 years.

Construction

  • Storing buckets as Vec<Vec<Key>> is bad for large keys, so now I store Vec<Vec<usize>>, but the nested ~Vec~s still waste a lot of space and will cause allocation slowdowns. PTHash pushes onto a vector which is sorted later, which seems more efficient.
  • When testing \(k_i\), not only do we need to test that positions are not filled by previous buckets, but also we have to check that elements within the bucket do not collide. It is not sufficient that \(h(x, s)\) does not collide within buckets, since they could collide after taking the % n.

Fastmod

It seems that Daniel Lemire’s fastmod C++ library has not yet been ported to Rust, so I converted the few parts I need.

There is also strength_reduce, which contains a similar but distinct algorithm for a % b that computes the remainder from the quotient.

TODO Try out fastdivide and reciprocal crates

First benchmark

I implemented these under a generic Reduce trait.

just bench at the linked commit at 2.6GHz gives the following for \(10^7\) keys:

methodconstruction (s)query (ns)
u6410.47459191
fastmod6410.73158355
fastmod329.91175150
strengthreduce6411.52093956
strengthreduce3210.00201750

The u32 versions simply only use the lower \(32\) bits of the \(64\) bit hash.

This is not yet as fast as the fastest 28ns reported in the PTHash paper (for C-C encoding), but I also haven’t optimized anything else yet. Time for profiling.

Profiling: Looking at the flamegraph (cargo flamegraph), and zooming in on the hash function, we see

A lot of time is spend on fold! The murmur2 function I use has signature murmur2(bytes: &[u8], seed: u64), and even though my keys/bytes always correspond to just a u64, it’s iterating over them!

In the generated perf report, we see

33.14%         27328  test::queries_e  pthash_rs-f15b4648f77f672b           [.] pthash_rs::PTHash<P,R>::new
18.18%         14823  test::queries_e  pthash_rs-f15b4648f77f672b           [.] pthash_rs::PTHash<P,R>::index
13.76%         11245  test::queries_e  pthash_rs-f15b4648f77f672b           [.] murmur2::murmur64ane

We can ignore the \(33\%\) for construction and only focus on querying here, where we see that the index function calls to murmur2 and a lot of time is spent in both. In fact, murmur2 is not inlined at all! That explains the iterator appearing in the flamegraph.

Thin-LTO: This is fixed by enabling link-time optimization: add lto = "thin" to cargo.toml.

Rerunning the benchmark we get

construction (s)construction (s)query (ns)query (ns)
methodno LTOthin-LTOno LTOthin-LTO
u6410.58.99160
fastmod6410.78.35534
fastmod329.98.55026
strengthreduce6411.58.35638
strengthreduce3210.08.85031

Sweet! 26ns is faster than any of the numbers in table 5 of Pibiri and Trani (2021b)! (Admittedly, there is no compression yet and the dictionary size is \(10\times\) smaller, but still!)

More inlining: Actually, we don’t even want the index() function call to show up in our logs: inlining it should give better instruction pipelining in the benchmarking hot-loop

for key in &keys {
    mphf.index(key);
}

and indeed, we now get

query (ns)no LTOthin-LTOinline index()
u64916055
fastmod64553433
fastmod32502624
strengthreduce64563833
strengthreduce32503126

Conclusion: From now on let’s only use fastmod64 and fastmod32. (I suspect the 32bit variant does not have sufficient entropy for large key sets, so we keep the original 64bit variant as well.)

Faster bucket computation

After inlining everything, the generated assembly for our test is just one big \(\sim 100\) line assembly function. Currently, the bucket(hx) function (that computes the bucket for the given hash hx = hash(x, s)) looks like

fn bucket(&self, hx: u64) -> u64 {
    if (hx % self.rem_n) < self.p1 { // Compare
        hx % self.rem_p2
    } else {
        self.p2 + hx % self.rem_mp2
    }
}

The assembly looks like this:

   5     ┌──cmp        %rdi,0xc0(%rsp)       # compare
   1     ├──jbe        370
 102       mov        0x98(%rsp),%rdx       # first branch: fastmod
  41       mulx       %r9,%rdx,%rdi
   4       imul       0xa0(%rsp),%r9
  85       mulx       %r10,%r13,%r13
  12       add        %rdi,%r9
   7       mov        %r9,%rdx
  72       mulx       %r10,%rdx,%rdi
  17       add        %r13,%rdx
   3       adc        $0x0,%rdi             # add 0
           cmp        %rdi,0x58(%rsp)       # index-out-of-bound check for k array
  56     │↓ ja         3ac                   # ok: continue below at line 3ac:
         │↓ jmp        528                   # panic!
           cs         nopw 0x0(%rax,%rax,1)
 128 │370:└─→mov        0xa8(%rsp),%rdx       # second branch: fastmod
  41        mulx       %r9,%rdx,%rdi
            imul       0xb0(%rsp),%r9
  66        mulx       %rcx,%r13,%r13
  12        add        %rdi,%r9
            mov        %r9,%rdx
  58        mulx       %rcx,%rdx,%rdi
  14        add        %r13,%rdx
            adc        0xb8(%rsp),%rdi       # add p2
  54        cmp        %rdi,0x58(%rsp)       # out-of-bound check for k array
   1       jbe        528                   # panic!
8100 │3ac:   mov        (%r11,%rdi,8),%rdx    # Do array index.

We see that there are quite some branches:

  • The first and second branch of the bucket() function are both fully written out.
  • They use the same number of instructions.
  • One branch does add 0, I suppose because the CPU likes equal-sized branches.
  • There are redundant index-out-of-bounds checks.
  • The last line, the array index itself, has \(8000\) samples: \(57\%\) of the total samples is this single assembly instruction!

Branchless bucket index: I tried rewriting the bucket() function into a branchless form as follows:

fn bucket(&self, hx: u64) -> u64 {
    let is_large = (hx % self.rem_n) >= self.p1;
    let rem = if is_large { self.rem_mp2 } else { self.rem_p2 };
    is_large as u64 * self.p2 + hx % rem
}

but this turns out to be slower than the original, probably because the new assembly now needs a lot of cmov instructions. (In particular, rem contains a u128 and a u64, so needs \(3\) mov’s and \(3\) cmov’s.)

             cmp        %rdi,0xd0(%rsp)     # comparison
  112        mov        0xb8(%rsp),%rdi     # load rem_p2
   29        cmova      0xc8(%rsp),%rdi     # conditionally overwrite with rem_mp2
             mov        0xb0(%rsp),%rdx
  137        cmova      0xc0(%rsp),%rdx
             mov        0xa0(%rsp),%r14
  137        cmova      0xa8(%rsp),%r14
   26        mov        %r9,%r10
             mov        $0x0,%r11d          # set offset to 0
   90        cmova      %r11,%r10           # conditionally overwrite offset
   38        imul       %r8,%rdi            # start computation
    2        mulx       %r8,%rdx,%r8
  122        add        %r8,%rdi
   48        mulx       %r14,%r8,%r8
             mov        %rdi,%rdx
  163        mulx       %r14,%rdx,%rdi
             add        %r8,%rdx
             adc        %r10,%rdi
  184        cmp        %rdi,0x60(%rsp)     # index-out-of-bounds check
            jbe        52f                 # panic
   38        mov        0x98(%rsp),%rdx
10798        mov        (%rdx,%rdi,8),%rdx  # Do array index.

No bounds check: We can replace k[index] by unsafe { *k.get_unchecked(index) }. This doesn’t give much performance gain (less than the few ns of measurement noise I have), but can’t hurt. It removes the final cmp; jbe lines from the assembly.

Fix tests: Instead of ignoring test results we can accumulate the resulting indices and pass them to black_box(sum). This prevents the compiler from optimizing away all queries. Somehow this affects the reported timings. I now get:

query (ns)no LTOthin-LTOinline index()fixed tests
u6491605563
fastmod6455343335
fastmod3250262420
strengthreduce6456383338
strengthreduce3250312630

I’m confused how the fastmod32 timing went down, but the fastmod64 went up. (Typical situation when you do constant profiling and there are more numbers than you can make sense of, sadly.)

Branchless, for real now! (aka the trick-of-thirds)

I’m still annoyed by this branching. Branches are bad! They may be fast for now, but I kinda have the long term goal to put SIMD on top of this and that doesn’t go well with branching. Also, branch-misprediction is a thing, and the \(70\% - 30\%\) uniform random split is about as bad as you can do to a branch predictor. The code from earlier does fix it, but at the cost of a whole bunch of mov’s and cmov’s.

But there is a trick we can do! \(p_1\) and \(p_2\) are sort of arbitrarily chosen, and all the original paper (Fox, Chen, and Heath 1992) has to say about it is

Good values for these two parameters are experimentally determined to be around \(0.6n\) and \(0.3m\), respectively.

Thus I feel at liberty to change the value of \(p_2\) from \(0.3m\) to \(m/3\). This gives: \[m-p_2 = m-m/3 = \frac 23 m = 2p_2.\] The cute tick is that now we can use that \[x \mm p_2 = (x \mm (2p_2)) \mm p_2 = (x \mm (m - p_2)) \mm p_2,\] and since $ 0≤ x \mm (2p_2) < 2p_2$, computing that value modulo \(p_2\) is as simple as comparing the value to \(p_2\) and subtracting \(p_2\) if needed.

Thus, we modify the initialization to round \(m\) up to the next multiple of \(3\), and change the bucket function to

fn bucket(&self, hx: u64) -> u64 {
    let mod_mp2 = hx % self.rem_mp2;
    let mod_p2 = mod_mp2 - self.p2 * (mod_mp2 >= self.p2) as u64;
    let large = (hx % self.rem_n) >= self.p1;
    self.p2 * large as u64 + if large { mod_mp2 } else { mod_p2 }
}

The new timings are

query (ns)no LTOthin-LTOinline index()fixed tests\(p_2 = m/3\)
u649160556354
fastmod645534333527
fastmod325026242019
strengthreduce645638333833
strengthreduce325031263021

fastmod32 didn’t get much faster, but all others went down a lot! Let’s check out the generated assembly for fastmod64:

  10           vpunpcklqdq  %xmm3,%xmm5,%xmm3
   3           vmovq        %r15,%xmm12
   1           vpunpcklqdq  %xmm6,%xmm7,%xmm5
  33           vinserti128  $0x1,%xmm3,%ymm5,%ymm5
  14           vpunpcklqdq  %xmm8,%xmm14,%xmm3
   2           vpunpcklqdq  %xmm15,%xmm12,%xmm6
   2           vinserti128  $0x1,%xmm3,%ymm6,%ymm3
  32           vmovdqu      0x570(%rsp),%ymm14
  15           vpxor        %ymm3,%ymm14,%ymm3
   2           vpxor        0x610(%rsp),%ymm14,%ymm12
   2           vpxor        %ymm5,%ymm14,%ymm6
  29           vmovdqu      0x630(%rsp),%ymm11
  13           vpxor        %ymm14,%ymm11,%ymm15
   2           vpcmpgtq     %ymm6,%ymm15,%ymm6
   2           vpcmpgtq     %ymm3,%ymm12,%ymm3
  39           vpandn       %ymm11,%ymm6,%ymm6
  13           vpandn       %ymm11,%ymm3,%ymm7
   4           vpand        %ymm6,%ymm3,%ymm3
   2           vpaddq       %ymm5,%ymm7,%ymm5
  31           vpcmpeqd     %ymm7,%ymm7,%ymm7
   8           vpsubq       %ymm3,%ymm5,%ymm3
               vpxor        %xmm6,%xmm6,%xmm6
   2           mov          0x1c0(%rsp),%r14
4264           vpgatherqq   %ymm7,(%r14,%ymm3,8),%ymm6 # Do array index

Huh what?! I don’t really what is going on here, but I do know that the compiler just vectorized our code for us! All the vp instructions are vector/packed instructions! Magic! This probably explains the big speedup we get for fastmod64.

Closer inspection: As it turns out, the 32bit versions were already auto-vectorized before we implemented this last optimization. Probably because the FastMod32 type is smaller (two u64) than the Fastmod64 type (u128 and u64) and hence easier to vectorize (and similar for StrengthReduce32). But either way this last trick helps a lot for the 64bit variants that will be needed for large hashmaps.

Compiling and benchmarking PTHash

Compiling PTHash was very smooth; just a git clone, submodule init, and building just worked :)

Running a benchmark similar to the ones here:

./build -n 10000000 -c 7.0 -a 1 -e compact_compact -s 1234567890 --minimal --verbose --lookup

reports a query performance of 26ns/key, similar to the fastmod64 performance I get.

Note that PTHash uses fixed-width bitpacking here, while I just store u64’s directly, but this shouldn’t affect the time too much.

Vectorization: More interestingly, PTHash is not auto-vectorized by my compiler, so I’m surprised it performs this well. Maybe the vpgatherqq instruction just doesn’t give that much speedup over sequential lookups – I don’t know yet. But still, my equivalent code using fastmod64 with \(p_2 = 0.3m\) has 35ns/key vs 26ns/key for PTHash. Confusing.

Branching: PTHash compiles to a branchy version of fastmod(x, p2) or fastmod(x, m-p2), but is still fast.

Compact encoding

Adding fixed-width encoding was easy using the sucds CompactVector type. The generated code doesn’t look so pretty though – it branches on whether the bits cross a usize boundary, whereas PTHash’s implementation does an unaligned read from a *u8 to avoid this, which seems nicer.

Find the \(x\) differences

At this point, both pthash-rs and the original PTHash support encoding by a single compacted vector, but there is still quite some time difference: 31ns vs 25ns. Time to find all the differences.

This may or may not be the best approach, but I decided to put the assemblies side-by-side.

Exhibit A: the missing modulo Ok, I won’t bore you with the full assembly, but I found this in the PTHash assembly:

movabs     $0x999999ffffffffff,%rbx

with nothing similar in the rust version. Turns out that this is \(0.6 \cdot (2^{64}-1)\). Indeed, the code is:

inline uint64_t bucket(uint64_t hash) const {
    static const uint64_t T = constants::a * UINT64_MAX;
    return (hash < T) ? fastmod::fastmod_u64(hash, m_M_num_dense_buckets, m_num_dense_buckets)
                        : m_num_dense_buckets + fastmod::fastmod_u64(hash, m_M_num_sparse_buckets,
                                                                    m_num_sparse_buckets);
}

note how it does \(hash < 0.6 2^{64}\) instead of \(hash \mm n < 0.6 n\) as written in the paper for what FCH does. Basically we can completely drop the \(\mm n\) there! Sweet! That’s \(1\) of \(3\) modulo operations gone :)

FastReduce revisited

Earlier I mentioned the blogpost A fast alternative to the modulo reduction: to map \(h\in[2^k]\) to \([n]\) (\([n] := \{0, \dots, n-1\}\) here), instead of taking \(h\mm n\), one can do \(h*n/2^k\) which is must faster to evaluate. The problem with this approach is that it only uses the \(\log_2 n\) high-order bits of \(n\), discarding some necessary entropy.

On second thought, it seems like this may still be useful though. There are two modulo operations in the PTHash algorithm:

  1. In the bucket() function, mapping from \([2^{64}]\) to \([p_2]\) or \([m - p_2]\).
  2. In the position() function, mapping from \([2^{64}]\) to \([n]\).

And there is one related check:

  1. In the bucket() function, check whether \(h < p_1 \cdot 2^{64}\) (or \(h \mm n \leq p_1 \cdot n\)), which is actually also a reduction operation.

While we cannot use FastReduce twice, I think it may still be possible to use it once. In particular, it should be fine to use a low-entropy hash for bucket-selection since we anyway have collisions there – that’s the point of bucketing in the first place.

A second optimization may be possible if we could find a function that uses the lower \(\log_2 n\) bits of the hash for position(). Then we can access \(2\cdot \log_2 n\) bits of entropy in total which should be sufficient to avoid collisions with constant probability (via the birthday paradox).

One idea is something like \(h \mm 2^{\lceil{\log_2 n\rceil}}\), but this is not quite fair so it may not work out nicely. But then again, maybe we could use that for the bucketing modulo, since it possibly doesn’t require (as much) fairness.

Or maybe we can just take the lower \(32\) bits of \(h\) and do \((h\mm 2^{32}) * n / 2^{32}\). That should probably work just fine :)

So, we have 7 possible reduction functions now:

  1. FastMod64, same as mod but faster.
  2. FastMod32L, taking the lower \(32\) bits modulo \(n\).
  3. FastMod32H, taking the higher \(32\) bits modulo \(n\).
  4. FastReduce64: \((h * n) » 64\)
  5. FastReduce32L, fastreduce on the \(32\) low bits: \(((h \mm 2^{32}) * n) » 32\)
  6. FastReduce32H, fastreduce on the \(32\) high bits: \(((h » 32) * n) » 32\)

Excluding mod (which is never better than FastMod64, we can make \(36\) combinations from this to use for the two modulo operations.

Not all combinations end up working (because of lack of entropy when e.g. only the \(32\) high bits are used). It’s clear that FastMod64 tends to be slow, and that Reduce operations are usually (but not always) faster than Mod operations.

The big problem with this benchmark though seems to be that timings are quite inconsistent (variance of a few nanoseconds), and that the quality of generated code (auto-vectorization and loop unrolling) depends on a lot of things and is somewhat unpredictable/inconsistent.

Table 1: Query time (ns) for different combinations of reduction functions. Missing entries either fail or slow down construction.
bucket reduce (\(\mm m\)) \ position reduce (\(\mm n\))FM32LFM32HFM64FR32LFR32HFR64
FastMod32L20271920
FastMod32H182619
FastMod64292733282627
FastReduce32L182318
FastReduce32H
FastReduce64

Note that the last two rows fail in particular, because they strongly correlate with the check whether elements belong in a small or large bucket, \(h(x, s) < p_1\).

TODO Is there a problem if \(\gcd(m, n)\) is large?

Faster hashing

The current implementation uses Murmur2 to hash both the key \(x\mapsto h(x, s)\) and the pilot \(k_i \mapsto h(k_i, s)\). While this hash is fast, it’s still quite some instructions. Instead, especially for the \(h(k_i, s)\), we may be able to get away with either no or a much simpler hash.

No hash: A first try of \(h_0(k_i, s) := k_i\) returns in failures because the \(k_i\) become too large.

Multiplication hash (MulHash): So we do need more mixing of bits rather than just incrementally increasing \(k_i\) starting at \(0\). One common way of doing that is simply to multiply by a large semi-random $64$bit integer. In particular, Murmur also does this, so let’s just reuse their mixing constant and set: \[ h_1(k_i, s) := (0xc6a4a7935bd1e995 \cdot k_i) \mm 2^{64}. \] (I tried looking for documentation on why this constant was chosen, but there doesn’t seem to be more to it than it works.)

In experiments, this gives anywhere between \(1\) and \(4\) nanoseconds of speedup.

TODO Try xxhash

See github.

An experiment

Another fun comparison is here, where I use MulHash for \(k_i\) and replace the MurmurHash for \(h(x)\) by simply the identity operation (since we’re testing on uniform random \(x\) anyway):

  17 │1f0:   mov        (%rsi,%rax,1),%rdx
  11        imul       %r13,%rdx           # Start of Murmur
  32        mov        %rdx,%rbx
 106        shr        $0x2f,%rbx
   7        xor        %rdx,%rbx
  23        imul       %r13,%rbx
  24        mov        %rdi,%rdx
  99        movabs     $0x35253c9ade8f4ca8,%rbp
  11        xor        %rbp,%rdx
  27        xor        %rbx,%rdx
  28        imul       %r13,%rdx
  92        mov        %rdx,%rbx
  15        shr        $0x2f,%rbx
  15        xor        %rdx,%rbx
  28        imul       %r13,%rbx
  95        mov        %rbx,%rbp
  16        shr        $0x2f,%rbp
  17        xor        %rbx,%rbp           # End of Murmur
  25        mov        %ebp,%edx
  64        cmp        %rbp,%r8
  43       jbe        250                 # Branch for bucket index
   9        imul       %r14,%rdx
  12        shr        $0x20,%rdx
  27       jmp        25d
            cs         nopw 0x0(%rax,%rax,1) # nop; for code alignment
   7 │250:   imul       %r15,%rdx
   4        shr        $0x20,%rdx
  11        add        0x10(%rsp),%rdx
5088 │25d:   mov        (%r9,%rdx,8),%rdx   # Memory lookup -- most waiting is here.
 301        imul       %r13,%rdx
  98        xor        %rbp,%rdx
 453        mulx       %r11,%rdx,%rdx
 100        cmp        %rdx,%r10
   2       jbe        6e8
  13        add        %rdx,%r12
  10        add        $0x8,%rax
  37        cmp        %rax,%rcx
 100       jne        1f0
Code Snippet 1: Assembly of \(index()\) function when using Murmur (which takes \(17/38\) instructions): 18ns/query.
  72 │4a0:   mov        (%rcx,%rbp,1),%rbx
  38        mov        %ebx,%edx
  21        cmp        %rbx,%rsi
  36       jbe        4c0                 # Branch for bucket index
  16        imul       %r15,%rdx
  22        shr        $0x20,%rdx
  24       jmp        4cb
            data16     cs nopw 0x0(%rax,%rax,1) # code alignment
     │4c0:   imul       %r14,%rdx
  20        shr        $0x20,%rdx
  21        add        %r10,%rdx
1895 │4cb:   mov        (%rdi,%rdx,8),%rdx  # Memory lookup -- most waiting is here
 172        imul       %r13,%rdx
  75        xor        %rbx,%rdx
 252        mulx       %r9,%rdx,%rdx
  56        cmp        %rdx,%r8
           jbe        706
  43        add        %rdx,%r11
  26        add        $0x8,%rbp
            cmp        %rbp,%rax
  34       jne        4a0
Code Snippet 2: Assembly of \(index()\) function when using \(h(x) = x\) instead: 7ns/query.

MurmurHash takes slightly less than half the instructions, but removing them gives almost \(2.5\times\) speedup! My current thinking is that this is not so much due to the reduced instruction count itself (the CPU is stalling anyway to wait for memory), but rather due to the better pipelining it results in: when loop iterations are shorter (in number of assembly instructions), pipelining can look ahead more iterations, and hence does a better job at prefetching memory. But even with this short loop, around two thirds of the time is still spend waiting for memory.

Conclusion 1.: I should really write code with prefetching.

Conclusion 2.: It’s time to use perf stat for some metrics on branch mispredictions and instructions per cycle.

Compiler struggles

Auto-vectorization: The compiler is quite eager to generated vectorized assembly code. It’s quite unpredictable when auto-vectorization triggers, and it seems I have to keep at least one branch in the hot loop to prevent it. The vectorized code seems bad for a few reasons:

  • Gather instructions (vpgatherqq) are slow.
  • Pipelining: It seems that pipelining works much better for the scalar version, being able to look ahead further and keeping busy while waiting for memory to load.

Even worse: Also, it did the following terrible thing. Starting with this piece of innocent looking code:

if likely(p < self.n0) {
    p
} else {
    unsafe { *self.free.get_unchecked(p - self.n0) }
}

the compiler decided to generate:

Basically: it created a branchless implementation of this if statement where the false branch is always executed. But that branch is super slow! Basically a completely unnecessary read from main memory! For now I’ll just completely remove the false branch to prevent this issue…

Prefetching, at last

Without further ado, here we go:

#[inline(always)]
pub fn index_stream<'a, const L: usize>(
    &'a self,
    xs: &'a [Key],
) -> impl Iterator<Item = usize> + 'a {
    let mut next_hx: [Hash; L] = xs.split_array_ref().0.map(|x| self.hash_key(&x));
    let mut next_i: [usize; L] = next_hx.map(|hx| self.bucket(hx));
    xs[L..].iter().enumerate().map(move |(idx, next_x)| {
        let idx = idx % L;
        let cur_hx = next_hx[idx];
        let cur_i = next_i[idx];
        next_hx[idx] = self.hash_key(next_x);
        next_i[idx] = self.bucket(next_hx[idx]);
        // TODO: Use 0 or 3 here?
        // I.e. populate caches or do a 'Non-temporal access', meaning the
        // cache line can skip caches and be immediately discarded after
        // reading.
        unsafe { prefetch_read_data(self.k.address(next_i[idx]), 3) };
        let ki = self.k.index(cur_i);
        let p = self.position(cur_hx, ki);
        p
    })
}

For \(L = 64\), this is around twice as fast as the non-streaming/non-prefetching version!

In our \(n=10^7\) benchmark, this reduces latency to 4.2ns!! For the larger \(n=10^8\) benchmark, latency is 7.5ns, down from 28ns of the original PTHash paper! (But note that I don’t do any compression here.)

And this is without vectorization still :)

TODO Prefetching with vectorization

Preliminary results: this seems tricky to get right and tends to be slower. It sometimes generates unwanted gather instructions, but even when it doesn’t it’s slow although I don’t know exactly why yet. Does pipelining work with SIMD instructions?

Inverting \(h(k_i)\)

When there are only a few empty slots left, we want to find \(k\) such that \[ R(h_1(x) \lxor h_2(k), n) = p \] for some fixed position \(p\).

Using FastReduce64 as reduction \(R\), we want \[ \left\lfloor\frac{(h_1(x) \lxor h_2(k)) * n}{2^{64}}\right\rfloor = p \] i.e. \[ 2^{64} p \leq (h_1(x) \lxor h_2(k)) \cdot n < 2^{64}(p+1) \] i.e. \[ \frac{2^{64} p}{n} \leq h_1(x) \lxor h_2(k) < \frac{2^{64}(p+1)}{n}. \] This basically fixes the high \(r:=\log_2(n)\) bits of \(h_1(x) \lxor h_2(k)\). (I suspect it only leaves two possible options for the highest \(\lfloor r\rfloor\) bits.) Thus, we infer (with some handwaving) \[ h_2(k) = \frac{2^{64}(p+\frac 12)}{n} \oplus h_1(x) \oplus A =: X \oplus A \] where \(X\) is the target constant and \(A\)’s upper \(r\) bits are \(0\), i.e. \(A < 2^{64-r}\). (TODO: Maybe we can just for the lower \(64-r\) bits of \(X\) to \(0\). Not sure.)

We could solve this when using Murmur for \(h_2\), but in practice we use a much simpler MulHash: \(h_2(k) = (C\cdot k) \mm 2^{64}\). Thus, we must solve \[ (C \cdot k) \mm 2^{64} \in \{ X \oplus A : 0\leq A < 2^{64-r}\} \] for \(k\).

Now, since \(C\) is odd it has a multiplicative inverse \(C’\) over \(F_{2^{32}}\) with \((C \cdot C’)\mm 2^{64} = 1\), so we’re looking for \[ k \in \{ (C’ \cdot (X \oplus A)) \mm 2^{64} : 0\leq A < 2^{64-r}\}. \]

This makes it very easy to find arbitrary solutions! The big drawback is that this results in ‘random’ \(k\) in \([2^{64}]\), while for compression purposes we much prefer finding small (or even minimal) \(k\).

Given the size of the set, the expected value of the smallest solution \(k\) is \(2^r = n\). But I’m not sure if this is solvable efficiently. We could first look for solutions where the \(64-r\) high bits of \(k_i\) are \(0\), and if none are found we can relax to \(64-r-1\) high bits of \(0\), and so on. But does that even help?

Remember that \(C\) is a mixing constant and hence both \(C\) and \(C’\) are quite ‘random’ with lots of \(1\) bits. Maybe we can choose \(C\) to have good mixing but also simplify this inverse? Or can it always be done? I don’t know..

Another day of progress

  • Explored ways to solve the hash-inversion problem above. Seems that an \(O(64-r)\) solution is possible, and maybe even \(O( r)\) but that’s probably more annoying. (Either way both will be \(O(1)\) for \(n < 2^{64}\) ;) There are a lot of interesting things going on to be written down and formalized later.
  • Added a binary src/bin/bucket_sizes.rs -n <n> -c <c> -a <a> to print a nice table with number of buckets of each size and related statistics.
  • Replaced Vec::sort() by radsort::sort which is roughly \(2\times\) faster for uniform u64 hashes. TODO: Probably it will be faster to only sort by the highest \(n\) bits (or \(32\) bits), i.e. by only sorting on buckets, since most buckets will have size \(1\).
  • I’m playing with the idea of implementing some kind of interpolation sort algorithm that just inserts things directly in the right place in an array of Option<NonZero<usize>> of size \((1+\epsilon)n\) or maybe \(n + C \cdot \sqrt n\) and then runs a collect on this. Should work quite well I think.

TODO Possible sorting algorithms

Diving into the inverse hash problem

So we want to find the minimal \(k\) such that \[ (C\cdot k) \mm 2^{64} \in \{X \oplus A : 0 \leq A < 2^{64}\} \] for some given \(X\), i.e. the \(r\) high bits of C*k as \(64\) bit integers should be equal to the \(r\) high bits of \(X\). Let \(\K\) be the set of \(k\) for which this holds, so that we are interested in \(\min \K\).

A first thought is: can we first find some valid \(k=k_0\), and then hope that the space of solutions \(\K\) has some kind of affine structure so that we can easily translate the initial \(k_0\) to a minimal solution? So let’s investigate the linear structure first: let’s find solutions that have \(r\) leading zeros, i.e \[\K_0 := \{k : C \times k < 2^{r’}\},\] where I will use \(\times\) for $64$bit multiplication and \(r’ := 64-r\).

We definitely expect \(\K_0\) to have some structure: if \(C\times k_1 < 2^{r’}\) and \(C\times k_2 < 2^{r’}\), then \(C\times (k_1 + k_2) < 2^{r’+1}\), which is almost good enough, and similarly, \(-2^{r’} \leq C\times (k_1 - k_2) \leq 2^{r’}\), so that either \(k_1 - k_2\) or \(k_2 - k_1\) is also a solution.

Thus, consider those \(k\in \K_0\) with small absolute value (modulo \(2^{64}\), so the absolute value \(|k|\) of \(k\) is \(\min(k, 2^{64}-k)\)). Then, each difference \(k_d\) between two consecutive elements of \(\K_0\) must occur in \(\K_0\) either as \(k_d\) or as \(-k_d\).

So let’s write some code to determine all possible differences. I’m doing this for $32$bit integers first since we can quickly iterate over them

type T = u32;
fn find_diffs_bruteforce(c: T) {
    for r in 1..T::BITS {
        let mut last = 0;
        let mut diffs = HashSet::new();
        for i in 0..=T::MAX {
            let ci = c.wrapping_mul(i);
            if ci.leading_zeros() >= r {
                diffs.insert(i - last);
                last = i;
            }
        }
        eprintln!("r = {r:>2}: {diffs:?}");
    }
}

Result:

r =  1: {1, 2, 3}
r =  2: {3, 5, 8}
r =  3: {3, 11, 14}
r =  4: {11, 14, 25}
r =  5: {14, 39, 53}
r =  6: {39, 53, 92}
r =  7: {92, 145, 237}
r =  8: {92, 329, 421}
r =  9: {421, 513, 934}
r = 10: {421, 934, 1355}
r = 11: {1355, 2289, 3644}
r = 12: {1355, 3644, 4999}
r = 13: {3644, 8643, 12287}
r = 14: {8643, 29573, 38216}
r = 15: {8643, 46859, 55502}
r = 16: {55502, 64145, 119647}
r = 17: {55502, 119647, 175149}
r = 18: {175149, 230651, 405800}
r = 19: {405800, 1042251, 1448051}
r = 20: {405800, 2665451, 3071251}
r = 21: {405800, 3477051, 3882851}
r = 22: {3882851, 4288651, 8171502}
r = 23: {3882851, 27585757, 31468608}
r = 24: {3882851, 35351459, 39234310}
r = 25: {3882851, 39234310, 43117161}
r = 26: {43117161, 90117173, 133234334}
r = 27: {43117161, 133234334, 176351495}
r = 28: {133234334, 309585829, 442820163}
r = 29: {442820163, 1195226155}
r = 30: {442820163, 2966506807}
r = 31: {3852147133}

WOW: For each \(r\), it turns out there are only \(3\) possible differences between adjacent elements of \(\K_0\)! And also note that (when there are \(3\) distinct values) the largest one is always the sum of the smaller two. What is more: when we increase \(r\) by one, the new differences always include one or two of the previous differences, and some small linear combination of them. I.e. \((a,b,c) \mapsto (b, c, b+c)\) or \((a,b,c) \mapsto (a, a+c, b+c)\).

This inspires a much faster algorithm to find these possible differences: for \(r\) from \(1\) to \(32\), start with differences \(\Delta_0 := \{1\}\) for \(r=0\), and then to get \(\Delta_r\) from \(\Delta_{r-1}\) first make all small linear combinations of elements of \(\Delta_{r-1}\) and then start with \(k=0 \in K_0^r\), and then keep increasing \(k\) by all the possible delta’s to see if \(k+\delta\in K_0^r\). If this is the case add \(\delta\) to \(\Delta_r\), and otherwise keep going. Once \(\Delta_r\) reaches size \(3\), we can stop assuming our hypothesis that \(|\Delta_r| \leq 3\) is true. (TODO: This will require a proof eventually.) See the implementation here on GitHub.

Also, let’s have a look at what \(C \times \delta\) looks like in binary for all \(\delta \in \Delta_r\):

r = 1
         1  1011011110100011110100110010101
         2 10110111101000111101001100101010
         3    10011011101011011110010111111
r = 2
         3    10011011101011011110010111111
         5 11001011000110011000111111101001
         8 11011110100011110100110010101000
r = 3
         3    10011011101011011110010111111
        11 11110010000001010000100101100111
        14      101011110101100011000100110
r = 4
        11 11110010000001010000100101100111
        14      101011110101100011000100110
        25 11110111011111111100111110001101
r = 5
        14      101011110101100011000100110
        39 11111100111110101001010110110011
        53       10011101010101101111011001
r = 6
        39 11111100111110101001010110110011
        53       10011101010101101111011001
        92 11111111011011111111000110001100
r = 7
        92 11111111011011111111000110001100
       145        1111001010100110101100101
       237        1010101010011111011110001
r = 8
        92 11111111011011111111000110001100
       329         110001010011000001111101
       421           1101010010001000001001
r = 9
       421           1101010010001000001001
       513 11111111101001010001001110010101
       934 11111111110110100011010110011110
r = 10
       421           1101010010001000001001
       934 11111111110110100011010110011110
      1355             11110101011110100111
r = 11
      1355             11110101011110100111
      2289 11111111111010011000110101000101
      3644 11111111111110001110010011101100
r = 12
      1355             11110101011110100111
      3644 11111111111110001110010011101100
      4999             10000011110010010011
r = 13
      3644 11111111111110001110010011101100
      8643                10010000101111111
     12287 11111111111110100000011001101011
r = 14
      8643                10010000101111111
     29573 11111111111111000100100101101001
     38216 11111111111111010110101011101000
r = 15
      8643                10010000101111111
     46859 11111111111111101000110001100111
     55502 11111111111111111010110111100110
r = 16
     55502 11111111111111111010110111100110
     64145                 1100111101100101
    119647                  111110101001011
r = 17
     55502 11111111111111111010110111100110
    119647                  111110101001011
    175149                   10101100110001
r = 18
    175149                   10101100110001
    230651 11111111111111111101100100010111
    405800                      10001001000
r = 19
    405800                      10001001000
   1042251 11111111111111111110000110100111
   1448051 11111111111111111110010111101111
r = 20
    405800                      10001001000
   2665451 11111111111111111111001011000111
   3071251 11111111111111111111011100001111
r = 21
    405800                      10001001000
   3477051 11111111111111111111101101010111
   3882851 11111111111111111111111110011111
r = 22
   3882851 11111111111111111111111110011111
   4288651                       1111100111
   8171502                       1110000110
r = 23
   3882851 11111111111111111111111110011111
  27585757                        110100001
  31468608                        101000000
r = 24
   3882851 11111111111111111111111110011111
  35351459                         11011111
  39234310                          1111110
r = 25
   3882851 11111111111111111111111110011111
  39234310                          1111110
  43117161                            11101
r = 26
  43117161                            11101
  90117173 11111111111111111111111111011001
 133234334 11111111111111111111111111110110
r = 27
  43117161                            11101
 133234334 11111111111111111111111111110110
 176351495                            10011
r = 28
 133234334 11111111111111111111111111110110
 309585829                             1001
 442820163 11111111111111111111111111111111
r = 29
 442820163 11111111111111111111111111111111
1195226155                              111
r = 30
 442820163 11111111111111111111111111111111
2966506807                               11
r = 31
 442820163 11111111111111111111111111111111
3852147133                                1

We see that the first \(r\) bits of \(C\times \delta_r\) are either \(0\) or \(1\), so \(-2^{32-r} < C \times \delta_r < 2^{32-r}\).

So using this knowledge, how do we find \(\min \K\) for an arbitrary constant \(X\)? Well, we can start with \(k = 0\) and \(r=0\), and then keep trying to add elements of \(\Delta_{r+1}\) until the \(r\)’th-highest bit of \(C\times k\) is the same as the one in \(X\). Then we increase \(r\) and repeat until all highest \(r\) bits are correct.

fn find_inverse_fast(X: T, r: u32, diffs: &Vec<Vec<T>>) -> T {
    let mut k = 0;
    let mut rr = (C.wrapping_mul(k) ^ X).leading_zeros();
    'rr: while rr < r {
        for &d in &diffs[rr as usize] {
            let new_rr = (C.wrapping_mul(k + d) ^ X).leading_zeros();
            if new_rr >= rr {
                k += d;
                rr = new_rr;
                eprintln!(
                    "k+={d:10} = {k:10}: {:032b} {:032b}  {rr:>2}",
                    C.wrapping_mul(k),
                    C.wrapping_mul(k) ^ X
                );
                continue 'rr;
            }
        }
        unreachable!();
    }
    k
}

Running this on some random input for \(r=32\) gives for example:

r = 32                  X = 10110111100010000001010011000110

        delta            k               k*C                              (k*C)^X               r
k+=         1 =          1: 01011011110100011110100110010101 11101100010110011111110101010011   0
k+=         1 =          2: 10110111101000111101001100101010 00000000001010111100011111101100  10
k+=      1355 =       1357: 10110111101100110010101011010001 00000000001110110011111000010111  10
k+=       934 =       2291: 10110111100011010110000001101111 00000000000001010111010010101001  13
k+=      8643 =      10934: 10110111100011101000000111101110 00000000000001101001010100101000  13
k+=      8643 =      19577: 10110111100011111010001101101101 00000000000001111011011110101011  13
k+=      3644 =      23221: 10110111100010001000100001011001 00000000000000001001110010011111  16
k+=     55502 =      78723: 10110111100010000011011000111111 00000000000000000010001011111001  18
k+=    230651 =     309374: 10110111100010000000111101010110 00000000000000000001101110010000  19
k+=    405800 =     715174: 10110111100010000001001110011110 00000000000000000000011101011000  21
k+=    405800 =    1120974: 10110111100010000001011111100110 00000000000000000000001100100000  22
k+=   3882851 =    5003825: 10110111100010000001011110000101 00000000000000000000001101000011  22
k+=   3882851 =    8886676: 10110111100010000001011100100100 00000000000000000000001111100010  22
k+=   3882851 =   12769527: 10110111100010000001011011000011 00000000000000000000001000000101  22
k+=   3882851 =   16652378: 10110111100010000001011001100010 00000000000000000000001010100100  22
k+=   3882851 =   20535229: 10110111100010000001011000000001 00000000000000000000001011000111  22
k+=   3882851 =   24418080: 10110111100010000001010110100000 00000000000000000000000101100110  23
k+=   3882851 =   28300931: 10110111100010000001010100111111 00000000000000000000000111111001  23
k+=   3882851 =   32183782: 10110111100010000001010011011110 00000000000000000000000000011000  27
k+= 133234334 =  165418116: 10110111100010000001010011010100 00000000000000000000000000010010  27
k+= 133234334 =  298652450: 10110111100010000001010011001010 00000000000000000000000000001100  28
k+= 133234334 =  431886784: 10110111100010000001010011000000 00000000000000000000000000000110  29
k+=1195226155 = 1627112939: 10110111100010000001010011000111 00000000000000000000000000000001  31
k+= 442820163 = 2069933102: 10110111100010000001010011000110 00000000000000000000000000000000  32

In this example we are somewhat lucky and \(k=2\) happens to have the \(10\) highest bits correct, but generally the number of correct bits grows by a few each step.

I have written some code to verify the correctness and minimality of the result, and indeed results are always identical to a bruteforce search starting at \(k=0\).

We can also make some statistics on the value of \(\min \K^r\) compared to \(2^r\). Since \(|\K^r| = 2^{64-r}\) we expect the smallest solution to be around \(2^r / 2\). In practice this seems to be mostly the case, but not quite exactly. Over a total of \(10^8\) samples (which takes \(40\) seconds) we get the following ratios of \(k/2^r = \min(\K^r)/2^r\):

     r  avg(k/2^r) max(k/2^r)
r =  0:      0.000      0.000
r =  1:      0.250      0.500
r =  2:      0.375      0.750
r =  3:      0.500      1.000
r =  4:      0.694      2.375
r =  5:      0.741      1.750
r =  6:      0.538      1.922
r =  7:      0.600      1.422
r =  8:      0.588      1.645
r =  9:      0.644      1.539
r = 10:      0.564      1.718
r = 11:      0.536      1.245
r = 12:      0.552      1.677
r = 13:      0.549      1.365
r = 14:      0.553      1.784
r = 15:      0.527      2.319
r = 16:      0.762      2.060
r = 17:      0.599      1.475
r = 18:      0.571      1.991
r = 19:      0.501      1.309
r = 20:      4.275     16.086
r = 21:      4.580     12.026
r = 22:      3.083      7.008
r = 23:      1.763      3.691
r = 24:      0.939      1.908
r = 25:      0.515      1.923
r = 26:      0.801      2.416
r = 27:      0.671      1.692
r = 28:      0.556      1.571
r = 29:      0.526      1.209
r = 30:      0.529      1.602
r = 31:      0.645      1.798
r = 32:      0.513      1.148
r = 33:      0.515      1.473
r = 34:      0.934      2.784
r = 35:      0.865      2.160
r = 36:      0.561      1.208
r = 37:      0.532      1.748
r = 38:      0.584      1.446
r = 39:      0.555      1.732
r = 40:      0.603      3.102
r = 41:      1.291      4.149
r = 42:      1.100      2.724
r = 43:      0.690      1.470
r = 44:      0.902      4.465
r = 45:      1.530      4.437
r = 46:      1.191      2.770
r = 47:      0.721      1.569
r = 48:      0.560      1.523
r = 49:      0.525      1.154
r = 50:      0.600      1.534
r = 51:      0.516      1.823
r = 52:      0.598      1.439
r = 53:      0.535      1.703
r = 54:      0.578      1.343
r = 55:      0.583      1.589
r = 56:      0.522      1.130
r = 57:      0.614      1.528
r = 58:      0.523      1.329
r = 59:      0.849      2.757
r = 60:      0.777      1.902
r = 61:      0.502      1.082
r = 62:      0.565      1.492
r = 63:      0.532      1.254
r = 64:      0.500      1.000

Indeed, the average value of \(k/2^r\) is typically just above \(0.5\) or at least below \(1.0\), but occasionally jumps to above \(4\), in particular around \(r\in \{20, 21, 22, 23\}\). I’m really not quite sure why, but this particular range seems to have something weird going on, as also seen by the maximum \(k/2^r\) ratio of \(16\). (Note that this is not a bug with our fast inversion algorithm. The bruteforce algorithm confirms these results.)

Most likely this is caused in some way by the presence of a very small \(\delta_r\) for these \(r\) (marked with a *), and this correlating with all \(C\times \delta_r\) not only being \(< 2^{64-r}\) (in absolute value) but being either very close to it or very small.

     r       delta_r                                                        C*delta_r
r = 19
              164249                    111111111101001111010010100101010111100001101
              521986 1111111111111111111111110111101110100011000101011001011000101010
              686235                    111110111011000011101011010000100010100110111
r = 20
        *     521986 1111111111111111111111110111101110100011000101011001011000101010
            16345815                     11111111001100111001111011111101111000100011
            16867801                     11110110111011011101000001010111010001001101
r = 21
        *     521986 1111111111111111111111110111101110100011000101011001011000101010
            24697591                      1111010110101101011010010010100000011000011
            25219577                      1110010100100001110010111101101011011101101
r = 22
        *     521986 1111111111111111111111110111101110100011000101011001011000101010
            28873479                       111000101010000011111101011111001000010011
            29395465                       110000011000100111000010111000100000111101
r = 23
        *     521986 1111111111111111111111110111101110100011000101011001011000101010
            30439437                        11111110101101101001101101011010010010001
            30961423                        10111100100010000010011000100101010111011
r = 24
        *     521986 1111111111111111111111110111101110100011000101011001011000101010
            31483409                         1111010010110011011000011110000011100101
            32005395                          111000001010110011101110111011100001111
r = 25
            32005395                          111000001010110011101110111011100001111
            32527381 1111111111111111111111111110101111111001100011010000110100111001
            64532776                          101110001010000000001001000010001001000

Conclusion: We now have a fast \(O( r)\) algorithm to solve \(C\times k = X \lxor A\) with \(0\leq A < 2^{64-r}\). But this does not yet solve the full problem since we actually wanted \[ \frac{2^{64} p}{n} \leq X \lxor (C\times k) < \frac{2^{64}(p+1)}{n} \] and we did some handwaving to assume that implies the topmost \(r\) bits of \(C\times k\).

Bringing it home

Let’s start with a (slightly pathological, but illustrating) \(32\) bit example: \(n=511\), \(p=127\), so we want \[ \frac{2^{32}\cdot 127}{511} \leq X \lxor (C\times k) < \frac{2^{32} \cdot 128}{511}. \] In binary, the lower and (inclusive) upper bound are:

low : 0011 1111 1001 1111 1100 1111 1110 0111
high: 0100 0000 0010 0000 0001 0000 0000 1000
lcp : 0  (r=1)

These have a longest common prefix 0 of length \(r=1\), so we could now use the solution above to enumerate solutions \(k\) for \(r=1\) and then for each of them check if \(X \lxor (C\times k)\) is indeed in this interval. But this is quite inefficient, as the actual interval only has length \(2^{32}/511 \approx 2^{23}\), so that each possible \(k\) only works with probability \(2^{23} / 2^{32 - 1} = 2^{23-31} = 2^{-8} = 1/256\), indicating that we may have to try a lot of distinct \(k\).

Instead, we can split the interval into two ranges:

low  = r1 start: 0011 1111 1001 1111 1100 1111 1110 0111
       r1 end  : 0011 1111 1111 1111 1111 1111 1111 1111
       lcp     : 0011 1111 1   (r=9)

       r2 start: 0100 0000 0000 0000 0000 0000 0000 0000
high = r2 end  : 0100 0000 0010 0000 0001 0000 0000 1000
       lcp     : 0100 0000 00  (r=10)

Both range \(r_1\) and \(r_2\) now cover more than half of the values with the prefix of length \(r=9\) resp. \(r=10\) bits, so a suitable \(k\) should be found within a few tries in each case, and we can simply take the minimum of the two solutions.

One additional optimization that we can do is that instead of computing solutions for the two ranges independently, we can first compute a \(k_0\) that gives the right bits in the LCP of low and high, and then extend from there towards the two disjoint intervals.

The final code is as follows (and includes some more handling of edge cases):

/// Solve FastReduce(x ^ MH(k), n) == p efficiently.
fn invert_fr64_fast(x: Hash, n: usize, p: usize, diffs: &Vec<Vec<T>>) -> u64 {
    // low = 2^64 * p/n <= x^FR(k) < 2^64 * (p+1)/n = high+1
    let low = ((1u128 << 64) * p as u128 / n as u128) as u64;
    // high is an inclusive bound:  (2^64 * (p+1) - 1)/n
    let high = (((1u128 << 64) * (p + 1) as u128 - 1) / n as u128 - 1) as u64;

    // In this case the partitioning into two intervals doesn't work.
    if low == high {
        return find_inverse_fast_with_test(
            low ^ x.get(),
            64, // r
            |k| {
                let xck = x.get() ^ C.wrapping_mul(k);
                low <= xck && xck < high
            },
            0, // k0
            diffs,
        );
    }

    // Split [low, high] into two pieces [low, low_end] and [high_start, high]
    // that have (much) longer LCP.
    let lcp = (low ^ high).leading_zeros();

    // First find the solution for the LCP.
    let k0 = find_inverse_fast(low ^ x.get(), lcp, diffs);

    let low_end = low | ((1u64 << (63 - lcp)) - 1);
    let high_start = low_end + 1;

    let low_lcp = (low ^ low_end).leading_zeros();
    let high_lcp = (high_start ^ high).leading_zeros();

    let test = |k| low <= x.get() ^ C.wrapping_mul(k) && x.get() ^ C.wrapping_mul(k) < high;

    let low_k = find_inverse_fast_with_test(low ^ x.get(), low_lcp, test, k0, diffs);
    let high_k = find_inverse_fast_with_test(high_start ^ x.get(), high_lcp, test, k0, diffs);

    min(low_k, high_k)
}

This is \(O(64)\) and takes around a minute to invert \(10^8\) hashes.

Conclusion: We can now efficiently find \(k_i = O(n)\) such that \(h_1(x) \lxor h_2(k_i)\) maps to a chosen free slot (when \(h_2\) is a FastReduce instance). This should allow us to fill the last slots of the table much faster.

Hash-inversion for faster PTHash construction

So now we have a fast way to find \(k_i\) for the tail of the last \(t\) buckets. We will assume that these buckets all have size \(1\). (Otherwise decrease \(t\).) Let \(F\) be the set of free positions once the head of \(m-t\) buckets has been processed. We always have \(|F| \geq t\) and when \(\alpha = 1\) we have \(|F| = t\). We can then implement two strategies:

Sequential
Iterate over buckets and free slots in parallel, matching each bucket to a slot. Then compute the \(k_i\) that sends each bucket to the corresponding free slot. This will give \(k_i\sim n\) in expectation, uses \(t \cdot \log_2(n)\) bits in total, and runs in \(O(t)\).
Greedy
For each bucket (in order), compute \(k_i(f)\) for each candidate slot \(f\), and choose the minimal value. When \(\alpha=1\) this gives \(k_i \sim n/f\) and runs in \(O(t^2)\). The total number of bits is \[ \sum_{f=1}^t \log_2(n/f) = t\log_2(n) - \log_2(t!) \sim t (\log_2(n) - \log_2(t)) \] For \(t=O(\sqrt n)\), this saves up to half the bits for these numbers.

In some quick experiments with \(n=10^8\), the sequential strategy seems to give at most around \(15\%\) speedup (\(35\) to \(30\) seconds for \(t=10000\)), which is not as much as I had hoped. This seems to be because relatively a lot of time is also spent on finding \(k_i\) for the last buckets of size \(2\) and \(3\).

Fast path for small buckets

For small buckets (size \(\leq 4\)) it pays of to use a code path that knows the explicit bucket size and processes a fixed size &[Hash; BUCKET_SIZE] array instead of an arbitrary sized slice &[Hash]. This allows for better code generation.

TODO Dictionary encoding

The dictionary will be quite dense for numbers up to some threshold (say \(1024\)), and sparser afterwards. We can encode the small numbers directly and only do the dictionary lookup for larger ones.

  • TODO: Figure out if the branch is worth the savings of the lookup.

TODO Larger buckets

The largest bucket should be able to have size \(O(\sqrt n)\) without issues. From there it should slowly decay (TODO: figure out the math) to constant size. This could put the elements that are currently in the largest \(\sim 1\%\) of buckets all together in a few buckets, reducing the average size of the remaining buckets. Although the reduction seems only minimal, so this may not give too much benefit.

One way of achieving such a skew distribution might be to replace the partitioning of \(h\in [0, 2^{64})\) in \(m\) chunks, by a partitioning of \(h^2 \in [0, 2^{128})\) in \(m\) chunks.

TODO Prefetching free slots

Looking up whether slots in the array for a certain \(k_i\) are free is quite slow and memory bound. Maybe we can prefetch values for a few \(k_i\) ahead.

Also, the computation of position could be vectorized.

Filling the last few empty slots needs very high \(k_i\)!

Have a look at the last two columns of this table, indicating the number of newly appearing \(k_i\) in each class and total number of distinct \(k_i\) up to then:

sz          cnt bucket%  cuml%  elem%  cuml%     avg ki     max ki     new ki       # ki
25:           1    0.00   0.00   0.00   0.00        0.0          0          1          1
24:           1    0.00   0.00   0.00   0.00        0.0          0          0          1
23:           5    0.00   0.00   0.00   0.00        0.0          0          0          1
22:          11    0.00   0.00   0.00   0.00        0.0          0          0          1
21:          63    0.00   0.00   0.00   0.00        0.0          0          0          1
20:         181    0.00   0.00   0.00   0.01        0.0          0          0          1
19:         566    0.00   0.00   0.01   0.02        0.0          1          1          2
18:        1494    0.01   0.01   0.03   0.04        0.0          1          0          2
17:        4143    0.02   0.02   0.07   0.11        0.0          2          1          3
16:       10119    0.04   0.06   0.16   0.28        0.0          2          0          3
15:       23730    0.09   0.15   0.36   0.63        0.1          3          1          4
14:       52393    0.20   0.35   0.73   1.36        0.2          6          3          7
13:      107711    0.41   0.76   1.40   2.77        0.4          8          2          9
12:      205081    0.78   1.54   2.46   5.23        0.8         14          6         15
11:      359440    1.36   2.90   3.95   9.18        1.5         27         10         25
10:      581154    2.21   5.11   5.81  14.99        2.9         57         23         48
 9:      854333    3.24   8.35   7.69  22.68        6.0        123         44         92
 8:     1149239    4.36  12.72   9.19  31.87       12.7        264         95        187
 7:     1418308    5.38  18.10   9.93  41.80       25.6        539        181        368
 6:     1685178    6.40  24.50  10.11  51.91       46.4        873        288        656
 5:     2096913    7.96  32.46  10.48  62.40       73.6       1578        391       1047
 4:     2880342   10.94  43.40  11.52  73.92      107.5       2276        580       1627
 3:     4046307   15.36  58.76  12.14  86.06      150.3       4334        956       2583
 2:     4887755   18.56  77.31   9.78  95.83      171.2       8071       1066       3649 *
 1:     4166126   15.82  93.13   4.17 100.00      382.2  126928083      14084      17733 *
 0:     1809532    6.87 100.00   0.00 100.00        0.0          0          0      17733

The buckets of size \(2\) and larger only need \(3649\) distinct \(k_i\) which can be represented using \(12\) bits each with dictionary encoding. But after the last single-element buckets we need \(17733\) distinct \(k_i\) which would require \(15\) bits! If we instead use \(\alpha=0.99\) things look much better:


sz          cnt bucket%  cuml%  elem%  cuml%     avg ki     max ki     new ki       # ki
27:           1    0.00   0.00   0.00   0.00        0.0          0          1          1
26:           1    0.00   0.00   0.00   0.00        0.0          0          0          1
24:           4    0.00   0.00   0.00   0.00        0.0          0          0          1
23:           4    0.00   0.00   0.00   0.00        0.0          0          0          1
22:          20    0.00   0.00   0.00   0.00        0.0          0          0          1
21:          53    0.00   0.00   0.00   0.00        0.0          0          0          1
20:         165    0.00   0.00   0.00   0.01        0.0          1          1          2
19:         514    0.00   0.00   0.01   0.01        0.0          1          0          2
18:        1378    0.01   0.01   0.02   0.04        0.0          1          0          2
17:        3736    0.01   0.02   0.06   0.10        0.0          2          1          3
16:        9386    0.04   0.06   0.15   0.25        0.0          2          0          3
15:       22223    0.08   0.14   0.33   0.59        0.1          3          1          4
14:       49298    0.19   0.33   0.69   1.28        0.2          5          2          6
13:      102501    0.39   0.71   1.33   2.61        0.4          7          2          8
12:      196895    0.74   1.45   2.36   4.97        0.7         14          7         15
11:      348772    1.31   2.76   3.84   8.81        1.3         22          8         23
10:      568406    2.14   4.90   5.68  14.49        2.7         49         19         42
 9:      843818    3.17   8.07   7.59  22.09        5.4        118         45         87
 8:     1143451    4.30  12.37   9.15  31.23       11.4        210         94        181
 7:     1426213    5.36  17.74   9.98  41.22       22.9        395        160        341
 6:     1701482    6.40  24.14  10.21  51.43       41.3       9910        261        602
 5:     2111700    7.94  32.08  10.56  61.99       65.0       1494        332        934
 4:     2897775   10.90  42.98  11.59  73.58       93.5       2247        472       1406
 3:     4079515   15.34  58.32  12.24  85.82      125.3       3209        752       2158
 2:     4964046   18.67  76.98   9.93  95.74      126.5       4074        437       2595 *
 1:     4256736   16.01  92.99   4.26 100.00       38.6       1061          0       2595 *
 0:     1863589    7.01 100.00   0.00 100.00        0.0          0          0       2595

Now no new pilot values are needed at all for the single-element buckets, and \(12\) buts is enough for each value, saving \(20\%\) of memory!

Looking closer at the buckets of size \(1\), we see that most new values are added for the last few elements, where large \(k_i\) are needed.

1:      263401    1.00  79.00   0.26  96.28       25.0        355          0       3649
1:      263401    1.00  80.00   0.26  96.54       27.0        364          0       3649
1:      263402    1.00  81.00   0.26  96.80       29.3        381          0       3649
1:      263401    1.00  82.00   0.26  97.07       31.8        429          0       3649
1:      263401    1.00  83.00   0.26  97.33       35.1        463          0       3649
1:      263401    1.00  84.00   0.26  97.60       38.7        562          0       3649
1:      263402    1.00  85.00   0.26  97.86       43.4        623          0       3649
1:      263401    1.00  86.00   0.26  98.12       49.3        630          0       3649
1:      263401    1.00  87.00   0.26  98.39       57.1        720          0       3649
1:      263401    1.00  88.00   0.26  98.65       67.2        896          0       3649
1:      263402    1.00  89.00   0.26  98.91       82.0       1160          0       3649
1:      263401    1.00  90.00   0.26  99.18      105.0       1333          0       3649
1:      263401    1.00  91.00   0.26  99.44      146.4       2047          0       3649
1:      263401    1.00  92.00   0.26  99.70      240.6       3272          0       3649
1:      263402    1.00  93.00   0.26  99.97      824.0      33353       3762       7411 *
1:      263401    1.00  94.00   0.03 100.00     4227.2  126928083      10322      17733 *

Perfect matching for the tail

Let’s for now only consider some tail of length \(t\) of buckets of size \(1\). We already saw two techniques to assign them: sequentially and greedily (giving the minimal \(k_i\)).

We also already discussed the possibility of cuckoo hashing, where inserting an element can ’evict’ already placed elements. This seems to need a bit more bookkeeping though, as for each placed element we must track the bucket it belongs to.

Another more complex way would be a perfect matching: Compute \(k_i(f)\) for each bucket \(i\) and each free slot \(f\) and find the weighted bipartite matching assigning buckets to free slots that minimizes e.g. \(\sum \log_2 k_i(f)\) (the total number of bits) or \(\max k_i(f)\) (the maximal number of bits).

This last case is interesting: we could fix a maximal \(K=2^b\) up front and then find all possible \(k_i < K\) for all buckets by simply trying them all. Then we find a perfect matching (if one exists) and use these \(k_i\) for the final buckets.

This raises the question:

Let \(G\) be a bipartite graph on \(|A|+|B|=n+n\) vertices. Let each vertex in \(A\) have degree \(d\) edges towards random vertices in \(B\). What is the smallest \(d\) for which a perfect matching exists with high probability.

If we can answer this, than we can find the \(d\) smallest \(k_i\) solutions for the last \(t\) buckets and solve the perfect matching problem. In this case all final \(k_i\) would be \(O(d\cdot n/t)\), for \(O(t \log_2(dn/t))\) bits total.

A closely related question is answered on stackoverflow which implies that we should take \(d = c + \ln n\).

I implemented this and it seems to work quite well! Take for example the tail of \(n=10^8\), \(c=7\), \(\alpha=1\) without this optimization:

n: 100000000
m: 26340126
sz          cnt bucket%  cuml%  elem%  cuml%     avg ki     max ki     new ki       # ki
...
 1:      263401    1.00  79.00   0.26  96.28       25.0        390          0       3680
...
 1:      263401    1.00  91.00   0.26  99.44      146.2       1626          0       3680
 1:      263401    1.00  92.00   0.26  99.70      241.8       3699          3       3683
 1:      263402    1.00  93.00   0.26  99.97      835.7      21967       3825       7508
 1:      263401    1.00  94.00   0.03 100.00     4135.5   52284963      10187      17695
  :    26340126  100.00 100.00 100.00 100.00      137.3   52284963      17695      17695

and now the same where we run perfect matching on the last \(500000\) buckets/slots where we find all \(k_i\) below \(2^{12} = 4096\):

n: 100000000
m: 26340126
 sz          cnt bucket%  cuml%  elem%  cuml%     avg ki     max ki     new ki       # ki
...
  1:      263401    1.00  79.00   0.26  96.28       25.0        418          0       3662
...
  1:      263401    1.00  91.00   0.26  99.44      146.0       1819          0       3662
  1:      263401    1.00  92.00   0.26  99.70      299.7       4094        474       4136
  1:      263402    1.00  93.00   0.26  99.97      753.4       4095         62       4198
  1:      263401    1.00  94.00   0.03 100.00      193.9       4095          0       4198
   :    26340126  100.00 100.00 100.00 100.00       97.7       6655       4198       4198

Observe that:

  • The maximum \(k_i\) overall is now \(6655\) instead of \(50M\) (the value for one of the last buckets of size \(2\)), requiring half the bits to store.
  • The number of distinct \(k_i\) has gone down from \(17695\) to \(4198\), requiring one less bit to store.
  • Both the maximum \(k_i\) and the number of distinct \(k_i\) is between \(2^{12}=4096\) and \(2^{13}=8192\). I.e. dictionary compression on the full array does not provide any additional benefits anymore! (I’m not yet sure whether separate front-back encoding would provide benefits.)

For reference, here are the corresponding results when using \(\alpha = 0.99\) without tail optimizations.

 sz          cnt bucket%  cuml%  elem%  cuml%     avg ki     max ki     new ki       # ki
...
  1:      265917    1.00  91.00   0.27  99.48       60.5        804          0       2595
  1:      265917    1.00  92.00   0.27  99.74       72.2        887          0       2595
  1:      265917    1.00  93.00   0.26 100.00       86.4       1342          0       2595
   :    26591682  100.00 100.00 100.00 100.00       69.0       4431       2595       2595
  • Both the maximum \(k_i\) and number of distinct \(k_i\) are a bit smaller here, mostly because the buckets of size \(2\) are much easier to place.

Conclusion: This works extremely well to handle the last size-\(1\) buckets. It seems to be a bit slower for now, but I think this can be fixed. (The matching itself is super fast – it’s rather that we now need to search up to a threshold for all buckets, rather than only for the last few buckets.)

TODO: See if we can make something similar work for the tail of buckets of size \(2\). There we don’t need perfect matching though, and simpler techniques such as peeling may suffice (as used in XOR-filters (Graf and Lemire 2020)). This is also somewhat similar to cuckoo hashing (Fan et al. 2014), which has a similar result but uses a slightly less structured approach. In particular Cuckoo hashing does a greedy ‘DFS’ without backtracking for augmenting paths, compared to matching/flow algorithms which do a normal DFS and are able to ‘reset’ \(k_i\) back to the lowest solution after an augmenting path through them has been found.

Peeling for size-1 buckets

Instead of full matching, let’s try peeling to match the size-1 buckets: Again we first choose a threshold \(K=2^b\) and find all suitable \(k_i < K\) for each bucket using bruteforce. Also as before, consider the induced bipartite graph.

Suppose that there is a vertex \(u\) of degree \(1\). Then we know that the single edge incident to it (say \(uv\)) must be part of the final matching, so we can greedily take it and remove any other edges incident to \(v\).

We can repeat this process: repeatedly choose a vertex \(u\) of minimal remaining degree, choosing a random edge \(uv\) incident to it, add \(uv\) to the matching, and remove any other edges incident to \(u\) and \(v\).

I was hoping that this would result in a perfect matching with high probability, but it seems that in practice this procedure typically leaves behind a few percent of vertices without any remaining edges. One example situation is where this ‘deadlock’ could occur is when we have two fully connected subgraphs \(A=K_{x,x}\) and \(B=K_{y,y}\), and additionally a few extra edges from \(a \in A\) to \(B\). In this case, we must never take these ‘cross’ edges, but the algorithm described above doesn’t have such non-local information and could create an imbalance.

Not all hope is lost though:

  1. This approach may actually work well for buckets of size \(2\) and larger, where there are spare empty slots and we are not looking for a perfect matching.
  2. The partial matching found by peeling could be input to Dinic’ full matching algorithm, possibly reducing the number of iterations needed and providing overall speedup. But on the other hand, Dinic also matches all but a few percent of elements in its first round, so it’s not clear if it would help much.

Greedy peeling 1: Assigning from hard to easy

A simpler alternative to peeling is the following. As before choose a number of bits \(b\) and threshold \(K = 2^b\), and find all solutions \(k_i < K\) for all buckets of size \(2\). Then sort the buckets by increasing number of solutions. This leaves the buckets with more freedom of choice for last, hopefully giving them a larger probability of finding a match.

I had high hopes for this, but it seems to not work great. As expected it reduces the maximum \(k_i\) required, but (for \(n=10^8\)) it only goes down from \(3000\) to \(2500\), which is not even enough to save a bit in a direct representation.

TODO Peeling and cuckoo hashing for larger buckets.

We can processing the buckets in various ways:

  1. By increasing number of total edges, as in previous section.
  2. By increasing number of remaining edges.
  3. By increasing number of remaining free slots covered by all edges.

This last method sounds more complicated than the second one, but is simpler in practice: Whenever we fill a slot we simply decrease all buckets mapping to it by \(1\), instead of having to check first whether those edges are still alive.

Sunday morning ideas

Dinic

Possible speedups for Dinic algorithm:

  • It’s never needed to visit vertices with level \(\geq level(t)\) (other than \(t\) itself). Do such vertices exist? How many of them? Would it speed things up to skip over them?
    • Answer: Yes, they exist. But when matching \(1M\) buckets, there are only around \(100\) at level \(>level(t)\). Sometimes there are quite some (up to \(n/2\)) at level \(=level(t)\) though, so we could skip those.

      So during level computation, we can stop as soon as we determine the level of \(t\), and leave all other levels at \(-1\), so they will never be processed during augmentation, which only visits vertices \(v\) with \(level(v) > level(u)\).

      This saves up to \(25\%\) of time when matching \(1M\) buckets.

  • For vertices in the left component at \(level(u) = level(t)-2\), instead of doing a DFS, we could first check if any of the outgoing \(v\) has a free edge to \(t\), by storing all taken/free \(v\) in a bitmask.
    • Done: Instead I implemented a check that for vertices in level \(level(t)-1\), we only check the edge to \(t\). This saves around \(30\%\) time of the augment step, and around \(15\%\) time of the entire Dinic algorithm.

Together these two optimizations bring the runtime on \(1M\) elements down from around \(3.5\) seconds to \(2.5\) seconds: \(30\%\) speedup.

New iterative greedy assignment idea

Possible improvement to Greedy peeling 1:

  • Run the greedy linear-pass algorithm once.

    • Then fix all the remaining unmatched ones to their first outgoing edge.
    • Rerun the algorithm starting with these ‘hard cases’ already fixed.
    • Repeat.

    Will this converge? How many iterations? Is it better to fix hard cases on the fly, evicting previous ones as soon as the hard one is found?

Answer: Around 10-20 iterations are typically needed. These iterations are relatively fast, but the iterating to find all solutions below some threshold is quite slow.

We can try different sorting orders in which the buckets are processed:

  • no sorting,
  • by number of solutions \(k_i\) below the threshold (going from few to many),
  • by smallest solution \(k_i\) (going from high to low).

Experiments show that sorting by number of solutions is the best, but this requires precomputing all solutions below the threshold which is quite costly in practice. Sorting by smallest solution is probably a better idea in the end, since this is something that needs to be computed for each bucket anyway.

Cuckoo hashing, again

The reason that this keeps coming up is because I think it will work very well, but I don’t yet see exactly how to implement this in a sufficiently nice/clean way. So I procrastinate by first implementing simpler (to implement) ideas.

Basically it’s the same as the iterative greedy assignment idea above, but displacing others on the fly, doing a random walk DFS until success.

  • TODO: Do we choose outgoing edges at random? sequentially (wrapping around)? Always starting at the start of the list (leading to smaller \(k_i\))?
  • Prefer displacing edges that only include one occupied slot. (Or if possible better, never take an edge with \(2\) (or more) slots already taken.)

Some statistics for bucket size \(2\) (\(n=10^8\)):

  • \(\sim20\%\) of the buckets has size \(2\), and \(\sim20\%\) of the buckets has size \(1\).
  • Bucket size \(2\) covers elements from percentile \(85\%\) to \(95\%\).
  • The ’early’ buckets have success rate one in \(1/0.15^2 \sim 45\).
  • The ’late’ buckets have success rate one in \(1/0.05^2 \sim 400\).
  • So only \(1\) in \(9\) that works initially still works at the end.
  • This is problematic, because even though in expectation most will find a solution with \(20\) candidates, some of the \(f = O(10^6)\) buckets will be unlucky and will need more like \(O(9\cdot \ln f)\) candidates, which corresponds to quite large \(k_i\) actually (\(45\cdot 9\cdot \ln f \sim 4000\)).
  • Basically, all these cuckoo/displacement strategies are really mostly needed for the few buckets that lie a factor \(\ln f\) (or say more than \(3\times\)) above the average, so if we can fix those the rest should be relatively OK.
  • Coming back to cuckoo hashing: Each candidate \(k_i\) maps the two elements to the \(15\%\) remaining slots. After most size-\(2\) buckets have been assigned, \(5\%\) of slots still remains empty. That means that every candidate edge consisting of \(2\) slots in this \(15\%\) has quite a high probability of including at least one empty slot: \(1-(2/3)^2 = 55\%\). So basically cuckoo hashing has a very high probability of finding edges with only one slot already filled, and \(1/9\) of the edges has both slots empty. So I think the DFS only needs to see roughly \(10\) edges in total before success, which is very little actually!

Cuckoo hashing / displacing, for real now

Ok so I implemented a quick first version and, lo and behold: It works great!.

So for completeness the algorithm I run now:

  1. Put all buckets of fixed size (from \(2\) to \(5\) or so, but not \(1\) (TODO: maybe \(1\) works as well?)) on a stack.
  2. While the stack is not empty: find a \(k_i\) for the top of the stack that maps to all empty slots.
  3. If not possible, find a \(k_i\) that minimizes the number of collisions. Drop the colliding buckets from the taken slot array and push those buckets on the stack.
  4. Repeat until the stack is empty.

One tricky point is how to tie-break the selection of the minimal bucket. Always choosing the smallest \(k_i\) with the minimal number of collisions can lead to cycles where two buckets keep displacing each other. What seems to work: Choose the first \(k_i\) after any existing \(k_i\) (modulo \(2^b\)).

For \(n=10^7\):

  • the \(450k\) buckets of size \(3\) needs \(4k\) displacements, around \(1\%\);
  • the \(600k\) buckets of size \(2\) need \(20k\) displacements, around \(3\%\).

For now the bottleneck is twofold:

  • Iterating over all buckets to find all solutions (edges) up-front is slower than necessary. In practice most solutions \(k_i\) are quite small and there is no need to iterate all the way to the upper bound up-front.

  • The bit width \(b\) has to be large enough to ensure each bucket has at least one solution. When the expected value of the smallest solution is \(\overline{k}\), the maximum value we need is roughly \(\overline{k} \ln b\) when there are \(b\) buckets of the current size. This is unfortunate because in practice \(b\) is of order \(1M\), so we need to go up to \(20 \overline{k}\), whereas maybe \(2\overline{k}\) would be enough for most of them.

    TODO: We should fix this by allowing some buckets to displace larger buckets. Probably displacing buckets one larger than itself is good enough.

Displacing globally

A drawback with the implementation above is that it first finds all edges, and only then starts the algorithm. Much more efficient is to do things right from the start. We can make a single global displacement algorithm:

  1. Set a threshold \(k_{max} = 2^b\).
  2. Determine buckets and order by decreasing size.
  3. For each bucket, try to find the smallest \(k_i<2^b\) that places it without collisions. If so, move to the next bucket.
  4. Otherwise, find the \(k_i<2^b\) that minimizes the number of collisions, and (more importantly) minimizes the size of the colliding buckets. Then evict those other buckets (remove their \(k_i\) and push them on a stack), and set the \(k_i\) for the current bucket.
  5. While the stack is non-empty, repeat steps 2 and 3 for the top of the stack.

One drawback is that this needs some more memory since for each slot it needs to store the bucket it came from, and it’s also a bit slow since (in case collision-free matching is not possible) we need to lookup the size of each bucket in another random-memory-access.

Running it

This seems to work super well! My implementation isn’t as streamlined yet as the original that just finds the smallest \(k_i\) per bucket (which has a simpler and thus faster inner loop), but nevertheless gets the job done. Only around the last buckets of size \(2\) do displacements go above \(1\%\) of buckets. The main drawback is that it somewhat explodes at the last \(100\) to \(1000\) size-\(1\) buckets, where it has a harder time finding chains that lead to a free slot. Probably falling back to deterministic matching using Dinic is better here.

The average \(k_i\) per bucket-size using this method seems similar to the original algorithm, but the main benefit is that even for \(n=10^8\) we can now find solutions that with all \(k_i < 2^8\), meaning that we can store the pilots simply as a vector of u8 instead of needing a compact representation and/or dictionary encoding. In fact I’m now thinking it’s probably simpler to tune \(c\) (i.e. the number of buckets) such that we can easily find a solution where all pilots fit in \(8\) bits, rather than having a variable packing size.

Some further observations:

  • Only rarely is more than one collision needed.
    • TODO: In this case can we backtrack instead of displacing two buckets?
  • When filling buckets of size \(s\), at worst buckets of size \(s+1\) are evicted, never larger ones.

Limitations

As already mentioned, one problem with this is that matching the last few buckets is hard. Basically, for the last bucket we have to find a size-\(1\) bucket for which one of its \(256\) possible pilots maps it to the remaining free slot, which only happens for \(0.5\%\) of buckets (which is large for \(n=10^8\)).

Alternatively we can use Dinic for the size \(1\) buckets again, but this also has a limitation as mentioned before, that there will always be some size-\(1\) buckets that do not map to any free slot. Thus, it seems we need some balance where:

  • Some size-\(1\) buckets displace already-placed size-\(2\) buckets.
  • For the remaining size-\(1\) buckets we find some/all solutions, build the graph, and run Dinic.

One other tricky problem is that not only do we have to ensure that each bucket maps to a free slot, but also that each free slot has some bucket mapping to it. I’m not quite sure yet how exactly to solve this, since the probability of such a position existing is also quite large. One option is to track for each slot one or two \((bucket, pilot)\) pairs that maps to it, and then use that in case no other remaining values map there. But this might then trigger a chain of displacements again in the other (‘pulling’) direction. Although that’s somewhat inevitable in the \(\alpha = 1\) setting.

Cleanup and revisiting defaults

So far I have been using Vec<u64> as backing storage for pilots, MurmurHash to hash keys, and \(16\) iteration lookahead for prefetching.

  1. Now that we aim to have \(k_i < 2^8\), let’s switch that to a simple Vec<u8>.

  2. I added support for FxHash, which processes a u64 at a time using a simple rotate-xor-multiply: state=(state.rotate(5) ^ x)*constant.

    While benchmarking this, I realized that the reason NoHash (which is a toy implementation that does exactly nothing) was so much faster than other hashes is not only because it spends less instructions hashing, but also because my random keys were sorted (to check for duplicates), and this lead to linear access patterns rather than random access patterns.

  3. It now seems that $32$-iteration lookahead is around \(10\%\) faster for large \(n\).

Some results for \(n=10^8\) (\(c=7\), \(\alpha=1\)):

Vec<u64>
 murmur (1): 20.3 (16):  7.6 (32):  7.2 (64):  7.3
 fxhash (1): 16.3 (16):  7.6 (32):  7.2 (64):  7.2
 nohash (1): 15.4 (16):  7.6 (32):  7.2 (64):  7.2
Vec<u8>
 murmur (1): 17.4 (16):  5.8 (32):  5.1 (64):  5.2
 fxhash (1): 13.8 (16):  5.5 (32):  4.9 (64):  5.6
 nohash (1): 13.0 (16):  5.5 (32):  4.9 (64):  4.8
sucds::CompactVector
 murmur (1): 20.9 (16): 27.4 (32): 27.4 (64): 27.5
 fxhash (1): 62.9 (16): 17.2 (32): 17.6 (64): 17.6
 nohash (1): 60.5 (16): 16.8 (32): 17.4 (64): 17.3
sux::CompactArray
 murmur (1): 21.5 (16): 20.9 (32): 21.0 (64): 20.9
 fxhash (1): 16.7 (16): 16.9 (32): 16.1 (64): 17.8
 nohash (1): 15.6 (16): 15.3 (32): 15.5 (64): 16.1

(1), (16/32/64) indicate no prefetching resp. prefetching 16/32/64 iterations ahead.

Observations:

  • Switching from Vec<u64> to Vec<u8> saves around \(2ns\) per query.
  • Going from Murmur to FxHash typically saves a bit.
  • Going further to NoHash doesn’t save more.
  • At \(32\) lookahead, the runtime does not depend much on hash function, so probably it’s memory bound rather than cpu bound.
  • CompactVector and CompactArray do not support prefetching and hence aren’t nearly as good.

TODO Sum instead of xor?

PTHash replaces the summation in \((h(x) + d)\mm n\) by a xor. It’s not exactly clear to me why this is beneficial. I the xor has slightly better mixing, especially when the \(\mm n\) is implemented as a reduction of the high bits, and \(h(k_i) = C\cdot k_i\) has a linear structure as well.

Revisiting \(\alpha < 1\)

After playing some more with \(\alpha<1\) values, it seems that we can do this with only very minor loss in query performance. Some experiments for \(n=10^8\), \(c=10\), with prefetching \(32\) iterations ahead:

Table 2: Query time (ns) for distinct alpha. index returns the computed \(position(x)\) directly as a lowerbound on time, while index_remap looks up the target position for positions after \(n\).
\(\alpha\)indexindex_remap
\(1.0\)5.75.7
\(0.999\)5.65.8
\(0.995\)5.75.8
\(0.99\)5.75.9
\(0.95\)5.77.1
\(0.9\)5.78.7

As can be seen, index is independent of \(\alpha\) as expected, since it doesn’t do any extra work for out-of-range elements. index_remap becomes slow once too many elements fall out of bounds around \(\alpha = 0.95\). But we can safely choose \(\alpha=0.99\) without significant loss of query performance.

Elias-Fano for the remap-dictionary

I added support to use sucds::EliasFano to store the dictionary elements. This saves quite some bits, since (for \(\alpha=0.99\)) it only needs \(2+\log_2(100) = 9\) bits per element as opposed to the default which encodes each index using a u32.

So let’s see the query performance.

Table 3: Query time (ns) for distinct alpha, using EliasFano encoding for the remapping.
\(\alpha\)indexindex_remap
\(1.0\)5.76.6
\(0.999\)5.86.7
\(0.995\)5.87.2
\(0.99\)5.77.6
\(0.95\)5.711.1
\(0.9\)5.716.1

This comes out quite a bit slower than using a direct Vec<u32> encoding. On the other hand, it does save space: for \(\alpha=0.99\), the vector encoding uses \(0.01 \cdot 32 = 0.32\) bits per element overall, while EliasFano only needs \(0.01\cdot 9 = 0.09\) bits per element overall, saving \(0.2\) bits/element.

Sadly the EliasFano hot-loop has slightly worse codegen even in the non-EF part, introducing a branch where previously non-branching assembly was generated for the >>small shift in determining the right bucket.

Global iterative prioritizing

Instead of displacing using DFS, another idea I had that may work for \(\alpha<1\):

  1. Iterate over all buckets by decreasing size; place those that can be placed without collisions.
  2. Collect the rest in a vector of priority_buckets.
  3. Repeat, but place the priority buckets before iterating over the others.

This worked well with only buckets of a fixed size, but seems to not work as well globally, even when we try to assign the priority buckets to have minimal collisions with others. Instead of converging to a stable solution, each iteration about the same amount (\(0.3\%\) for \(n=10^8\), \(c=10\), \(\alpha=0.99\)) of buckets is added to the priority list. It seems that the changes made by the few extra priority buckets propagate a lot, changing the optimal pilot for \(40\%\) of buckets. This basically means the taken slots are completely newly random each iteration, preventing converging to a stable solution where the number of added priority elements goes down each round.

Basically, just DFSing the displacement using the earlier strategy seems much better.

Cleanup: removing peeling and suboptimal displacing code

The code base is getting messy. I’m going to drop all the peeling experiments, and also the ‘displacement per bucket size’ will go. Basically all methods that require computing edges up front are annoying in this regard. Also matching seems not-so-useful anymore now that we can settle on \(\alpha=0.99\) without impacting the query throughput. It may still be useful though in cases where we want to minimize memory usage, but even then we can go with \(\alpha=0.999\) which would take longer to construct but should still be fine, and avoids the complexity of having to balance between perfect matching and displacing.

So we’re dropping all these algorithms which required all edges to be precomputed up-front:

  • Peeling algorithms
    • Didn’t really work in the first place
  • Dinic matching
    • Works well for size-\(1\) buckets, but needs a large threshold to guarantee every bucket and slot have at least one edge. For larger bucket sizes the balancing act is annoying.
  • Per-bucket-size displacing:
    • The greedy non-iterative version that simply runs once and only fills \(95\%\) of buckets.
    • The greedy iterative version works well but only does one bucket size at a time.
    • The DFS version that works per bucket size isn’t quite good enough because sometimes we must displace to larger buckets.
    • The global iteration method of the previous subsection also goes – it just doesn’t work well.
  • The code to invert hashes \(s = C \times k_i\) is also dropped. Inverting returns values that are \(O(n)\) while we are now only interested in values at most \(2^8\). I’m kinda sad about this because the math there was really cool.
  • The ‘fast small buckets’ optimization is also dropped for now, to be reintroduced later, to clean up code a bit.

All together, this drops around 1900 lines of code :)

Some speedups to the displacement algorithm

  • Use a fast-path for detecting placement without collisions, which is sufficient for \(99\%\) of buckets.
  • Use a taken bitvector for the fast-path, instead of a full slot-to-bucket-index mapping .
  • Attempt at prefetching of positions in the taken vector for upcoming \(k_i\). Doubtful whether it actually improves construction time though.

TODO Runtime analysis of displacement algorithm

It would be nice to be able to estimate the runtime complexity of the displacement algorithm in terms of \(n\), \(m\) (/\(c\)), \(\alpha\), and the maximum allowed pilot value.

TODO Optimal prefetching strategy

For each class of buckets of the same size we could device some optimal prefetch strategy: It’s wasteful to prefetch the slot corresponding to the last element of each bucket when the probability of failing in earlier elements in the bucket is high. E.g. for size \(2\) buckets only roughly \(1\) in \(6\) of the second elements needs to be queried. Not prefetching those may give a nice speedup.

Are we close to the memory bandwidth?

I have DDR4 @ 3200MHz, which has a theoretical bandwidth limit of \(25GB/s\).

For \(n=10^9\) (1G), the taken vector has size 1G*1bit=125MB and does not fit in my 12MB L3 cache. Thus let’s say for simplicity that all lookups in it are cache misses, and thus that each lookup reads a single $64$byte cacheline from main memory.

But how many lookups do we need?

  • there are 0.3G buckets
  • the average \(k_i\) is \(40\)
  • buckets have around \(3\) elements on average
  • total number of lookups: 0.3G * 40 * 3 ~ 40G
  • But that’s an upper bound, since in practice we do an early break after the first taken slot.

Simply counting the number of lookups gives me 20G, which sounds reasonable: it’s lower and the same order of magnitude.

So how long does it take to do 20G lookups? \(20G * 64byte = 1.28TB\), which at \(25GB/s\) is \(51s\). In practice the construction takes longer at \(350s\), so this indicates we’re not yet memory-bandwidth bound, assuming we could indeed saturate the full \(25GB/s\), but it’s not too far from it.

More sorting algorithm resources

I found some more implementations and posts about sorting algorithms, so listing them here for future reference:

Robinhoodsort
https://github.com/mlochbaum/rhsort

Sorts uniform random data by placing values at their interpolated location in a \(2n\) size array.

Ska_sort
https://github.com/skarupke/ska_sort

In-place radix sort with CPU-pipelining-friendly swapping of values to their buckets. See posts investigating radix sort, I wrote a faster sorting algorithm, and faster sorting algorithm part 2.

Wolfsort
https://github.com/scandum/wolfsort

Didn’t really look into this.

Voracious sort
https://github.com/lakwet/voracious_sort

Rust library with various radix sort implementations, with the default voracious sort being an improved ska_sort. I tried all implementations in here for sorting random \(10^8\) random 64=bit hashes, and none beats =radsort::sort_by_key(|h| h>>32) followed by manually sorting equal buckets.

More blogs
http://stereopsis.com/radix.html
pqdsort
Pattern defeating quicksort, https://github.com/orlp/pdqsort, by Orson Peters
glidesort
https://github.com/orlp/glidesort, by Orson Peters

And some resources on partitioning

Branchles partitioning
https://github.com/Voultapher/sort-research-rs/blob/main/writeup/lomcyc_partition/text.md by lukas Bergdoll
Blog on Lomoto partitioning
https://orlp.net/blog/branchless-lomuto-partitioning/ by Orson Peters

Partitioning to reduce memory latency

As discussed just before, memory latency and bandwidth are a bottleneck in construction times when \(n=10^9\). For a typical bucket with \(2\) elements and \(k_i=50\), we must lookup \(100\) positions in taken, corresponding to fetching \(100\) cachelines from main memory. It would be much better if these cachelines were already available in a low level cache. So how about if we already roughly know where the elements of this bucket are supposed to go? Maybe a window of say \(10\) cachelines (i.e. \(10\cdot 64\cdot 8 = 5120\) bits) should be wide enough to find an empty slot with high probability. So we could say that bucket \(i\) of \(m\) maps to a window/slice of some length starting at \(i/m \cdot s\) (where \(s=n/\alpha\) is the total number of slots). But actually we can simplify this:

  • We can divide the \(m\) buckets into \(P\) partitions, and then the first \(1/P\) of buckets map to the first \(1/P\) of slots.

So there are \(P\) partitions with \(m_p = round(m/P)\) buckets each, and each partition corresponds to \(s_p = round(n/\alpha/P)\) slots.

So then, the position for an element \(x\) with bucket \(i\) becomes \[p(x, i) = \lfloor i/m_p \rfloor \cdot s_p + ((h(x) \lxor h(k_i)) \mm s_p.\]

As usual, we can remap values larger than \(n\) to the free slots in the rest of the array. Note that this is slightly different from the PTHash-HEM partitioning in Pibiri and Trani (2021a) since that builds a number of independent MPHFs, while this method builds a single one and only constructs a single free array.

One point needs some careful consideration: this method does not guarantee that each partition has the same number \(n_p = round(n/\alpha)\) of elements, so the total load factor per partition will vary. This is not fatal though: The expected number of elements per partition is indeed \(n_p\), and since this is a binomial distribution for large \(n\) and small probability \(p=1/P\) the standard deviation is only \(\sqrt{n_p}\). This means that as long as we take, say, \(4\sqrt{n_p}\) additional slots per partition, the probability of a bucket exceeding the number of slots is less than \(1\) in \(16000\). (Otherwise we rehash everything with a new global seed.) Since we anyway reserve free slots with \(\alpha<1\), we can simply make some more free slots. Construction time may then be slightly concentrated in the buckets with number of elements significantly above the mean, but the memory latency savings should counter this easily.

So the new algorithm would be:

  1. Compute bucket for each element.
  2. Partition buckets into \(P\) partitions of \(m_p\) buckets each.
  3. Sort the buckets in each partition by decreasing size.
  4. (Optional) prefetch the \(s_p\) bits of the taken array corresponding to the current partition into low-level cache.
  5. Run the usual algorithm for each partition, searching for the smallest pilot that maps each bucket to empty slots only.
  6. Remap position above \(n\) to the free slots.

As said before, the main benefit here is that step 5 should now run much much faster, since each lookup for a free slot can go directly to L1 or L2 cache rather than main memory. So let’s see how many slots \(s_p\) we should aim for in each partition.

  • My L1 cache is \(32kb = 256kbit\), so we could choose \(n_p = 200k\) elements per partition.
  • My L2 cache is \(256kb = 2048Mbit\), so we could choose \(n_p = 2M\) elements per partition.

The optimal size is probably determined using experimentation. The benefit of smaller partitions is that they fit in a faster cache, while \(10\) times larger partitions have a relatively \(\sqrt{10} \approx 3\times\) smaller standard deviation, hence needing less buffer slots.

As an example let’s take \(n=10^9\) with \(n_p = 2M\) elements per bucket. This creates \(500\) buckets, and needs \(4\sqrt{n_p} \approx 5k\) buffer slots. \(5k/2M = 0.25\%\), so actually choosing \(\alpha=0.99\) (which gives \(1\%\) buffer slots) already gives plenty of guarantee that buckets are not too full and leaves at least \(0.75\%\) of free slots at the end to speed up construction.

Note that this should also be very good for parallelization, since each thread can fill it’s own part of the taken vector in L1 or L2 without interfering with the other threads at all.

Implementation efficiency

  • \(s_p\) should be a multiple of the \(512\) bit cache-line size to avoid sharing cachelines between threads. The entire bitvector should be cacheline-aligned as well. But this is really nitpicking.
  • We should either use fast-divide for the bucket partitioning, or we can simply ensure that there’s \(2^k\) buckets per partition so we can just do this via a bitshift.

Oh some problem? Ughhh, so we manually create imbalanced bucket sizes, and the way we do this is by putting \(60\%\) of the elements in the first \(30\%\) of buckets. That means that if we partition buckets simply by id, the first buckets will be much larger than the last buckets…. And also, that means there will be very few small buckets to fill gaps among those large buckets… So it sounds like we want to either distribute those large buckets equally over the range, or partition the buckets e.g. via their id modulo \(P\).

A fix? So instead of making the first \(30\%\) of buckets large, we have to distribute the large buckets roughly equally over the partitions. Some ideas for this:

  1. Make every third bucket large is as follows:

    • Remap large buckets from 0..m/3 to 0..m by taking \(i \mapsto 3i-1\), filling positions \(2\mm 3\).

    • Remap small buckets from 0..2/m3 t0 0..m by taking \(i\mapsto \lfloor 3i/2 \rfloor\). Filling positions \(0\mm 3\) and \(1\mm 3\)

      (This is a specific instance of the more general following result: for irrational \(\alpha\) and \(\beta\) with \(1/\alpha + 1/\beta = 1\) the sets \(\{\lfloor i\alpha\rfloor: i\in \mathbb N\}\) and \(\{\lfloor i\beta\rfloor: i\in \mathbb N\}\) are disjoint and partition \(\mathbb N\).)

  2. First compute the bucket id as \(h(x) \mm m\). Then remap elements that should go to large buckets by setting the last two bits of the bucket id to \(0\). That maps some percentage of elements to buckets \(0 \mm 4\), creating large buckets there. (We could also map them to buckets \(0\mm 3\) by doing i/3*3, but that requires more operations.)

  3. Simply do the original logic or partitioning by small or large hash value, but choose a cut-off within each partition. So for reference the bucketing function we currently use is

    \begin{equation*} bucket(x) = \begin{cases} \phantom{p_2+{}}\lfloor h(x) \cdot p_2 / 2^{64}\rfloor & \text{if } h(x) < p_1 \\ p_2 + \lfloor h(x) \cdot (n-p_2) / 2^{64}\rfloor & \text{if } h(x) \geq p_1. \end{cases} \end{equation*}

    If we have \(2^k\) partitions, we can mask the first \(k\) bits of \(h(x)\) and check \(mask_k(h(x)) < p_1 / 2^k\).

Back from a break!

So I finally managed to implement partitioning and it works great! It’s super fast when putting ~200k elements per part, which exactly fits in L1.

This now also allows for parallelization, since each part can be processed independently, and since each part only needs little memory, they don’t interfere with each other at all!

I now use FastReduce32H to determine the part and bucket.

One issue though: Using FastReduce32L to determine the slot does not provide sufficient entropy, since it only gives \(\lg(N/P)\) bits of entropy, whereas we need more like \(\lg N\) bit of entropy. This is somewhat fixed by reverting to FastMod32L, but that still lacks entropy for \(N = 10^9\), where we need the much slower FastMod64. More on this later.

Speeding up the search for pilots

Currently the loop that searches for pilots is as follows:

'p: for p in 0u64..kmax {
    let hp = self.hash_pilot(p);
    // True when the slot for hx is already taken.
    let check = |hx| unsafe { *taken.get_unchecked(self.position_in_part_hp(hx, hp)) };

    for hx in bucket {
        if check(*hx) {
            continue 'p;
        }
    }

    if self.try_take_pilot(bucket, hp, taken) {
        return Some((p, hp));
    }
}

This is quite efficient, but has a problem: each iteration of the inner loop is quite unpredictable: looking up random bits of taken gives random bits true or false with some probability, so there will be lots of branch-mispredictions. Indeed, running under perf stat -d shows:

     16,324.28 msec task-clock:u                     #    1.000 CPUs utilized
57,487,973,942      cycles:u                         #    3.522 GHz
87,101,496,658      instructions:u                   #    1.52  insn per cycle
10,513,721,875      branches:u                       #  644.054 M/sec
   643,749,689      branch-misses:u                  #    6.12% of all branches

To reduce branches and increase parallelization, we can only check for success every \(4\) lookups, which has much more predictable results (i.e. failing most of the time):

'p: for p in 0u64..kmax {
    let hp = self.hash_pilot(p);
    // True when the slot for hx is already taken.
    let check = |hx| unsafe { *taken.get_unchecked(self.position_in_part_hp(hx, hp)) };

    // Process chunks of 4 bucket elements at a time.
    let chunks = bucket.array_chunks::<4>();
    for &hxs in chunks.clone() {
        // Check all 4 elements of the chunk without early break.
        if hxs.map(check).iter().any(|&bad| bad) {
            continue 'p;
        }
    }
    // Check remaining elements.
    let mut bad = false;
    for &hx in chunks.remainder() {
        bad |= check(hx);
    }
    if bad {
        continue 'p;
    }

    if self.try_take_pilot(bucket, hp, taken) {
        return Some((p, hp));
    }
}

The new perf stat -d:

      14,310.45 msec task-clock:u                     #    0.999 CPUs utilized      (was 16k)
 50,239,412,832      cycles:u                         #    3.511 GHz                (was 57G)
123,285,550,459      instructions:u                   #    2.45  insn per cycle     (was 87G; 1.5)
 10,003,769,536      branches:u                       #  699.053 M/sec              (was 10.5G)
    170,547,753      branch-misses:u                  #    1.70% of all branches    (was 650M; 6.1%)

Note how this roughly \(10\%\) faster (in clock and cycles). It does use \(50\%\) more instructions, but is able to execute those much more efficiently, at \(2.5\) instructions per cycle vs. \(1.5\) before. This is probably because of the $4$-fold reduction in branch-misses!

The remaining branch misses seem to mostly come from the less-optimized case where a bucket requires displacing some other buckets.

MultiplyReduce

For \(n\) close to \(10^9\) or \(2^{32}\), we really must use all \(64\) bits of entropy. FastReduce only uses the high bits, and FastMod works but is quite slow (taking \(2\) multiplications for the $32$bit variant and more for $64$bits).

When the target modulus is a power of \(2\), \(2^d\), an alternative is what I’ll call MultiplyReduce: \[ h \mapsto \lfloor \frac {C\cdot h}{2^{64}}\rfloor \mm 2^d. \] This simply multiplies the hash by a random $64$bit constant and then takes the \(d\) low bits of the high word of the result.

We can use this for the reduction to slots in each part, so that all entropy is used.

The one problem with this function (and also MulHash) is that $128$bit multiplications don’t work well with SIMD, but so far SIMD hasn’t payed off yet anyway.

Linux hugepages?

I may or may not have this already enabled on my laptop, but if not, if may make prefetches faster by having a smaller TLB cache.

It seems that jemalloc has builtin support for optional hugepages so I may try that. See also this reddit post.


After trying a bit more (tweet, example code), it seems that a simple vec![0u8; len] already uses transparent huge pages on my machine:

$ grep AnonHugePages /proc/meminfo

goes up by around 300MB when filling the vector of pilots, and similarly

$ grep thp_fault_alloc /proc/vmstat

increases by \(150\) at the same time, counting the total number of allocated 2MB hugepages.

Either way, here is the full allocation code I tried for completeness:

let len = 300_000_000;
const HUGE_PAGE_SIZE_BYTES: usize = 2 * 1024 * 1024;
unsafe {
    let ptr = std::alloc::alloc(
        std::alloc::Layout::from_size_align(len, HUGE_PAGE_SIZE_BYTES).unwrap(),
    );
    libc::madvise(ptr as _, len, libc::MADV_HUGEPAGE);
    Vec::from_raw_parts(ptr, len, len)
}

Dropping the bucket split?

So far we have been mapping \(\beta = 0.6\) of elements to \(\gamma = 0.3\) of slots to create two classes of buckets: large and small ones. This helps the performance of the construction algorithm by putting more elements in large and easy-to-place buckets so that in the end we have more smaller buckets and especially more free slots remaining once we get to the last buckets of size \(2\).

In my current implementation it’s kind of annoying to find the bucket for a given hash: First FastReduce for the part: multiply \(h\) by \(P\), the number of parts. The P*h>>64 is the part, and we use P*h % (2^64) to compute the bucket, which uses the high bits of that result to determine whether it belongs in a small or large bucket and then sends it there via a linear transformation.

If we drop these two classes things would be much simpler: the part is P*h>>64, and similarly the bucket is B*h>>64, where B is the total number of buckets over all parts.

Build performance

This is a bit more tricky: dropping the two classes roughly doubles the number of displacements and slows the construction by around \(10\%\). Also, the minimal \(c\) for which construction succeeds at all goes up slightly.

An alternative

Maybe we can come up with a ’local’ way to imbalance buckets, so that each bucket roughly maps to position h*B >> 64. Maybe we could or the last bit with another random bit, so that \(75\%\) of elements goes to \(50\%\) of buckets?

Query performance

Table 4: Query performance (ns) for \(n=10^9\)
for-loopprefetching
2 classes14.57.6
1 class10.37.5

As you can see, when we already do prefetching the added simplicity doesn’t affect performance much, but without it, the simpler and shorter code helps considerably.

Query memory bandwidth

Most likely, the prefetching variant is already saturating the memory bandwidth. Each cacheline is $64$bytes, which comes down to \(64B / 7.5ns = 8.5GB/s\). That’s a third of the \(25GB/s\) peak bandwidth for my RAM, but I’m not sure whether the peak is achievable at all for random-access reads.

I did find this quora answer that talks about Line-Fill Buffer. To quote:

If you test performance from a single thread on a machine that’s mostly idle (ie, you’re not competing with other threads for access to the memory bus), then you’ll find that on many chips, you can perform roughly twice as many sequential reads per second as random reads. The explanation for this is that the CPU has clever circuitry that watches your memory access pattern and tries to predict which cache-lines your code is going to want next. Why does this prefetching help? Well, this gets a bit deep. My current understanding is that in the random access case, your parallelism is limited by that fact that each core can only have a limited number of L1 misses outstanding at one time. Intel calls this the “Line-Fill Buffer” and it has 10 slots on recent chip models, and there are similar mechanisms on AMD and Arm. DRAM physics dictate that reads have ~60ns of latency (and this barely changes as the tech evolves), which is ~400 clock cycles, so that would rate limit throughput of random parallel reads to at most 1 per ~40 clock cycles. Versus when the hardware prefetch mechanisms kick in, they don’t consume line-fill slots and can effectively perform more reads in parallel. Sadly, if you issue software prefetch instructions, many CPUs treat that as needing a line-fill entry, so it’s only reads triggered by hardware-prefetching that are able to achieve greater parallelism than the size of your line-fill buffer.

So with this in mind, maybe the next question is whether we can get the last factor \(3\) improvement from \(8GB/s\) to \(25GB/s\) by using multithreading, since each CPU has it’s own copy of these Line-Fill Buffers between L2 and L1.

So I tried multithreading, but it seems not to be faster at all! It’s usually slightly slower. Weirdly though, the threads seem to do the work sequentially, with the $i$th of \(k\) threads finishing roughly after \(i/k\) of the total time. Some threads just seem to be busy waiting for memory that never comes until other threads are done.

But then does that mean that my memory bandwidth is actually capped at \(8GB/s\)? That would be disappointing.

Maybe another solution could be to store two copies of the pilot array, one on each of the two \(32GB\) memory cards I have? I don’t really know whether the main memory or the processor is the bottleneck, so no idea if this would help.

Some more experiments

I wrote some experiments to measure memory bandwidth where I sum the first byte of each cacheline of a \(4GB\) array. Results:

stride 1
strided       1.78s     22.513GB/s
prefetch      1.77s     22.553GB/s
stride 2
strided       2.73s     14.678GB/s
prefetch      2.79s     14.357GB/s
stride 4
strided       3.61s     11.079GB/s
prefetch      3.48s     11.501GB/s
stride 8
strided       3.85s     10.403GB/s
prefetch      3.76s     10.649GB/s
stride 16
strided       4.05s      9.876GB/s
prefetch      4.00s      9.995GB/s
stride 32
strided       4.53s      8.828GB/s
prefetch      4.50s      8.893GB/s
stride 64
strided       3.91s     10.228GB/s
prefetch      3.89s     10.279GB/s
stride 128
strided       4.01s      9.976GB/s
prefetch      4.02s      9.957GB/s

so basically it caps out around \(9GB/s\) even for very predictable access patterns. Prefetching also doesn’t help.

So the \(8.5GB/s\) we currently get with the queries is already super good and basically saturates the full memory bandwidth!

Multithreading benchmark

So can we get to the full \(25GB/s\) using multithreading? It seems that the answer is yes! Running multiple threads in parallel each with a different (coprime) stride, I get the following throughput per thread and in total:

Threads: 1     3.85s     10.396GB/s
Threads: 2     4.35s     18.393GB/s
Threads: 3     5.03s     23.845GB/s
Threads: 4     6.15s     26.035GB/s
Threads: 5     7.52s     26.578GB/s
Threads: 6     8.86s     27.074GB/s

As you can see, this converges to just above \(25GB/s\). Probably it’s a bit higher because some reads are cached in L3, and some reads are shared between threads.

If we give each thread it’s own copy of the \(4GB\) array, this changes to:

Threads: 1     4.55s      8.800GB/s
Threads: 2     5.37s     14.900GB/s
Threads: 3     6.85s     17.512GB/s
Threads: 4     8.51s     18.791GB/s
Threads: 5    10.53s     18.985GB/s
Threads: 6    12.15s     19.756GB/s

This is quite a bit lower!

Changing to a \(7GB\) array may be interesting, since \(6\cdot 7GB = 42GB\) must be spread over two \(32GB\) ram sticks. (\(8GB\) crashed because I don’t have \(48GB\) free currently.) This gives:

Threads: 1     7.76s      9.016GB/s
Threads: 2     9.54s     14.679GB/s
Threads: 3    12.20s     17.211GB/s
Threads: 4    14.98s     18.686GB/s
Threads: 5    18.15s     19.282GB/s
Threads: 6    22.93s     18.313GB/s

I.e., very similar to results above. So probably we won’t be able to gain speedup by distributing reads over the two ram sticks.

Multithreading queries: satisfaction at last

I had a silly bug at first, but with that fixed, here’s the inverse throughput when using up to \(6\) threads in parallel, each prefetching \(64\) iterations ahead:

threadsns/querymemory throughput (GB/s)
17.598.4
24.1715.3
33.0920.7
42.8122.8
52.6923.8
62.7623.2

Satisfaction, at last! \(2.69ns/query\) means a memory throughput of \(64B/2.69ns = 23.79GB/s\), quite close to the full memory bandwidth.

Also, each thread is also reading the keys to be queried from main memory. Those are efficiently packed so each \(8B\) key is only $1/8$th of a \(64B\) cacheline and hence only requires $1/8$th of a read. But adding those in we get \((1+1/8) \cdot 23.79 GB/s = 26.76GB/s\), quite a bit above the full memory bandwidth. I suppose this is because some queries are already in L3 cache, sparing the main memory bus. Let’s see more precisely: The memory used for all pilots is \(2.43bit/elem\cdot 10^9 elem = 2.43\cdot 10^9 bit = 304MB\). My L3 cache is \(12MB\), so each query hits the cache with probability \(12MB/304MB=4\%\), getting to $$ \frac 18 ⋅ \frac{64B}{2.69ns}

  • (1-0.04) ⋅\frac{64B}{2.69ns} = 25.8GB/s

$$ This is still slightly above the \(25GB/s\) max throughput my ram should have, but I’ll call it done here. We’re within a few percent of the expected maximum performance, which is as good as we can get.

Maybe the last bit is L1 and L2 caches containing a few more elements than L3?

Packing difference from expected position

Since the values we pack are a random subset of \([0, n]\), it may be possible to encode them efficiently by only storing the deviation from their expected value. I.e. we store \(r=(1/\alpha - 1)n\) values where position \(i\) stores the difference with \(i/r \cdot n\).

Result: for \(n=10^8\) with \(\alpha = 0.99\), the largest difference between actual and expected position is around \(500\ 000\). I.e., this would still need at least \(20\) compared to the \(20\) we started with.

Local packing ideas

Instead we could store a fixed offset \(c\) every \(k\) positions, and approximate the following \(k\) values as \(c + i/r\cdot n\), where \(n/r\) is the expected difference between adjacent remap targets (i.e. the distance between adjacent empty slots in the taken array).

For example, we could store a u32 offset every \(30\) positions, one per cacheline, totalling to \(16 + 32/30 \sim 17 bits\) per remapped position.

Or maybe we could pack slightly more efficient and only use somewhere between \(10\) to \(15\) bits for the variance from the expected value. Some maths will be needed to figure out a safe upper bound on the max deviation.

Some possibilities:

  1. Absolute: Store \(32bit\) offset, and \(16bit\) deltas to it.
  2. Linear: Store \(32bit\) offset, and \(16bit\) deltas to \(offset + i\cdot n/r\).
  3. Custom linear: Store \(32bit\) offset, \(32bit\) (or less) difference \(d\), and \(16bit\) deltas to \(offset + i \cdot d\).
  4. Tiny EF: Store \(32bit\) offset, \(128\) bits indicating where \(256bit\) boundaries are crossed similar to EF encoding, and \(44\) \(1byte\) values.

This last idea needs \(64/44\cdot 8bit=11.6bits\) per value. That’s slightly more than using EF encoding directly on the full vector (which uses around \(9bits\)), but since this only uses a single cache line lookups are much faster, and it uses \(3\times\) less bits that the direct Vec<u32>.

Query times for different remapping structures

Table 5: Query times in nanoseconds without and with remapping, for different remap structures. \(n=10^9\), \(c=9\), \(\alpha\in\{0.99, 0.98\}\). The bottom row indicates bits/element needed for pilots under no remap and additional memory for remapping datastructures.
threadsno remapVec<u32>TinyEFEliasFanoVec<u32>TinyEFEliasFano
\(\alpha\)\(0.99\)\(0.99\)\(0.99\)\(0.98\)\(0.98\)\(0.98\)
17.47.97.910.18.18.211.6
24.14.34.35.14.84.56.0
33.03.23.23.63.53.34.3
42.82.92.83.23.02.93.6
52.72.72.83.02.82.73.2
62.72.72.73.03.02.83.1
bits/elem2.43+0.33+0.12+0.09+0.66+0.24+0.17

As can be seen, TinyEF uses only about a third of the memory of Vec<u32>, while being at least as fast (and faster for \(\alpha=0.98\)). EliasFano uses another \(25\%\) less memory, but significantly slows down queries due to its non-locality.

The code for TinyEF is here.

Sharding

When the number of keys is very large, they may not all fit in memory at the same time. To solve this, we can shard the keys and process one shard at a time.

In particular, the \(P\) parts are split into \(S\) shards, where part \(0\) to \(P/S\) go in the first shard, and so on. We make sure that \(P\) is a multiple of \(S\). Then we process one shard at a time: we hash the keys, sort them into their parts, and then compute the pilots for each part independently.

I implemented two ways to shard keys:

  1. On-disk sharding: create a file for each shard, iterate over the keys, and append the hash of each key to file for the shard it belongs to.
  2. In-memory sharding: iterate over the keys multiple times, and each time only return those which belong to the current shard.

128bit hashing

In order to scale more than \(2^{32}\) keys, we need to use $128$bit hash functions (github issue). There are many hash functions out there, all with different speed vs. randomness tradeoffs. I implemented many of them. Although I have no idea of the quality of the implementations I used, these are some quick results for hashing u64 keys:

Table 6: Query throughput (ns) for different hash functions on \(10^9\) u64 keys for sequential querying and streaming with $64$-ahead prefetching.
HasherBitsseq.prefetch
fxhash6414.27.48
murmur26415.87.70
murmur312816.59.40
xx6439.014.04
xx12832.313.94
city6428.012.19
city12845.120.09
metro6429.913.21
metro12840.316.71
highway64107.657.20
highway128114.547.84
spooky64115.763.43
spooky128115.567.06

From this we conclude that FxHash, Murmur2, and Murmur3 are good fast hash functions where throughput is close to the memory bandwidth, while most other hash functions are quite a bit slower and lower the throughput.

I also added support for hashing of &[u8] strings. Results are here:

Table 7: Query throughput (ns) for different hash functions on \(4\cdot 10^6\) URLs (avg. \(71\) bytes long) for sequential querying and streaming with $64$-ahead prefetching.
HasherBitsseq.prefetch
fxhash6431.721.26
murmur26411.56.23
murmur312824.715.35
wy6419.511.58
xx6423.214.26
xx12821.012.90
city6418.010.97
city12827.517.90
metro6422.215.36
metro12824.317.71
highway6461.136.23
highway12867.042.51
spooky6474.162.48
spooky12874.162.87

Observations:

  • fxhash is slow for strings.
  • murmur2 is super fast!
  • city and wy are great for =64=bit hashes.
  • xx is best for =128=bit hashes.

Varying the partition size

Let’s construct for \(n=10^9\) keys, \(c=8\), \(\alpha=0.98\), with different partition sizes. \(\max \alpha\) is the maximum load factor over all parts.

slots per partnum partsdisplace timetotal time\(\max\alpha\)
\(2^{15}\)$31$k--\(>1\)
\(2^{16}\)$16$k\(20.7\)\(27.0\)\(99.49\)
\(2^{17}\)$7.8$k\(22.8\)\(29.0\)\(99.18\)
\(2^{18}\)$3.9$k\(22.7\)\(29.0\)\(98.59\)
\(2^{19}\)\(1947\)\(25.8\)\(31.6\)\(98.40\)
\(2^{20}\)\(974\)\(30.3\)\(36.1\)\(98.21\)
\(2^{21}\)\(487\)\(42.3\)\(48.0\)\(98.14\)
\(2^{22}\)\(244\)\(41.2\)\(48.1\)\(97.83\)
\(2^{23}\)\(122\)\(40.9\)\(47.0\)\(97.81\)
\(2^{24}\)\(61\)\(72.5\)\(78.3\)\(97.77\)
\(2^{25}\)\(31\)\(88.8\)\(94.9\)\(96.17\)

Observe that:

  • Using too small partitions causes too many parts, with too much load-unbalancing, consistently giving \(\alpha > 1\).
  • \(\max \alpha\) decreases as the parts get larger, reducing the worst-case outliers. In particular the delta with \(98\%\) roughly halves each step.
  • When there are very few parts, \(\max \alpha\) goes below \(98\%\) due to rounding, because each part is fixed to have a power-of-two size.
  • Smaller parts are faster to build, due to data fitting in lower caches.
  • \(2^{17}\) slots per part was my default, but \(2^{19}\) is not much slower and more reliable.
  • From \(2^{21}\), the displacement time quickly grows. Storing \(2^{21}\) taken bits takes \(256kb\), which is exactly the L2 cache size. But the displacement algorithm also needs to store for each slot which bucket is occupying it, taking \(32bit/slot\). This does not fit in L3 and goes to main memory instead.
  • At \(2^{24}\) slots, even the taken arrays do not fit in L3 cache anymore and things slow down even more. (In fact I’m surprised it’s only a \(2\times\) slowdown. The latency is hidden pretty well.)

One way in which PTRHash differs from PTHash is that apart from just the taken bitvector, it needs to store an additional \(32bit\) index for each slot to be able to displace elements, which requires a lot more memory. If we could prevent this, memory usage would go down and it may be possible to use larger parts without slowing down the construction time so much.

References

Fan, Bin, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. 2014. “Cuckoo Filter.” Proceedings of the 10th Acm International on Conference on Emerging Networking Experiments and Technologies, December. https://doi.org/10.1145/2674005.2674994.
Fox, Edward A., Qi Fan Chen, and Lenwood S. Heath. 1992. “A Faster Algorithm for Constructing Minimal Perfect Hash Functions.” Proceedings of the 15th Annual International Acm Sigir Conference on Research and Development in Information Retrieval - Sigir ’92. https://doi.org/10.1145/133160.133209.
Graf, Thomas Mueller, and Daniel Lemire. 2020. “Xor Filters.” Acm Journal of Experimental Algorithmics 25 (March): 1–16. https://doi.org/10.1145/3376122.
Lemire, Daniel, Owen Kaser, and Nathan Kurz. 2019. “Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries.” Software: Practice and Experience 49 (6): 953–70. https://doi.org/10.1002/spe.2689.
Limasset, Antoine, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo. 2017. “Fast and Scalable Minimal Perfect Hashing for Massive Key Sets.” arXiv. https://doi.org/10.48550/ARXIV.1702.03154.
Pibiri, Giulio Ermanno, and Roberto Trani. 2021a. “Parallel and External-Memory Construction of Minimal Perfect Hash Functions with Pthash.” arXiv. https://doi.org/10.48550/ARXIV.2106.02350.
———. 2021b. “Pthash: Revisiting Fch Minimal Perfect Hashing.” Proceedings of the 44th International Acm Sigir Conference on Research and Development in Information Retrieval, July. https://doi.org/10.1145/3404835.3462849.