Thoughts on Consensus MPHF and tiny pointers

These are some thoughts on the Consensus-based MPHF presented in Lehmann et al. (2025), and how this could be applied to PtrHash:

Lehmann, Hans-Peter, Peter Sanders, Stefan Walzer, and Jonatan Ziegler. 2025. “Combined Search and Encoding for Seeds, with an Application to Minimal Perfect Hashing.” arXiv. https://doi.org/10.48550/ARXIV.2502.05613.

Below are also some thoughts on the papers on tiny pointers, used to achieve hash tables with load factors very close to 1: Bender et al. (2021), Farach-Colton, Krapivin, and Kuszmaul (2025).

1 Consensus

Setting. Suppose we have \(n\) Bernoulli processes \(B_i\), each with probability1 of success \(p_i\). For each \(B_i\), we retry it until success at attempt \(S_i \in \{0,1,2,\dots\}\).

Goal. Store all \(S_i\) using minimal space.

Optimal. The optimal space to store some successful \(S_i\) for each \(i\) is \(\sum_i \log_2(1/p_i)\) bits. And in expectation, we must do at least \(T_i = 1/p_i\) trials of experiment \(i\).

Main result. The main result states that with \(\varepsilon\) bits per element extra, for total space \(M^{opt} + \varepsilon n + O(\log n)\), it is possible to encode seeds \(S_i\) using in expectation \(O(T_i^{opt} / \varepsilon)\) tries.

Method. Consider a bitstring of length \(\sum_j (\varepsilon + \log_2(1/p_j))\). As the seed \(S_i\), use the 64 bits ending at position \(L_j:=\left\lceil\sum_{j\leq i} (\varepsilon + \log_2(1/p_j))\right\rceil\).

Then, seeds \(S_i\) can be found using a greedy DFS. Suppose seeds have already been found up to \(S_{i-1}\).

  1. Using the additional bits between \(L_{i-1}\) and \(L_i\), find a seed for \(B_i\).
  2. On success, move on to \(B_{i+1}\).
  3. On failure, go back to \(B_{i-1}\), and search its next seed. If needed, move back up more.

1.1 Consensus-RecSplit

The MPHF they introduce using consensus is rather pretty:

  1. Assume that \(n=2^d\). We will build a depth-\(d\) splitting tree.
  2. First find a seed \(S_0\) that splits the \(n\) keys into two sets.
  3. Then find seeds \(S_0\) and \(S_1\) that split the two sets equally into four sets.
  4. Repeat, until in the last level we find seeds \(S_0, \dots, S_{2^{d-1}-1}\) that slit each leaf of two keys into individual keys.

The beauty is now that we can use Consensus to compactly store the seeds in each level using only minimal overhead. And as it turns out, in the basis this construction is space-optimal, and the only overhead is in the \(\varepsilon\) we use to encode the seeds.

The only drawback is that this now requires looking up \(\log_2 n\) seeds to look up a key.

A bucketed version reduces the height of the tree from \(O(\log_2 n)\) to \(O(\log_2(1/\varepsilon))\) by ‘merging’ the top layers into a single $k$-perfect hash function.

Result. The result is the first MPHF that can be constructed in time \(O(n/\varepsilon)\) rather than \(\Omega(n\cdot \exp(1/\varepsilon))\).

Commentary. In a way, RecSplit and PTHash/PHOBIC/PtrHash are on opposite ends of a spectrum:

  • recsplit recursively splits the keys into equal halves, requiring \(\log n\) lookups
  • PTHash and variants split the keys into \(n\) parts at once, requiring only a single lookup.

The benefit of the first approach is that it is much more balanced: at each point, the number of keys in each node is completely known, and searching for a successful seed to split into half is relatively easy.

For PTHash on the other hand, keys are necessarily distributed over buckets of distinct sizes, following some distribution. This requires processing buckets from large to small and/or using a load factor \(\alpha < 1\).

2 IDEA: Consensus-PtrHash

An example. As a slightly artificial application, suppose we have \(n\) keys \(k_i\) that we want to hash into \(n\) slots. For the first key, the probability of hitting an empty slot is \(1\), so \(S_0 = 0\). For the second key, the probability of hitting an empty slot is \(p_1=(n-1)/n\), and so \(S_1\) follows a geometric distribution with expected value \(n/(n-1)\). Generally, \(S_i\) has success probability \(p_i = (n-i)/n\) and follows a geometric distribution with mean \(n/(n-i)\). Now we could simply write all \(S_i\) back-to-back as, say, 64-bit integers, but this is inefficient.

Suppose for the moment that all \(S_i\) are instead uniformly distributed between 0 and \(1/p_i\), rather than geometric. Then we need \(\log_2(1/p_i)\) bits to encode each \(S_i\), and it turns out that \(M^{opt} = \sum_i \log_2(1/p_i)\) bits is the optimal number of bits to encode some successful \(S_i\) for each \(i\).

New idea. instead of partitioning keys into buckets, we can directly assign each key a slice of, say, 16 bits in the output buffer of \(n\log_2e + n\varepsilon\) bits. Basically things work exactly as in the example, but instead of knowing that a key has ID \(i\), we take \(x = h(k)/2^{64}\) as the relative position, and let it use the bits at position \((x + (1-x)\ln (1-x))\cdot n\cdot (\log_2 e + \varepsilon)\), which is the perfect bucket assignment function of PHOBIC (Hermann et al. 2024) and corresponds to the integral of \(1/p_i\).

Compared to the current buckets of around \(\lambda=4\) elements on average, in this new situation, each element gets 1.5 to 2 bits of its own on average. This should speed up construction, since effectively we now have more but smaller buckets, meaning that elements can be placed one-by-one instead of a full bucket at once. (Ie, instead of doing up to 256 tries to place 4 elements, we now do 4 times 4 tries to place one element at a time.)

A drawback of this approach is that one the one hand we want to choose seeds from left to right, while on the other hand we want to assign them from large buckets (dense regions) to small buckets (sparse regions). Probably a mix of both can work. Anyway because of the perfect bucket assignment function, the largest buckets will be at the start, and so going left to right should somewhat work. Still, it is to be expected that the distribution won’t be perfect for the small buckets towards the end, causing, say, 4 keys to map to the same bit position. One solution could be to add 8 buffer bits at the end of each cache line, so that dependencies never travel too far out. Then, we can again prioritize cache lines or bytes with a heavy load factor, and work outwards from those.

Either way, the hash-evict strategy as in PtrHash probably won’t work, since updating existing seed values in the middle of other things would throw off adjacent seeds as well.

Using the \(\varepsilon n\) buffer. We can spread out the buffer bits in various ways:

  • \(\varepsilon\) bits per element, so that the dense region at the start relatively gets a lot of buffer;
  • or uniformly spread out over the output bits;
  • or a mix of the two.

Further strategies to increase the probability of success could be:

  • As before, use load factor \(\alpha \approx 0.99 < 1\).
  • We could find \(\approx \sqrt n\) elements in the most dense regions, and separately store their hash values in a hash table. This would add a second cache miss for each element though, to check whether the value is special.
  • We could also write 1111111 to any region that we failed to place, and then look it up in a fallback table.

3 Tiny pointers and optimal open addressing hash tables

Tiny pointers are introduced in Bender et al. (2021) (hackernews). The core idea feels quite similar to most minimal perfect hashing algorithms: Instead of storing the hashed position of a key using \(\log n\) bits of memory, it is sufficient to store only \(\log 1/\delta\) bit tiny pointers (\(1-\delta\) is the load factor) and then look up keys using both the key and the tiny pointer.

The main data structures used to achieve these tiny pointers are:

  • load-balancing tables, that split the input into blocks of size \(b=\delta^{-2} \log \delta^{-1}\). Then a block is chosen at random via the key, and if possible, the key is inserted into an empty position in the block and that position is returned.
  • In case insertions fails, a sparsely filled power-of-two-choices table is used, with buckets of size \(\log \log n\).

In the end, they manage to get down to variable-width tiny pointers of average size \(O(1 + \log \delta^{-1})\).

Funnel hashing. In Farach-Colton, Krapivin, and Kuszmaul (2025) (quanta, hackernews, youtube) the authors introduce new hash tables that achieve significantly higher load factors than previously, using techniques similar to (but simpler than, IMO) tiny pointers. The simpler one is funnel hashing:

  1. Set the bucket size \(\beta=2\log \delta^{-1}\).
  2. Create arrays \(A_1, A_2, \dots\), each of which is size a multiple of \(\beta\), with \(|A_{i+1}| \approx 3/4 |A_i|\). Also keep a small separate \(A’\) fallback array, so that the total size is \(n\).
  3. To try insert a key into \(A_i\), hash it into a bucket of \(A_i\), and try to insert it.

To insert a key in the full data structure, try inserting into \(\beta\) random buckets of \(A_1\), then \(\beta\) random buckets of \(A_2\), and so on,

The main benefit of this recursive data structure is that \(A_i\) slowly fills, but once it is somewhat full, we start falling back to a much emptier \(A_{i+1}\), so that insertions never get too hard. Also, this spreads the hardness of inserting into \(A_i\) over more keys, as all keys are tried against \(A_i\) before going to \(A_{i+1}\).

TODO: Can we somehow make a smooth version of this? Where the load factor slowly decreases as a function of position in the original array, and we don’t need the explicit levels?

TODO: Can we somehow use this as the basis of an MPHF? It may be hard to use together with the consensus scheme, since we explicitly first want to try inserting into \(A_i\) before \(A_{i+1}\), and having ‘‘random’’ high-order bits overlapping a previous \(S_i\) may mess this up.

Elastic hashing is a variant that batches insertions and fills one layer at a time (and spilling to the next one). It uses some clever probing to trade searching in a nearly full \(A_i\) with easy options in a mostly empty \(A_{i+1}\).

References

Bender, Michael A., Alex Conway, Martín Farach-Colton, William Kuszmaul, and Guido Tagliavini. 2021. “Tiny Pointers.” arXiv. https://doi.org/10.48550/ARXIV.2111.12800.
Farach-Colton, Martin, Andrew Krapivin, and William Kuszmaul. 2025. “Optimal Bounds for Open Addressing without Reordering.” arXiv. https://doi.org/10.48550/ARXIV.2501.02305.
Hermann, Stefan, Hans-Peter Lehmann, Giulio Ermanno Pibiri, Peter Sanders, and Stefan Walzer. 2024. “Phobic: Perfect Hashing with Optimized Bucket Sizes and Interleaved Coding.” In 32nd Annual European Symposium on Algorithms. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ESA.2024.69.
Lehmann, Hans-Peter, Peter Sanders, Stefan Walzer, and Jonatan Ziegler. 2025. “Combined Search and Encoding for Seeds, with an Application to Minimal Perfect Hashing.” arXiv. https://doi.org/10.48550/ARXIV.2502.05613.

  1. It is allowed for the success of \(B_i\) to depend on \(B_{i’}\) with \(i’<i\); see appendix C. ↩︎