These are some thoughts on the Consensus-based MPHF presented in Lehmann et al. (2025), and how this could be applied to PtrHash:
Lehmann, Hans-Peter, Peter Sanders, Stefan Walzer, and Jonatan Ziegler. 2025. “Combined Search and Encoding for Seeds, with an Application to Minimal Perfect Hashing.” arXiv. https://doi.org/10.48550/ARXIV.2502.05613.
Below are also some thoughts on the papers on tiny pointers, used to achieve hash tables with load factors very close to 1: Bender et al. (2021), Farach-Colton, Krapivin, and Kuszmaul (2025).
1 Consensus Link to heading
Setting. Suppose we have It is allowed for the success of
Goal. Store all
Optimal. The optimal space to store some successful
Main result.
The main result states that with
Method.
Consider a bitstring of length
Then, seeds
- Using the additional bits between
and , find a seed for . - On success, move on to
. - On failure, go back to
, and search its next seed. If needed, move back up more.
1.1 Consensus-RecSplit Link to heading
The MPHF they introduce using consensus is rather pretty:
- Assume that
. We will build a depth- splitting tree. - First find a seed
that splits the keys into two sets. - Then find seeds
and that split the two sets equally into four sets. - Repeat, until in the last level we find seeds
that slit each leaf of two keys into individual keys.
The beauty is now that we can use Consensus to compactly store the seeds in each
level using only minimal overhead. And as it turns out, in the basis this
construction is space-optimal, and the only overhead is in the
The only drawback is that this now requires looking up
A bucketed version reduces the height of the tree from
Result. The result is the first MPHF that can be constructed in time
Commentary. In a way, RecSplit and PTHash/PHOBIC/PtrHash are on opposite ends of a spectrum:
- recsplit recursively splits the keys into equal halves, requiring
lookups - PTHash and variants split the keys into
parts at once, requiring only a single lookup.
The benefit of the first approach is that it is much more balanced: at each point, the number of keys in each node is completely known, and searching for a successful seed to split into half is relatively easy.
For PTHash on the other hand, keys are necessarily distributed over buckets of
distinct sizes, following some distribution. This requires processing buckets
from large to small and/or using a load factor
2 IDEA: Consensus-PtrHash Link to heading
An example.
As a slightly artificial application, suppose we have
Suppose for the moment that all
New idea.
instead of partitioning keys into buckets, we can directly assign each key a
slice of, say, 16 bits in the
output buffer of
Compared to the current buckets of around
A drawback of this approach is that one the one hand we want to choose seeds from left to right, while on the other hand we want to assign them from large buckets (dense regions) to small buckets (sparse regions). Probably a mix of both can work. Anyway because of the perfect bucket assignment function, the largest buckets will be at the start, and so going left to right should somewhat work. Still, it is to be expected that the distribution won’t be perfect for the small buckets towards the end, causing, say, 4 keys to map to the same bit position. One solution could be to add 8 buffer bits at the end of each cache line, so that dependencies never travel too far out. Then, we can again prioritize cache lines or bytes with a heavy load factor, and work outwards from those.
Either way, the hash-evict strategy as in PtrHash probably won’t work, since updating existing seed values in the middle of other things would throw off adjacent seeds as well.
Using the
bits per element, so that the dense region at the start relatively gets a lot of buffer;- or uniformly spread out over the output bits;
- or a mix of the two.
Further strategies to increase the probability of success could be:
- As before, use load factor
. - We could find
elements in the most dense regions, and separately store their hash values in a hash table. This would add a second cache miss for each element though, to check whether the value is special. - We could also write
1111111
to any region that we failed to place, and then look it up in a fallback table.
3 Tiny pointers and optimal open addressing hash tables Link to heading
Tiny pointers are introduced in Bender et al. (2021) (hackernews).
The core idea feels quite similar to most minimal perfect hashing algorithms:
Instead of storing the hashed position of a key using
The main data structures used to achieve these tiny pointers are:
- load-balancing tables, that split the input into blocks of size
. Then a block is chosen at random via the key, and if possible, the key is inserted into an empty position in the block and that position is returned. - In case insertions fails, a sparsely filled power-of-two-choices table is
used, with buckets of size
.
In the end, they manage to get down to variable-width tiny pointers of average
size
Funnel hashing. In Farach-Colton, Krapivin, and Kuszmaul (2025) (quanta, hackernews, youtube) the authors introduce new hash tables that achieve significantly higher load factors than previously, using techniques similar to (but simpler than, IMO) tiny pointers. The simpler one is funnel hashing:
- Set the bucket size
. - Create arrays
, each of which is size a multiple of , with . Also keep a small separate fallback array, so that the total size is . - To try insert a key into
, hash it into a bucket of , and try to insert it.
To insert a key in the full data structure, try inserting into
The main benefit of this recursive data structure is that
TODO: Can we somehow make a smooth version of this? Where the load factor slowly decreases as a function of position in the original array, and we don’t need the explicit levels?
TODO: Can we somehow use this as the basis of an MPHF? It may be hard to use
together with the consensus scheme, since we explicitly first want to try
inserting into
Elastic hashing is a variant that batches insertions and fills one layer at a
time (and spilling to the next one). It uses some clever probing to trade
searching in a nearly full
References Link to heading
It is allowed for the success of
to depend on with ; see appendix C. ↩︎