- Abstract
- 1 Introduction
- 2 Related work
- 3 PtrHash
- 4 Results
- 4.1 Construction
- 4.1.1 Bucket Functions
- 4.1.2 Tuning Parameters for Construction
- 4.1.3 Remap
- 4.2 Comparison to Other Methods
- 4.1 Construction
- 5 Conclusions and Future Work
- Acknowledgements
- Funding
- 6 Appendix: Query throughput
- 7 Appendix: Sharding
- 7.1 Evaluation
- 8 Appendix: Evaluating Hash Functions
- 9 TODO
\[ \newcommand{\part}{\mathsf{part}} \newcommand{\bucket}{\mathsf{bucket}} \newcommand{\slot}{\mathsf{slot}} \newcommand{\reduce}{\mathsf{reduce}} \newcommand{\h}{\mathsf{h}} \newcommand{\hp}{\mathsf{h}_{\mathsf{p}}} \newcommand{\C}{\mathsf{C}} \newcommand{\select}{\mathsf{select}} \newcommand{\free}{F} \newcommand{\mphf}{\mathsf{H_{mphf}}} \]
This is the work-in-progress paper on PtrHash. The original blog post on its development is here.
Abstract
Motivation. Given a set \(S\) of \(n\) keys, a minimal perfect hash function (MPHF) is a collision-free bijective map \(\mphf\) from \(S\) to \(\{0, \dots, n-1\}\). These functions have uses in databases, search engines, and are used in bioinformatics indexing tools such as Pufferfish (using BBHash), and Piscem (PTHash). PTHash is also used in SSHash, a data structure on $k$-mers that supports membership queries. PTHash only takes around \(5\%\) of the total space of SSHash, and thus, trading slightly more space for faster queries is beneficial. Thus, this work presents a (minimal) perfect hash function that first prioritizes query throughput, while also allowing efficient construction for \(10^9\) or more elements using up to 3.0 bits of memory per key.
Contributions. Both PTHash and PHOBIC first map all \(n\) keys to \(n/\lambda < n\) buckets. Then, each bucket stores a pilot that controls the final hash value of the keys mapping to it. PtrHash builds on this by using 1) fixed-width (uncompressed) 8-bit pilots, 2) a construction algorithm similar to cuckoo-hashing to find suitable pilot values. Further, it 3) uses the same number of buckets and slots for each part, with 4) a single remap table to map intermediate positions \(\geq n\) to \(<n\)>, 5) encoded using per-cacheline Elias-Fano coding. Lastly, 6) PtrHash support streaming queries, where we use prefetching to answer a stream of multiple queries more efficiently than one-by-one processing.
Results. With default parameters, PtrHash takes 2.40 bits per key. On 300 million string keys, PtrHash is as fast or faster to build than other MPHFs, and at least \(1.75\times\) faster to query. When streaming multiple queries, this improves to \(3.1\times\) speedup over the fastest alternative, while also being significantly faster to construct. When using \(10^9\) integer keys instead, query times are as low as 12 ns/key when iterating in a for loop, or even down to 8 ns/key when using the streaming approach, within \(10\%\) of the maximum memory-bound throughput.
Source: github.com/RagnarGrootKoerkamp/ptrhash
1 Introduction
Given a set of \(n\) keys \(\{k_0, \dots, k_{n-1}\}\), a hash function maps them to some co-domain \([m] := \{0, \dots, m-1\}\). When \(m\geq n\) and the hash is injective (collision-free), it is also called perfect. When additionally \(m=n\) and it is surjective onto \([n]\), it is minimal. Thus, a minimal perfect hash function (MPHF) bijectively maps a set of \(n\) keys onto \([n]\).
Metrics. Various aspects of MPHF data structures can be optimized. First, one could minimize its space usage and try to approach the \(\log_2(e)=1.44\) bits/key lower bound (Mehlhorn 1982). Indeed, there are many recent works in this direction, such as Bipartite ShockHash-RS, which gets below 1.5 bits/key (Lehmann, Sanders, and Walzer 2024, 2023a; Lehmann 2024).
In this paper, we focus primarily on optimizing for query throughput and secondary on construction speed, while relaxing space usage up to 3 bits/key. This continues the line of work of FCH (Fox, Chen, and Heath 1992), PTHash (Pibiri and Trani 2021, 2024), and PHOBIC (Hermann et al. 2024a), that all provide relatively fast queries.
Problem statement. Construct a minimal perfect hash function data structure \(\mphf\) that is fast to query, ideally using one memory access per lookup, and fast to construct, while staying below 3 bits/key of space.
Motivation. Our main motivating application is to optimize the use of PTHash in SSHash (Pibiri 2022), a data structure to index a set of $k$-mers (sequences of \(k\) DNA bases). There, the MPHF only takes around \(5\%\) of the total space. Thus, a slightly increased space usage of the MPHF has little effect on the total space, while faster lookups could significantly improve the overall query speed. In this application, $k$-mers are typically encoded as 64-bit integers, and thus we will focus our attention on integer keys.
Further applications can be found in domains such as networking (Lu, Prabhakar, and Bonomi 2006), databases (Chang 2005), and full-text indexing (Belazzougui and Navarro 2014), where one could imagine hashing IP addresses, URLs, or (compact) suffix-trie edge labels.
Contributions. We introduce PtrHash, a minimal perfect hash function that is primarily optimized for query throughput and construction speed, at the cost of slightly more memory usage.
Compared to PTHash and PHOBIC, the main novelties are:
- a fixed number of buckets and slots per part;
- the use of fixed-width 8-bit pilots, leading to very simple lookups;
- a pilot search based on cuckoo-hashing;
- remapping using a single remap table, again simplifying lookups;
- a remap table based on a per-cacheline Elias-Fano encoding (Elias 1974; Fano 1971), CacheLineEF;
- the use of prefetching to stream multiple queries in parallel.
Results. When using 300 million string keys, PtrHash with default parameters takes 2.40 bits/key and is nearly as fast to construct as the fastest other methods, while being much faster to query. Compared to the next-fastest method to query, PtrHash provides \(1.75\times\) faster queries when looping naively, or \(3.1\times\) faster when streaming.
When using \(10^9\) integer keys instead, PtrHash can achieve a throughput of up to 12 ns/key when looping over queries, or even 8 ns/key when streaming, close to matching the maximum throughput of random memory accesses of a single thread. In a multi-threaded setting, PtrHash can fully saturate the DDR4 memory bandwidth with 4 threads.
2 Related work
There is a vast amount of literature on (minimal) perfect hashing. Here we only give a highlight of recent approaches. We refer the reader to Section 2 of (Pibiri and Trani 2024) and Sections 4 and 8 of the thesis of Hans-Peter Lehmann (Lehmann 2024), which contains a nice overview of the different approaches taken by various tools.
Space lower bound. There is a lower bound of \(n \log_2(e)\) bits to store a minimal perfect hash function on \(n\) random keys (Mehlhorn 1982). To get some feeling for this bound, consider any hash function. Intuitively the probability that this is an MPHF is \(n!/n^n\). From this, it follows that at most, around \(\log_2(n^n/n!)\approx n\log_2(e)\) bits of information are needed to ‘‘steer’’ the hash function in the right direction. Now, a naive approach is to use a seeded hash function, and try \(O(e^n)\) seeds until a perfect hash function is found. Clearly, that is not feasible in practice.
Brute-force. When \(n\) is small, \(e^n\) can be sufficiently small to allow a bruteforce search over \(n\). RecSplit exploits this by first partitioning the input keys first into buckets, and then recursively splitting buckets until they have size at most \(\ell \leq 16\). These leafs can then be solved using brute-force, and the overall space usage can be as low as 1.56 bits/key. SIMDRecSplit significantly improves the construction time by using a meet-in-the-middle approach for the leafs, and generally speeds up the implementation. Consensus-RecSplit (Lehmann et al. 2025) is a recent MPHF that is the first to achieve construction time linear in \(1/\varepsilon\), where \(\varepsilon\) is the bits-per-key space overhead on top of the \(\log_2(e)\) lower bound. Its core idea is to efficiently encode the seeds for multiple sub-problems together.
Graphs. SicHash (Lehmann, Sanders, and Walzer 2023b) and its predecessor BPZ (Botelho, Pagh, and Ziviani 2007) are based on hypergraph peeling, which was first introduced in Majewski et al. (1996) and analyzed further in Molloy (2005): nodes are the \(n\) hash values, and each key corresponds to a size-\(r\) hyper-edge. Then keys can be assigned a value one-by-one as long as each set of \(k\) keys covers at least \(k+1\) values. This is also alike cuckoo hashing, where each key has \(r=2\) target locations. ShockHash (Lehmann, Sanders, and Walzer 2024) then takes the RecSplit framework and uses an \(r=2\) cuckoo table for the base case. It then tries \(O((e/2)^n)\) seeds until one is found that allows building the cuckoo hash table. Bipartite ShockHash-RS (Lehmann, Sanders, and Walzer 2023a) further improves this by using meet-in-the-middle on the seeds, improving the construction time to \(O((\sqrt{e/2})^n) = O(1.166^n)\). This is currently the most space efficient approach. Bipartite ShockHash-Flat is a variant that trades space for more efficient queries.
Fingerprinting. A completely different technique was introduced by (Chapman et al. 2011; Müller et al. 2014), and used in BBHash (Limasset et al. 2017). Here, the idea is to start with any hash function mapping into \([\gamma n]\) for some \(\gamma \geq 1\). Any slots that have exactly one element mapping to them are marked with a 1, and the remaining \(n_1\) elements are processed recursively, mapping them to \([\gamma n_1]\). Lookups are then done using rank queries on this bitvector. FMPH (Beling 2023) has a much faster implementation of the construction that goes down to 3.0 bits/key. FiPS (Lehmann 2024) is a variant that trades some space in the rank data structure for faster queries. FMPHGO (Beling 2023) is variant that first splits keys into buckets, then uses a seeded hash function that has a low number of collisions, and only then recurses into colliding keys. This reduces the space usage and number of recursion steps, leading to faster queries, but takes longer to construct.
Bucket placement. PtrHash builds on methods that first group the keys into buckets of a few keys. Then, keys in the buckets are assigned their hash value one bucket at a time, such that newly assigned values do not collide with previously taken values. All methods iterate different possible key assignments for each bucket until a collision-free one is found, but differ in the way hash values are determined. To speed up the search for keys, large buckets are placed before small buckets.
FCH (Fox, Chen, and Heath 1992) uses a fixed number of bits to encode the seed for each bucket and uses a skew distribution of bucket sizes. The seed stored in each bucket determines how far the keys are displaced (shifted) to the right from their initially hashed position. A fallback hash can be used if needed, and construction can fail if that also does not work. CHD (Belazzougui, Botelho, and Dietzfelbinger 2009) uses uniform bucket sizes, but uses a variable-width encoding for the seeds. PTHash (Pibiri and Trani 2021) combines these two ideas and introduces a number of compression schemes for the seed values, that are called pilots. Instead of directly generating an MPHF, it first generates a PHF to \([n’]\) for \(n’=n/\alpha \approx n/0.99\), and values mapping to positions \(\geq n\) are remapped to the skipped values in \([n]\). PTHash-HEM (Pibiri and Trani 2024) first partitions the keys, and uses this to build multiple parts in parallel. This also enables external-memory construction. Lastly, PHOBIC (Hermann et al. 2024a) improves from the simple skew distribution of FCH to an optimal bucket assignment function, which speeds up construction and enables smaller space usage. Secondly, it partitions the input into parts of expected size 2500 and uses the same number of buckets for each part. Then, it uses that the pilot values of the \(i\)’th bucket of each part follow the same distribution, and encodes them together. Together, this saves 0.17 bits/key over PTHash.
3 PtrHash
The core design goal of PtrHash1 is to simplify PTHash to speed up both query speed and construction time, at the cost of possibly using slightly more memory. We first give a high level overview of PtrHash (3.1). Then, we explain specific parts of PtrHash in more detail.
In Appendix 6, we investigate batching to improve query throughput, and in Appendix 7 we give details on the sharding of the input to construct PtrHash on large inputs.
3.1 Overview
Figure 1: Overview of PtrHash on (n=23) keys. The keys are hashed into ([H] = [2^{64}]) and this range is split into (P=2) parts and (B=5) buckets per part. In red are four keys hashing to the same bucket in the first part, and in blue are three keys belonging to the same bucket in the second part. The pilots of the (Pcdot B=10) buckets in the highlighted area are the main component of the data structure, and control to which slots keys in the bucket are mapped to avoid collisions. The blue highlighted key is initially mapped to a position (geq n), and thus (along with the other yellow cells) remapped into an empty slot (<n) via a (compressed) table of free slots.
Before going into details, we first briefly explain the fully constructed PtrHash data structure and how to query it, see Figure 1 and Code Snippet 1. We also highlight differences to PTHash (Pibiri and Trani 2021) and PHOBIC (Hermann et al. 2024a).
Parts and buckets. The input is a set of \(n\) keys \(\{k_0, ̣\dots, k_{n-1}\}\) that we want to hash to \(n\) slots \([n]:=\{0, \dots, n-1\}\). We first hash the keys using a 64-bit hash function \(\h\) into \(\{\h(k_0), \dots, \h(k_{n-1})\}\). The total space of hashes \([2^{64}]\) is equally partitioned into \(P\) parts, and the part of a key is easily found as \(\left\lfloor P\cdot \h(k_i) / 2^{64}\right\rfloor\) (Lemire 2019). Then, the expected \(n/P\) keys in each part are partitioned into exactly \(B\) non-uniform buckets: each key has a relative position \(x\) inside the part, and this is passed through a bucket assignment function \(\gamma: [0,1)\mapsto[0,1)\) such as \(\gamma(x)=x^2\) that controls the distribution of expected bucket sizes (Hermann et al. 2024a), as explained in detail in 3.4. The result is then scaled to a bucket index in \([B]\):
\begin{align} \begin{split} \part(k_i) &:= \left\lfloor P\cdot \h(k_i) / 2^{64}\right\rfloor,\\ x &:= \big((P\cdot \h(k_i)) \bmod 2^{64}\big)/2^{64},\\ \bucket(k_i) &:= \left\lfloor B\cdot \gamma(x)\right\rfloor. \end{split}\label{eq:partbucket} \end{align}
Slots and pilots. Now, the goal and core of the data structure is to map the \(n/P\) expected keys in each part to \(S\approx (n/P)/\alpha\) slots, where \(\alpha\approx 0.99\) gives us roughly \(\approx 1\%\) extra slots to play with. The pilot for each bucket controls to which slots its keys map. PtrHash uses fixed-width $8$-bit pilots \(\{p_0, \dots, p_{P\cdot B-1}\}\), one for each bucket. Specifically, key \(k_i\) in bucket \(\bucket(k_i)\) with pilot \(p_{\bucket(k_i)}\) maps to slot
\begin{equation} \slot(k_i) := \part(k_i) \cdot S + \reduce(\h(k_i) \oplus \hp(p_{\bucket(k_i)}), S),\label{eq:slot} \end{equation}
where \(\reduce(\cdot, S)\) maps the random 64-bit integer into \([S]\) as explained below.
Compared to PHOBIC and PTHash(-HEM) (Pibiri and Trani 2024), there are two differences here. First, while we still split the input into parts, we assign each part not only the same number of bukets, but also the same number of slots, instead of scaling the number of slots with the actual size of each part. This removes the need to store a prefix sum of part sizes, and avoids one memory access at query time to look up the offset of the part. Second, previous methods search for arbitrary large pilot values that require some form of compression to store efficiently. Our 8-bit pilots can simply be stored in an array so that lookups are simple.
Construction. The main difficulty of PtrHash is during construction (3.3), where we must find values of the pilots \(p_j\) such that all keys indeed map to different slots. Like other methods, PtrHash processes multiple parts in parallel. Within each part, it sorts the buckets from large to small and ‘greedily’ assigns them the smallest pilot value that maps the keys in the bucket to slots that are still free. Unlike other methods though, PtrHash only allows pilots up to \(255\). When no suitable pilot is found, we use a method similar to (blocked) cuckoo hashing (Pagh and Rodler 2001; Fotakis et al. 2004): a pilot with a minimal number of collisions is chosen, and the colliding buckets are ’evicted’ and will have to search for a new pilot. A similar approach was discovered independently by Stefan Hermann (Section 4.5 Hermann 2023).
3.2 Details
We now go over some specific details.
Hash functions. The 8-bit pilots \(p_b\) are hashed into pseudo-random 64-bit integers by using FxHash (Breeden, n.d.) for \(\hp\), which simply multiplies the pilot with a mixing constant \(\C\):
\begin{equation} \hp(p) := \C \cdot p. \end{equation}
When the keys are 64-bit integers, we use this same FxHash algorithm to hash them (\(\h(k) := \C\cdot k\)), since multiplication by an odd constant is invertible modulo \(2^{64}\) and hence collision-free. For other types of keys, the hash function depends on the number of elements. When the number of elements is not too far above \(10^9\), the probability of hash collisions with a 64-bit hash function is sufficiently small, and we use the 64-bit variant of xxHash (Collet, n.d.; Martirosian, n.d.). When the number of keys goes beyond \(2^{32} \approx 4\cdot 10^9\), the probability of 64-bit hash collisions increases. In this case, we use the \(128\) bit variant of xxHash. The high 64 bits determine the part and bucket in Equation \ref{eq:partbucket}, and the low 64 bits are used in Equation \ref{eq:slot} to determine the slot.
The reduce function. When 64-bit hashes are used, we must ensure that all bits of the hash are used to avoid collisions. A simple choice would be \(\reduce(x,S) = x\bmod S\), which uses all bits when \(S\) is not a power of \(2\) and takes two multiplications using ‘fast mod’ (Lemire, Kaser, and Kurz 2019). Instead, we use \(S=2^s\), so that \(x\bmod 2^s\) is a simple bit-mask. Unfortunately, this only uses the lower \(s\) bits of the hash, while the \(part\) and \(bucket\) functions use the high \(\log_2(P\cdot B)\) bits, leaving some entropy in the middle bits unused.
As a solution, we first multiply \(x\) by the mixing constant \(\C\), and then take the low \(s\) bits of the high half. This uses all input bits and only needs a single multiplication, giving a small speedup over fast mod:
\begin{equation} \reduce(x, 2^s) := \left\lfloor \C\cdot x/2^{64}\right\rfloor \bmod 2^s. \end{equation}
Remapping. Since each part has slightly (\(\approx 1\%\)) more slots than keys, some keys will map to an index \(\geq n\), leading to a non-minimal perfect hash function. To fix this, those are remapped back into the ‘gaps’ left behind in slots \(<n\) using a (possibly compressed) lookup table. This is explained in detail in 3.5.
Whereas PTHash-HEM uses a separate remap per part, PtrHash only has a single ‘global’ remap table. PHOBIC directly builds a full \(\alpha=1\) table, and does not need any remapping.
Parameter values. In practice, we usually use \(\alpha=0.99\). Similar to PHOBIC, the number of buckets per part is set to \(B = \lceil(\alpha\cdot S)/\lambda\rceil\), where \(\lambda\) is the expected size of each bucket and is around \(3\) to \(4\). The number of parts is \(P=\lceil n/(\alpha S)\rceil\).
Choosing the number of slots per part \(S\). PtrHash-HEM and PHOBIC randomly split the keys into parts, and a part with \(n_i\) elements gets \(S_i=n_i/\alpha\) slots. In PtrHash, each part has the same number of slots \(S\). We would prefer many small parts, since smaller parts fit better in cache and hence are faster to construct. On the other hand, there is some variance in the part sizes, and the largest parts will contain more than \(n/P\) keys. In particular, for a given \(S\) and \(P=P(S)=\lceil n/(\alpha S)\rceil\), we estimate the size of the largest part as \(n/P + \sqrt{n/P}\cdot \sqrt{2 \ln P}\). We then choose \(S\) as the smallest power of two for which this is below \(S-1.5\sqrt{n/P}\), where the buffer ensures that, at least in practice, a larger-than-expected largest part still fits.
|
|
3.3 Construction
Both PTHash-HEM and PHOBIC first partition the keys into parts, and then build an MPHF part-by-part, optionally in parallel on multiple threads. Within each part, the keys are randomly partitioned (Figure 1) into buckets of average size \(\lambda\). Then, the buckets are sorted from large to small, and one-by-one greedily assigned a pilot, such that the keys in the bucket map to slots not yet covered by earlier buckets.
As more buckets are placed, there are fewer remaining empty slots, and searching for pilots becomes harder. Hence, PTHash uses \(n/\alpha > n\) slots to ensure there sufficiently many empty slots for the last pilots. This speeds up the search and reduces the values of the pilots. PHOBIC, on the other hand, uses relatively small parts of expected size 2500, so that the search for the last empty slot usually should not take much more than 2500 attempts. Nevertheless, a drawback of the greedy approach is that pilots values have an uneven distribution, making it somewhat harder to compress them while still allowing fast access (e.g. requiring the interleaved coding of PHOBIC).
Hash-evict2. In PtrHash, we instead use fixed width, single byte pilots. To achieve this, we use a technique resembling cuckoo hashing (Pagh and Rodler 2001) that was also independently found in (Section 4.5 2023). As before, buckets are greedily inserted from large to small. For some buckets, there may be no pilot in \([2^8]\) such that all its keys map to empty slots. When this happens, a pilot is found with the lowest weighted number of collisions. The weight of a collision with an element of a bucket of size \(s\) is \(s^2\), to prevent evicting large buckets, as those are harder to place. The colliding buckets are evicted by emptying the slots they map to and pushing them back onto the priority queue of remaining buckets. Then, the new bucket is inserted, and the next largest remaining or evicted bucket is inserted.
Implementation details. In order to speed up the code to search for pilots, a number of optimizations are made.
taken
bit mask. Like PTHash and PHOBIC, we keep ataken
bit mask that indicates for each slot whether it was taken. This keeps the array small so it can be cached efficiently.- Collision-free hot path. When searching for pilots, we first test if there is a pilot without any collisions. This is usually the case, and is faster since it only needs access to the bit vector. Additionally, where there is a collision, we know a pilot is optimal when it collides with exactly one bucket of minimal size, allowing for an early break.
- Avoiding loops. To avoid repeated patterns of the same buckets evicting each other, the search for a pilot starts at a random number in \([2^8]\), rather than always restarting at 0.
- Avoiding loops more. Each time a bucket is placed that evicted some other bucket(s), it is added to a list of the 16 most recently placed buckets. Buckets in this list are never evicted. This avoids short cycles, where for example two buckets keep evicting each other from the same slot.
Analysis. Unfortunately, we do not currently have a formal analysis showing that the hash-evict method works with high probability given that certain criteria are met. Ideally, the analysis (Section 5 2023) would be extended to fully cover our method. In 4, we show some practical results.
3.4 Bucket Assignment Functions
During construction, slots fill up as more buckets are placed. Because of this, the first buckets are much easier to place than the later ones, when only few empty slots are left. To compensate for this, we can introduce an uneven distribution of bucket sizes, so that the first buckets are much larger and the last buckets are smaller. FCH (Fox, Chen, and Heath 1992) accomplishes this by a skew mapping that assigns \(60\%\) of the elements to \(30\%\) of the buckets, so that those \(30\%\) are large buckets while the remaining \(70\%\) is small (Table 1). This is also the scheme used by PTHash.
The optimal bucket function. PHOBIC (Hermann et al. 2024a) provides a more thorough analysis and uses the optimal function \(\gamma_p(x) = x + (1-x)\ln (1-x)\) when the target load factor is \(\alpha=1\). A small modification is optimal for \(\alpha<1\) (Hermann et al. 2024b Appendix B), but for simplicity we only consider the original \(\gamma_p\). This function has derivative \(0\) at \(x=0\), so that many \(x\) values map close to \(0\). In practice, this causes the largest buckets to have size much larger than \(\sqrt S\). Such buckets are hard to place, because by the birthday paradox they are likely to have multiple elements hashing to the same slot. To fix this, PHOBIC ensures the slope of \(\gamma\) is at least \(\varepsilon=1/\big(5 \sqrt S\big)\) by using \(\gamma_{p,\varepsilon}(x) = x + (1-\varepsilon)(1-x)\ln(1-x)\) instead. For simplicity in the implementation, we fix \(\varepsilon = 1/{2^8}\) which works well in practice.
Approximations. For PtrHash, we aim for high query throughput, and thus we would like to only use simple computations and avoid additional lookups as much as possible. To this end, we replace the \(\ln (1-x)\) by its first order Taylor approximation at \(x=0\), \(\ln(1-x) \approx -x\), giving the quadratic \(\gamma_2(x) := x^2\). Using the second order approximation \(\ln(1-x) \approx -x-x^2/2\) results in the cubic \(\gamma(x) = (x^2+x^3)/2\). This version again suffers from too large buckets, so in practice we use \(\gamma_3(x) := \frac{255}{2^8}\cdot (x^2+x^3)/2 + \frac{1}{2^8}\cdot x\). We also test the trivial \(\gamma_1(x):=x\).
These values can all be computed efficiently by using that the input and output of \(\gamma\) are 64-bit unsigned integers representing a fraction of \(2^{64}\), so that e.g. \(x^2\) can be computed as the upper 64-bits of the widening \(64\times64\to 128\) bit product \(x\cdot x\).
3.5 Remapping using CacheLineEF
Like PTHash, PtrHash uses a parameter \(0<\alpha\leq 1\) to use a total of \(n’=n/\alpha\) slots, introducing \(n’-n\) additional free slots. As a result of the additional slots, some, say \(R\), of the keys will map to positions \(n\leq q_0<\dots< q_{R-1}< n’\), causing the perfect hash function to not be minimal.
Remapping. Since there are a total of \(n\) keys, this means there are exactly \(R\) empty slots (‘gaps’) left behind in \([n]\), say at positions \(L_0\) to \(L_{R-1}\). We remap the keys that map to positions \(\geq n\) to the empty slots at positions \(< n\) to obtain a minimal perfect hash function.
A simple way to store the remap is as a plain array \(\free\), such that \(\free[q_i-n] = L_i\). PTHash encodes this array using Elias-Fano coding (Elias 1974; Fano 1971), after setting undefined positions of \(\free\) equal to their predecessor. The benefit of a plain \(\free\) array is fast and cache-local lookups, whereas Elias-Fano coding provides a more compact encoding that typically requires multiple lookups to memory.
CacheLineEF. We would like to answer each query by reading only a single cache line from memory. To do this, we use a method based on interleaving data. First, the list of non-decreasing \(\free\) positions is split into chunks of \(C=44\) values \(\{v_0, \dots, v_{43}\}\), with the last chunk possibly containing fewer values. We assume that values are at most 40 bits, and that the average stride in each block is not more than 500. Then, each chunk is encoded into 64 bytes that can be stored as single cache line, as shown in Figure 2.
We first split all values into their 8 low bits (\(v_i \bmod 2^8\)) and 32 high bits (\(\lfloor v_i/2^8\rfloor\)). Further, the high part is split into an offset (the high part of \(v_0\)) and the relative high part:
\begin{equation} v_i = 2^8\cdot\underbrace{\lfloor v_0/2^8\rfloor}_{\text{Offset}} + 2^8\cdot \underbrace{\left(\lfloor v_i/2^8\rfloor - \lfloor v_0/2^8\rfloor\right)}_{\text{Relative high part}} +\underbrace{(v_i\bmod 2^8)}_{\text{Low bits}}. \label{eq:clef} \end{equation}
This is stored as follows.
- First, the 32 bit offset \(\lfloor v_0/2^8\rfloor\) is stored.
- Then, the relative high parts are encoded into 128 bits. For each \(i\in[44]\), bit \(i + \lfloor v_i/2^8\rfloor - \lfloor v_0/2^8\rfloor\) is set to 1. Since the \(v_i\) are increasing, each \(i\) sets a distinct bit, for a total of 44 set bits.
- Lastly, the low 8 bits of each \(v_i\) are directly written to the 44 trailing bytes.
Figure 2: Overview of the CacheLineEF data structure.
Lookup. The value at position \(i\) is found by summing the terms of Equation
\ref{eq:clef}. The offset and low bits can be read directly.
This relative high part can be found as \(2^8\cdot(\select(i)-i)\), where \(\select(i)\) gives
the position of the \(i\)’th 1 bit in the 128-bit-encoded relative high parts. In practice, this can be implemented
efficiently using the PDEP
instruction provided by the BMI2 bit manipulation
instruction set (Pandey, Bender, and Johnson 2017).
Limitations. CacheLineEF uses \(64/44\cdot 8 = 11.6\) bits per value, which is more than the usual Elias-Fano, which for example takes \(8+2=10\) bits per value for data with an average stride of \(2^8\). Furthermore, values are limited to \(40\) bits, covering \(10^{12}\) items. The range could be increased to \(48\) bit numbers by storing \(5\) bytes of the offset, but this has not been necessary so far. Lastly, each CacheLineEF can only span a range of around \((128-44)\cdot 2^8 = 21\ 504\), or an average stride of \(500\). This means that for PtrHash, we only use CacheLineEF when \(\alpha\leq 0.99\), so that the average distance between empty slots is \(100\) and the average stride of \(500\) is not exceeded in practice. When \(\alpha > 0.99\), a simple plain array can be used instead without much overhead.
Comparison. Compared to Elias-Fano coding, CacheLineEF stores the low order bits as exactly a single byte, removing the need for unaligned reads. Further, the select data structure on the high-order bits is replaced by a few local bit-wise operations. CacheLineEF is also somewhat similar to the (Uniform) Partitioned Elias-Fano Index of Ottaviano and Venturini (2014), in that both split the data. The uniform partitioned index also uses fixed part sizes, but encodes them with variable widths, and adds a second level of EF to encode the part offsets. Instead, CacheLineEF prefers simplicity and uses fixed part sizes with a constant width encoding and simply stores the offsets directly.
4 Results
We now evaluate PtrHash construction and query throughput for different parameters, and compare PtrHash to other minimal perfect hash functions. All experiments are run on an Intel Core i7-10750H CPU with 6 cores and hyper-threading disabled. The frequency is pinned to 2.6 GHz. Cache sizes are 32 KiB L1 and 256 KiB L2 per core, and 12 MiB shared L3 cache. Main memory is 64 GiB DDR4 at 3200 MHz, split over two 32 GiB banks.
In 4.1, we compare the effect of various parameters and configurations on the size, construction speed, and query speed of PtrHash. In Section 4.2, we compare PtrHash to other methods.
Additionally, in Appendix 6.2 we investigate query throughput with batching and when using multiple threads. In Appendix 7.1 we compare various sharding methods, and lastly in 8 we compare the effect of different input types and hash functions on query throughput.
4.1 Construction
The construction experiments use \(10^9\) random 64-bit integer keys, for which the data structure takes over 300 MB and thus is much larger than L3 cache. Unless otherwise mentioned, construction is in parallel using 6 cores. For the query throughput experiments, we also test on 20 million keys, for which the data structure take around 6 MB and easily fit in L3 cache. To avoid the time needed for hashing keys, and since our motivating application is indexing $k$-mers that fit in 64 bits, we always use random 64-bit integer keys, and hash them using FxHash.
4.1.1 Bucket Functions
Figure 3: Bucket size distribution (red) and average number of evictions (black) per additionally placed bucket during construction of the pilot table, for different bucket assignment functions. Parameters are (n=10^9) keys, (S=2^{18}) slots per part, and (alpha=0.98), and the red shaded load factor ranges from (0) to (alpha). In the first five plots (lambda=3.5) so that the pilots take (2.29) bits/key. For (lambda=4.0) (bottom-right), the linear, skewed, and optimal bucket assignment functions cause endless evictions, and construction fails. The cubic function does work, resulting in (2.0) bits/key for the pilots.
In Figure 3, we compare the performance of different bucket assignment functions \(\gamma\) in terms of the bucket size distribution and the number of evictions for each additionally placed bucket. We see that the linear \(\gamma_1(x) = x\) has a lot of evictions for the last buckets of size \(3\) and \(2\), but like all methods it is fast for the last buckets of size \(1\) due to the load factor \(\alpha < 1\). The optimal distribution of PHOBIC performs only slightly better than the skewed one of FCH and PTHash, and can be seen to create more large buckets since the load factor increases fast for the first buckets. The cubic \(\gamma_3\) is clearly much better than all other functions, and is also tested with larger buckets of average size \(\lambda = 4\), where all other functions fail.
In the remainder, we will test the linear \(\gamma_1\) for simplicity and lookup speed, and the cubic \(\gamma_3\) for space efficiency.
4.1.2 Tuning Parameters for Construction
Figure 4: This plot shows the construction time (blue and red, left axis) and data structure size (black, green, and yellow, right axis) as a function of (lambda) for (n=10^9) keys. Parallel construction time on 6 threads is shown for both the linear and cubic (gamma), and for various values of (alpha) (thickness). The curves stop because construction times out when (lambda) is too large. For each (lambda), the black line shows the space taken by the array of pilots. For larger (lambda) there are fewer buckets, and hence the pilots take less space. The total size including the remap table is shown in green (plain vector) and yellow (CacheLineEF) for various (alpha). The blue (fast), black (default), and red (compact) dots highlight the chosen parameter configurations.
In Figure 4 we compare the multi-threaded construction time and space usage of PtrHash on \(n=10^9\) keys for various parameters \(\gamma\in \{\gamma_1, \gamma_3\}\), \(2.7\leq \lambda\leq 4.2\), \(\alpha\in \{0.98, 0.99, 0.995, 0.998\}\), and plain remapping or CacheLineEF. We see that for fixed \(\gamma\) and \(\alpha\), the construction time appears to increase exponentially as \(\lambda\) increases, until it times out due to a never-ending chain of evictions. Load factors \(\alpha\) closer to \(1\) (thinner lines) achieve smaller overall data structure size, but take longer to construct and time out at smaller \(\lambda\). The cubic \(\gamma_3\) is faster to construct than the identity \(\gamma_1\) for small \(\lambda \leq 3.5\). Unlike \(\gamma_1\), it also scales to much larger \(\lambda\) up to \(4\), and thereby achieves significantly smaller overall size.
We note that for small \(\lambda\), construction time does converge to around 19 ns/key. A rough time breakdown is that for each key, 1 ns is spent on hashing, 5 ns on sorting all the keys, 12 ns to find pilots, and lastly 1 ns on remapping to empty slots.
Recommended parameters. Based on these results, we choose three sets of parameters for further evaluation, as indicated with blue, black, and red dots in Figure 4:
- Fast (blue), aiming for query speed: using the linear \(\gamma_1\), \(\lambda=3.0\), \(\alpha=0.99\), and a plain vector for remapping. Construction takes only just over 20 ns/key, close to the apparent lower bound, and space usage is 3 bits/key. This can be used when \(n\) is small, or more generally when memory usage is not a bottleneck.
- Compact (red), aiming for small space: using the cubic \(\gamma_3\), \(\lambda=4.0\), \(\alpha=0.99\), and CacheLineEF remapping. Construction now takes around 50 ns/key, but the data structure only uses 2.12 bits/key. In practice, this configuration sometimes ends up in endless eviction cycles and \(\lambda=3.9\) may be better.
- Default (black), a trade-off between fast construction and small space: using cubic \(\gamma_3\), \(\lambda=3.5\), and \(\alpha=0.99\), with CacheLineEF remapping.
4.1.3 Remap
Parameters | Pilots | Query | PHF | Remap | Remap | Query | MPHF |
---|---|---|---|---|---|---|---|
Loop | Stream | Loop | Stream | ||||
Fast | 2.67 | 11.6 | 8.6 | Vec<u32> | 0.33 | 12.7 | 8.9 |
CacheLineEF | 0.12 | 12.1 | 8.9 | ||||
EF | 0.09 | 14.4 | 9.7 | ||||
Default | 2.29 | 17.6 | 7.9 | Vec<u32> | 0.33 | 20.0 | 8.6 |
CacheLineEF | 0.12 | 21.0 | 8.7 | ||||
EF | 0.09 | 21.2 | 9.6 | ||||
Compact | 2.00 | 17.5 | 7.9 | Vec<u32> | 0.33 | 20.0 | 8.5 |
CacheLineEF | 0.12 | 21.0 | 8.6 | ||||
EF | 0.09 | 21.3 | 9.5 |
In Table 2, we compare the space usage and query throughput of the different remap data structures for both the fast and compact parameters, for \(n=10^9\) keys. We observe that the overhead of CacheLineEF is \(2.75\times\) smaller than a plain vector, and only \(40\%\) larger than true Elias-Fano encoding.
The speed of non-minimal (PHF) queries that do not remap does not depend on the remap structure used.
For minimal (MPHF) queries with the for loop, EF is significantly slower (14.2 ns) with the fast parameters than the plain vector (12.5 ns), while CacheLineEF (12.9 ns) is only slightly slower. The difference is much smaller with the compact parameters, because the additional computations for the cubic \(\gamma_3\) reduce the number of iterations the processor can work ahead. When streaming queries, for both parameter choices CacheLineEF is less than 0.1 ns slower than the plain vector, while EF is 1 ns slower.
In the end, we choose CacheLineEF when using compact parameters, but prefer the simpler and slightly faster plain vector for fast parameters.
4.2 Comparison to Other Methods
Approach | Configuration | Space bits/key | Construction 6t, ns/key | Query ns/query |
---|---|---|---|---|
Brute-force | ||||
SIMDRecSplit | \(n{=}5\), \(b{=}5\) | 2.96 | 26 | 310 |
SIMDRecSplit | \(n{=}8\), \(b{=}100\) | 1.81 | 66 | 258 |
Bip. ShockHash-Flat | \(n{=}64\) | 1.62 | 2140* (357) | 201 |
Consensus | \(k=256\), \(\varepsilon=0.10\) | 1.58 | 521* (87) | 565 |
Consensus | \(k=512\), \(\varepsilon=0.03\) | 1.49 | 1199* (200) | 528 |
Fingerprinting | ||||
FMPH | \(\gamma{=}2.0\) | 3.40 | 44 | 168 |
FMPH | \(\gamma{=}1.0\) | 2.80 | 69 | 236 |
FMPHGO | \(s{=}4\), \(b{=}16\), \(\gamma{=}2.0\) | 2.86 | 298 | 160 |
FMPHGO | \(s{=}4\), \(b{=}16\), \(\gamma{=}1.0\) | 2.21 | 423 | 212 |
FiPS | \(\gamma{=}2.0\) | 3.52 | 93* (16) | 109 |
FiPS | \(\gamma{=}1.5\) | 3.12 | 109* (18) | 124 |
Graphs | ||||
SicHash | \(p_1{=}0.21\), \(p_2{=}0.78\), \(\alpha{=}0.90\) | 2.41 | 48 | 149 |
SicHash | \(p_1{=}0.45\), \(p_2{=}0.31\), \(\alpha{=}0.97\) | 2.08 | 63 | 141 |
Bucket placement | ||||
CHD | \(\lambda{=}3.0\) | 2.27 | 1059* (177) | 542 |
PTHash | \(\lambda{=}4.0\), \(\alpha{=}0.99\), C-C | 3.19 | 403 | 77 |
+ HEM | 173 | |||
PTHash | \(\lambda{=}5.0\), \(\alpha{=}0.99\), EF | 2.17 | 765 | 156 |
+ HEM | 323 | |||
PHOBIC | \(\lambda{=}3.9\), \(\alpha{=}1.0\), IC-C | 4.14 | 62 | 116 |
PHOBIC | \(\lambda{=}4.5\), \(\alpha{=}1.0\), IC-R | 2.34 | 80 | 179 |
PHOBIC | \(\lambda{=}6.5\), \(\alpha{=}1.0\), IC-R | 1.94 | 215 | 163 |
PHOBIC | \(\lambda{=}6.5\), \(\alpha{=}1.0\), IC-C | 2.44 | 220 | 108 |
PHOBIC | \(\lambda{=}7.0\), \(\alpha{=}1.0\), IC-R | 1.86 | 446 | 157 |
PtrHash, fast | \(\lambda{=}3.0\), \(\alpha{=}0.99\), Vec | 2.99 | 26 | 38 |
+ streaming | 20 | |||
PtrHash, default | \(\lambda{=}3.5\), \(\alpha{=}0.99\), CLEF | 2.40 | 32 | 44 |
+ streaming | 25 | |||
PtrHash, compact | \(\lambda{=}4.0\), \(\alpha{=}0.99\), CLEF | 2.12 | 62 | 42 |
+ streaming | 23 |
In Table 3 we compare the performance of PtrHash against other methods on short, random strings. In particular, we compare against methods and configurations that are reasonably fast to construct: SIMDRecSplit (Esposito, Graf, and Vigna 2020; Bez et al. 2023), Bipartite ShockHash-Flat (Lehmann, Sanders, and Walzer 2024, 2023a), Consensus-RecSplit (Lehmann et al. 2025), FMPH and FMPHGO (Beling 2023), FiPS (Lehmann 2024), SicHash (Lehmann, Sanders, and Walzer 2023b), CHD (Belazzougui, Botelho, and Dietzfelbinger 2009), PTHash (Pibiri and Trani 2021, 2024), and PHOBIC (Hermann et al. 2024a). We also include Bipartite ShockHash-Flat (Lehmann, Sanders, and Walzer 2024, 2023a), which is able to use relatively little space with fast construction time. The specific parameters are based on Table 1 of (Hermann et al. 2024a), Table 8.1 of (Lehmann 2024), and Table 3 of (Beling 2023). These results were obtained using the excellent MPHF-Experiments library (Lehmann 2025) by Hans-Peter Lehmann. Construction is done on 6 threads in parallel when supported. By default, the framework queries one key at a time. For PtrHash with streaming queries, we modified this to query all keys at once.
Input. The input is 300 million random strings of random length between 10 and 50 characters. This input size is such that the MPHF data structures take around 75 MB, which is much larger than the 12 MB L3 cache.
PtrHash. As expected, the space usage of PtrHash matches the numbers of Table 2. In general, PtrHash can be slightly larger due to rounding in the number of parts and slots per part, but for large inputs like here this effect is small. Construction times per key are slightly slower than as predicted by Figure 4, while we might expect slightly faster construction due to the lower number of keys. Likely, the slowdown is caused by hashing the input strings. The hashing of input strings has a much worse effect on query throughput. In Figure 6, we obtained query throughput of 12 ns and 18 ns for the fast and compact configurations when looping, and as low as 8 ns when streaming queries. With string inputs, these numbers more than double to 38 ns resp. 42 ns when looping, and 20 ns when streaming. A similar effect can be seen when comparing Tables 3 and 4 of (Beling 2023). 8 further investigates this.
Speed. We observe that PtrHash with fast parameters is the fastest to construct alongside SIMDRecSplit (26 ns/key) and FiPS (16 ns/key, assuming optimal scaling to 6 threads), resulting in around 3 bits/key for all three methods. However, query throughput of PtrHash is \(8\times\) (SIMDRecSplit) resp. \(2.8\times\) (FiPS) faster, going up to \(15\times\) resp. \(5\times\) faster when streaming all queries at once. Compared to the next-fastest method to query, PTHash-CC (HEM), PtrHash is twice faster to query (or nearly \(4\times\) when streaming), is \(6.5\times\) faster to build, and even slightly smaller.
With default parameters, PtrHash is \(1.75\times\) faster to query than the fastest configuration of PTHash, and \(3.1\times\) faster when using streaming, while being over \(5\times\) faster to construct.
Indeed, the speedup in query speed is explained by the fact that only a single memory access is needed for most queries (compared to \(\geq 2\) for PtrHash-HEM and PHOBIC), and generally by the fact that the code for querying is short.
Space. PtrHash with the fast parameters is larger (2.99 bits/key) than some other methods, but compensates by being significantly faster to construct and/or query. When space is of importance, the compact version can be used (2.12 bits/key). This takes \(2.4\times\) longer to build at 62 ns/key, and has only slightly slower queries. Compared to methods that are smaller, PtrHash is over \(3\times\) faster to build than PHOBIC. Consensus, SIMDRecSplit, and SicHash achieve smaller space of 1.58, 1.81 and 2.08 bits/key in comparable time (63-87 ns/key), but again are at least \(3\times\) slower to query, or over \(6\times\) compared to streaming queries.
5 Conclusions and Future Work
We have introduced PtrHash, a minimal perfect hash function that builds on PTHash and PHOBIC. Its main novelty is the used of fixed-width 8-bit pilots that simplify queries. To make this possible, we use hash-and-evict, similar to Cuckoo hashing: when there is no pilot that leads to a collision-free placement of the corresponding keys, some other pilots are evicted and have to search for a new value.
The result is an MPHF with twice faster queries (38 ns/key) than any other method (at least 77 ns/key) for datasets larger than L3 cache. Further, due to its simplicity, queries can be processed in streaming fashion, giving another two times speedup (20 ns/key). At this point, the hashing of string inputs becomes a bottleneck. For integer keys, such as $k$-mers, much higher throughput of up to 8 ns/key can be obtained, fully saturating the RAM throughput of each core, or when using multiple cores even saturating the main memory (2.5 ns/key).
Future work. A theoretical analysis of our method is currently missing. While the hash-evict strategy works well in practice, we currently have no relation between the bucket size \(\lambda\), load factor \(\alpha\), and the number of evicts arising during construction. Such an analysis could help to better understand the optimal bucket assignment function, like PHOBIC (Hermann et al. 2024a) did for the case without eviction.
Second, the size of pilots could possibly be improved by further parameter tuning. In particular we use 8-bit pilots, while slightly fewer or more bits may lead to smaller data structures. An experiment with 4-bit pilots was not promising, however.
Lastly, to further improve the throughput, we suggest that more attention is given to the exact input format. As already seen, hashing all queries at once can provide significant performance gains via prefetching. For string input specifically, it is more efficient when the strings are consecutively packed in memory rather than separately allocated, and it might be more efficient to explicitly hash multiple strings in parallel. More generally, applications should investigate whether they can be rewritten to take advantage of streaming queries.
Acknowledgements
First, I thank Giulio Ermanno Pibiri for his ongoing feedback in various stages of this project. Further, I thank Sebastiano Vigna for feedback from trying to construct PtrHash on \(10^{12}\) keys and integrating $ε$-serde, and lastly I thank Hans-Peter Lehmann for feedback on the text.
Funding
This work was supported by ETH Research Grant ETH-1721-1 to Gunnar Rätsch.
6 Appendix: Query throughput
6.1 Batching and streaming
Throughput. In practice in bioinformatics applications such as SSHash, we expect many independent queries to the MPHF. This means that queries can be answered in parallel, instead of one by one. Thus, we should optimize for query throughput rather than individual query latency. We report throughput as inverse throughput in amortized nanoseconds per query, rather than the usual queries per second.
Out-of-order execution. An MPHF on \(10^9\) keys requires memory at least \(1.5\mathrm{bits}/\mathrm{key} \cdot 10^9 \mathrm{keys} = 188\) MB, which is much larger than the L3 cache of size around 16 MB. Thus, most queries require reading a pilot from main memory (RAM), which usually has a latency around 80 ns. Nevertheless, existing MPHFs such as FCH (Fox, Chen, and Heath 1992) achieve an inverse throughput as low as 35 ns/query on such a dataset (Pibiri and Trani 2021). This is achieved by pipelining and the reorder buffer. For example, Intel Skylake CPUs can execute over 200 instructions ahead while waiting for memory to become available (Wong 2013; Downs 2017). This allows the CPU to already start processing ‘future’ queries and fetch the required cache lines from RAM while waiting for the current query. Thus, when each iteration requires less than 100 instructions and there are no branch-misses, this effectively makes up to two reads in parallel. A large part of speeding up queries is then to reduce the length of each iteration so that out-of-order execution can fetch memory more iterations ahead.
Prefetching. Instead of relying on the CPU hardware to parallellize requests to memory, we can also explicitly prefetch3 cache lines from our code. Each prefetch requires a line fill buffer to store the result before it is copied into the L1 cache. Skylake has 12 line fill buffers (Downs 2018), and hence can support up to 12 parallel reads from memory. In theory, this gives a maximal random memory throughput around $80/12 = 6.67$ns per read from memory, but in practice experiments show that the limit is 7.4 ns per read. Thus, our goal is to achieve a query throughput of 7.4 ns.
We consider two models to implement prefetching: batching and streaming.
Figure 5: Simplified schematic of in-progress reads from main memory (RAM) when using two different prefetching approaches processing (up to) (8) reads in parallel. Each horizontal line indicates the duration a read is in progress, from the moment it is prefetched (left vertical bar) to the moment it is available in L1 cache and its corresponding line fill buffer is free again (right vertical bar). Streaming (right) provides better parallelism than batching (left).
Batching. In this approach, the queries are split into batches (chunks) of size \(B\), and are then processed one batch at a time (Figure 5, left). In each batch, two passes are made over all keys. In the first pass, each key is hashed, its bucket it determined, and the cache line containing the corresponding pilot is prefetched. In the second pass, the hashes are iterated again, and the corresponding slots are computed.
Streaming. A drawback of batching is that at the start and end of each batch, the memory bandwidth is not fully saturated. Streaming fixes this by prefetching the cache line for the pilot \(B\) iterations ahead of the current one, and is able to sustain the maximum possible number of parallel prefetches throughout, apart from at the very start and end (Figure 5, right).
6.2 Evaluation
A note on benchmarking throughput.
To our knowledge, all recent papers on (minimal) perfect hashing measure query
speed by first creating a list of keys, and then querying all keys in the list,
as in for key in keys { ptr_hash.query(key); }
. One might think this measures the average
latency of a query, but that is not the case, as the CPU will execute
instructions from adjacent iterations at the same time.
Indeed, as can be seen in Table 2, this loop can be as fast as 12
ns/key for \(n=10^9\), which is over 6 times faster than the RAM latency of
around 80 ns (for an input of size 300 MB),
and thus, at least 6 iterations are being processed in parallel.
Hence, we argue that existing benchmarks measure (and optimize for)
throughput and that they assume that the list of keys to query is known in advance.
We make this assumption explicit by changing the API to benchmark all queries at
once, as in ptr_hash.query_all(keys)
. This way, we can explicitly process
multiple queries in parallel.
We also argue that properly optimizing for throughput is relevant for applications. SSHash, for example, queries all minimizers of a DNA sequence, which can be done by first computing and storing those minimizers, followed by querying them all at once.
We now explore the effect of the batch size and number of parallel threads on query throughput.
Batching and Streaming.
Figure 6: Query throughput of prefetching via batching (dotted) and streaming (dashed) with various batch/lookahead sizes, compared to a plain for loop (solid), for (n=20cdot 10^6) (left) and (n=10^9) (right) keys. Blue shows the results for the fast parameters, and red for the compact parameters. Default parameters give performance nearly identical to the compact parameters, since the main differentiating factor is the use of (gamma_1) versus (gamma_3). All times are measured over a total of (10^9) queries, and for (non-minimal) perfect hashing only, without remapping.
In Figure 6, we compare the query throughput of a simple for loop with the batching and streaming variants with various batch/lookahead sizes. We see that both for small \(n=20\cdot 10^6\) and large \(n=10^9\), the fast parameters yield higher throughput than the compact parameters when using a for loop. This is because of the overhead of computing \(\gamma_3(x)\). For small \(n\), batching and streaming do not provide much benefit, indicating that memory latency is not a bottleneck. However, for large \(n\), both batching and streaming improve over the plain for loop. As expected, streaming is faster than batching here. For streaming, throughput saturates when prefetching around 16 iterations ahead. At this point, memory throughput is the bottleneck, and the difference between the compact and fast parameters disappears. In fact, compact parameters with \(\gamma_3\) are slightly faster. This is because \(\gamma_3\) has a more skew distribution of bucket sizes with more large buckets. When the pilots for these large buckets are cached, they are more likely to be hit by subsequent queries, and hence avoid some accesses to main memory.
For further experiments we choose streaming over batching, and use a lookahead of 32 iterations. The final throughput of 8 ns per query is very close to the optimal throughput of 7.4 ns per random memory read.
6.3 Multi-threaded Throughput
Figure 7: In this plot we compare the throughput of a for loop (solid) versus streaming (dashed) for multiple threads, for both non-minimal (dimmed) and minimal (bright) perfect hashing. The left shows results for (n=20cdot 10^6), and the right shows results for (n=10^9). In blue the results for the fast parameters with (gamma_1), and in red the results for compact parameters with (gamma_3), which perfors nearly identical to the default parameters. On the right, the solid black line shows the maximum throughput based on 7.4 ns per random memory access per thread, and the solid black line shows the maximum throughput based on the total memory bandwidth of 25.6 GB/s.
In Figure 7 we compare the throughput of the fast and compact parameters for multiple threads. When \(n=20\cdot 10^6\) is small and the entire data structure fits in L3 cache, the scaling to multiple threads is nearly perfect. As expected, minimal perfect hashing (bright) tends to be slightly slower than perfect hashing (dimmed), but the difference is small. The fast \(\gamma_1\) is faster than the compact \(\gamma_3\), and streaming provides only a small benefit over a for loop. For large \(n=10^9\), all methods converge towards the limit imposed by the full RAM throughput of 25.6 GB/s. Streaming variants hit this starting at around 4 threads, and remain faster than the for loop. As before, the compact version is slightly faster because of its more efficient use of the caches, and is even slightly better than the maximum throughput of random reads to RAM. Minimal perfect hashing is only slightly slower than perfect hashing.
7 Appendix: Sharding
When the number of keys is large, say over \(10^{10}\), their 64-bit (or 128-bit) hashes may not all fit in memory at the same time, even though the final PtrHash data structure (the list of pilots) would fit. Thus, we can not simply sort all hashes in memory to partition them. Instead, we split the set of all \(n\) hashes into, say \(s=\lceil n/2^{32}\rceil\) shards of \(\approx 2^{32}\) elements each, where the \(i\)’th shard corresponds to hash values in \(s_i:=[2^{64}\cdot i/s, 2^{64}\cdot (i+1)/s)\). Then, shards are processed one at a time. The hashes in each shard are sorted and split into parts, after which the parts are constructed as usual. This way, the shards only play a role during construction, and the final constructed data structure is independent of which sharding strategy was used.
In-memory sharding. The first approach to sharding is to iterate over the set of keys \(s\) times. In the \(i\)’th iteration, all keys are hashed, and only those hashes in the corresponding interval \(s_i\) are stored and processed. This way, no disk space is needed for construction.
On-disk sharding. A drawback of the first approach is that keys are potentially hashed many times. This can be avoided by writing hashes to disk. Specifically, we can create one file per shard and append hashes to their corresponding file. These files are then read and processed one by one.
Hybrid sharding. A hybrid of the two approaches above only requires disk space for \(D<s\) shards. This iterates and hashes the keys \(\lceil s/D\rceil\) times, and in each iteration writes hashes for \(D\) shards to disk. Those are then processed one by one as before.
On-disk PtrHash. When the number of keys is so large that even the pilots do not fit in memory, they can also be stored to disk and read on-demand while querying. This is supported using $ε$-serde (Vigna and Fontana 2024; Fontana, Vigna, and Zacchiroli 2024).
7.1 Evaluation
TODO: update results
We tested the in-memory and hybrid sharding by constructing PtrHash with default parameters on \(5\cdot 10^{10}\) random integer keys on a laptop with only 64 GB of memory, using 6 cores in parallel. All 64-bit hashes would take 400 GB, so we use 24 shards of around \(2^{31}\) keys, that each take 16 GB. The final data structure takes 2.17 bits/key, or 13.6 GB in total, and the peak memory usage is around 50 GB.
The in-memory strategy iterates through and hashes the integer keys 24 times, and takes 3996 seconds in total or 166s per shard. Of this, 65s (39%) is spent on hashing the keys, 15s (9%) is spent sorting hashes into buckets, and 82s (49%) is spent searching for pilots.
The hybrid strategy is allowed to use up to 128 GB of disk space, and thus writes hashes to disk in 3 batches of 8 shards at a time. This brings the total time down to 3314s (17% faster), and uses 138s per shard. Of this, 24s is spent writing hashes to disk, and 21s is spent reading hashes from disk, which together is faster than the 65s that was previously spent on hashing all keys.
8 Appendix: Evaluating Hash Functions
Input | Loop | Stream | ||||
---|---|---|---|---|---|---|
FxHash | XXH3-64 | XXH3-128 | FxHash | XXH3-64 | XXH3-128 | |
u64 | 11.1 | 24.4 | 29.9 | 7.2 | 9.1 | 10.5 |
Box<u64> | 12.7 | 30.1 | 31.2 | 8.7 | 11.1 | 12.4 |
&[u8; 10] | 19.4 | 27.7 | 32.9 | 10.1 | 12.5 | 14.2 |
&[u8; 50] | 34.1 | 28.6 | 32.8 | 16.5 | 12.7 | 14.1 |
&[u8] | 39.2 | 37.0 | 50.9 | 27.2 | 17.8 | 23.1 |
Vec<u8> | 40.2 | 40.6 | 52.7 | 28.3 | 20.2 | 25.3 |
In Table 4, we compare the throughput of various hash functions on different types of inputs, both when iterating and streaming through queries. The hash functions being compared are 64-bit FxHash (Breeden, n.d.), and the 64-bit and 128-bit variants of XxHash (XXH3 specifically). We test various inputs, of increasing complexity:
- plain integers keys, the easy case;
- allocated integer keys, to measure the overhead of the memory indirection;
- short back-to-back packed fixed-length strings, to measure the overhead of string hashing;
- longer packed fixed-length strings, to measure the overhead of iteration the string characters;
- variable-length packed strings, to measure the overhead of branch-mispredictions;
- allocated variable-length strings, to again measure pointer indirection overhead.
The string slices are all packed back-to-back into a single large allocation, so
that their contents can be efficiently prefetched. The Vec<u8>
version on the
other hand uses the default allocator, which may or may not put things close to
each other and/or in order.
Results. Both when iterating and when streaming queries, FxHashing plain integers is fastest, as expected. Both XxHash variants are quite a bit slower, especially when using plain iteration. On boxed integers, most methods become around 1 ns/key slower. FxHash maintains this speed advantage for short strings, but for longer strings XXH3-64 becomes faster. In fact, XXH3 is only marginally slower for 50 byte strings than for boxed integers, which is quite impressive!
When moving on to variable-length strings, all methods take quite a hit of at least 5 ns/key due to the branch-mispredictions. XXH3-64 remains fastest, but FxHash is nearly as fast when looping over queries. The same is true when moving to allocated strings, which is again around 2 ns/key slower.
Overall, FxHash is the best hash to use for integer keys, and XXH3-64 is a good choice for strings. XXH3-128 is slower and should only be used when really needed. Hashing is slightly faster when the keys are back to back in memory.
9 TODO
- update sharding results
- make into latex
- check double equation
References
The PT in PTHash stand for Pilot Table. The author of the present paper mistakenly understood it to stand for Pibiri and Trani, the authors of the PTHash paper. Due to the current author’s unconventional last name, and PTGK not sounding great, the first initial (R) was appended instead. As things go, nothing is as permanent as a temporary name. Furthermore, we follow the Google style guide and avoid a long run of uppercase letters, and write PtrHash instead of PTRHash. ↩︎
We would have preferred to call this method hash-displace, as displace is the term used instead of evict in e.g. the cuckoo filter Fan et al. (2014). Unfortunately, hash and displace is already taken by hash-and-displace Pagh (1999; Belazzougui, Botelho, and Dietzfelbinger 2009). ↩︎
There are typically multiple types of prefetching instructions that prefetch into a different level of the cache hierarchy. We prefetch into all levels of cache using
prefetcht0
. ↩︎