- Abstract
- 1 Introduction
- 1.1 Motivation
- 1.2 Contributions
- 2 Background
- 2.1 Single large seed.
- 2.2 Other work
- 2.3 Recsplit?
- 2.4 FCH
- 2.5 PTHash
- 2.6 PTHash-HEM
- 2.7 PHOBIC
- 2.8 EF coding
- 2.8.1 Uniform Partitioned EF
- 2.9 Cachelines
- 3 PtrHash
- 4 Results
- 4.1 Construction
- 4.1.1 Bucket functions
- 4.1.2 Construction and size
- 4.1.3 Remap
- 4.1.4 Sharding
- 4.2 Query throughput
- 4.2.1 A note on benchmarking query throughput
- 4.2.2 Batch size
- 4.2.3 Throughput
- 4.3 Comparison with competitors
- 4.1 Construction
- 5 Conclusion and future work
- 6 Acknowledgements
- 7 Appendix
- 8 DONE Failed ideas
This is the work-in-progress paper on PtrHash. The original blog post on its development is here.
Abstract
Motivation. Given a set \(K\) of \(n\) keys, a minimal perfect hash function (MPHF) is a collision-free bijective map \(f\) from \(K\) to \(\{0, \dots, n-1\}\). These functions have uses in databases, search engines, and are used in bioinformatics indexing tools such as Pufferfish (using BBHash), and Piscem (both using PTHash). Our main motivating usage is that as in SSHash, where PTHash only takes \(5\%\) of the total space. Thus, this work presents a (minimal) perfect hash function that first prioritizes query throughput, while also allowing efficient construction for \(10^9\) or more elements using under \(3\) bits of memory per key.
Contributions. PtrHash builds on PTHash by 1) partitioning the table for faster construction, and 2) using cuckoo hashing to find \(8\) bit pilots, which 3) makes further compression redundant. We further speed up queries by 4) simplifying hash functions and modulo operations, and 5) streaming multiple queries in parallel and prefetching memory reads.
Results. (TODO: update numbers. TODO: add something about speed of memory) We implemented PtrHash in Rust and show that at \(3\) bits/key memory usage it is \(3\times\) faster to construct than PTHash while achieving $8$ns/key query throughput, \(5\times\) faster than state-of-the-art methods.
Source: github.com/RagnarGrootKoerkamp/ptrhash
1 Introduction
- Perfect hashing
- Minimal perfect hashing
- Application in bioinformatics (SSHash)
- Focus on query speed/throughput;
- Only a single RAM access per query, ideally
- Instead of one \(n \log_2(e)\) bit number, search for many 8-bit numbers
1.1 Motivation
SSHash, throughput first
1.2 Contributions
We introduce PtrHash, a minimal perfect hash function that is primarily optimized for query throughput and construction speed, at the cost of some more memory usage. Its main novelties are:
- the use of fixed-width \(8\) bit pilots, leading to very simple lookups;
- cuckoo-hashing-like based construction
- construction in parts, with a single remap table, leading to fast construction;
2 Background
- GB vs GiB
2.1 Single large seed.
2.2 Other work
- sichash (Lehmann, Sanders, and Walzer 2023b)
- shockhash (Lehmann, Sanders, and Walzer 2024)
- bipartite shockhash (Lehmann, Sanders, and Walzer 2023a)
2.3 Recsplit?
2.4 FCH
2.5 PTHash
2.6 PTHash-HEM
2.7 PHOBIC
2.8 EF coding
2.8.1 Uniform Partitioned EF
2.9 Cachelines
3 PtrHash
The core principle of PtrHash1 is to simplify PTHash to speed up both query speed and construction time, at the cost of using slightly more memory. We first give a high level overview of PtrHash (3.1). Then, we highlight the differences compared to PTHash and PHOBIC (3.2), after which we explain specific parts of PtrHash in more detail.
3.1 Overview
Before going into details, we first briefly explain the fully constructed PtrHash data structure and how to query it, see Figure 1 and Code Snippet 1.
The input is a set of \(n\) keys \(\{k₀, ̣\dots, k_{n-1}\}\) that we want to hash to \(n\) slots \([n]:=\{0, \dots, n-1\}\). These keys are hashed using a hash function \(h\) into \(\{h(k_0), \dots, h(k_{n-1})\}\). The total space of hashes \([2^{64}]\) is equally partitioned into \(P\) parts. Further, each part is partitioned into \(B\) buckets. Each key \(k_i\) is uniformly mapped to a part using the ‘fast range’ method (Lemire 2019) that interprets \(h(k_i)/2^{64}\) as a value in \([0, 1)\). Then, its relative position inside the part, \(x\), is passed through a bucket assignment function \(\gamma: [0,1)\mapsto[0,1)\) such as \(x\mapsto x\) or \(x\mapsto x^2\) that controls the distribution of expected bucket sizes, as explained in detail in 3.5. The result is then scaled to a bucket index in \([B]\):
\begin{align} part(k_i) &:= \left\lfloor P\cdot h(k_i) / 2^{64}\right\rfloor,\\ x &:= \big((P\cdot h(k_i)) \bmod 2^{64}\big)/2^{64},\\ bucket(k_i) &:= \left\lfloor B\cdot \gamma(x)\right\rfloor. \end{align}
Now, the goal is to map the \(n/P\) expected keys in each part to \(S\approx (n/P)/\alpha\) slots, where \(\alpha\approx 0.98\) gives us \(\approx 2\%\) extra slots to play with. The main data structure to achieve this mapping is a list of $8$-bit pilots \(\{p_0, \dots, p_{P\cdot B-1}\}\), one for each bucket. The pilots control to which slot the keys in each bucket map. Specifically, key \(k_i\) in bucket \(b=bucket(k_i)\) with pilot \(p_b\) maps to slot
\begin{equation} slot(k_i) := part(k_i) \cdot S + reduce(h(k_i) \oplus h_p(p_b), S),\label{eq:slot} \end{equation}
where \(reduce(x, S)\) maps the random \(64\) bit integer to \([S]\) as explained below.
Hash functions. Starting simple, the pilots \(p_b\) are hashed into pseudo-random \(64\) integers by using FxHash for \(h_p\), which simply multiplies the pilot with a mixing constant \(C\).
When the keys are \(64\) bit integers we use this same FxHash algorithm to hash them, since multiplying by an odd constant is invertible modulo \(2^{64}\) and hence a perfect hash. For other keys, the hash function depends on the number of elements. When the number of elements is not too far above \(10^9\), the probability of hash collisions with a \(64\) bit hash function is sufficiently small, and we use the \(64\) bit variant of xxHash. When the number of keys goes beyond \(2^{32} \approx 4\cdot 10^9\), the probability of \(64\) bit hash collisions increases, and collisions may be found after sorting the hashes. In this case, the \(128\) bit variant of xxHash is used, where the high \(64\) bits determine the part and bucket, and the low \(64\) bits are used in Equation \ref{eq:slot} to determine the slot.
The reduce function. When \(64\) bit hashes are used, we must ensure that the full entropy of the hash is used. A simple choice would be \(reduce(x,S) = x\bmod S\), which uses all bits when \(S\) is not a power of \(2\) and takes two multiplications using ‘fast mod’ (Lemire, Kaser, and Kurz 2019). Instead, we take \(S=2^s\), so that \(x\bmod 2^s\) is a simple bit-mask. Unfortunately, this only uses the lower \(s\) bits of the hash, while the \(part\) and \(bucket\) functions use the high \(\log_2(P\cdot B)\) bits, leaving some entropy in the middle bits unused.
As a solution, we first multiply \(x\) by the mixing constant \(C\), and then take the low \(s\) bits of the high half. This uses the full entropy of the input and takes only a single multiplication, giving around \(5\%\) query speedup over fast mod.
\begin{equation} reduce(x, 2^s) := \left\lfloor C\cdot x/2^{64}\right\rfloor \bmod 2^s \end{equation}
Remapping. Since each part has slightly more slots than keys, some keys will map to an index \(\geq n\), leading to a non-minimal perfect hash function. To fix this, those are remapped back into the ‘gaps’ left behind in slots \(<n\), which is explained in detail in 3.4.
Construction. The main difficulty of PtrHash is during construction (4.1), where we must find values of the pilots \(p_j\) such that all keys indeed map to different slots. Like other methods, PtrHash processes multiple parts in parallel. It sorts the buckets within each part from large to small and ‘greedily’ assigns them the smallest pilot that maps the corresponding keys to slots that are still free. Unlike other methods though, PtrHash only allows pilots up to \(255\). When no suitable pilot is found, we use a method similar to (blocked) cuckoo hashing: a pilot with a minimal number of collisions is chosen, and the colliding buckets are ’evicted’ and have to find a different pilot.
Parameter values. In practice, we fix the number of slots per part, \(S\), to be \(2^{20}\approx 10^6\), since a \(2^{20}\) bit mask of taken slots takes \(128KiB\) and hence comfortable fits in a typical L2 cache. The number of parts \(P\) is then \((n/S)/\alpha\). The load factor \(\alpha\) is around \(0.98\), so that each part of \(S\) slots has \(\alpha \cdot S\) keys in expectation. In order to avoid overly full parts with more keys than slots, a smaller \(\alpha\) is sometimes required when the number of parts is large, for example when using \(10^{12}\) keys. Similar to PHOBIC, the number of buckets per part is set to \(B = (\alpha\cdot S)/\lambda\), where \(\lambda\) is the expected size of each bucket and is around \(4\).
Name | Definition |
---|---|
\(S = 2^{18}\) | Number of slots per part. |
\(\alpha = 0.98\) | Load factor. Expected number of keys per part is \(\alpha\cdot S\). |
\(\lambda=4\) | Expected number of elements per bucket. |
\(\gamma(x) = \frac{255}{256}\cdot (x^2+x^3)/2 + \frac{1}{256}\cdot x\) | Bucket function controlling relative bucket sizes. |
\(n\) | Total number of keys. |
\(P = \lceil n/(\alpha \cdot S)\rceil\) | Number of parts. |
\(B = \lceil(\alpha \cdot S)/\lambda\rceil\) | Number of buckets per part. |
|
|
3.2 Comparison to PTHash and PHOBIC
Compared to PTHash (Pibiri and Trani 2021b), PtrHash has a few differences:
Single bucket size. Following FCH (Fox, Chen, and Heath 1992), PTHash (Pibiri and Trani 2021b) uses small and large buckets to speed up the construction and decrease memory usage. Similarly, PHOBIC uses a function to create a near-optimal distribution of bucket sizes that saves up to \(0.14\) bits/kmer over the small/large buckets (Hermann et al. 2024). For PtrHash we prefer simplicity and all buckets have the same expected size.
Pilot encoding. PTHash offers a few encoding schemes for the pilots: compact encoding (storing each pilot with exactly as many bits as are needed for the largest pilot), dictionary encoding (storing a list of all pilot values, and replacing each pilot with an index in the list), and Elias-Fano encoding. Additionally the small and large buckets can use different encoding schemes. PHOBIC offers an additional space saving of \(0.06\) bits/key by interleaving the pilots of each part.
For PtrHash, all pilots are exactly \(8\) bits, and we simply store them as a vector of bytes, removing the need for additional logic and memory accesses during their lookup.
Parts. PTHash-HEM (Pibiri and Trani 2021a) (TODO: Replace by published version) and PHOBIC split the keys into parts, and then work on each part independently. For a part containing \(P’\) keys, they use \(P’/\alpha\) slots (with \(\alpha=1\) for PHOBIC). This means that for each query, a lookup is required to determine the slot where the current part starts.
PtrHash, on the other hand, assigns the same number of slots to each part, so that no additional lookups are needed.
Part size. PTHash only uses a single part for all keys. PHOBIC, instead, uses relatively small parts of expected size \(2500\). PtrHash chooses the part size such that construction of each part roughly fits in the L2 cache of each CPU core, which is around \(250\ 000\) in practice.
Remapping. PTHash-HEM supports construction by parts and ensures that each part of \(P\) elements maps to \(P\) consecutive slots, by remapping per part. PHOBIC does not use remapping since it does not use an \(\alpha<1\). PtrHash, instead, does a global remap over all parts.
Additionally, PtrHash introduces a new encoding for the remapped values, see 3.4.
Streaming queries. Lastly, PtrHash supports streaming queries, allowing it to prefetch pilots from memory and better use the available memory bandwidth.
3.3 Construction
Both PTHash-HEM and PHOBIC first partition the keys into parts, and then build an MPHF part-by-part, optionally in parallel on multiple threads. Within each part, the keys are randomly partitioned into buckets of expected size \(\lambda\) (Figure 1). Then, the buckets are sorted from large to small, and one-by-one greedily assigned a pilot, such that the keys in the bucket map to slots not yet covered by earlier buckets.
[Drop/dedup with overview/move to background?] As observed for PTHash, searching for pilots becomes harder as the load factor (fraction of already filled slots) increases. Hence, PTHash uses \(n/\alpha > n\) slots to reduce the construction time and decrease the pilots, making their encoding more efficient. PHOBIC, on the other hand, uses relatively small parts of size \(2500\), so that the search for the last empty slot usually shouldn’t take much more than \(2500\) attempts. Nevertheless, a drawback of the greedy approach is that pilots have an uneven distribution, causing sub-optimal fixed-width compression.
Hash-evict2. In PtrHash, we instead always use fixed width single byte pilots. To achieve this, we use a technique resembling cuckoo hashing (Pagh and Rodler 2001). As before, buckets are greedily inserted from large to small. For some buckets, there may be no pilot in \([256]\) such that all its keys map to empty slots. When this happens, a pilot is found with the lowest weighted number of collisions. The weight of a collision with a bucket of size \(s\) is \(s^2\), to prevent collisions with large buckets, as those are harder to place. The colliding buckets are then evicted by emptying the slots they map to and pushing them back onto the priority queue of remaining buckets. Then, the new bucket is inserted.
|
|
Optimizations. In order to speed up the code to search for pilots, a number of optimizations are made to the conceptual idea of Code Snippet 2.
taken
bit mask. Instead of determining whether a slot is free by checking theslots
array for the optional index of the bucket mapping there, we keep a separate bit masktaken
that takes only \(1\) bit instead of \(32\) bits per element. This allows for better caching and hence faster access.- Collision-free hot path. When searching for pilots, we first test if there
is a pilot without any collisions. This is usually the case, and is faster
since it only needs access to
taken
, notslots
. Additionally, where there is a collision, we know a pilot is optimal when it collides with exactly one bucket of minimal size. - Avoiding loops. To avoid repeated patterns of the same buckets evicting each other, the search for a pilot starts at a random number in \([256]\), rather than at \(0\).
- Avoiding loops more. Each time a bucket is placed that evicted some other bucket(s), it is added to a list of the \(16\) most recently placed buckets. Buckets in this list are never evicted. This avoids short cycles, where for example two buckets keep evicting each other for the same slot.
Analysis. Unfortunately, we do not currently have a formal analysis showing that the hash-evict method works with high probability given that certain criteria are met. In 4, we will show some practical results.
3.4 Remapping using CacheLineEF
Both PTHash and PtrHash use a parameter \(0<\alpha\leq 1\) to use a total of \(n’=n/\alpha\) slots, introducing \(n’-n\) additional free slots. As a result of the additional slots, some, say \(R\), of the keys will map to positions \(n\leq p_0<\dots< p_{R-1}< n’\), causing the perfect hash function to not be minimal.
Remapping. Since there are a total of \(n\) keys, this means there are exactly \(R\) empty slots (‘gaps’) left behind in \([n]\), say at positions \(L_0\) to \(L_{R-1}\). We remap the keys that map to positions \(\geq n\) to the empty slots at positions \(< n\) to obtain a minimal perfect hash function.
A simple way to store the remap is as a plain array \(free\), such that \(free[p_i-n] = L_i\). PTHash encodes this array using Elias-Fano coding (Elias 1974; Fano 1971), after setting undefined positions of \(free\) equal to their predecessor. The benefit of the plain \(free\) array is fast and cache-local lookups, whereas Elias-Fano coding provides a more compact encoding that requires multiple lookups to memory.
CacheLineEF. We propose using Elias-Fano coding on a per-cache line basis, so that each lookup only requires a single read from memory. First, the list of non-decreasing \(free\) positions is split into chunks of \(C=44\) values \(\{v_0, \dots, v_{43}\}\), with the last chunk possibly containing fewer values. Then, each chunk is encoded into \(64\) bytes that can be stored as single cache line, as shown in Figure 2.
We first split all indices into their \(8\) low bits (\(v_i \bmod 2^8\)) and \(32\) high bits (\(\lfloor v_i/2^8\rfloor\)). Further, the high part is split into an offset (the high part of \(v_0\)) and the relative high part: \[ v_i = (v_i\bmod 2^8) + 2^8\cdot\lfloor v_0/256\rfloor + 2^8\cdot \left(\lfloor v_i/256\rfloor - \lfloor v_0/256\rfloor\right). \] This is stored as follows.
- First, the \(8\) low bits of each \(v_i\) are directly written to the \(44\) trailing bytes.
- Next, the \(32\) bit offset \(\lfloor v_0/256\rfloor\) is stored.
- Lastly, the relative high parts are encoded into \(128\) bits. For each \(i\in[44]\), bit \(i + \lfloor
v_i/256\rfloor - \lfloor v_0/256\rfloor\) is set to
1
. Since the \(v_i\) are increasing, each \(i\) sets a distinct bit, for a total of \(44\) set bits.
Lookup. The value at position \(i\) is found by summing (1) the \(8\) low bits,
(2) the offset multiplied by \(256\), and (3) the relative high part.
This last part can be found as \(256\cdot(select(i)-i)\), where \(select(i)\) gives
the position of the \(i\)’th 1
bit. In practice, this can be implemented
efficiently using the PDEP
instruction provided by the BMI2 bit manipulation
instruction set (Pandey, Bender, and Johnson 2017):
this operation can deposit the mask 1<<i
onto our bit pattern, so that the
1
ends up at the position of the \(i\)’th one of our pattern. Then, it suffices
to count the number of trailing zeros, which is provided by the TZCNT
instruction in BMI1.
Limitations. CacheLineEF uses \(64/44\cdot 8 = 11.6\) bits per value, which is more than the usual Elias-Fano, which for example takes \(8+2=10\) bits per value for data with an average stride of \(256\). Furthermore, values are limited to \(40\) bits, covering \(10^{12}\) items. The range could be increased to \(48\) bit numbers by storing \(5\) bytes of the offset, but this has not been necessary so far. Lastly, each CacheLineEF can only span a range of around \((128-44)\cdot 256 = 21\ 504\), or an average stride of \(500\). For PtrHash, we use \(\alpha\leq 0.99\), and hence the average distance between empty slots is at most \(100\), so that in practice the average distance never exceeds \(500\).
Comparison. Compared to Elias-Fano coding, CacheLineEF stores the low order bits as exactly a single byte, removing the need for unaligned reads. Further, the select data structure on the high-order bits is replaced by a few local bit-wise operations. CacheLineEF is also somewhat similar to the (Uniform) Partitioned Elias-Fano Index of Ottaviano and Venturini (2014), in that both split the data. The uniform partitioned index also uses fixed part sizes, but encodes them with variable widths, and adds a second level of EF to encode the part offsets. Instead, CacheLineEF prefers simplicity and uses fixed part sizes with a constant width encoding and simply stores the offsets directly.
|
|
3.5 Bucket assignment functions
During construction, slots slowly fill up as more buckets are placed. Because of this, the first buckets are much easier to place than the later ones, when only few empty slots are left. To compensate for this, we can introduce an uneven distribution of bucket sizes, so that the first buckets are much larger and the last buckets are smaller. FCH (Fox, Chen, and Heath 1992) accomplishes this by a skew mapping that assigns \(60\%\) of the elements to \(30\%\) of the buckets, so that those \(30\%\) are large buckets while the remaining \(70\%\) is small (Table 2). This is also the scheme used by PTHash.
The perfect bucket function. PHOBIC (Hermann et al. 2024) provides a more thorough analysis and uses the optimal3 function \(\gamma(x) = x + (1-x)\ln (1-x)\). This function has derivative \(0\) at \(x=0\), which in practice causes the largest buckets to have size much larger than \(\sqrt S\). Such buckets are hard to place, because by the birthday paradox they are likely to have multiple elements hashing to the same slot. To fix this, they ensure the slope of \(\gamma\) is at least \(\varepsilon=1/(5 \sqrt S)\) by using \(\gamma(x) = x + (1-\varepsilon)(1-x)\ln(1-x)\) instead. Since this function is slow to compute in practice, PHOBIC uses a $2048$-piecewise linear approximation using a lookup table and linear interpolation.
Approximations. For PtrHash, we would like to only use simple computations and avoid lookups as much as possible, to avoid the CPU becoming a bottleneck in query throughput. To this end, we replace the \(\ln (1-x)\) by its first order Taylor approximation at \(x=0\), \(\ln(1-x) \approx -x\), giving the quadratic \(\gamma(x) = x^2\). Using the second order approximation \(\ln(1-x) \approx -x-x^2/2\) results in the cubic \(\gamma(x) = (x^2+x^3)/2\). This version again suffers from too large buckets, so in practice we use \(\gamma(x) = \frac{255}{256}\cdot (x^2+x^3)/2 + \frac{1}{256}\cdot x\).
These values can all be computed efficiently by using that the input and output of \(\gamma\) are \(64\) bit unsigned integers representing a fraction of \(2^{64}\), so that e.g. \(x^2\) can be computed as the upper \(64\) bits of the widening \(64\times64\to 128\) bit product \(x\cdot x\).
TODO: $α$-adjusted perfect function.
3.6 Parallel queries
Throughput. In practice in bioinformatics applications (such as SSHash), we expect many independent queries to the MPHF. This means that queries can be answered in parallel, instead of one by one. Thus, we should optimize for query throughput (queries per second, but usually implicitly reported as inverse throughput in amortized seconds per query) rather than individual query latency (seconds per query).
Out-of-order execution. A typical MPHF on \(10^9\) keys requires memory at least \(2bits/key \cdot 10^9 keys = 250MB\), which is much larger than the L3 cache of size around \(16MB\). Thus, most queries require reading from main memory (RAM), which usually has a latency around \(80ns\). Nevertheless, existing MPHFs such as FCH (Fox, Chen, and Heath 1992) achieve an inverse throughput as low as \(35ns/query\) on such a dataset (Pibiri and Trani 2021b). This is achieved by pipelining and the reorder buffer. For example, Intel Skylake CPUs can execute over \(200\) instructions ahead while waiting for memory to become available (Wong 2013; Downs 2017). This allows the CPU to already start processing ‘future’ queries and fetch the required cache lines from RAM while waiting for the current query. Thus, when each iteration requires less than \(100\) instructions and there are no branch-misses, this effectively makes up to two reads in parallel. A large part of ensuring faster queries is then to reduce the length of each iteration to out-of-order execution to fetch memory more iterations ahead.
Prefetching. Instead of relying on the CPU hardware to parallellize requests to memory, we can also explicitly prefetch4 cache lines from software. Each prefetch requires a line fill buffer to store the result before it is copied into the L1 cache. Skylake has \(12\) line fill buffers (Downs 2018), and hence can support up to \(12\) parallel reads from memory. This gives a maximal theoretical throughput around \(80ns/query/12 = 6.67 ns/query\) as long as each query only requires a single read from main memory.
We consider two models to implement prefetching: batching and streaming.
TODO: in practice, batching will also have some overlap due to pipelining, but streaming doesn’t need this.
Batching. In this approach, the queries are split into chunks of size \(B\), and chunks are then processed one by one (Figure 3, left). In each chunk, each key is hashed, its bucket it determined, and the cache line containing the corresponding pilot is prefetched. Then, the buffered hashes are iterated again, and the corresponding slots are computed.
Streaming. A drawback of batching is that at the start and end of each batch, the maximal number of parallel prefetches is not fully saturated. The streaming approach fixes this by prefetching the cache line for the pilot \(B\) iterations ahead of the current one, and is able to sustain the maximum possible number of parallel prefetches throughout, apart from at the very start and end (Figure 3, right).
3.7 Sharding
When the number of keys is large, say \(>10^{10}\), their \(64\) bit hashes may not all fit in memory at the same time, even though the final PtrHash datastructure (the list of pilots) would fit. Thus, we can not simply sort all hashes in memory to partition them. Instead, we split the set of all \(n\) hashes into, say \(s=\lceil n/2^{32}\rceil\) shards, where the \(i\)’th shard corresponds to hash values in \(s_i:=[2^{64}\cdot i/s, 2^{64}\cdot (i+1)/s)\). Then, the hashes in each shard are sorted and split into parts, after which the parts are constructed as usual. This way, the shards only play a role during construction, and the final constructed data structure is independent of which sharding strategy is used.
In-memory sharding. The first approach to sharding is to iterate over the set of keys \(s\) times. In the \(i\)’th iteration, all keys are hashed, and only those hashes in the corresponding interval \(s_i\) are stored and processed. This way, no disk space is needed for construction.
On-disk sharding. A drawback of the first approach is that keys are potentially hashed many times. This can be avoided by writing hashes to disk. Specifically, we can create one file per shard and append hashes to their corresponding file. These files are then read and processed one by one.
On-disk PtrHash. When the number of keys is so large that the pilots do not fit in memory, they can also be stored to disk and read on-demand while querying. This is supported using $ε$-serde (Vigna and Fontana 2024; Fontana, Vigna, and Zacchiroli 2024).
4 Results
In this section we investigate PtrHash construction and query throughput for different parameters, and compare PtrHash to competitors. All experiments are run on an Intel Core i7-10750H CPU with 6 cores and hyper-threading disabled. The frequency is pinned to 2.6GHz. Cache sizes are 32KiB L1 and 256KiB L2 per core, and 12MiB shared L3 cache. Main memory is 64GiB DDR4 at 3200MHz, split over two 32GiB banks.
Input data. For construction, all experiments use \(10^9\) keys, for which the pilots take around 300MB and are much larger than L3 cache. For the query throughput experiments, we also test on 20 million keys, for which the pilots take around 6MB and easily fit in L3 cache. To avoid the time needed for hashing keys, and since our motivating application is indexing $k$-mers that fit in \(64\) bits, we always use random \(64\) bit integer keys, and hash them using FxHash.
Parameter overview. The number of slots per part \(S\) is fixed to \(2^{20}\). Slightly smaller parts of \(S=2^{18}\) slots results in up to \(20\%\) faster construction times because of better cache locality. However, this turns out to be less reliable when the number of keys and parts is large, because the variance in number of keys per part is large enough to have parts with load factor too close to \(1\) for construction to succeed, or even \(>1\).
The load factor \(\alpha\) ranges from \(0.98\), which is sufficiently small to allow fast construction, to \(0.995\), which is about as large as possible to still allow buckets of size \(1\) to find a pilot within $256 tries.
Overall, we propose two sets of parameters, simple and compact.
4.1 Construction
4.1.1 Bucket functions
- Linear is simple
- Cubic is best for construction
- In the remainder, we use linear for speed and simplicity (especially for small datastructures), and cubic for space efficiency.
4.1.2 Construction and size
- \(n=10^9\)
- \(\alpha \in \{0.98, 0.99, 0.995\}\)
- Larger alpha such as \(0.995\) are not reliable:
- too sparse for cacheline ef
- overly large parts for large n
- Larger alpha such as \(0.995\) are not reliable:
- bucket functions linear and cubic
- lambda variable \(\{3.0, …, 4.5\}\)
- remap using CacheLineEF
- time using 6 threads.
- observations:
- construction time grows exponentially, until it fails
- fails usually because eviction chains become too long.
- cubic can handle much large \(\lambda\), and makes smaller overall size
- \(\alpha=0.98\) is faster, but around \(0.1bit/key\) larger
- \(\alpha=0.99\) fails before \(\alpha=0.98\), partially because for large \(n\), there will be variance in slots per part, and some parts will have load factor too close to \(1\), or even larger than \(1\).
- \(\alpha=0.999\) would
Slots per part.
- remark that \(2^{18}\) is up to \(20\%\) faster to construct, but has too large variance for \(n=10^{12}\) keys.
Construction time breakdown.
- What are these \(20ns\) spent on?
- 1ns hashing
- 5ns radix sorting
- 12ns finding pilots
- 1ns remapping
Recommended parameters.
- two proposed configurations:
- Simple: Linear, \(\lambda = 3.0\), \(\alpha=0.99\), 2.79 bits/key
- when \(n\) is small and memory doesn’t matter too much
- Compact: Cubic, \(\lambda = 4.0\), \(\alpha=0.98\), 2.24 bits/key (for large \(\lambda\) we avoid \(\alpha=0.99\) since it is less reliable)
- Simple: Linear, \(\lambda = 3.0\), \(\alpha=0.99\), 2.79 bits/key
4.1.3 Remap
Parameters | Pilots (bits/key) | q1_phf | q32_phf | remap_type | remap | q1_mphf | q32_mphf |
---|---|---|---|---|---|---|---|
simple: \(\alpha=0.99\), \(\lambda=3.0\), \(\gamma\) linear | 2.67 | 14.01 | 8.57 | Vec < u32 > | 0.33 | 12.56 | 8.80 |
13.80 | 8.58 | CacheLineEF | 0.12 | 12.77 | 9.19 | ||
13.91 | 8.65 | EF | 0.09 | 14.37 | 9.84 | ||
compact: \(\alpha=0.98\), \(\lambda=4.0\), \(\gamma\) cubic | 2.00 | 17.82 | 8.07 | Vec < u32 > | 0.66 | 20.37 | 9.11 |
17.83 | 8.00 | CacheLineEF | 0.24 | 20.97 | 10.45 | ||
18.32 | 8.33 | EF | 0.17 | 23.17 | 14.14 |
alpha | lambda | bucketfn | pilots | q1_phf | q32_phf | remap_type | remap | q1_mphf | q32_mphf |
---|---|---|---|---|---|---|---|---|---|
0.995 | 3.000 | Linear | 2.668 | 13.713 | 8.575 | Vec < u32 > | 0.179 | 12.164 | 8.707 |
0.995 | 3.000 | Linear | 2.668 | 13.779 | 8.588 | EF | 0.056 | 13.400 | 9.199 |
0.990 | 4.000 | CubicEps | 2.000 | 17.663 | 7.841 | Vec < u32 > | 0.330 | 19.514 | 8.427 |
0.990 | 4.000 | CubicEps | 2.000 | 17.707 | 7.933 | CacheLineEF | 0.120 | 19.857 | 10.520 |
0.990 | 4.000 | CubicEps | 2.000 | 17.711 | 7.941 | EF | 0.094 | 20.841 | 11.110 |
- Remap:
- plain vec of
u32
- cachelineef
- ef
- plain vec of
- observations:
- CLEF is 2.75x smaller than plain vec
- CLEF is only 1.4x larger than EF
- CLEF has slightly slower queries, than plain vec, but much faster than EF, both when doing queries one at a time and with explicit streaming and prefetching.
- Params:
- Simple:
Vec<u32>
- Compact:
CacheLineEF
- Simple:
4.1.4 Sharding
- Use \(2^{20} \approx 10^6\) slots per part, to avoid having parts with more keys
than slots (even for \(\alpha=0.99\)):
- Say there are \(n=\alpha\cdot 10^{12}\) keys and \(S=10^6\) slots per part, and \(P=n/S\) parts.
- keys per part distribution: \(Binom(n, p=1/P)\sim N(np, np(1-p))\), so stddev \(\sqrt(np(1-p))\approx \sqrt(n/P) =1000\). With \(\alpha=0.99\) and \(990000\) keys/part expected with \(10000\) buffer, that’s \(10\sigma\) space, which is plenty for a reliable construction.
- 5*10^10:
- 24 shards
- In-memory: 3314s
- Hybrid: 3996s
- pilots: ~82s/shard
- sorting buckets: ~15s/shard
- hybrid: ~45s/shard to hash/write/read keys
- 197s to hash + write *1/8
- 21s to read
- memory: ~65s/shard to re-hash all keys
- Around 50GB of RAM needed.
- 12.8GB pilots
- 6.3GB taken bitvec
- size:
- pilots: 2.05 bits/key
- remap: 0.12 bits/key
- total: 2.17 bits/key, 13.6GB
- 10^11 and larger would be possible with support for constructing the datastructure itself on disk instead of in memory.
4.2 Query throughput
- Latency around 80ns
- Max single-threaded throughput:
- Max RAM transfer rate: 25.6GB/s
- Single bucket size vs 2-way split vs PHOBIC
- Construction speed
- Sequential lookup
- Prefetching
4.2.1 A note on benchmarking query throughput
To our knowledge, all recent papers on (minimal) perfect hashing measure query speed by first creating a list of keys, and then querying all keys in the list, as shown in Code Snippet 4. One might think this measures the average latency of a query, but that is not the case, as the CPU will execute instructions from adjacent iterations at the same time. Indeed, as we will see (TODO table ref), this loop can be as fast as \(12 ns/key\) for \(n=10^9\), which is over \(6\) times faster than the RAM latency of \(\approx 80ns\), and thus, at least \(6\) iterations are being processed in parallel.
Hence, we argue that existing benchmarks measure (and optimize for) throughput and that they assume that the list of keys to query is known in advance. We make this assumption explicit by changing the API to benchmark all queries at once, as shown in Code Snippet 5. This way, we can explicitly process multiple queries in parallel as described in 3.6.
We also argue that properly optimizing for throughput is relevant for applications. SSHash, for example, queries all minimizers of a DNA sequence, which can be done by first computing and storing those minimizers, followed by querying them all at once.
|
|
|
|
4.2.2 Batch size
- Compact is slow on small inputs because the loop is vectorized with SIMD, which ends up being slower.
- Query experiments always do 1G queries.
- n=20M n=1G
- linear/0.99/3.0 vs cubic/0.98/4.0
- no remap
- loop vs streaming vs batching
- 24: 1..128 batch size: 1,2,3,4,5,6,7,8,10,12,14,16,20,24,28,32,40,48,56,64,80,96,112,128
4.2.3 Throughput
- 2: Linear/0.99/3.0 vs Cubic/0.98/4.0
- 2: loop vs streaming with known batch size
- 2: with/out remap (Vec & CachelineEF)
- 6: with 1..6 threads
- random access memory throughput
4.3 Comparison with competitors
- check out https://github.com/ByteHamster/MPHF-Experiments for evals
5 Conclusion and future work
Future work.
- Faster than memory is possible, when most queries can be answered from a smaller cache.
- SIMD, although the widening multiplication complicates things
6 Acknowledgements
- Giulio for ongoing discussions
- Sebastiano for trying the 10^12 construction.
7 Appendix
7.1 Rust and assembly code for streaming
Code Snippet 6 shows the Rust code for the streaming version of PtrHash, and
Code Snippet 7 shows the corresponding assembly code with perf record
results.
|
|
|
|
DONE 8 Failed ideas
- always compute remap to avoid branch:
- Instead, an additional layer of prefetching helps a bit, but too complicated and annoying.
- rattle kicking?
- 4bit pilots with buckets of half the size -> doesn’t work.
References
The PT in PTHash stand for Pilot Table. The author of the present paper mistakenly understood it to stand for Pibiri and Trani, the authors of the PTHash paper. Due to the current author’s unconventional last name, and PTGK not sounding great, the first initial (R) was appended instead. As things go, nothing is as permanent as a temporary name. Furthermore, we follow the Google style guide and avoid a long run of uppercase letters, and write PtrHash instead of PTRHash. ↩︎
We would have preferred to call this method hash-displace, as displace is the term used instead of evict in e.g. the cuckoo filter by Fan et al. (2014). Unfortunately, hash and displace is also the name of another MPHF introduced by Pagh (1999), that was then extended into compressed hand-and-displace (CHD) by Belazzougui, Botelho, and Dietzfelbinger (2009). There, the to-be-inserted key (rather than the existing key) is displaced by applying a linear shift to its initial position. ↩︎
Under the assumption that bucket sizes are continuous, and that the target load factor is \(1\). ↩︎
There are typically multiple types of prefetching instructions that prefetch into a different level of the cache hierarchy. We prefetch into all levels of cache using
prefetcht0
. ↩︎