Abstract
Motivation: Given a set \(S\) of \(n\) objects, a minimal perfect hash function (MPHF) is a collision-free bijective map \(f\) from the elements of \(S\) to \(\{0, \dots, n-1\}\). These functions have uses in databases, search engines, and are used in bioinformatics indexing tools such as Pufferfish (using BBHash), SSHash, and Piscem (both using PTHash). This work presents an MPHF that prioritizes query throughput and can be constructed efficiently for billions or more elements using \(2\) to \(4\) bits of memory per key.
Contributions: PTRHash builds on PTHash by 1) partitioning the table for faster construction, and 2) using cuckoo hashing to find $8$bit pilots, 3) making compression redundant. We further speed up queries by 4) simplifying hash functions and modulo operations, and 5) prefetching reads from memory.
Results: We implemented PTRhash in Rust and show that at \(3\) bits/key memory usage it is \(3\times\) faster to construct than PTHash while achieving $5$ns/key query throughput, \(5\times\) faster than state-of-the-art methods.
Source: github.com/RagnarGrootKoerkamp/pthash-rs
Introduction
- Perfect hashing
- Minimal perfect hashing
- Application in bioinformatics (SSHash)
Background
PtHash
Phobic
PtrHash
Overview
Before going into details, we first briefly explain the fully constructed PtrHash data structure and how to query it, see Figure 1 and Code Snippet 1.
The input is a set of \(n\) keys \(\{k₀, ̣\dots, k_{n-1}\}\) that we want to hash to \(n\) slots \([n]:=\{0, \dots, n-1\}\). These keys are hashed using a hash function \(h\) into \(\{h(k_0), \dots, h(k_{n-1})\}\). The total space of hashes \([H]\) (\(H\in \{2^{64},2^{128}\}\)) is equally partitioned into \(P\) parts. Further, each part is partitioned into \(B\) buckets. Thus, key \(k_i\) corresponds to part \(part(k_i) = \left\lfloor \frac{h(k_i)}{H} \cdot P\right\rfloor\) and bucket \(bucket(k_i) = \left\lfloor \frac{h(k_i)}{H} \cdot P\cdot B\right\rfloor\), where the fractional hash \(h(k_i)/H \in [0, 1)\) is scaled to the output range.
Now, the goal is to map the roughly \(n/P\) keys in each part to \(S\approx (n/P)/\alpha\) slots, where \(\alpha\approx 0.98\) gives us \(\approx 2\%\) extra slots to play with.1 The main data structure to achieve this mapping is a list of pilots \(\{p_0, \dots, p_{P\cdot B-1}\}\), one for each bucket. This controls to which slot the keys in each bucket map. Specifically, key \(k_i\) in bucket \(b=bucket(k_i)\) and pilot \(p_b\) maps to \[ slot(k_i) = part(k_i) \cdot S + (h(k_i) \oplus h_2(p_b))\bmod S. \]
The main difficulty is during construction, where we must find values of the pilots \(p_j\) such that all keys indeed map to different slots.
|
|
- simplify all the things
- overview figure comparing PTHash, Phobic, PtrHash
Dropping lookups: 8-bit pilots and displacing
- Drop EF encoding of pivots; no rank/select needed for querying
Fast construction: Parts and global remap
- Faster construction by localized (to L2 cache) displacement only.
CachelineEF instead of plain EF
Larger inputs: Shards
- Allowing
Faster queries: Batching/prefetching
- Plot of max random-access memory throughput.
TODO Perfect bucket function
TODO Shift displacement instead of rehashing
TODO Chunk vs stream batching?
Results
For convenience later on, \(P\) is chosen such that \(S = 2^s\) is a power of two of configurable size. ↩︎