[WIP] PtrHash: perfect hashing at the speed of memory

This is the work-in-progress paper on PtrHash. The original blog post on its development is here.

Abstract

Motivation: Given a set \(K\) of \(n\) keys, a minimal perfect hash function (MPHF) is a collision-free bijective map \(f\) from \(K\) to \(\{0, \dots, n-1\}\). These functions have uses in databases, search engines, and are used in bioinformatics indexing tools such as Pufferfish (using BBHash), SSHash, and Piscem (both using PTHash). This work presents an MPHF that prioritizes query throughput and can be constructed efficiently for a billion or more elements using under \(3\) bits of memory per key.

Contributions: PtrHash builds on PTHash by 1) partitioning the table for faster construction, and 2) using cuckoo hashing to find \(8\) bit pilots, which 3) makes further compression redundant. We further speed up queries by 4) simplifying hash functions and modulo operations, and 5) batching queries and prefetching reads from memory.

Results: (TODO: update numbers. TODO: add something about speed of memory) We implemented PtrHash in Rust and show that at \(3\) bits/key memory usage it is \(3\times\) faster to construct than PTHash while achieving $5$ns/key query throughput, \(5\times\) faster than state-of-the-art methods.

Source: github.com/RagnarGrootKoerkamp/ptrhash

1 Introduction

  • Perfect hashing
  • Minimal perfect hashing
  • Application in bioinformatics (SSHash)
  • Focus on query speed/throughput;

1.1 Contributions

We introduce PtrHash, a minimal perfect hash function that is primarily optimized for query throughput and construction speed, at the cost of some more memory usage. Its main novelties are:

  1. the use of fixed-width \(8\) bit pilots, leading to very simple lookups;
  2. cuckoo-hashing-like based construction
  3. construction in parts, with a single remap table, leading to fast construction;

2 Background

  • GB vs GiB

2.1 Single large seed.

2.2 Recsplit?

2.3 FCH

2.4 PTHash

2.5 PTHash-HEM

2.6 PHOBIC

2.7 EF coding

2.7.1 Uniform Partitioned EF

2.8 Cachelines

3 PtrHash

The core principle of PtrHash1 is to simplify PTHash to speed up both query speed and construction time, at the cost of using slightly more memory. We first give a high level overview of PtrHash (3.1). Then, we highlight the differences compared to PTHash and PHOBIC (3.2), after which we explain specific parts of PtrHash in more detail.

3.1 Overview

Figure 1: Overview of PtrHash. In red are four keys belonging to the same bucket in the first part, and in blue are three keys belonging to the same bucket in the second part. The (Pcdot B=10) buckets in the highlighted area are the main component of the data structure, mapping corresponding keys to slots for the same part. The blue highlighted key is initially mapped to a position (geq n), and thus remapped into one of the empty slots (<n), along with the other yellow marked cells.

Figure 1: Overview of PtrHash. In red are four keys belonging to the same bucket in the first part, and in blue are three keys belonging to the same bucket in the second part. The (Pcdot B=10) buckets in the highlighted area are the main component of the data structure, mapping corresponding keys to slots for the same part. The blue highlighted key is initially mapped to a position (geq n), and thus remapped into one of the empty slots (<n), along with the other yellow marked cells.

Before going into details, we first briefly explain the fully constructed PtrHash data structure and how to query it, see Figure 1 and Code Snippet 1.

The input is a set of \(n\) keys \(\{k₀, ̣\dots, k_{n-1}\}\) that we want to hash to \(n\) slots \([n]:=\{0, \dots, n-1\}\). These keys are hashed using a hash function \(h\) into \(\{h(k_0), \dots, h(k_{n-1})\}\). The total space of hashes \([2^{64}]\) is equally partitioned into \(P\) parts. Further, each part is partitioned into \(B\) buckets. Each key \(k_i\) is uniformly mapped to a part using the ‘fast range’ method (Lemire 2019) that interprets \(h(k_i)/2^{64}\) as a value in \([0, 1)\). Then, it’s relative position inside the part, \(x\), is passed through a \([0,1)\mapsto[0,1)\) bucket assignment function \(\beta\) such as \(x\mapsto x\) or \(x\mapsto x^2\). This controls the distribution of expected bucket sizes, as explained in detail in 3.5. The result is then scaled to a bucket index in \([B]\):

\begin{align} part(k_i) &:= \left\lfloor P\cdot h(k_i) / 2^{64}\right\rfloor,\\ x &:= \big((P\cdot h(k_i)) \bmod 2^{64}\big)/2^{64},\\ bucket(k_i) &:= \left\lfloor B\cdot \beta(x)\right\rfloor. \end{align}

Now, the goal is to map the \(n/P\) expected keys in each part to \(S\approx (n/P)/\alpha\) slots, where \(\alpha\approx 0.98\) gives us \(\approx 2\%\) extra slots to play with. The main data structure to achieve this mapping is a list of $8$-bit pilots \(\{p_0, \dots, p_{P\cdot B-1}\}\), one for each bucket. The pilots control to which slot the keys in each bucket map. Specifically, key \(k_i\) in bucket \(b=bucket(k_i)\) with pilot \(p_b\) maps to slot

\begin{equation} slot(k_i) := part(k_i) \cdot S + reduce(h(k_i) \oplus h_p(p_b), S),\label{eq:slot} \end{equation}

where \(reduce(x, S)\) maps the random \(64\) bit integer to \([S]\) as explained below.

Hash functions. Starting simple, the pilots \(p_b\) are hashed into pseudo-random \(64\) integers by using Fx Hash for \(h_p\), which simply multiplies the pilot with a mixing constant \(C\).

When the keys are \(64\) bit integers we use this same Fx Hash algorithm to hash them, since multiplying by an odd constant is invertible modulo \(2^{64}\) and hence a perfect hash. For other keys, the hash function depends on the number of elements. When the number of elements is not too far above \(10^9\), the probability of hash collisions with a \(64\) bit hash function is sufficiently small, and we use the \(64\) bit variant of xxHash (TODO). When the number of keys goes beyond \(2^{32} \approx 4\cdot 10^9\), the probability of \(64\) bit hash collisions increases, and collisions may be found after sorting the hashes. In this case, the \(128\) bit variant of xxHash (TODO) is used, where the high \(64\) bits determine the part and bucket, and the low \(64\) bits are used in Equation \ref{eq:slot} to determine the slot.

TODO murmur vs xx vs wy, and actually implement this.

The reduce function. When \(64\) bit hashes are used, we must ensure that the full entropy of the hash is used. A simple choice would be \(reduce(x,S) = x\bmod S\), which uses all bits when \(S\) is not a power of \(2\) and takes two multiplications using ‘fast mod’ (Lemire, Kaser, and Kurz 2019). Instead, we take \(S=2^s\), so that \(x\bmod 2^s\) is a simple bit-mask. Unfortunately, this only uses the lower \(s\) bits of the hash, while the \(part\) and \(bucket\) functions use the high \(\log_2(P\cdot B)\) bits, leaving some entropy in the middle bits unused.

As a solution, we first multiply \(x\) by the mixing constant \(C\), and then take the low \(s\) bits of the high half. This uses the full entropy of the input and takes only a single multiplication, giving around \(5\%\) query speedup over fast mod.

\begin{equation} reduce(x, 2^s) := \left\lfloor C\cdot x/2^{64}\right\rfloor \bmod 2^s \end{equation}

Remapping. Since each part has slightly more slots than keys, some keys will map to an index \(\geq n\), leading to a non-minimal perfect hash function. To fix this, those are remapped back into the ‘gaps’ left behind in slots \(<n\), which is explained in detail in 3.4.

Construction. The main difficulty of PtrHash is during construction (3.3), where we must find values of the pilots \(p_j\) such that all keys indeed map to different slots. Like other methods, PtrHash processes multiple parts in parallel. It sorts the buckets within each part from large to small and ‘greedily’ assigns them the smallest pilot that maps the corresponding keys to slots that are still free. Unlike other methods though, PtrHash only allows pilots up to \(255\). When no suitable pilot is found, we use a method similar to (blocked) cuckoo hashing: a pilot with a minimal number of collisions is chosen, and the colliding buckets are ’evicted’ and have to find a different pilot.

Parameter values. In practice, we fix the number of slots per part, \(S\), to be around \(2^{18}\approx 262\ 000\), since this allows the construction of each part to fit in the L2 cache of each core. The number of parts \(P\) is then \((n/S)/\alpha\). The load factor \(\alpha\) is around \(0.98\), so that each part of \(S\) slots has \(\alpha \cdot S\) keys in expectation. In order to avoid overly full parts with more keys than slots, a smaller \(\alpha\) is sometimes required when there are many parts.

Similar to FCH and PTHash, the total number of buckets \(P\cdot B\) is roughly \(c\cdot n / \lg_2(n)\), where \(c\) ranges from \(\approx 6\) to \(\approx 10\) TODO.

TODO: Instead use average bucket size \(\lambda\)?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
struct PtrHash {
    n: usize,         // Number of elements
    P: usize,         // Number of parts
    B: usize,         // Buckets per parts
    S: usize,         // Slots per parts
    lgS: usize        // S = 2^lgS
    pilots: Vec<u8>,  // P*B pilots
    free: Vec<usize>, // P*S-n remap indices
}

/// Multiply a and b as if they are fractions of 2^64.
/// Compiles to taking the high 64 bits of the 64x64->128 multiplication.
fn mul(a: usize, b: usize) -> usize {
    ((a as u128 * b as u128) >> 64) as usize
}

impl PtrHash {
    fn query(&self, key: Key) -> usize {
        let h = self.hash(key);
        let part = mul(self.P, h);
        let bucket = mul(self.B, self.beta(self.P * h));
        let pilot = self.pilots[bucket];
        let slot_bucket = mul(self.C, h ^ self.hash_pilot(pilot)) & (self.S - 1);
        let slot = (part << self.lgS) + slot_bucket;
        if slot < self.n {
            return slot
        } else {
            return self.free[slot - self.n]
        }
    }
}
Code Snippet 1: Rust code for a simple implementation of the data structure and query function.

3.2 Comparison to PTHash and PHOBIC

Compared to PTHash (Pibiri and Trani 2021b), PtrHash has a few differences:

  • Single bucket size. Following FCH (Fox, Chen, and Heath 1992), PTHash (Pibiri and Trani 2021b) uses small and large buckets to speed up the construction and decrease memory usage. Similarly, PHOBIC uses a function to create a near-optimal distribution of bucket sizes that saves up to \(0.14\) bits/kmer over the small/large buckets (Hermann et al. 2024). For PtrHash we prefer simplicity and all buckets have the same expected size.

  • Pilot encoding. PTHash offers a few encoding schemes for the pilots: compact encoding (storing each pilot with exactly as many bits as are needed for the largest pilot), dictionary encoding (storing a list of all pilot values, and replacing each pilot with an index in the list), and Elias-Fano encoding. Additionally the small and large buckets can use different encoding schemes. PHOBIC offers an additional space saving of \(0.06\) bits/key by interleaving the pilots of each part.

    For PtrHash, all pilots are exactly \(8\) bits, and we simply store them as a vector of bytes, removing the need for additional logic and memory accesses during their lookup.

  • Parts. PTHash-HEM (Pibiri and Trani 2021a) and PHOBIC split the keys into parts, and then work on each part independently. For a part containing \(P’\) keys, they use \(P’/\alpha\) slots (with \(\alpha=1\) for PHOBIC). This means that for each query, a lookup is required to determine the slot where the current part starts.

    PtrHash, on the other hand, assigns the same number of slots to each part, so that no additional lookups are needed.

  • Part size. PTHash only uses a single part for all keys. PHOBIC, instead, uses relatively small parts of expected size \(2500\). PtrHash chooses the part size such that construction of each part roughly fits in the L2 cache of each CPU core, which is around \(250\ 000\) in practice.

  • Remapping. PTHash-HEM supports construction by parts and ensures that each part of \(P\) elements maps to \(P\) consecutive slots, by remapping per part. PHOBIC does not use remapping since it does not use an \(\alpha<1\). PtrHash, instead, does a global remap over all parts.

    Additionally, PtrHash introduces a new encoding for the remapped values, see 3.4.

  • Streaming queries. Lastly, PtrHash supports streaming queries, allowing it to prefetch pilots from memory and better use the available memory bandwidth.

3.3 Construction

Both PTHash-HEM and PHOBIC first partition the keys into parts, and then build an MPHF part-by-part, optionally in parallel on multiple threads. Within each part, the keys are randomly partitioned into buckets of expected size \(\lambda\) (Figure 1). Then, the buckets are sorted from large to small, and one-by-one greedily assigned a pilot, such that the keys in the bucket map to slots not yet covered by earlier buckets.

[Drop/dedup with overview/move to background?] As observed for PTHash, searching for pilots becomes harder as the load factor (fraction of already filled slots) increases. Hence, PTHash uses \(n/\alpha > n\) slots to reduce the construction time and decrease the pilots, making their encoding more efficient. PHOBIC, on the other hand, uses relatively small parts of size \(2500\), so that the search for the last empty slot usually shouldn’t take much more than \(2500\) attempts. Nevertheless, a drawback of the greedy approach is that pilots have an uneven distribution, causing sub-optimal fixed-width compression.

Hash-evict2. In PtrHash, we instead always use fixed width single byte pilots. To achieve this, we use a technique resembling cuckoo hashing (Pagh and Rodler 2001). As before, buckets are greedily inserted from large to small. For some buckets, there may be no pilot in \([256]\) such that all its keys map to empty slots. When this happens, a pilot is found with the lowest weighted number of collisions. The weight of a collision with a bucket of size \(s\) is \(s^2\), to prevent collisions with large buckets, as those are harder to place. The colliding buckets are then evicted by emptying the slots they map to and pushing them back onto the priority queue of remaining buckets. Then, the new bucket is inserted.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/// Given the buckets of hashed keys for a part, search for pilot values.
fn pilots_for_part(&self, buckets: Vec<&[Hash]>) -> Vec<u8> {
    let mut pilots = vec![0; self.B];                    // One pilot per bucket.
    let mut slots = vec![None; self.S];       // Bucket idx mapping to each slot.

    // A priority queue (max-heap) of buckets.
    let mut queue = BinaryHeap::from_iter(
        (0..buckets.len()).iter().map(|i| (buckets[i].len(), i))
    );

    while let Some((_, i)) = queue.pop() {       // Insert next-largest bucket i.
        pilots[i] = self.find_pilot(buckets[i], &mut slots);
        for &h in buckets[i] {
            let slot = self.slot_for_hashed_key(h, pilots[i]);
            if let Some(j) = slots[slot] {           // Evict colliding bucket j.
                for &h_j in buckets[j] {
                    let slot_j = self.slot_for_hashed_key(h_j, pilots[j]);
                    slots[slot_j] = None;
                }
                todo.push((buckets[j].len(), j));
            }
            slots[slot] = Some(i);
        }
    }

    pilots
}
Code Snippet 2: Conceptual Rust code for determining the pilot values for each part. In practice, a number of optimizations are made.

Optimizations. In order to speed up the code to search for pilots, a number of optimizations are made to the conceptual idea of Code Snippet 2.

  1. taken bit mask. Instead of determining whether a slot is free by checking the slots array for the optional index of the bucket mapping there, we keep a separate bit mask taken that takes only \(1\) bit instead of \(32\) bits per element. This allows for better caching and hence faster access.
  2. Collision-free hot path. When searching for pilots, we first test if there is a pilot without any collisions. This is usually the case, and is faster since it only needs access to taken, not slots. Additionally, where there is a collision, we know a pilot is optimal when it collides with exactly one bucket of minimal size.
  3. Avoiding loops. To avoid repeated patterns of the same buckets evicting each other, the search for a pilot starts at a random number in \([256]\), rather than at \(0\).
  4. Avoiding loops more. Each time a bucket is placed that evicted some other bucket(s), it is added to a list of the \(16\) most recently placed buckets. Buckets in this list are never evicted. This avoids short cycles, where for example two buckets keep evicting each other for the same slot.

Analysis. Unfortunately, we do not currently have a formal analysis showing that the hash-evict method works with high probability given that certain criteria are met. In 4, we will show some practical results.

3.4 Remapping using CacheLineEF

Both PTHash and PtrHash use a parameter \(0<\alpha\leq 1\) to use a total of \(n’=n/\alpha\) slots, introducing \(n’-n\) additional free slots. As a result of the additional slots, some, say \(R\), of the keys will map to positions \(n\leq p_0<\dots< p_{R-1}< n’\), causing the perfect hash function to not be minimal.

Remapping. Since there are a total of \(n\) keys, this means there are exactly \(R\) empty slots (‘gaps’) left behind in \([n]\), say at positions \(L_0\) to \(L_{R-1}\). We remap the keys that map to positions \(\geq n\) to the empty slots at positions \(< n\) to obtain a minimal perfect hash function.

A simple way to store the remap is as a plain array \(free\), such that \(free[p_i-n] = L_i\). PTHash encodes this array using Elias-Fano coding (Elias 1974; Fano 1971), after setting undefined positions of \(free\) equal to their predecessor. The benefit of the plain \(free\) array is fast and cache-local lookups, whereas Elias-Fano coding provides a more compact encoding that requires multiple lookups to memory.

CacheLineEF. We propose using Elias-Fano coding on a per-cache line basis, so that each lookup only requires a single read from memory. First, the list of non-decreasing \(free\) positions is split into chunks of \(C=44\) values \(\{v_0, \dots, v_{43}\}\), with the last chunk possibly containing fewer values. Then, each chunk is encoded into \(64\) bytes that can be stored as single cache line, as shown in Figure 2.

We first split all indices into their \(8\) low bits (\(v_i \bmod 2^8\)) and \(32\) high bits (\(\lfloor v_i/2^8\rfloor\)). Further, the high part is split into an offset (the high part of \(v_0\)) and the relative high part: \[ v_i = (v_i\bmod 2^8) + 2^8\cdot\lfloor v_0/256\rfloor + 2^8\cdot \left(\lfloor v_i/256\rfloor - \lfloor v_0/256\rfloor\right). \] This is stored as follows.

  • First, the \(8\) low bits of each \(v_i\) are directly written to the \(44\) trailing bytes.
  • Next, the \(32\) bit offset \(\lfloor v_0/256\rfloor\) is stored.
  • Lastly, the relative high parts are encoded into \(128\) bits. For each \(i\in[44]\), bit \(i + \lfloor v_i/256\rfloor - \lfloor v_0/256\rfloor\) is set to 1. Since the \(v_i\) are increasing, each \(i\) sets a distinct bit, for a total of \(44\) set bits.

Figure 2: Overview of the CacheLineEF datastructure.

Figure 2: Overview of the CacheLineEF datastructure.

Lookup. The value at position \(i\) is found by summing (1) the \(8\) low bits, (2) the offset multiplied by \(256\), and (3) the relative high part. This last part can be found as \(256\cdot(select(i)-i)\), where \(select(i)\) gives the position of the \(i\)’th 1 bit. In practice, this can be implemented efficiently using the PDEP instruction provided by the BMI2 bit manipulation instruction set (Pandey, Bender, and Johnson 2017): this operation can deposit the mask 1<<i onto our bit pattern, so that the 1 ends up at the position of the \(i\)’th one of our pattern. Then, it suffices to count the number of trailing zeros, which is provided by the TZCNT instruction in BMI1.

Limitations. CacheLineEF uses \(64/44\cdot 8 = 11.6\) bits per value, which is more than the usual Elias-Fano, which for example takes \(8+2=10\) bits per value for data with an average stride of \(256\). Furthermore, values are limited to \(40\) bits, covering \(10^{12}\) items. The range could be increased to \(48\) bit numbers by storing \(5\) bytes of the offset, but this has not been necessary so far. Lastly, each CacheLineEF can only span a range of around \((128-44)\cdot 256 = 21\ 504\), or an average stride of \(500\). For PtrHash, we use \(\alpha\leq 0.99\), and hence the average distance between empty slots is at most \(100\), so that in practice the average distance never exceeds \(500\).

Comparison. Compared to Elias-Fano coding, CacheLineEF stores the low order bits as exactly a single byte, removing the need for unaligned reads. Further, the select data structure on the high-order bits is replaced by a few local bit-wise operations. CacheLineEF is also somewhat similar to the (Uniform) Partitioned Elias-Fano Index of Ottaviano and Venturini (2014), in that both split the data. The uniform partitioned index also uses fixed part sizes, but encodes them with variable widths, and adds a second level of EF to encode the part offsets. Instead, CacheLineEF prefers simplicity and uses fixed part sizes with a constant width encoding and simply stores the offsets directly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
const L: usize = 44; // The number of elements per cache line.

#[repr(C)]
#[repr(align(64))]   // Align the 64byte object to cache lines.
pub struct CacheLineEF {
    high: [u64; 2],  // Encoding of the high bits.
    offset: u32,     // Offset of the first element.
    low: [u8; L],    // Low 8 bits of each element.
}

impl CacheLineEF {
    fn new(vals: &[u64; L]) -> Self {
        let offset = vals[0] >> 8;
        let mut low = [0u8; L];
        for (i, &v) in vals.iter().enumerate() {
            low[i] = (v & 0xff) as u8;
        }
        let mut high = [0u64; 2];
        for (i, &v) in vals.iter().enumerate() {
            let idx = i + ((v >> 8) - offset) as usize;
            high[idx / 64] |= 1 << (idx % 64);
        }
        Self {
            offset: offset as u32,
            high,
            low,
        }
    }

    fn get(&self, idx: usize) -> u64 {
        let p = self.high[0].count_ones() as usize;
        // Select the position of the 1 using the BMI2 PDEP instruction.
        let one_pos = if idx < p {
            self.high[0].select_in_word(idx)
        } else {
            64 + self.high[1].select_in_word(idx - p)
        };

        self.low[idx] as u64
            + 256 * self.reduced_offset as u64
            + 256 * (one_pos - idx) as u64
    }
}
Code Snippet 3: Code for constructing and querying CacheLineEF.

3.5 Bucket assignment functions

The

  • Function: \(\beta_*(x) = x+(1-x)\ln(1-x)\)
  • Practical function: \(\beta_\varepsilon(x) = x + (1-\varepsilon)(1-x)\ln(1-x)\).
  • Implementation: for a \(64\) bit number \(x\), \(1-x\) is the negation. \(\log_2(1-x)\) is the negative of the number of leading zeros, which we can multiply by a scaling factor \(\ln 2\).

3.6 Parallel queries

Throughput. In practice in bioinformatics applications (such as SSHash), we expect many independent queries to the MPHF. This means that queries can be answered in parallel, instead of one by one. Thus, we should optimize for query throughput (queries per second, but usually reported as inverse throughput in amortized seconds per query) rather than individual query latency (seconds per query).

Pipelining. A typical MPHF on \(10^9\) keys requires memory at least \(2bits/key \cdot 10^9 keys = 250MB\), which is much larger than the L3 cache of size around \(16MB\). Thus, most queries require reading from main memory (RAM), which usually has a latency around \(80ns\). Nevertheless, existing MPHFs such as FCH (Fox, Chen, and Heath 1992) achieve an inverse throughput as low as \(35ns/query\) on such a dataset (Pibiri and Trani 2021b). This is achieved by pipelining and the reorder buffer: Intel Skylake CPUs can execute over \(200\) instructions ahead while waiting for memory to become available (Wong 2013; Downs 2017). This allows the CPU to already start processing ‘future’ queries and fetch the required cache lines from RAM while waiting for the current query. Thus, when each iteration requires less than \(100\) instructions and there are no branch-misses, this effectively makes up to two reads in parallel. A large part of ensuring faster queries is then to reduce the length of each iteration to allow pipelining to fetch memory more iterations ahead.

Prefetching. Instead of relying on the CPU hardware to parallellize requests to memory, we can also explicitly prefetch3 cache lines from software. Each prefetch requires a line fill buffer to store the result before it is copied into the L1 cache. Skylake has \(12\) line fill buffers (Downs 2018), and hence can support up to \(12\) parallel reads from memory. This gives a maximal theoretical throughput around \(80ns/query/12 = 6.67 ns/query\) as long as each query only requires a single read from main memory.

We consider two models to implement prefetching: batching and streaming.

Figure 3: Simplified schematic of in-progress reads from main memory (RAM) when using two different prefetching approaches processing (up to) (8) reads in parallel. Each horizontal line indicates the duration a read is in progress, from the moment it is prefetched (left vertical bar) to the moment it is available in L1 cache and it’s corresponding line fill buffer is free again (right vertical bar). Streaming (right) provides better parallellism than batching (left).

Figure 3: Simplified schematic of in-progress reads from main memory (RAM) when using two different prefetching approaches processing (up to) (8) reads in parallel. Each horizontal line indicates the duration a read is in progress, from the moment it is prefetched (left vertical bar) to the moment it is available in L1 cache and it’s corresponding line fill buffer is free again (right vertical bar). Streaming (right) provides better parallellism than batching (left).

Batching. In this approach, the queries are split into chunks of size \(B\), and chunks are then processed one by one (Figure 3, left). In each chunk, each key is hashed, its bucket it determined, and the cache line containing the corresponding pilot is prefetched. Then, the buffered hashes are iterated again, and the corresponding slots are computed.

Streaming. A drawback of batching is that at the start and end of each batch, the maximal number of parallel prefetches is not fully saturated. The streaming approach fixes this by prefetching the cache line for the pilot \(B\) iterations ahead of the current one, and is able to sustain the maximum possible number of parallel prefetches throughout, apart from at the very start and end (Figure 3, right).

3.7 External memory construction: Shards

When the number of keys is large, say \(>10^{10}\), their \(64\) bit hashes may not all fit in memory at the same time, even though the final PtrHash datastructure (the list of pilots) would fit. Thus, we can not simply sort all hashes in memory to partition them. Instead, we split the set of all \(n\) hashes into, say \(s=\lceil n/2^{32}\rceil\) shards, where the \(i\)’th shard corresponds to hash values in \(s_i:=[2^{64}\cdot i/s, 2^{64}\cdot (i+1)/s)\). Then, the hashes in each shard are sorted and split into parts, after which the parts are constructed as usual. This way, the shards only play a role during construction, and the final constructed data structure is independent of which sharding strategy is used.

In memory sharding. The first approach to sharding is to iterate over the set of keys \(s\) times. In the \(i\)’th iteration, all keys are hashed, and only those hashes in the corresponding interval \(s_i\) are stored and processed. This way, no additional on-disk memory is needed for construction.

On disk sharding. A drawback of the first approach is that keys are potentially hashed many times. This can be avoided by writing hashes to disk. Specifically, we can create one file per shard and append hashes to their corresponding file. These files are then read and processed one by one.

4 Results

  • lscpu | grep CPU CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz
    • Running at 3.6GHz
    • Cache sizes 32KiB/256KiB/12MiB
    • 12 line fill buffers
    • hyperthreading disabled
  • Ram: sudo lshw -short -C memory 2x 32GiB SODIMM DDR4 synchronous 3200 MHz.
  • Query throughput vs threads
    • Plot of max random-access memory throughput.
  • Construction time vs part size
  • Single bucket size vs 2-way split vs PHOBIC
    • Construction speed
    • Sequential lookup
    • Prefetching
  • Measure at 25M keys (which at 10MB fits in 12MB L3 cache) and 1G keys (which is 20x larger than L3)
  • Comparison of bucketing functions; plot over ‘construction time’ with pilot value and number of evictions.
  • Prefetching
    • streaming/batching
    • distance 1..128
    • CPU freq 2.6 and 3.6 GHz?
    • with/without remap
    • Compare prefetch0/1/2/NTA
  • Parameters for small cases,
  • 10^10 or 10^11 construction with both sharding variants.

5 Conclusion and future work

Future work.

  • Faster than memory is possible, when most queries can be answered from a smaller cache.

6 Acknowledgements

  • Giulio for ongoing discussions
  • Sebastiano for trying the 10^12 construction.

7 Appendix

7.1 Rust and assembly code for streaming

Code Snippet 4 shows the Rust code for the streaming version of PtrHash, and Code Snippet 5 shows the corresponding assembly code with perf record results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
pub fn index_stream<'a, const B: usize, const MINIMAL: bool>(
    &'a self,
    keys: impl IntoIterator<Item = &'a Key> + 'a,
) -> impl Iterator<Item = usize> + 'a {
    // Append B values at the end of the iterator to make sure we wrap sufficiently.
    let mut hashes = keys.into_iter().map(|x| self.hash_key(x)).chain([0; B]);

    // Ring buffers to cache the hash and bucket of upcoming queries.
    let mut next_hashes: [Hx::H; B] = [Hx::H::default(); B];
    let mut next_buckets: [usize; B] = [0; B];

    // Initialize and prefetch first B values.
    for idx in 0..B {
        next_hashes[idx] = hashes.next().unwrap();
        next_buckets[idx] = self.bucket(next_hashes[idx]);
        crate::util::prefetch_index(self.pilots, next_buckets[idx]);
    }
    hashes.enumerate().map(move |(idx, next_hash)| {
        let idx = idx % B;
        let cur_hash = next_hashes[idx];
        let cur_bucket = next_buckets[idx];
        let pilot = self.pilots[cur_bucket];
        let mut slot = self.slot(cur_hash, pilot);
        if MINIMAL && slot >= self.n {
            slot = self.remap.index(slot - self.n) as usize;
        };

        // Prefetch B iterations ahead.
        next_hashes[idx] = next_hash;
        next_buckets[idx] = self.bucket(next_hashes[idx]);
        crate::util::prefetch_index(self.pilots, next_buckets[idx]);

        slot
    })
}
Code Snippet 4: Rust code for streaming indexing that prefetches \(B\) iterations ahead.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 2.57  a0:   lea        (%r14,%rbp,1),%r12d
 0.95        mov        0x8(%rsp),%rdx
16.93        mov        (%rdx,%r14,8),%rdx
 0.80        imul       %r11,%rdx
 2.30        and        $0x1f,%r12d
 0.90        mov        0x8(%rcx,%r12,8),%rsi
 1.36        mulx       %rbx,%r8,%r9
 2.24        mov        0x108(%rcx,%r12,8),%r10
 0.92        mov        %rdx,0x8(%rcx,%r12,8)
 0.48        mov        %r8,%rdx
 2.99        mulx       %r8,%rdx,%rdx
 1.03        mov        0x20(%rsp),%r8
 1.44        mulx       %r8,%rdx,%rdx
 2.15        imul       0x18(%rsp),%r9
 1.08        add        %rdx,%r9
 0.83        mov        %r9,0x108(%rcx,%r12,8)
46.61        prefetcht0 (%r15,%r9,1)            ; Nearly half the time is spent here.
 1.39        movzbl     (%r15,%r10,1),%r8d
 0.54        imul       %r11,%r8
 0.31        xor        %rsi,%r8
 2.34        mov        %rsi,%rdx
 1.43        mulx       %rbx,%rdx,%rdx
 0.30        shlx       %r13,%rdx,%rdx
 2.43        add        %rdx,%rax
 0.87        mov        %r8,%rdx
 0.72        mulx       %r11,%rdx,%rdx
 2.37        and        %rdi,%rdx
 0.98        add        %rdx,%rax
 0.51        inc        %r14
             cmp        %r14,0x28(%rsp)
 0.23       jne        a0
Code Snippet 5: Assembly code of streaming indexing (without the final remap) that prefetches 32 iterations ahead, with perf record measurement of time time spent on each line. TODO: Update for latest version.

References

Belazzougui, Djamal, Fabiano C. Botelho, and Martin Dietzfelbinger. 2009. “Hash, Displace, and Compress.” In Algorithms - Esa 2009, 682–93. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-04128-0_61.
Downs, Travis. 2017. “Measuring Reorder Buffer Capacity on Skylake.” https://www.realworldtech.com/forum/?threadid=166772&curpostid=167685.
———. 2018. “Measuring Reorder Buffer Capacity on Skylake.” https://github.com/Kobzol/hardware-effects/issues/1#issuecomment-441111396.
Elias, Peter. 1974. “Efficient Storage and Retrieval by Content and Address of Static Files.” Journal of the Acm 21 (2): 246–60. https://doi.org/10.1145/321812.321820.
Fan, Bin, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. 2014. “Cuckoo Filter.” Proceedings of the 10th Acm International on Conference on Emerging Networking Experiments and Technologies, December. https://doi.org/10.1145/2674005.2674994.
Fano, R.M. 1971. On the Number of Bits Required to Implement an Associative Memory. Memorandum 61, Computation Structures Group. MIT Project MAC Computer Structures Group. https://books.google.ch/books?id=07DeGwAACAAJ.
Fox, Edward A., Qi Fan Chen, and Lenwood S. Heath. 1992. “A Faster Algorithm for Constructing Minimal Perfect Hash Functions.” Proceedings of the 15th Annual International Acm Sigir Conference on Research and Development in Information Retrieval - Sigir ’92. https://doi.org/10.1145/133160.133209.
Hermann, Stefan, Hans-Peter Lehmann, Giulio Ermanno Pibiri, Peter Sanders, and Stefan Walzer. 2024. “Phobic: Perfect Hashing with Optimized Bucket Sizes and Interleaved Coding.” Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.ESA.2024.69.
Lemire, Daniel. 2019. “Fast Random Integer Generation in an Interval.” Acm Transactions on Modeling and Computer Simulation 29 (1): 1–12. https://doi.org/10.1145/3230636.
Lemire, Daniel, Owen Kaser, and Nathan Kurz. 2019. “Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries.” Software: Practice and Experience 49 (6): 953–70. https://doi.org/10.1002/spe.2689.
Ottaviano, Giuseppe, and Rossano Venturini. 2014. “Partitioned Elias-Fano Indexes.” In Proceedings of the 37th International Acm Sigir Conference on Research &Amp; Development in Information Retrieval. Sigir ’14. ACM. https://doi.org/10.1145/2600428.2609615.
Pagh, Rasmus. 1999. “Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions.” In Algorithms and Data Structures, 49–54. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-48447-7_5.
Pagh, Rasmus, and Flemming Friche Rodler. 2001. “Cuckoo Hashing.” In Algorithms — Esa 2001, 121–33. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44676-1_10.
Pandey, Prashant, Michael A. Bender, and Rob Johnson. 2017. “A Fast X86 Implementation of Select.” arXiv. https://doi.org/10.48550/ARXIV.1706.00990.
Pibiri, Giulio Ermanno, and Roberto Trani. 2021a. “Parallel and External-Memory Construction of Minimal Perfect Hash Functions with Pthash.” arXiv. https://doi.org/10.48550/ARXIV.2106.02350.
———. 2021b. “Pthash: Revisiting Fch Minimal Perfect Hashing.” Proceedings of the 44th International Acm Sigir Conference on Research and Development in Information Retrieval, July. https://doi.org/10.1145/3404835.3462849.
Wong, Henry. 2013. “Measuring Reorder Buffer Capacity.” https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/.

  1. The PT in PTHash stand for Pilot Table. The author of the present paper mistakenly understood it to stand for Pibiri and Trani, the authors of the PTHash paper. Due to the current author’s unconventional last name, and PTGK not sounding great, the first initial (R) was appended instead. As things go, nothing is as permanent as a temporary name. Furthermore, we follow the Google style guide and avoid a long run of uppercase letters. ↩︎

  2. We would have preferred to call this method hash-displace, as displace is the term used instead of evict in e.g. the cuckoo filter by Fan et al. (2014). Unfortunately, hash and displace is also the name of another MPHF introduced by Pagh (1999), that was then extended into compressed hand-and-displace (CHD) by Belazzougui, Botelho, and Dietzfelbinger (2009). There, the to-be-inserted key (rather than the existing key) is displaced by applying a linear shift to its initial position. ↩︎

  3. There are typically multiple types of prefetching instructions that prefetch into a different level of the cache hierarchy. We prefetch into all levels of cache using prefetcht0↩︎