Table of Contents

1 Introduction
- 1.1 Overview
2 Optimizing Compute Bound Code: Random Minimizers
- 2.1 Avoiding Branch Misses
- 2.2 SIMD: Processing In Parallel
- 2.3 Instruction Level Parallelism
- 2.4 Input Format
3 Optimizing Memory Bound Code: Minimal Perfect Hashing
- 3.1 Using Less Memory
- 3.2 Reducing Memory Accesses
- 3.3 Interleaving Memory Accesses
- 3.4 Batching, Streaming, and Prefetching

This is Chapter 10 of my thesis (Groot Koerkamp 2025a), to introduce the last part on High Throughput Bioinformatics.

Summary

In this chapter, the goal will be fast code. Specifically, the goal is to solve some given problem as fast as possible by fully exhausting the hardware. In fact, the goal is not just fast code in itself, but provably fast code: ideally we can prove that, given some assumptions, the code is within some percentage of the fastest the given hardware could be.

We start by introducing a few common techniques to achieve speedups. For compute-bound problems, these include SIMD, avoiding unpredictable branches, and instruction-level parallellism, For memory-bound problems, the core principle is to optimize for throughput rather than latency. Some solutions are interleaving multiple queries using batching or streaming, prefetching, and doing cache line-aware optimization of the memory layout.

We also include some general background on benchmarking and optimizing code.

We show these techniques by using the SSHash \(k\)-mer index as an example application. This data structure takes as input a text (without duplicate \(k\)-mers), and builds a dictionary on its \(k\)-mers. It can then efficiently answer queries whether given \(k\)-mers occur in the input text.

The first step is to compute the minimizers of the text. As the number of minimizers is typically much lower than the number of characters in the text, we want this to be fast so that this ‘‘compression’’ step is not the bottleneck.

Then, when answering queries, SSHash uses a minimal perfect hash function (MPHF) to find the index of each minimizer in its internal data structure. With PtrHash, we develop an MPHF that only needs a single memory access for most queries, and can answer queries nearly as fast as the memory can support random access reads.

Attribution

In this part, we summarize high level techniques for writing high throughput code. We give examples of how these optimizations are applied in the two upcoming chapters, where we optimize the computation of random minimizers (Groot Koerkamp and Martayan 2025) and build a fast minimal perfect hash function data structure (Groot Koerkamp 2025b).

All text is my own, and at most loosely based on the mentioned papers.

1 Introduction Link to heading

The amount of sequenced biological data is growing exponentially. For a long time, the performance of computer hardware was also growing exponentially, but over the last years this has slowed down. As a result, waiting for faster hardware is not an solution if one wants to process all current data. Thus, there is a need for faster algorithms, and indeed, many papers in bioinformatics use this remark as their motivation.

Complexity. Unfortunately, the performance of a piece of code does not just depend on the theoretical runtime complexity of the algorithm used (Medvedev 2023). For example, the hidden constant can have a large influence on the actual running time. For example, summing a vector of \(n\) integers is \(\Theta(n)\), and can be done very efficiently using SIMD instructions at a speed of 0.1 ns per value (or 4 values per CPU cycle). On the other hand, reading \(n\) values from random memory addresses is also considered \(O(n)\), but this can be as slow as 100 ns per read, which is \(1000\times\) slower!

I/O complexity. Thus, the performance of an algorithm also depends on the type of operations it does. One way to capture this is by using the IO model, where the number of memory (or disk) accesses is explicitly tracked. This also somewhat captures the intuitive simplicity of an algorithm: it is well known that dynamic programming algorithms that simply do linear loops over an array are much faster than IO-bound data structures. More generally, this is an instance of the Von Neumann bottleneck (Backus 1978): all reading and writing of memory is limited by the memory bus, while simple operations are orders of magnitude faster.

Implementation. Lastly, we come to the most practical part, which is the implementation of the algorithm. While Moore’s law (Moore 2006) may not quite apply anymore, modern CPUs are vastly complex machines, with increasingly more optimizations and heuristics to squeeze everything out of the code they execute. We, the programmer, should be aware of these things, and write our code in such a way that indeed the CPU can execute it efficiently. This also means that we should explicitly use the dedicated instructions that the CPU provides, such as SIMD (single-instruction multiple-data) instructions to operate on up to 8 64-bit integers at a time, or the BMI2 (bit manipulation) instruction set for efficient bit-wise instructions.

Algorithm design. Taking this one step further, we should not just aim to implement our algorithm of choice efficiently, but we may want to choose and design our algorithm by explicitly keeping in mind the capabilities of the hardware it will run on. Even more, we may want to also change the input format to better suit the algorithm.

1.1 Overview Link to heading

In this chapter, we will go over some techniques to optimize code for modern CPUs. Many works have been written on this topic, such as Ulrich Drepper’s book on CPU memory (Drepper 2007), Denis Bakhvalov’s book on optimizing performance on modern CPUs (Bakhvalov 2024), and Agner Fog’s resources on instruction latency and more (Fog 2024). Thus, in this chapter we simply highlight some techniques that are particularly relevant for the following chapters. We group them into two categories.

The first category are optimizations for compute-bound problems. In 2, we have a look at simd-minimizers, which is an efficient implementation of an algorithm to compute random minimizers. We have also already seen a number of compute-bound optimizations being applied to A*PA2 in the chapter on A*PA2.

Then, in 3, we consider a memory-bound application, where the CPU is mostly waiting for data to be read from RAM (or disk): minimal perfect hashing. Here, the problem is to build a data structure that, given a fixed input set \(S\) of \(n\) elements, efficiently maps each element to a unique integer in \(\{0, \dots, n-1\}\).

Text indexing. We note that both these applications have uses in text indexing. First, minimizers are used in many different data structures. One way in which they are used is to sketch the input text into a smaller representation. Then, one can build a much smaller and faster data structure only on this sketched representation (Ayad et al. 2025; Grabowski and Raniszewski 2017; Ekim, Berger, and Chikhi 2021). This sketching step can also be seen as a way to compress the data. This means that the compression algorithm itself (the computation of the minimizers) is the only part of the pipeline that sees the full input data, while all subsequent steps only work on the sketched representation. This means that as the compression factor increases (for example, because genomic reads become more accurate), the proportion of time spent on the compression increases, and indeed, this can take a significant portion of the time. Thus, we design an optimized implementation to compute random minimizers.

A second application of minimizers is to cluster the \(k\)-mers of a text, where \(k\)-mers that share the same minimizer are mapped to the same bucket. This is used, for example, by the GGCAT De Bruijn graph construction algorithm (Cracco and Tomescu 2023) in order to build disjoint pieces of the graph in parallel, and a similar technique is used by k-mer counting methods such as KMC2 (Deorowicz et al. 2015).

The same technique is also used by SSHash (Pibiri 2022), which is an efficient representation of a static set of \(k\)-mers. Again, each k-mer is first mapped to its minimizer. It then efficiently stores buckets of k-mers that share the same minimizer via super-\(k\)-mers, which are longer strings containing multiple adjacent \(k\)-mers as substrings. Once the data structure is built, a critical step is to efficiently retrieve the bucket that corresponds to a minimizer, which is done by building a minimal perfect hash function. Since a data structure implementing such a hash naturally takes quite some space, queries usually hit the main memory, and thus this is a memory bound problem.

SSHash originally uses PTHash (Pibiri and Trani 2021), and in PtrHash, we build on this to develop PtrHash by applying the techniques from this chapter to optimize its throughput.

Throughput, not latency. We end here with one more remark. Many memory-bound applications are in fact bound by the memory latency. For example, this means that a piece of data is requested from RAM, and then the CPU has to wait for this data to become available before further progress can be made. This means that for (up to) the entire duration of the request, which can take 80 ns, the CPU is waiting for one bit of data. At the same time, the memory can handle many more reads than only one every 80 ns, and thus, the memory bandwidth is also not exhausted.

We argue that in many bioinformatics applications, sequences are processed in a relatively homogeneous way, where for example the same function is applied to every k-mer. This means that multiple k-mers are processed independently. If every k-mer requires read from memory, we can then process those in parallel.

Currently, not many applications are written in this way, and thus, there is a lot of room for improvement.

2 Optimizing Compute Bound Code: Random Minimizers Link to heading

We start with an overview of techniques that can be used to optimize compute-bound code.

As an example application, we apply these techniques to the problem of efficiently computing the minimizers of a sequence. There are many indices and tools using minimizers, such as SSHash (Pibiri 2022) and minimizer-space De Bruijn graphs (Ekim, Berger, and Chikhi 2021). In some cases, minimizers are also specifically used as a sketch of the text (Grabowski and Raniszewski 2017; Ayad et al. 2025). Specifically there, this ‘‘compression’’ step of computing the minimizers can easily become a bottleneck, since all subsequent operations only have to operate on the much smaller sketched space. Thus, this is a classic compute bound problem, where the input is a DNA sequence, and the output is the set of minimizer positions or kmers.

A*PA2. Most of the techniques mentioned below are also already used in the A*PA2 pairwise aligner, which is also compute bound. It processes parts of the DP matrix in large blocks, so that the execution is very predictable and branch misses are avoided. It also uses SIMD (on top of bitpacking) to compute even more states in parallel, and exploits instruction level parallellism by independently processing two SIMD vectors at a time. It also uses a bit-packed input format to reduce the memory pressure.

2.1 Avoiding Branch Misses Link to heading

Modern CPUs have execution pipelines that are hundreds of instructions long Thus, if one instruction is waiting for some data (from memory), the CPU will already start execution upcoming instructions. When a branch occurs, the CPU has to predict which of the two paths will be taken in order to proceed this speculative execution, since waiting for the condition to be resolved would remove most of the benefits of pipelining.

Thus, the CPU has a branch predictor that fulfils this task. Very much simplified, it tracks for each branch instruction whether it is usually taken or not, and makes a prediction based on this. Modern branch predictors can perfectly recognize patterns like taking a branch every 10th iteration.

When a branch misprediction happens, the CPU has to unwind the speculative computations that depended on the wrong assumption, and then start over with the correct sequence of instructions. In practice, this can cause a delay of 10 to 20 clock cycles, and can easily become the bottleneck for performance. Thus, we should aim to design algorithms without data-dependent branches, so the branches that remain are all predictable and quick to compute.

Application. For the problem of computing minimizers, we apply this technique by replacing the classic queue based algorithm for minimizers by an efficient version of the two-stacks method, that only uses a single branch every \(w\) iterations.

2.2 SIMD: Processing In Parallel Link to heading

A common technique to speed up computations on modern hardware is by using SIMD, or single-instruction-multiple-data, instructions. The are for example 256 bit registers that contain four 64 bit integers at once, or eight 32 bit integers. The processor can then do arithmetic on all lanes in parallel, providing up to \(4\times\) or \(8\times\) speedup over scalar arithmetic.

In order to use SIMD instructions, we have to make sure that the input data is sufficiently homogeneous: we need to fill the lanes with integers that require exactly the same computation. And since these computations happen in parallel, they can not depend on each other.

Application. Unfortunately, the problem of computing minimizers is (locally) very sequential, since it requires taking a rolling minimum. To circumvent this, we can split each input sequence into 8 chunks that are independent and can be processed in parallel via 256 bit AVX2 SIMD instructions on 8 32 bit lanes.

Because we use a data-independent method to compute the minimizers, the data-flow and executed instructions in each of the 8 chunks are exactly the same. This is the perfect case for SIMD, since there is no divergence between the lanes.

2.3 Instruction Level Parallelism Link to heading

Modern CPUs can not only execute many instructions ahead, but they also execute many instructions in parallel. For example, typical Intel CPUs can execute up to 4 instructions each clock cycle. In particular in very simple for loops, e.g., that sum the values of an array, there is a loop carried dependency, and each iteration depends on the previous one. Thus only one addition can be executed at a time, so that the CPU is not fully utilized.

One way to increase the amount of parallelism available in the code is by solving two instances in parallel. For example, to sum the integers in a vector, we can split it in two halves (or even four quarters!) and sum them at the same time.

Application. We tried to apply this to the computation of minimizers by splitting the input into 16 chunks, and then running two instances of the 8-lane algorithm interleaved. In this case, the gains were marginal. Probably the additional instructions increase the load on the hardware registers too much.

2.4 Input Format Link to heading

Lastly, also the input format and more generally memory IO can have a big impact on performance, since highly optimized code usually processes a lot of data.

Specifically, the SIMD scatter instruction, that reads 8 arbitrary addresses, and gather instruction, that writes to 8 arbitrary addresses, are often slow. More generally, any kind of shuffling data, either by writing spread out over memory or by reading from random parts of memory, tends to be much slower than simply sequentially iterating over some input.

Application. The input for the SIMD version of our minimizer algorithm is 8 streams of text, that are initially encoded as plain 8 bit ASCII characters. Thus, while we could read one character from each stream at a time, it is much more efficient to gather 8 32 bit integers at once, each containing 4 characters. In practice, it is better to read a full 64 bit integer at a time, rather than splitting this into 2 32 bit reads.

Still that is not maximally efficient. For DNA, each ASCII character can only really be one of four values, \(\nuc{ACGT}\). Thus, each 8 bit character has 6 wasted bits. We can avoid this by first packing the input in a separate linear pass. Then, the algorithm itself can read 64 bits at a time from each lane, containing 32 characters.

3 Optimizing Memory Bound Code: Minimal Perfect Hashing Link to heading

We now consider techniques for optimizing memory bound code.

As an application, we consider the minimal perfect hash function in SSHash. SSHash first collects all minimizers, and then builds a hash table on these minimizers as a part of its data structure. Building a classic hash table that stores the values of the keys is possible, but this would take a lot of space, since it has to store all the keys. Instead, we can use the fact that the data structure is static: the set of \(m\) minimizers is fixed. Thus, we can build a minimal perfect hash function (MPHF) that takes this set, and bijectively maps them to the range \(\{0, \dots, m-1\}\). Then, queries can use this function to find the right slot in an array storing additional data for each minimizer.

We focus on designing an MPHF that can answer queries quickly. Specifically, we optimize for throughput, i.e., to answer as many independent queries per second as possible. When the number of keys (minimizers) is large, say \(10^9\), the MPHF data structure will not fit in L3 cache, and hence, most of the queries will need to access main memory. Thus, like most data structures, this problem is memory-bound.

We note that code can be memory bound in two ways: by memory latency, where it is usually waiting for one read to come through, or by memory throughput, where the entire bandwidth is saturated. We should avoid being bound by latency, and instead aim to get as much work done as possible given the available throughput.

3.1 Using Less Memory Link to heading

A first way to reduce a memory latency or throughput bottleneck is by simply using less memory. CPUs have a hierarchy of caches, typically with L1, L2, and L3 cache, with L1 being the closest to the CPU and hence fastest, but also the smallest. This means that if the data fits in L1, random accesses to it will be significantly faster (a few cycles) than for data that only fits in L2 (around 10 cycles), L3 (around 40 cycles), or main memory (up to 200 cycles). Thus, smaller data fits in a smaller cache, and hence will have faster accesses. Even when the data is much larger than L3, reducing its size can still help, because then, a larger fraction of it can be cached in L3.

One way to apply this is by reducing the size of integers from 64 bits to 32 bits, when this is still sufficiently large to hold the data.

3.2 Reducing Memory Accesses Link to heading

A first step to reduce the memory bottleneck is by avoiding memory access as much as possible. Completely removing a dependency on some data is usually not possible, but instead, it is often possible to organize data more efficiently.

In particular, RAM works in units of cache lines, which (usually) consist of 64 bytes. Thus, whenever an integer is read from main memory, the entire corresponding cache line must be fetched into the L1 cache. This means that it may be more efficient to store a single array of structs rather than a struct of arrays if elements of the struct are usually accessed together.

Additionally, one should avoid sequential memory accesses, where the result of memory read determines the location of a second access to memory, since these can not be executed in parallel.

Application. A common application of this technique is in B-trees, which are balanced search trees holding a set of sorted elements. Classic binary search trees have an indirection at every level of the tree. B-trees on the other hand store \(B\) values in each node. This reduces the height of the tree from \(\log_2(n)\) to \(\log_{B+1}(n)\), and efficiently uses a cacheline by reading \(B\) values from it at once, rather than just a single value.

Our MPHF, PtrHash, internally uses Elias-Fano (EF) coding (Elias 1974; Fano 1971) to compactly encode sequences of integers. We introduce a CacheLineEF version, that overall uses a bit more space, but stores the information to retrieve each value in a single cache line. That way, we can still compress the data, while not paying with more memory accesses.

3.3 Interleaving Memory Accesses Link to heading

As already discussed, CPU pipelines can execute many instructions at the same time. This means that the CPU will already fetch memory for upcoming instructions whenever it can. For example, in a for-loop where each iteration reads a single independent memory address, the CPU can fetch memory a number of iterations ahead.

More precisely, each core in the CPU has a number (12, in case of the hardware used for the experiments in this thesis) of line fill buffers. Each time the core requests a new cache line to be read from memory, it reserves one of these buffers so that the result can be stored there when it is available. Thus, the latency of each individual access can be hidden by doing around 10 reads in parallel. The result is then 10 times higher memory throughput.

One way to achieve this is by clustering independent memory accesses, so that they are automatically executed in parallel. More generally, it can help to have as little code as possible in between consecutive reads, so that the CPU can look relatively more iterations ahead.

3.4 Batching, Streaming, and Prefetching Link to heading

One way to make the interleaving of memory accesses more explicit is by using batching. If we have to process \(n\) independent iterations of a for loop, and each requires a read to memory, we can group (chunk) them into batches of size \(B\), say of size \(B=16\) or \(B=32\). Then, we can first make \(B\) reads to memory, and then process the results.

To make this slightly more efficient, prefetching can be used, where instead of directly reading the \(B\) values into a register, we first ask the CPU to read them into L1 cache using a dedicated prefetch instruction. Then we process the elements in the batch as usual, and all the data should already be present.

A slight variant of this is streaming, where instead of processing chunks of size \(B\), we prefetch the data required for the iteration \(B\) ahead of the current one.

Application. We apply both batching and streaming in PtrHash, and achieve up to \(2\times\) speedup compare to plain for-loops. In particular, using these techniques, each iteration only takes just over 8 ns on average, which on my CPU, is very close to the maximum random memory throughput each core can have.