CPU performance

This (planned) series of posts has the aim to write a high performance search algorithm for suffix arrays. We will start with a classic binary search implementation and make incremental improvements to it. But that is planned for Part 3.

Before that, we will review existing methods for binary searching plain integer arrays as a warm up exercise. There is a great paper by Khuong and Morin (2017), “Array Layouts for Comparison-Based Searching”, and an algorithmica.org case study based on it, that will form the basis of that. See Part 2.

First, (in a typical case of nerd-sniping,) we must understand our hardware. In this part, we will look at some simple benchmarks to understand the capabilities and limits of a modern CPU. This part is very much inspired by Chapter 9 of algorithmica.org. In fact, I will probably end up directly replicating some of the experiments from there for educational purposes (at least for myself). Algorithmica also lists this post by Igor Ostrovsky and What Every Programmer Should Know About Memory by Ulrich Drepper as useful resources, but I haven’t read them closely myself.

Even at the start, I can already tell you that each part will have the same conclusion:

  1. Reading from main memory has very high latency.
  2. Batching and prefetching reads hides latency and can increase throughput over 10x.

In this post, Part 1, we will mostly investigate latency and bandwidth of L1/L2/L3 caches and main memory in different workloads.

The final goal is to develop a static data structure supporting a high throughput of read-only queries, and so for now we focus on reading from memory rather than writing to memory.

Code can be found in this repository: github:RagnarGrootKoerkamp/cpu-benchmarks.

Introduction

Hardware

First, lets get some idea of my hardware.

  • CPU: lscpu | grep model: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, a high end laptop CPU released in 2020.
    • Cores: This CPU has 6 cores. Hyper threading is disabled.
    • Frequency: All experiments run at the base frequency of 2.6GHz, with turbo boost disabled.
  • Caches: lscpu | grep cache
    • L1: 32K per core
    • L2: 256K per core
    • L3: 12M shared
  • Main memory: 64G, as 2 32G DDR4 banks at 3.2GHz.

You can also use lstopo to generate a figure of the CPU and memory layout:

Details of caches and memory

Before of discovering the characteristics and limitations of memory through experiments, I will summarize the highlights here. This is just to give some highlights and relevant notes-to-self. Google is your friend to learn more about details of any of these.

Cache lines
Caches work in units of 64 byte cache lines1. See getconf -a | grep CACHE_LINESIZE. Each time you read a byte, the entire 64 bytes cache line is moved from RAM to L3 to L2 to L12.
Memory pages
The CPU has to translate between virtual addresses (the pointers in your program) and physical addresses. The usual page size is 4K, meaning that memory gets allocated in 4K chunks, and data within a page is consecutive in main memory. The translation lookaside buffer (TLB) is a small dedicated cache for recent page lookups. My CPU has 64 page entries in the L1 TLB, and 1.5K page entries in the L2 TLB.
  • Huge pages are 2M or 1G instead. When addressing a lot of memory, the 4K memory pages may not all fit in the TLB. Switching to larger page sizes fixes this, at the cost of wasting some memory.

    I have transparent huge pages enabled, which automatically switches to huge pages for large data.

Main memory
  • Memory is stored as rows of 1K to 8K. Consecutive reads in the same row may be slightly faster.
  • My memory has a CAS latency of 22 cycles, i.e., after receiving a request it takes 22 cycles (7ns) to return the data.
Instruction cache
The processor doesn’t only need to fetch the data, it also needs the code itself. Thus, there is a dedicated 32K L1i3 instruction cache alongside the 32K L1d data cache. There is no dedicated L2 instruction cache, so it is shared between code and data.
Line fill buffers
Each core has 10 line fill buffers (Intel) or miss address buffer (AMD): separate 64 byte buffers that are reserved for pending requests to L2/L3/RAM for cache lines that are not in L1 yet.

Cache associativity :

Eviction policies
W It’s generally hard to find information on what cache eviction policies are used

Understanding and predicting the behaviour of CPUs and the memory system appears to be complex, given that many details are only known (it seems) through reverse engineering. A benefit of attempting to make highly accurate plots is that we can see a lot of details, but this comes with the drawback that I do not always have answers (yet) explaining these details. I have marked some questions in bold below.

Latency, bandwidth, and throughput

The CPU memory, caches, and instructions (and algorithms in general) have two important properties:

Latency
The time it takes to fetch/process some data to the CPU after requesting it, e.g. 10ns.
Bandwidth
The upper bound on the amount of data that can be read per second, e.g. 1GB/s.

Also related:

Throughput
The actual amount of data processed, in items or size per second. The inverse throughput is the average time between completion of consecutive units of work, e.g. 1ns. This can be easily compared to latency, and is typically lower/better than the latency.

Latency & pointer chasing

In this section we measure the latency of caches and RAM.

TL;DR: RAM is slow! Each lookup takes 78ns or ~200cycles.

Pointer chasing

We start with a simple pointer chasing experiment: we create a large array in which each position contains the index of another position and then follow the chain. In particular, we ensure that the array is a random derangement, or rather, a permutation that is just one long cycle, so that memory cannot be prefetched by the hardware prefetcher.

The first experiment looks like this:

1
2
3
4
5
6
let v: Vec<usize> = derangement(size);
let mut i: usize = 0;
for _ in 0..STEPS {
    i = v[i];
}
black_box(i);
Code Snippet 1: A simple pointer-chasing experiment.

Figure 1: Latency of pointer chasing for various sized arrays. The horizontal axis shows the size of the input array in bytes on a logarithmic scale. Red lines show the L1, L2, and L3 cache sizes. All experiments are run 3 times and the plot shows minimum, median, and maximum runtime.

Figure 1: Latency of pointer chasing for various sized arrays. The horizontal axis shows the size of the input array in bytes on a logarithmic scale. Red lines show the L1, L2, and L3 cache sizes. All experiments are run 3 times and the plot shows minimum, median, and maximum runtime.

First, the vector v is initialized with a derangement with the given total size in bytes (not length). Then we iterate the main loop for some configurable number of STEPS, and in each iteration we read one element. At the end we pass i to black_box to prevent everything from being optimized away. Note that we only time the main loop, not the initialization. The code can be found at [TODO]. The result is in Figure 1.

Observe that:

  • Latency goes up as the array size increases.
  • After crossing a cache-size boundary the increase is smooth, not stepwise, because part of the data still fits in the smaller but faster cache.
  • The latency stabilizes once the smaller caches become negligible.
  • Since L3 cache is shared between all cores/processes, the array cannot completely occupy it, and we see a slowdown already at slightly smaller array sizes.
  • A similar effect happens when crossing from L2 to L3, probably because L2 is also used for the program code itself.

For reference, here is the corresponding assembly code:

1
2
3
4
5
 2.34  60:┌─→cmp        %rsi,%rdi             ; Compare rdi (=i) to the array length.
          │↓ jae        2fc                   ; If i >= array length, bail.
97.60       mov        (%rcx,%rdi,8),%rdi    ; Read index i from the array starting at rcx, with size-8 elements.
 0.01     ├──dec        %rax                  ; Decrease remaining interation counter
 0.05     └──jne        60                    ; If not 0, continue

Bounds checking

Since we are writing Rust, array indexing in v[i] is always checked, and when i is not a valid index the code panics. That is nice, but since we are looking for high performance here, we’ll avoid the bound checks by using get_unchecked. But since that looks kinda ugly, from here on, I will use an UncheckedVec wrapper type and just write v[i] for simplicity.

1
2
3
4
5
6
let v: Vec<usize> = derangement(size);
let mut i: usize = 0;
for _ in 0..STEPS {
    i = unsafe { *v.get_unchecked(index) }
}
black_box(i);
Code Snippet 2: Pointer chasing without bound checks.

Figure 2: The unchecked version is basically as fast, since branch prediction makes the check cheap.

Figure 2: The unchecked version is basically as fast, since branch prediction makes the check cheap.

Although not faster, the generated assembly is much more concise.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
12.12  70:┌─→mov        (%rax,%rdx,8),%rdx        ; Do 8 consecutive lookups.
12.62       mov        (%rax,%rdx,8),%rdx
12.80       mov        (%rax,%rdx,8),%rdx
12.59       mov        (%rax,%rdx,8),%rdx
12.71       mov        (%rax,%rdx,8),%rdx
12.08       mov        (%rax,%rdx,8),%rdx
12.58       mov        (%rax,%rdx,8),%rdx
12.48       mov        (%rax,%rdx,8),%rdx
          ├──add        $0xfffffffffffffff8,%rsi  ; Decrease counter by 8.
 0.02     └──jne        70
Code Snippet 3: The compiled assembly code simply contains a list of array lookups, with the for loop unrolled 8 times.

Padding elements

One thing we did not yet account for is that each cache line of the array contains multiple (64B/8B = 8) elements, so in some cases the next index may already be cached because it is in the same cache line as a recently seen element. To prevent this effect, we pad each element to occupy 64 bytes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
/// 64B sized object that is aligned to a cache line.
#[repr(align(64))]
struct PaddedUsize{
    value: usize,
    _padding: [u8; 56]
};
let v: UncheckedVec<PaddedUsize> = derangement(size);
let mut i: usize = 0;
for _ in 0..STEPS {
    i = v[i].value;
}
Code Snippet 4: Pointer chasing with one element per cache line.

Figure 3: When data does not fit in L1, the padded version is slightly slower, as expected.

Figure 3: When data does not fit in L1, the padded version is slightly slower, as expected.

As expected, we see in Figure 3 that the padded version is consistently slower than the original version.

  • In L1, we can see that one additional cycle per lookup is needed to compute the 64 * i offset, since this is too large to inline into the mov instruction like we had for 8 * i before.
  • In L2, the running time is initially exactly flat, and not a smooth transition. Most likely this is because once space in L2 runs out, it throws away the least recently used cache line. Since our ‘walk’ through the array is cyclic, elements will be evicted from L1 before we loop around, basically making the L1 useless.
  • As L2 gets fuller, we observe a slowdown before it is completely full. We’ll get back to this in a bit.
  • Question: Unlike the L1->L2 transition, the L2->L3 transition is smooth. Maybe L2 has a different strategy for which elements are evicted?

Raw pointers

So far, we weren’t really chasing pointers. Instead, we were chasing indices, which have a slight indirection since v[i] needs to add i to the pointer to the start of the array (&v[0]). Instead, we can store actual pointers in a Vec<const* usize> and avoid the offsets

1
2
3
4
5
let mut v: Vec<PaddedPointer> = ...;
let mut i: *const usize = v[0];
for _ in 0..*STEPS {
        i = unsafe { *i } as *const usize;
}
Code Snippet 5: Pointer chasing with padded elements.

Figure 4: Direct pointer chasing is usually slightly faster than using array offsets, because the explicit multiplication by 64 isn’t needed anymore.

Figure 4: Direct pointer chasing is usually slightly faster than using array offsets, because the explicit multiplication by 64 isn’t needed anymore.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
12.40  70:┌─→mov        (%r12),%rdx
12.44       mov        (%rdx),%rdx
12.44       mov        (%rdx),%rdx
12.44       mov        (%rdx),%rdx
12.44       mov        (%rdx),%rdx
12.44       mov        (%rdx),%rdx
12.44       mov        (%rdx),%rdx
12.38       mov        (%rdx),%r12
          ├──add        $0xfffffffffffffff8,%rcx
          └──jne        70
Code Snippet 6: This code is even simpler than Code Snippet 3, and contains 8 unrolled direct pointer dereferences.
Takeaway

Raw pointer indexing is slightly faster than array indexing.

Aligned memory & Hugepages

There is a weird but consistent improvement in performance once the array reaches size 2^25=32MiB. This appears to be the point where the kernel decides that instead of reusing some memory from the already allocated heap, it will make a completely new allocation. Most likely, the reason this is faster is because transparent hugepages kick in: the operating system can automatically detect large allocations and use 2MiB hugepages for them instead of the default 4KiB pages. This reduces pressure the translation lookaside buffer (TLB) that maps vertical memory pages to physical memory addresses.

To avoid the slight slowdown just before 32MiB, we can instead always use hugepages, by allocating a multiple of 2M at a 2M boundary. I use the alloc-madvise crate for this which also indicates to the system that hugepages should be used. To make this work reliably, we now over-allocate all arrays at the next size that is a multiple of 2MiB. In fact, it turns out that arrays of, say, 2MiB still get allocated on the program heap which uses 4KiB pages. If we allocate the next multiple of (and at least) 32MiB instead, this is fixed, and hugepages work consistently.

Figure 5: Hugepages and 2MiB aligned allocations.

Figure 5: Hugepages and 2MiB aligned allocations.

Indeed, the spike at 2^25 is gone! Very satisfying! And generally performance improves across the L3 range. Sizes just below L3 capacity are slightly noisy, since other ongoing processes also compete for the last bit of L3 cache.

Also, performance is now perfectly constant for all L2 sizes. Before, the 4KiB (2^12) sized blocks where probably at random hardware offsets. Due to associativity, each memory address can only be cached at a small (4-16) number of possible cache lines. When the pages are randomly positioned, there will be some sets that are over-used, while some sets that and under-used. This means that even though the array size is less than the size of L2, it may not be possible to cache it in its entirety. With 2MiB page sizes, the entire allocation is a single block, and the distribution over cache lines is perfectly uniform. Thus, the entire array can be cached in L2 at once.

Takeaway

To effectively use hugepages, we must allocate at least 32MiB to avoid the pre-allocated heap. When used, they slightly improve performance in L2 and L3 and make results more consistent.

Summary

To wrap up, here is a summary of results.

Table 1: Latency of each method, evaluated at sizes L1/2=2^14, L2/2=2^17, L3/3=2^22, and RAM=2^28. Note that L1 operations take an exact number of clock cycles. Key metrics to remember in bold.
nsL1L2L3RAMcyclesL1L2L3RAM
Pointer Chasing Checked1.94.918.977.35.112.649.2200.9
Pointer Chasing1.95.219.677.45.113.651.0201.3
Pointer Chasing Padded2.36.120.378.36.115.952.7203.6
Raw Pointer Chasing1.74.019.177.73.210.349.6202.1
Raw Pointer Chasing Padded1.64.618.778.54.012.148.7204.2
Pointer Chasing Padded Aligned2.35.115.778.86.013.340.8204.9
Raw Pointer Chasing Padded Aligned1.54.615.378.64.012.139.8204.3

Based on this evaluation, we will from now on assume all the above optimizations:

  • unchecked indexing,
  • cache line-size array elements,
  • aligned 2MiB hugepages using a multiple of 32MiB allocation,
  • raw pointer indices.
Takeaway

RAM has a latency of just below 80ns.

Random access throughput & batching

In this section we measure the random access throughput of caches and RAM.

TL;DR: By using batches of size \(>12\) and prefetching, we can fully saturate the memory bandwidth with random memory accesses.

Batching

So far we have only been looking at latency, where we process a single RAM access at a time. Instead, we can also consider the throughput of random accesses, where we do multiple independent accesses in parallel.

To measure this, we use batching with batches of size B: instead of a single pointer chasing chain, we process B chains in parallel. To ensure that they do not interfere with each other or access the same cache lines, we initialize them equally spaced on the single long cycle of pointers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
let v: AlignedVec<PaddedPointer> = ...;
// B pointers to element 0, n/B, 2n/B, 3n/B, ... of the cycle.
let mut is: [*const usize; B] = ...;
for _ in 0..*STEPS / B {
    for i in &mut is {
        // `i` has type `&mut *const usize`.
        // First deref gets us `*const usize`,
        // and second gets the actual pointed-to-value,
        // which is casted to a pointer and written to `*i`.
        *i = unsafe { **i } as *const usize;
    }
}
black_box(is);
Code Snippet 7: With batching, we advance B independent pointers at a time, for a total of STEPS / B iterations.

Figure 6: When using a batch size B>1, there is a B-fold speedup, until it saturates around B=16. The best result from before is dotted and labelled Latency.

Figure 6: When using a batch size B>1, there is a B-fold speedup, until it saturates around B=16. The best result from before is dotted and labelled Latency.

As expected, batch size 1 behaves the same as the best result from before. We observe that larger batch sizes improve throughput linearly, until it saturates at batch size 8 in L3 and 16 in RAM! (Batch sizes 64 and 128 (not shown) provide no further improvement.)

Takeaway

The best single-threaded random access throughput in RAM is 7.4ns per cache line, or 8.6GB/s.

That’s around a third of the the total RAM throughput.

The reason that the CPU is able to improve so much over the latency variant of before is pipelining and out-of-order execution: multiple instructions can be executed in parallel, and when some instructions are slow, for example because they are waiting for memory, upcoming instructions that do not depend on the result can already be started. This way, the CPU is able to execute over 10 instructions/memory requests at a time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
3.53  b0:   movq       (%r8),%r8
3.14        movq       (%r9),%r9
3.28        movq       (%r10),%r10
2.77        movq       (%r11),%r11
2.77        movq       (%r12),%r12
2.77        movq       (%r13),%r13
2.77        movq       (%r14),%r14
2.77        movq       (%r15),%r15
          ; ...
          ; three more copies of the above
          ; ...
            addq       $-4,%rcx
           jne        b0
Code Snippet 8: Assembly code for batch size 8: a four times unrolled loop of 8 pointer accesses.
Takeaway

The CPU uses out-of-order execution and pipelining to execute multiple instructions in advance and in parallel while previous ones are e.g. pending memory. This hides latency of independent reads.

Line fill buffers

Looking closer, we see that the throughput saturates at batch size 12. This corresponds to my CPU having 12 line fill buffers: each request to memory that is not already present in L1 reserves a line fill buffer as a placeholder where the cache line will be stored once it is available. This means that each CPU core can only have 12 ongoing requests at a time, and hence batch sizes beyond this do not increase throughput.

Figure 7: The throughput saturates at batch size 12.

Figure 7: The throughput saturates at batch size 12.

Takeaway

We should use a batch size of at least 12, or in practice usually 16, to fully use the random access RAM bandwidth available to each core.

The reorder-buffer

We just saw that the CPU can execute up to 12 memory request at a time by ’looking ahead’ and executing multiple instructions in parallel. But this has its limits.

Let’s say that we are not just chasing pointers, but also doing some work on each result, as in Code Snippet 9. We have a template parameter WORK that controls how often we iterate over the result of each lookup. The content of the iterations are just there to keep the CPU busy, and include a \(64\times64\to128\) bit widening_mul to prevent SIMD vectorization.

Figure 8 shows the results for 3, 6, and 12 iterations of work.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 let v: AlignedVec<PaddedPointer> = ...;
 // B pointers to element 0, n/B, 2n/B, 3n/B, ... of the cycle.
 let mut is: [*const usize; B] = ...;

+let loops = black_box(WORK);
+let v0 = v.as_ptr() as *const usize;
+let mut sum = 0;

 for _ in 0..*STEPS / B {
     for i in &mut is {
         *i = unsafe { **i } as *const usize;

+        // Take the index of the current pointer
+        // and do `WORK` iterations on it.
+        let mut x = unsafe { (*i).offset_from(v0) } as usize;
+        let mut y = x;
+        let mut z = x;
+        for _ in 0..loops {
+            x = x + (x >> 1);
+            y = y.widening_mul(x).1;
+            z += y;
+        }
         sum += z;
     }
 }
 black_box(is);
+black_box(sum);
Code Snippet 9: We let the CPU work a bit for WORK iterations on each looked up pointer.

Figure 8: Batching as before, but with some additional work (3/6/12 iterations) on each result. Runtimes explode once things overflow to RAM.

Figure 8: Batching as before, but with some additional work (3/6/12 iterations) on each result. Runtimes explode once things overflow to RAM.

We observe a few things:

  1. Naturally, doing more work is slower when everything hits in L1, but all methods are at most 14ns.
  2. All three methods with work slow down significantly in RAM.
  3. The 12-work version is consistently slower than just the latency of pointer chasing.
  4. The 6-work version is around 2x faster, so much prefetch one iteration ahead.
  5. The 3-work version is roughly another 2x faster, so much prefetch three iterations ahead.

Thus, the CPU can only look so far ahead. In fact, my CPU should have a reorder buffer (ROB) of around 200 instructions. But for this task, it only manages less than 12 iterations, for around 80 instructions.

Takeaway

While out-of-order execution can hide memory latency, this only works when sufficiently many independent memory accesses are sufficiently close together

More in-between work can prevent the CPU from seeing far enough ahead to the next memory access.

Prefetching

But! We don’t have to rely on the CPU to parallelize memory accesses. We can also do this explicitly using prefetch instructions, that simply tell the CPU to read some data (a cache line) from memory into L1 cache.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
 let v: AlignedVec<PaddedPointer> = ...;
 // B pointers to element 0, n/B, 2n/B, 3n/B, ... of the cycle.
 let mut is: [*const usize; B] = ...;

 let loops = black_box(WORK);
 let v0 = v.as_ptr() as *const usize;
 let mut sum = 0;

 for _ in 0..*STEPS / B {
     for i in &mut is {
         *i = unsafe { **i } as *const usize;
+        // Prefetch the next cache line pointed to by `*i`.
+        prefetch(*i)

         // Take the index of the current pointer
         // and do `WORK` iterations on it.
         let mut x = unsafe { (*i).offset_from(v0) } as usize;
         let mut y = x;
         let mut z = x;
         for _ in 0..loops {
             x = x + (x >> 1);
             y = y.widening_mul(x).1;
             z += y;
         }
         sum += z;
     }
 }
 black_box(is);
 black_box(sum);
Code Snippet 10: We explicitly prefetch the cache line needed in the next iteration of the batch.

Figure 9: With explicit prefetching, we can almost completely hide the memory latency again.

Figure 9: With explicit prefetching, we can almost completely hide the memory latency again.

Takeaway

Prefetching is an effective way to hide memory latency when iterations are too long for the CPU’s reorder buffer.

TODO Memory bandwidth

  • Experiments measuring the maximum speed of linearly reading an array.

TODO Multithreading

Further links

References

Khuong, Paul-Virak, and Pat Morin. 2017. “Array Layouts for Comparison-Based Searching.” Acm Journal of Experimental Algorithmics 22 (May): 1–39. https://doi.org/10.1145/3053370.

  1. On most modern hardware, at least. I believe 128byte cache lines also exist. ↩︎

  2. In some cases it’s possible to skip L3 and L2 and fetch data to L1 directly. ↩︎

  3. on my cpu ↩︎