This (planned) series of posts has the aim to write a high performance search algorithm for suffix arrays. We will start with a classic binary search implementation and make incremental improvements to it. But that is planned for Part 3.
Before that, we will review existing methods for binary searching plain integer arrays as a warm up exercise. There is a great paper by Khuong and Morin (2017), “Array Layouts for Comparison-Based Searching”, and an algorithmica.org case study based on it, that will form the basis of that. See Part 2.
First, (in a typical case of nerd-sniping,) we must understand our hardware. In this part, we will look at some simple benchmarks to understand the capabilities and limits of a modern CPU. This part is very much inspired by Chapter 9 of algorithmica.org. In fact, I will probably end up directly replicating some of the experiments from there for educational purposes (at least for myself). Algorithmica also lists this post by Igor Ostrovsky and What Every Programmer Should Know About Memory by Ulrich Drepper as useful resources, but I haven’t read them closely myself.
Even at the start, I can already tell you that each part will have the same conclusion:
- Reading from main memory has very high latency.
- Batching and prefetching reads hides latency and can increase throughput over 10x.
In this post, Part 1, we will mostly investigate latency and bandwidth of L1/L2/L3 caches and main memory in different workloads.
The final goal is to develop a static data structure supporting a high throughput of read-only queries, and so for now we focus on reading from memory rather than writing to memory.
Code can be found in this repository: github:RagnarGrootKoerkamp/cpu-benchmarks.
Introduction
Hardware
First, lets get some idea of my hardware.
- CPU:
lscpu | grep model
: Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, a high end laptop CPU released in 2020.- Cores: This CPU has 6 cores. Hyper threading is disabled.
- Frequency: All experiments run at the base frequency of 2.6GHz, with turbo boost disabled.
- Caches:
lscpu | grep cache
- L1: 32K per core
- L2: 256K per core
- L3: 12M shared
- Main memory: 64G, as 2 32G DDR4 banks at 3.2GHz.
You can also use lstopo
to generate a figure of the CPU and memory layout:
Details of caches and memory
Before of discovering the characteristics and limitations of memory through experiments, I will summarize the highlights here. This is just to give some highlights and relevant notes-to-self. Google is your friend to learn more about details of any of these.
- Cache lines
- Caches work in units of 64 byte cache lines1. See
getconf -a | grep CACHE_LINESIZE
. Each time you read a byte, the entire 64 bytes cache line is moved from RAM to L3 to L2 to L12. - Memory pages
- The CPU has to translate between virtual addresses (the pointers in
your program) and physical addresses. The usual page size is 4K, meaning that
memory gets allocated in 4K chunks, and data within a page is consecutive in
main memory. The translation lookaside buffer (TLB) is a small dedicated
cache for recent page lookups. My CPU has 64 page entries in the L1 TLB, and 1.5K
page entries in the L2 TLB.
Huge pages are 2M or 1G instead. When addressing a lot of memory, the 4K memory pages may not all fit in the TLB. Switching to larger page sizes fixes this, at the cost of wasting some memory.
I have transparent huge pages enabled, which automatically switches to huge pages for large data.
- Main memory
- Memory is stored as rows of 1K to 8K. Consecutive reads in the same row may be slightly faster.
- My memory has a CAS latency of 22 cycles, i.e., after receiving a request it takes 22 cycles (7ns) to return the data.
- Instruction cache
- The processor doesn’t only need to fetch the data, it also needs the code itself. Thus, there is a dedicated 32K L1i3 instruction cache alongside the 32K L1d data cache. There is no dedicated L2 instruction cache, so it is shared between code and data.
- Line fill buffers
- Each core has 10 line fill buffers (Intel) or miss address buffer (AMD): separate 64 byte buffers that are reserved for pending requests to L2/L3/RAM for cache lines that are not in L1 yet.
Cache associativity :
- Eviction policies
- W It’s generally hard to find information on what cache eviction policies are used
Understanding and predicting the behaviour of CPUs and the memory system appears to be complex, given that many details are only known (it seems) through reverse engineering. A benefit of attempting to make highly accurate plots is that we can see a lot of details, but this comes with the drawback that I do not always have answers (yet) explaining these details. I have marked some questions in bold below.
Latency, bandwidth, and throughput
The CPU memory, caches, and instructions (and algorithms in general) have two important properties:
- Latency
- The time it takes to fetch/process some data to the CPU after requesting it, e.g. 10ns.
- Bandwidth
- The upper bound on the amount of data that can be read per second, e.g. 1GB/s.
Also related:
- Throughput
- The actual amount of data processed, in items or size per second. The inverse throughput is the average time between completion of consecutive units of work, e.g. 1ns. This can be easily compared to latency, and is typically lower/better than the latency.
Latency & pointer chasing
In this section we measure the latency of caches and RAM.
TL;DR: RAM is slow! Each lookup takes 78ns or ~200cycles.
Pointer chasing
We start with a simple pointer chasing experiment: we create a large array in which each position contains the index of another position and then follow the chain. In particular, we ensure that the array is a random derangement, or rather, a permutation that is just one long cycle, so that memory cannot be prefetched by the hardware prefetcher.
The first experiment looks like this:
|
|
First, the vector v
is initialized with a derangement with the given total
size
in bytes (not length). Then we iterate the main loop for some configurable
number of STEPS
, and in each iteration we read one element. At the end we pass
i
to black_box
to prevent everything from being optimized away. Note that we
only time the main loop, not the initialization. The code can be found at
[TODO]. The result is in Figure 1.
Observe that:
- Latency goes up as the array size increases.
- After crossing a cache-size boundary the increase is smooth, not stepwise, because part of the data still fits in the smaller but faster cache.
- The latency stabilizes once the smaller caches become negligible.
- Since L3 cache is shared between all cores/processes, the array cannot completely occupy it, and we see a slowdown already at slightly smaller array sizes.
- A similar effect happens when crossing from L2 to L3, probably because L2 is also used for the program code itself.
For reference, here is the corresponding assembly code:
|
|
Bounds checking
Since we are writing Rust, array indexing in v[i]
is always
checked, and when i
is not a valid index the code panics. That is nice, but
since we are looking for high performance here, we’ll avoid the bound checks by
using get_unchecked
. But since that looks kinda ugly, from here on, I will use
an UncheckedVec
wrapper type and just write v[i]
for simplicity.
|
|
Although not faster, the generated assembly is much more concise.
|
|
Padding elements
One thing we did not yet account for is that each cache line of the array contains multiple (64B/8B = 8) elements, so in some cases the next index may already be cached because it is in the same cache line as a recently seen element. To prevent this effect, we pad each element to occupy 64 bytes.
|
|
As expected, we see in Figure 3 that the padded version is consistently slower than the original version.
- In L1, we can see that one additional cycle per lookup is needed to compute the
64 * i
offset, since this is too large to inline into themov
instruction like we had for8 * i
before. - In L2, the running time is initially exactly flat, and not a smooth transition. Most likely this is because once space in L2 runs out, it throws away the least recently used cache line. Since our ‘walk’ through the array is cyclic, elements will be evicted from L1 before we loop around, basically making the L1 useless.
- As L2 gets fuller, we observe a slowdown before it is completely full. We’ll get back to this in a bit.
- Question: Unlike the L1->L2 transition, the L2->L3 transition is smooth. Maybe L2 has a different strategy for which elements are evicted?
Raw pointers
So far, we weren’t really chasing pointers. Instead, we
were chasing indices, which have a slight indirection since v[i]
needs to
add i
to the pointer to the start of the array (&v[0]
). Instead, we can
store actual pointers in a Vec<const* usize>
and avoid the offsets
|
|
|
|
Raw pointer indexing is slightly faster than array indexing.
Aligned memory & Hugepages
There is a weird but consistent improvement in performance once the array
reaches size 2^25=32MiB
.
This appears to be the point where the kernel decides that instead of reusing
some memory from the already allocated heap, it will make a completely new
allocation.
Most likely, the reason this is faster is because transparent hugepages kick
in: the operating system can automatically detect large allocations and use
2MiB
hugepages for them instead of the default 4KiB
pages. This reduces
pressure the translation lookaside buffer (TLB) that maps vertical memory
pages to physical memory addresses.
To avoid the slight slowdown just before 32MiB
, we can instead always use hugepages,
by allocating a multiple of 2M at a 2M boundary. I use the
alloc-madvise
crate for this which also indicates to the system that hugepages
should be used. To make this work reliably, we now over-allocate all arrays at
the next size that is a multiple of 2MiB
.
In fact, it turns out that arrays of, say, 2MiB
still get allocated on the
program heap which uses 4KiB
pages. If we allocate the next multiple of
(and at least) 32MiB
instead, this is fixed, and hugepages work consistently.
Indeed, the spike at 2^25
is gone! Very satisfying!
And generally performance improves across the L3 range.
Sizes just below L3 capacity are slightly noisy, since other ongoing processes
also compete for the last bit of L3 cache.
Also, performance is now perfectly constant for all L2 sizes. Before, the 4KiB
(2^12
) sized blocks where probably at random hardware offsets. Due to associativity, each
memory address can only be cached at a small (4-16) number of possible
cache lines. When the pages are randomly positioned, there will be some sets that
are over-used, while some sets that and under-used. This means that even though
the array size is less than the size of L2, it may not be possible to cache it
in its entirety. With 2MiB
page sizes, the
entire allocation is a single block, and the distribution over cache lines is
perfectly uniform. Thus, the entire array can be cached in L2 at once.
To effectively use hugepages, we must allocate at least 32MiB
to avoid the
pre-allocated heap.
When used, they slightly improve performance in L2 and L3 and make results
more consistent.
Summary
To wrap up, here is a summary of results.
ns | L1 | L2 | L3 | RAM | cycles | L1 | L2 | L3 | RAM | |
---|---|---|---|---|---|---|---|---|---|---|
Pointer Chasing Checked | 1.9 | 4.9 | 18.9 | 77.3 | 5.1 | 12.6 | 49.2 | 200.9 | ||
Pointer Chasing | 1.9 | 5.2 | 19.6 | 77.4 | 5.1 | 13.6 | 51.0 | 201.3 | ||
Pointer Chasing Padded | 2.3 | 6.1 | 20.3 | 78.3 | 6.1 | 15.9 | 52.7 | 203.6 | ||
Raw Pointer Chasing | 1.7 | 4.0 | 19.1 | 77.7 | 3.2 | 10.3 | 49.6 | 202.1 | ||
Raw Pointer Chasing Padded | 1.6 | 4.6 | 18.7 | 78.5 | 4.0 | 12.1 | 48.7 | 204.2 | ||
Pointer Chasing Padded Aligned | 2.3 | 5.1 | 15.7 | 78.8 | 6.0 | 13.3 | 40.8 | 204.9 | ||
Raw Pointer Chasing Padded Aligned | 1.5 | 4.6 | 15.3 | 78.6 | 4.0 | 12.1 | 39.8 | 204.3 |
Based on this evaluation, we will from now on assume all the above optimizations:
- unchecked indexing,
- cache line-size array elements,
- aligned
2MiB
hugepages using a multiple of32MiB
allocation, - raw pointer indices.
RAM has a latency of just below 80ns
.
Random access throughput & batching
In this section we measure the random access throughput of caches and RAM.
TL;DR: By using batches of size \(>12\) and prefetching, we can fully saturate the memory bandwidth with random memory accesses.
Batching
So far we have only been looking at latency, where we process a single RAM access at a time. Instead, we can also consider the throughput of random accesses, where we do multiple independent accesses in parallel.
To measure this, we use batching with batches of size B
:
instead of a single pointer chasing chain, we process B
chains in parallel.
To ensure that they do not interfere with each other or access the same cache
lines, we initialize them equally spaced on the single long cycle of pointers.
|
|
As expected, batch size 1 behaves the same as the best result from before. We observe that larger batch sizes improve throughput linearly, until it saturates at batch size 8 in L3 and 16 in RAM! (Batch sizes 64 and 128 (not shown) provide no further improvement.)
The best single-threaded random access throughput in RAM is 7.4ns
per cache line, or 8.6GB/s
.
That’s around a third of the the total RAM throughput.
The reason that the CPU is able to improve so much over the latency variant of before is pipelining and out-of-order execution: multiple instructions can be executed in parallel, and when some instructions are slow, for example because they are waiting for memory, upcoming instructions that do not depend on the result can already be started. This way, the CPU is able to execute over 10 instructions/memory requests at a time.
|
|
The CPU uses out-of-order execution and pipelining to execute multiple instructions in advance and in parallel while previous ones are e.g. pending memory. This hides latency of independent reads.
Line fill buffers
Looking closer, we see that the throughput saturates at batch size 12. This corresponds to my CPU having 12 line fill buffers: each request to memory that is not already present in L1 reserves a line fill buffer as a placeholder where the cache line will be stored once it is available. This means that each CPU core can only have 12 ongoing requests at a time, and hence batch sizes beyond this do not increase throughput.
We should use a batch size of at least 12, or in practice usually 16, to fully use the random access RAM bandwidth available to each core.
The reorder-buffer
We just saw that the CPU can execute up to 12 memory request at a time by ’looking ahead’ and executing multiple instructions in parallel. But this has its limits.
Let’s say that we are not just chasing pointers, but also doing some work on
each result, as in Code Snippet 9. We have a template parameter WORK
that
controls how often we iterate over the result of each lookup. The content of the
iterations are just there to keep the CPU busy, and include a \(64\times64\to128\)
bit widening_mul
to prevent SIMD vectorization.
Figure 8 shows the results for 3, 6, and 12 iterations of work.
|
|
We observe a few things:
- Naturally, doing more work is slower when everything hits in L1, but all
methods are at most
14ns
. - All three methods with work slow down significantly in RAM.
- The 12-work version is consistently slower than just the latency of pointer chasing.
- The 6-work version is around 2x faster, so much prefetch one iteration ahead.
- The 3-work version is roughly another 2x faster, so much prefetch three iterations ahead.
Thus, the CPU can only look so far ahead. In fact, my CPU should have a reorder buffer (ROB) of around 200 instructions. But for this task, it only manages less than 12 iterations, for around 80 instructions.
While out-of-order execution can hide memory latency, this only works when sufficiently many independent memory accesses are sufficiently close together
More in-between work can prevent the CPU from seeing far enough ahead to the next memory access.
Prefetching
But! We don’t have to rely on the CPU to parallelize memory accesses. We can
also do this explicitly using prefetch
instructions, that simply tell the CPU
to read some data (a cache line) from memory into L1 cache.
|
|
Prefetching is an effective way to hide memory latency when iterations are too long for the CPU’s reorder buffer.
TODO Memory bandwidth
- Experiments measuring the maximum speed of linearly reading an array.
TODO Multithreading
Further links
L3 is typically inclusive of L1 and L2 https://community.intel.com/t5/Software-Tuning-Performance/cache-eviction-policy-of-Intel-newer-CPUs/m-p/922774#M1359
L2 is typically not inclusive nor exclusive of L1; they just do their own thing and may overlap.
https://travisdowns.github.io/blog/2020/05/18/icelake-zero-opt.html#fn:l1port
https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html#fnref:melty
https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#fnref:rmwnote
Investigate the branch predictor state size.
https://github.com/Kobzol/hardware-effects/tree/master/write-combining