- 1 Optimizing Binary Search And Interpolation Search
- 2 Binary search
- 3 Eytzinger
- 3.1 Naive implementation
- 3.2 Prefetching
- 3.3 Branchless Eytzinger
- 3.4 Batched Eytzinger
- 3.4.1 Non-prefetched
- 3.4.2 Prefetched
- 4 Eytzinger or BinSearch?
- 5 Memory efficiency – parallel search and comparison to B-trees
- 6 Interpolation search
- 7 Conclusion and takeaways
1 Optimizing Binary Search And Interpolation Search
This blogpost is a preliminary of the post on static search trees. We will be looking into binary search and how it can be optimized using different memory layouts (Eytzinger), branchless techniques and careful use of prefetching. In addition, we will explore batching and different implementations of it, some of them using vector instructions. Our language of choice will be Rust.
The goal of this text is mainly educational, as we’ll mostly be replicating research that has already been done. In addition, we will try how much batching can improve performance of the search schemes.
1.1 Problem statement
Input: we take a sorted array of n unsigned 32-bit integers.
Output: for a query q, find the smallest element >= q
or u32::MAX
if no such element exists.
Metric: we optimize throughput. In plots, we usually report amortized time per query, i.e. the not queries/time, but time/query.
Optimizing for throughput is relevant because for many use-cases, e.g. in bioinformatics, the queries are often independent and the latency of one is not critical. It also allows for the usage of batching, which helps us to hide memory latency.
1.2 Inspiration and background
This article is inspired by a post in Algorithmica on binary search. We start by reimplementing the experiments and expand upon it by not optimizing for latency, but throughput. We also took inspiration from the notable paper Array Layouts For Comparison-Based Searching by Paul-Virak Khuong and Pat Morin, which goes in-depth on more-or-less successful array layouts.
The paper is great because it goes step by step and introduces nice models for why different schemes behave like they behave. It is worth a read!
1.3 Benchmarking setup
The benchmarking was done using rustc 1.85.0-nightly
. The CPU is an AMD
Ryzen 4750U, and the DRAM is DDR4-2666. For reproducibility, clock speed
was locked at 1.7GHz and turbo boost was disabled.
All measurements are done 5 times. The line follows the median, and we show the spread of the 2nd to 4th value (i.e., after discarding the minimum and maximum). Note that the results are often very deterministic and the spread is therefore not very visible. The measurements are done at \(1.17^k\) so as to be similar to Algorithmica. This also mitigates some artefacts of limited cache set associativity which we describe this a later section.
2 Binary search
We start with a textbook binary search implementation, and we see if we can replicate the results from Algorithmica. Here’s the code:
|
|
We note that by default, Rust wraps all accesses to vectors with bounds
checks. Since we are comparing against C++ implementations and there is
no risk of accessing unallocated memory, we implement the get
function, which performs a Rust unsafe get_unchecked
to get rid of the
bounds checks. This will also be the case for all the further
experiments.
So, how does it perform? Let’s compare to the Rust standard library version:
We can see that our naive version is slower, except for large array sizes. Checking out the
standard library source code, we can see that the implementation already
has some optimizations in place. In the binary_search_by
function, we
see the following:
|
|
So they do a constant number of iterations instead of early stopping
when the value is found and they try to use the cmov
instruction if it
is available in the select_unpredictable
function. Both these optimizations are done so that the branch
predictor has an easier time (as mispredictions are expensive). The cmov
instruction is useful when the result of the comparison can’t be
reliably predicted (which here it really can’t)1. They are also both
recommended by the Algorithmica post, and make the code effectively branchless.
It now makes sense that our naive version is faster on large array sizes.
Algorithmica explains this by the fact that with cmov
, the branch predictor can’t
start to speculatively prefetch data from main memory (as there is no branch). The cmov
-optimized version
therefore suffers more memory latency, as it can’t be hidden by prefetching.
Note that originally, I intended to write here that I did not see this effect of missing speculation. I then found out that when testing on arrays of power-of-two size can give very skewed results; but more on that soon when we talk about batching.
Now let us implement these branchless optimizations as well and see how we do then. Here’s the code:
|
|
When first implementing this, me, being a Rust newbie, immediately went
for the cmov
crate, as I was unable to make the compiler generate
the cmov
on its own just with an if expression. Trying this, I found
out that it is still plenty slower than the select_unpredictable
function that is used in binary_search_by
, so I followed the approach
that the standard library has.
We can see that we now match the performance of the library version, even surpassing it. We assume this is due to our function being specialized and not having an error condition at the end, leading to it being better by a small fraction.
To speed the search up for large array sizes, the Algorithmica post recommends explicit prefetching.
This negates the CPU’s inability to prefetch when we use the cmov
instruction.
We use the following construction for prefetching in Rust:
|
|
And using this function, we explicitly prefetch both the locations where the binary search could lead us from a given iteration:
|
|
The prefetching does its part, giving us a nice small ~10-15% speedup.
So far we have been replicating the work Algorithmica has done, just in Rust. Now it is time to use the fact that we only care about throughput, and talk about batching.
In this context, batching is just what it sounds like: we will take
several requests at once, and we will handle them concurrently within a
single function. In every loop iteration, we do a comparison for each of
the queries, and we move the base
index for it accordingly. The
branchless version can be seen below; the P
is a generic parameter.
|
|
The reason this improves performance is that it allows us to “amortize” memory latency; while comparing and computing the next relevant address for the search, we can already query the memory for the next value. Since we don’t care about latency but only throughput, we can do this at essentially no cost!
When I first thought about this, I figured that explicit prefetching should not be needed. But in the S+-tree post, Ragnar found that explicitly prefetching memory that was going to be accessed at the next interval size was also helpful. We therefore add it as well; we’ll compare versions with and without prefetching.
and
We compare the two best variants to see their differences:
We see that the prefetching is not really helping at large batch sizes. My intuition for it is that the batching effectively hides the memory latency anyway and the prefetching only adds unnecessary extra memory traffic.
2.1 A note on power-of-two array sizes
In the bechmarking setup section, we wrote about not doing the
benchmarks on power-of two-sized arrays. Now is the time to talk about
why. Let us repeat the previous experiment with multiple batch sizes
with arrays of size \(2^k\), \(5/4 \cdot 2^k\) , \(3/2 \cdot 2^k\) and
\(7/4 \cdot 2^k\).
Notice the sawtooth pattern. We see that when the size of the searched
array is a power of two, the time per query jumps higher. This effect
also gets more pronounced with more batching. Why is this?
After consulting the array layouts paper and the Algorithmica post, we find that the answer is poor cache utilization. The CPU cache sets have limited associativity, and when our memory accesses are regularly spaced (a multiple of cache size apart from each other), they will tend to kick each other out of the cache, leading to more loading from main memory. The article Binary Search is a Pathological Case for Caches goes more in-depth on this, if you are interested. I personally was puzzled by this at first and had to think hard about why the program is faster for batch size of 4 at large sizes, only to find it actually is not.
3 Eytzinger
An issue with the standard array layout is that caches are not optimally exploited. When you think about it, the first few queries in the array are really far apart from each other, and for each of them, we need to fetch a whole cacheline, but we only use one element from that cacheline. We can only exploit spatial locality in the bottom layers of the search. The Eytzinger layout can fix this, while also being friendly to efficient prefetching.
First, as a personal note: when I first encountered the layout, I had no idea it actually had this name. It was for a university programming homework and the task was to code a binary heap. To not have to deal with pointers, the heap layout was specified by indices in arrays. When at position \(i\), the left descendant is at position \(2i\) and the right one is at position \(2i + 1\). I think it is a very common exercise, so maybe you have encountered it in the same way. An illustration of the layout is shown below:

Figure 1: A picture of the Eytzinger layout (stolen from Algorithmica)
As for how to build the layout, there is a simple recursive algorithm which is well described in Algorithmica, so we will not waste space here and will refer the reader there if interested.
Eytzinger should give us better cache utilization by grouping together
commonly accessed elements on the top of the tree (the first 4 layers of the tree fit in a single cache line for u32
values).
This is very helpful for small array sizes and speeds up the searching as compared to vanilla binary
search. The layout is worse at the bottom of the tree, where subsequent memory queries are
very far apart.
3.1 Naive implementation
The API stays the same as for normal binary search; we get a query and
we return the lower bound or u32::MAX
when the lower bound does not
exist.
Notice that indexing starts from one. This makes the layout a bit easier to implement, is a bit more pleasant to caches (layers of the tree will be aligned to multiples of cache size), and allows us to easily handle the case where the lower bound does not exist, as the way we calculate the final index will result in zero.
|
|
The first while loop looks through the array, but the index it generates in the end will be out of bounds. How do we get the index of the lower bound?
I needed some time to grok this from the Algorithmica post, so I will
write it here in my own words. Essentially, each iteration of the
while
loop resembles either going to the left or to the right in the
binary tree represented by the layout. By the end of the loop, the index
will resemble our trajectory through the tree in a bitwise format; each
bit will represent whether we went right (1) or left (0) in the tree,
with the most significant bit representing the decision on the top of
the tree.
Now, let’s think about how the trajectory finding the lower bound will
look. Either we will not find it, so the trajectory will be all ones,
since q
was always greater than each element of the array. Then we want
to return the default value, which we have stored at index 0 of the
self.vals
array.
In the case the lower bound was found, we infer that we compared q
against it once in the trajectory, went left and then only went right
afterwards (because it is the smallest value >= q
, all values smaller
than it are smaller than q). Therefore, we have to strip all the right
turns (ones) at the end of the trajectory and then one bit.
Putting this together, what we want to do is this (hidden in the function
search_result_to_index
):
|
|
Okay, let us see how it performs! TODO fix plot colours!
Okay, so we see the layout is a bit slower at the smaller sizes and not too great at the large array sizes. Notice the bumps at small array sizes; they’re not random and we’ll come back to them.
3.2 Prefetching
The great thing about Eytzinger is that prefetching can be super effective. This is due to the fact that if we are at index \(i\), the next index is going to be at \(2i\) or \(2i + 1\). That means that if we prefetch, we can actually prefetch both of the possible options within the same cacheline!
We can abuse this effect up to the effective cache line size. A usual CL
length is 64 bytes, meaning that the cache line can fit 16 u32
values.
If we prefetch 4 Eytzinger iterations ahead, e.g. to position \(16i\),
we can get all the possible options at that search level in a single
cache line! So, let’s implement this:
|
|
As for the performance, it gets a lot better at large sizes:
And we can see that prefetching 4 iterations ahead is really best, which makes sense, because we’re not really doing more work, we’re just utilizing it better.
3.3 Branchless Eytzinger
Now, we go on to fixing the bumpiness in the Eytzinger graph. This is caused by branch mispredictions on when to end the loop; if the array size is close to a power of two, the ending is easy to predict, but otherwise, it is difficult for the CPU. We proceed as Algorithmica suggests, doing a fixed number of iterations and then doing one conditional move if still needed. We also still do prefetching:
|
|
Where the get_next_index_branchless
uses an explicit cmov
from the
cmov
crate. It was surprisingly difficult to get the compiler to
accept this optimization, as select_unpredictable
did not quite work.
On the performance plot, we see that this helps remove the bumps and also slightly helps the performance when the array size is big.
3.4 Batched Eytzinger
Now, let us do batching the same way we did with binary search. We will consider two variants, prefetched and not prefetched. The prefetching shouldn’t really be needed; the batching should properly overlay memory requests anyway. But modern computers are strange beasts, so we’ll try it and we’ll see. See the source code below.
3.4.1 Non-prefetched
|
|
3.4.2 Prefetched
|
|
We compare the two graphs and compare the two best options, one from prefetched and non-prefetched:
We see that the prefetched version is a few percent faster on large input sizes. Therefore, we select it as our best eytzinger version.
4 Eytzinger or BinSearch?
Now, to compare batched Eytzinger to batched binary search:
We see that batched Eytzinger beats batched binary search by some amount, especially at larger array sizes. If we compare the two layouts, we know that Eytzinger provides better locality at the top of the search while the normal sorted array layout for binary search provides better locality at the bottom of the search. Both of these effects are largely offset by batching (because it hides the latency of memory accesses quite well). So the limit is likely memory throughput.
When it comes to memory throughput, Eytzinger is more advantageous. This is because the more-accessed top levels of the tree are more efficiently cached and can be reused between queries. This leads to less cache lines fetched from main memory overall compared to binary search.
5 Memory efficiency – parallel search and comparison to B-trees
Now let us push memory to its limits and compare the layouts when we are allowed to use multiple threads to query. For this test, I have turned off hyperthreading and locked the CPU to 8 cores. The first interesting aspect of this is whether prefetching will help now. Let’s first look at binary search:
We see that as the prefetching increases pressure on memory, it is again mostly counterproductive in the multithreaded setting. In the next plot we see that batching is helpful up to roughly size 32, and then it levels out.
We will use batch size 32 as a reference.
As far as Eytzinger goes:
We use batch size 16 as a reference for Eytzinger, as it nicely combines speedup at small and large sizes. So for the final comparison:
We see that Eytzinger is a bit faster, likely due to better memory efficiency in the top tree levels, as we have already seen in the singlethreaded case. There, the difference between Eytzinger and binary search was roughly 8-10% of performance at 1GB; here we see the difference is also in the single percentage digits.
Overall, the speedup was roughly 4 (at array size 1GB) when using 8 threads. This clearly indicates that we’re memory bound. If we wanted to go for more speed and more cache utilization, we could start the first \(\lg(n)/2\) layers with the Eytzinger layout and the bottom \(\lg(n)/2\) layers with a standard sorted array. However, we won’t delve into this here, as there is more efficient stuff one can do; check out Ragnar’s post on S-trees!
6 Interpolation search
In the static search tree post, Ragnar suggested looking at interpolation search as an option to do less accesses to main memory. For completeness, we will implement it here as well to check out how it performs.
The idea behind interpolation search based on the fact if data is drawn from a random uniform distribution, then when we sort it and plot the indices on the x-axis and values on the y-axis, we should roughly get a straight line. Using that, when we have the query, we can efficiently interpolate (“guess”) where values corresponding to the query should be.
When the input data is nicely evenly distributed, the complexity is \(O(\lg \lg n)\) iterations, rather than \(O(\lg n)\) for binary search. When the data is not well distributed, the worst case complexity is \(O(n)\), which is illustrated by the following example. Imagine we’re searching for 2 in the following array of 10000 elements:
|
|
Every time we do the interpolation, we suspect that the 2 is on the second position of the array. It is therefore very easy to construct pathological examples for interpolation search. Even in non-adversarial settings, like with the human genome, we could get into trouble with non-uniform distribution of input data. But let’s try it out anyway and see how it goes.
|
|
We see that the performance is mostly terrible, multiple times slower than even binary search. Looking at `perf` outputs, we see that the issue is two-fold. Firstly, there is a data hazard on the if condition in each iteration. But secondly, integer division is just very slow.
We try if batching can hide some of this, as it did before:
|
|
Overall, I did not see this as a priority and did not spend too much time at optimizing it, as it seems like a bit of a dead end. I would appreciate ideas; if you have them, please let me know.
An interesting factor for interpolation search is also how it performs well on non-random data. Therefore, we download a part of the human genome from here. and compute 32-bit prefixes of all the suffixes. We then search in a subset of them and measure performance. This should be slower, as the data is not going to be exactly uniformly distributed.
We try at first with our original benchmarking setup; we take the first X 16-mers, sort them and work on them. The results are a bit strange:
The reason for this strange result is that the human data is strongly non-uniform. As interpolation search performs badly with increasing non-uniformity, we can assume that the start of the genome is really, really badly distributed and the distribution goes back to something resembling a uniform one as we increase the size of the sample we’re searching.
We will fix this by not always starting from the beginning, but taking a random starting index in the unsorted array of 16-mers and taking a continuous segment from it. That way, the results will be realistic (it makes sense to search through a continuous segment of the genome) but we will avoid the skewed start.
7 Conclusion and takeaways
Overall, we found that the conclusions from the Algorithmica article and from the array layouts paper mostly hold even for batched settings. Eytzinger is the best choice for a simple algorithm that is also very fast. It beats standard binary search due to its better cache use characteristics and ease of prefetching. The other major takeaway is of course that batching is essentially free performance and if you can, you should always do it.
For interpolation search, I did not believe the scheme to be too worthwhile; it is difficult to optimize and relies on the characteristics of the data for performance. Given there are schemes like Eytzinger or S-trees that are well suited for modern hardware optimizations, I think you should mostly use those even though the asymptotics are worse.
When writing this, I was suprised to see that the Rust standard library has some optimizations for binary search already implemented, but not all that are recommended by our sources, namely, prefetching is missing. This is suprising, because prefetching arguably does not cost almost anything. Is it due to unavailability of prefetch instructions on some platforms? If you know, please let me know.
Here’s Linus talking about it ↩︎