There was a hackernews post recently on a new single-source shortest path algorithm (Duan et al. 2025), that takes \(O(m (\log n)^{2/3})\) time, which is less than \(O(m + n \log n)\) (the complexity of Dijkstra with Fibonacci heaps with \(O(1)\) inserts) when \(m \leq n (\log n) ^{1/3}\), or equivalently, when the average degree satisfies \(d=m/n=O((\log n) ^{1/3})\). Anyway – this got me thinking about heaps. Specifically, about how fast we can make them.

The bottom line is that there is the quickheap data structure (Navarro and Paredes 2010) and its improved version (Navarro et al. 2011), and my implementation of this (github:ragnargrootkoerkamp/quickheap) is 2 to 3\(\times\) faster than the other heaps/priority queues from crates.io that I tested.

As for the title: there are many papers on I/O efficient and cache-oblivious heaps and priority queues (Brodal 2013), but I am not able to find a modern implementation of any one of them – somehow they don’t seem used at all in practice. On the other hand, there is the radix heap (Ahuja et al. 1990) that is faster at times, but it can only be used in cases where the extracted keys increase with time.

Just straight to Implementation or Results.

1 Background Link to heading

Priority queue Link to heading

As a quick recap, for the purposes of this post, a priority queue (wikipedia) is a data structure that supports the following two operations:

Push (insert)
Add a value \(x\) to the data structure.
Pop (extract min)
Remove and return the smallest value \(x\) in the data structure.

Other possible operations include find min (peek), delete, and decrement key, but we will not consider these.

Note: most literature seems to present the min-queue as above, but Wikipedia and implementations often use a max-queue, where a pop returns the maximum element instead.

Binary heap Link to heading

The binary heap (wikipedia) works by using the Eytzinger layout, also known as simply the heap layout. This represents \(n\) values as a binary tree that is stored layer by layer. If we put the root at index 1, the children of node \(i\) are at \(2i\) and \(2i+1\). The heap invariant, which makes the tree into a heap (wikipedia), is that each value is at most the value of its two children, which implies that the root is the smallest value.

Push is implemented by first appending the new value to the array, which adds it as the next node in the bottom layer of the tree (or possibly creating a new layer). Then, we ensure the invariant is preserved by swapping the new element with its parent as long as its larger.

Pop is implemented by first extracting the root, and replacing it with the last element of the queue. Then, we again ensure the invariant is preserved by sifting down: we swap the root with the smaller of its two children as long as needed.

We can see that in the worst case, both these operations take \(\lg n = \log_2(n)\) steps.

D-ary heaps Link to heading

D-ary heaps (wikipedia) generalize binary heaps so that each node has up to \(d\) children. Their main benefit is that they have better cache locality, and require fewer levels of traversing the tree. The drawback is that sifting down requires finding the minimum of \(d\) elements rather than just 2, but this is anyway fast using some SIMD instructions.

Other heaps Link to heading

Other heaps are the radix heap (wikipedia, nice blog explanation) (Ahuja et al. 1990), which implements a monotone priority queue where the priority of the popped elements may only increase. This is for example the case when running Dijkstra’s algorithm on a graph with non-negative edge weights.

The sequence heap (Sanders 2000) and funnel heap (Brodal and Fagerberg 2002) are two similar I/O efficient / cache-oblivious data structures. (But, according to Brodal (2013), they serve slightly different purposes.) They build on the idea of binary mergers: given two input buffers containing sorted data, they merge those into a sorted output buffer as needed. These “gadgets” are used to build $k$-way mergers, and a chain of these is made for exponentially increasing \(k\).

The cascade heap (Babenko, Kolesnichenko, and Smirnov 2017) has \(O(1)\) inserts, and \(O(\log^* n + \log k)\) pop for the \(k\)’th popped element. Thus, this avoids the overhead of doing too much up-front work. The logarithmic funnel (Loeffeld 2017) takes a similar approach of building high dimensional trees.

A more broad survey is given by Brodal (2013), which surprisingly does not mention the quickheap described below.

2 Literature on Quickheaps Link to heading

Sorting algorithms go back a long way. One way to sort a list is to put all elements in a heap and then extract them one by one, called heapsort (Williams 1964).

Another method is quicksort, which works as follows: first, choose a random element of the array as pivot. Then compare all elements to this pivot, and move the smaller ones to the front of the array, and the larger ones to the back of the array. Then recurse on the two halves. This runs in \(O(n\log n)\) time, since the number of levels in the recursion is expected to be \(O(\log n)\).

Optimal incremental sorting Link to heading

A related problem is to only sort the smallest \(k\leq n\) elements, which can be done in \(O(n + k \log k)\). Heapsort does this, but incurs an up-front cost of \(O(n \log n)\). Paredes and Navarro (2006) give an incremental \(O(n+k\log k)\) method that returns the smallest elements one by one in sorted order. Their method, IncrementalQuickSelect (IQS) builds on the ideas of quicksort:

  1. Choose a random pivot and partition the array.
  2. Push the position of the pivot to a stack.
  3. Recurse on the left part with smaller values until only a single element is left. This is the minimum.
  4. Pop that element, go up the stack, and recurse on the elements between this pivot and the next as needed.
  5. Repeat until the array is empty.

Figure 1: Example from Paredes and Navarro (2006) of the IQS method: a random (here: first) element is chosen as pivot to partition the array. This is repeated until only a single element is left, and the positions of all pivots are stored on a stack.

Figure 1: Example from Paredes and Navarro (2006) of the IQS method: a random (here: first) element is chosen as pivot to partition the array. This is repeated until only a single element is left, and the positions of all pivots are stored on a stack.

On random input, IQS takes average time \(O(n + k \log k)\).

Quickheap Link to heading

The conclusions of the 2006 paper above already mention that this same idea can be used to build a heap. This is presented in detail in Navarro and Paredes (2010).

Compared to the incremental sorting method above, the one additional operation it needs to support is pushing new elements.

Figure 2: Example from Navarro and Paredes (2010) of inserting an element (35) into the quickheap.

Figure 2: Example from Navarro and Paredes (2010) of inserting an element (35) into the quickheap.

Like with binary heaps, we first push the new element to the back of the array, and then sift it (down, in this case) to its right location: as long as the new element is less than the preceding pivot, that pivot (eg 51 above) is shifted one position to the right to make space for the new element, and the new element is inserted on its left.

Since the tree is expected to have \(O(\log n)\) many levels on random input, this takes \(O(\log n)\) steps, as for binary heaps. Additionally, it is shown that the I/O cost of push and pop is \(O((1/B) \log (m/M))\), where \(B\) is the block size and \(M\) is the total available memory, which is close to optimal. Thus, quickheaps make efficient use of the memory bandwidth.

Figure 3: On a sequence of m times (ins, (del, ins)^2) followed by m times (del, (ins, del)^2), quickheap (QH) is faster than the binary heap (BH) and pairing heap (PH), but slightly slower than the sequence heap (SH), which are optimized for cases where all elements are extracted. Note that the y-axis reports the time divided by m lg(m).

Figure 3: On a sequence of m times (ins, (del, ins)^2) followed by m times (del, (ins, del)^2), quickheap (QH) is faster than the binary heap (BH) and pairing heap (PH), but slightly slower than the sequence heap (SH), which are optimized for cases where all elements are extracted. Note that the y-axis reports the time divided by m lg(m).

Quickheaps are also shown to have much lower I/O cost than radix heaps.

A drawback of quickheaps is that the analysis only works for randomized operations.

Randomized quickheaps Link to heading

When the keys being inserted into quickheap are mostly decreasing, this causes the number of layers/pivot to grow over time. This results in worst-case linear time inserts, since elements have to sift down linearly many layers.

Randomized quickheaps (RQH) (Navarro et al. 2011) solve this: every time an element is inserted and moves down one layer, the entire subtree starting in that layer is flattened with probability \(1/s\) when it has \(s\) elements. That is, all pivots in the subtree are dropped, and the next pop operation will pick a new random pivot. This way, each subtree is re-randomized roughly once each time it doubles in size.

Figure 4: On a sequence of m times (ins, (del, ins)^2) followed by m times (del, (ins, del)^2), the random quickheap (RQH) is faster than both the binary heap (BH) and quickheap (QH) for sufficiently large inputs.

Figure 4: On a sequence of m times (ins, (del, ins)^2) followed by m times (del, (ins, del)^2), the random quickheap (RQH) is faster than both the binary heap (BH) and quickheap (QH) for sufficiently large inputs.

The paper by Regla and Paredes (2015) takes a more practical approach to optimal worst-case behaviour. One could use the median-of-medians algorithm to select a pivot in the 30%-70% interval in linear time, but this is slow in practice. Instead, they first try a random pivot, and only fall back to median-of-medians in case this random pivot is not in the 30%-70% interval.

3 Bucket-based implementation Link to heading

Data structure Link to heading

The original quickheap papers store everything in a flat array, without additional memory. In my implementation (github), I instead use a single bucket (vector) per layer. This simplifies the partition steps, since they do not have to be in-place, but comes at the cost of somewhat higher memory usage. Additionally, I store a flat list of the pivot values for all layers, rather than their positions. Lastly, I stop the recursion when layers have size at most 16. Then, this list is simply scanned to extract the smallest element.

Figure 5: A schematic overview of the quickheap: elements are stored in layers. Deeper layers have smaller values, with each layer being at most its pivit. The bottom layer has at most 16 (here: 4) elements. Popping is done by scanning the bottom layer, and pushing is done by comparing the new value to all pilots and appending to the correct layer. To split a layer, a random pivot is selected as the median of 3 and smaller elements are pushed to the next layer, while larger elements stay. Values equal to the pivot are pushed down if they are on its left.

Figure 5: A schematic overview of the quickheap: elements are stored in layers. Deeper layers have smaller values, with each layer being at most its pivit. The bottom layer has at most 16 (here: 4) elements. Popping is done by scanning the bottom layer, and pushing is done by comparing the new value to all pilots and appending to the correct layer. To split a layer, a random pivot is selected as the median of 3 and smaller elements are pushed to the next layer, while larger elements stay. Values equal to the pivot are pushed down if they are on its left.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
type T = u32;
struct QuickHeap {
    /// The number of layers.
    layer: usize,
    /// A decreasing array of the pivots for all layers.
    /// pivots[0] = T::MAX
    pivots: Vec<T>,
    /// The values in each layer.
    /// pivots[i] >= elements of buckets[i] >= pivots[i+1]
    buckets: Vec<Vec<T>>,
}
Code Snippet 1: My version of the QuickHeap data structure.

The active/last layer is stored separately, so that we can reuse buffers instead of deallocating them.

Push Link to heading

Push is implemented by simply comparing the new element against all the pivots, and then inserting it into the right layer.

Code for pushing a new value.
1
2
3
4
5
6
7
8
9
fn push(&mut self, x: T) {
    let mut target_layer = 0;
    for &p in &self.pivots[..=self.layer] {
        if p > x {
            target_layer += 1;
        }
    }
    self.buckets[target_layer - 1].push(x);
}
Code Snippet 2: We count the number of pivots larger than the new element.

To enable maximum efficiency of SIMD comparisons, in practice we do this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
 fn push(&mut self, x: T) {
     let mut target_layer = 0;
-    for &p in &self.pivots[..=self.layer] {
+    for &p in &self.pivots[..(self.layer+1).next_multiple_of(8)] {
         if p > x {
             target_layer += 1;
         }
     }
     self.buckets[target_layer - 1].push(x);
 }
Code Snippet 3: By always comparing to a multiple of 8 number of elements, each block of 8 is compiled to some SIMD instructions.

This has complexity \(O((\log n)/L)\) when using \(L\) SIMD lanes, which in practice is fast, especially when \(L=8\) for u32 values. One option for some speedup could be to turn this into a 2-level B-tree, with a root node that divides the levels into \(L+1\) chunks.

Pop Link to heading

Pop is more tricky. We split the current (bottom) layer as long as it has more than 16 elements. Then, we find the position of the minimum and remove it by swapping it with the last element in the layer. Lastly, we decrease the active layer if we exhausted it.

Code for popping the smallest value.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
fn pop(&mut self) -> Option<T> {
    // Only the top layer can be empty.
    if self.buckets[self.layer].len() == 0 {
        return None;
    }
    // Split the current layer as long as it is too large.
    while self.buckets[self.layer].len() > 16 {
        self.partition();
    }
    // Find and extract the minimum.
    let layer = &mut self.buckets[self.layer];
    let min_pos = layer.iter().position_min().unwrap();
    let min = layer.swap_remove(min_pos);

    // Update the active layer.
    if layer.is_empty() && self.layer > 0 {
        self.pivots[self.layer] = 0;
        self.layer -= 1;
    }
    Some(min)
}
Code Snippet 4: Popping works by first splitting the layer as long as it has more than 16 elements, and then scanning the remaining elements for the minimum.

Partition Link to heading

This leaves only the partitioning of the layers. Of note are the fact that I use the median of 3 candidate pivots, and that the partitioning is based on AVX2 SIMD instructions inspired by Daniel Lemire’s blog (see github for the detailed code).

Code for partitioning the bottom layer.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
fn partition(&mut self) {
    // Reserve space for an additional 8 layers when needed.
    if self.layer + 2 == self.pivots.len() {
        self.pivots.extend(repeat_n(0, 8));
        self.buckets.extend(repeat_n(vec![], 8));
    }
    // Alias the current layer (to be split) and the next layer.
    let [cur_layer, next_layer] = &mut self.buckets[self.layer..=self.layer + 1] else {
        unreachable!()
    };
    let n = cur_layer.len();

    // Select 3 random pivots, and compute their median.
    let mut pivots: [T; 3] = from_fn(|_| cur_layer[rand::random_range(0..n)]);
    pivots.sort();
    // Pivots are exclusive.
    let pivot = pivots[1] + 1;
    self.pivots[self.layer + 1] = pivot;

    // Reserve space in the next layer,
    // and make sure the current layer can hold a spare SIMD register.
    next_layer.resize(n + 8, 0);
    cur_layer.resize(n + 8, 0);

    // Partition a list into two using SIMD.
    let mut cur_len = 0;
    let mut next_len = 0;
    for i in (0..n).step_by(8) {
        let vals = *cur_layer[i..i + 8].as_array().unwrap();
        simd::partition(
            u32x8::from_array(vals),
            n - i, // Only use the at most n-i remaining elements.
            pivot,
            cur_layer,
            &mut cur_len,
            next_layer,
            &mut next_len,
        );
    }
    cur_layer.resize(cur_len, 0);
    next_layer.resize(next_len, 0);

    // If we extracted all elements to the next layer
    // because the pivot was the largest one,
    // undo and try again.
    if cur_len == 0 {
        std::mem::swap(cur_layer, next_layer);
        return;
    }

    // Increment the active layer.
    self.layer += 1;
}
Code Snippet 5: Partitioning the bottom layer works by reserving two sufficiently large arrays, and then using SIMD instructions to append values < pivot to one and values >= pivot to the other.

4 Results Link to heading

Libraries Link to heading

I benchmarked against a few heap and priority queue crates. I did not find any implementation of (randomized) quickheap either online or in the papers1, but there are some d-ary heaps:

Excluded implementations.

We do compare against a radix heap, which is specialized for cases where the popped keys increase with time.

Datasets Link to heading

I test on a few types of data. First off, we test keys with types:

  • u32
  • u64

For each type, we construct a number of test cases. Each has the structure \[ F_k(n) := (\mathsf{push}\circ(\mathsf{pop}\circ\mathsf{push})^k)^n \circ(\mathsf{pop}\circ(\mathsf{push}\circ\mathsf{pop})^k)^n. \]

  1. Heapsort: \(F_0(n)\): \(n\) random pushes, followed by \(n\) pops. I.e. a heapsort.
  2. Random: \(F_4(n)\) with random pushes. This simulates a heap that slowly grows and then slowly shrinks.
  3. Linear: \(F_4(n)\), where the $i$th push pushes value \(i\).
  4. Increasing: \(F_4(n)\), but pushes increase by a random amount:
    • for u32: the last popped value plus a random value up to 1000.
    • for u64: the last popped value plus a random value up to \(2^{32}\).

Results Link to heading

Figure 6: (Click to enlarge.) Log-log plots of the average time per push-pop pair. Top row: u32 values, bottom row: u64 values. Left to right corresponds to the four datasets listed above: heapsort, random, linear, and increasing. The shown time is the total time divided by (2k+1), and thus is the average time for an element to be pushed and popped again. Experiments are stopped once they take >100 ns. Memory usage is 4 or 8 times more than (n), and ranges from 8 KiB to 128 MiB for u32 and double that for u64.

Figure 6: (Click to enlarge.) Log-log plots of the average time per push-pop pair. Top row: u32 values, bottom row: u64 values. Left to right corresponds to the four datasets listed above: heapsort, random, linear, and increasing. The shown time is the total time divided by (2k+1), and thus is the average time for an element to be pushed and popped again. Experiments are stopped once they take >100 ns. Memory usage is 4 or 8 times more than (n), and ranges from 8 KiB to 128 MiB for u32 and double that for u64.

Some observations:

  • The binary heap (blue) and d-ary heaps (orange, green) have similar performance for u32 and u64.
  • The 4-ary heap (green) in orx_priority_queue is consistently slightly faster than the default binary heap.
  • The 8-ary heap in dary_heap is usually much slower, but slightly faster for large \(n\).
    • (Other d-ary heap variants are only rarely better than both of these two.)
  • The quickheap (red) is the fastest for both heapsort and the interleaved variant with random pushes.
    • For heapsort, it’s up to 2x faster for u32 for large \(n\), because most time is spent partitioning lists, and this is 2x faster for the smaller data type.
    • For random interleaved pushes, this difference is nearly gone. Most likely this is because roughly 75% of pushed elements will directly be popped again. In fact, the performance is nearly independent of \(n\) here!
    • Exact linear pushes are probably the worst for quickheap, as these always get pushed to the top layer and then need to sift down through the maximum possible number of layers.
  • The radix heap (purple) has constant performance for heapsort, since its \(O(n\log C)\) term only depends on the number of bits in the input values, which is 32 or 64. This is also the worst possible input for radix sort.
    • Random pushes interleaved with pops are not supported, since the minimum value in the heap may only increase.
    • Radix sort is the fastest for linear input, likely because of its cache locality.
    • For u32 input that increases by 0 to 1000 on each push, radix sort gets faster as \(n\) increases, probably because there are very many duplicate elements.
    • For u64 input that increases by 0 to \(2^{32}\) on each push, radix sort is not as good, because now \(\log C=\log2^{32}\) is quite large (compared to \(\log 1000\) before). As in the heapsort case, the performance is constant though, because \(\log C\) is constant.

5 Conclusion Link to heading

Overall, the radix heap is probably a good choice in cases where it can be used, specifically for small integer input. Otherwise, the quickheap is significantly faster than binary/d-ary heaps, especially as \(n\) grows. This is primarily because of its better cache efficiency and corresponding I/O complexity: Binary/d-ary heaps must access \(O(\log n)\) memory locations and have a cache miss (of cost \(\sqrt {2^i}\), see this blog on The Myth of RAM) at each layer \(i\), whereas the quickheap pushes to one of \(O(\log n)\) known (cached) locations, pops from a single location, and has memory-efficient partitioning as well.

Still TODO Link to heading

  • I have not yet implemented the ideas from the randomized quickheap Navarro et al. (2011) to prevent the worst-case quadratic growth.
  • I’d like to run experiments on some more realistic datasets. I could compare Dijkstra with the radix heap and quickheap, but then I need some interesting graphs to test on.
  • Prim’s minimal spanning tree algorithm (wikipedia) is a case where the radix heap will not work, and might be a good case to show the improvement of the quickheap over d-ary heaps. But again, this needs some graphs to test on.
  • Are there non-graph applications of heaps? Things like an ’event queue’ come to mind: when processing a sorted list of intervals, we push the end of the interval after processing the start. But again a radix heap (or even a bucket heap) is probably better here, unless the timestamps are very high precision.

References Link to heading

Ahuja, Ravindra K., Kurt Mehlhorn, James Orlin, and Robert E. Tarjan. 1990. “Faster Algorithms for the Shortest Path Problem.” Journal of the Acm 37 (2): 213–23. https://doi.org/10.1145/77600.77615.
Babenko, Maxim, Ignat Kolesnichenko, and Ivan Smirnov. 2017. “Cascade Heap: Towards Time-Optimal Extractions.” In Computer Science – Theory and Applications, 62–70. Springer International Publishing. https://doi.org/10.1007/978-3-319-58747-9_8.
Brodal, Gerth Stølting. 2013. “A Survey on Priority Queues.” In Space-Efficient Data Structures, Streams, and Algorithms, 150–63. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-40273-9_11.
Brodal, Gerth Stølting, and Rolf Fagerberg. 2002. “Funnel Heap - a Cache Oblivious Priority Queue.” In Proceedings of the 13th International Symposium on Algorithms and Computation, 219–28. Isaac ’02. Berlin, Heidelberg: Springer-Verlag.
Duan, Ran, Jiayi Mao, Xiao Mao, Xinkai Shu, and Longhui Yin. 2025. “Breaking the Sorting Barrier for Directed Single-Source Shortest Paths.” arXiv. https://doi.org/10.48550/ARXIV.2504.17033.
Loeffeld, Christian. 2017. “The Logarithmic Funnel Heap: An Efficient Priority Queue for Extremely Large Sets.” arXiv. https://doi.org/10.48550/ARXIV.1705.10648.
Navarro, Gonzalo, Rodrigo Paredes, Patricio V. Poblete, and Peter Sanders. 2011. “Stronger Quickheaps.” International Journal of Foundations of Computer Science 22 (04): 945–69. https://doi.org/10.1142/s0129054111008507.
Navarro, Gonzalo, and Rodrigo Paredes. 2010. “On Sorting, Heaps, and Minimum Spanning Trees.” Algorithmica 57 (4): 585–620. https://doi.org/10.1007/s00453-010-9400-6.
Paredes, Rodrigo, and Gonzalo Navarro. 2006. “Optimal Incremental Sorting.” In Proceedings of the Meeting on Algorithm Engineering & Expermiments, 171–82. Miami, Florida: Society for Industrial and Applied Mathematics.
Regla, Erik, and Rodrigo Paredes. 2015. “Worst-Case Optimal Incremental Sorting.” In 2015 34Th International Conference of the Chilean Computer Science Society (Sccc), 1–6. IEEE. https://doi.org/10.1109/sccc.2015.7416566.
Sanders, Peter. 2000. “Fast Priority Queues for Cached Memory.” Acm Journal of Experimental Algorithmics 5 (December): 7. https://doi.org/10.1145/351827.384249.
Williams, J.W.J. 1964. “Algorithm 232: Heapsort.” Communications of the Acm 7 (6): 347–48. https://doi.org/10.1145/512274.3734138.

  1. There is https://github.com/emmt/QuickHeaps.jl, but it seems to just be a “quick” binary heap. ↩︎