- 1 Problem Statement: Customizable Route Planning (CRP)
- 2 Contraction Hierarchies (CH)
- 3 Analogy with trees
- 4 Shortest Paths in CHs
- 5 Parents: Faster Shortest Paths in CCHs
- 6 Input Graph
- 7 Initial Algorithm
- 7.1 Permute input
- 7.2 Chordal Completion and Parents
- 7.3 Customize
- 7.4 Query
- 8 Optimizing things
- 9 Some stats
- 10 Serializing the final structure
- 11 Merging adjacent edges
- 12 Bugfixing
- 13 Further ideas
- 14 Current best results
- 14.1 Bottleneck
- 15 TODO
These are some notes on customizable contraction hierarchies, based on talks with Michael Zündorf and the survey paper by Bläsius, Buchhold, Wagner, Zeitz, and Zündorf (2025).
1 Problem Statement: Customizable Route Planning (CRP) Link to heading
Given is a static directed graph modelling a road network (e.g. the map of Europe). Separately, we get a list of edge weights that can change frequently (say, every few minutes). The goal is to build a data structure that efficiently supports the following two operations:
- Update all the edge weights.
- Compute the length of the shortest path between two nodes \(s\) and \(t\).
The customizable in the name of the problem refers to the fact that edge weights can be changed.
2 Contraction Hierarchies (CH) Link to heading
The Wikipedia page on contraction hierarchies seems quite complete; just go and read that.
The mathematical tl;dr is: we are given a total order \(\prec\) (also called the rank, let’s say \(r(u)\)) on the nodes, such that typically, ‘unimportant’ nodes come early (small rank) and ‘important’ nodes come late (large rank). Then, we repeatedly contract the smallest node \(u\) by turning its set of neighbours into a clique. For edges \(uv\) and \(uw\), we set or lower the weight of \(vw\) to \(\min(l(u,v) + l(v,w), l(v,w))\). This way, we can skip over \(v\) when computing shortest paths between other vertices.
2.1 Classic Contraction Hierarchies Link to heading
The classic approach of the 2010’s builds the CH bottom-up. It repeatedly finds “good” nodes to contract, which are those that only generate a few new edges between their neighbours. In particular, when a node \(u\) is contracted, not all pairwise edges between its neighbours are added/updated. Instead, only those edges are added/changed for which the shortest path goes through \(u\). This requires computing all-vs-all distances between the neighbours of \(u\), and thus depends on the weights of the edges.
In the end, the high rank nodes typically represent for example highways near cities, or more generally ‘bottlenecks’ that have a lot of shortest paths going through them.
A drawback of this approach is that it needs the weights up-front, and that it needs to do a lot of shortest-path queries during construction, in order to prevent adding too many edges.
2.2 Customizable Contraction Hierarchies (CCH) Link to heading
Customizable contraction hierarchies use a similar but slightly different approach that does not require knowing the edge weights. This means that eagerly pruning edges is not possible, and thus, all edges must be added pessimistically. As a consequence, the hierarchy/ranks must be chosen in such a way to minimize the number of edges that is added.
Instead of building the CH bottom-up, here we build the CH top-down. One way to find them is the following idea of InertialFlow (Schild and Sommer 2015): take a route network and contract the 25% most Northern and Southern vertices. Then find a minimal cut between these two parts and use those nodes as the top levels of the hierarchy. This splits the graph into two parts (North and South), so now we can recurse of the two halves by splitting each in East and West, and so on.
3 Analogy with trees Link to heading
Suppose we know the input graph is a tree. Then we can find the distance between any nodes \(s\) and \(t\) by computing the distance from \(s\) and \(t\) to each of their ancestors, and taking the sum of these at the lowest common ancestor. In particular, when the tree is balanced, this runs in \(\lg n\) time.
Unfortunately, graphs are not trees, and the same does not quite work. For CCHs, we turn a graph into a tree by taking a set of nodes (a cut) that we contract and then consider as the root. Since this is a cut, the remainder of the graph is disconnected and falls apart into two sub-trees on which we recurse.
At the end, each node of this tree of contracted sets will be (roughly) a clique, and there are additional edges from each node to a subset of its ancestors. Then, we can use the same query algorithms as for trees: compute the distance from each node to all its ancestors, and then find the (not necessarily lowest) common ancestor with minimal sum.
4 Shortest Paths in CHs Link to heading
Let \(s = u_1, u_2, \dots, u_k=t\) be a shortest path from \(s\) to \(t\). Then there is some node \(u_m\) with maximal rank. We can then find the increasing subsequence \(s=u_1 \prec u_{i_2}\prec u_{i_3}\prec \dots\prec u_m\) of nodes that are not preceded by any node of larger rank. And symmetrically we can find \(u_m\succ \dots \succ u_{j_3} \succ u_{j_2} \succ u_k = t\) of all nodes on the path not succeeded by a node with larger rank. Because of the CH construction, there will be shortcuts between all consecutive elements of these subsequences, and we can skip over low-rank intermediate nodes. Thus, we can do single upward search from \(s\) and \(t\) to find the minimum distance to all higher-ranked nodes they can reach, and then find the minimum total distance over all nodes where they meet.
For CH, this is done by using Dijkstra and only relaxing edges to higher-ranked nodes. A priority queue is needed to visit the nodes in order of distance to \(s\) (or \(t\)).
5 Parents: Faster Shortest Paths in CCHs Link to heading
Since CCHs contain all edges above each node, a different approach is possible that does not need the priority queue. In order to visit all nodes above \(s\), we can store for each node \(u\) a parent pointer to the lowest ranked node \(p\) above \(u\). Since \(p\) has edges to all other up-neighbours of \(u\), those will eventually be visited as well. In the end, following parent pointers from \(s\) will traverse exactly all higher-ranked vertices reachable from it.
This removes the need for a priority queue, and makes the algorithm highly predictable.
6 Input Graph Link to heading
We use the PTV-DIMACS-Europe-Graph mentioned here:
| |
Each file is a binary little-endian dump of a &[u32], without leading length.
latandlngare the latitude and longitude of each node.- The two
*.orderfiles are different orders for the CH. first_outindicates the index of the first edge for each node.headindicates for each edge the node that it points to.distindicates the length of each edge.
7 Initial Algorithm Link to heading
The construction mostly follows the original code at github:mzuenni/Separator-Based-Alternative-Paths-in-CCHs.
7.1 Permute input Link to heading
First, we re-number all nodes and edges by increasing rank.
7.2 Chordal Completion and Parents Link to heading
Iterate over all nodes in order of low to high rank, and turn their upper neighbours into a clique. From low rank (id) to high, the parent is the lowest neighbour of higher rank, and we append the upper-neighbours of each node to its parent.
7.3 Customize Link to heading
- First, copy the input weights into the larger CCH graph. Note that all edges are directed here.
- For each node \(u\) from low to high, ‘relax’ the edges between its upper-neighbours.
- Drop edges that are never optimal. For each node \(u\) from high to low, iterate over the triangles between its upper neighbours, and in each, relax the lower and middle edge. (The upper edge was already relaxed before.) If an edge is relaxed, mark it for deletion but first keep it around for further propagation. Finally, prune all the deleted edges.
7.4 Query Link to heading
As a lazy initial approach for a query from \(s\) to \(t\):
- For each, initialize a vector with distances to all other nodes.
- Then relax all nodes given by parent pointers until the root is reached.
- Then iterate over all nodes and find the one with minimum sum of distances to \(s\) and \(t\).
8 Optimizing things Link to heading
- Reading the graph is a few seconds.
- Relaxing upper edges is 18s
- Finding redundant edges is 18s
- Compressing edges is 1s
- 100 queries is 2s
- total time: 44s
A quick perf record; perf report after enabling debug symbols via
[profile.release] debug=true shows that around 70% of time is spent on
iterating through neighbour lists to search xy edges completing a triangle
from u.
8.1 Binary searching in find_edge
Link to heading
Since edges are already sorted by out-neighbour anyway, we can also find them via binary search.
| |
This reduces the total time to 22s, with only 7s for each of the relax and prune stage, but still, most of the time is in the binary search.
8.2 HashMap of edges
Link to heading
We better replace the binary search with a HashMap directly telling us the index of each edge.
| |
Surprisingly, this takes 7s to build, and then 23s for relax and prune each, for 65s total. Not good.
8.3 Ranges of neighbours Link to heading
Because of the CCH structure, most triangles are inside the cliques formed in each cut. This means that many nodes have large intervals of consecutive neighbours. So we can store for each node a list of start nodes of each interval, and then binary search that instead. This reduces the total time to 19s.
8.4 Linear scan Link to heading
It turns out that during the relaxing and pruning, find_edge_index is always called repeatedly for a single x
and increasing v, so we can replace repeated searches by a single linear scan.
This lowers the entire contruction to 7s.
8.5 Proper query algorithm Link to heading
So far, I was allocating two size-n vectors for each query and doing a separate search from \(s\) and \(t\). By interleaving the searches, we can efficiently reuse a single allocation and incrementally clean up used values. Query time starts at 230us/q for random pairs.
8.6 Pruning edges Link to heading
If we enable pruning of sub-optimal edges, query time drops to 108us/q.
8.7 Pruning Link to heading
When we found some \(st\) path of length \(d\), we can skip expanding nodes at distance at least \(d\) from either \(s\) or \(t\). The gains are marginal though, from 108 us to 103 us (and these numbers have high variance).
8.8 Unconditional edge relaxing Link to heading
To avoid branch misses, we write relaxed edge values unconditionally:
| |
8.9 Early edge break Link to heading
For each node, we can sort outgoing edges by length, and only process the prefix that results in relaxations with cost at most the best s-t path so far. This gives around 10% speedup, from 97 us/q to 88 us/q.
8.10 DFS-ordering the nodes Link to heading
Currently the nodes are already ordered by their rank, but in a layer-by-layer way. Instead we can use a DFS order to get more locality.
Unfortunately, this appears to give pretty much no speedup, and actually slows things down slightly: 85 us/q to 89 us/q.
8.11 Not inclining queries Link to heading
If we mark the query function as #[inline(never)],
times go down from 89 us/q to 81 us/q. I only did this to isolate the results in
perf, so not exactly sure why it helps.
9 Some stats Link to heading
With default .order:
- 18.0M nodes
- 42.2M directed input edges
- 18.1M added undirected edges for chordal closure
- 17.4M pruned undirected edges for perfect customization
With .ifc.order:
- 16.7M added edges for chordal closure
- 16.6M pruned edges for perfect customization
- 42.3M edges remaining, only slightly more than the input (but the input will contain many edges in both directions).
10 Serializing the final structure Link to heading
For iteration speed on queries, we dump the final customized data structure to disk so we can just load it directly instead of manually processing it each time. We use epserde, which just dumps the binary representation to disk. It needs zero-copy annotations like this:
| |
We write and read the file (both in <1s, compared to ~15s preprocessing) using
| |
11 Merging adjacent edges Link to heading
Many nodes are connected to ranges of consecutive vertices. We will try to process these using SIMD. First we transpose the distances vector so that adjacent nodes have their distances adjacent as well.
- Initial query time: 92 us/q
- Transposed
dist: 110 us/q - Iterating all ranges separately: 178 us/q
- Unchecked indexing: 159 us/q
- Merging the first and second phase into one, so we only have to do write the code once: 166 us/q
- Flattened edge weights: 177 us/q
- Relaxing 8 edges at once using SIMD: 94 us/q
- Merging ranges with gap <=4: 69 us/q
- Including the first
NodeIdwith the range, avoiding a memory indirection: 56 us/q splatfor the masking outside the inner loop: 53 us/q
11.1 Perf stat Link to heading
| |
- 12% branch misses is a lot!
- 1.6 instructions/cycle is quite low
The hot-loop over all ranges for a node and the edges in each range now looks like this. Note that the percentages in the first column are quite large.
| |
We see that two instructions are ‘wasted’ splatting the iteration count. Fixing that makes things slightly cleaner.
More generally, we do a bunch of bookkeeping to handle ranges with non-multiple-of-8 lengths. Instead, we can also just bloat the set of edges so that all ranges have multiple-of-8 lengths.
11.2 All ranges are multiples of 8 Link to heading
We can ’expand’ all small groups of edges to cover ranges of size 8, adding up to 7 dummies if needed. This blows up the size of the data structure in memory, as the number of edges increases from 42.2M to 230.7M, but gets rid of the blending operations. This lowers the time to 48 us/q.
Still, we have 8% branch misses and now only 1.4 instructions/cycle.
11.3 All ranges have size 8 Link to heading
Only few ranges are actually larger than 8, so instead, we can also just make all ranges have exactly size 8 and replace a double for loop by just a single loop where each range corresponds to a single SIMD iteration. This increases the number of ranges from 27385521 to 28838911, and lowers queries to 39 us/q just by reducing branch misses, and if we drop the inner loop completely, we get 34 us/q.
The inner loop now looks like this, and we see that 80% of time is here.
| |
The reading of the edge idx is redundant here: every range covers exactly 8
weights, so we can replace this by a +=8 instead, which gives 30 us/q.
| |
In fact, we don’t even need the range start and end at all any more, and just
the original range of edges of the node suffices. This reduces the addq to
only add 4, for a single u32. Result: 28 us/q.
Anyway, the assembly is very pretty and very hot: 72% of time is spent here, branch misses are down to 3.5%, and instructions per cycle is up to 1.8 (but still far from 4).
11.4 Finding the bottleneck Link to heading
It’s unclear to me what the current bottleneck is (CPU or memory). Let’s try to
find out using perf stats’s topdown analysis, following https://perfwiki.github.io/main/top-down-analysis/.
| |
then after we can use
| |
| |
We now need to relax permissions:
| |
| |
The picture seems to be quite mixed, with no clear target for improvement. Both CPU and L1/RAM are a bottleneck at different times it seems.
12 Bugfixing Link to heading
I finally added a Dijkstra implementation as a baseline. Unfortunately, results are very different… Turns out there’s a bug in the pruning of edges where I iterated over the wrong vertex.
Also, I was pruning edges in both directions, even if I marked them for deletion in only one direction.
With all that fixed, and with making sure I apply the same permutation to the input of Dijkstra, all is good now.
13 Further ideas Link to heading
13.1 Failed: Doubling the graph Link to heading
In fact, we can pruned edges that are only redundant in one direction, by doubling the graph and storing separate data for the up and down graph. I implemented this, but it comes out quite a bit slower, probably because it requires more IO to read the edges. (It might be better with an interleaved layout, which I didn’t test yet.)
13.2 Edge pruning Link to heading
We can sort the blocks of 8 outgoing edges from each node by the minimum weight, and then early break as soon as all remaining edges have too large weight.
This seems 1 to 2 us/q faster, but whether pruning actually happens depends a lot on the queries. For random queries, s and t only meet very late, while in specific cases, it can be up to 50%.
13.3 Failed: Expanding a node and its parent in parallel Link to heading
We can manually relax the node from \(u\) to \(p\), and then unify their neighbours and relax them in a single loop. This saves some iteration overhead and storing and reading some distances, as long as the neighbour sets are sufficiently similar.
First results are 20% or so slower (50 instead of 40 us/q). Transposing the weights array to have weights for \(u\) and \(p\) adjacent in two halves of a cacheline helps a tiny bit, but not enough.
It’s a bit unfortunate that the inner-loop assembly has a bunch of mov that
weren’t there before:
| |
We can get rid of these extra mov instructions by replacing
self.weights.get_unchecked by self.weights.as_ptr().offset(...).read(),
which forces the compiler to inline the variables (even with the higher register
pressure of the double-expanding), and gives back a few (2-3 us/q) that we lost.
14 Current best results Link to heading
- With everything up to and including sorting of edges.
- Without separate up and down graphs.
- Without expanding nodes and their parents in parallel.
We test on Dijkstra rank queries, where we run a Dijkstra search from 1000 random start nodes and then use as destination the \(2^i\)’th expanded vertex. This gives queries that are close together for small \(i\) and far apart for large \(i\).
Running at fixed 3.6 GHz on a i7-10750H laptop CPU. We compare against the C++ implementation of Michael Zündorf at github:mzuenni/Separator-Based-Alternative-Paths-in-CCHs.
| dist | time C++ baseline | time us |
|---|---|---|
| 2^ 0 | 1 | 2 |
| 2^ 1 | 2 | 3 |
| 2^ 2 | 1 | 3 |
| 2^ 3 | 2 | 3 |
| 2^ 4 | 2 | 3 |
| 2^ 5 | 2 | 3 |
| 2^ 6 | 2 | 4 |
| 2^ 7 | 3 | 4 |
| 2^ 8 | 3 | 4 |
| 2^ 9 | 5 | 5 |
| 2^10 | 5 | 5 |
| 2^11 | 8 | 7 |
| 2^12 | 10 | 7 |
| 2^13 | 13 | 9 |
| 2^14 | 16 | 10 |
| 2^15 | 19 | 13 |
| 2^16 | 24 | 14 |
| 2^17 | 30 | 17 |
| 2^18 | 36 | 20 |
| 2^19 | 46 | 24 |
| 2^20 | 60 | 29 |
| 2^21 | 76 | 35 |
| 2^22 | 93 | 41 |
| 2^23 | 95 | 42 |
| 2^24 | 50 | 24 |
Results are slightly (10% to 2x) faster than CH in Fig 5 of (Buchhold, Sanders, and Wagner 2019), and up 2 to 2.5x faster than CCH-tree-fast for large Dijkstra ranks, and also 2x faster than the baseline C++ implementation.
But also, note that this comparison is not quite fair yet, since the baseline also stores metadata to recover the optimal path, while we don’t.
14.1 Bottleneck Link to heading
At this point, I’m still not quite sure whether things are CPU or memory bound,
and it seems to be a bit of both. If anything, one thing to improve could be the
data layout, to use cache lines as efficient as possible. For example, the
blocks of neighbouring nodes could be aligned to 32 bytes (half a cacheline) to
avoid needlessly crossing cacheline boundaries. That is, each block should
always start at a NodeId that is a multiple of 8.
Other possible improvements could be prefetching of:
- Distances to nodes, although these are largely overlapping already anyway.
- Edge weights, since these are fresh for each node.
- Node data and parent pointers, since this is a chain of 500 to 2000 pointers.
TODO 15 Link to heading
- metadata for path recovery
- Interleaving up and down search, somehow
- special in-clique handling
- ‘super’ parents to detect separators, prefetch node data for the next one, and use SIMD for pruning
- parallel construction
- change order to cluster edges, so we have fewer ranges?
- reorder based on down-neighbourhood
- stats: average degree on path to root before/after pruning
- transpose distances and interleave up and down costs
- detect best dist on relax instead of expand
- potentials: for each edge, do something with the cheapest upward edge?