Route Planning using Customizable Contraction Hierarchies

Table of Contents

1 Problem Statement: Customizable Route Planning (CRP)
2 Contraction Hierarchies (CH)
- 2.1 Classic Contraction Hierarchies
- 2.2 Customizable Contraction Hierarchies (CCH)
3 Analogy with trees
4 Shortest Paths in CHs
5 Parents: Faster Shortest Paths in CCHs
6 Input Graph
7 Initial Algorithm
- 7.1 Permute input
- 7.2 Chordal Completion and Parents
- 7.3 Customize
- 7.4 Query
8 Optimizing things
- 8.1 Binary searching in find_edge
- 8.2 HashMap of edges
- 8.3 Ranges of neighbours
- 8.4 Linear scan
- 8.5 Proper query algorithm
- 8.6 Pruning edges
- 8.7 Pruning
- 8.8 Unconditional edge relaxing
- 8.9 Early edge break
- 8.10 DFS-ordering the nodes
- 8.11 Not inclining queries
9 Some stats
10 Serializing the final structure
11 Merging adjacent edges
- 11.1 Perf stat
- 11.2 All ranges are multiples of 8
- 11.3 All ranges have size 8
- 11.4 Finding the bottleneck
12 Bugfixing
13 Further ideas
- 13.1 Failed: Doubling the graph
- 13.2 Edge pruning
- 13.3 Failed: Expanding a node and its parent in parallel
14 Current best results
- 14.1 Bottleneck
15 TODO

These are some notes on customizable contraction hierarchies, based on talks with Michael Zündorf and the survey paper by Bläsius, Buchhold, Wagner, Zeitz, and Zündorf (2025).

1 Problem Statement: Customizable Route Planning (CRP) Link to heading

Given is a static directed graph modelling a road network (e.g. the map of Europe). Separately, we get a list of edge weights that can change frequently (say, every few minutes). The goal is to build a data structure that efficiently supports the following two operations:

Update all the edge weights.
Compute the length of the shortest path between two nodes \(s\) and \(t\).

The customizable in the name of the problem refers to the fact that edge weights can be changed.

2 Contraction Hierarchies (CH) Link to heading

The Wikipedia page on contraction hierarchies seems quite complete; just go and read that.

The mathematical tl;dr is: we are given a total order \(\prec\) (also called the rank, let’s say \(r(u)\)) on the nodes, such that typically, ‘unimportant’ nodes come early (small rank) and ‘important’ nodes come late (large rank). Then, we repeatedly contract the smallest node \(u\) by turning its set of neighbours into a clique. For edges \(uv\) and \(uw\), we set or lower the weight of \(vw\) to \(\min(l(u,v) + l(v,w), l(v,w))\). This way, we can skip over \(v\) when computing shortest paths between other vertices.

2.1 Classic Contraction Hierarchies Link to heading

The classic approach of the 2010’s builds the CH bottom-up. It repeatedly finds “good” nodes to contract, which are those that only generate a few new edges between their neighbours. In particular, when a node \(u\) is contracted, not all pairwise edges between its neighbours are added/updated. Instead, only those edges are added/changed for which the shortest path goes through \(u\). This requires computing all-vs-all distances between the neighbours of \(u\), and thus depends on the weights of the edges.

In the end, the high rank nodes typically represent for example highways near cities, or more generally ‘bottlenecks’ that have a lot of shortest paths going through them.

A drawback of this approach is that it needs the weights up-front, and that it needs to do a lot of shortest-path queries during construction, in order to prevent adding too many edges.

2.2 Customizable Contraction Hierarchies (CCH) Link to heading

Customizable contraction hierarchies use a similar but slightly different approach that does not require knowing the edge weights. This means that eagerly pruning edges is not possible, and thus, all edges must be added pessimistically. As a consequence, the hierarchy/ranks must be chosen in such a way to minimize the number of edges that is added.

Instead of building the CH bottom-up, here we build the CH top-down. One way to find them is the following idea of InertialFlow (Schild and Sommer 2015): take a route network and contract the 25% most Northern and Southern vertices. Then find a minimal cut between these two parts and use those nodes as the top levels of the hierarchy. This splits the graph into two parts (North and South), so now we can recurse of the two halves by splitting each in East and West, and so on.

3 Analogy with trees Link to heading

Suppose we know the input graph is a tree. Then we can find the distance between any nodes \(s\) and \(t\) by computing the distance from \(s\) and \(t\) to each of their ancestors, and taking the sum of these at the lowest common ancestor. In particular, when the tree is balanced, this runs in \(\lg n\) time.

Unfortunately, graphs are not trees, and the same does not quite work. For CCHs, we turn a graph into a tree by taking a set of nodes (a cut) that we contract and then consider as the root. Since this is a cut, the remainder of the graph is disconnected and falls apart into two sub-trees on which we recurse.

At the end, each node of this tree of contracted sets will be (roughly) a clique, and there are additional edges from each node to a subset of its ancestors. Then, we can use the same query algorithms as for trees: compute the distance from each node to all its ancestors, and then find the (not necessarily lowest) common ancestor with minimal sum.

4 Shortest Paths in CHs Link to heading

Let \(s = u_1, u_2, \dots, u_k=t\) be a shortest path from \(s\) to \(t\). Then there is some node \(u_m\) with maximal rank. We can then find the increasing subsequence \(s=u_1 \prec u_{i_2}\prec u_{i_3}\prec \dots\prec u_m\) of nodes that are not preceded by any node of larger rank. And symmetrically we can find \(u_m\succ \dots \succ u_{j_3} \succ u_{j_2} \succ u_k = t\) of all nodes on the path not succeeded by a node with larger rank. Because of the CH construction, there will be shortcuts between all consecutive elements of these subsequences, and we can skip over low-rank intermediate nodes. Thus, we can do single upward search from \(s\) and \(t\) to find the minimum distance to all higher-ranked nodes they can reach, and then find the minimum total distance over all nodes where they meet.

For CH, this is done by using Dijkstra and only relaxing edges to higher-ranked nodes. A priority queue is needed to visit the nodes in order of distance to \(s\) (or \(t\)).

5 Parents: Faster Shortest Paths in CCHs Link to heading

Since CCHs contain all edges above each node, a different approach is possible that does not need the priority queue. In order to visit all nodes above \(s\), we can store for each node \(u\) a parent pointer to the lowest ranked node \(p\) above \(u\). Since \(p\) has edges to all other up-neighbours of \(u\), those will eventually be visited as well. In the end, following parent pointers from \(s\) will traverse exactly all higher-ranked vertices reachable from it.

This removes the need for a priority queue, and makes the algorithm highly predictable.

6 Input Graph Link to heading

We use the PTV-DIMACS-Europe-Graph mentioned here:

1
2
3
4
5
6
7
8
   Size (B) file
 72,040,692 europe.lat
 72,040,692 europe.lng
 72,040,692 europe.ifc.order -- newer/better
 72,040,692 europe.order
 72,040,696 europe.first_out
168,754,656 europe.head
168,754,656 europe.dist

Each file is a binary little-endian dump of a &[u32], without leading length.

lat and lng are the latitude and longitude of each node.
The two *.order files are different orders for the CH.
first_out indicates the index of the first edge for each node.
head indicates for each edge the node that it points to.
dist indicates the length of each edge.

7 Initial Algorithm Link to heading

The construction mostly follows the original code at github:mzuenni/Separator-Based-Alternative-Paths-in-CCHs.

7.1 Permute input Link to heading

First, we re-number all nodes and edges by increasing rank.

7.2 Chordal Completion and Parents Link to heading

Iterate over all nodes in order of low to high rank, and turn their upper neighbours into a clique. From low rank (id) to high, the parent is the lowest neighbour of higher rank, and we append the upper-neighbours of each node to its parent.

7.3 Customize Link to heading

First, copy the input weights into the larger CCH graph. Note that all edges are directed here.
For each node \(u\) from low to high, ‘relax’ the edges between its upper-neighbours.
Drop edges that are never optimal. For each node \(u\) from high to low, iterate over the triangles between its upper neighbours, and in each, relax the lower and middle edge. (The upper edge was already relaxed before.) If an edge is relaxed, mark it for deletion but first keep it around for further propagation. Finally, prune all the deleted edges.

7.4 Query Link to heading

As a lazy initial approach for a query from \(s\) to \(t\):

For each, initialize a vector with distances to all other nodes.
Then relax all nodes given by parent pointers until the root is reached.
Then iterate over all nodes and find the one with minimum sum of distances to \(s\) and \(t\).

8 Optimizing things Link to heading

Reading the graph is a few seconds.
Relaxing upper edges is 18s
Finding redundant edges is 18s
Compressing edges is 1s
100 queries is 2s
total time: 44s

A quick perf record; perf report after enabling debug symbols via [profile.release] debug=true shows that around 70% of time is spent on iterating through neighbour lists to search xy edges completing a triangle from u.

8.1 Binary searching in `find_edge` Link to heading

Since edges are already sorted by out-neighbour anyway, we can also find them via binary search.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
 fn find_edge_index(&self, u: NodeId, v: NodeId) -> usize {
     assert!(u < v);
     let edge_range = self.edge_range(u);
     edge_range.start
         + self.edges[edge_range]
-            .iter()
-            .position(|e| e.head == v)
+            .binary_search_by_key(&v, |e| e.head)
             .unwrap()
 }

This reduces the total time to 22s, with only 7s for each of the relax and prune stage, but still, most of the time is in the binary search.

8.2 `HashMap` of edges Link to heading

We better replace the binary search with a HashMap directly telling us the index of each edge.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
     for u in 0..self.n {
         for i in self.edge_range(u) {
             let v = self.edges[i].head;
             self.edge_idx.insert((u, v), i as u32);
         }
     }
 ...

 fn find_edge_index(&self, u: NodeId, v: NodeId) -> usize {
     assert!(u < v);
-    let edge_range = self.edge_range(u);
-    edge_range.start
-        + self.edges[edge_range]
-            .iter()
-            .position(|e| e.head == v)
-            .binary_search_by_key(&v, |e| e.head)
-            .unwrap()
+    self.edge_idx[&(u, v)] as usize
 }

Surprisingly, this takes 7s to build, and then 23s for relax and prune each, for 65s total. Not good.

8.3 Ranges of neighbours Link to heading

Because of the CCH structure, most triangles are inside the cliques formed in each cut. This means that many nodes have large intervals of consecutive neighbours. So we can store for each node a list of start nodes of each interval, and then binary search that instead. This reduces the total time to 19s.

8.4 Linear scan Link to heading

It turns out that during the relaxing and pruning, find_edge_index is always called repeatedly for a single x and increasing v, so we can replace repeated searches by a single linear scan. This lowers the entire contruction to 7s.

8.5 Proper query algorithm Link to heading

So far, I was allocating two size-n vectors for each query and doing a separate search from \(s\) and \(t\). By interleaving the searches, we can efficiently reuse a single allocation and incrementally clean up used values. Query time starts at 230us/q for random pairs.

8.6 Pruning edges Link to heading

If we enable pruning of sub-optimal edges, query time drops to 108us/q.

8.7 Pruning Link to heading

When we found some \(st\) path of length \(d\), we can skip expanding nodes at distance at least \(d\) from either \(s\) or \(t\). The gains are marginal though, from 108 us to 103 us (and these numbers have high variance).

8.8 Unconditional edge relaxing Link to heading

To avoid branch misses, we write relaxed edge values unconditionally:

1
2
3
4
-if new_dist < dv {
-    self.dist[v as usize][dir] = new_dist;
-}
+self.dist[v as usize][dir] = dv.min(new_dist);

8.9 Early edge break Link to heading

For each node, we can sort outgoing edges by length, and only process the prefix that results in relaxations with cost at most the best s-t path so far. This gives around 10% speedup, from 97 us/q to 88 us/q.

8.10 DFS-ordering the nodes Link to heading

Currently the nodes are already ordered by their rank, but in a layer-by-layer way. Instead we can use a DFS order to get more locality.

Unfortunately, this appears to give pretty much no speedup, and actually slows things down slightly: 85 us/q to 89 us/q.

8.11 Not inclining queries Link to heading

If we mark the query function as #[inline(never)], times go down from 89 us/q to 81 us/q. I only did this to isolate the results in perf, so not exactly sure why it helps.

9 Some stats Link to heading

With default .order:

18.0M nodes
42.2M directed input edges
18.1M added undirected edges for chordal closure
17.4M pruned undirected edges for perfect customization

With .ifc.order:

16.7M added edges for chordal closure
16.6M pruned edges for perfect customization
42.3M edges remaining, only slightly more than the input (but the input will contain many edges in both directions).

10 Serializing the final structure Link to heading

For iteration speed on queries, we dump the final customized data structure to disk so we can just load it directly instead of manually processing it each time. We use epserde, which just dumps the binary representation to disk. It needs zero-copy annotations like this:

1
2
3
4
5
6
#[derive(Clone, Copy, epserde::Epserde)]
#[repr(C)]
#[epserde_zero_copy]
struct Node {
    ...
}

We write and read the file (both in <1s, compared to ~15s preprocessing) using

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
fn save(&mut self, path: &Path) {
    self.input_edges.clear();
    self.dist.clear();
    let mut file = std::fs::File::create(path.with_added_extension("cch")).unwrap();
    unsafe { self.serialize(&mut file).unwrap() };
}
#[allow(unused)]
fn read(path: &Path) -> Self {
    let mut path = path.with_added_extension("cch");
    unsafe { Self::load_full(&path).unwrap() }
}

11 Merging adjacent edges Link to heading

Many nodes are connected to ranges of consecutive vertices. We will try to process these using SIMD. First we transpose the distances vector so that adjacent nodes have their distances adjacent as well.

Initial query time: 92 us/q
Transposed dist: 110 us/q
Iterating all ranges separately: 178 us/q
Unchecked indexing: 159 us/q
Merging the first and second phase into one, so we only have to do write the code once: 166 us/q
Flattened edge weights: 177 us/q
Relaxing 8 edges at once using SIMD: 94 us/q
Merging ranges with gap <=4: 69 us/q
Including the first NodeId with the range, avoiding a memory indirection: 56 us/q
splat for the masking outside the inner loop: 53 us/q

11.1 Perf stat Link to heading

1
2
3
4
5
6
7
8
             0      context-switches:u               #      0.0 cs/sec  cs_per_second
             0      cpu-migrations:u                 #      0.0 migrations/sec  migrations_per_second
         8,629      page-faults:u                    #   4354.4 faults/sec  page_faults_per_second
      1,981.65 msec task-clock:u                     #      1.0 CPUs  CPUs_utilized
   128,476,110      branch-misses:u                  #     12.1 %  branch_miss_rate         (49.97%)
 1,061,440,297      branches:u                       #    535.6 M/sec  branch_frequency     (50.04%)
 6,211,675,304      cpu-cycles:u                     #      3.1 GHz  cycles_frequency       (50.05%)
10,194,937,762      instructions:u                   #      1.6 instructions  insn_per_cycle  (50.04%)

12% branch misses is a lot!
1.6 instructions/cycle is quite low

The hot-loop over all ranges for a node and the edges in each range now looks like this. Note that the percentages in the first column are quite large.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
 3.13 │4b0:┌─→movl         (%r8),%r9d
 1.78 │    │  movl         0x4(%r8),%r12d
 0.95 │    │  movl         0x8(%r8),%r15d
 1.10 │    │  movl         %r15d,%r11d
 0.56 │    │  subl         %r12d,%r11d
 1.77 │    │  nop
 0.54 │4d0:│  movl         %r9d,%r14d
 4.26 │    │  vmovd        %r11d, %xmm3         -- splat1
 2.82 │    │  vpbroadcastd %xmm3, %ymm3         -- splat2
 8.87 │    │  vpcmpgtd     %ymm0, %ymm3, %ymm3
21.15 │    │  vpaddd       (%rax,%r12,4), %ymm2, %ymm4
 7.90 │    │  vblendvps    %ymm3, %ymm4, %ymm1, %ymm3
11.55 │    │  vpminsd      (%rdx,%r14,4), %ymm3, %ymm3
 4.06 │    │  addq         $0x8,%r12
 3.46 │    │  vmovdqu      %ymm3, (%rdx,%r14,4)
 3.16 │    │  addl         $0x8,%r9d
      │    │  addl         $-0x8,%r11d
 1.87 │    │  cmpq         %r15,%r12
 2.20 │    │↑ jb           4d0
 0.68 │    │  addq         $0xc,%r8
 0.15 │    ├──cmpq         %rcx,%r8
 1.48 │    └──jne          4b0

We see that two instructions are ‘wasted’ splatting the iteration count. Fixing that makes things slightly cleaner.

More generally, we do a bunch of bookkeeping to handle ranges with non-multiple-of-8 lengths. Instead, we can also just bloat the set of edges so that all ranges have multiple-of-8 lengths.

11.2 All ranges are multiples of 8 Link to heading

We can ’expand’ all small groups of edges to cover ranges of size 8, adding up to 7 dummies if needed. This blows up the size of the data structure in memory, as the number of edges increases from 42.2M to 230.7M, but gets rid of the blending operations. This lowers the time to 48 us/q.

Still, we have 8% branch misses and now only 1.4 instructions/cycle.

11.3 All ranges have size 8 Link to heading

Only few ranges are actually larger than 8, so instead, we can also just make all ranges have exactly size 8 and replace a double for loop by just a single loop where each range corresponds to a single SIMD iteration. This increases the number of ranges from 27385521 to 28838911, and lowers queries to 39 us/q just by reducing branch misses, and if we drop the inner loop completely, we get 34 us/q.

The inner loop now looks like this, and we see that 80% of time is here.

1
2
3
4
5
6
7
8
 8.39 │470:┌─→movl         (%r8),%r11d                  # read v
 0.74 │    │  movl         0x4(%r8),%r12d               # read edge idx
39.69 │    │  vpaddd       (%r9,%r12,4), %ymm0, %ymm1   # add weight do dist(x)
13.12 │    │  vpminsd      (%r15,%r11,4), %ymm1, %ymm1  # min(w+dist(x), dist(v))
10.79 │    │  vmovdqu      %ymm1, (%r15,%r11,4)         # update dist(v)
      │    │  addq         $0xc,%r8                     # increment range iterator by 12 bytes=3 u32
 6.24 │    ├──cmpq         %rcx,%r8
      │    └──jne          470

The reading of the edge idx is redundant here: every range covers exactly 8 weights, so we can replace this by a +=8 instead, which gives 30 us/q.

1
2
3
4
5
6
7
 2.79 │470:┌─→movl         (%rcx,%rax),%r8d             # read v
32.17 │    │  vpaddd       (%r13,%rax,4), %ymm0, %ymm1  # dist[x] + w[xv]
18.70 │    │  vpminsd      (%r11,%r8,4), %ymm1, %ymm1   # min(dist[v], dist[x] + w[xv])
 9.38 │    │  vmovdqu      %ymm1, (%r11,%r8,4)          # write dist[v]
 1.14 │    │  addq         $0x8,%rax                    # jump to next range
 0.56 │    ├──cmpq         %rax,%r9
 7.31 │    └──jne          470

In fact, we don’t even need the range start and end at all any more, and just the original range of edges of the node suffices. This reduces the addq to only add 4, for a single u32. Result: 28 us/q.

Anyway, the assembly is very pretty and very hot: 72% of time is spent here, branch misses are down to 3.5%, and instructions per cycle is up to 1.8 (but still far from 4).

11.4 Finding the bottleneck Link to heading

It’s unclear to me what the current bottleneck is (CPU or memory). Let’s try to find out using perf stats’s topdown analysis, following https://perfwiki.github.io/main/top-down-analysis/.

1
2
3
perf stat -M TopdownL1 cargo run -r
# or
perf stat --topdown cargo run -r

then after we can use

1
2
3
perf stat -M tma_*_group cargo run -r
# or
perf stat --topdown --td-level 2 cargo run -r

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> perf stat -M TopdownL1 cargo run -r
     7,628,649,785      INST_RETIRED.ANY:u               #     29.2 %  tma_retiring             (62.47%)
        70,770,602      CPU_CLK_UNHALTED.REF_XCLK:u                                             (62.50%)
     2,600,333,947      IDQ_UOPS_NOT_DELIVERED.CORE:u    #      6.4 %  tma_frontend_bound
                                                         #      nan %  tma_backend_bound        (62.52%)
   <not supported>      INT_MISC.RECOVERY_CYCLES_ANY:u
        70,454,622      CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE:u                                        (62.54%)
    10,210,250,089      CPU_CLK_UNHALTED.THREAD:u                                               (62.55%)
       590,781,380      UOPS_RETIRED.MACRO_FUSED:u                                              (50.00%)
    11,888,222,748      UOPS_RETIRED.RETIRE_SLOTS:u                                             (50.00%)
    12,543,109,900      UOPS_ISSUED.ANY:u                                                       (49.96%)

We now need to relax permissions:

1
echo 0 | sudo tee /proc/sys/kernel/nmi_watchdog

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
> perf stat --topdown --td-level 3

18.7%  tma_other_light_ops
 0.0%  tma_fp_arith
14.2%  tma_ports_utilization
 0.0%  tma_other_mispredicts
14.6%  tma_dram_bound
 0.9%  tma_l2_bound
 0.0%  tma_other_nukes
 8.2%  tma_fused_instructions
 2.6%  tma_few_uops_instructions
 0.1%  tma_microcode_sequencer
 2.3%  tma_non_fused_branches
15.5%  tma_memory_operations
 1.0%  tma_mite
10.5%  tma_dsb
 2.0%  tma_l3_bound
 6.1%  tma_l1_bound
 0.1%  tma_icache_misses
 3.3%  tma_branch_resteers
 0.4%  tma_store_bound
 0.3%  tma_serializing_operation
 0.1%  tma_ms_switches
 0.0%  tma_lcp
 0.0%  tma_itlb_misses
 1.2%  tma_dsb_switches
 0.0%  tma_divider

The picture seems to be quite mixed, with no clear target for improvement. Both CPU and L1/RAM are a bottleneck at different times it seems.

12 Bugfixing Link to heading

I finally added a Dijkstra implementation as a baseline. Unfortunately, results are very different… Turns out there’s a bug in the pruning of edges where I iterated over the wrong vertex.

Also, I was pruning edges in both directions, even if I marked them for deletion in only one direction.

With all that fixed, and with making sure I apply the same permutation to the input of Dijkstra, all is good now.

13 Further ideas Link to heading

13.1 Failed: Doubling the graph Link to heading

In fact, we can pruned edges that are only redundant in one direction, by doubling the graph and storing separate data for the up and down graph. I implemented this, but it comes out quite a bit slower, probably because it requires more IO to read the edges. (It might be better with an interleaved layout, which I didn’t test yet.)

13.2 Edge pruning Link to heading

We can sort the blocks of 8 outgoing edges from each node by the minimum weight, and then early break as soon as all remaining edges have too large weight.

This seems 1 to 2 us/q faster, but whether pruning actually happens depends a lot on the queries. For random queries, s and t only meet very late, while in specific cases, it can be up to 50%.

13.3 Failed: Expanding a node and its parent in parallel Link to heading

We can manually relax the node from \(u\) to \(p\), and then unify their neighbours and relax them in a single loop. This saves some iteration overhead and storing and reading some distances, as long as the neighbour sets are sufficiently similar.

First results are 20% or so slower (50 instead of 40 us/q). Transposing the weights array to have weights for \(u\) and \(p\) adjacent in two halves of a cacheline helps a tiny bit, but not enough.

It’s a bit unfortunate that the inner-loop assembly has a bunch of mov that weren’t there before:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
 3.30 │1b60:┌─→movl         (%rcx,%r8),%r9d                  -- read edge head
 1.96 │     │  movq         0x8(%rax),%r10     -- extra: read dist vector pointer
 0.16 │     │  movq         0x80(%rdx),%r11    -- extra: read edge weight vector pointer
 3.52 │     │  addq         %rdi,%r11                        -- adjudge edge index
16.02 │     │  vpaddd       -0x20(%r11,%r8,8), %ymm0, %ymm2  -- dist from u
38.77 │     │  vpaddd       (%r11,%r8,8), %ymm1, %ymm3       -- dist from p
 3.80 │     │  vpminsd      %ymm3, %ymm2, %ymm2              -- min(dist u, dist p)
 5.25 │     │  vpminsd      (%r10,%r9,4), %ymm2, %ymm2       -- min(cur dist, u/p dist)
 3.83 │     │  vmovdqu      %ymm2, (%r10,%r9,4)              -- write new dist
 0.06 │     │  addq         $0x8,%r8
 0.20 │     ├──cmpq         %r8,%rsi
 3.37 │     └──jne          1b60

We can get rid of these extra mov instructions by replacing self.weights.get_unchecked by self.weights.as_ptr().offset(...).read(), which forces the compiler to inline the variables (even with the higher register pressure of the double-expanding), and gives back a few (2-3 us/q) that we lost.

14 Current best results Link to heading

With everything up to and including sorting of edges.
Without separate up and down graphs.
Without expanding nodes and their parents in parallel.

We test on Dijkstra rank queries, where we run a Dijkstra search from 1000 random start nodes and then use as destination the \(2^i\)’th expanded vertex. This gives queries that are close together for small \(i\) and far apart for large \(i\).

Running at fixed 3.6 GHz on a i7-10750H laptop CPU. We compare against the C++ implementation of Michael Zündorf at github:mzuenni/Separator-Based-Alternative-Paths-in-CCHs.

Table 1: Average time (us/query) for 1000 queries at Dijkstra rank \(2^i\).

dist	time C++ baseline	time us
2^ 0	1	2
2^ 1	2	3
2^ 2	1	3
2^ 3	2	3
2^ 4	2	3
2^ 5	2	3
2^ 6	2	4
2^ 7	3	4
2^ 8	3	4
2^ 9	5	5
2^10	5	5
2^11	8	7
2^12	10	7
2^13	13	9
2^14	16	10
2^15	19	13
2^16	24	14
2^17	30	17
2^18	36	20
2^19	46	24
2^20	60	29
2^21	76	35
2^22	93	41
2^23	95	42
2^24	50	24

Results are slightly (10% to 2x) faster than CH in Fig 5 of (Buchhold, Sanders, and Wagner 2019), and up 2 to 2.5x faster than CCH-tree-fast for large Dijkstra ranks, and also 2x faster than the baseline C++ implementation.

But also, note that this comparison is not quite fair yet, since the baseline also stores metadata to recover the optimal path, while we don’t.

14.1 Bottleneck Link to heading

At this point, I’m still not quite sure whether things are CPU or memory bound, and it seems to be a bit of both. If anything, one thing to improve could be the data layout, to use cache lines as efficient as possible. For example, the blocks of neighbouring nodes could be aligned to 32 bytes (half a cacheline) to avoid needlessly crossing cacheline boundaries. That is, each block should always start at a NodeId that is a multiple of 8.

Other possible improvements could be prefetching of:

Distances to nodes, although these are largely overlapping already anyway.
Edge weights, since these are fresh for each node.
Node data and parent pointers, since this is a chain of 500 to 2000 pointers.

TODO 15 Link to heading

metadata for path recovery
Interleaving up and down search, somehow
special in-clique handling
‘super’ parents to detect separators, prefetch node data for the next one, and use SIMD for pruning
parallel construction
change order to cluster edges, so we have fewer ranges?
- reorder based on down-neighbourhood
stats: average degree on path to root before/after pruning
transpose distances and interleave up and down costs
detect best dist on relax instead of expand
potentials: for each edge, do something with the cheapest upward edge?

References Link to heading

Bläsius, Thomas, Valentin Buchhold, Dorothea Wagner, Tim Zeitz, and Michael Zündorf. 2025. “Customizable Contraction Hierarchies – a Survey.” arXiv. https://doi.org/10.48550/ARXIV.2502.10519.

Buchhold, Valentin, Peter Sanders, and Dorothea Wagner. 2019. “Real-Time Traffic Assignment Using Engineered Customizable Contraction Hierarchies.” Acm Journal of Experimental Algorithmics 24 (December): 1–28. https://doi.org/10.1145/3362693.

Schild, Aaron, and Christian Sommer. 2015. “On Balanced Separators in Road Networks.” In Experimental Algorithms, 286–97. Springer International Publishing. https://doi.org/10.1007/978-3-319-20086-6_22.