One Billion Row Challenge

Since everybody is doing it, I’m also going to take a stab at the One Billion Row Challenge.

Rules I set for myself:

  • Rust (of course)
  • External libraries are allowed at first for convenience.
  • I’m assuming that the set of cities is fixed.
    • Or at least, that each row is sampled randomly from the set of available cities.
    • Or really, I’m just scanning a prefix of 100k rows, assuming all cities appear there.
  • Along the way, I use that the longest line in the input file has 33 characters, and that cities are uniquely determined by their first and last 8 characters.

This post is a log of ideas, timings, and general progress.

My processor is a i7-10750H CPU @ 2.6GHz

  • Clock speed is capped at 3.60GHz. (It can run

at 4.6GHz, but 3.6GHzz tends to be more stable for benchmarking.)

  • \(64GB\) ram.
  • It has 6 cores.
  • Hyperthreading is disabled.
  • The file should be cached in memory by the OS.
Table 1: Results after improving ILP, @4.6GHz, measured inside the program, excluding munmap at exit.
threadsHT offHT on
14.935.05
60.900.97
120.900.92

The problem

Input: a 13GB file containing \(10^9\) lines of the form Amsterdam;10.5.

  • First a city name. There are \(413\) cities, and each row has name chosen at random from the set. Their lengths are from \(3\) to \(26\) bytes.
  • Then a temperature, formatted as -?\d?\d,\d, i.e. a possibly negative number with one or two integral digits and exactly one decimal. Each temperature is drawn from a normal distribution for each city.

Output: a sorted list of cities of the form <city>: <min>/<avg>/<max>, each formatted with one decimal place.

Initial solution: 105s

Here’s a first version. zero optimizations.

use std::{collections::HashMap, env::args, io::Read};

struct Record {
    count: u32,
    min: f32,
    max: f32,
    sum: f32,
}

impl Record {
    fn default() -> Self {
        Self {
            count: 0,
            min: 1000.0,
            max: -1000.0,
            sum: 0.0,
        }
    }
    fn add(&mut self, value: f32) {
        self.count += 1;
        self.sum += value;
        self.min = self.min.min(value);
        self.max = self.max.max(value);
    }
    fn avg(&self) -> f32 {
        self.sum / self.count as f32
    }
}

fn main() {
    let filename = args().nth(1).unwrap_or("measurements.txt".to_string());
    let mut data = String::new();
    {
        let mut file = std::fs::File::open(filename).unwrap();
        file.read_to_string(&mut data).unwrap();
    }
    let mut h = HashMap::new();
    for line in data.trim().split('\n') {
        let (name, value) = line.split_once(';').unwrap();
        let value = value.parse::<f32>().unwrap();
        h.entry(name).or_insert(Record::default()).add(value);
    }

    let mut v = h.into_iter().collect::<Vec<_>>();
    v.sort_unstable_by_key(|p| p.0);
    for (name, r) in v {
        println!("{name}: {:.1}/{:.1}/{:.1}", r.min, r.avg(), r.max);
    }
}
Code Snippet 1: Commit.

First flamegraph

Let’s see what’s slow here: cargo flamegraph --open (or just flamegraph).

Figure 1: A flamegraph. Open in a new tab to see the interactive version with zooming and ctrl-F support.

Figure 1: A flamegraph. Open in a new tab to see the interactive version with zooming and ctrl-F support.

Takeaways:

  • 35% of time is next_match, i.e. searching for \n and/or ;.
  • 14% of time is parsing the f32.
  • 35% of time is accessing the hashmap.
  • Not sure what exactly is the remainder. We’ll figure that out once it becomes relevant.

Bytes instead of strings: 72s

Strings in rust are checked to be valid UTF8. Using byte slices (&[u8]) is usually faster. We have to do some slightly ugly conversions from byteslice back to strings for parsing floats and printing, but it’s worth it. This basically removes next_match from the flamegraph.

Commit here. (It’s neither pretty nor interesting.)

This already saves 21 seconds, from 105 to 84. Pretty great!

Manual parsing: 61s

Instead of parsing the input as f32 float, we can parse manually to a fixed-precision i32 signed integer

type V = i32;
fn parse(mut s: &[u8]) -> V {
    let neg = if s[0] == b'-' {
        s = &s[1..];
        true
    } else {
        false
    };
    // s = abc.d
    let (a, b, c, d) = match s {
        [c, b'.', d] => (0, 0, c - b'0', d - b'0'),
        [b, c, b'.', d] => (0, b - b'0', c - b'0', d - b'0'),
        [a, b, c, b'.', d] => (a - b'0', b - b'0', c - b'0', d - b'0'),
        [c] => (0, 0, 0, c - b'0'),
        [b, c] => (0, b - b'0', c - b'0', 0),
        [a, b, c] => (a - b'0', b - b'0', c - b'0', 0),
        _ => panic!("Unknown patters {:?}", std::str::from_utf8(s).unwrap()),
    };
    let v = a as V * 1000 + b as V * 100 + c as V * 10 + d as V;
    if neg {
        -v
    } else {
        v
    }
}
Code Snippet 2: A custom parsing function using matching on the pattern. commit.

Inline hash keys: 50s

Currently the hashmap is from &str to Record, where all &str are slices of the input string. All this indirection is probably slow. So we instead would like to store keys inline as [u8; 8] (basically a u64). It turns out that the first 8 characters of each city name are almost enough for uniqueness. Only Alexandra and Alexandria coincide, so we’ll xor in the length of the string to make them unique. One drawback is that the hashmap must now store the full name corresponding to the key as well.

+fn to_key(name: &[u8]) -> u64 {
+    let mut key = [0u8; 8];
+    let l = name.len().min(8);
+    key[..l].copy_from_slice(&name[..l]);
+    key[0] ^= name.len() as u8;
+    u64::from_ne_bytes(key)
+}
 ...
-        h.entry(name).or_insert(Record::default()).add(parse(value));
+        h.entry(to_key(name))
+            .or_insert((Record::default(), name))
+            .0
+            .add(parse(value));
Code Snippet 3: The new key function. commit.

Faster hash function: 41s

The default hash table in rust uses a pretty slow hash function. Let’s instead use fxhash::FxHashMap. For u64 keys, the hash function is simply multiplication by a constant. This gives another 10 seconds speedup.

-    let mut h = HashMap::new();
+    let mut h = FxHashMap::default();
Code Snippet 4: Switching to FxHash. commit.

A new flame graph

Now that we’ve addressed the obvious hot parts, let’s make a new graph.

Figure 2: A useless flamegraph.

Figure 2: A useless flamegraph.

Yeah well great… I suppose everything is inlined or so. But actually the debuginfo should still be there. idk…

Perf it is

cargo flamegraph uses perf record under the hood. So we can just perf report and see what’s there.

Some snippets. Numbers on the left are percentage of samples on that line.

3.85 2d0:┌─→movzbl       0x0(%rbp,%rbx,1),%r15d // read a byte
1.24       cmp          $0xa,%r15b             // compare to \n
0.69     │↓ je           300                    // handle the line if \n
2.07       inc          %rbx                   // increment position
         ├──cmp          %rbx,%rcx              // compare to end of data
5.43     └──jne          2d0                    // next iteration
Code Snippet 5: The column on the left indicates that in total 13% of time is spent looking for newlines.
6.25 330:┌─→cmpb         $0x3b,0x0(%rbp,%r13,1) // read a byte
3.40     │↓ je           350                    // handle if found
3.28       inc          %r13                   // increment position
         ├──cmp          %r13,%rbx              // compare to length of the line
2.53     └──jne          330                    // next iteration
           jmp          c0e                    // fall through to panic handler
Code Snippet 6: 15% of time is spent looking for semicolons.
           key[0] ^= name.len() as u8;
 3.79        xor          %r13b,0x40(%rsp)
           u64::from_ne_bytes(key)
11.77        mov          0x40(%rsp),%r12
Code Snippet 7: Converting from [u8; 8] to u64, i.e. an unaligned read, is surprisingly slow?

Then there are quite some instructions for indexing the hash table, adding to around 20% in total.

Parsing takes around 5%.

Something simple: allocating the right size: 41s

We can stat the input file for its size and allocate exactly the right amount of space. This saves around half a second.

     let mut data = vec![];
+    let stat = std::fs::metadata(filename).unwrap();
+    data.reserve(stat.len() as usize + 1);
     let mut file = std::fs::File::open(filename).unwrap();
     file.read_to_end(&mut data).unwrap();
Code Snippet 8: reserving space

memchr for scanning: 47s

memchr(byte, text) is a libc function that returns the first index of the byte in the text. But well.. it turns out this is a non-inlined function call after all and things slow down. But anyway, here’s the diff:

     let mut h = FxHashMap::default();
-    for line in data.split(|&c| c == b'\n') {
-        let (name, value) = line.split_once(|&c| c == b';').unwrap();
+    let mut data = &data[..];
+    loop {
+        let Some(separator) = memchr(b';', data) else {
+            break;
+        };
+        let end = memchr(b'\n', &data[separator..]).unwrap();
+        let name = &data[..separator];
+        let value = &data[separator + 1..separator + end];
         h.entry(to_key(name))
             .or_insert((Record::default(), name))
             .0
             .add(parse(value));
+        data = &data[separator + end + 1..];
     }
Code Snippet 9: Switching to FxHash. commit.

memchr crate: 29s

It also turns out the default memchr function doesn’t use SIMD. But there is the nice memchr crate which is heavily optimized and does use SIMD.

This brings us down from the previous best of 42s to 29s!

get_unchecked: 28s

By default all array accesses are bound checked. We don’t really need that. Removing them saves half a second.

The code is now a bit uglier sadly: commit.

Manual SIMD: 29s

One ‘problem’ with memchr is that it is made for scanning long ranges, and is not super flexible. So let’s roll our own.

We make sure that data is aligned to SIMD boundaries and iterate over it \(32\) characters at a time. We check for all of them at once whether they equal each of them, and convert these results to a bitmask. The number of trailing zeros indicates the position of the match. If the bitmask is \(0\), there are no matches and we try the next \(32\) characters.

This turns out to be slightly slower. I’m not exactly sure why, but we can profile and iterate from here.

/// Number of SIMD lanes. AVX2 has 256 bits, so 32 lanes.
const L: usize = 32;
/// The Simd type.
type S = Simd<u8, L>;

/// Find the regions between \n and ; (names) and between ; and \n (values),
/// and calls `callback` for each line.
#[inline(always)]
fn iter_lines<'a>(data: &'a [u8], mut callback: impl FnMut(&'a [u8], &'a [u8])) {
    unsafe {
        // TODO: Handle the tail.
        let simd_data: &[S] = data.align_to::<S>().1;

        let sep = S::splat(b';');
        let end = S::splat(b'\n');
        let mut start_pos = 0;
        let mut i = 0;
        let mut eq_sep = sep.simd_eq(simd_data[i]).to_bitmask();
        let mut eq_end = end.simd_eq(simd_data[i]).to_bitmask();

        // TODO: Handle the tail.
        while i < simd_data.len() - 2 {
            // find ; separator
            // TODO if?
            while eq_sep == 0 {
                i += 1;
                eq_sep = sep.simd_eq(simd_data[i]).to_bitmask();
                eq_end = end.simd_eq(simd_data[i]).to_bitmask();
            }
            let offset = eq_sep.trailing_zeros();
            eq_sep ^= 1 << offset;
            let sep_pos = L * i + offset as usize;

            // find \n newline
            // TODO if?
            while eq_end == 0 {
                i += 1;
                eq_sep = sep.simd_eq(simd_data[i]).to_bitmask();
                eq_end = end.simd_eq(simd_data[i]).to_bitmask();
            }
            let offset = eq_end.trailing_zeros();
            eq_end ^= 1 << offset;
            let end_pos = L * i + offset as usize;

            callback(
                data.get_unchecked(start_pos..sep_pos),
                data.get_unchecked(sep_pos + 1..end_pos),
            );

            start_pos = end_pos + 1;
        }
    }
}
Code Snippet 10: Simd code to search for semicolon and newline characters. commit.

Profiling

Running perf stat -d cargo run -r gives:

      28,367.09 msec task-clock:u                     #    1.020 CPUs utilized
              0      context-switches:u               #    0.000 /sec
              0      cpu-migrations:u                 #    0.000 /sec
         31,249      page-faults:u                    #    1.102 K/sec
 92,838,268,117      cycles:u                         #    3.273 GHz
153,099,184,152      instructions:u                   #    1.65  insn per cycle
 19,317,651,322      branches:u                       #  680.988 M/sec
  1,712,837,337      branch-misses:u                  #    8.87% of all branches
 27,760,594,151      L1-dcache-loads:u                #  978.620 M/sec
    339,143,832      L1-dcache-load-misses:u          #    1.22% of all L1-dcache accesses
     25,000,151      LLC-loads:u                      #  881.308 K/sec
      4,546,946      LLC-load-misses:u                #   18.19% of all L1-icache accesses #+end_src
Code Snippet 11: Output of perf stat profiling.

Observe:

  • Actual cycles is only 3.3GHz, whereas it should be 3.6GHz. Not sure why; might be waiting for IO.
  • 1.65 instructions per cycle is quite low. It can be up to 4 and is often at least 2.5.
  • 8.87% of branch misses is also quite high. Usually this is at most 1% and typically lower. Each branch mispredict causes a stall of 5ns or so, which is over 1 second total, but I suspect the impact is larger.
  • 18.19% of last-level-cache load misses. Also quite high, but I’m not sure if this is a problem, since the total number of LLC loads is relatively low.

Revisiting the key function: 23s

Looking at perf report we see that the hottest instruction is a call to memcpy to read up to name.len() bytes from the &[u8] name to a u64.

            core::intrinsics::copy_nonoverlapping:
 0.15         lea          0xa8(%rsp),%rdi
 0.64         mov          %rsi,0x168(%rsp)
 1.18         vzeroupper
 0.68        call         *0x46449(%rip)        # 5f8e8 <memcpy@GLIBC_2.14>
11.31         mov          0xa8(%rsp),%r15
 0.19         mov          %rbx,0x160(%rsp)
Code Snippet 12: 12% of time is spent on casting the name into a u64.

We can avoid this memcpy call entirely by just doing a (possibly out of bounds) u64 read of the name, and then shifting away bits corresponding to the out-of-bounds part. We’ll also improve the hash to add the first and last (up to) 8 characters.

fn to_key(name: &[u8]) -> u64 {
    // Hash the first and last 8 bytes.
    let head: [u8; 8] = unsafe { *name.get_unchecked(..8).split_array_ref().0 };
    let tail: [u8; 8] = unsafe { *name.get_unchecked(name.len() - 8..).split_array_ref().0 };
    let shift = 64usize.saturating_sub(8 * name.len());
    let khead = u64::from_ne_bytes(head) << shift;
    let ktail = u64::from_ne_bytes(tail) >> shift;
    khead + ktail
}
Code Snippet 13: The new key function. commit.

This brings the runtime down from 28s to 23s!

In perf stat, we can also see that the number of branches and branch-misses went down around 30%.

PtrHash perfect hash function: 17s

Now, the hottest instructions are all part of the hashmap lookup.

           hashbrown::raw::RawTable<T,A>::find:
0.27         mov          (%rsp),%rcx
0.16         mov          0x8(%rsp),%rax
           hashbrown::raw::h2:
0.41         mov          %rbp,%rdx
0.56         shr          $0x39,%rdx
1.19         mov          %rdx,0x158(%rsp)
0.13         vmovd        %edx,%xmm0
0.89         vpbroadcastb %xmm0,%xmm0
0.20         lea          -0x28(%rcx),%rdx
0.16         xor          %esi,%esi
0.16         mov          %rbp,%r11
           hashbrown::raw::RawTableInner::find_inner:
1.41  586:   and          %rax,%r11
           core::intrinsics::copy_nonoverlapping:
3.29         vmovdqu      (%rcx,%r11,1),%xmm1
           core::core_arch::x86::sse2::_mm_movemask_epi8:
5.60         vpcmpeqb     %xmm0,%xmm1,%xmm2                    ; compare key to stores keys
0.02         vpmovmskb    %xmm2,%r8d
           hashbrown::raw::bitmask::BitMask::lowest_set_bit:
0.31         nop
0.97  5a0:┌─→test         %r8w,%r8w
          │<hashbrown::raw::bitmask::BitMaskIter as core::iter::traits::iterator::Iterator>::next:
0.80      │↓ je           5d0
          hashbrown::raw::bitmask::BitMask::lowest_set_bit:
5.59        tzcnt        %r8d,%r9d                            ; find position of match in bitmask
          hashbrown::raw::bitmask::BitMask::remove_lowest_bit:
0.03        blsr         %r8d,%r8d
          hashbrown::raw::RawTableInner::find_inner:
0.61        add          %r11,%r9
0.53        and          %rax,%r9
          core::ptr::mut_ptr::<impl *mut T>::sub:
1.93        neg          %r9
          core::ptr::mut_ptr::<impl *mut T>::offset:
0.57        lea          (%r9,%r9,4),%r9
          core::cmp::impls::<impl core::cmp::PartialEq for u64>::eq:
8.40      ├──cmp          %r14,(%rdx,%r9,8)                    ; check equal
          hashbrown::raw::RawTableInner::find_inner:
0.69      └──jne          5a0
0.11        jmp          600
           core::core_arch::x86::sse2::_mm_movemask_epi8:
             data16       cs nopw 0x0(%rax,%rax,1)
7.55  5d0:   vpcmpeqb     -0x47c8(%rip),%xmm1,%xmm1            ; more equality checking
0.00         vpmovmskb    %xmm1,%r8d
           hashbrown::raw::bitmask::BitMask::any_bit_set:
          ┌──test         %r8d,%r8d
          hashbrown::raw::RawTableInner::find_inner:
          ├──jne          6f6
Code Snippet 14: The hasmap takes a lot of time. There are four instructions taking over 5% here, for a total of around 35% of runtime.

Observe:

  • There is a loop for linear probing.
  • There are a lot of equality checks to test if a slot corresponds to the requested key.
  • Generally, this code is long, complex, and branchy.

It would be much better to use a perfect hash function that we build once. Then none of these equality checks are needed.

For this, I will use PtrHash, a (minimal) perfect hash function I developed based on PtHash (PtHash paper). I still have to write a paper on PtrHash, but I do have a long roundabout blogpost.

  1. Find all city names the first 100k rows. Since each row has a random city, all names will occur here.

  2. Build a perfect hash function. For the given dataset, PtrHash outputs a metadata pilot array of \(63\) bytes.

  3. On each lookup, the u64 hash is mapped to one of the \(63\) buckets. Then the hash is xored by C * pilots[b] where \(C\) is a random mixing constant. This is then reduced to an integer less than \(512\), which is the index in the array of Records we are looking for.

    The pilots are constructed such that each hash results in a different index.

The full code is here. The diff in the hot loop is this.

     let callback = |name, value| {
         let key = to_key(name);
-        let entry = h.entry(key).or_insert((Record::default(), name)).0;
+        let index = phf.index(&key);
+        let entry = unsafe { records.get_unchecked_mut(index) };
         entry.add(parse(value));
     };
     iter_lines(data, callback);
Code Snippet 15: Using a perfect hash function for lookups. Before, h was a HashMap<u64, (Record, &str)>. After, records is simply a [Record; 512], and phf.index(key) is the perfect hash function.

In assembly code, it looks like this:

0.24         movabs       $0x517cc1b727220a95,%rsi // Load the multiplication constant C
2.22         imul         %rsi,%rdx                // Hash the key by multiplying by C
0.53         mov          0xf8(%rsp),%rax          // Some instructions to compute bucket b < 63
3.16         mulx         %rax,%rax,%rax
0.55         mov          0x10(%rsp),%r8
5.67         movzbl       (%r8,%rax,1),%eax        // Read the pilot for this bucket. This is slow.
0.03         mov          0x110(%rsp),%r8
0.57         mulx         %r8,%r12,%r12
7.09         imul         %rsi,%rax                // Some instructions to get the slot < 512.
0.81         xor          %rdx,%rax
0.05         mov          %rax,%rdx
3.87         mulx         %rsi,%rdx,%rdx
Code Snippet 16: Assembly code for the perfect hash function lookup. Just note how short it is compared to the hash table. It's still 20% of the total time though.

The new running time is now 17s!

Larger masks: 15s

Currently we store u32 masks on which we do .trailing_zeros() to find character offsets. We can also check two 32 simd lanes in parallel and combine them into a single u64 mask. This gives a small speedup, I think mostly because there are now slightly fewer branch-misses (593M now vs 675M before): commit.

Reduce pattern matching: 14s

I modified the generator I’m using to always print exactly one decimal. This saves some branches.

     // s = abc.d
     let (a, b, c, d) = match s {
         [c, b'.', d] => (0, 0, c - b'0', d - b'0'),
         [b, c, b'.', d] => (0, b - b'0', c - b'0', d - b'0'),
         [a, b, c, b'.', d] => (a - b'0', b - b'0', c - b'0', d - b'0'),
-        [c] => (0, 0, 0, c - b'0'),
-        [b, c] => (0, b - b'0', c - b'0', 0),
-        [a, b, c] => (a - b'0', b - b'0', c - b'0', 0),
+        // [c] => (0, 0, 0, c - b'0'),
+        // [b, c] => (0, b - b'0', c - b'0', 0),
+        // [a, b, c] => (a - b'0', b - b'0', c - b'0', 0),
         _ => panic!("Unknown pattern {:?}", to_str(s)),
     };
Code Snippet 17: Assume fixed single-decimal formatting.

Memory map: 12s

Instead of first reading the file into memory and then processing that, we can memory map it and transparently read parts as needed. This saves the 2 seconds spent reading the file at the start.

     let filename = &args().nth(1).unwrap_or("measurements.txt".to_string());
-    let mut data = vec![];
+    let mut mmap: Mmap;
+    let mut data: &[u8];
     {
         let mut file = std::fs::File::open(filename).unwrap();
         let start = std::time::Instant::now();
-        let stat = std::fs::metadata(filename).unwrap();
-        data.reserve(stat.len() as usize + 1);
-        file.read_to_end(&mut data).unwrap();
+        mmap = unsafe { Mmap::map(&file).unwrap() };
+        data = &*mmap;
         eprintln!("{}", format!("{:>5.1?}", start.elapsed()).bold().green());
     }
Code Snippet 18: memory mapping using memmap2 crate.

Parallelization: 2.0s

Parallelizing code is fairly straightforward. First we split the data into one chunk per thread. Then we fire a thread for each chunk, each with its own vector to accumulate results. Then at the end each thread merges its results into the global accumulator.

This gives pretty much exactly \(6\times\) speedup on my 6-core machine, since accumulating is only a small fraction of the total time.

fn run_parallel(data: &[u8], phf: &PtrHash, num_slots: usize) -> Vec<Record> {
    let mut slots = std::sync::Mutex::new(vec![Record::default(); num_slots]);

    // Spawn one thread per core.
    let num_threads = std::thread::available_parallelism().unwrap();
    std::thread::scope(|s| {
        let chunks = data.chunks(data.len() / num_threads + 1);
        for chunk in chunks {
            s.spawn(|| {
                // Each thread has its own accumulator.
                let thread_slots = run(chunk, phf, num_slots);

                // Merge results.
                let mut slots = slots.lock().unwrap();
                for (thread_slot, slot) in thread_slots.into_iter().zip(slots.iter_mut()) {
                    slot.merge(&thread_slot);
                }
            });
        }
    });

    slots.into_inner().unwrap()
}
Code Snippet 19: Code to process data in parallel.

Branchless parsing: 1.7s

The match statement on the number of digits in the temperature generated quite a lot of branches and perf stat cargo run -r was showing 440M branch-misses, i.e. almost one every other line. That’s about as bad as it can be with half the numbers having a single integer digit and half the numbers having two integer digits.

I was able to pinpoint it to the branching by running perf record -b -g cargo run -r followed by perf report.

Changing this to a branch-less version is quite a bit faster, and now only 140M branch-misses remain.

// s = abc.d
let a = unsafe { *s.get_unchecked(s.len() - 5) };
let b = unsafe { *s.get_unchecked(s.len() - 4) };
let c = unsafe { *s.get_unchecked(s.len() - 3) };
let d = unsafe { *s.get_unchecked(s.len() - 1) };
let v = a as V * 1000 * (s.len() >= 5) as V
      + b as V * 100  * (s.len() >= 4) as V
      + c as V * 10
      + d as V;
Code Snippet 20: Branchless float parsing.

Purging all branches: 1.67s

The remaining branch misses are in the while eq_sep == 0 in the scanning for ; and \n characters (Manual SIMD: 29s). Since cities and temperatures have variable lengths, iterating over the input will always have to do some branching to move to the next bit of input or not.

We can work around this by doing an independent scan for the next occurrence of ; and \n in each iteration. It turns out the longest line in the input contains 33 characters including newline. This means that a single 32-character SIMD comparison is exactly sufficient to determine the next occurrence of each character.

In code, it looks like this.

#[inline(always)]
fn iter_lines<'a>(mut data: &'a [u8], mut callback: impl FnMut(&'a [u8], &'a [u8])) {
    let sep = S::splat(b';');
    let end = S::splat(b'\n');

    // Find the next occurence of the given separator character.
    let mut find = |mut last: usize, sep: S| {
        let simd = S::from_array(unsafe { *data.get_unchecked(last..).as_ptr().cast() });
        let eq = sep.simd_eq(simd).to_bitmask();
        let offset = eq.trailing_zeros() as usize;
        last + offset
    };

    // Pointers to the last match of ; or \n.
    let mut sep_pos = 0;
    let mut start_pos = 0;

    while start_pos < data.len() - 32 {
        // Both start searching from the last semicolon, so that the unaligned SIMD read can be reused.
        sep_pos = find(sep_pos+1, sep) ;
        let end_pos = find(sep_pos+1, end) ;

        unsafe {
            let name = data.get_unchecked(start_pos + 1..sep_pos);
            let value = data.get_unchecked(sep_pos + 1..end_pos);
            callback(name, value);
        }

        start_pos = end_pos;
    }
}

It turns out this does not actually give a speedup, but we will use this as a starting point for further improvements. Note also that perf stat changes considerably:

BEFORE
    35,409,579,588      cycles:u                         #    3.383 GHz
    96,408,277,646      instructions:u                   #    2.72  insn per cycle
     4,463,603,931      branches:u                       #  426.463 M/sec
       148,274,976      branch-misses:u                  #    3.32% of all branches

AFTER
    35,217,349,810      cycles:u                         #    3.383 GHz
    87,571,263,997      instructions:u                   #    2.49  insn per cycle
     1,102,455,316      branches:u                       #  105.904 M/sec
         4,148,835      branch-misses:u                  #    0.38% of all branches
Code Snippet 22: Selection of perf stat before and after

Note:

  • The total CPU cycles is the same.
  • The number of instructions has gone down 10%.
  • The number of branches went from 4.4G (4 per line) to 1.1G (1 per line).
  • The number of branch-misses went from 150M (once every 7 lines) to 4M (once every 250 lines).

To illustrate, at this point the main loop looks like this. Note that it is indeed branchless, and only 87 instructions long.

 0.48 210:┌─→vpcmpeqb     %ymm1,%ymm0,%ymm0
 1.16       vpmovmskb    %ymm0,%eax
 1.03       tzcnt        %eax,%eax
 0.11       mov          %rax,0x38(%rsp)
 0.40       lea          (%r14,%rax,1),%r11
 1.21       lea          (%r12,%r11,1),%r9
 5.25       vmovdqu      0x2(%rdi,%r9,1),%ymm0
 1.53       vpcmpeqb     %ymm2,%ymm0,%ymm3
 0.45       vpmovmskb    %ymm3,%esi
 2.20       tzcnt        %esi,%ebx
 0.91       movzbl       0x2(%rdi,%r9,1),%esi
 0.04       xor          %ebp,%ebp
 0.43       cmp          $0x2d,%sil
 1.56       sete         %bpl
 0.93       mov          %rbx,%r10
 0.06       mov          %r11,%rax
 0.41       sub          %rbp,%rax
 1.50       sub          %rbp,%r10
 0.99       add          %rbp,%rax
 0.08       add          %r12,%rax
 0.62       add          %rbx,%rax
 3.78       movzbl       -0x3(%rdi,%rax,1),%ebp
 0.93       movzbl       -0x2(%rdi,%rax,1),%r8d
 3.18       imul         $0x3e8,%ebp,%ebp
 0.22       cmp          $0x5,%r10
 0.86       mov          $0x0,%edx
 1.82       cmovb        %edx,%ebp
 0.84       imul         $0x64,%r8d,%r8d
 0.22       cmp          $0x4,%r10
 1.27       cmovb        %edx,%r8d
 2.10       add          %ebp,%r8d
 0.40       movzbl       -0x1(%rdi,%rax,1),%r10d
 0.16       lea          (%r10,%r10,4),%r10d
 1.42       lea          (%r8,%r10,2),%r8d
 0.98       movzbl       0x1(%rdi,%rax,1),%eax
 1.07       add          %eax,%r8d
 0.08       mov          %r8d,%ebp
 2.25       neg          %ebp
 0.51       cmp          $0x2d,%sil
 1.63       cmovne       %r8d,%ebp
 0.09       mov          %r11,%rax
 0.68       sub          %rcx,%rax
 0.56       shl          $0x3,%rax
 1.62       mov          $0x40,%esi
 0.06       add          %r12,%rcx
 0.69       sub          %rax,%rsi
 0.58       mov          $0x0,%eax
 1.65       cmovae       %rsi,%rax
 0.05       and          $0x38,%al
 1.16       shlx         %rax,0x1(%rdi,%rcx,1),%rsi
 1.63       shrx         %rax,-0x7(%rdi,%r9,1),%rcx
 0.75       add          %rsi,%rcx
 0.50       movabs       $0x517cc1b727220a95,%r8
 1.53       imul         %r8,%rcx
 0.06       mov          %rcx,%rdx
 0.70       mov          0x28(%rsp),%rax
 1.97       mulx         %rax,%rax,%rax
 0.07       mov          0x20(%rsp),%rdx
 0.77       movzbl       (%rdx,%rax,1),%esi
 0.52       mov          0x30(%rsp),%rdx
 1.60       mulx         %rcx,%rax,%rax
 0.70       imul         %r8,%rsi
 0.62       mov          0x18(%rsp),%rdx
 1.50       shlx         %rdx,%rax,%rax
 0.10       xor          %rcx,%rsi
 0.76       mov          %rsi,%rdx
 2.37       mulx         %r8,%rcx,%rcx
 0.14       and          0x10(%rsp),%rcx
 0.93       add          %rax,%rcx
 0.76       shl          $0x6,%rcx
13.57       incl         0x0(%r13,%rcx,1)
 4.95       add          %ebp,0xc(%r13,%rcx,1)
 0.81       mov          0x4(%r13,%rcx,1),%eax
 0.14       mov          0x8(%r13,%rcx,1),%edx
 0.63       cmp          %ebp,%eax
 1.21       cmovge       %ebp,%eax
 2.32       mov          %eax,0x4(%r13,%rcx,1)
 0.11       cmp          %ebp,%edx
 0.52       cmovg        %edx,%ebp
 1.56       mov          %ebp,0x8(%r13,%rcx,1)
 1.12       mov          0x38(%rsp),%rax
 0.09       add          %rax,%r14
 0.42       inc          %r14
 1.25       lea          (%rbx,%r11,1),%rcx
 1.03       add          $0x2,%rcx
          ├──cmp          %r15,%rcx
 0.09     └──jb           210
Code Snippet 23: Main loop of the program. The first column shows the percentage of time in each line.

Some more attempts

Possible improvements at this point are increasing parallelism to get more than 2.49 instructions per cycle, and increasing parallelism by using SIMD to process multiple lines at a time.

I quickly hacked something that splits the data: &[u8] for each thread into two to four chunks that are processed at the same time, hoping multiple independent code paths would improve parallelism, but that didn’t work out immediately. Probably I need to interleave all the instructions everywhere, and manually use SIMD where possible, which is slightly annoying and for a later time.

I also know that the PtrHash perfect hash function contains a few redundant instructions that are needed in the general case but not here. Removing those would be nice.

Faster perfect hashing: 1.55s

Turns out I added a function to PtrHash for lookups on small tables, but wasn’t actually using it. Saves some cycles again :)

Bug time: Back up to 1.71s

I accidentally dropped the - b'0' part when making the floating point parsing branch free. Adding them back in bumps the times quite a bit, given that it’s only 4 instructions extra.

-    let a = unsafe { *s.get_unchecked(s.len().wrapping_sub(5)) };
-    let b = unsafe { *s.get_unchecked(s.len().wrapping_sub(4)) };
-    let c = unsafe { *s.get_unchecked(s.len().wrapping_sub(3)) };
-    let d = unsafe { *s.get_unchecked(s.len().wrapping_sub(1)) };
+    let a = unsafe { *s.get_unchecked(s.len().wrapping_sub(5)) - b'0' };
+    let b = unsafe { *s.get_unchecked(s.len().wrapping_sub(4)) - b'0' };
+    let c = unsafe { *s.get_unchecked(s.len().wrapping_sub(3)) - b'0' };
+    let d = unsafe { *s.get_unchecked(s.len().wrapping_sub(1)) - b'0' };
Code Snippet 24: Bugfix

Temperatures less than 100: 1.62s

Assuming that temperatures are less than 100 helps quite a bit.

-    // s = abc.d
+    // s = bc.d
-    let a = unsafe { *s.get_unchecked(s.len().wrapping_sub(5)) - b'0' };
     let b = unsafe { *s.get_unchecked(s.len().wrapping_sub(4)) - b'0' };
     let c = unsafe { *s.get_unchecked(s.len().wrapping_sub(3)) - b'0' };
     let d = unsafe { *s.get_unchecked(s.len().wrapping_sub(1)) - b'0' };
-    let v = a as V * 1000 * (s.len() >= 5) as V
+    let v = 0
         + b as V * 100 * (s.len() >= 4) as V
         + c as V * 10
         + d as V;

Computing min as a max: 1.50

Instead of record.min = min(value, record.min) we can do record.min = max(-value, record.min) and negate the value at the end. This turns out to generate slightly faster code, because the two max calls can now be done using SIMD.

Intermezzo: Hyperthreading: 1.34s

Turns out enabling hyperthreading speeds up the parallel run by around 10%!

Surprisingly, the single-threaded version becomes a bit slower, from 7s down to 9s.

Here’s a perf stat on 12 threads with hyperthreading:

     15,888.56 msec task-clock:u                     #    9.665 CPUs utilized
54,310,677,591      cycles:u                         #    3.418 GHz
72,478,697,756      instructions:u                   #    1.33  insn per cycle
 1,095,632,420      branches:u                       #   68.957 M/sec
     1,658,837      branch-misses:u                  #    0.15% of all branches

   1.643962284 seconds time elapsed
  15.107088000 seconds user
   0.767016000 seconds sys

Instructions per cycle is very low, probably since the hyperthreads are competing for cycles.

And here’s a perf stat for 6 threads with hyperthreading disabled:

      9,059.31 msec task-clock:u                     #    5.220 CPUs utilized
30,369,716,994      cycles:u                         #    3.352 GHz
72,476,925,632      instructions:u                   #    2.39  insn per cycle
 1,095,268,277      branches:u                       #  120.900 M/sec
     1,641,589      branch-misses:u                  #    0.15% of all branches

   1.735524152 seconds time elapsed
   8.495397000 seconds user
   0.556457000 seconds sys

Notice how elapsed time is a bit higher, but instructions/cycle, task-clock, and user time are lower. In fact, the number of instructions, branches, and branch-misses is pretty much the same. The hyperthreaded variant just has more contention for the available cycles.

(I’ll now disable hyperthreading again to make numbers easier to interpret.)

Not parsing negative numbers: 1.48s

Having to deal with positive and negative numbers at the same time is kinda annoying for further parsing optimizations. To fix this, we will create separate hash map entries for positive and negative numbers. In particular, for cities with a negative value I will act as if the ; separator was located at the position of the minus. That way, the value is always positive, and the city name gets a ; appended for negative cases.

This now saves some instructions in the parsing where we can assume the number is positive.

Overall it’s pretty much performance-neutral.

iter_lines(data, |mut name, mut value| {
    unsafe {
        if value.get_unchecked(0) == &b'-' {
            value = value.get_unchecked(1..);
            name = name.get_unchecked(..name.len() + 1);
        }
    }
    ...
}
Code Snippet 25: adjusting the name and value &[u8] slice for negative numbers.

More efficient parsing: 1.44s

It turns out subtracting b'0' from each character is quite slow: since each u8 subtraction could overflow, they all have to be done independently, as we can see in the generated assembly:

          let b = unsafe { *s.get_unchecked(s.len().wrapping_sub(4)) - b'0' };
0.38        movzbl       -0x4(%rax,%r11,1),%ebp
          let c = unsafe { *s.get_unchecked(s.len().wrapping_sub(3)) - b'0' };
2.08        movzbl       -0x3(%rax,%r11,1),%r15d
          let d = unsafe { *s.get_unchecked(s.len().wrapping_sub(1)) - b'0' };
0.37        movzbl       -0x1(%rax,%r11,1),%r11d
          let b = unsafe { *s.get_unchecked(s.len().wrapping_sub(4)) - b'0' };
0.04        add          $0xd0,%bpl
          let c = unsafe { *s.get_unchecked(s.len().wrapping_sub(3)) - b'0' };
1.55        add          $0xd0,%r15b
          b as V * 100 * (s.len() >= 4) as V + c as V * 10 + d as V
0.17        movzbl       %bpl,%edx
2.20        imul         $0x64,%edx,%edx
            movzbl       %r15b,%esi
1.59        lea          (%rsi,%rsi,4),%esi
0.12        lea          (%rdx,%rsi,2),%esi
          let d = unsafe { *s.get_unchecked(s.len().wrapping_sub(1)) - b'0' };
0.10        add          $0xd0,%r11b
          b as V * 100 * (s.len() >= 4) as V + c as V * 10 + d as V
1.72        movzbl       %r11b,%edx
0.14        add          %esi,%edx

To fix this, we can do all subtractions on i32. That way, the compiler merges them into a single subtraction of 111 * b'0'.

     // s = bc.d
-    let b = unsafe { *s.get_unchecked(s.len().wrapping_sub(4)) - b'0' };
-    let c = unsafe { *s.get_unchecked(s.len().wrapping_sub(3)) - b'0' };
-    let d = unsafe { *s.get_unchecked(s.len().wrapping_sub(1)) - b'0' };
+    let b = unsafe { *s.get_unchecked(s.len().wrapping_sub(4)) as V - b'0' as V };
+    let c = unsafe { *s.get_unchecked(s.len().wrapping_sub(3)) as V - b'0' as V };
+    let d = unsafe { *s.get_unchecked(s.len().wrapping_sub(1)) as V - b'0' as V };
     b as V * 100 * (s.len() >= 4) as V + c as V * 10 + d as V

New assembly:

          let b = unsafe { *s.get_unchecked(s.len().wrapping_sub(4)) as V - b'0' as V };
1.16        movzbl       -0x4(%rbp,%rsi,1),%edx
          let c = unsafe { *s.get_unchecked(s.len().wrapping_sub(3)) as V - b'0' as V };
0.08        movzbl       -0x3(%rbp,%rsi,1),%r11d
          let d = unsafe { *s.get_unchecked(s.len().wrapping_sub(1)) as V - b'0' as V };
1.16        movzbl       -0x1(%rbp,%rsi,1),%esi
          b as V * 100 * (s.len() >= 4) as V + c as V * 10 + d as V
1.18        imul         $0x64,%edx,%edx
1.08        lea          (%r11,%r11,4),%r11d
0.57        lea          (%rdx,%r11,2),%r11d
          one_billion_row_challenge::Record::add:
          self.min = self.min.max(-value);
0.98        mov          $0x14d0,%ebp         // the constant is 111*48 = 111*b'0'
1.05        sub          %r11d,%ebp
1.26        sub          %esi,%ebp
          one_billion_row_challenge::parse:
          b as V * 100 * (s.len() >= 4) as V + c as V * 10 + d as V
1.60        lea          (%rsi,%r11,1),%edx
0.23        add          $0xffffeb30,%edx     // the constant is -111*48

I’m still confused, because the 111 * 48 constant appears twice, which seems unnecessary, but the code is quite a bit shorter for sure.

I’m also not quite sure exactly how the * (s.len() >= 4) ends up in here. It seems that both values are computed and the right one is automatically picked. But then I would expect to see 11 * 48 as a constant also, but that doesn’t appear.

Fixing undefined behaviour: back to 1.56s

So, it turns out that doing the unsafe equivalent of s[s.len()-4] on a slice of length 3 is no good. On stable rust it happens to work, but on nightly it does not. Apparently get_unchecked with out-of-bounds indexes is actually undefined behaviour. Rewriting everything to use indices into the larger data: &[u8] slice instead of small slices ends up giving quite a bit of slowdown. Maybe enough to reconsider some earlier choices…

Lazily subtracting b'0': 1.52s

So actually we don’t even have to do the - b'0' subtraction in the hot loop at all for c.d! Since we count how many entries there are, we can just keep them, and compensate in the end by subtracting count * 11 * b'0' there.

     let b = unsafe { *data.get_unchecked(end - 4) as V - b'0' as V };
-    let c = unsafe { *data.get_unchecked(end - 3) as V - b'0' as V };
+    let c = unsafe { *data.get_unchecked(end - 3) as V };
-    let d = unsafe { *data.get_unchecked(end - 1) as V - b'0' as V };
+    let d = unsafe { *data.get_unchecked(end - 1) as V };
     b as V * 100 * (end - sep >= 5) as V + c as V * 10 + d as V
Code Snippet 26: Diff. full commit

Min/max without parsing: 1.55s

Instead of doing min and max operations on parsed integers, we can do them on the raw string bytes directly. We only have to be careful to mask out the ; in ;2.3, since the ASCII value of ; is larger than the digits.

The commit is kinda ugly. Highlight is this function that reads the bytes in big-endian order, and then shifts away out-of-range bytes.

fn parse_to_raw(data: &[u8], start: usize, end: usize) -> u32 {
    let raw = u32::from_be_bytes(unsafe { *data.get_unchecked(start..).as_ptr().cast() });
    raw >> (8 * (4 - (end - start)))
}

This is only a bit slower, and I suspect this will allow speedups later if we can drop the other parse function completely.

Parsing using a single multiplication: doesn’t work

The previous parse_to_raw reads bytes in reverse order (big endian, while my system is low endian) and returns something like 0x3b_3c_??_3d (where ?? is the value of ., and 3b is the ASCII value of digit b). Most of this is constant and can be subtracted at the end in bulk. The interesting bits are 0x0b_0c_00_0d. From this we could almost get the target value by multiplying with 100 + 10 << 8 + 1 << 24 and then shifting right by 24:

              0b    0c    00    0d
----------------------------------
            100b  100c     0  100d | * 100
       10b   10c     0   10d       | * 10 << 8
b  0     c     d                   | * 1 << 24
b  0 10b+c     T  100c   10d  100d | sum
Code Snippet 27: Long multiplication using base 256.

The problem is that the 100c term could get larger than 256 and interfere with the target value T. Also, T itself could be up to 1000 and run in the 10b+c stored one position higher.

Parsing using a single multiplication does work after all! 1.48s

Alternatively, we could read the bytes in low endian order into 0x0d_00_0c_0b and multiply by 1 + 10<<16 + 100<<24, again followed by a right shift of 24:

                 0d    00  0c  0b
---------------------------------
                  d     0   c   b | * 1
      10d    0  10c   10b         | * 10 << 16
100d    0 100c 100b               | * 100 << 24
100d  10d 100c    T   10b   c   b | sum
Code Snippet 28: Long multiplication using base 256, attempt 2.

In this case, we see that the 10b term in the lower significant position can never exceed 90 < 256, so there are no problems there. On the other side, we still have the issue that \(T = 100b + 10c + d\) can be up to \(999\) and more than \(256\). But it turns out we’re lucky! (Very lucky?!) The 100c term does occupy the next byte, but 100 is divisible by 4, and hence the low 2 bits of that byte are always 0. And those 2 bits are exactly what we need! Taking them together with the 8 bits of the target byte, we have a 10 bit value, which can store anything up to 1024, larger than the max value of T!

OMG It’s hard to overstate my happiness here! Just think for a moment about what just happened. I wonder if there is some ‘deeper’ reason this works. We’re very much using that 2 divides 10, and that the position of the . is such that only the 100c term is close by.

The new parse function looks like this:

/// data[start..end] is the input slice.
fn parse(data: &[u8], start: usize, end: usize) -> V {
    // Start with a slice b"bc.d" of b"c.d"
    // Read it as low-endian value 0x3d..3c3b or 0x??3d..3c (?? is what comes after d)
    let raw = u32::from_le_bytes(unsafe { *data.get_unchecked(start..).as_ptr().cast() });
    // Shift out the ??, so we now have 0x3d..3c3b or 0x3d..3c00.
    let raw = raw >> (8 * (4 - (end - start)));
    // Extract only the relevant bits corresponding to digits.
    // Digit a looks of 0x3a in ASCII, so we simply mask out the high half of each byte.
    let raw = raw & 0x0f000f0f;
    // Long multiplication.
    const C: u64 = 1 + (10 << 16) + (100 << 24);
    // Shift right by 24 and take the low 10 bits.
    let v = ((raw as u64 * C) >> 24) & ((1 << 10) - 1);
    // let v = unsafe { core::arch::x86_64::_bextr_u64(v, 24, 10) };
    v as _
}

Note: I just learned of the bextr instruction that could replace this last shift and mask sequence, but it doesn’t seem to be faster than the native implementation.

A side note: ASCII

One thing that really annoys me a lot is that the bit representation of ; is very close to the digits 0..9 (ASCII table). This means that we can’t nicely mask it away, and are somewhat forced to do the shift in algorithm above to get 0 bits.

Skip parsing using PDEP: 1.42s

The new parsing function is pretty sweet, but there is another idea worth exploring.

Using PDEP and PEXT instructions (parallel bit deposit/extract), we can shuffle bits around in 3 clock cycles (same as a multiplication). So we could do:

                                0b0011bbbb0011cccc........0011dddd // Input string.
                                0b    1111    1111            1111 // Extract these bits (PEXT)
                                0b                    bbbbccccdddd // Result of extraction
0b                  1111                 1111                 1111 // Deposit bits here in u64 (PDEP)
0b                  bbbb                 cccc                 dddd // Result of deposit
                       ^42                  ^21                  ^0
Code Snippet 29: playing with PDEP and PEXT.

Now we have a u64 integer containing three 21-bit integers inside it. Since each digit has a value up to \(10\), we can accumulate 2^{21}/10 = 200k temperatures in this before overflow happens. So we could process say 10M rows (where each city should only appear 25k times) and then accumulate things into a main buffer.

Improved

We can even get rid of the PEXT instruction and directly PDEP the bits where we want them, since all bits only move left. Basically we deposit some bloat bits (the 3 of the ASCII digits, and the representation of the .), but then we simply drop them using an &.

                                0b0011bbbb0011cccc........0011dddd // Input string.
                                          xxxx    yyyyyyyyyyyy     // name the trash bits
0b                  bbbb             xxxxcccc     yyyyyyyyyyyydddd // Deposit here
0b                  1111                 1111                 1111 // Mask out trash using &
                       ^42                  ^21                  ^0
Code Snippet 30: More efficient bit deposit.
fn parse_pdep(data: &[u8], start: usize, end: usize) -> u64 {
    // Start with a slice b"bc.d" of b"c.d"
    // Read it as big-endian value 0xbbcc..dd or 0xcc..dd?? (?? is what comes after d)
    let raw = u32::from_be_bytes(unsafe { *data.get_unchecked(start..).as_ptr().cast() });
    // Shift out the ??, so we now have 0xbbcc..dd or 0x00cc..dd.
    let raw = raw >> (8 * (4 - (end - start)));

    //         0b                  bbbb             xxxxcccc     yyyyyyyyyyyydddd // Deposit here
    //         0b                  1111                 1111                 1111 // Mask out trash using &
    let pdep = 0b0000000000000000001111000000000000011111111000001111111111111111u64;
    let mask = 0b0000000000000000001111000000000000000001111000000000000000001111u64;

    let v = unsafe { core::arch::x86_64::_pdep_u64(raw as u64, pdep) };
    v & mask
}
Code Snippet 31: parsing using PDEP

Note that we have to do a bit of extra logic around the accumulation, but that’s fairly straightforward.

The length of the assembly code is now down to 62 instructions!

      │210:┌─→vpcmpeqb     %ymm0,%ymm2,%ymm2
 0.66       vpmovmskb    %ymm2,%edx
 0.00       tzcnt        %edx,%ebp
 2.83       mov          %r13,%r15
            add          %rbp,%r15
 0.81       xor          %r10d,%r10d
 2.79       cmpb         $0x2d,0x1(%r12,%r15,1)
 0.01       sete         %r10b
 0.70       lea          (%r15,%r10,1),%r11
 0.01       mov          %r11,%rdx
 2.80       sub          %rax,%rdx
 0.00       lea          0x0(,%rdx,8),%rbx
 0.78       mov          $0x40,%esi
 0.00       sub          %rbx,%rsi
 2.63       lea          (%r12,%rax,1),%rbx
            mov          $0x0,%r8d
 0.78       cmovb        %r8,%rsi
 0.00       and          $0x38,%sil
 2.96       shrx         %rsi,-0x8(%rdx,%rbx,1),%rdx
 0.82       shlx         %rsi,(%r12,%rax,1),%rax
 6.41       vmovdqu      0x1(%r12,%r15,1),%ymm2
 0.04       add          %rax,%rdx
 1.02       imul         %rcx,%rdx
            mov          0x20(%rsp),%rax
 3.37       mulx         %rax,%rax,%rax
 0.66       mov          (%rsp),%rsi
 2.55       movzbl       (%rsi,%rax,1),%eax
 4.62       imul         %rcx,%rax
 1.19       xor          %rdx,%rax
 0.62       vpcmpeqb     %ymm1,%ymm2,%ymm3
 0.01       vpmovmskb    %ymm3,%edx
 1.93       tzcnt        %edx,%esi
 1.22       lea          (%rsi,%r15,1),%rdx
 0.62       inc          %rdx
 0.00       add          %r15,%r10
 1.91       inc          %r10
 1.21       sub          %edx,%r10d
 0.63       movbe        0x1(%r12,%r11,1),%r11d
 1.84       mov          %rax,%rdx
 2.82       mulx         %rcx,%rdx,%rdx
 2.66       and          %r9,%rdx
 2.79       shl          $0x5,%rdx
 0.03       shl          $0x3,%r10d
23.24       incl         0x8(%r14,%rdx,1)
 0.00       shrx         %r10d,%r11d,%eax
 0.06       movabs       $0x3c001fe0ffff,%r8
 0.04       pdep         %r8,%rax,%r10
 3.37       movabs       $0x3c0001e0000f,%r8
 0.03       and          %r8,%r10
 5.64       add          %r10,(%r14,%rdx,1)
            add          %rbp,%r13
 0.07       inc          %r13
            mov          %eax,%r10d
 3.53       not          %r10d
            vmovd        %r10d,%xmm3
 0.09       vpinsrd      $0x1,%eax,%xmm3,%xmm3
 3.46       lea          0x2(%rsi,%r15,1),%rax
 0.00       vmovq        0xc(%r14,%rdx,1),%xmm4
 1.00       vpmaxud      %xmm3,%xmm4,%xmm3
            vmovq        %xmm3,0xc(%r14,%rdx,1)
          ├──cmp          %rdi,%rax
 2.74     └──jb           210

One slightly ugly part in the above is:

0.06       movabs       $0x3c001fe0ffff,%r8
0.04       pdep         %r8,%rax,%r10
3.37       movabs       $0x3c0001e0000f,%r8
0.03       and          %r8,%r10

Both the pdep shuffle and mask require a dedicated mov. The and-mask can also be done on the u32 value before the pdep, which merges the mov and and to and $0xf0f000f,%edx.

A further note

Not all systems support PDEP. Also, AMD Zen 1 and Zen 2 processor do support it, but implement it using microcode taking 18 cycles (wikipedia). In our case we can emulate the PDEP relatively easily using some mask and shift operations:

let raw = raw as u64;
(raw & 15) | ((raw & (15 << 16)) << (21 - 16)) | ((raw & (15 << 24)) << (42 - 24))
Code Snippet 32: Manually implementing PDEP.

This is only around 3% slower (1.13s vs 1.10s).

Branchy min/max: 1.37s

Inspired by this nice post on Algorithmica.org, we replace r.min = min(r.min, value) by a branchy version:

-        self.min = self.min.max(!raw_value);
+        if raw_value < self.min {
+            self.min = raw_value;
+        }
-        self.max = self.max.max(raw_value);
+        if raw_value > self.max {
+            self.max = raw_value;
+        }

Indeed, branch-mispredicts are slow, but it turns out these branches are almost never taken! In fact, since the data is random, they will be true only \(\ln(n)\) time for each city, where \(n\) is the number of times the city appears. Each city appears 10G/400 = 2.5M times, and so for each city we only expect around 15 updates, for 400*15=6000 branch misses in total (per thread), which is very small compared to the 1G writes and =cmov=s we save.

I tried adding an unlikely(..) hint to this, but that didn’t change the generated assembly code.

No counting: 1.34s

Ok this is where you start saying I’m cheating, but it turns out we don’t need to count how often each city appears in the input. Since each row has a random city, the number of times a city appears is \(Bin(R, 1/C)\), where \(r=10^9\) and \(C=413\). This has expectation \(R/C\sim 15\cdot 10^6\) and standard deviation \(\approx \sqrt(R/C) \sim 4000\). So the variance around the expected value is only \(1/4000\). That means that if in the end we simply divide by the expected number of occurrences instead of the actual number of occurrences, we’re only off by \(1/4000\) relatively. There are \(400\) cities, so the worst deviation will be slightly more, maybe \(4/4000=1/1000\) or so. And since there are \(400<1000\) cities, the probability that one of them has average value within \(1/1000\) from a rounding-boundary is less than a half.

Indeed, this ends up giving the correct output for the evaluation set.

Edit: There was a bug in my verification script. As it turns out, after this optimization the temperature for Baghdad is incorrectly reported as 22.7 instead of 22.8.

Arbitrary long city names: 1.34

Turns out supporting longer city names doesn’t even come at a cost since the branch is never taken!

 let find = |mut last: usize, sep: S| {
     let simd = S::from_array(unsafe { *data.get_unchecked(last..).as_ptr().cast() });
     let mut eq = sep.simd_eq(simd).to_bitmask() as u32;
+    if eq == 0 {
+        while eq == 0 {
+            last += 32;
+            let simd = S::from_array(unsafe { *data.get_unchecked(last..).as_ptr().cast() });
+            eq = sep.simd_eq(simd).to_bitmask() as u32;
+        }
+    }
     let offset = eq.trailing_zeros() as usize;
     last + offset
 };

Note that just doing while eq == 0 {..} slows down the code, but wrapping it in an always-false if statement makes it not slow down the code.

4 entries in parallel: 1.23s

Currently we only process one line at a time. This limits how much we can exploit ILP (instruction level parallelism). So instead we will process 4 lines at a time.

We don’t want to process consecutive lines since they depend on each other. Instead, we chunk the data into 4 parts and process them independently of each other in parallel.

To improve the generated code, I created a simple macro that interleaves the instructions. This gets the IPC (instructions per cycle) up to 2.66 (up from 2.20).

let mut state0 = init_state(0);
let mut state1 = init_state(data.len() / 4);
let mut state2 = init_state(2 * data.len() / 4);
let mut state3 = init_state(3 * data.len() / 4);

// Duplicate each line for each input state.
macro_rules! step {
    [$($s:expr),*] => {
        $($s.sep_pos = find_long($s.sep_pos, sep) + 1;)*
            $($s.end_pos = find($s.sep_pos, end) + 1;)*
            $(callback(data, $s.start_pos, $s.sep_pos - 1, $s.end_pos - 1);)*
            $($s.start_pos = $s.end_pos;)*
    }
}

while state3.start_pos < data.len() {
    step!(state0, state1, state2, state3);
}

Mmap per thread

Currently the main thread spends 180ms on unmapping the file at the end. Instead I tried making a separate Mmap instance per thread, so the unmapping can happen in parallel. Sadly this doesn’t make a difference: now the 180ms is simply divided over all threads, and since munmap has a program wide lock, the drop(mmap) calls in each thread are just waiting for each other.

Some workarounds:

  • Make each thread do its own memory (un)mapping. This does not yet speed things up since they all unmap at the same time.
  • Make two or three times the number of threads, so that other work can be done instead of waiting for the kernel lock. This ends up having around 100ms overhead, which is better than the 180ms we started with but still significant.

For now both these options are slightly too messy and I will just keep things as they are. I might revisit this later.

Reordering some operations: 1.19s

Turns out doing the increments for the start/end of each slice in a slightly different order gives quite a bit better performance. :shrug:

     macro_rules! step {
         [$($s:expr),*] => {
-            $($s.sep_pos = find_long($s.sep_pos, sep) + 1;)*
+            $($s.sep_pos = find_long($s.sep_pos + 1, sep);)*
-                $($s.end_pos = find($s.sep_pos, end) + 1;)*
+                $($s.end_pos = find($s.sep_pos + 1, end) ;)*
-                $(callback(data, $s.start_pos, $s.sep_pos - 1, $s.end_pos - 1);)*
+                $(callback(data, $s.start_pos, $s.sep_pos, $s.end_pos);)*
-                $($s.start_pos = $s.end_pos;)*
+                $($s.start_pos = $s.end_pos + 1;)*
         }
     }

Reordering more: 1.11s

The above still processes the callback sequentially for the 4 parallel states. In particular, each callback spends relatively a lot of time waiting for the right record to be loaded from L1 cache into registers.

Let’s go deeper and call the callback with 4 states. Ugly commit, around 0.01s slowdown for some reason.

Now we can interleave the record loading instructions:

 // If value is negative, extend name by one character.
 s0.sep += (data.get_unchecked(s0.sep + 1) == &b'-') as usize;
 let name0 = data.get_unchecked(s0.start..s0.sep);
 let key0 = to_key(name0);
 let index0 = phf.index_single_part(&key0);
-let entry0 = slots.get_unchecked_mut(index0);
 let raw0 = parse_to_raw(data, s0.sep + 1, s0.end);
-entry0.add(raw0, raw_to_pdep(raw0));

 s1.sep += (data.get_unchecked(s1.sep + 1) == &b'-') as usize;
 let name1 = data.get_unchecked(s1.start..s1.sep);
 let key1 = to_key(name1);
 let index1 = phf.index_single_part(&key1);
-let entry1 = slots.get_unchecked_mut(index1);
 let raw1 = parse_to_raw(data, s1.sep + 1, s1.end);
-entry1.add(raw1, raw_to_pdep(raw1));

 s2.sep += (data.get_unchecked(s2.sep + 1) == &b'-') as usize;
 let name2 = data.get_unchecked(s2.start..s2.sep);
 let key2 = to_key(name2);
 let index2 = phf.index_single_part(&key2);
-let entry2 = slots.get_unchecked_mut(index2);
 let raw2 = parse_to_raw(data, s2.sep + 1, s2.end);
-entry2.add(raw2, raw_to_pdep(raw2));

 s3.sep += (data.get_unchecked(s3.sep + 1) == &b'-') as usize;
 let name3 = data.get_unchecked(s3.start..s3.sep);
 let key3 = to_key(name3);
 let index3 = phf.index_single_part(&key3);
-let entry3 = slots.get_unchecked_mut(index3);
 let raw3 = parse_to_raw(data, s3.sep + 1, s3.end);
+
+let entry0 = slots.get_unchecked_mut(index0);
+entry0.add(raw0, raw_to_pdep(raw0));
+let entry1 = slots.get_unchecked_mut(index1);
+entry1.add(raw1, raw_to_pdep(raw1));
+let entry2 = slots.get_unchecked_mut(index2);
+entry2.add(raw2, raw_to_pdep(raw2));
+let entry3 = slots.get_unchecked_mut(index3);
 entry3.add(raw3, raw_to_pdep(raw3));

This brings us down to 1.11s, quite a big win really! Comparing perf stat before and after, the number of instructions went from 62.7G to 63.0G (0.5% more; I’m not sure why), but instructions per cycle went up from 2.55 to 2.80, almost 10% better!

Even more ILP: 1.05

Of course, we can do better and interleave all the instructions fully. I played around a bit and the following gives the best results:

 s0.sep += (data.get_unchecked(s0.sep + 1) == &b'-') as usize;
 let name0 = data.get_unchecked(s0.start..s0.sep);
-let key0 = to_key(name0);
-let index0 = phf.index_single_part(&key0);
-let raw0 = parse_to_raw(data, s0.sep + 1, s0.end);

 s1.sep += (data.get_unchecked(s1.sep + 1) == &b'-') as usize;
 let name1 = data.get_unchecked(s1.start..s1.sep);
-let key1 = to_key(name1);
-let index1 = phf.index_single_part(&key1);
-let raw1 = parse_to_raw(data, s1.sep + 1, s1.end);

 s2.sep += (data.get_unchecked(s2.sep + 1) == &b'-') as usize;
 let name2 = data.get_unchecked(s2.start..s2.sep);
-let key2 = to_key(name2);
-let index2 = phf.index_single_part(&key2);
-let raw2 = parse_to_raw(data, s2.sep + 1, s2.end);

 s3.sep += (data.get_unchecked(s3.sep + 1) == &b'-') as usize;
 let name3 = data.get_unchecked(s3.start..s3.sep);
+
+let key0 = to_key(name0);
+let key1 = to_key(name1);
+let key2 = to_key(name2);
 let key3 = to_key(name3);
+
+let index0 = phf.index_single_part(&key0);
+let index1 = phf.index_single_part(&key1);
+let index2 = phf.index_single_part(&key2);
 let index3 = phf.index_single_part(&key3);
+
+let raw0 = parse_to_raw(data, s0.sep + 1, s0.end);
+let raw1 = parse_to_raw(data, s1.sep + 1, s1.end);
+let raw2 = parse_to_raw(data, s2.sep + 1, s2.end);
 let raw3 = parse_to_raw(data, s3.sep + 1, s3.end);

 let entry0 = slots.get_unchecked_mut(index0);
 entry0.add(raw0, raw_to_pdep(raw0));
 let entry1 = slots.get_unchecked_mut(index1);
 entry1.add(raw1, raw_to_pdep(raw1));
 let entry2 = slots.get_unchecked_mut(index2);
 entry2.add(raw2, raw_to_pdep(raw2));
 let entry3 = slots.get_unchecked_mut(index3);
 entry3.add(raw3, raw_to_pdep(raw3));

For some reason keeping the sep += ... and name = ... instructions as-is is around 0.01s better. Still the overall win is very nice! The number of instructions remains the same, but instructions per cycle is up to 2.95! For 1.05s in total.

At this point it turns out that hyperthreading actually slows down the code!

Table 2: results at 4.6GHz.
threadsHT offHT on
14.935.05
60.900.97
120.900.92

Compliance 1, OK I’ll count: 1.06

Ok I’m adding back the record counting. Being non-compliant this way is annoying, and it’s really quite brittle to rely on the randomness here.

Surprisingly, performance is much (~0.05s) better when I make count: u64 instead of count: u32. I didn’t bother to figure out why.

TODO

  • Use AES instructions for hashing
  • Verify hash on lookup
  • Rebuild PHF on hash collision / new key.
  • Simplify perfect hash function
  • SIMD
  • Munmap per chunk/thread

Postscript

As things go, my attention shifted to other things before wrapping everything up, and by now I have forgotten the details of where I left things :")

Anyway, here’s a blogpost comparing non-java implementations, and here is a long discussion of possible optimizations.