- External links
- The problem
- Initial solution: 105s
- First flamegraph
- Bytes instead of strings: 72s
- Manual parsing: 61s
- Inline hash keys: 50s
- Faster hash function: 41s
- A new flame graph
- Perf it is
- Something simple: allocating the right size: 41s
memchr
for scanning: 47smemchr
crate: 29sget_unchecked
: 28s- Manual SIMD: 29s
- Profiling
- Revisiting the key function: 23s
- PtrHash perfect hash function: 17s
- Larger masks: 15s
- Reduce pattern matching: 14s
- Memory map: 12s
- Parallelization: 2.0s
- Branchless parsing: 1.7s
- Purging all branches: 1.67s
- Some more attempts
- Faster perfect hashing: 1.55s
- Bug time: Back up to 1.71s
- Temperatures less than 100: 1.62s
- Computing
min
as amax
: 1.50 - Intermezzo: Hyperthreading: 1.34s
- Not parsing negative numbers: 1.48s
- More efficient parsing: 1.44s
- Fixing undefined behaviour: back to 1.56s
- Lazily subtracting
b'0'
: 1.52s - Min/max without parsing: 1.55s
- Parsing using a single multiplication: doesn’t work
- Parsing using a single multiplication does work after all! 1.48s
- A side note: ASCII
- Skip parsing using
PDEP
: 1.42s - Branchy min/max: 1.37s
- No counting: 1.34s
- Arbitrary long city names: 1.34
- 4 entries in parallel: 1.23s
- Mmap per thread
- Reordering some operations: 1.19s
- Reordering more: 1.11s
- Even more ILP: 1.05
- Compliance 1, OK I’ll count: 1.06
- TODO
- Postscript
A youtube video on this post is here.
Since everybody is doing it, I’m also going to take a stab at the One Billion Row Challenge.
Rules I set for myself:
- Rust (of course)
- External libraries are allowed at first for convenience.
- I’m assuming that the set of cities is fixed.
- Or at least, that each row is sampled randomly from the set of available cities.
- Or really, I’m just scanning a prefix of 100k rows, assuming all cities appear there.
- Along the way, I use that the longest line in the input file has 33 characters, and that cities are uniquely determined by their first and last 8 characters.
This post is a log of ideas, timings, and general progress.
My processor is a i7-10750H CPU @ 2.6GHz
- Clock speed is capped at
3.60GHz
. (It can run
at 4.6GHz
, but 3.6GHzz
tends to be more stable for benchmarking.)
- \(64GB\) ram.
- It has 6 cores.
- Hyperthreading is disabled.
- The file should be cached in memory by the OS.
@4.6GHz
, measured inside the program, excluding munmap
at exit.threads | HT off | HT on |
---|---|---|
1 | 4.93 | 5.05 |
6 | 0.90 | 0.97 |
12 | 0.90 | 0.92 |
External links Link to heading
- My code: https://github.com/RagnarGrootKoerkamp/1brc
- My twitter: https://twitter.com/curious_coding
lehuyduc
’s repo (C++): https://github.com/lehuyduc/1brc-simd- Stackoverflow question/discussion, continued on github
buybackoff
’s repo (C#): https://github.com/buybackoff/1brc- Corresponding blogpost comparing non-Java code: https://hotforknowledge.com/2024/01/13/7-1brc-in-dotnet-even-faster-than-java-cpp/#results
The problem Link to heading
Input: a 13GB
file containing \(10^9\) lines of the form Amsterdam;10.5
.
- First a city name. There are \(413\) cities, and each row has name chosen at random from the set. Their lengths are from \(3\) to \(26\) bytes.
- Then a temperature, formatted as
-?\d?\d,\d
, i.e. a possibly negative number with one or two integral digits and exactly one decimal. Each temperature is drawn from a normal distribution for each city.
Output: a sorted list of cities of the form <city>: <min>/<avg>/<max>
,
each formatted with one decimal place.
Initial solution: 105s Link to heading
Here’s a first version. zero optimizations.
|
|
First flamegraph Link to heading
Let’s see what’s slow here: cargo flamegraph --open
(or just flamegraph
).
Figure 1: A flamegraph. Open in a new tab to see the interactive version with zooming and ctrl-F
support.
Takeaways:
35%
of time isnext_match
, i.e. searching for\n
and/or;
.14%
of time is parsing thef32
.35%
of time is accessing the hashmap.- Not sure what exactly is the remainder. We’ll figure that out once it becomes relevant.
Bytes instead of strings: 72s Link to heading
Strings in rust are checked to be valid UTF8. Using byte slices (&[u8]
) is
usually faster. We have to do some slightly ugly conversions from byteslice back
to strings for parsing floats and printing, but it’s worth it. This basically
removes next_match
from the flamegraph.
Commit here. (It’s neither pretty nor interesting.)
This already saves 21 seconds, from 105 to 84. Pretty great!
Manual parsing: 61s Link to heading
Instead of parsing the input as f32
float, we can parse manually to a
fixed-precision i32
signed integer
|
|
Inline hash keys: 50s Link to heading
Currently the hashmap is from &str
to Record
, where all &str
are slices of
the input string. All this indirection is probably slow.
So we instead would like to store keys inline as [u8; 8]
(basically a u64
).
It turns out that the first 8 characters of each city name are almost enough for
uniqueness. Only Alexandra
and Alexandria
coincide, so we’ll xor in the
length of the string to make them unique.
One drawback is that the hashmap must now store the full name corresponding to
the key as well.
|
|
Faster hash function: 41s Link to heading
The default hash table in rust uses a pretty slow hash function. Let’s instead
use fxhash::FxHashMap
. For u64
keys, the hash function is simply
multiplication by a constant. This gives another 10 seconds speedup.
|
|
A new flame graph Link to heading
Now that we’ve addressed the obvious hot parts, let’s make a new graph.
Figure 2: A useless flamegraph.
Yeah well great… I suppose everything is inlined or so. But actually the debuginfo should still be there. idk…
Perf it is Link to heading
cargo flamegraph
uses perf record
under the hood. So we can just perf report
and see what’s there.
Some snippets. Numbers on the left are percentage of samples on that line.
|
|
|
|
|
|
[u8; 8]
to u64
, i.e. an unaligned read, is surprisingly slow?Then there are quite some instructions for indexing the hash table, adding to around 20% in total.
Parsing takes around 5%.
Something simple: allocating the right size: 41s Link to heading
We can stat
the input file for its size and allocate exactly the right amount of space.
This saves around half a second.
|
|
memchr
for scanning: 47s
Link to heading
memchr(byte, text)
is a libc
function that returns the first index of the
byte in the text.
But well.. it turns out this is a non-inlined function call after all and things
slow down. But anyway, here’s the diff:
|
|
memchr
crate: 29s
Link to heading
It also turns out the default memchr
function doesn’t use SIMD. But there is
the nice memchr
crate which is heavily optimized and does use SIMD.
This brings us down from the previous best of 42s to 29s!
get_unchecked
: 28s
Link to heading
By default all array accesses are bound checked. We don’t really need that. Removing them saves half a second.
The code is now a bit uglier sadly: commit.
Manual SIMD: 29s Link to heading
One ‘problem’ with memchr
is that it is made for scanning long ranges, and is
not super flexible. So let’s roll our own.
We make sure that data
is aligned to SIMD boundaries and iterate over it \(32\)
characters at a time. We check for all of them at once whether they equal each
of them, and convert these results to a bitmask. The number of trailing zeros
indicates the position of the match. If the bitmask is \(0\), there are no matches
and we try the next \(32\) characters.
This turns out to be slightly slower. I’m not exactly sure why, but we can profile and iterate from here.
|
|
Profiling Link to heading
Running perf stat -d cargo run -r
gives:
|
|
perf stat
profiling.Observe:
- Actual cycles is only
3.3GHz
, whereas it should be3.6GHz
. Not sure why; might be waiting for IO. 1.65
instructions per cycle is quite low. It can be up to 4 and is often at least 2.5.8.87%
of branch misses is also quite high. Usually this is at most 1% and typically lower. Each branch mispredict causes a stall of 5ns or so, which is over 1 second total, but I suspect the impact is larger.18.19%
of last-level-cache load misses. Also quite high, but I’m not sure if this is a problem, since the total number of LLC loads is relatively low.
Revisiting the key function: 23s Link to heading
Looking at perf report
we see that the hottest instruction is a call to
memcpy
to read up to name.len()
bytes from the &[u8]
name to a u64
.
|
|
u64
.We can avoid this memcpy
call entirely by just doing a (possibly out of
bounds) u64
read of the name, and then shifting away bits corresponding to the
out-of-bounds part. We’ll also improve the hash to add the first and last (up
to) 8 characters.
|
|
This brings the runtime down from 28s to 23s!
In perf stat
, we can also see that the number of branches and branch-misses
went down around 30%.
PtrHash perfect hash function: 17s Link to heading
Now, the hottest instructions are all part of the hashmap lookup.
|
|
Observe:
- There is a loop for linear probing.
- There are a lot of equality checks to test if a slot corresponds to the requested key.
- Generally, this code is long, complex, and branchy.
It would be much better to use a perfect hash function that we build once. Then none of these equality checks are needed.
For this, I will use PtrHash, a (minimal) perfect hash function I developed based on PtHash (PtHash paper). I still have to write a paper on PtrHash, but I do have a long roundabout blogpost.
Find all city names the first 100k rows. Since each row has a random city, all names will occur here.
Build a perfect hash function. For the given dataset, PtrHash outputs a metadata pilot array of \(63\) bytes.
On each lookup, the
u64
hash is mapped to one of the \(63\) buckets. Then the hash is xored byC * pilots[b]
where \(C\) is a random mixing constant. This is then reduced to an integer less than \(512\), which is the index in the array ofRecords
we are looking for.The pilots are constructed such that each hash results in a different index.
The full code is here. The diff in the hot loop is this.
|
|
h
was a HashMap<u64, (Record, &str)>
. After, records
is simply a [Record; 512]
, and phf.index(key)
is the perfect hash function.In assembly code, it looks like this:
|
|
The new running time is now 17s!
Larger masks: 15s Link to heading
Currently we store u32
masks on which we do .trailing_zeros()
to find
character offsets. We can also check two 32
simd lanes in parallel and combine them into
a single u64
mask. This gives a small speedup, I think mostly because there
are now slightly fewer branch-misses (593M now vs 675M before): commit.
Reduce pattern matching: 14s Link to heading
I modified the generator I’m using to always print exactly one decimal. This saves some branches.
|
|
Memory map: 12s Link to heading
Instead of first reading the file into memory and then processing that, we can memory map it and transparently read parts as needed. This saves the 2 seconds spent reading the file at the start.
|
|
memmap2
crate.Parallelization: 2.0s Link to heading
Parallelizing code is fairly straightforward. First we split the data into one chunk per thread. Then we fire a thread for each chunk, each with its own vector to accumulate results. Then at the end each thread merges its results into the global accumulator.
This gives pretty much exactly \(6\times\) speedup on my 6-core machine, since accumulating is only a small fraction of the total time.
|
|
Branchless parsing: 1.7s Link to heading
The match
statement on the number of digits in the temperature generated quite
a lot of branches and perf stat cargo run -r
was showing 440M
branch-misses,
i.e. almost one every other line. That’s about as bad as it can be with half the
numbers having a single integer digit and half the numbers having two integer digits.
I was able to pinpoint it to the branching by running perf record -b -g cargo run -r
followed by perf report
.
Changing this to a branch-less version is quite a bit faster, and now only
140M
branch-misses remain.
|
|
Purging all branches: 1.67s Link to heading
The remaining branch misses are in the while eq_sep == 0
in the scanning for
;
and \n
characters (Manual SIMD: 29s).
Since cities and temperatures have variable
lengths, iterating over the input will always have to do some branching to
move to the next bit of input or not.
We can work around this by doing an independent scan for the next occurrence of
;
and \n
in each iteration. It turns out the longest line in the input
contains 33 characters including newline. This means that a single 32-character
SIMD comparison is exactly sufficient to determine the next occurrence of each character.
In code, it looks like this.
|
|
It turns out this does not actually give a speedup, but we will use this as a
starting point for further improvements. Note also that perf stat
changes
considerably:
|
|
perf stat
before and afterNote:
- The total CPU cycles is the same.
- The number of instructions has gone down 10%.
- The number of branches went from 4.4G (4 per line) to 1.1G (1 per line).
- The number of branch-misses went from 150M (once every 7 lines) to 4M (once every 250 lines).
To illustrate, at this point the main loop looks like this. Note that it is indeed branchless, and only 87 instructions long.
|
|
Some more attempts Link to heading
Possible improvements at this point are increasing parallelism to get more than 2.49 instructions per cycle, and increasing parallelism by using SIMD to process multiple lines at a time.
I quickly hacked something that splits the data: &[u8]
for each thread into
two to four chunks that are processed at the same time, hoping multiple
independent code paths would improve parallelism, but that didn’t work out
immediately. Probably I need to interleave all the instructions everywhere, and
manually use SIMD where possible, which is slightly annoying and for a later time.
I also know that the PtrHash perfect hash function contains a few redundant instructions that are needed in the general case but not here. Removing those would be nice.
Faster perfect hashing: 1.55s Link to heading
Turns out I added a function to PtrHash
for lookups on small tables, but
wasn’t actually using it. Saves some cycles again :)
Bug time: Back up to 1.71s Link to heading
I accidentally dropped the - b'0'
part when making the floating point parsing branch free.
Adding them back in bumps the times quite a bit, given that it’s only 4
instructions extra.
|
|
Temperatures less than 100: 1.62s Link to heading
Assuming that temperatures are less than 100 helps quite a bit.
|
|
Computing min
as a max
: 1.50
Link to heading
Instead of record.min = min(value, record.min)
we can do record.min = max(-value, record.min)
and negate the value at the end. This turns out to
generate slightly faster code, because the two max
calls can now be done using SIMD.
Intermezzo: Hyperthreading: 1.34s Link to heading
Turns out enabling hyperthreading speeds up the parallel run by around 10%!
Surprisingly, the single-threaded version becomes a bit slower, from 7s down to 9s.
Here’s a perf stat
on 12 threads with hyperthreading:
|
|
Instructions per cycle is very low, probably since the hyperthreads are competing for cycles.
And here’s a perf stat
for 6 threads with hyperthreading disabled:
|
|
Notice how elapsed time is a bit higher, but instructions/cycle, task-clock, and user time are lower. In fact, the number of instructions, branches, and branch-misses is pretty much the same. The hyperthreaded variant just has more contention for the available cycles.
(I’ll now disable hyperthreading again to make numbers easier to interpret.)
Not parsing negative numbers: 1.48s Link to heading
Having to deal with positive and negative numbers at the same time is kinda
annoying for further parsing optimizations. To fix this, we will create separate
hash map entries for positive and negative numbers. In particular, for cities with a
negative value I will act as if the ;
separator was located at the position of
the minus.
That way, the value is always positive, and the city name gets a ;
appended
for negative cases.
This now saves some instructions in the parsing where we can assume the number is positive.
Overall it’s pretty much performance-neutral.
|
|
&[u8]
slice for negative numbers.More efficient parsing: 1.44s Link to heading
It turns out subtracting b'0'
from each character is quite slow: since each
u8
subtraction could overflow, they all have to be done independently, as we
can see in the generated assembly:
|
|
To fix this, we can do all subtractions on i32
. That way, the compiler merges
them into a single subtraction of 111 * b'0'
.
|
|
New assembly:
|
|
I’m still confused, because the 111 * 48
constant appears twice, which seems
unnecessary, but the code is quite a bit shorter for sure.
I’m also not quite sure exactly how the * (s.len() >= 4)
ends up in here. It
seems that both values are computed and the right one is automatically picked.
But then I would expect to see 11 * 48
as a constant also, but that doesn’t appear.
Fixing undefined behaviour: back to 1.56s Link to heading
So, it turns out that doing the unsafe equivalent of s[s.len()-4]
on a slice
of length 3
is no good. On stable rust it happens to work, but on nightly it
does not. Apparently get_unchecked
with out-of-bounds indexes is actually
undefined behaviour.
Rewriting everything to use indices into the larger data: &[u8]
slice instead of
small slices ends up giving quite a bit of slowdown. Maybe enough to reconsider
some earlier choices…
Lazily subtracting b'0'
: 1.52s
Link to heading
So actually we don’t even have to do the - b'0'
subtraction in the hot loop at
all for c.d
!
Since we count how many entries there are, we can just keep them, and compensate
in the end by subtracting count * 11 * b'0'
there.
|
|
Min/max without parsing: 1.55s Link to heading
Instead of doing min
and max
operations on parsed integers, we can do them
on the raw string bytes directly. We only have to be careful to mask out the ;
in ;2.3
, since the ASCII value of ;
is larger than the digits.
The commit is kinda ugly. Highlight is this function that reads the bytes in big-endian order, and then shifts away out-of-range bytes.
|
|
This is only a bit slower, and I suspect this will allow speedups later if we
can drop the other parse
function completely.
Parsing using a single multiplication: doesn’t work Link to heading
The previous parse_to_raw
reads bytes in reverse order (big endian, while my
system is low endian) and returns something like 0x3b_3c_??_3d
(where ??
is
the value of .
, and 3b
is the ASCII value of digit b
).
Most of this is constant and can be subtracted at the end in bulk.
The interesting bits are 0x0b_0c_00_0d
. From this we could almost get
the target value by multiplying with 100 + 10 << 8 + 1 << 24
and then
shifting right by 24
:
|
|
The problem is that the 100c
term could get larger than 256 and interfere with
the target value T
. Also, T
itself could be up to 1000 and run in the
10b+c
stored one position higher.
Parsing using a single multiplication does work after all! 1.48s Link to heading
Alternatively, we could read the bytes in low endian order into 0x0d_00_0c_0b
and multiply by
1 + 10<<16 + 100<<24
, again followed by a right shift of 24
:
|
|
In this case, we see that the 10b
term in the lower significant position can
never exceed 90 < 256
, so there are no problems there.
On the other side, we still have the issue that \(T = 100b + 10c + d\) can be up
to \(999\) and more than \(256\). But it turns out we’re lucky! (Very lucky?!) The
100c
term does occupy the next byte, but 100 is divisible by 4, and hence the
low 2 bits of that byte are always 0. And those 2 bits are exactly what we need!
Taking them together with the 8 bits of the target byte, we have a 10 bit value,
which can store anything up to 1024, larger than the max value of T!
OMG It’s hard to overstate my happiness here! Just think for a moment about
what just happened. I wonder if there is some ‘deeper’ reason this works. We’re
very much using that 2 divides 10, and that the position of the .
is such that
only the 100c
term is close by.
The new parse
function looks like this:
|
|
Note: I just learned of the bextr
instruction that could replace this last
shift and mask sequence, but it doesn’t seem to be faster than the native implementation.
A side note: ASCII Link to heading
One thing that really annoys me a lot is that the bit representation of ;
is
very close to the digits 0..9
(ASCII table). This means that we can’t nicely
mask it away, and are somewhat forced to do the shift in algorithm above to get
0 bits.
Skip parsing using PDEP
: 1.42s
Link to heading
The new parsing function is pretty sweet, but there is another idea worth exploring.
Using PDEP
and PEXT
instructions (parallel bit deposit/extract), we can shuffle bits around in 3 clock
cycles (same as a multiplication).
So we could do:
|
|
Now we have a u64
integer containing three 21-bit integers inside it. Since
each digit has a value up to \(10\), we can accumulate 2^{21}/10 = 200k
temperatures in this before overflow happens. So we could process say 10M
rows
(where each city should only appear 25k
times) and then accumulate things into
a main buffer.
Improved Link to heading
We can even get rid of the PEXT
instruction and directly PDEP
the bits
where we want them, since all bits only move left. Basically we deposit some
bloat bits (the 3
of the ASCII digits, and the representation of the .
), but
then we simply drop them using an &
.
|
|
|
|
PDEP
Note that we have to do a bit of extra logic around the accumulation, but that’s fairly straightforward.
The length of the assembly code is now down to 62 instructions!
|
|
One slightly ugly part in the above is:
|
|
Both the pdep shuffle and mask require a dedicated mov
. The
and-mask can also be done on the u32
value before the pdep
, which merges the
mov
and and
to and $0xf0f000f,%edx
.
A further note Link to heading
Not all systems support PDEP
. Also, AMD Zen 1
and Zen 2
processor do
support it, but implement it using microcode taking 18 cycles (wikipedia). In
our case we can emulate the PDEP
relatively easily using some mask and shift
operations:
|
|
This is only around 3% slower (1.13s
vs 1.10s
).
Branchy min/max: 1.37s Link to heading
Inspired by this nice post on Algorithmica.org, we replace r.min = min(r.min, value)
by a branchy version:
|
|
Indeed, branch-mispredicts are slow, but it turns out these branches are almost
never taken! In fact, since the data is random, they will be true only \(\ln(n)\)
time for each city, where \(n\) is the number of times the city appears. Each city
appears 10G/400 = 2.5M
times, and so for each city we only expect around 15
updates, for 400*15=6000
branch misses in total (per thread),
which is very small compared to the 1G
writes and =cmov=s we save.
I tried adding an unlikely(..)
hint to this, but that didn’t change the
generated assembly code.
No counting: 1.34s Link to heading
Ok this is where you start saying I’m cheating, but it turns out we don’t need to count how often each city appears in the input. Since each row has a random city, the number of times a city appears is \(Bin(R, 1/C)\), where \(r=10^9\) and \(C=413\). This has expectation \(R/C\sim 15\cdot 10^6\) and standard deviation \(\approx \sqrt(R/C) \sim 4000\). So the variance around the expected value is only \(1/4000\). That means that if in the end we simply divide by the expected number of occurrences instead of the actual number of occurrences, we’re only off by \(1/4000\) relatively. There are \(400\) cities, so the worst deviation will be slightly more, maybe \(4/4000=1/1000\) or so. And since there are \(400<1000\) cities, the probability that one of them has average value within \(1/1000\) from a rounding-boundary is less than a half.
Indeed, this ends up giving the correct output for the evaluation set.
Edit: There was a bug in my verification script. As it turns out, after this
optimization the temperature for Baghdad
is incorrectly reported as 22.7
instead of 22.8
.
Arbitrary long city names: 1.34 Link to heading
Turns out supporting longer city names doesn’t even come at a cost since the branch is never taken!
|
|
Note that just doing while eq == 0 {..}
slows down the code, but wrapping it
in an always-false if statement makes it not slow down the code.
4 entries in parallel: 1.23s Link to heading
Currently we only process one line at a time. This limits how much we can exploit ILP (instruction level parallelism). So instead we will process 4 lines at a time.
We don’t want to process consecutive lines since they depend on each other. Instead, we chunk the data into 4 parts and process them independently of each other in parallel.
To improve the generated code, I created a simple macro that interleaves the instructions. This gets the IPC (instructions per cycle) up to 2.66 (up from 2.20).
|
|
Mmap per thread Link to heading
Currently the main thread spends 180ms
on unmapping the file at the end.
Instead I tried making a separate Mmap
instance per thread, so the unmapping
can happen in parallel. Sadly this doesn’t make a difference: now the 180ms
is
simply divided over all threads, and since munmap
has a program wide lock, the
drop(mmap)
calls in each thread are just waiting for each other.
Some workarounds:
- Make each thread do its own memory (un)mapping. This does not yet speed things up since they all unmap at the same time.
- Make two or three times the number of threads, so that other work can be done
instead of waiting for the kernel lock. This ends up having around
100ms
overhead, which is better than the180ms
we started with but still significant.
For now both these options are slightly too messy and I will just keep things as they are. I might revisit this later.
Reordering some operations: 1.19s Link to heading
Turns out doing the increments for the start/end of each slice in a slightly different order gives quite a bit better performance. :shrug:
|
|
Reordering more: 1.11s Link to heading
The above still processes the callback sequentially for the 4 parallel states. In particular, each callback spends relatively a lot of time waiting for the right record to be loaded from L1 cache into registers.
Let’s go deeper and call the callback with 4 states. Ugly commit, around 0.01s slowdown for some reason.
Now we can interleave the record loading instructions:
|
|
This brings us down to 1.11s, quite a big win really!
Comparing perf stat
before and after, the number of instructions went from
62.7G
to 63.0G
(0.5% more; I’m not sure why), but instructions per cycle went up
from 2.55 to 2.80, almost 10% better!
Even more ILP: 1.05 Link to heading
Of course, we can do better and interleave all the instructions fully. I played around a bit and the following gives the best results:
|
|
For some reason keeping the sep += ...
and name = ...
instructions as-is is
around 0.01s better. Still the overall win is very nice! The number of
instructions remains the same, but instructions per cycle is up to 2.95! For
1.05s in total.
At this point it turns out that hyperthreading actually slows down the code!
4.6GHz
.threads | HT off | HT on |
---|---|---|
1 | 4.93 | 5.05 |
6 | 0.90 | 0.97 |
12 | 0.90 | 0.92 |
Compliance 1, OK I’ll count: 1.06 Link to heading
Ok I’m adding back the record counting. Being non-compliant this way is annoying, and it’s really quite brittle to rely on the randomness here.
Surprisingly, performance is much (~0.05s) better when I make count: u64
instead of
count: u32
. I didn’t bother to figure out why.
TODO Link to heading
- Use AES instructions for hashing
- Verify hash on lookup
- Rebuild PHF on hash collision / new key.
- Simplify perfect hash function
- SIMD
- Munmap per chunk/thread
Postscript Link to heading
As things go, my attention shifted to other things before wrapping everything up, and by now I have forgotten the details of where I left things :")
Anyway, here’s a blogpost comparing non-java implementations, and here is a long discussion of possible optimizations.