Benchmarking is harder than you think, even when taking into account this rule.
This post lists some lessons I learned while attempting to run benchmarks for A* pairwise aligner. I was doing this on a laptop, which likely has different characteristics from CPUs in a typical server rack. All the programs I run are single threaded.
Hardware
- Do not run while charging the laptop
- Charging makes the battery hot and causes throttling. Run either on battery power or with a completely full battery to prevent this.
- Disable hyperthreading
- Completely disable hyperthreading in the BIOS. Multiple programs running on the same core may fight for resources.
CPU settings
- Pin CPU frequency
- CPUs, especially laptops, have turboboost, (thermal) throttling, and powersave
features. Make sure to pin the CPU core frequency low enough that it can be
sustained for long times without throttling.
In my case, the
performance
governor can fix the CPU frequency. The base frequency of my CPU is2.6GHz
, so that’s where I pinned it.1 2 3
sudo cpupower frequency-set -g performance sudo cpupower frequency-set -u 2.6GHz sudo cpupower frequency-set -d 2.6GHz
Note that even with a pinned CPU frequency, thermal throttling can reduce it.
- Pin program to core
- Make sure your program only executes on one core. Do this using e.g.
1
taskset -c 0 <shell invocation>
When running multiple experiments in parallel, use distinct ids instead of
0
.
Software
- Use a low job niceness
- At any point in time, multiple jobs need CPU resources. Use a low job
niceness (like
-20
, needs root) to give your experiment a higher priority. As an example, input (keyboard) and audio processing usually runs with niceness-20
.This should reduce the number of (kernel) interrupts to your program.
1
nice -n -20 <command>
- Do not use Snakemake for benchmarking memory usage
- It turns out that Snakemake’s polling-based memory-usage measurement
can be very imprecise. Apart from the first
30s
(or really15s
actually), it polls every30s
. This means that for programs whose memory usage grows linear with time, the measured memory usage of can be off by a factor 2 when it runs for59s
. - Limit the number of parallel jobs
- Memory bound programs share resources, even when running on disjoint CPUs. In my
case, using all 6 cores (running 6 benchmarks simultaneously) gives a
30%
slowdown compared to only using 1 core at a time (on some specific experiment). Using 3 cores simultaneously gives only10%
slowdown, which is acceptable in my case.