Benchmark attention points

Benchmarking is harder than you think, even when taking into account this rule.

This post lists some lessons I learned while attempting to run benchmarks for A* pairwise aligner. I was doing this on a laptop, which likely has different characteristics from CPUs in a typical server rack. All the programs I run are single threaded.

Hardware

Do not run while charging the laptop
Charging makes the battery hot and causes throttling. Run either on battery power or with a completely full battery to prevent this.
Disable hyperthreading
Completely disable hyperthreading in the BIOS. Multiple programs running on the same core may fight for resources.

CPU settings

Pin CPU frequency
CPUs, especially laptops, have turboboost, (thermal) throttling, and powersave features. Make sure to pin the CPU core frequency low enough that it can be sustained for long times without throttling.

In my case, the performance governor can fix the CPU frequency. The base frequency of my CPU is 2.6GHz, so that’s where I pinned it.

1
2
3
sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -u 2.6GHz
sudo cpupower frequency-set -d 2.6GHz

Note that even with a pinned CPU frequency, thermal throttling can reduce it.

Pin program to core
Make sure your program only executes on one core. Do this using e.g.
1
taskset -c 0 <shell invocation>

When running multiple experiments in parallel, use distinct ids instead of 0.

Software

Use a low job niceness
At any point in time, multiple jobs need CPU resources. Use a low job niceness (like -20, needs root) to give your experiment a higher priority. As an example, input (keyboard) and audio processing usually runs with niceness -20.

This should reduce the number of (kernel) interrupts to your program.

1
nice -n -20 <command>
Do not use Snakemake for benchmarking memory usage
It turns out that Snakemake’s polling-based memory-usage measurement can be very imprecise. Apart from the first 30s (or really 15s actually), it polls every 30s. This means that for programs whose memory usage grows linear with time, the measured memory usage of can be off by a factor 2 when it runs for 59s.
Limit the number of parallel jobs
Memory bound programs share resources, even when running on disjoint CPUs. In my case, using all 6 cores (running 6 benchmarks simultaneously) gives a 30% slowdown compared to only using 1 core at a time (on some specific experiment). Using 3 cores simultaneously gives only 10% slowdown, which is acceptable in my case.