Many of my rust crates and binaries building on them use SIMD instructions. Notably:
packed-seq, a library for 2-bit encoding DNA (gh, docs.rs),simd-minimizers, a library for fast computation of minimizers (gh, docs.rs, paper),deacon, a tool for fast decontamination of reads (gh, preprint),sassy, a SIMD-based approximate string searching library and binary (gh, docs.rs, preprint),barbell, a tool for demultiplexer based on sassy (gh, preprint)
all support fast algorithms based on both x86-64 (x64, henceforth) AVX2 and aarch64 NEON instructions. The question of this post is: how to effectively distribute binaries using these libraries?
As a solution, the ensure_simd crate (gh) does a compile time check that
either NEON or AVX2 instructions are enabled.
It also contains instructions for effectively distributing such binaries,
which is explained in more detail here.
1 What’s inside Link to heading
Initially, most of my code uses Rust’s unstable #![feature(portabe-simd)].
Unfortunately, users and packagers don’t like unstable Rust, so by now I have
converted everything to wide (gh, docs.rs). Wide supports types like u64x4
like this (omitting irrelevant details):
| |
Thus, it detects if the avx2 feature is enabled during compilation. If so, it
directly uses a 256-bit type, and otherwise it falls back to a tuple of 128-bit
elements, which in turn is implemented as:
| |
This is more interesting:
whereas the u64x4 only had a native version on x64, 128-bit SIMD is also
supported in both WebAssembly (wasm) and on arm chips with the NEON feature.
2 Compile-time feature detection Link to heading
What is important here is that the implementation is chosen at compile time.
Thus, the entire binary only has one instance of the wide crate, and we must
make sure that the right version is used during compilation, or otherwise the
binary will silently have degraded performance.
In particular, wide will work even when the avx2 feature is not set, but in
that case we will not get the expected performance, since it has to emulate
everything using 128-bit sse2 instructions instead. Thus, all my libraries
include a specific check that AVX2 or NEON SIMD instructions are indeed enabled
when doing a release build.
| |
scalar feature is not enabled on the crate, then force a compile error.The check can be ignored by passing the -F scalar feature flag to cargo, in
which case the less efficient fallback types will be used.
2.1 Other solutions Link to heading
There are other solutions to this: one is run-time feature detection.
Unfortunately, which this works great when you’re hardcoding the intrinsics
manually, it’s not a solution in our case, because wide instantiates
everything at compile-time.
The multiversion crate provides a macro that compiles a single function under
different architectures and then chooses the one to dispatch to at runtime. But
again, here we’re talking about an entire library that has to change rather than
a single function.
There is also cargo-multivers that can build a binary with different features
enabled and then combines those into a single binary that chooses the one to use
at runtime, but it seems conceptually quite heavy and has a significant startup
overhead due to decompressing the binaries.
3 Rust’s default target-cpu Link to heading
This is all nice, but it gives some problems: on x64, Rust has a default target of x86-64-v1, which does not include all the features we need. Instead, we can force to use all feature available on the host CPU and optimize the code for it using:
| |
or by passing RUSTFLAGS directly into cargo:
| |
Note that this is only required on x64, where the default is missing AVX2, but even on aarch64 platforms (where NEON is always available), telling the compiler to optimize the code for a specific CPU can be beneficial: even when two CPUs implement the same instructions, they might have different costs for different instructions, and the compiler can use this information to generate better code.
3.1 cargo build vs cargo install
Link to heading
By default, cargo build will search for .cargo/config.toml configuration files in
the workspace root and up, and thus you can also provide e.g.
~/.cargo/config.toml, and always ask for target-cpu=native from there.
Similarly, cargo install --path . will build and install the current directory
using the provided config. Unfortunately though, this config is not part of
the crate that is pushed to crates.io. Thus, cargo install sassy does not
automatically build with all available features enabled, and the user has to ask
for this manually (or otherwise for x64 the compile_error! will trigger):
| |
4 Hardware support Link to heading
In practice, AVX2 is supported by Intel since Haswell (2013), and by all AMD Zen architectures (2017). In fact, x64 CPU features come in groups called microarchitecture levels. This wikipedia page has a great overview, and show the aforementioned processors support all of x86-64-v3.
Similarly, NEON is a required feature for the aarch64 architecture, and thus supported on all Apple silicon chips since M1 (2020).
5 Distributing binaries Link to heading
We want to ensure that distributed binaries are maximally portable.
This means that we explicitly should not use target-cpu=native.
For example, many CI servers support x6-64-v4 which includes AVX512. Such
binaries will be fast, but also will not work on systems (such as my laptop)
that only have x86-64-v3 and thus do not support AVX512.
Thus, instead of compiling for the native target, we fix x86-64-v3 for x64
systems. Furthermore, for builds on an aarch64 M chip, we specifically target
the apple-m14 architecture, corresponding to the M1:
| |
We would like to only include this config for building portable binaries, so
that builds directly from the repository still achieve maximal performance.
That is easily achieved by using mv .cargo/config-portable.toml .cargo/config.toml before cargo build.
5.1 GitHub Releases Link to heading
GitHub releases are set up via cargo-dist (gh).
It uses a small dist-workspace.toml file to generate (via dist init) a
massive workflow that in the end simply builds the binaries and makes a new
release with changelog whenever a new tag is pushed.
5.2 Bioconda Link to heading
The Barbell Bioconda workflow is in this pr.
5.3 Pypi Link to heading
The sassy pypi release workflow is here .
6 Open questions Link to heading
- Should we look into
cargo-multiversto ship a single portable binary per architecture? - Should the repository contain
target-cpu=nativeby default to enable maximum performance, or should we always prefer portability? - Should we add an additional run-time check that AVX2 or NEON is actually
supported by the CPU running the binary? This sounds nice, but in practice,
something like argument parsing will already use the max available SIMD
instructions at compilation time, which will trigger an
illegal instructionerror very early on. The only way to avoid this is to have every binary callensure_simd()at the very start of the program, but this feels a bit brittle.