Distributing Rust SIMD Binaries

Table of Contents

1 What’s inside
2 Compile-time feature detection
- 2.1 Other solutions
3 Rust’s default target-cpu
- 3.1 cargo build vs cargo install
4 Hardware support
5 Distributing binaries
- 5.1 GitHub Releases
- 5.2 Bioconda
- 5.3 Pypi
6 Open questions

Many of my rust crates and binaries building on them use SIMD instructions. Notably:

packed-seq, a library for 2-bit encoding DNA (gh, docs.rs),
simd-minimizers, a library for fast computation of minimizers (gh, docs.rs, paper),
deacon, a tool for fast decontamination of reads (gh, preprint),
sassy, a SIMD-based approximate string searching library and binary (gh, docs.rs, preprint),
barbell, a tool for demultiplexer based on sassy (gh, preprint)

all support fast algorithms based on both x86-64 (x64, henceforth) AVX2 and aarch64 NEON instructions. The question of this post is: how to effectively distribute binaries using these libraries?

As a solution, the ensure_simd crate (gh) does a compile time check that either NEON or AVX2 instructions are enabled. It also contains instructions for effectively distributing such binaries, which is explained in more detail here.

1 What’s inside Link to heading

Initially, most of my code uses Rust’s unstable #![feature(portabe-simd)]. Unfortunately, users and packagers don’t like unstable Rust, so by now I have converted everything to wide (gh, docs.rs). Wide supports types like u64x4 like this (omitting irrelevant details):

1
2
3
4
5
6
7
pick! {
  if #[cfg(target_feature="avx2")] { // x64 variant
    struct u64x4 { avx2: m256i }
  } else {
    struct u64x4 { a : u64x2, b : u64x2 }
  }
}

Thus, it detects if the avx2 feature is enabled during compilation. If so, it directly uses a 256-bit type, and otherwise it falls back to a tuple of 128-bit elements, which in turn is implemented as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
pick! {
  if #[cfg(target_feature="sse2")] { // x64 variant
    struct u64x2 { sse: m128i }
  } else if #[cfg(target_feature="simd128")] { // wasm variant
    struct u64x2 { simd: v128 }
  } else if #[cfg(all(target_feature="neon",target_arch="aarch64"))]{ // arm/aarch64 variant
    struct u64x2 { neon : uint64x2_t }
  } else { // fallback
    struct u64x2 { arr: [u64; 2] }
  }
}

This is more interesting: whereas the u64x4 only had a native version on x64, 128-bit SIMD is also supported in both WebAssembly (wasm) and on arm chips with the NEON feature.

2 Compile-time feature detection Link to heading

What is important here is that the implementation is chosen at compile time. Thus, the entire binary only has one instance of the wide crate, and we must make sure that the right version is used during compilation, or otherwise the binary will silently have degraded performance.

In particular, wide will work even when the avx2 feature is not set, but in that case we will not get the expected performance, since it has to emulate everything using 128-bit sse2 instructions instead. Thus, all my libraries include a specific check that AVX2 or NEON SIMD instructions are indeed enabled when doing a release build.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#[cfg(not(any(
    debug_assertions,
    doc,
    target_feature = "avx2",
    target_feature = "neon",
    feature = "scalar"
)))]
compile_error!("
Sassy uses AVX2 or NEON SIMD instructions for performance.
To get the expected performance, compile/install using eg:
RUSTFLAGS=\"-C target-cpu=native\" cargo install sassy
Alternatively, silence this error by activating the `scalar` feature (eg `cargo install -F scalar ...`).
See the sassy README for details."
);

Code Snippet 1: When compiling, and it's 1) not a debug build, 2) not a documentation build, and 3) neither the AVX2 nor NEON target-feature is enabled, and 4) also the scalar feature is not enabled on the crate, then force a compile error.

The check can be ignored by passing the -F scalar feature flag to cargo, in which case the less efficient fallback types will be used.

2.1 Other solutions Link to heading

There are other solutions to this: one is run-time feature detection. Unfortunately, which this works great when you’re hardcoding the intrinsics manually, it’s not a solution in our case, because wide instantiates everything at compile-time.

The multiversion crate provides a macro that compiles a single function under different architectures and then chooses the one to dispatch to at runtime. But again, here we’re talking about an entire library that has to change rather than a single function.

There is also cargo-multivers that can build a binary with different features enabled and then combines those into a single binary that chooses the one to use at runtime, but it seems conceptually quite heavy and has a significant startup overhead due to decompressing the binaries.

3 Rust’s default target-cpu Link to heading

This is all nice, but it gives some problems: on x64, Rust has a default target of x86-64-v1, which does not include all the features we need. Instead, we can force to use all feature available on the host CPU and optimize the code for it using:

1
2
3
# .config/cargo.toml
[build]
rustflags = ["-C", "target-cpu=native"]

or by passing RUSTFLAGS directly into cargo:

1
RUSTFLAGS="-C target-cpu=native" cargo build -r

Note that this is only required on x64, where the default is missing AVX2, but even on aarch64 platforms (where NEON is always available), telling the compiler to optimize the code for a specific CPU can be beneficial: even when two CPUs implement the same instructions, they might have different costs for different instructions, and the compiler can use this information to generate better code.

3.1 `cargo build` vs `cargo install` Link to heading

By default, cargo build will search for .cargo/config.toml configuration files in the workspace root and up, and thus you can also provide e.g. ~/.cargo/config.toml, and always ask for target-cpu=native from there.

Similarly, cargo install --path . will build and install the current directory using the provided config. Unfortunately though, this config is not part of the crate that is pushed to crates.io. Thus, cargo install sassy does not automatically build with all available features enabled, and the user has to ask for this manually (or otherwise for x64 the compile_error! will trigger):

1
RUSTFLAGS="-C target-cpu=native" cargo install sassy

4 Hardware support Link to heading

In practice, AVX2 is supported by Intel since Haswell (2013), and by all AMD Zen architectures (2017). In fact, x64 CPU features come in groups called microarchitecture levels. This wikipedia page has a great overview, and show the aforementioned processors support all of x86-64-v3.

Similarly, NEON is a required feature for the aarch64 architecture, and thus supported on all Apple silicon chips since M1 (2020).

5 Distributing binaries Link to heading

We want to ensure that distributed binaries are maximally portable. This means that we explicitly should not use target-cpu=native. For example, many CI servers support x6-64-v4 which includes AVX512. Such binaries will be fast, but also will not work on systems (such as my laptop) that only have x86-64-v3 and thus do not support AVX512.

Thus, instead of compiling for the native target, we fix x86-64-v3 for x64 systems. Furthermore, for builds on an aarch64 M chip, we specifically target the apple-m14 architecture, corresponding to the M1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# .cargo/config-portable.toml

[target.'cfg(target_arch="x86_64")']
# x86-64-v2 does not have AVX2, but we need that.
# x86-64-v4 has AVX512 which we explicitly do not include for portability.
rustflags = ["-C", "target-cpu=x86-64-v3"]

[target.'cfg(all(target_arch="aarch64", target_os="macos"))']
# For aarch64 macos builds, specifically target M1 rather than generic aarch64.
rustflags = ["-C", "target-cpu=apple-a14"]

We would like to only include this config for building portable binaries, so that builds directly from the repository still achieve maximal performance. That is easily achieved by using mv .cargo/config-portable.toml .cargo/config.toml before cargo build.

5.1 GitHub Releases Link to heading

GitHub releases are set up via cargo-dist (gh). It uses a small dist-workspace.toml file to generate (via dist init) a massive workflow that in the end simply builds the binaries and makes a new release with changelog whenever a new tag is pushed.

5.2 Bioconda Link to heading

The Barbell Bioconda workflow is in this pr.

5.3 Pypi Link to heading

The sassy pypi release workflow is here .

6 Open questions Link to heading

Should we look into cargo-multivers to ship a single portable binary per architecture?
Should the repository contain target-cpu=native by default to enable maximum performance, or should we always prefer portability?
Should we add an additional run-time check that AVX2 or NEON is actually supported by the CPU running the binary? This sounds nice, but in practice, something like argument parsing will already use the max available SIMD instructions at compilation time, which will trigger an illegal instruction error very early on. The only way to avoid this is to have every binary call ensure_simd() at the very start of the program, but this feels a bit brittle.