This is a quick note on some thoughts & experiments on fasta parsing, alongside github:RagnarGrootKoerkamp/fasta-parsing-playground.

It’s a common complaint these days that Fasta parsing is slow. A common parser is Needletail (github), which builds on seq_io (github). Another recent one is paraseq (github).

Paraseq helps the user by providing an interface for parallel processing of reads, see eg this example. Unfortunately, this still has a bottleneck: while the users processing of reads is multi threaded, the parsing itself is still single threaded.

1 Minimizing critical sections Link to heading

In Langmead et al. (2018), some approaches are discussed for FASTQ parsing.

  1. Deferred parsing: search the end of the current block of 4 lines inside a critical section, and then parse those in separate threads.
  2. Batch deferred: read \(4N\) lines, and then parse those. This reduces contention on the locks.
  3. Block deferred: read \(B\) bytes, assuming that this exactly matches a single record. This can be done by padding records with spaces to give them all the same size, but is more of a hypothetical best case.

2 RabbitFx: chunking Link to heading

RabbitFx (Zhang et al. 2023) (github) is a recent parser written in C++ that does something new: it splits the entire input file into fixed size chunks (of 8 MB by default) that are distributed those over the threads. Reads that cross chunk boundaries store a link to the next chunk:

Figure 1: The chunking & linking of RabbitFx (Zhang et al. 2023).

Figure 1: The chunking & linking of RabbitFx (Zhang et al. 2023).

The RabbitFx paper then reimplements some basic algorithms on top of this parser and should significantly higher throughput, reaching up to 2.4 GB/s1 for counting bases on a server with two 20-core CPUs, compared to FQFeeder’s 740 MS/s.

Unfortunately, all of this doesn’t work well with gzipped input which requires sequential unzipping. But anyway, I was curious to see how much perf this would give in practice. This can be used as an argument why single-block gzipping is bad.

3 Different chunking Link to heading

In my 1brc post, I took a different approach. When there are \(t\) threads, simply partition the input file into \(t\) non-overlapping chunks, and give one chunk to each thread. Then there might be some reads that cross chunk boundaries, but this is not a problem: each thread first finds the start of the first record (the first > character), and then processes things from there. Each thread ends processing when the next > character it runs into is in the next chunk (or the end of the file is reached).

Figure 2: My proposed chunking: each thread just takes its own slice of the input file and does its own thing. The middle of the three threads processes the reads shown in yellow.

Figure 2: My proposed chunking: each thread just takes its own slice of the input file and does its own thing. The middle of the three threads processes the reads shown in yellow.

In this model, the parsing does not require any synchronization at all!

In the code, I simply scan for all > characters and then collect a batch of reads spanning at least 1 MB. Then I send this to a queue to be picked up by a worker thread.2 In the worker thread, I search the first newline inside each record for the end of the header, and the rest is the read itself.

Note that the underlying file is memory mapped, so that accessing it in multiple streams is trivial from the code perspective.

4 Experiments Link to heading

I did some experiments both on the CHM13v2.0.fa human genome, and a set of simulated reads3 from zenodo by Bede Constantinides that are used for benchmarking Deacon (github).

See the github code for the implementation.

rsviruses17900.fa is 5.4 GB and has reads of average length 2000 bp, and each read is a single line (ie no linebreaks inside records).

Adding RabbitFx to these benchmarks is still TODO, since I don’t feel like messing with C/C++ bindings at the moment.4

Table 1: Throughput of parsing reads (average len 2000 bp) in GB/s. Note that both needletail and paraseq don't benefit from multiple threads, because I don't do any work on their output.
threadsneedletailparasequs
16.915.287.81
26.955.0912.46
37.044.9715.26
46.814.6317.27
56.844.7018.32
66.824.6617.83
Table 2: Throughput of human genome in GB/s. Note that the fasta file has linebreaks, and I don't find/strip them.
threadsneedletailparasequs
12.77bug8.26
22.804.0112.24
32.803.9615.38
42.784.0216.30
52.803.9217.08
62.773.9617.28

Needletail nor paraseq only allocate a newline-free record on request, which I don’t do in this benchmark, but I believe at least needletail does unconditionally store the positions of all newlines, and so does slightly more work.

The new method caps out at 17 GB/s, while the peak memory transfer rate of my DDR4 RAM5 is around 23 GB/s, so this is not too bad. If I drop the fixed 2.6 GHz CPU frequency that I used for the experiments, instead I get speeds up to 22.7 GB/s, which is actually very close!

Thus, if the throughput of your code on all threads is more than 2-4 GB/s, you might try this technique to avoid file parsing being the point of contention.

5 Outlook Link to heading

I think this is a neat idea, but probably with limited practical usage, since most large inputs will be zipped. And in that case, you may have many input files, and it’s better to spin up one thread/program per input file, so that decompression can be done in parallel.

6 Some helicase benchmarks Link to heading

See https://github.com/imartayan/helicase

Running the helicase benchmarks: cd bench && cargo run -r <path>:

Input files are taken/modified from https://zenodo.org/records/15411280:

filewhatsizeavg lennewlinesdistribution
rsviruses17900.faViruses539 MB30 kbpnohigh variance
rsviruses17900.fastqSimulated ONT reads11 GB2 kbpnohigh variance
rsviruses17900.fasta(converted to .fasta)5.4 GB2 kbpnohigh variance
rsviruses17900.r1.fastqSimulated Illumina reads6.4 GB150 bpnoall same
rsviruses17900.r1.fasta(converted to .fasta)3.6 GB150 bpnoall same
chm13v2.0.fastahuman genome3.2 GB140 Mbpyeshigh variance

Benchmarks take different inputs:

  • path takes a file path and reads the file.
  • mmap takes a file path and mmaps the file.
  • slice and reader take a &[u8] from an already-read file.
Table 3: Throughput in GB/s. (Close to) best in bold, excluding regex, header-only, and DNA len.
methodvirusONT .fqONT .faIllumina .fqIllumina .fahuman
size539 MB11 GB5.4 GB6.4 GB3.6 GB3.2 GB
avg record len30 kbp2 kbp2 kbp150 bp150 bp140 Mbp
newlinesnononononoyes
Regex header (slice)15.7120.009.0518.310.7720.13
Needletail (file)7.448.737.164.073.131.89
Needletail (reader)8.4211.168.514.883.392.07
Paraseq (file)5.252.583.401.611.391.94
Paraseq (reader)7.702.714.481.901.662.29
Header only (file)6.736.996.075.253.754.48
Header only (mmap)8.569.128.906.555.725.69
Header only (slice)8.7011.607.928.595.426.02
DNA string (file)4.715.914.634.813.242.89
DNA string (mmap)5.539.105.926.613.853.42
DNA string (slice)5.4711.605.088.593.523.23
DNA packed (file)3.334.423.153.162.222.00
DNA packed (mmap)4.035.033.983.352.572.35
DNA packed (slice)3.775.323.553.642.402.18
DNA columnar (file)3.624.623.453.732.672.45
DNA columnar (mmap)4.165.594.174.132.963.06
DNA columnar (slice)3.665.893.504.572.642.78
DNA len (slice)6.376.275.844.604.504.69

Observations:

  • Helicase is much faster for fastq than for fasta, surprisingly, both for &str and columnar/packed output.
  • For fastq, helicase is best.
  • For short reads, helicase is best (less per-record overhead).
  • For fasta with newlines, helicase is slower than on no-newline data, but faster than needletail/paraseq.
  • Needletail is quite good, but slow on short reads and human data with newlines.
  • Paraseq is only comparable in speed for long chromosomes.
  • In all cases, processing an in-memory &[u8] is faster than file input.
  • mmap is faster than usual file parsing, but usually (not always) slower than bytes from memory.

We can modify our mmap-based multithreading to call helicase (slice input, string output) instead of our stub parser in each thread, as in this code. Note that we simply split the input into 6 chunks (one for each thread) and ignore cross-chunk reads in helicase.

Recall that our baseline simply searches for > characters and then for the first following \n to determine the end of the header, and does nothing else.

Table 4: Throughput in GB/s.
methodthreadsvirusONT .fqONT .faIllumina .fqIllumina .fahuman
size539 MB11 GB5.4 GB6.4 GB3.6 GB3.2 GB
avg record len30 kbp2 kbp2 kbp150 bp150 bp140 Mbp
newlinesnononononoyes
us (count-only)627.126.225.815.916.923.0
helicase slice default620.924.521.820.820.410.9
us (count-only)324.522.520.714.116.122.0
helicase slice default313.619.513.415.512.08.0
us (count-only)114.311.816.187.768.9712.22
helicase slice default15.58.75.946.424.033.43
Needletail19.18.58.684.304.473.22
Paraseq14.36.04.803.332.753.79
helicase path default14.65.74.444.483.093.16
helicase path ignore16.77.16.275.944.285.12
helicase path header16.77.06.155.313.825.00
helicase path string14.65.84.554.753.253.24
helicase path columnar13.54.53.463.652.662.77
helicase path packed13.24.32.833.132.282.16

Observations:

  • Our trivial method is already saturated at 3 threads.
  • Helicase needs the full 6 threads for reads and viruses.
  • Helicase is quite a bit slower on the line-breaked human data, probably because it does a lot of additional (and mostly inevitable) I/O.
  • Needletail is quite fast in all single-threaded cases (but note that we do not call .seq()).

References Link to heading

Langmead, Ben, Christopher Wilks, Valentin Antonescu, and Rone Charles. 2018. “Scaling Read Aligners to Hundreds of Threads on General-Purpose Processors.” Edited by John Hancock. Bioinformatics 35 (3): 421–32. https://doi.org/10.1093/bioinformatics/bty648.
Zhang, Hao, Honglei Song, Xiaoming Xu, Qixin Chang, Mingkai Wang, Yanjie Wei, Zekun Yin, Bertil Schmidt, and Weiguo Liu. 2023. “Rabbitfx: Efficient Framework for Fasta/Q File Parsing on Modern Multi-Core Platforms.” Ieee/Acm Transactions on Computational Biology and Bioinformatics 20 (3): 2341–48. https://doi.org/10.1109/tcbb.2022.3219114.

  1. I haven’t looked into the details of their code, but this really is not so fast. As a baseline, SimdMinimizers computes minimizers at over 500 MB/s on a single core, and that is much more involved that simply counting the number of A/C/G/T characters. ↩︎

  2. Really the queue isn’t needed so much here – we could just pass a callback to a worker function, similar to what paraseq does with the map-reduce model. Then, each thread simply processes its own data. ↩︎

  3. Download rsviruses17900.fastq.gz, unzip, and then convert the .fastq to a .fa↩︎

  4. I get build errors on the Examples.cpp. Thanks C++. ↩︎

  5. Note that the file is read from RAM, not disk/SSD, since it’s already cached by the kernel. ↩︎