This is a quick note on some thoughts & experiments on fasta parsing, alongside github:RagnarGrootKoerkamp/fasta-parsing-playground.

It’s a common complaint these days that Fasta parsing is slow. A common parser is Needletail (github), which builds on seq_io (github). Another recent one is paraseq (github).

Paraseq helps the user by providing an interface for parallel processing of reads, see eg this example. Unfortunately, this still has a bottleneck: while the users processing of reads is multi threaded, the parsing itself is still single threaded.

1 Minimizing critical sections Link to heading

In Langmead et al. (2018), some approaches are discussed for FASTQ parsing.

  1. Deferred parsing: search the end of the current block of 4 lines inside a critical section, and then parse those in separate threads.
  2. Batch deferred: read \(4N\) lines, and then parse those. This reduces contention on the locks.
  3. Block deferred: read \(B\) bytes, assuming that this exactly matches a single record. This can be done by padding records with spaces to give them all the same size, but is more of a hypothetical best case.

2 RabbitFx: chunking Link to heading

RabbitFx (Zhang et al. 2023) (github) is a recent parser written in C++ that does something new: it splits the entire input file into fixed size chunks (of 8 MB by default) that are distributed those over the threads. Reads that cross chunk boundaries store a link to the next chunk:

Figure 1: The chunking & linking of RabbitFx (Zhang et al. 2023).

Figure 1: The chunking & linking of RabbitFx (Zhang et al. 2023).

The RabbitFx paper then reimplements some basic algorithms on top of this parser and should significantly higher throughput, reaching up to 2.4 GB/s1 for counting bases on a server with two 20-core CPUs, compared to FQFeeder’s 740 MS/s.

Unfortunately, all of this doesn’t work well with gzipped input which requires sequential unzipping. But anyway, I was curious to see how much perf this would give in practice. This can be used as an argument why single-block gzipping is bad.

3 Different chunking Link to heading

In my 1brc post, I took a different approach. When there are \(t\) threads, simply partition the input file into \(t\) non-overlapping chunks, and give one chunk to each thread. Then there might be some reads that cross chunk boundaries, but this is not a problem: each thread first finds the start of the first record (the first > character), and then processes things from there. Each thread ends processing when the next > character it runs into is in the next chunk (or the end of the file is reached).

Figure 2: My proposed chunking: each thread just takes its own slice of the input file and does its own thing. The middle of the three threads processes the reads shown in yellow.

Figure 2: My proposed chunking: each thread just takes its own slice of the input file and does its own thing. The middle of the three threads processes the reads shown in yellow.

In this model, the parsing does not require any synchronization at all!

In the code, I simply scan for all > characters and then collect a batch of reads spanning at least 1 MB. Then I send this to a queue to be picked up by a worker thread.2 In the worker thread, I search the first newline inside each record for the end of the header, and the rest is the read itself.

Note that the underlying file is memory mapped, so that accessing it in multiple streams is trivial from the code perspective.

4 Experiments Link to heading

I did some experiments both on the CHM13v2.0.fa human genome, and a set of simulated reads3 from zenodo by Bede Constantinides that are used for benchmarking Deacon (github).

See the github code for the implementation.

rsviruses17900.fa is 5.4 GB and has reads of average length 2000 bp, and each read is a single line (ie no linebreaks inside records).

Adding RabbitFx to these benchmarks is still TODO, since I don’t feel like messing with C/C++ bindings at the moment.

Table 1: Throughput of parsing reads (average len 2000 bp) in GB/s. Note that both needletail and paraseq don't benefit from multiple threads, because I don't do any work on their output.
threadsneedletailparasequs
16.915.287.81
26.955.0912.46
37.044.9715.26
46.814.6317.27
56.844.7018.32
66.824.6617.83
Table 2: Throughput of human genome in GB/s. Note that the fasta file has linebreaks, and I don't find/strip them.
threadsneedletailparasequs
12.77bug8.26
22.804.0112.24
32.803.9615.38
42.784.0216.30
52.803.9217.08
62.773.9617.28

Needletail nor paraseq only allocate a newline-free record on request, which I don’t do in this benchmark, but I believe at least needletail does unconditionally store the positions of all newlines, and so does slightly more work.

The new method caps out at 17 GB/s, while the peak memory transfer rate of my DDR4 RAM4 is around 23 GB/s, so this is not too bad. If I drop the fixed 2.6 GHz CPU frequency that I used for the experiments, instead I get speeds up to 22.5 GB/s, which is actually very close!

Thus, if the throughput of your code on all threads is more than 2-4 GB/s, you might try this technique to avoid file parsing being the point of contention.

5 Outlook Link to heading

I think this is a neat idea, but probably with limited practical usage, since most large inputs will be zipped. And in that case, you may have many input files, and it’s better to spin up one thread/program per input file, so that decompression can be done in parallel.

References Link to heading

Langmead, Ben, Christopher Wilks, Valentin Antonescu, and Rone Charles. 2018. “Scaling Read Aligners to Hundreds of Threads on General-Purpose Processors.” Edited by John Hancock. Bioinformatics 35 (3): 421–32. https://doi.org/10.1093/bioinformatics/bty648.
Zhang, Hao, Honglei Song, Xiaoming Xu, Qixin Chang, Mingkai Wang, Yanjie Wei, Zekun Yin, Bertil Schmidt, and Weiguo Liu. 2023. “Rabbitfx: Efficient Framework for Fasta/Q File Parsing on Modern Multi-Core Platforms.” Ieee/Acm Transactions on Computational Biology and Bioinformatics 20 (3): 2341–48. https://doi.org/10.1109/tcbb.2022.3219114.

  1. I haven’t looked into the details of their code, but this really is not so fast. As a baseline, SimdMinimizers computes minimizers at over 500 MB/s on a single core, and that is much more involved that simply counting the number of A/C/G/T characters. ↩︎

  2. Really the queue isn’t needed so much here – we could just pass a callback to a worker function, similar to what paraseq does with the map-reduce model. Then, each thread simply processes its own data. ↩︎

  3. Download rsviruses17900.fastq.gz, unzip, and then convert the .fastq to a .fa↩︎

  4. Note that the file is read from RAM, not disk/SSD, since it’s already cached by the kernel. ↩︎