1 File formats Link to heading
This stuff is all insanely confusing. My summary:
- DEFLATE
- The ‘original’ compression method. Works in 32kB blocks, and for each stores a small header with the compression mode and optional huffman encoded dictionary. It applies Lempel-Ziv'77 compression of replacing common texts by back-references. It uses a 32kiB context window for this, which may extend beyond the start of the block.
- zlib
- An implementation of DEFLATE. The file format wraps the raw deflate blocks in a header and footer.
- gzip (GNU zip)
- Another file format around DEFLATE, consisting of a small header containing eg the original file name, then a list of DEFLATE blocks, and lastly a CRC32 checksum.
- blocked gzip (BGZF, blocked gzip format)
- A file format developed for bioinformatics that is just multiple GZIP files concatenated. This allows faster compression and decompression by parallellizing over independent blocks, as well as random access via a small auxiliary index of block starts. This is backwards compatible with plain gzip.
2 Implementations Link to heading
zlib- The original C library.
zlib-ng- Modern C implementation of zlib using SIMD.
libz-syscrate- Rust bindings to
zlibandzlib-ng. zlib-rscrate- Pure Rust re-implementation.
zlib-rs-syscrate- zlib-compatible C-API to
zlib-rs flate2crate- High level Rust crate with uniform bindings to multiple zlib implementations.
3 Containers Link to heading
Figure 1: Overview of the containers in a (blocked) GZIP file. Mim stores checkpoints at the start of DEFLATE blocks, which do not coincide with the start of GZIP blocks!
4 Zran Link to heading
zran (zlib/zran.h, zlib/zran.c) is a small library that comes with zlib that allows random access
decompression by storing the (up to) last 32kiB of data before spaced out DEFLATE blocks, so
decompression can start from the middle. DEFLATE blocks at the start of a gzip
block do not have a context window.
Uses Z_BLOCK to decompress up to a DEFLATE block boundary, inflatePrime to
insert unprocessed padding bits, and inflateSetDictionary to set the 32kiB
context window.
5 Mim Link to heading
mim extends zran by also storing Fasta/Fastq information for each checkpoint.
Specifically, it stores the absolute position in the text and index of the first
record starting after each checkpoint.
Unfortunately, inflateSetDictionary is currently only available via the zlib-rs-sys
C-compatible API.