HPRC v2 stats

The Movi 2 paper (Zakeri et al. 2025) builds a fast index on HPRCv2 (Human Pangenome Reference Consortium 2025), a collection of 466 human genomes. This posts collects some statistics on the number of BWT runs (\(r\)) of this dataset, and makes an estimate on the number of unique mutations based on that.

Table 1: Number of BWT runs and average run length for a random 3.2Gbp string, a human genome, and HPRCv1 and v2. The average run-length in HPRC is taken from Zakeri et al. (2025).

dataset	copies	rc?	length (Gbp)	avg run-len	runs (G)	est total mut’s	unique mut rate
random		no	3.2	1.33	2.40
CHM13v2.0	1	no	3.2	1.85	1.72
HPRCv1	94	yes	2x 301	134	2.25	33M	1/8900
HPRCv2	466	yes	2x 1500	535	2.80	68M	1/22000

TODO: Update for the fact that HPRC run-lengths are for the version with rc! (Include single human genome with rc)

https://github.com/mohsenzakeri/Movi/issues/29#issuecomment-3694268341

We see that going from 1 to 94 copies increases the number of runs by only 31%, and then going to 466 copies adds an additional 25% of runs.

In the BWT of copies of a random string of length \(n\), each point mutation creates very roughly (within a factor \(2\) of) \(\log_\sigma n\) additional runs. Thus, HPRCv1 (Liao et al. 2023) contains around \((2.25G - 1.72G) / \log_4(3.2G) = 33M\) mutations, or 360k per copy. So each copy has a unique mutation roughly every \(1/8900\) bases.

For HPRCv2, we have 68M mutations, or 150k per copy, or one new unique mutation every 22k bases.

References Link to heading

Human Pangenome Reference Consortium. 2025. “Hprc Data Release 2.” https://humanpangenome.org/hprc-data-release-2.

Liao, Wen-Wei, Mobin Asri, Jana Ebler, Daniel Doerr, Marina Haukness, Glenn Hickey, Shuangjia Lu, et al. 2023. “A Draft Human Pangenome Reference.” Nature 617 (7960): 312–24. https://doi.org/10.1038/s41586-023-05896-x.

Zakeri, Mohsen, Nathaniel K. Brown, Travis Gagie, and Ben Langmead. 2025. “Movi 2: Fast and Space-Efficient Queries on Pangenomes,” October. https://doi.org/10.1101/2025.10.16.682873.