The Movi 2 paper (Zakeri et al. 2025) builds a fast index on HPRCv2 (Human Pangenome Reference Consortium 2025), a collection of 466 human genomes. This posts collects some statistics on the number of BWT runs (\(r\)) of this dataset, and makes an estimate on the number of unique mutations based on that.

Table 1: Number of BWT runs and average run length for a random 3.2Gbp string, a human genome, and HPRCv1 and v2. The average run-length in HPRC is taken from Zakeri et al. (2025).
datasetcopiesrc?length (Gbp)avg run-lenruns (G)est total mut’sunique mut rate
randomno3.21.332.40
CHM13v2.01no3.21.851.72
HPRCv194yes2x 3011342.2533M1/8900
HPRCv2466yes2x 15005352.8068M1/22000

TODO: Update for the fact that HPRC run-lengths are for the version with rc! (Include single human genome with rc)

We see that going from 1 to 94 copies increases the number of runs by only 31%, and then going to 466 copies adds an additional 25% of runs.

In the BWT of copies of a random string of length \(n\), each point mutation creates very roughly (within a factor \(2\) of) \(\log_\sigma n\) additional runs. Thus, HPRCv1 (Liao et al. 2023) contains around \((2.25G - 1.72G) / \log_4(3.2G) = 33M\) mutations, or 360k per copy. So each copy has a unique mutation roughly every \(1/8900\) bases.

For HPRCv2, we have 68M mutations, or 150k per copy, or one new unique mutation every 22k bases.

References Link to heading

Human Pangenome Reference Consortium. 2025. “Hprc Data Release 2.” https://humanpangenome.org/hprc-data-release-2.
Liao, Wen-Wei, Mobin Asri, Jana Ebler, Daniel Doerr, Marina Haukness, Glenn Hickey, Shuangjia Lu, et al. 2023. “A Draft Human Pangenome Reference.” Nature 617 (7960): 312–24. https://doi.org/10.1038/s41586-023-05896-x.
Zakeri, Mohsen, Nathaniel K. Brown, Travis Gagie, and Ben Langmead. 2025. “Movi 2: Fast and Space-Efficient Queries on Pangenomes,” October. https://doi.org/10.1101/2025.10.16.682873.