The Movi 2 paper (Zakeri et al. 2025) builds a fast index on HPRCv2 (Human Pangenome Reference Consortium 2025), a collection of 466 human genomes. This posts collects some statistics on the number of BWT runs (\(r\)) of this dataset, and makes an estimate on the number of unique mutations based on that.
| dataset | copies | rc? | length (Gbp) | avg run-len | runs (G) | est total mut’s | unique mut rate |
|---|---|---|---|---|---|---|---|
| random | no | 3.2 | 1.33 | 2.40 | |||
| CHM13v2.0 | 1 | no | 3.2 | 1.85 | 1.72 | ||
| HPRCv1 | 94 | yes | 2x 301 | 134 | 2.25 | 33M | 1/8900 |
| HPRCv2 | 466 | yes | 2x 1500 | 535 | 2.80 | 68M | 1/22000 |
TODO: Update for the fact that HPRC run-lengths are for the version with rc! (Include single human genome with rc)
We see that going from 1 to 94 copies increases the number of runs by only 31%, and then going to 466 copies adds an additional 25% of runs.
In the BWT of copies of a random string of length \(n\), each point mutation creates very roughly (within a factor \(2\) of) \(\log_\sigma n\) additional runs. Thus, HPRCv1 (Liao et al. 2023) contains around \((2.25G - 1.72G) / \log_4(3.2G) = 33M\) mutations, or 360k per copy. So each copy has a unique mutation roughly every \(1/8900\) bases.
For HPRCv2, we have 68M mutations, or 150k per copy, or one new unique mutation every 22k bases.