- 1 Results
- 2 Faster construction by skipping runs
- 3 TODO Queries
- 4 TODO suffix links
- 5 Ideas
- 6 Analysing and reducing space usage
- 6.1 Movi 1
- 6.2 Movi 2
- 6.2.1 Block-based run-indices
- 6.3 STPD
- 6.4 Compressing STPD: STPD for pBWT
- 6.5 Jump Index
- 6.5.1 Compact link
- 6.5.2 Mutable
- 6.6 TODO
- 7 CDAWG
- 7.1 Further papers
- 8 Fast construction
- 9 HPRCv2
- 10 Questions
- 10.1 Notes
- 10.2 Jump Index
- 11 Shrinking practical memory usage
These are ongoing notes and logs on improving the STPD (Becker et al. 2025) and a new variant of it that I’ll call the jump index.
1 Results Link to heading
twice the same table; bottom one in millions
| n | k | r | e | CDAWG nodes | CDAWG edges | stpd | \(\vert\mathsf{stpd}\vert\) | #(src) | #(src,c) | #(links) | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| influenza | 154808560 | 78k | 3022821 | 17163707 | 7734059 | 17163736 | pos- | 1928406 | 1815632 | 2337554 | 2938832 |
| influenza | lex- | 1815859 | 1580462 | 2087389 | 2982427 | ||||||
| influenza | colex- | 1805729 | 4081396 | 5054791 | 5054791 | ||||||
| dna.001.1 | 104857600 | 100 | 1716805 | 6003074 | 12911887 | pos- | 1250622 | 764033 | 1270872 | 1609646 | |
| dna.001.1 | lex- | 1102787 | 793048 | 1285939 | 1707288 | ||||||
| dna.001.1 | colex- | 1100566 | 3248971 | 3874620 | 3874621 | ||||||
| einstein.de.txt | 92758440 | 2.1k | 101368 | 238338 | 79468 | 238234 | pos- | 61606 | 43335 | 94804 | 104020 |
| einstein.de.txt | lex- | 61492 | 37442 | 88424 | 103639 | ||||||
| einstein.de.txt | colex- | 62429 | 51986 | 117392 | 117392 | ||||||
| Escherichia_Coli | 112689512 | 23 | 15044485 | 31340395 | 12011071 | 31340396 | pos- | 11677571 | 7065572 | 11877332 | 14968636 |
| Escherichia_Coli | lex- | 9817074 | 6840748 | 11527829 | 14987339 | ||||||
| Escherichia_Coli | colex- | 9825970 | 8027661 | 13337512 | 13337512 |
| (in M) | n | k | r | e | CDAWG-n | -e | stpd | \(\vert\mathsf{stpd}\vert\) | #(src) | #(src,c) | #(links) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| influenza | 154.81 | 78k | 3.02 | 17.16 | 7.73 | 17.15 | pos- | 1.93 | 1.82 | 2.34 | 2.94 |
| influenza | lex- | 1.82 | 1.58 | 2.09 | 2.98 | ||||||
| influenza | colex- | 1.81 | 4.08 | 5.05 | 5.05 | ||||||
| dna.001.1 | 104.86 | 100 | 1.72 | 6.00 | 12.91 | pos- | 1.25 | 0.76 | 1.27 | 1.61 | |
| dna.001.1 | lex- | 1.10 | 0.79 | 1.29 | 1.71 | ||||||
| dna.001.1 | colex- | 1.10 | 3.25 | 3.87 | 3.87 | ||||||
| einstein.de.txt | 92.76 | 2.1k | 0.10 | 0.24 | 0.08 | 0.23 | pos- | 0.06 | 0.04 | 0.09 | 0.10 |
| einstein.de.txt | lex- | 0.06 | 0.04 | 0.09 | 0.10 | ||||||
| einstein.de.txt | colex- | 0.06 | 0.05 | 0.12 | 0.12 | ||||||
| Escherichia_Coli | 112.69 | 23 | 15.04 | 31.34 | 12.01 | 31.33 | pos- | 11.68 | 7.07 | 11.88 | 14.97 |
| Escherichia_Coli | lex- | 9.82 | 6.84 | 11.53 | 14.99 | ||||||
| Escherichia_Coli | colex- | 9.83 | 8.03 | 13.34 | 13.34 |
short repetitive sequence, 1% error rate, alphabet size 2 and 4
| n | r | CDAWG-n | -e | stpd | \(\vert\mathsf{stpd}\vert\) | #(src) | #(src,c) | #(links) | |
|---|---|---|---|---|---|---|---|---|---|
| relative(1*1000@0.01,2) | 1000 | 488 | 740 | 1486 | pos- | 401 | 336 | 336 | 484 |
| relative(1*1000@0.01,2) | 1000 | 488 | 740 | 1486 | lex- | 309 | 301 | 302 | 487 |
| relative(1*1000@0.01,2) | 1000 | 488 | 740 | 1486 | colex- | 305 | 380 | 380 | 380 |
| relative(2*1000@0.01,2) | 2000 | 612 | 909 | 1825 | pos- | 513 | 402 | 402 | 622 |
| relative(2*1000@0.01,2) | 2000 | 612 | 909 | 1825 | lex- | 376 | 376 | 377 | 606 |
| relative(2*1000@0.01,2) | 2000 | 612 | 909 | 1825 | colex- | 375 | 470 | 470 | 470 |
| relative(128*1000@0.01,2) | 128000 | 7766 | 43553 | 87165 | pos- | 5690 | 5541 | 5541 | 7814 |
| relative(128*1000@0.01,2) | 128000 | 7766 | 43553 | 87165 | lex- | 4672 | 4737 | 4738 | 7582 |
| relative(128*1000@0.01,2) | 128000 | 7766 | 43553 | 87165 | colex- | 4652 | 21795 | 21795 | 21795 |
| relative(1024*1000@0.01,2) | 1024000 | 31984 | 252733 | 505545 | pos- | 22931 | 24898 | 24898 | 32440 |
| relative(1024*1000@0.01,2) | 1024000 | 31984 | 252733 | 505545 | lex- | 20692 | 20374 | 20375 | 31186 |
| relative(1024*1000@0.01,2) | 1024000 | 31984 | 252733 | 505545 | colex- | 20783 | 128607 | 128607 | 128607 |
| relative(1*1000@0.01,4) | 1000 | 744 | 542 | 1475 | pos- | 615 | 353 | 633 | 775 |
| relative(1*1000@0.01,4) | 1000 | 744 | 542 | 1475 | lex- | 501 | 353 | 620 | 773 |
| relative(1*1000@0.01,4) | 1000 | 744 | 542 | 1475 | colex- | 497 | 375 | 657 | 657 |
| relative(2*1000@0.01,4) | 2000 | 874 | 621 | 1668 | pos- | 683 | 391 | 692 | 847 |
| relative(2*1000@0.01,4) | 2000 | 874 | 621 | 1668 | lex- | 572 | 375 | 662 | 841 |
| relative(2*1000@0.01,4) | 2000 | 874 | 621 | 1668 | colex- | 576 | 432 | 750 | 750 |
| relative(128*1000@0.01,4) | 128000 | 8282 | 43824 | 99987 | pos- | 5671 | 4351 | 6739 | 8374 |
| relative(128*1000@0.01,4) | 128000 | 8282 | 43824 | 99987 | lex- | 5140 | 3726 | 6002 | 8076 |
| relative(128*1000@0.01,4) | 128000 | 8282 | 43824 | 99987 | colex- | 5155 | 22762 | 29810 | 29812 |
| relative(1024*1000@0.01,4) | 1024000 | 37602 | 234132 | 703386 | pos- | 25721 | 21231 | 30606 | 38203 |
| relative(1024*1000@0.01,4) | 1024000 | 37602 | 234132 | 703386 | lex- | 24160 | 18321 | 26770 | 36844 |
| relative(1024*1000@0.01,4) | 1024000 | 37602 | 234132 | 703386 | colex- | 24107 | 156444 | 323502 | 323503 |
long repetitive sequence, 0.1% error rate, alphabet size 2 and 4
| n | r | CDAWG-n | -e | stpd | \(\vert\mathsf{stpd}\vert\) | #(src) | #(src,c) | #(links) | |
|---|---|---|---|---|---|---|---|---|---|
| relative(1*10000@0.001,2) | 10000 | 5013 | 7540 | 15091 | pos- | 4122 | 3361 | 3361 | 4966 |
| lex- | 3073 | 3115 | 3116 | 4980 | |||||
| colex- | 3087 | 3763 | 3764 | 3764 | |||||
| relative(2*10000@0.001,2) | 20000 | 5152 | 7816 | 15641 | pos- | 4224 | 3459 | 3459 | 5140 |
| lex- | 3216 | 3213 | 3214 | 5140 | |||||
| colex- | 3223 | 3908 | 3908 | 3908 | |||||
| relative(128*10000@0.001,2) | 1280000 | 16527 | 98398 | 196848 | pos- | 12502 | 11256 | 11256 | 16218 |
| lex- | 10033 | 10178 | 10179 | 16339 | |||||
| colex- | 10032 | 50831 | 50832 | 50832 | |||||
| relative(1024*10000@0.001,2) | 10240000 | 79282 | 2987410 | 5975276 | pos- | 57803 | 58012 | 58012 | 80795 |
| lex- | 48235 | 48634 | 48635 | 77778 | |||||
| colex- | 48250 | 1481795 | 1481796 | 1481796 | |||||
| relative(1*10000@0.001,4) | 10000 | 7534 | 5459 | 14639 | pos- | 5905 | 3418 | 5955 | 7426 |
| lex- | 5020 | 3360 | 5856 | 7488 | |||||
| colex- | 5012 | 3762 | 6484 | 6484 | |||||
| relative(2*10000@0.001,4) | 20000 | 7627 | 5581 | 14957 | pos- | 6058 | 3491 | 6105 | 7624 |
| lex- | 5028 | 3399 | 5914 | 7589 | |||||
| colex- | 5087 | 3838 | 6642 | 6642 | |||||
| relative(128*10000@0.001,4) | 1280000 | 16924 | 91260 | 192530 | pos- | 12520 | 8407 | 13586 | 16859 |
| lex- | 10881 | 7721 | 12728 | 16850 | |||||
| colex- | 10958 | 47333 | 54212 | 54212 | |||||
| relative(1024*10000@0.001,4) | 10240000 | 81066 | 2958481 | 6587814 | pos- | 55824 | 40501 | 63548 | 79097 |
| lex- | 51343 | 36891 | 59656 | 80204 | |||||
| colex- | 51324 | 1602100 | 1973240 | 1973241 |
Datasets from pizza&chili repetitive corpus:
influenza: 78k copies of influenza; probably with very high variancedna.001.1: is 100 synthetic copies of a 1 MB string, each with 0.1% independent mutationseinstein.de.txt: revisions of the German Einstein Wikipedia articleEscherichia_Coli: 23 copies of E. Coli.
Variables:
- \(n\): size of input
- \(k\): the number of copies
- \(r\): BWT-runs
- \(e = |\mathsf{CDAWG}|\) is the number of edges in the CDAWG and is copied from Cleary et al. (2024).
- My numbers in the CDAWG-e column are just slightly (<100) off. Maybe because I don’t append a sentinel?
- \(|\mathsf{stpd}|\): STPD size
- equal to number of CDAWG nodes with a non-path incoming edge?
- src: number of text positions from which we jump
- (src, c): number of (text positions, character) for which there is a jump
- #(links): total number of links in Jump Index.
- TODO: Why not equal to |CDAWG|/2?
Observations:
- For all but
influenza, we have roughly \(r = n/(k/2) = 2\cdot n/k\). - the STPD size is always 60% to 75% of \(r\)
- Number of links for pos and lex is very close to \(r\)
- Number of links for colex is much worse, but why? Especially when there are many repeats.
2 Faster construction by skipping runs Link to heading
before: 60s
| (in M) | n | r | stpd | #(src) | #(src,c) | #(links) | |
|---|---|---|---|---|---|---|---|
| influenza | 154.81 | 3.02 | pos- | 1928406.00 | 1815632.00 | 2337554.00 | 2938832.00 |
| influenza | 154.81 | 3.02 | lex- | 1815859.00 | 1580462.00 | 2087389.00 | 2982427.00 |
| influenza | 154.81 | 3.02 | colex- | 1805729.00 | 4081396.00 | 5054791.00 | 5054791.00 |
after: 35s
| (in M) | n | r | stpd | #(src) | #(src,c) | #(links) | |
|---|---|---|---|---|---|---|---|
| influenza | 154.81 | 3.02 | pos- | 1928405.00 | 1815631.00 | 2337552.00 | 2938830.00 |
| influenza | 154.81 | 3.02 | lex- | 1815859.00 | 1580462.00 | 2087388.00 | 2982426.00 |
| influenza | 154.81 | 3.02 | colex- | 1805728.00 | 4081395.00 | 5054789.00 | 5054789.00 |
Idea: if an SA interval \([l, r)\) is contained inside a BWT run, we can skip processing it completely.
This results in close to \(4r\) samples in total. Probably because each ‘critical’ interval is processed once for each character somehow?
- After reusing BWT between variants: 29s
- After switching to non-precomputed RMQ: still 29s
- After switching to S=256 instead of S=64: still 29s
- using
par_iterforpermuted_pi: 22s - using S=128: 21s
- build RMQ using par_iter for block mins: 20s
New overall results: (67s)
| (in M) | n | r | stpd | #(src) | #(src,c) | #(links) | |
|---|---|---|---|---|---|---|---|
| dna.001.1 | 104.86 | 1.72 | pos- | 1250622.00 | 764033.00 | 1270872.00 | 1609647.00 |
| dna.001.1 | 104.86 | 1.72 | lex- | 1102787.00 | 793048.00 | 1285939.00 | 1707288.00 |
| dna.001.1 | 104.86 | 1.72 | colex- | 1100566.00 | 3248974.00 | 3874623.00 | 3874624.00 |
| influenza | 154.81 | 3.02 | pos- | 1928405.00 | 1815631.00 | 2337552.00 | 2938830.00 |
| influenza | 154.81 | 3.02 | lex- | 1815859.00 | 1580462.00 | 2087388.00 | 2982426.00 |
| influenza | 154.81 | 3.02 | colex- | 1805728.00 | 4081395.00 | 5054789.00 | 5054789.00 |
| Escherichia_Coli | 112.69 | 15.04 | pos- | 11677571.00 | 7065572.00 | 11877332.00 | 14968636.00 |
| Escherichia_Coli | 112.69 | 15.04 | lex- | 9817074.00 | 6840748.00 | 11527829.00 | 14987339.00 |
| Escherichia_Coli | 112.69 | 15.04 | colex- | 9825970.00 | 8027661.00 | 13337512.00 | 13337512.00 |
| einstein.de.txt | 92.76 | 0.10 | pos- | 61606.00 | 43335.00 | 94804.00 | 104020.00 |
| einstein.de.txt | 92.76 | 0.10 | lex- | 61492.00 | 37443.00 | 88425.00 | 103640.00 |
| einstein.de.txt | 92.76 | 0.10 | colex- | 62431.00 | 51988.00 | 117394.00 | 117394.00 |
TODO 2.1 We loose 1 or 2 samples in the new setting. Why? Link to heading
TODO 2.2 More fine-grained pruning to directly construct the final result without duplicates? Link to heading
TODO 2.3 Construction direction via run-length BWT? Link to heading
TODO 2.4 PHF on (pos, c) Link to heading
TODO 2.5 compressed targets Link to heading
Instead of storing all jump targets, store a list of all target locations (as in STPD). But now partition it by the last character. Then store one list per final character (sorted by text pos, for lg(n/r) bits each) and the index in the list (lg r bits each). This should help compression when targets are reused.
TODO 2.6 Prefix table up to k=10 Link to heading
TODO 3 Queries Link to heading
- 1-PHF to map \((i,a)\) to its range
TODO 4 suffix links Link to heading
- How good are our suffix links??? How many do we need to traverse to get to the next position on average?
5 Ideas Link to heading
- max LCP length relates to max run length: both increase for more repetitive texts
- pBWT: highlight jump-targets
- Either binary search for them in \(O(l)\) (\(l = \Theta(n/r)\) copies), or store target in \(\lg l\) bits.
- STPD, say of size \(s\):
- \(s\) \(\lg n\) bit samples, shuffled
- can we compress them?
- typically, we will jump from pos \(i\) in one copy to pos \(\approx i\) in another copy? So then a binary search over \(\ell\) things is enough?
- Consecutive STPD samples are in similar text positions?
- Jump targets will usually point to roughly the same position in the text
- Compressing the Jump Index:
- MPHF for \((i, a)\)
- LCP \(\geq 255\) can be stored in a secondary structure with 32-bit values, so 8 bits is enough for the primary structure
- Use a prefix lookup for the first ~10 bp = ~20 bits
6 Analysing and reducing space usage Link to heading
6.1 Movi 1 Link to heading
Movi (Zakeri et al. 2024, 2025) is an engineered version of a run-length FM-index. Similar to the r-index, but avoiding all the rank and select structures and instead inlining just the right data so that most LF-mappings can be done with only a single cache miss.
Movi1 uses 16 bytes per run.
6.2 Movi 2 Link to heading
Movi2 splits some runs:
- Runs longer than \(2^\ell\) are split into two (+5% for HPRCv2), so the length can be stored in \(\ell\) bits
- In a run of
A, if we instead want to append aC, the best match is in the run ofC’s preceding or succeeding the current run. Most of the time, this choice is the same for all elements of the run, and a single bit up/down pointer for each of \(\sigma-1\) of the non-run characters suffices. If this is not the case, the run is split, for 10% extra runs on HPRCv2.
It then stores:
- 2-bit character
- \(\ell=11\)-bit run length
- the \(\log_2(n)\)-bit position where the entire run maps to:
- a \(\log_2( r)=36\)-bit run-index \(\xi\)
- an 11-bit index in the run
- \(\sigma-1=3\) bits indicating whether to round up or down for each non-run characters.
For a total of 63 bits or 8 bytes, with the bulk being the \(r\log_2n\) bits for the target indices.
6.2.1 Block-based run-indices Link to heading
The largest component are the run-indices \(\chi\). But these can be compressed further, since \(\xi\) is increasing over all runs for the same character. An elias-fano structure would be space efficient, but a practical solution is to only store \(\xi\) values (for all characters) every \(b\) rows and then linearly interpolate between them, only storing the delta to the prediction. In this mode, \(\ell=10\) and \(\Delta=23\) (I think), for a total of 6 bytes.
Instead of storing \(\Delta\), a further down-sampling followed by a linear scan is also possible, bringing the space per run to just 3 bytes. (I do not yet fully understand the details here.)
6.3 STPD Link to heading
Suppose we have \(k\) similar sequences of total length \(n=k\cdot m\). In practice, the STPD has roughly \(s=\frac 23 r\) samples (see tables above).
Each sample is a text position, for a total space of \(s\log_2 n \approx r\log_2 n\), i.e., similar to the move-index. In fact, 6 bytes (48 bits) per sample is plenty, matching the block-based movi2 (but not the sampled version).
Do note though that this excludes the relative-Lempel-Ziv compressed text, as well as a 2-4 bit/element RMQ structure that selects the best match in the interval of the STPD (sparse prefix array) that matches the pattern prefix:
- For
colex-, this cam be omitted and we take the start of the interval instead. - For
pos-, we could also iterate over the entire matching interval. Not sure if that would be slow. - For
lex-, the RMQ structure is needed.
Unlike for movi, where the target runs are increasing and can thus be compressed, here we must sort the \(s\) integers in colex order, meaning that (a priory) we cannot compress them via something like Elias-Fano coding.
But there is hope!
6.4 Compressing STPD: STPD for pBWT Link to heading
Consider the setting where the \(k\) sequences only differ by say 1% of substitutions, and nothing else. Then we can put them below each other (as in the pBWT), and for each column where the pattern matches, we could report the first input sequence in which it matches.
Thus, we get something like this:
| |
Here, the pattern matches a prefix of \(T_1\).
Then, if we extend the pattern by Z, it does not match \(T_1\) anymore, and
instead we have to jump to \(T_3\). Thus, the Z in \(T_3\) would be an STPD-sample
(in the concatenation of the texts).
Thus, in every column where “things happen” we get one or more STPD samples. But in this setting, we don’t have to binary search over all samples in \(O(\log s)\)! Instead, we only have to binary search the samples in the current column, which is more like \(O(\log s/m)\), or either way at most \(O(\log k)\).
Thus, we could ideally use some perfect hash function to map active columns (with at least one sample) to a variable-length list of samples in around 2 bits per active column. Or we could store in \(\log_2 k\) bits the number of STPD samples in each active column, and then distribute columns over multiple PHFs based on the number of samples.
Then, all that’s needed is to store the subset of the \(k\) sequences that has an STPD sample in this column, which is only \(\log_2 k\) bits (around 9 for HPRCv2) rather than \(\log_2 n\) (around 44).
Thus, this scheme needs only \(m\log k + 2m +s\log_2 k\) bits.
A big restriction, though, is that this does not allow jumping between different columns: once the pattern in anchored to some column, it can only switch between rows.
Nevertheless, it feels like this should provide some possible ways to optimize the generic STPD as well: most of the time, jumping happens between ’equivalent positions’ in different genomes, and if we have some course multiple-sequence-alignment or so between them, then it might be sufficient to only store low bits most of the time to indicate ‘jump to the equivalent location in the input sequence with ID \(x\)’.
6.5 Jump Index Link to heading
- HPRCv2: 466 copies of Gbp => 40.5 bits of text
- At 8 bytes/run and including rc and 15% extra runs, movi2 has 47.3 GB.
- 5.1 G runs
- 2.55 G runs for 466 copies of fwd-only
- 1.72 G runs for a single copy of a human genome
Initial space we need per pointer:
- source: 41 bits
- hide in MPHF on (pos, c) => 2 bits
- EF to store all lengths?
- Or EF for source locations + EF for lengths?
- <1 byte total?
- target: 41 bits = 5 bytes
- LCP len: 16 bits (probably 8 + fallback) = 2 bytes
6.5.1 Compact link Link to heading
Current
| |
- naive: 24 bytes
- packed: 100 bits = 12.5 bytes
- 41 bit src
- 2 bit char
- 16 bit LCP
- 41 bit target
- EF: we have ~2G runs, so we encode the top 31 bits with only 2 bits => 71 bits each
6.5.2 Mutable Link to heading
- Giulio’s mutable rank/select B-tree on the high part of the EF
- maybe 20% overhead; that’s all fine
- Does it support insertions though… It’ll need to be modified for this
- B-tree on the low parts of the EF values, so insertion is fast
- Needs special support for indexing
- Split so that low parts are 64 bits??? 9-bit LCP?
TODO 6.6 Link to heading
- Number of edges in graph on STPD samples?
- (this is larger than just the number of links)
- equivalent to CDAWG? Smaller?
- Number of suffix links?
7 CDAWG Link to heading
What is the relation between suffixient sets/STPD and the CDAWG?
- STPD is a path decomposition of the CDAWG
- Size is the number of nodes with >1 incoming edge, ie at least one incoming edge not in the path decomposition. Up to 10x smaller than the CDAWG, see table at the top.
- Jump Index has size equal to the number of non-path edges in the CDAWG?
- In the table, the size of the jump index varies between 20% and 50% of the CDAWG. I’m not yet sure why it’s not always at least 50%, ie where at most half the edges are heavy-path and the other (at least) half light edges needing a jump pointer.
- Hmm; the JI jumps to a different location depending on the length of the substring already matched, whereas the CDAWG is independent of the already-matched prefix and so is more like COLEX-? Or is this not an issue because such cases have different CDAWG nodes?
- CDAWG is bad (\(O(n)\)) for
AAAAAAAA#. But maybe fine in practice? - CDAWG nodes are maximal repeats \(S\), and have an equivalence class of
right-maximal strings \(\{S[0..), S[1..), S[2..), \dots, S[k..)\}\) such that all of
those (but the first) are not left-maximal, but \(S[k+1\dots)\) is
left-maximal. (Belazzougui and Cunial 2017)
- All strings in the equivalence class have the same range-length in the BWT and for \(i\geq 1\), all \(S[i..)\) the range is (contained in) a single run with character \(S[i-1]\). OTOH, the BWT interval for \(S[0..)\) contains at least two characters.
- All paths to a node in the CDAWG are a suffix of one another.
- The in-neighbours of a node exactly partition the range of ‘active lengths’.
CDAWG:
- A node for every maximal repeat, that is: a substring \(S\) that is repeated and can be extended both left and right in at least 2 ways. Or equivalently, \(aS\) and \(Sa\) occur stricly less than \(S\) for all symbols \(a\).
- Space: number of right-extensions of maximal repeats
“Composite Repetition-Aware Data Structures” (Belazzougui et al. 2015)
- (google scholar citations to this paper give a good source of papers)
- introduces \(e\), the number of arcs in the CDAWG
- Index combining run-length BWT with CDAWG
“Representing the Suffix Tree with the Cdawg” (Belazzougui and Cunial 2017)
- Heavy-path decomposition of CDAWG; supports additional suffix-tree operations
7.1 Further papers Link to heading
“Suffix Trees, Dawgs and Cdawgs for Forward and Backward Tries” (Inenaga 2020)
- Takes as input not a string, but a suffix tree or so.
“The Cdawg Index and Pattern Matching on Grammar-Compressed Strings” (Cleary et al. 2024)
- Builds a CDAWG on grammars; 4x smaller than the original CDAWG
“Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures” (Navarro 2021b)
“Indexing Highly Repetitive String Collections, Part Ii: Compressed Indexes” (Navarro 2021a)
8 Fast construction Link to heading
- Goal: \(O( r)\) space; \(O(n)\) is wasteful
- BWT+LCP in \(O( r)\) is possible (?)
- Skip constructing SA on repetitive parts?
- Simply dynamically add them and don’t process them at all?
- For each mutation find the longest extension to left/right to make it unique. Then only process that window?
9 HPRCv2 Link to heading
Get the AGC (assembled genomes compressor) version as described here (3.3 GB .agc):
| |
Or get all of them as alignments against a reference from here (5.2 GB .paf.gz).
Rust AGC api: https://docs.rs/ragc-core/0.1.1/ragc_core/
10 Questions Link to heading
- Why is the number of jumps so close to \(r\)
- data layout
- dynamic vs static implementation?
- lg(k) instead of lg(n) bits per jump?
- suffix links?
10.1 Notes Link to heading
- BML: faster MEM finding
- moni align
- phi (and inverse) also give LCP values
- classic: find all occurrences of all MEMs, but not what is needed in practice
- 100 matches to 3 locations is not super useful
- Only care about distinct locations in pangenome graphs
- Paper Joni + Paten, WABI 2025: tag array
- only store distinct mapping locations for runs
- Latin 2025 paper Gonzalo, Joni, Travis
- Jannick Olbrich: MSA
- sample runs; how many columns you move left after LF mapping
- Heng:
- vg-surject: maps alignment to graph back to standard reference
- tag array of vg-surject values
- Goal: mix tag arrays with suffixient sets
- Problem: these don’t give BWT interval
- Can only enumerate the interval; not “RMQ” query it
- not count query, but interval query
- lg(n) vs lg(h)
- movi: r lg n, but r lg(n/r)
- move structure: row-major
- compression: col-major
- paper from Roberto’s festschrift
- SMEM: MEM aligned by pBWT columns
- suffixient set for findings MEMs; then r-index to locate all occurrences?
- SPIRE paper:
- Heng: grlBWT; LMS grammar
- Ben: PFP parsing
- z-fast trie vs centroid decomposition
- WABI 22 ??
Demo:
- patterns up to length 20
- obvious that most stuff is useless
10.2 Jump Index Link to heading
- dynamic construction
- quickly discard existing regions
- dynamic predecessor structure is all that’s needed
- log(u/n) ideally
- similar to ukkonen’s SA construction
- Store lg(h) instead of lg(n), relative to MSA
- all relative to reference?
- break each pointer into 2 pieces:
- this genome lg(h)
- roughly this position lg(w)
- if mapping is monotone; use checkpoints
- Can enumerate all occurrences via DFS
- Associate TAG array information with jump links?
- But we still
- Can we do counts?
- Draw as automaton
- What happens when we add a copy?
- How many states do we gain?
11 Shrinking practical memory usage Link to heading
- SA: 24GB = 8B * 3G, because
libsaisneedsi64for \(2^{31.5}\) elements - PLCP: 24GB
- Shrink SA to
u32for 12GB - Shrink PLCP to
u16for 6GB - Build LCP, also 6GB
- Peak so far: 48GB
New:
- Text: 1 B -> 3 GB
- BWT: 1 B -> 3 GB
- SA: 32 bit = 4 B -> 12 GB
- LCP: 32 bit = 4 B -> 12 GB
- Dropped range boundaries
- RMQ via segtree instead of sparse table
- SA RMQ: 1.5 b -> 500 MB
- LCP RMQ: 1.5 b -> 500 MB
Total: 31GB
32 byte links: ~23GB for ~1/3rd of the data
16 byte links: ~24GB for 560/750 intervals = 3/4th of the data
Replace LCP by Compact PLCP version
- 12.4 GB => 0.8 GB
20 GB now: 3+3+12+0.5+0.5+0.8
- 40 GB left for links :)
collecting links is SLOW because of RMQ requiring up to 128 select queries…
- 36 GB before dedup; ~56 GB total
- 27 GB after dedup; ~47 GB total
POS- for human genome in 9.5min wall time (576s) on 6 cores / 12 threads.
| n | r | CDAWG nodes | CDAWG edges | #(src) | #(src,c) | #(links) | |
|---|---|---|---|---|---|---|---|
| human | 3 117 292 032 | 1 702 564 736 | 1 437 127 296 | 3 741 263 872 | 815 311 808 | 1 345 582 592 | 1 697 342 720 |
Max LCP: 3111690 => 22 bits
- links can fit in 12 bytes, but that makes sorting them much much slower.
- EF compresses links from 27 GB to 13 GB!
- MPHF would make it even smaller but need something to counter false positives.