STPD dev notes

Table of Contents

1 Results
2 Faster construction by skipping runs
3 TODO Queries
4 TODO suffix links
5 Ideas
6 Analysing and reducing space usage
- 6.1 Movi 1
- 6.2 Movi 2
  - 6.2.1 Block-based run-indices
- 6.3 STPD
- 6.4 Compressing STPD: STPD for pBWT
- 6.5 Jump Index
  - 6.5.1 Compact link
  - 6.5.2 Mutable
- 6.6 TODO
7 CDAWG
- 7.1 Further papers
8 Fast construction
9 HPRCv2
10 Questions
- 10.1 Notes
- 10.2 Jump Index
11 Shrinking practical memory usage

These are ongoing notes and logs on improving the STPD (Becker et al. 2025) and a new variant of it that I’ll call the jump index.

1 Results Link to heading

twice the same table; bottom one in millions

	n	k	r	e	CDAWG nodes	CDAWG edges	stpd	\(\vert\mathsf{stpd}\vert\)	#(src)	#(src,c)	#(links)
influenza	154808560	78k	3022821	17163707	7734059	17163736	pos-	1928406	1815632	2337554	2938832
influenza							lex-	1815859	1580462	2087389	2982427
influenza							colex-	1805729	4081396	5054791	5054791
dna.001.1	104857600	100	1716805		6003074	12911887	pos-	1250622	764033	1270872	1609646
dna.001.1							lex-	1102787	793048	1285939	1707288
dna.001.1							colex-	1100566	3248971	3874620	3874621
einstein.de.txt	92758440	2.1k	101368	238338	79468	238234	pos-	61606	43335	94804	104020
einstein.de.txt							lex-	61492	37442	88424	103639
einstein.de.txt							colex-	62429	51986	117392	117392
Escherichia_Coli	112689512	23	15044485	31340395	12011071	31340396	pos-	11677571	7065572	11877332	14968636
Escherichia_Coli							lex-	9817074	6840748	11527829	14987339
Escherichia_Coli							colex-	9825970	8027661	13337512	13337512

(in M)	n	k	r	e	CDAWG-n	-e	stpd	\(\vert\mathsf{stpd}\vert\)	#(src)	#(src,c)	#(links)
influenza	154.81	78k	3.02	17.16	7.73	17.15	pos-	1.93	1.82	2.34	2.94
influenza							lex-	1.82	1.58	2.09	2.98
influenza							colex-	1.81	4.08	5.05	5.05
dna.001.1	104.86	100	1.72		6.00	12.91	pos-	1.25	0.76	1.27	1.61
dna.001.1							lex-	1.10	0.79	1.29	1.71
dna.001.1							colex-	1.10	3.25	3.87	3.87
einstein.de.txt	92.76	2.1k	0.10	0.24	0.08	0.23	pos-	0.06	0.04	0.09	0.10
einstein.de.txt							lex-	0.06	0.04	0.09	0.10
einstein.de.txt							colex-	0.06	0.05	0.12	0.12
Escherichia_Coli	112.69	23	15.04	31.34	12.01	31.33	pos-	11.68	7.07	11.88	14.97
Escherichia_Coli							lex-	9.82	6.84	11.53	14.99
Escherichia_Coli							colex-	9.83	8.03	13.34	13.34

short repetitive sequence, 1% error rate, alphabet size 2 and 4

	n	r	CDAWG-n	-e	stpd	\(\vert\mathsf{stpd}\vert\)	#(src)	#(src,c)	#(links)
relative(1*1000@0.01,2)	1000	488	740	1486	pos-	401	336	336	484
relative(1*1000@0.01,2)	1000	488	740	1486	lex-	309	301	302	487
relative(1*1000@0.01,2)	1000	488	740	1486	colex-	305	380	380	380
relative(2*1000@0.01,2)	2000	612	909	1825	pos-	513	402	402	622
relative(2*1000@0.01,2)	2000	612	909	1825	lex-	376	376	377	606
relative(2*1000@0.01,2)	2000	612	909	1825	colex-	375	470	470	470
relative(128*1000@0.01,2)	128000	7766	43553	87165	pos-	5690	5541	5541	7814
relative(128*1000@0.01,2)	128000	7766	43553	87165	lex-	4672	4737	4738	7582
relative(128*1000@0.01,2)	128000	7766	43553	87165	colex-	4652	21795	21795	21795
relative(1024*1000@0.01,2)	1024000	31984	252733	505545	pos-	22931	24898	24898	32440
relative(1024*1000@0.01,2)	1024000	31984	252733	505545	lex-	20692	20374	20375	31186
relative(1024*1000@0.01,2)	1024000	31984	252733	505545	colex-	20783	128607	128607	128607
relative(1*1000@0.01,4)	1000	744	542	1475	pos-	615	353	633	775
relative(1*1000@0.01,4)	1000	744	542	1475	lex-	501	353	620	773
relative(1*1000@0.01,4)	1000	744	542	1475	colex-	497	375	657	657
relative(2*1000@0.01,4)	2000	874	621	1668	pos-	683	391	692	847
relative(2*1000@0.01,4)	2000	874	621	1668	lex-	572	375	662	841
relative(2*1000@0.01,4)	2000	874	621	1668	colex-	576	432	750	750
relative(128*1000@0.01,4)	128000	8282	43824	99987	pos-	5671	4351	6739	8374
relative(128*1000@0.01,4)	128000	8282	43824	99987	lex-	5140	3726	6002	8076
relative(128*1000@0.01,4)	128000	8282	43824	99987	colex-	5155	22762	29810	29812
relative(1024*1000@0.01,4)	1024000	37602	234132	703386	pos-	25721	21231	30606	38203
relative(1024*1000@0.01,4)	1024000	37602	234132	703386	lex-	24160	18321	26770	36844
relative(1024*1000@0.01,4)	1024000	37602	234132	703386	colex-	24107	156444	323502	323503

long repetitive sequence, 0.1% error rate, alphabet size 2 and 4

	n	r	CDAWG-n	-e	stpd	\(\vert\mathsf{stpd}\vert\)	#(src)	#(src,c)	#(links)
relative(1*10000@0.001,2)	10000	5013	7540	15091	pos-	4122	3361	3361	4966
					lex-	3073	3115	3116	4980
					colex-	3087	3763	3764	3764
relative(2*10000@0.001,2)	20000	5152	7816	15641	pos-	4224	3459	3459	5140
					lex-	3216	3213	3214	5140
					colex-	3223	3908	3908	3908
relative(128*10000@0.001,2)	1280000	16527	98398	196848	pos-	12502	11256	11256	16218
					lex-	10033	10178	10179	16339
					colex-	10032	50831	50832	50832
relative(1024*10000@0.001,2)	10240000	79282	2987410	5975276	pos-	57803	58012	58012	80795
					lex-	48235	48634	48635	77778
					colex-	48250	1481795	1481796	1481796
relative(1*10000@0.001,4)	10000	7534	5459	14639	pos-	5905	3418	5955	7426
					lex-	5020	3360	5856	7488
					colex-	5012	3762	6484	6484
relative(2*10000@0.001,4)	20000	7627	5581	14957	pos-	6058	3491	6105	7624
					lex-	5028	3399	5914	7589
					colex-	5087	3838	6642	6642
relative(128*10000@0.001,4)	1280000	16924	91260	192530	pos-	12520	8407	13586	16859
					lex-	10881	7721	12728	16850
					colex-	10958	47333	54212	54212
relative(1024*10000@0.001,4)	10240000	81066	2958481	6587814	pos-	55824	40501	63548	79097
					lex-	51343	36891	59656	80204
					colex-	51324	1602100	1973240	1973241

Datasets from pizza&chili repetitive corpus:

influenza: 78k copies of influenza; probably with very high variance
dna.001.1: is 100 synthetic copies of a 1 MB string, each with 0.1% independent mutations
einstein.de.txt: revisions of the German Einstein Wikipedia article
Escherichia_Coli: 23 copies of E. Coli.

Variables:

\(n\): size of input
\(k\): the number of copies
\(r\): BWT-runs
\(e = |\mathsf{CDAWG}|\) is the number of edges in the CDAWG and is copied from Cleary et al. (2024).
- My numbers in the CDAWG-e column are just slightly (<100) off. Maybe because I don’t append a sentinel?
\(|\mathsf{stpd}|\): STPD size
- equal to number of CDAWG nodes with a non-path incoming edge?
src: number of text positions from which we jump
(src, c): number of (text positions, character) for which there is a jump
#(links): total number of links in Jump Index.
- TODO: Why not equal to |CDAWG|/2?

Observations:

For all but influenza, we have roughly \(r = n/(k/2) = 2\cdot n/k\).
the STPD size is always 60% to 75% of \(r\)
Number of links for pos and lex is very close to \(r\)
Number of links for colex is much worse, but why? Especially when there are many repeats.

2 Faster construction by skipping runs Link to heading

before: 60s

(in M)	n	r		stpd	#(src)	#(src,c)	#(links)
influenza	154.81	3.02	pos-	1928406.00	1815632.00	2337554.00	2938832.00
influenza	154.81	3.02	lex-	1815859.00	1580462.00	2087389.00	2982427.00
influenza	154.81	3.02	colex-	1805729.00	4081396.00	5054791.00	5054791.00

after: 35s

(in M)	n	r		stpd	#(src)	#(src,c)	#(links)
influenza	154.81	3.02	pos-	1928405.00	1815631.00	2337552.00	2938830.00
influenza	154.81	3.02	lex-	1815859.00	1580462.00	2087388.00	2982426.00
influenza	154.81	3.02	colex-	1805728.00	4081395.00	5054789.00	5054789.00

Idea: if an SA interval \([l, r)\) is contained inside a BWT run, we can skip processing it completely.

This results in close to \(4r\) samples in total. Probably because each ‘critical’ interval is processed once for each character somehow?

After reusing BWT between variants: 29s
After switching to non-precomputed RMQ: still 29s
After switching to S=256 instead of S=64: still 29s
using par_iter for permuted_pi: 22s
using S=128: 21s
build RMQ using par_iter for block mins: 20s

New overall results: (67s)

(in M)	n	r		stpd	#(src)	#(src,c)	#(links)
dna.001.1	104.86	1.72	pos-	1250622.00	764033.00	1270872.00	1609647.00
dna.001.1	104.86	1.72	lex-	1102787.00	793048.00	1285939.00	1707288.00
dna.001.1	104.86	1.72	colex-	1100566.00	3248974.00	3874623.00	3874624.00
influenza	154.81	3.02	pos-	1928405.00	1815631.00	2337552.00	2938830.00
influenza	154.81	3.02	lex-	1815859.00	1580462.00	2087388.00	2982426.00
influenza	154.81	3.02	colex-	1805728.00	4081395.00	5054789.00	5054789.00
Escherichia_Coli	112.69	15.04	pos-	11677571.00	7065572.00	11877332.00	14968636.00
Escherichia_Coli	112.69	15.04	lex-	9817074.00	6840748.00	11527829.00	14987339.00
Escherichia_Coli	112.69	15.04	colex-	9825970.00	8027661.00	13337512.00	13337512.00
einstein.de.txt	92.76	0.10	pos-	61606.00	43335.00	94804.00	104020.00
einstein.de.txt	92.76	0.10	lex-	61492.00	37443.00	88425.00	103640.00
einstein.de.txt	92.76	0.10	colex-	62431.00	51988.00	117394.00	117394.00

TODO 2.1 We loose 1 or 2 samples in the new setting. Why? Link to heading

TODO 2.2 More fine-grained pruning to directly construct the final result without duplicates? Link to heading

TODO 2.3 Construction direction via run-length BWT? Link to heading

TODO 2.4 PHF on (pos, c) Link to heading

TODO 2.5 compressed targets Link to heading

Instead of storing all jump targets, store a list of all target locations (as in STPD). But now partition it by the last character. Then store one list per final character (sorted by text pos, for lg(n/r) bits each) and the index in the list (lg r bits each). This should help compression when targets are reused.

TODO 2.6 Prefix table up to k=10 Link to heading

TODO 3 Queries Link to heading

1-PHF to map \((i,a)\) to its range

TODO 4 suffix links Link to heading

How good are our suffix links??? How many do we need to traverse to get to the next position on average?

5 Ideas Link to heading

max LCP length relates to max run length: both increase for more repetitive texts
pBWT: highlight jump-targets
- Either binary search for them in \(O(l)\) (\(l = \Theta(n/r)\) copies), or store target in \(\lg l\) bits.
STPD, say of size \(s\):
- \(s\) \(\lg n\) bit samples, shuffled
- can we compress them?
  - typically, we will jump from pos \(i\) in one copy to pos \(\approx i\) in another copy? So then a binary search over \(\ell\) things is enough?
  - Consecutive STPD samples are in similar text positions?
- Jump targets will usually point to roughly the same position in the text
Compressing the Jump Index:
- MPHF for \((i, a)\)
- LCP \(\geq 255\) can be stored in a secondary structure with 32-bit values, so 8 bits is enough for the primary structure
Use a prefix lookup for the first ~10 bp = ~20 bits

6 Analysing and reducing space usage Link to heading

6.1 Movi 1 Link to heading

Movi (Zakeri et al. 2024, 2025) is an engineered version of a run-length FM-index. Similar to the r-index, but avoiding all the rank and select structures and instead inlining just the right data so that most LF-mappings can be done with only a single cache miss.

Movi1 uses 16 bytes per run.

6.2 Movi 2 Link to heading

Movi2 splits some runs:

Runs longer than \(2^\ell\) are split into two (+5% for HPRCv2), so the length can be stored in \(\ell\) bits
In a run of A, if we instead want to append a C, the best match is in the run of C’s preceding or succeeding the current run. Most of the time, this choice is the same for all elements of the run, and a single bit up/down pointer for each of \(\sigma-1\) of the non-run characters suffices. If this is not the case, the run is split, for 10% extra runs on HPRCv2.

It then stores:

2-bit character
\(\ell=11\)-bit run length
the \(\log_2(n)\)-bit position where the entire run maps to:
- a \(\log_2( r)=36\)-bit run-index \(\xi\)
- an 11-bit index in the run
\(\sigma-1=3\) bits indicating whether to round up or down for each non-run characters.

For a total of 63 bits or 8 bytes, with the bulk being the \(r\log_2n\) bits for the target indices.

6.2.1 Block-based run-indices Link to heading

The largest component are the run-indices \(\chi\). But these can be compressed further, since \(\xi\) is increasing over all runs for the same character. An elias-fano structure would be space efficient, but a practical solution is to only store \(\xi\) values (for all characters) every \(b\) rows and then linearly interpolate between them, only storing the delta to the prediction. In this mode, \(\ell=10\) and \(\Delta=23\) (I think), for a total of 6 bytes.

Instead of storing \(\Delta\), a further down-sampling followed by a linear scan is also possible, bringing the space per run to just 3 bytes. (I do not yet fully understand the details here.)

6.3 STPD Link to heading

Suppose we have \(k\) similar sequences of total length \(n=k\cdot m\). In practice, the STPD has roughly \(s=\frac 23 r\) samples (see tables above).

Each sample is a text position, for a total space of \(s\log_2 n \approx r\log_2 n\), i.e., similar to the move-index. In fact, 6 bytes (48 bits) per sample is plenty, matching the block-based movi2 (but not the sampled version).

Do note though that this excludes the relative-Lempel-Ziv compressed text, as well as a 2-4 bit/element RMQ structure that selects the best match in the interval of the STPD (sparse prefix array) that matches the pattern prefix:

For colex-, this cam be omitted and we take the start of the interval instead.
For pos-, we could also iterate over the entire matching interval. Not sure if that would be slow.
For lex-, the RMQ structure is needed.

Unlike for movi, where the target runs are increasing and can thus be compressed, here we must sort the \(s\) integers in colex order, meaning that (a priory) we cannot compress them via something like Elias-Fano coding.

But there is hope!

6.4 Compressing STPD: STPD for pBWT Link to heading

Consider the setting where the \(k\) sequences only differ by say 1% of substitutions, and nothing else. Then we can put them below each other (as in the pBWT), and for each column where the pattern matches, we could report the first input sequence in which it matches.

Thus, we get something like this:

1
2
3
4
5
Pat: ABCDE
T1   ABCDEFGHIJX
T2   ABCDEFGHIJY
T3   ABCDEZGHIJX
T4   ABCDEZGHIXX

Here, the pattern matches a prefix of \(T_1\). Then, if we extend the pattern by Z, it does not match \(T_1\) anymore, and instead we have to jump to \(T_3\). Thus, the Z in \(T_3\) would be an STPD-sample (in the concatenation of the texts).

Thus, in every column where “things happen” we get one or more STPD samples. But in this setting, we don’t have to binary search over all samples in \(O(\log s)\)! Instead, we only have to binary search the samples in the current column, which is more like \(O(\log s/m)\), or either way at most \(O(\log k)\).

Thus, we could ideally use some perfect hash function to map active columns (with at least one sample) to a variable-length list of samples in around 2 bits per active column. Or we could store in \(\log_2 k\) bits the number of STPD samples in each active column, and then distribute columns over multiple PHFs based on the number of samples.

Then, all that’s needed is to store the subset of the \(k\) sequences that has an STPD sample in this column, which is only \(\log_2 k\) bits (around 9 for HPRCv2) rather than \(\log_2 n\) (around 44).

Thus, this scheme needs only \(m\log k + 2m +s\log_2 k\) bits.

A big restriction, though, is that this does not allow jumping between different columns: once the pattern in anchored to some column, it can only switch between rows.

Nevertheless, it feels like this should provide some possible ways to optimize the generic STPD as well: most of the time, jumping happens between ’equivalent positions’ in different genomes, and if we have some course multiple-sequence-alignment or so between them, then it might be sufficient to only store low bits most of the time to indicate ‘jump to the equivalent location in the input sequence with ID \(x\)’.

6.5 Jump Index Link to heading

HPRCv2: 466 copies of Gbp => 40.5 bits of text
At 8 bytes/run and including rc and 15% extra runs, movi2 has 47.3 GB.
- 5.1 G runs
- 2.55 G runs for 466 copies of fwd-only
- 1.72 G runs for a single copy of a human genome

Initial space we need per pointer:

source: 41 bits
- hide in MPHF on (pos, c) => 2 bits
- EF to store all lengths?
- Or EF for source locations + EF for lengths?
- <1 byte total?
target: 41 bits = 5 bytes
LCP len: 16 bits (probably 8 + fallback) = 2 bytes

6.5.1 Compact link Link to heading

Current

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
pub struct Link {
    /// 5 bytes?
    pub source: usize,
    /// 2 bits?
    pub c: char,
    /// 2 bytes?
    pub lcp: usize,
    /// 5 bytes?
    pub target: usize,
}

naive: 24 bytes
packed: 100 bits = 12.5 bytes
- 41 bit src
- 2 bit char
- 16 bit LCP
- 41 bit target
EF: we have ~2G runs, so we encode the top 31 bits with only 2 bits => 71 bits each

6.5.2 Mutable Link to heading

Giulio’s mutable rank/select B-tree on the high part of the EF
- maybe 20% overhead; that’s all fine
- Does it support insertions though… It’ll need to be modified for this
B-tree on the low parts of the EF values, so insertion is fast
- Needs special support for indexing
- Split so that low parts are 64 bits??? 9-bit LCP?

TODO 6.6 Link to heading

Number of edges in graph on STPD samples?
- (this is larger than just the number of links)
- equivalent to CDAWG? Smaller?
Number of suffix links?

7 CDAWG Link to heading

What is the relation between suffixient sets/STPD and the CDAWG?

STPD is a path decomposition of the CDAWG
- Size is the number of nodes with >1 incoming edge, ie at least one incoming edge not in the path decomposition. Up to 10x smaller than the CDAWG, see table at the top.
Jump Index has size equal to the number of non-path edges in the CDAWG?
- In the table, the size of the jump index varies between 20% and 50% of the CDAWG. I’m not yet sure why it’s not always at least 50%, ie where at most half the edges are heavy-path and the other (at least) half light edges needing a jump pointer.
- Hmm; the JI jumps to a different location depending on the length of the substring already matched, whereas the CDAWG is independent of the already-matched prefix and so is more like COLEX-? Or is this not an issue because such cases have different CDAWG nodes?
CDAWG is bad (\(O(n)\)) for AAAAAAAA#. But maybe fine in practice?
CDAWG nodes are maximal repeats \(S\), and have an equivalence class of right-maximal strings \(\{S[0..), S[1..), S[2..), \dots, S[k..)\}\) such that all of those (but the first) are not left-maximal, but \(S[k+1\dots)\) is left-maximal. (Belazzougui and Cunial 2017)
- All strings in the equivalence class have the same range-length in the BWT and for \(i\geq 1\), all \(S[i..)\) the range is (contained in) a single run with character \(S[i-1]\). OTOH, the BWT interval for \(S[0..)\) contains at least two characters.
- All paths to a node in the CDAWG are a suffix of one another.
The in-neighbours of a node exactly partition the range of ‘active lengths’.

CDAWG:

A node for every maximal repeat, that is: a substring \(S\) that is repeated and can be extended both left and right in at least 2 ways. Or equivalently, \(aS\) and \(Sa\) occur stricly less than \(S\) for all symbols \(a\).
Space: number of right-extensions of maximal repeats

“Composite Repetition-Aware Data Structures” (Belazzougui et al. 2015)

(google scholar citations to this paper give a good source of papers)
introduces \(e\), the number of arcs in the CDAWG
Index combining run-length BWT with CDAWG

“Representing the Suffix Tree with the Cdawg” (Belazzougui and Cunial 2017)

Heavy-path decomposition of CDAWG; supports additional suffix-tree operations

7.1 Further papers Link to heading

“Suffix Trees, Dawgs and Cdawgs for Forward and Backward Tries” (Inenaga 2020)

Takes as input not a string, but a suffix tree or so.

“The Cdawg Index and Pattern Matching on Grammar-Compressed Strings” (Cleary et al. 2024)

Builds a CDAWG on grammars; 4x smaller than the original CDAWG

“Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures” (Navarro 2021b)

“Indexing Highly Repetitive String Collections, Part Ii: Compressed Indexes” (Navarro 2021a)

8 Fast construction Link to heading

Goal: \(O( r)\) space; \(O(n)\) is wasteful
- BWT+LCP in \(O( r)\) is possible (?)
Skip constructing SA on repetitive parts?
- Simply dynamically add them and don’t process them at all?
- For each mutation find the longest extension to left/right to make it unique. Then only process that window?

9 HPRCv2 Link to heading

Get the AGC (assembled genomes compressor) version as described here (3.3 GB .agc):

1
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/B4174A5F-F20E-4DCF-8470-F8A907B640BC--HPRCv2_0.6.1_pr_agc_submission/HPRC_r2_assemblies_0.6.1.agc

Or get all of them as alignments against a reference from here (5.2 GB .paf.gz).

Rust AGC api: https://docs.rs/ragc-core/0.1.1/ragc_core/

10 Questions Link to heading

Why is the number of jumps so close to \(r\)
data layout
- dynamic vs static implementation?
lg(k) instead of lg(n) bits per jump?
suffix links?

10.1 Notes Link to heading

BML: faster MEM finding
moni align
phi (and inverse) also give LCP values
classic: find all occurrences of all MEMs, but not what is needed in practice
- 100 matches to 3 locations is not super useful
- Only care about distinct locations in pangenome graphs
  - Paper Joni + Paten, WABI 2025: tag array
  - only store distinct mapping locations for runs
Latin 2025 paper Gonzalo, Joni, Travis
Jannick Olbrich: MSA
- sample runs; how many columns you move left after LF mapping
Heng:
- vg-surject: maps alignment to graph back to standard reference
- tag array of vg-surject values
Goal: mix tag arrays with suffixient sets
- Problem: these don’t give BWT interval
- Can only enumerate the interval; not “RMQ” query it
- not count query, but interval query
lg(n) vs lg(h)
movi: r lg n, but r lg(n/r)
- move structure: row-major
- compression: col-major
paper from Roberto’s festschrift
SMEM: MEM aligned by pBWT columns
suffixient set for findings MEMs; then r-index to locate all occurrences?
SPIRE paper:
- Heng: grlBWT; LMS grammar
- Ben: PFP parsing
z-fast trie vs centroid decomposition
WABI 22 ??

Demo:

patterns up to length 20
- obvious that most stuff is useless

10.2 Jump Index Link to heading

dynamic construction
- quickly discard existing regions
dynamic predecessor structure is all that’s needed
- log(u/n) ideally
similar to ukkonen’s SA construction
Store lg(h) instead of lg(n), relative to MSA
- all relative to reference?
- break each pointer into 2 pieces:
  - this genome lg(h)
  - roughly this position lg(w)
- if mapping is monotone; use checkpoints
Can enumerate all occurrences via DFS
Associate TAG array information with jump links?
- But we still
Can we do counts?
Draw as automaton
- What happens when we add a copy?
- How many states do we gain?

11 Shrinking practical memory usage Link to heading

SA: 24GB = 8B * 3G, because libsais needs i64 for \(2^{31.5}\) elements
PLCP: 24GB
Shrink SA to u32 for 12GB
Shrink PLCP to u16 for 6GB
Build LCP, also 6GB
Peak so far: 48GB

New:

Text: 1 B -> 3 GB
BWT: 1 B -> 3 GB
SA: 32 bit = 4 B -> 12 GB
LCP: 32 bit = 4 B -> 12 GB
Dropped range boundaries
RMQ via segtree instead of sparse table
SA RMQ: 1.5 b -> 500 MB
LCP RMQ: 1.5 b -> 500 MB

Total: 31GB

32 byte links: ~23GB for ~1/3rd of the data
16 byte links: ~24GB for 560/750 intervals = 3/4th of the data
Replace LCP by Compact PLCP version
- 12.4 GB => 0.8 GB
20 GB now: 3+3+12+0.5+0.5+0.8
- 40 GB left for links :)
collecting links is SLOW because of RMQ requiring up to 128 select queries…
- 36 GB before dedup; ~56 GB total
- 27 GB after dedup; ~47 GB total
POS- for human genome in 9.5min wall time (576s) on 6 cores / 12 threads.

	n	r	CDAWG nodes	CDAWG edges	#(src)	#(src,c)	#(links)
human	3 117 292 032	1 702 564 736	1 437 127 296	3 741 263 872	815 311 808	1 345 582 592	1 697 342 720

Max LCP: 3111690 => 22 bits

links can fit in 12 bytes, but that makes sorting them much much slower.
EF compresses links from 27 GB to 13 GB!
- MPHF would make it even smaller but need something to counter false positives.

References Link to heading

Becker, Ruben, Davide Cenzato, Travis Gagie, Ragnar Groot Koerkamp, Sung-Hwan Kim, Giovanni Manzini, and Nicola Prezza. 2025. “Compressing Suffix Trees by Path Decompositions.” arXiv; arXiv. https://doi.org/10.48550/ARXIV.2506.14734.

Belazzougui, Djamal, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. 2015. “Composite Repetition-Aware Data Structures.” In Combinatorial Pattern Matching, 26–39. Springer International Publishing. https://doi.org/10.1007/978-3-319-19929-0_3.

Belazzougui, Djamal, and Fabio Cunial. 2017. “Representing the Suffix Tree with the Cdawg.” In Lipics, Volume 78, Cpm 2017, 78:7:1–7:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.CPM.2017.7.

Cleary, Alan M., Joseph Winjum, Jordan Dood, and Shunsuke Inenaga. 2024. “The Cdawg Index and Pattern Matching on Grammar-Compressed Strings.” Arxiv. https://doi.org/10.48550/ARXIV.2407.08826.

Inenaga, Shunsuke. 2020. “Suffix Trees, Dawgs and Cdawgs for Forward and Backward Tries.” In Latin 2020: Theoretical Informatics, 194–206. Springer International Publishing. https://doi.org/10.1007/978-3-030-61792-9_16.

Navarro, Gonzalo. 2021a. “Indexing Highly Repetitive String Collections, Part Ii: Compressed Indexes.” Acm Computing Surveys 54 (2): 1–32. https://doi.org/10.1145/3432999.

———. 2021b. “Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures.” Acm Computing Surveys 54 (2): 1–31. https://doi.org/10.1145/3434399.

Zakeri, Mohsen, Nathaniel K. Brown, Omar Y. Ahmed, Travis Gagie, and Ben Langmead. 2024. “Movi: A Fast and Cache-Efficient Full-Text Pangenome Index.” Iscience 27 (12): 111464. https://doi.org/10.1016/j.isci.2024.111464.

Zakeri, Mohsen, Nathaniel K. Brown, Travis Gagie, and Ben Langmead. 2025. “Movi 2: Fast and Space-Efficient Queries on Pangenomes.” Biorxiv, October. https://doi.org/10.1101/2025.10.16.682873.