These are ongoing notes and logs on improving the STPD (Becker et al. 2025) and a new variant of it that I’ll call the jump index.

1 Results Link to heading

twice the same table; bottom one in millions

nkreCDAWG nodesCDAWG edgesstpd\(\vert\mathsf{stpd}\vert\)#(src)#(src,c)#(links)
influenza15480856078k302282117163707773405917163736pos-1928406181563223375542938832
influenzalex-1815859158046220873892982427
influenzacolex-1805729408139650547915054791
dna.001.11048576001001716805600307412911887pos-125062276403312708721609646
dna.001.1lex-110278779304812859391707288
dna.001.1colex-1100566324897138746203874621
einstein.de.txt927584402.1k10136823833879468238234pos-616064333594804104020
einstein.de.txtlex-614923744288424103639
einstein.de.txtcolex-6242951986117392117392
Escherichia_Coli1126895122315044485313403951201107131340396pos-1167757170655721187733214968636
Escherichia_Colilex-981707468407481152782914987339
Escherichia_Colicolex-982597080276611333751213337512
(in M)nkreCDAWG-n-estpd\(\vert\mathsf{stpd}\vert\)#(src)#(src,c)#(links)
influenza154.8178k3.0217.167.7317.15pos-1.931.822.342.94
influenzalex-1.821.582.092.98
influenzacolex-1.814.085.055.05
dna.001.1104.861001.726.0012.91pos-1.250.761.271.61
dna.001.1lex-1.100.791.291.71
dna.001.1colex-1.103.253.873.87
einstein.de.txt92.762.1k0.100.240.080.23pos-0.060.040.090.10
einstein.de.txtlex-0.060.040.090.10
einstein.de.txtcolex-0.060.050.120.12
Escherichia_Coli112.692315.0431.3412.0131.33pos-11.687.0711.8814.97
Escherichia_Colilex-9.826.8411.5314.99
Escherichia_Colicolex-9.838.0313.3413.34

short repetitive sequence, 1% error rate, alphabet size 2 and 4

nrCDAWG-n-estpd\(\vert\mathsf{stpd}\vert\)#(src)#(src,c)#(links)
relative(1*1000@0.01,2)10004887401486pos-401336336484
relative(1*1000@0.01,2)10004887401486lex-309301302487
relative(1*1000@0.01,2)10004887401486colex-305380380380
relative(2*1000@0.01,2)20006129091825pos-513402402622
relative(2*1000@0.01,2)20006129091825lex-376376377606
relative(2*1000@0.01,2)20006129091825colex-375470470470
relative(128*1000@0.01,2)12800077664355387165pos-5690554155417814
relative(128*1000@0.01,2)12800077664355387165lex-4672473747387582
relative(128*1000@0.01,2)12800077664355387165colex-4652217952179521795
relative(1024*1000@0.01,2)102400031984252733505545pos-22931248982489832440
relative(1024*1000@0.01,2)102400031984252733505545lex-20692203742037531186
relative(1024*1000@0.01,2)102400031984252733505545colex-20783128607128607128607
relative(1*1000@0.01,4)10007445421475pos-615353633775
relative(1*1000@0.01,4)10007445421475lex-501353620773
relative(1*1000@0.01,4)10007445421475colex-497375657657
relative(2*1000@0.01,4)20008746211668pos-683391692847
relative(2*1000@0.01,4)20008746211668lex-572375662841
relative(2*1000@0.01,4)20008746211668colex-576432750750
relative(128*1000@0.01,4)12800082824382499987pos-5671435167398374
relative(128*1000@0.01,4)12800082824382499987lex-5140372660028076
relative(128*1000@0.01,4)12800082824382499987colex-5155227622981029812
relative(1024*1000@0.01,4)102400037602234132703386pos-25721212313060638203
relative(1024*1000@0.01,4)102400037602234132703386lex-24160183212677036844
relative(1024*1000@0.01,4)102400037602234132703386colex-24107156444323502323503

long repetitive sequence, 0.1% error rate, alphabet size 2 and 4

nrCDAWG-n-estpd\(\vert\mathsf{stpd}\vert\)#(src)#(src,c)#(links)
relative(1*10000@0.001,2)100005013754015091pos-4122336133614966
lex-3073311531164980
colex-3087376337643764
relative(2*10000@0.001,2)200005152781615641pos-4224345934595140
lex-3216321332145140
colex-3223390839083908
relative(128*10000@0.001,2)12800001652798398196848pos-12502112561125616218
lex-10033101781017916339
colex-10032508315083250832
relative(1024*10000@0.001,2)102400007928229874105975276pos-57803580125801280795
lex-48235486344863577778
colex-48250148179514817961481796
relative(1*10000@0.001,4)100007534545914639pos-5905341859557426
lex-5020336058567488
colex-5012376264846484
relative(2*10000@0.001,4)200007627558114957pos-6058349161057624
lex-5028339959147589
colex-5087383866426642
relative(128*10000@0.001,4)12800001692491260192530pos-1252084071358616859
lex-1088177211272816850
colex-10958473335421254212
relative(1024*10000@0.001,4)102400008106629584816587814pos-55824405016354879097
lex-51343368915965680204
colex-51324160210019732401973241

Datasets from pizza&chili repetitive corpus:

  • influenza: 78k copies of influenza; probably with very high variance
  • dna.001.1: is 100 synthetic copies of a 1 MB string, each with 0.1% independent mutations
  • einstein.de.txt: revisions of the German Einstein Wikipedia article
  • Escherichia_Coli: 23 copies of E. Coli.

Variables:

  • \(n\): size of input
  • \(k\): the number of copies
  • \(r\): BWT-runs
  • \(e = |\mathsf{CDAWG}|\) is the number of edges in the CDAWG and is copied from Cleary et al. (2024).
    • My numbers in the CDAWG-e column are just slightly (<100) off. Maybe because I don’t append a sentinel?
  • \(|\mathsf{stpd}|\): STPD size
    • equal to number of CDAWG nodes with a non-path incoming edge?
  • src: number of text positions from which we jump
  • (src, c): number of (text positions, character) for which there is a jump
  • #(links): total number of links in Jump Index.
    • TODO: Why not equal to |CDAWG|/2?

Observations:

  • For all but influenza, we have roughly \(r = n/(k/2) = 2\cdot n/k\).
  • the STPD size is always 60% to 75% of \(r\)
  • Number of links for pos and lex is very close to \(r\)
  • Number of links for colex is much worse, but why? Especially when there are many repeats.

2 Faster construction by skipping runs Link to heading

before: 60s

(in M)nrstpd#(src)#(src,c)#(links)
influenza154.813.02pos-1928406.001815632.002337554.002938832.00
influenza154.813.02lex-1815859.001580462.002087389.002982427.00
influenza154.813.02colex-1805729.004081396.005054791.005054791.00

after: 35s

(in M)nrstpd#(src)#(src,c)#(links)
influenza154.813.02pos-1928405.001815631.002337552.002938830.00
influenza154.813.02lex-1815859.001580462.002087388.002982426.00
influenza154.813.02colex-1805728.004081395.005054789.005054789.00

Idea: if an SA interval \([l, r)\) is contained inside a BWT run, we can skip processing it completely.

This results in close to \(4r\) samples in total. Probably because each ‘critical’ interval is processed once for each character somehow?

  • After reusing BWT between variants: 29s
  • After switching to non-precomputed RMQ: still 29s
  • After switching to S=256 instead of S=64: still 29s
  • using par_iter for permuted_pi: 22s
  • using S=128: 21s
  • build RMQ using par_iter for block mins: 20s

New overall results: (67s)

(in M)nrstpd#(src)#(src,c)#(links)
dna.001.1104.861.72pos-1250622.00764033.001270872.001609647.00
dna.001.1104.861.72lex-1102787.00793048.001285939.001707288.00
dna.001.1104.861.72colex-1100566.003248974.003874623.003874624.00
influenza154.813.02pos-1928405.001815631.002337552.002938830.00
influenza154.813.02lex-1815859.001580462.002087388.002982426.00
influenza154.813.02colex-1805728.004081395.005054789.005054789.00
Escherichia_Coli112.6915.04pos-11677571.007065572.0011877332.0014968636.00
Escherichia_Coli112.6915.04lex-9817074.006840748.0011527829.0014987339.00
Escherichia_Coli112.6915.04colex-9825970.008027661.0013337512.0013337512.00
einstein.de.txt92.760.10pos-61606.0043335.0094804.00104020.00
einstein.de.txt92.760.10lex-61492.0037443.0088425.00103640.00
einstein.de.txt92.760.10colex-62431.0051988.00117394.00117394.00

TODO 2.1 We loose 1 or 2 samples in the new setting. Why? Link to heading

TODO 2.2 More fine-grained pruning to directly construct the final result without duplicates? Link to heading

TODO 2.3 Construction direction via run-length BWT? Link to heading

TODO 2.4 PHF on (pos, c) Link to heading

TODO 2.5 compressed targets Link to heading

Instead of storing all jump targets, store a list of all target locations (as in STPD). But now partition it by the last character. Then store one list per final character (sorted by text pos, for lg(n/r) bits each) and the index in the list (lg r bits each). This should help compression when targets are reused.

TODO 2.6 Prefix table up to k=10 Link to heading

TODO 3 Queries Link to heading

  • 1-PHF to map \((i,a)\) to its range
  • How good are our suffix links??? How many do we need to traverse to get to the next position on average?

5 Ideas Link to heading

  • max LCP length relates to max run length: both increase for more repetitive texts
  • pBWT: highlight jump-targets
    • Either binary search for them in \(O(l)\) (\(l = \Theta(n/r)\) copies), or store target in \(\lg l\) bits.
  • STPD, say of size \(s\):
    • \(s\) \(\lg n\) bit samples, shuffled
    • can we compress them?
      • typically, we will jump from pos \(i\) in one copy to pos \(\approx i\) in another copy? So then a binary search over \(\ell\) things is enough?
      • Consecutive STPD samples are in similar text positions?
    • Jump targets will usually point to roughly the same position in the text
  • Compressing the Jump Index:
    • MPHF for \((i, a)\)
    • LCP \(\geq 255\) can be stored in a secondary structure with 32-bit values, so 8 bits is enough for the primary structure
  • Use a prefix lookup for the first ~10 bp = ~20 bits

6 Analysing and reducing space usage Link to heading

6.1 Movi 1 Link to heading

Movi (Zakeri et al. 2024, 2025) is an engineered version of a run-length FM-index. Similar to the r-index, but avoiding all the rank and select structures and instead inlining just the right data so that most LF-mappings can be done with only a single cache miss.

Movi1 uses 16 bytes per run.

6.2 Movi 2 Link to heading

Movi2 splits some runs:

  • Runs longer than \(2^\ell\) are split into two (+5% for HPRCv2), so the length can be stored in \(\ell\) bits
  • In a run of A, if we instead want to append a C, the best match is in the run of C’s preceding or succeeding the current run. Most of the time, this choice is the same for all elements of the run, and a single bit up/down pointer for each of \(\sigma-1\) of the non-run characters suffices. If this is not the case, the run is split, for 10% extra runs on HPRCv2.

It then stores:

  • 2-bit character
  • \(\ell=11\)-bit run length
  • the \(\log_2(n)\)-bit position where the entire run maps to:
    • a \(\log_2( r)=36\)-bit run-index \(\xi\)
    • an 11-bit index in the run
  • \(\sigma-1=3\) bits indicating whether to round up or down for each non-run characters.

For a total of 63 bits or 8 bytes, with the bulk being the \(r\log_2n\) bits for the target indices.

6.2.1 Block-based run-indices Link to heading

The largest component are the run-indices \(\chi\). But these can be compressed further, since \(\xi\) is increasing over all runs for the same character. An elias-fano structure would be space efficient, but a practical solution is to only store \(\xi\) values (for all characters) every \(b\) rows and then linearly interpolate between them, only storing the delta to the prediction. In this mode, \(\ell=10\) and \(\Delta=23\) (I think), for a total of 6 bytes.

Instead of storing \(\Delta\), a further down-sampling followed by a linear scan is also possible, bringing the space per run to just 3 bytes. (I do not yet fully understand the details here.)

6.3 STPD Link to heading

Suppose we have \(k\) similar sequences of total length \(n=k\cdot m\). In practice, the STPD has roughly \(s=\frac 23 r\) samples (see tables above).

Each sample is a text position, for a total space of \(s\log_2 n \approx r\log_2 n\), i.e., similar to the move-index. In fact, 6 bytes (48 bits) per sample is plenty, matching the block-based movi2 (but not the sampled version).

Do note though that this excludes the relative-Lempel-Ziv compressed text, as well as a 2-4 bit/element RMQ structure that selects the best match in the interval of the STPD (sparse prefix array) that matches the pattern prefix:

  • For colex-, this cam be omitted and we take the start of the interval instead.
  • For pos-, we could also iterate over the entire matching interval. Not sure if that would be slow.
  • For lex-, the RMQ structure is needed.

Unlike for movi, where the target runs are increasing and can thus be compressed, here we must sort the \(s\) integers in colex order, meaning that (a priory) we cannot compress them via something like Elias-Fano coding.

But there is hope!

6.4 Compressing STPD: STPD for pBWT Link to heading

Consider the setting where the \(k\) sequences only differ by say 1% of substitutions, and nothing else. Then we can put them below each other (as in the pBWT), and for each column where the pattern matches, we could report the first input sequence in which it matches.

Thus, we get something like this:

1
2
3
4
5
Pat: ABCDE
T1   ABCDEFGHIJX
T2   ABCDEFGHIJY
T3   ABCDEZGHIJX
T4   ABCDEZGHIXX

Here, the pattern matches a prefix of \(T_1\). Then, if we extend the pattern by Z, it does not match \(T_1\) anymore, and instead we have to jump to \(T_3\). Thus, the Z in \(T_3\) would be an STPD-sample (in the concatenation of the texts).

Thus, in every column where “things happen” we get one or more STPD samples. But in this setting, we don’t have to binary search over all samples in \(O(\log s)\)! Instead, we only have to binary search the samples in the current column, which is more like \(O(\log s/m)\), or either way at most \(O(\log k)\).

Thus, we could ideally use some perfect hash function to map active columns (with at least one sample) to a variable-length list of samples in around 2 bits per active column. Or we could store in \(\log_2 k\) bits the number of STPD samples in each active column, and then distribute columns over multiple PHFs based on the number of samples.

Then, all that’s needed is to store the subset of the \(k\) sequences that has an STPD sample in this column, which is only \(\log_2 k\) bits (around 9 for HPRCv2) rather than \(\log_2 n\) (around 44).

Thus, this scheme needs only \(m\log k + 2m +s\log_2 k\) bits.

A big restriction, though, is that this does not allow jumping between different columns: once the pattern in anchored to some column, it can only switch between rows.

Nevertheless, it feels like this should provide some possible ways to optimize the generic STPD as well: most of the time, jumping happens between ’equivalent positions’ in different genomes, and if we have some course multiple-sequence-alignment or so between them, then it might be sufficient to only store low bits most of the time to indicate ‘jump to the equivalent location in the input sequence with ID \(x\)’.

6.5 Jump Index Link to heading

  • HPRCv2: 466 copies of Gbp => 40.5 bits of text
  • At 8 bytes/run and including rc and 15% extra runs, movi2 has 47.3 GB.
    • 5.1 G runs
    • 2.55 G runs for 466 copies of fwd-only
    • 1.72 G runs for a single copy of a human genome

Initial space we need per pointer:

  • source: 41 bits
    • hide in MPHF on (pos, c) => 2 bits
    • EF to store all lengths?
    • Or EF for source locations + EF for lengths?
    • <1 byte total?
  • target: 41 bits = 5 bytes
  • LCP len: 16 bits (probably 8 + fallback) = 2 bytes

Current

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
pub struct Link {
    /// 5 bytes?
    pub source: usize,
    /// 2 bits?
    pub c: char,
    /// 2 bytes?
    pub lcp: usize,
    /// 5 bytes?
    pub target: usize,
}
  • naive: 24 bytes
  • packed: 100 bits = 12.5 bytes
    • 41 bit src
    • 2 bit char
    • 16 bit LCP
    • 41 bit target
  • EF: we have ~2G runs, so we encode the top 31 bits with only 2 bits => 71 bits each

6.5.2 Mutable Link to heading

  • Giulio’s mutable rank/select B-tree on the high part of the EF
    • maybe 20% overhead; that’s all fine
    • Does it support insertions though… It’ll need to be modified for this
  • B-tree on the low parts of the EF values, so insertion is fast
    • Needs special support for indexing
    • Split so that low parts are 64 bits??? 9-bit LCP?

TODO 6.6 Link to heading

  • Number of edges in graph on STPD samples?
    • (this is larger than just the number of links)
    • equivalent to CDAWG? Smaller?
  • Number of suffix links?

7 CDAWG Link to heading

What is the relation between suffixient sets/STPD and the CDAWG?

  • STPD is a path decomposition of the CDAWG
    • Size is the number of nodes with >1 incoming edge, ie at least one incoming edge not in the path decomposition. Up to 10x smaller than the CDAWG, see table at the top.
  • Jump Index has size equal to the number of non-path edges in the CDAWG?
    • In the table, the size of the jump index varies between 20% and 50% of the CDAWG. I’m not yet sure why it’s not always at least 50%, ie where at most half the edges are heavy-path and the other (at least) half light edges needing a jump pointer.
    • Hmm; the JI jumps to a different location depending on the length of the substring already matched, whereas the CDAWG is independent of the already-matched prefix and so is more like COLEX-? Or is this not an issue because such cases have different CDAWG nodes?
  • CDAWG is bad (\(O(n)\)) for AAAAAAAA#. But maybe fine in practice?
  • CDAWG nodes are maximal repeats \(S\), and have an equivalence class of right-maximal strings \(\{S[0..), S[1..), S[2..), \dots, S[k..)\}\) such that all of those (but the first) are not left-maximal, but \(S[k+1\dots)\) is left-maximal. (Belazzougui and Cunial 2017)
    • All strings in the equivalence class have the same range-length in the BWT and for \(i\geq 1\), all \(S[i..)\) the range is (contained in) a single run with character \(S[i-1]\). OTOH, the BWT interval for \(S[0..)\) contains at least two characters.
    • All paths to a node in the CDAWG are a suffix of one another.
  • The in-neighbours of a node exactly partition the range of ‘active lengths’.

CDAWG:

  • A node for every maximal repeat, that is: a substring \(S\) that is repeated and can be extended both left and right in at least 2 ways. Or equivalently, \(aS\) and \(Sa\) occur stricly less than \(S\) for all symbols \(a\).
  • Space: number of right-extensions of maximal repeats

“Composite Repetition-Aware Data Structures” (Belazzougui et al. 2015)

  • (google scholar citations to this paper give a good source of papers)
  • introduces \(e\), the number of arcs in the CDAWG
  • Index combining run-length BWT with CDAWG

“Representing the Suffix Tree with the Cdawg” (Belazzougui and Cunial 2017)

  • Heavy-path decomposition of CDAWG; supports additional suffix-tree operations

7.1 Further papers Link to heading

“Suffix Trees, Dawgs and Cdawgs for Forward and Backward Tries” (Inenaga 2020)

  • Takes as input not a string, but a suffix tree or so.

“The Cdawg Index and Pattern Matching on Grammar-Compressed Strings” (Cleary et al. 2024)

  • Builds a CDAWG on grammars; 4x smaller than the original CDAWG

“Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures” (Navarro 2021b)

“Indexing Highly Repetitive String Collections, Part Ii: Compressed Indexes” (Navarro 2021a)

8 Fast construction Link to heading

  • Goal: \(O( r)\) space; \(O(n)\) is wasteful
    • BWT+LCP in \(O( r)\) is possible (?)
  • Skip constructing SA on repetitive parts?
    • Simply dynamically add them and don’t process them at all?
    • For each mutation find the longest extension to left/right to make it unique. Then only process that window?

9 HPRCv2 Link to heading

Get the AGC (assembled genomes compressor) version as described here (3.3 GB .agc):

1
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/B4174A5F-F20E-4DCF-8470-F8A907B640BC--HPRCv2_0.6.1_pr_agc_submission/HPRC_r2_assemblies_0.6.1.agc

Or get all of them as alignments against a reference from here (5.2 GB .paf.gz).

Rust AGC api: https://docs.rs/ragc-core/0.1.1/ragc_core/

10 Questions Link to heading

  • Why is the number of jumps so close to \(r\)
  • data layout
    • dynamic vs static implementation?
  • lg(k) instead of lg(n) bits per jump?
  • suffix links?

10.1 Notes Link to heading

  • BML: faster MEM finding
  • moni align
  • phi (and inverse) also give LCP values
  • classic: find all occurrences of all MEMs, but not what is needed in practice
    • 100 matches to 3 locations is not super useful
    • Only care about distinct locations in pangenome graphs
      • Paper Joni + Paten, WABI 2025: tag array
      • only store distinct mapping locations for runs
  • Latin 2025 paper Gonzalo, Joni, Travis
  • Jannick Olbrich: MSA
    • sample runs; how many columns you move left after LF mapping
  • Heng:
    • vg-surject: maps alignment to graph back to standard reference
    • tag array of vg-surject values
  • Goal: mix tag arrays with suffixient sets
    • Problem: these don’t give BWT interval
    • Can only enumerate the interval; not “RMQ” query it
    • not count query, but interval query
  • lg(n) vs lg(h)
  • movi: r lg n, but r lg(n/r)
    • move structure: row-major
    • compression: col-major
  • paper from Roberto’s festschrift
  • SMEM: MEM aligned by pBWT columns
  • suffixient set for findings MEMs; then r-index to locate all occurrences?
  • SPIRE paper:
    • Heng: grlBWT; LMS grammar
    • Ben: PFP parsing
  • z-fast trie vs centroid decomposition
  • WABI 22 ??

Demo:

  • patterns up to length 20
    • obvious that most stuff is useless

10.2 Jump Index Link to heading

  • dynamic construction
    • quickly discard existing regions
  • dynamic predecessor structure is all that’s needed
    • log(u/n) ideally
  • similar to ukkonen’s SA construction
  • Store lg(h) instead of lg(n), relative to MSA
    • all relative to reference?
    • break each pointer into 2 pieces:
      • this genome lg(h)
      • roughly this position lg(w)
    • if mapping is monotone; use checkpoints
  • Can enumerate all occurrences via DFS
  • Associate TAG array information with jump links?
    • But we still
  • Can we do counts?
  • Draw as automaton
    • What happens when we add a copy?
    • How many states do we gain?

11 Shrinking practical memory usage Link to heading

  • SA: 24GB = 8B * 3G, because libsais needs i64 for \(2^{31.5}\) elements
  • PLCP: 24GB
  • Shrink SA to u32 for 12GB
  • Shrink PLCP to u16 for 6GB
  • Build LCP, also 6GB
  • Peak so far: 48GB

New:

  • Text: 1 B -> 3 GB
  • BWT: 1 B -> 3 GB
  • SA: 32 bit = 4 B -> 12 GB
  • LCP: 32 bit = 4 B -> 12 GB
  • Dropped range boundaries
  • RMQ via segtree instead of sparse table
  • SA RMQ: 1.5 b -> 500 MB
  • LCP RMQ: 1.5 b -> 500 MB

Total: 31GB

  • 32 byte links: ~23GB for ~1/3rd of the data

  • 16 byte links: ~24GB for 560/750 intervals = 3/4th of the data

  • Replace LCP by Compact PLCP version

    • 12.4 GB => 0.8 GB
  • 20 GB now: 3+3+12+0.5+0.5+0.8

    • 40 GB left for links :)
  • collecting links is SLOW because of RMQ requiring up to 128 select queries…

    • 36 GB before dedup; ~56 GB total
    • 27 GB after dedup; ~47 GB total
  • POS- for human genome in 9.5min wall time (576s) on 6 cores / 12 threads.

nrCDAWG nodesCDAWG edges#(src)#(src,c)#(links)
human3 117 292 0321 702 564 7361 437 127 2963 741 263 872815 311 8081 345 582 5921 697 342 720

Max LCP: 3111690 => 22 bits

  • links can fit in 12 bytes, but that makes sorting them much much slower.
  • EF compresses links from 27 GB to 13 GB!
    • MPHF would make it even smaller but need something to counter false positives.

References Link to heading

Becker, Ruben, Davide Cenzato, Travis Gagie, Ragnar Groot Koerkamp, Sung-Hwan Kim, Giovanni Manzini, and Nicola Prezza. 2025. “Compressing Suffix Trees by Path Decompositions.” arXiv; arXiv. https://doi.org/10.48550/ARXIV.2506.14734.
Belazzougui, Djamal, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. 2015. “Composite Repetition-Aware Data Structures.” In Combinatorial Pattern Matching, 26–39. Springer International Publishing. https://doi.org/10.1007/978-3-319-19929-0_3.
Belazzougui, Djamal, and Fabio Cunial. 2017. “Representing the Suffix Tree with the Cdawg.” In Lipics, Volume 78, Cpm 2017, 78:7:1–7:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.CPM.2017.7.
Cleary, Alan M., Joseph Winjum, Jordan Dood, and Shunsuke Inenaga. 2024. “The Cdawg Index and Pattern Matching on Grammar-Compressed Strings.” Arxiv. https://doi.org/10.48550/ARXIV.2407.08826.
Inenaga, Shunsuke. 2020. “Suffix Trees, Dawgs and Cdawgs for Forward and Backward Tries.” In Latin 2020: Theoretical Informatics, 194–206. Springer International Publishing. https://doi.org/10.1007/978-3-030-61792-9_16.
Navarro, Gonzalo. 2021a. “Indexing Highly Repetitive String Collections, Part Ii: Compressed Indexes.” Acm Computing Surveys 54 (2): 1–32. https://doi.org/10.1145/3432999.
———. 2021b. “Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures.” Acm Computing Surveys 54 (2): 1–31. https://doi.org/10.1145/3434399.
Zakeri, Mohsen, Nathaniel K. Brown, Omar Y. Ahmed, Travis Gagie, and Ben Langmead. 2024. “Movi: A Fast and Cache-Efficient Full-Text Pangenome Index.” Iscience 27 (12): 111464. https://doi.org/10.1016/j.isci.2024.111464.
Zakeri, Mohsen, Nathaniel K. Brown, Travis Gagie, and Ben Langmead. 2025. “Movi 2: Fast and Space-Efficient Queries on Pangenomes.” Biorxiv, October. https://doi.org/10.1101/2025.10.16.682873.