STPD: Compressing Suffix Trees by Path Decompositions

The (lost?) art of suffix trees of repetitive texts

Ruben Becker, Davide Cenzato, Travis Gagie, Ragnar {Groot Koerkamp},
Sung-Hwan Kim, Giovanni Manzini, Nicola Prezza

RECOMB-Seq 2026, Thessaloniki

curiouscoding.nl/slides/stpd/slides https://doi.org/10.48550/ARXIV.2506.14734

Text indexing: Locate all matches of a pattern

stpd-all-matches.svg

Focus: Locate leftmost match of a pattern

stpd-one-match.svg

FM/r-index: cache miss per character

stpd-fm-index.png

Suffix trees of repetitive text are repetitive too!

Suffixient arrays, Cenzato+

stpd-suffix-tree-repetitive.png

Suffix tree: \(O(d + m/B + \mathit{occ})\) I/O complexity

  • \(m\): pattern length
  • \(B\): block size
  • \(d\): depth in suffix tree, usually small
  • \(r\): runs in Burrows Wheeler Transform (BWT)

Goal: \(O(r)\) space, \(O(d+m/B+\mathit{occ})\) I/Os

STPD: \(O(r)\) space, \(O(d\color{red}{\cdot \log m}+m/B+\mathit{occ})\) I/Os

stpd-sa.svg

Results: Small & fast (\(m=15\))

Locate one 19x chr19 Locate all

stpd-results-15.png

Results: Small & fast (\(m=150\))

Locate one 19x chr19 Locate all

stpd-results-150.png

Stringology time!

Definition 1 Right maximal substring (RMS).

Substring x for which at least two extensions xA and xB exist in \(T\). Aka: suffix-tree nodes.

stpd-rms.svg

Stringology time!

Definition 2 Right maximal substring (RMS).

Substring x for which at least two extensions xA and xB exist in \(T\). Aka: suffix-tree nodes.

Definition 3 Right maximal extension (RME).

Extension xA of an RMS x. Aka: suffix-tree edges.

stpd-rme.svg

Suffixient set

Definition 4 Suffixient set.

A set of text positions \(S\) sorted in co-lex order, such that for each RME xA, there is \(i\in S\) s.t. \(T[.. i)\) ends in xA.

\(|S|\leq 2r\) is possible. Note that access to the (compressed) text is needed.

stpd-prefix-array.svg

Suffixient set queries

Definition 5 Query algorithm.

Start greedily matching at start of \(T\). Say that prefix \(P[..j)\) of \(P\) matches.

If the extension \(P[..j]\) occurs in \(T\), it is an RME, and we can binary search a prefix from \(S\) that ends in it. Jump there and repeat.

stpd-query-1.svg

Suffixient set queries

Definition 6 Query algorithm.

Start greedily matching at start of \(T\). Say that prefix \(P[..j)\) of \(P\) matches.

If the extension \(P[..j]\) occurs in \(T\), it is an RME, and we can binary search a prefix from \(S\) that ends in it. Jump there and repeat.

stpd-query-2.svg

Suffixient set queries

Definition 7 Query algorithm.

Start greedily matching at start of \(T\). Say that prefix \(P[..j)\) of \(P\) matches.

If the extension \(P[..j]\) occurs in \(T\), it is an RME, and we can binary search a prefix from \(S\) that ends in it. Jump there and repeat.

stpd-query-3.svg

Suffixient set queries

Definition 8 Query algorithm.

Start greedily matching at start of \(T\). Say that prefix \(P[..j)\) of \(P\) matches.

If the extension \(P[..j]\) occurs in \(T\), it is an RME, and we can binary search a prefix from \(S\) that ends in it. Jump there and repeat.

stpd-query-4.svg

Suffixient set

Definition 9 Suffixient set (repeated).

A set of text positions \(S\) sorted in co-lex order, such that for each RME xA, there is \(i\in S\) s.t. \(T[.. i)\) ends in xA.

Note that access to the (compressed) text is needed.

stpd-prefix-array.svg

STPD

Definition 10 STPD.

The set of text positions \(S\), such that for each RME xA that is not the lefmost occurrence of x, there is \(i\in S\) s.t. \(T[.. i)\) is the lefmost occurrence of xA.

stpd-stpd.svg

Larger lexicographic example

stpd-fig1.png

Sizes on Pizza & Chili repetitive corpus

(in millions) influenza cere escherichia
n 155.0 461.0 112.0
BWT runs \(r\) 3.0 11.6 15.0
suffixient set \(\chi\leq 2r\) 2.2 9.9 13.1
STPD: leftmost 1.9 8.9 11.7
STPD: lex min \(\leq r\) 1.8 7.5 9.8

Conclusion

  • \(O(r)\) space alongside the text-oracle, in practice less than r-index
  • Cache-friendly lookups, 30× faster than r-index

Outlook:

  • Incremental construction, allowing to grow the text
  • Direct pointer-based navigation on the text works great too???

Pointers

stpd-pointers.png