1 De Bruijn graph
Consider an edge-centric De Bruijn graph, where each edge corresponds to a k-mer, and nodes are the \(k-1\) overlaps between adjacent k-mers. In the figures, all edges are directed towards the right.
2 k-mers
The goal is now to store all edges / k-mers of the graph efficiently. A spectrum preserving string set (SPSS) is a set of strings whose k-mers are the k-mers of the input graph, that does not contain duplicate k-mers (Rahman and Medvedev 2020).
The most simple solution is to just store all k-mers individually.
3 Unitigs
The non-branching paths in a the graph. These can be stored as compressed strings, but are shown as k-mer-by-k-mer paths in the graph.
4 Simplitigs
Adjacent unitigs overlap in \(k-1\) characters, so it’s more efficient to store them adjacently (Břinda, Baym, and Kucherov 2021). One way to construct simplitigs is by simply greedily extending paths, and the resulting strings are called greedy simplitigs.
It turns out that an optimal decomposition into simplitigs can be found in linear time by considering an Eulerian circuit. The resulting optimal simplitigs are called eulertigs (Schmidt and Alanko 2022).
5 Matchtigs
We can achieve more compression by joining unitigs ‘at a distance’, and duplicating some in-between k-mers (Schmidt et al. 2021).
6 Masked superstrings
We can achieve even more compression by introducing some new k-mers during the overlapping process, and then mask these out (shown as dashed) (Sladký, Veselý, and Břinda 2023).