These are some notes on Wheeler DFAs after talking to Nicola Prezza and others from the RAVEN lab at DSB 2026 in Venice.
1 Deterministic Finite Automaton (DFA) Link to heading
A DFA is a graph where edges are labelled by characters. Each node can have at most one outgoing edge with each label. (Otherwise it would be non-deterministic.)
Each node has a set of paths/walks ending there, and each of those paths spells a (possibly left-infinite) string.
2 Wheeler-DFA Link to heading
A Wheeler-DFA is a DFA, together with an order \(\preceq\) on the nodes, that satisfies two properties:
- If \(u\) is a node and \(u\xrightarrow{a} x\) and \(u\xrightarrow{b} y\) are two edges labelled with characters \(a\leq b\), then \(x\preceq y\).
- If \(u\preceq v\) are two nodes with outgoing edges \(u\xrightarrow{a}x\) and \(v\xrightarrow{a}y\), then \(x\preceq y\).
In practice, these axioms have a strong implication1:
- \(\preceq\) sorts the set of all nodes, such that for any \(u\preceq v\), all paths ending in \(u\) are co-lex smaller than all paths ending in \(v\).
If we represent a node by a path ending there with spelling \(\alpha\), the first axiom states that \(\alpha a \leq \alpha b\) (in co-lex order), and the second axiom states that \(\alpha a \leq \beta a\) when \(\alpha \leq \beta\). Both of these are of course completely natural, and apparently sufficient as well to fully capture the ordering of all strings.
3 Linear graphs: Prefix array Link to heading
To start with, let’s consider a plain linear text. From the graph perspective, a text of \(n\) characters has \(n+1\) nodes with \(n\) labelled transitions between them from each to the next. The prefix array on the text corresponds to a permutation of the nodes, such that the nodes are co-lex sorted (from right to left) by the value of the string ending in the node (ie the path to the node). And indeed, that order is exactly the order \(\preceq\).
Figure 1: Co-lex sorted prefixes of BANANA (i.e., the prefix array), drawn as DFAs.
4 De Bruijn graphs are Wheeler Link to heading
For example, all De Bruijn graphs are Wheeler. There is exactly one node for each present k-mer, and all paths in ending in that node end in spelling the k-mer. Thus, the nodes naturally partition the set of all paths by their last \(k\) characters. It follows that they can be sorted using the co-lex order on k-mers.
Figure 2: In a De Bruijn graph, all paths to a note labelled by a given k-mer end by spelling this k-mer.
5 Not all DFAs are Wheeler Link to heading
Many practically relevant graphs, such as for example variation graphs, are not Wheeler. But every graph can be “Wheelerized”. A trivial way to do this is to duplicate every node for each incoming edge (where each copy has 1 incoming edge, and the full set of outgoing edges). Or effectively the same, we can duplicate the entire subgraph for each node with multiple incoming edges, so that a tree remains. Then, every node has exactly one unique incoming path, and these can trivially be sorted.
A more conservative way is this: as long as the axioms are violated, only split violating nodes into two copies with the same set of outgoing edges but each only a subset of the incoming edges. This process will always terminate, and can be done in such a way to create the unique minimal Wheeler-DFA corresponding to the input. Unfortunately, though, this may still require an exponential number of steps.
An example is given below.
Figure 3: Wheelerizing a variation graph (top). Middle: by lazily making it into a tree. Bottom: by only expanding nodes as needed.
We can co-lex sort the paths in the graph at the top as follows, labelled by their final node:
| |
The graph is not Wheeler, because there are multiple nodes (3 and 4, marked with a star) whose corresponding paths do not form an interval. By turning the graph into a tree, each path has its own node and the Wheeler condition holds trivially. The optimal Wheelerization of this graph duplicates only nodes 3 and 4, as shown in the bottom of the figure.
My intuition is this: any time we want to merge two paths, we have to first
ensure that they are sufficiently similar, in that they are consecutive in the
set of all co-lex sorted paths. Thus, in a random text we would have to
“right-extend” the two paths for around \(\lg n\) steps until they become
‘sufficiently similar’ again. In a variation graph, the situation is more
tricky: if another copy of the path following the variation is present somewhere
else in the graph, we may have to extend until the two copies diverge. For
example, if there is a variation like {A,C}UVWXYZ and an occurrence of
BUVWXYA somewhere else, then we have to split the paths (duplicate nodes) to
get {AUVWXY,CUVWXY}Z, since only after appending Z do the two paths become
“more similar” to each other than to the secondary BUVWXYA.
6 Locating patterns via binary search Link to heading
We can test if a string is present in a text by binary searching the prefix array, and then returning the positions in the interval that match. Similarly, we can binary search the nodes in a Wheeler-DFA. One option is to store for each node two (out of possibly many) incoming edges, corresponding to the smallest and largest path ending in the node. This graph will have at most \(2n\), and co-lex comparisons can be made via a ‘backwards’ traversal.
This narrows down the interval for a pattern right-to-left.
7 Locating patterns via the BOSS table Link to heading
The BOSS table (Bowe et al. 2012) extends the idea of the LF-mapping in an FM-index to De Bruijn graphs, and the same technique works on Wheeler graphs. We can extend the pattern one-by-one on the right (not left, since we use the prefix rather than suffix array), and find the interval of nodes with paths ending in this pattern via an LF-like step.
To do this, we make a table entry for each outgoing edge, and track the boundaries between nodes. For the last graph in Figure 3, we get
| |
\(\newcommand{\occ}{\mathsf{occ}}\).
To match eg BAAB, we do the usual process:
- Start with the full interval \([0, 9)\).
- Count the number of
Binnextto the start and end of the interval: 0 and 4. - Update to \(\occ[B] + [0,4) = [5, 9)\)
- Count
Ato start and end of range: \([1, 4)\) - Update to \(\occ[A] + [1,4) = [2, 5)\)
- Count
Ato start and end of range: \([0, 1)\) - Update to \(\occ[A] + [0,1) = [1, 2)\)
- Count
Bto start and end of range: \([1, 2)\) - Update to \(\occ[B] + [1, 2) = [6, 7)\)
And indeed, index \(6\) corresponds to BAAB.
References Link to heading
I’m not sure if this implies that \(\preceq\) must be exactly the total order as written here, or just that such a \(\preceq\) exists. ↩︎