Linear-time suffix array construction

These are some notes about linear time suffix array (SA) construction algorithms (SACA’s).

History of suffix array construction algorithms:

  • 1990 first algorithm: Manber and Myers (1993)
  • 2002 small/large suffixes, explained below: Ko and Aluru (2005)
  • 2009 recursion only on LMS suffixes: Nong, Zhang, and Chan (2009)

These slides from Stanford are a nice reference for the last algorithm.

Notation

  • String \(S = s_0\dots s_{n-1}s_n\).
  • $s_n=\$$ is the sentinel, smaller than all other characters.
  • Suffix from \(i\): \(S_i := s_i\dots s_n\).
  • \(i\) indexes the string \(S\).
  • \(j\) indexes the suffix array \(A\).
  • Let \(P_i\) be the position of suffix \(S_i\) in the suffix array.
    • \(A_{P_i} = i\) and \(P_{A_j}=j\) by definition.
  • The previous-suffix and next-suffix of \(S_i\) are \(S_{i-1}\) and \(S_{i+1}\) respectively.

Small and Large suffixes

  • Suffix \(S_i\) is small (S-suffix) when \(S_i < S_{i+1}\). \(S_n\) is small by definition.
  • Suffix \(S_i\) is large (L-suffix) when \(S_i > S_{i+1}\).
  • Equality isn’t possible because of the sentinel.

Then by definition:

  • when \(S_i\) is small: \(P_i < P_{i+1}\), ie S-suffixes come before their next-suffix.
  • when \(S_i\) is large: \(P_i > P_{i+1}\), ie L-suffixes come after their next-suffix.

Lemma 1: When an S-suffix and L-suffix start with the same character \(c\), the L-suffix comes before the S-suffix in the suffix array.

Proof:

  • The S-suffix is of the form $c^x d …$, with \(d>c\) and \(x>0\).
  • The L-suffix is of the form $c^y b …$, with \(b<c\) and \(y>0\).
  • Since \(b<c<d\), no matter what \(x\) and \(y\) are, the L-suffix is less than the S-suffix.

Building the suffix array from a smaller one

It turns out we can build the SA of \(S\) once we know the SA-order of all small suffixes.

Lemma 2: The set of L-suffixes $\{S_i : s_i =c \text{ and $S_i$ is large }\}$ starting with the same character \(c\) occurs in the same order in the SA as the corresponding next-suffixes \(S_{i+1}\).

Proof: This follows simply from the fact that a set of strings starting with \(c\) can be sorted by only comparing them from the second character onward.

Extension algorithm: This gives the following algorithm to build \(A\) from the partial suffix array of small suffixes:

  1. Start by initializing a bucket for each character \(c\) in \(S\): Determine the position in \(A\) of the first and last suffix starting with \(c\) by counting the number of characters in \(S\) that are \(<c\) and \(\leq c\) respectively.
  2. By Lemma 1, the S-suffixes starting with character \(c\) come after the L-suffixes starting with \(c\), and hence can be copied in their already-sorted order to the end of bucket \(c\).
  3. Now we iterate over the suffix array position \(j\) from \(0\) to \(n\) to fill in the large suffixes in the right places. Recall that \(S_{A_j}\) is the suffix at position \(j\) of the suffix array, and \(S_{A_j-1}\) is its previous-suffix. Also, \(A_0 = n\) by definition, since $S_n = \$$ is minimal.
  4. If the previous-suffix \(S_{A_j-1}\) is small, it is smaller than \(S_{A_j}\) and hence already positioned at a smaller position \(j’ = P_{A_j-1} < j\).
  5. If the previous-suffix \(S_{A_j-1}\) is large, this is the next-smallest suffix starting with character \(c=s_{A_j-1}\). Indeed, by Lemma 2, any smaller L-suffix starting with \(c\) has a smaller next-suffix, which has been processed already (seen at position \(< j\)), while any larger L-suffix has a larger next-suffix, which has not been processed already (not seen at position \(\leq j\)).
    Write \(A_j-1\) to the first empty slot in the bucket for \(c\).

Once \(j\) reaches \(n\), this algorithm has completely filled in \(A\). Note that it can never run into empty slots.1

Visualization

I wrote a small visualizer for the algorithm above, see this repo and this post. You can use it to visualize the algorithm for any input string.

Figure 1: Constructing the complete suffix array starting after step 2 above, with the small suffixes already filled at the ends of their buckets.

Figure 1: Constructing the complete suffix array starting after step 2 above, with the small suffixes already filled at the ends of their buckets.

References

Ko, Pang, and Srinivas Aluru. 2005. “Space Efficient Linear Time Construction of Suffix Arrays.” Journal of Discrete Algorithms 3 (2–4): 143–56. https://doi.org/10.1016/j.jda.2004.08.002.
Manber, Udi, and Gene Myers. 1993. “Suffix Arrays: A New Method for on-Line String Searches.” Siam Journal on Computing 22 (5): 935–48. https://doi.org/10.1137/0222058.
Nong, Ge, Sen Zhang, and Wai Hong Chan. 2009. “Linear Suffix Array Construction by Almost Pure Induced-Sorting.” 2009 Data Compression Conference, March. https://doi.org/10.1109/dcc.2009.42.

  1. Proof omitted. ↩︎