Tools for suffix array searching

Let’s summarize some tools for efficiently searching suffix arrays.

1 Sapling

Sapling (Kirsche, Das, and Schatz 2020) works as follows:

  1. Choose a parameter \(p\) store for each of the \(2^p\) $p$-bit prefixes the corresponding position in the suffix array.
  2. When querying, first find the bucket for the query prefix. Then do a linear interpolation inside the bucket.
  3. Search the area \([-E, +E]\) around the interpolated position, where \(E\) is a bound on the error of the linear approximation. In practice \(E\) is only a $95\%$-confidence bound, and if the true value is not in the range, a linear search with steps of size \(E\) is done.

The paper also introduces a neural network approach to approximating buckets, but this takes over a day to learn and is slower to query in practice.

Results:

  • The prefix table gives up to \(3\times\) speedup over plain binary search.
  • The binary search implementation is relatively native and extends/compares one character at a time.
    • Currently the LCP function takes a string instead of a string&, although the copy is surely optimized away.
    • A non-recursive version could be faster, but may just be equivalent to the tail-recursive version.

2 PLA-Index

The PLA-Index (Abrar and Medvedev 2024) takes the ideas from Sapling a step further.

  • Sapling’s interpolation can fail when the suffix distribution is very skewed.
  • Thus, the PLA-index uses variable x-coordinates.

3 LISA: learned index

https://www.biorxiv.org/content/10.1101/2020.12.22.423964v3

References

Abrar, Md. Hasin, and Paul Medvedev. 2024. “Pla-Complexity of $k$-Mer Multisets,” February. https://doi.org/10.1101/2024.02.08.579510.
Kirsche, Melanie, Arun Das, and Michael C Schatz. 2020. “Sapling: Accelerating Suffix Array Queries with Learned Data Models.” Edited by Robinson Peter. Bioinformatics 37 (6): 744–49. https://doi.org/10.1093/bioinformatics/btaa911.