Tools for suffix array searching

Table of Contents

1 Sapling
2 PLA-Index
3 LISA: learned index

Let’s summarize some tools for efficiently searching suffix arrays.

1 Sapling Link to heading

Sapling (Kirsche, Das, and Schatz 2020) works as follows:

Choose a parameter $p$ store for each of the $2^p$ $p$-bit prefixes the corresponding position in the suffix array.
When querying, first find the bucket for the query prefix. Then do a linear interpolation inside the bucket.
Search the area $[-E, +E]$ around the interpolated position, where $E$ is a bound on the error of the linear approximation. In practice $E$ is only a $95\%$-confidence bound, and if the true value is not in the range, a linear search with steps of size $E$ is done.

The paper also introduces a neural network approach to approximating buckets, but this takes over a day to learn and is slower to query in practice.

Results:

The prefix table gives up to $3\times$ speedup over plain binary search.
The binary search implementation is relatively native and extends/compares one character at a time.
- Currently the LCP function takes a string instead of a string&, although the copy is surely optimized away.
- A non-recursive version could be faster, but may just be equivalent to the tail-recursive version.

2 PLA-Index Link to heading

The PLA-Index (Abrar and Medvedev 2024) takes the ideas from Sapling a step further.

Sapling’s interpolation can fail when the suffix distribution is very skewed.
Thus, the PLA-index uses variable x-coordinates.

3 LISA: learned index Link to heading

https://www.biorxiv.org/content/10.1101/2020.12.22.423964v3

References Link to heading

Abrar, Md. Hasin, and Paul Medvedev. 2024. “Pla-Complexity of $k$-Mer Multisets,” February. https://doi.org/10.1101/2024.02.08.579510.

Kirsche, Melanie, Arun Das, and Michael C Schatz. 2020. “Sapling: Accelerating Suffix Array Queries with Learned Data Models.” Edited by Robinson Peter. Bioinformatics 37 (6): 744–49. https://doi.org/10.1093/bioinformatics/btaa911.