Table of Contents
Let’s summarize some tools for efficiently searching suffix arrays.
1 Sapling
Sapling (Kirsche, Das, and Schatz 2020) works as follows:
- Choose a parameter \(p\) store for each of the \(2^p\) $p$-bit prefixes the corresponding position in the suffix array.
- When querying, first find the bucket for the query prefix. Then do a linear interpolation inside the bucket.
- Search the area \([-E, +E]\) around the interpolated position, where \(E\) is a bound on the error of the linear approximation. In practice \(E\) is only a $95\%$-confidence bound, and if the true value is not in the range, a linear search with steps of size \(E\) is done.
The paper also introduces a neural network approach to approximating buckets, but this takes over a day to learn and is slower to query in practice.
Results:
- The prefix table gives up to \(3\times\) speedup over plain binary search.
- The binary search implementation is relatively native and extends/compares one
character at a time.
- Currently the LCP function takes a
string
instead of astring&
, although the copy is surely optimized away. - A non-recursive version could be faster, but may just be equivalent to the tail-recursive version.
- Currently the LCP function takes a
2 PLA-Index
The PLA-Index (Abrar and Medvedev 2024) takes the ideas from Sapling a step further.
- Sapling’s interpolation can fail when the suffix distribution is very skewed.
- Thus, the PLA-index uses variable x-coordinates.
3 LISA: learned index
https://www.biorxiv.org/content/10.1101/2020.12.22.423964v3
References
Abrar, Md. Hasin, and Paul Medvedev. 2024. “Pla-Complexity of $k$-Mer Multisets,” February. https://doi.org/10.1101/2024.02.08.579510.
Kirsche, Melanie, Arun Das, and Michael C Schatz. 2020. “Sapling: Accelerating Suffix Array Queries with Learned Data Models.” Edited by Robinson Peter. Bioinformatics 37 (6): 744–49. https://doi.org/10.1093/bioinformatics/btaa911.