Table of Contents

Let’s summarize some tools for efficiently searching suffix arrays.

## 1 Sapling

Sapling (Kirsche, Das, and Schatz 2020) works as follows:

- Choose a parameter \(p\) store for each of the \(2^p\)
**$p$-bit prefixes**the corresponding position in the suffix array. - When querying, first find the bucket for the query prefix. Then do a
**linear interpolation**inside the bucket. - Search the area \([-E, +E]\) around the interpolated position, where \(E\) is a bound on the error of the linear approximation. In practice \(E\) is only a $95\%$-confidence bound, and if the true value is not in the range, a linear search with steps of size \(E\) is done.

The paper also introduces a neural network approach to approximating buckets, but this takes over a day to learn and is slower to query in practice.

**Results:**

- The prefix table gives up to \(3\times\) speedup over plain binary search.
- The binary search implementation is relatively native and extends/compares one
character at a time.
- Currently the LCP function takes a
`string`

instead of a`string&`

, although the copy is surely optimized away. - A non-recursive version could be faster, but may just be equivalent to the tail-recursive version.

- Currently the LCP function takes a

## 2 PLA-Index

The PLA-Index (Abrar and Medvedev 2024) takes the ideas from Sapling a step further.

- Sapling’s interpolation can fail when the suffix distribution is very skewed.
- Thus, the PLA-index uses variable x-coordinates.

## 3 LISA: learned index

https://www.biorxiv.org/content/10.1101/2020.12.22.423964v3

## References

Abrar, Md. Hasin, and Paul Medvedev. 2024. “Pla-Complexity of $k$-Mer Multisets,” February. https://doi.org/10.1101/2024.02.08.579510.

Kirsche, Melanie, Arun Das, and Michael C Schatz. 2020. “Sapling: Accelerating Suffix Array Queries with Learned Data Models.” Edited by Robinson Peter.

*Bioinformatics*37 (6): 744–49. https://doi.org/10.1093/bioinformatics/btaa911.