Suffix-Array on CuriousCodinghttps://curiouscoding.nl/tags/suffix-array/Recent content in Suffix-Array on CuriousCodingHugoenSat, 05 Oct 2024 00:00:00 +0200A lemma on suffix array searchinghttps://curiouscoding.nl/posts/suffix-array-searching-lemma/Sat, 05 Oct 2024 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-searching-lemma/Table of Contents 1 Suffix arrays 2 Searching methods 2.1 Naive \(O(|P|\cdot \lg_2 n)\) search 2.2 Faster \(O(|P|\cdot \lg_2 n)\) search 2.3 LCP-based \(O(|P| + \lg_2 n)\) search 3 Analysing the faster search We’ll prove that using the “faster” binary search algorithm (see 2.2) that tracks the LCP with the left and right boundary of the remaining search interval has amortized runtime
\[ O\Big(\lg_2(n) + |P| + |P| \cdot \lg_2(Occ(P))\Big), \] when \(P\) is a randomly sampled fixed-length pattern from the text and \(Occ(P)\) counts the number of occurrences of \(P\) in the text.[WIP] Progress on fast suffix array searchinghttps://curiouscoding.nl/posts/suffix-array-searching-log/Tue, 01 Oct 2024 00:00:00 +0000https://curiouscoding.nl/posts/suffix-array-searching-log/Here’s a lablog.
Background Compare with suffix arrays with a twist: https://www.cai.sk/ojs/index.php/cai/article/view/2019_3_555 Compare with https://github.com/mranisz/sa, which is based on Compact and hash based variants of the suffix array https://journals.pan.pl/dlibra/publication/121376/edition/105762/content Here’s a bike
A figure of a bike.
Binary searching Eytzinger Btrees Multithreading[WIP] Faster binary searchhttps://curiouscoding.nl/posts/fast-binary-search/Sun, 08 Sep 2024 00:00:00 +0200https://curiouscoding.nl/posts/fast-binary-search/Table of Contents 1 High level ideas 1.1 Resources 1.2 Code 2 To measure 3 TODO Memory efficiency 3.1 B-tree 1 High level ideas Prefix table: for each 20-bit prefix, store the corresponding range of the array. Interpolation: Make one or more interpolation steps. Could store max resulting error. Drawback: can cause an unpredictable number of resulting iterations. Batching: process multiple (8-32) queries at the same time, hiding memory latency Query bucketing: given >>1M of queries, partition them into 1M buckets and answer bucket by bucket.Tools for suffix array searchinghttps://curiouscoding.nl/posts/suffix-array-searching/Fri, 14 Jun 2024 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-searching/Table of Contents 1 Sapling 2 PLA-Index 3 LISA: learned index Let’s summarize some tools for efficiently searching suffix arrays.
1 Sapling Sapling (Kirsche, Das, and Schatz 2020) works as follows:
Choose a parameter \(p\) store for each of the \(2^p\) $p$-bit prefixes the corresponding position in the suffix array. When querying, first find the bucket for the query prefix. Then do a linear interpolation inside the bucket. Search the area \([-E, +E]\) around the interpolated position, where \(E\) is a bound on the error of the linear approximation.Crates for suffix array constructionhttps://curiouscoding.nl/posts/suffix-array-crates/Thu, 13 Jun 2024 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-crates/Popular C libraries are:
divsufsort libsais Both have a ..64 variant that supports input strings longer than 2GB.
Rust wrappers:
divsufsort: rust reimplementation, does not support large inputs. cdivsufsort: c-wrapper, does not support large inputs livdivsufsort-rs: c-wrapper, does support large inputs sais: unrelated to the original library; does not implement a linear time algorithm anyway libsais-rs: Daniel Liu’s fork-of-fork of the original, but not on crates.io. Supports multithreading using OpenMP and wraps both the original and 64bit version.String algorithm visualizationshttps://curiouscoding.nl/posts/alg-viz/Tue, 08 Nov 2022 00:00:00 +0100https://curiouscoding.nl/posts/alg-viz/ Select the algorithm to visualize Click the buttons, or click the canvas and use the indicated keys Suffix-array construction is explained here and BWT is explained here.
Source code is on GitHub.
Algorithm Suffix Array Construction Burrows-Wheeler Transform Bidirectional BWT String Query prev (←/backspace) next (→/space) Delay (s) faster (↑/+/f) slower (↓/-/s) pause/play (p/return)BWT and FM-indexhttps://curiouscoding.nl/posts/bwt/Tue, 18 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/bwt/Table of Contents Burrows-Wheeler Transformation (BWT) Last-to-first mapping (LF mapping) Pattern matching Visualization Bi-directional BWT These are some notes about the Burrows-Wheeler Transform (BWT), FM-index, and variants.
See my post on the linear time suffix array construction algorithm for notation and terminology.
At the bottom you can find a visualization. This page has an interactive demo. Source code for visualizations is this GitHub repo.
Burrows-Wheeler Transformation (BWT) The BWT of a string \(S\) is generated as follows:Linear-time suffix array constructionhttps://curiouscoding.nl/posts/suffix-array-construction/Thu, 13 Oct 2022 00:00:00 +0200https://curiouscoding.nl/posts/suffix-array-construction/Table of Contents Notation Small and Large suffixes Building the suffix array from a smaller one Visualization These are some notes about linear time suffix array (SA) construction algorithms (SACA’s).
At the bottom you can find a visualization. This page has an interactive demo. History of suffix array construction algorithms:
1990 first algorithm: Manber and Myers (1993) 2002 small/large suffixes, explained below: Ko and Aluru (2005) 2009 recursion only on LMS suffixes: Nong, Zhang, and Chan (2009) These slides from Stanford are a nice reference for the last algorithm.