# Notes on implementing Longest Common Repeat (LCR)

These are my running notes on implementing an algorithm for Longest Common Repeat using minimizers.

## Notes

### Coloured Tree Problem

See Lemma 3 at here

### Generic sparse suffix array

For random strings and $$b \leq n / \log n$$, direct radix sort on $2log n + log log n$-bit prefixes is sufficient for $$O(n)$$ runtime. In fact, since computer word size $$w\geq \log n$$, we only need at most $$2$$ rounds of radix sort! (See simple-saca.)

### Sparse suffix array on minimizers

1. Fix $$w$$, the window length, and $$k$$, the minimizer length.
2. Find all minimizers. We expect a density of $$2/w$$, and there is one at least every $$w$$ positions.
3. Radix sort the suffixes of length $$2\cdot w$$ starting at each minimizer in $$O(2/w \cdot 2w \cdot n) = O(n)$$ time.
4. Use prefix doubling to resolve missing elements.

## Discussion / TODOs

• Write kmer code to actually do LCR
• Bench against divsufsort
• Solons papers on rotations instead of minimizers:
• Linear time LCS from SA:
• Minimizer-space suffix array?
• Fast SSA:
• First one round of radix sort
• Then comparison-sort on small context. (Or more radix sort rounds?)
• Then fall back to PA.
• Can we be $$o(n)$$ and avoid hashing the full string? Maybe some kind of lazy hashing? (I.e. only cache parts of the hash that are actually computed?) Maybe in a lazy segment-tree-like structure, where nodes are only computed on demand.
• Binary search $$l$$ for LCR length. (Try $$n/2$$, $$n/4$$, $$n/8$$, …)
• Reuse minimizer for fixed $$k$$, varying the window size.
• $$O(m + \log n)$$ mapping
• Compute matching statistics: for strings $$A$$ and $$B$$, for each $$A[i..]$$ the longest common prefix with some suffix of $$B$$.
• Segtree for caching rolling hashes.
• Minimizer doubling
• Fir radix sort, only then sort unresolved suffices using minimizer doubling or PA.
• NtHash2 for longer hashes
• NtHash for minimizers https://gist.github.com/Daniel-Liu-c0deb0t/7078ebca04569068f15507aa856be6e8

### Evals

• Plot space/time tradeoff in $$l$$
• Plot space/time in $$n$$
• Compare to full SA/LCP construction algorithm