Results on CuriousCoding

A lemma on suffix array searching

Sat, 05 Oct 2024 00:00:00 +0200

Table of Contents

1 Suffix arrays
2 Searching methods
3 Analysing the faster search

We’ll prove that using the “faster” binary search algorithm (see 2.2) that tracks the LCP with the left and right boundary of the remaining search interval has amortized runtime

\[ O\Big(\lg_2(n) + |P| + |P| \cdot \lg_2(Occ(P))\Big), \] when $P$ is a randomly sampled fixed-length pattern from the text and $Occ(P)$ counts the number of occurrences of $P$ in the text.

Computing random minimizers, fast

Fri, 12 Jul 2024 00:00:00 +0200

Table of Contents

1 Introduction
- 1.1 Results
2 Random minimizers
3 Algorithms
4 Analysing what we have so far
5 Rolling our own hash
6 SIMD sliding window
- 6.1 Results
  - Human genome results
7 TODO Cleanup, Testing, Super-k-mers, and canonical k-mers

1 Introduction

In this post, we will develop a fast implementation of random minimizers.

A near-tight lower bound on minimizer density

Tue, 25 Jun 2024 00:00:00 +0200

Table of Contents

Succinct background
- Definitions
- Lower bounds
A new lower bound
Discussion
Post scriptum
Acknowledgement

The results of this post are now available in a pre-print: DOI, PDF:

Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd Treangen. 2024. “A near-Tight Lower Bound on the Density of Forward Sampling Schemes.” Biorxiv. https://doi.org/10.1101/2024.09.06.611668.

In this post I will prove a new lower bound on the density of any minimizer or forward sampling scheme: \[ d(f) \geq \frac{\lceil\frac{w+k}{w}\rceil}{w+k} = \frac{\lceil\frac{\ell+1}{w}\rceil}{\ell+1}. \]

Perfect NtHash for Robust Minimizers

Sun, 31 Dec 2023 00:00:00 +0100

Table of Contents

NtHash
Minimizers
- Robust minimizers
Is NtHash injective on kmers?
- Searching for a collision
- Proving perfection
Alternatives
SmHasher results
TODO benchmark NtHash, NtHash2, FxHash

NtHash

NtHash (Mohamadi et al. 2016) is a rolling hash suitable for hashing any kind of text, but made for DNA originally. For a string of length $k$ it is a $64$ bit value computed as:

\begin{equation} h(x) = \bigoplus_{i=0}^{k-1} rot^i(h(x_i)) \end{equation}

PTRHash: Notes on adapting PTHash in Rust

Thu, 21 Sep 2023 00:00:00 +0200

Table of Contents

Questions and remarks on PTHash paper
Ideas for improvement
Implementation log
TODO

\[ %\newcommand{\mm}{\,\%\,} \newcommand{\mm}{\bmod} \newcommand{\lxor}{\oplus} \newcommand{\K}{\mathcal K} \]

Tensor embedding preserves Hamming distance

Fri, 14 Oct 2022 00:00:00 +0200

Table of Contents

Definitions
Proof of Lemma 1
TODO Proof of Lemma 2

This is a proof that Tensor Embedding (Joudaki, Rätsch, and Kahles 2020) with $ℓ^2$-norm preserves the Hamming distance.

This is in collaboration with Amir Joudaki.

\begin{equation*} \newcommand{\I}{\mathcal I} \newcommand{\EE}{\mathbb E} \newcommand{\var}{\operatorname{Var}} \end{equation*}

Definitions

Notation

The alphabet is $\Sigma$, of size $|\Sigma| = \sigma$.
The set of indices is $\I := \{(i_1, \dots, i_t) \in [n]^t: i_1 < \dots < i_t\}$.
Given a string $a_1\dots a_n = a\in \Sigma^n$, we define the $I$-index as $a_I = (a_{i_1}, \dots, a_{i_t})$.
We write $[ X ]$ for the indicator variable of event $X$, which is $1$ when $X$ holds and $0$ otherwise.

Definition 1: Tensor embedding

Given $a\in \Sigma^n$, the tensor embedding $T_a$ is the $\sigma^t$ tensor given by $T_a[s] = \sum_{I\in \I} [A_I = s]$ for each $s\in \Sigma^t$.

The normalized tensor embedding distance $d_{te}$ between two sequences $a$ and $b$ is defined as

28000x speedup with Numba.CUDA

Mon, 24 May 2021 00:00:00 +0200

Xrefs: r/CUDA, Numba discourse