Highlight on CuriousCoding

Practical selection and sampling schemes

Thu, 12 Sep 2024 00:00:00 +0200

Table of Contents

1 Sampling schemes
- 1.1 Definitions and background
- 1.2 Mod-minimizer
- 1.3 Forward scheme lower bound
- 1.4 Open syncmer minimizer
- 1.5 Open-closed minimizer
- 1.6 New: General mod-minimizer
- 1.7 Variant: Open-closed minimizer using offsets
2 Selection schemes
- 2.1 Definition
- 2.2 Bd-anchors
- 2.3 New: Smallest unique substring anchors
- 2.4 New: Anti lexicographic sorting
3 More sampling schemes
- 3.1 Anti-lex sus-anchors
- 3.2 Threshold anchors
- 3.3 The $t$-gap disappears for large alphabets
4 Computing the density of forward schemes
- 4.1 WIP: Anti lexicographic sus-anchor density
5 Open questions

This post introduces some new practical sampling schemes. It builds on:

Computing random minimizers, fast

Fri, 12 Jul 2024 00:00:00 +0200

Table of Contents

1 Introduction
- 1.1 Results
2 Random minimizers
3 Algorithms
4 Analysing what we have so far
5 Rolling our own hash
6 SIMD sliding window
- 6.1 Results
  - Human genome results
7 TODO Cleanup, Testing, Super-k-mers, and canonical k-mers

1 Introduction

In this post, we will develop a fast implementation of random minimizers.

A near-tight lower bound on minimizer density

Tue, 25 Jun 2024 00:00:00 +0200

Table of Contents

Succinct background
- Definitions
- Lower bounds
A new lower bound
Discussion
Post scriptum
Acknowledgement

The results of this post are now available in a pre-print: DOI, PDF:

Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd Treangen. 2024. “A near-Tight Lower Bound on the Density of Forward Sampling Schemes.” Biorxiv. https://doi.org/10.1101/2024.09.06.611668.

In this post I will prove a new lower bound on the density of any minimizer or forward sampling scheme: \[ d(f) \geq \frac{\lceil\frac{w+k}{w}\rceil}{w+k} = \frac{\lceil\frac{\ell+1}{w}\rceil}{\ell+1}. \]

A*PA2: Up to 19x faster exact global alignment

Sat, 23 Mar 2024 00:00:00 +0100

Table of Contents

Abstract
1 Introduction
- 1.1 Contributions
- 1.2 Previous work
  - 1.2.1 Needleman-Wunsch
  - 1.2.2 Graph algorithms
  - 1.2.3 Computational volumes
  - 1.2.4 Parallelism
  - 1.2.5 Tools
2 Preliminaries
3 Methods
- 3.1 Band-doubling
- 3.2 Blocks
- 3.3 Memory
- 3.4 SIMD
- 3.5 SIMD-friendly sequence profile
- 3.6 Traceback
- 3.7 A*
  - 3.7.1 Bulk-contours update
  - 3.7.2 Pre-pruning
- 3.8 Determining the rows to compute
  - 3.8.1 Sparse heuristic invocation
- 3.9 Incremental doubling
4 Results
- 4.1 Setup
- 4.2 Comparison with other aligners
- 4.3 Effects of methods
5 Discussion
Acknowledgements
Conflict of interest
6 Appendix
- 6.1 Bitpacking
- 6.2 Comparison with other aligners
- 6.3 Effects of methods

\begin{equation*} \newcommand{\g}{g^*} \newcommand{\h}{h^*} \newcommand{\f}{f^*} \newcommand{\cgap}{c_{\textrm{gap}}} \newcommand{\xor}{\ \mathrm{xor}\ } \newcommand{\and}{\ \mathrm{and}\ } \newcommand{\st}[2]{\langle #1, #2\rangle} \newcommand{\matches}{\mathcal M} \end{equation*}

Mod-minimizers and other minimizers

Thu, 18 Jan 2024 00:00:00 +0100

\[ \newcommand{\d}{\mathrm{d}} \newcommand{\L}{\mathcal{L}} \]

This post introduces some background for minimizers and some experiments for a new minimizer variant. That new variant is now called the mod-minimizer and published at WABI24 (Groot Koerkamp and Pibiri 2024). This also includes a review of existing methods, including pseudocode for most of the methods covered below.

One Billion Row Challenge

Wed, 03 Jan 2024 00:00:00 +0100

Table of Contents

External links
The problem
Initial solution: 105s
First flamegraph
Bytes instead of strings: 72s
Manual parsing: 61s
Inline hash keys: 50s
Faster hash function: 41s
A new flame graph
Perf it is
Something simple: allocating the right size: 41s
memchr for scanning: 47s
memchr crate: 29s
get_unchecked: 28s
Manual SIMD: 29s
Profiling
Revisiting the key function: 23s
PtrHash perfect hash function: 17s
Larger masks: 15s
Reduce pattern matching: 14s
Memory map: 12s
Parallelization: 2.0s
Branchless parsing: 1.7s
Purging all branches: 1.67s
Some more attempts
Faster perfect hashing: 1.55s
Bug time: Back up to 1.71s
Temperatures less than 100: 1.62s
Computing min as a max: 1.50
Intermezzo: Hyperthreading: 1.34s
Not parsing negative numbers: 1.48s
More efficient parsing: 1.44s
Fixing undefined behaviour: back to 1.56s
Lazily subtracting b'0': 1.52s
Min/max without parsing: 1.55s
Parsing using a single multiplication: doesn’t work
Parsing using a single multiplication does work after all! 1.48s
A side note: ASCII
Skip parsing using PDEP: 1.42s
- Improved
- A further note
Branchy min/max: 1.37s
No counting: 1.34s
Arbitrary long city names: 1.34
4 entries in parallel: 1.23s
Mmap per thread
Reordering some operations: 1.19s
Reordering more: 1.11s
Even more ILP: 1.05
Compliance 1, OK I’ll count: 1.06
TODO
Postscript

Since everybody is doing it, I’m also going to take a stab at the One Billion Row Challenge.

Perfect NtHash for Robust Minimizers

Sun, 31 Dec 2023 00:00:00 +0100

Table of Contents

NtHash
Minimizers
- Robust minimizers
Is NtHash injective on kmers?
- Searching for a collision
- Proving perfection
Alternatives
SmHasher results
TODO benchmark NtHash, NtHash2, FxHash

NtHash

NtHash (Mohamadi et al. 2016) is a rolling hash suitable for hashing any kind of text, but made for DNA originally. For a string of length $k$ it is a $64$ bit value computed as:

\begin{equation} h(x) = \bigoplus_{i=0}^{k-1} rot^i(h(x_i)) \end{equation}

PTRHash: Notes on adapting PTHash in Rust

Thu, 21 Sep 2023 00:00:00 +0200

Table of Contents

Questions and remarks on PTHash paper
Ideas for improvement
Implementation log
TODO

\[ %\newcommand{\mm}{\,\%\,} \newcommand{\mm}{\bmod} \newcommand{\lxor}{\oplus} \newcommand{\K}{\mathcal K} \]

String algorithm visualizations

Tue, 08 Nov 2022 00:00:00 +0100

Select the algorithm to visualize
Click the buttons, or click the canvas and use the indicated keys

Suffix-array construction is explained here and BWT is explained here.

Source code is on GitHub.

Algorithm
String
Query

Delay (s)

A survey of exact global pairwise alignment

Fri, 01 Apr 2022 00:00:00 +0200

Note: This is a living document, and will likely remain so for a while. Feel free to suggest missing papers or make a pull request.

28000x speedup with Numba.CUDA

Mon, 24 May 2021 00:00:00 +0200

Xrefs: r/CUDA, Numba discourse