ALPACA/PANGAIA winter workshop notes

These are notes of discussions at the ALPACA/PANGAIA conference in November 2023.

Monday

I had interesting discussions with Giulio, Paul, and Lucas Robidou.

Fimpera: bloom filter for kmers

Idea: instead of storing $k$mers in bloom filter, store all constituent $s$mers (\(s<k\)). This allows single-memory-lookup membership queries when streaming $k$mers.

  • Speedup by Lucas: For each $k$mer only query the first and last $s$mer.
  • Question: can we make a locality preserving bloom filter?
  • Answer: locality-preserving hashing of the $s$-mers using their minimizers. (It’s minimizers all the way down.)

Progress of tools

  1. Generic datastructure
  2. Datastructure for $k$mers
  3. Datastructure for streaming of $k$mers
  4. Locality preserving datastructure for streaming of $k$mers.

Order-preserving MPHF of minimizers

The big question:

Can we build an order preserving MPHF/index on minimizer in \(o(n\lg n)\) space?

In SSHash, most memory is used by mapping each minimizer back into the right order.

  • There are \(n!\) orders, and since \(\lg n! \sim n\lg n\), this already takes \(\lg n\) bits per element.
  • Can we make some kind of $ε$-order-preserving MPHF?
  • Maybe we can bucket them and only preserve order between buckets?
    • If we make \(B\) buckets, each $k$mer will still need to be mapped to one of \(B\) buckets, requiring \(\lg B\) bits per element still. For \(10^9\) elements, maybe you want \(10^5\) buckets to get to small L1-sized buckets, but that still requires basically \(\frac 12 n\lg n\) bits…
  • Can we exploit that we’re hashing minimizers?
  • Are consecutive non-overlapping minimizers correlated???
    • If not, it’s basically inevitable to encode a permutation of all minimizers, requiring \(n\lg n\) memory and most likely a cache miss per lookup.
  • For mapping specifically: what about only looking up say \(1\%\) of minimizers? That gives \(100\times\) fewer cache misses. What are the drawbacks?
  • Doubling \(k\) almost halves the number of minimizers and shrinks (almost halves) the size of the datastructure.
    • If we build the datastructure for \(k=\{31,63,127,255,\dots\}\), we can jump ahead a lot at low cost, and make queries for reads with various accuracies.
  • Is there correlation between the hash value of closely related minimizers?
    • Maybe small values correlate or large values correlate? Or maybe small-large pairs are more likely? Whatever pairs co-occur could be placed close to each other in cache.

Algorithmic bottlenecks in SSHash

  • Maybe the cache-miss latency of SSHash can be hidden by prefetching/working in blocks?
  • All minimizers in the query are known up-front, so we don’t actually have to stream things; prefetching is easy.
  • Is the bottleneck memory throughput?
  • After prefetching, random-access in itself isn’t much slower than sequential, but sequential works with hardware prefetching while non-sequential requires manual software prefetching and corresponding overhead.

Fourier transform of the human genome?

  • Can we locate $k$mers in the human genome using some inherent properties?
  • Are there some properties of $k$mers that change slowly over ’time’, looking something like a sine wave?
  • Suppose something like CG fraction changes slowly over time. Then CG-content of a $k$mer can be used to estimate its position.
  • Sounds like a stretch, but: Can we make an FFT of the human genome, throw away high frequency information, and compute the corresponding low frequency signals for shorter kmers? Than we could use this to locate them.

Tuesday

Variant types

From Tobias’ talk.

Table 1: E. Eichler, NEJM, 2019
classsize of variantNumber/genomesize of region affectedpercent of genome
bpMbp
SNV14M4-50.078
indel1-49700k5-50.069
structural variant>5025k10-120.19
inversions>50150230.397
multi-copy-number>100050012-150.232

Wednesday

Many more interesting discussions about de Bruijn Graphs, SSHash, and more.

Main insights:

  • A de Bruijn Graph is literally just a set of kmers
  • If you store an array of \(n\) things, you can encode \(\lg n! = n\lg n\) bits of information in just the order in which they are on disk.

SSHash

  • We can shuffle the unitigs in the output in any order we want. Roughly half the unitigs only has a single minimizer, so we have around \(\lg (n/2)! \approx n/2 \cdot \lg n\) bits of information we may be able to save from the order table by permuting the output to have the same order as the MPHF outputs.
  • We don’t need to store the full precision offset. Indicating the start of the cacheline should be good enough. Each cacheline has \(64\cdot 8 / 2 = 256\) basepairs, so we can just drop the last \(8\) bits of each value and search inside the cacheline for the right place.
  • We could exploit that most minimizer only occur once, and skip the sizes table: Instead, store a single offset and

PTHash

  • There are also other working on PTHash.
    • GPU implementation
    • Non-linear formula for optimal bucket size distribution.
  • Why is PTHash not closer to optimal? Where are we not succinct in encoding information?
    • Some \(10\%\) of buckets is empty, which is memory that is directly thrown away.
    • Average pilot value is around \(30\) to \(50\). But each bucket can store values up to \(256\), so average could be up to \(128\). So we throw away \(1\) to \(2\) bits (out of \(8\)) for each bucket! That’s another \(\approx 20\%\) of wasted overhead.
    • (On the other hand, the displacement algorithm currently is not able to construct parts with higher averages.)
  • It would be nice to handle overly full partitions separately. We currently need to choose \(\alpha\) a bit smaller than necessary to ensure that the relatively full parts still have \(\alpha\) sufficiently small, but this wastes some performance for the buckets with lower fill-factor.
    • We could store \(1\) bit per part, indicating if it is large, where we increase the number of slots by \(10\%\) (or whatever is necessary). Then, we can run the algorithm as usual but use this larger \(s’>s\) for the within-part offset. If we end up with an index \(<s\), we proceed as usual. If we end with an index between \(s\) and \(s’\), we can put these at the very back, using a count of home many such large parts occurred before, and then remap them accordingly.

      This would slow down queries because there are more edge cases, but also speeds them up because we can use larger \(\alpha\), reducing other edge cases.

    • Alternatively, we could store prefix sums of the number of slots allocated to each part. This adds quite some additional metadata that needs to be read for each query, and doesn’t sound so nice.

de Bruijn Graphs

  • Metagraph (BOSS table): 3-4 bits/kmer: 2 for the characters, and 1-2 more for metadata
  • Spectral BWT (SBWT): 4-5bits/kmer
  • SSHash: ~5bits/kmer for \(k=63\); see table here.

Interestingly, the first two methods do not depend on \(k\), and involve a lot of cache misses for each lookup, while SSHash becomes better for larger \(k\), since it stores an SPSS (spectrum preserving string set) of fixed size (\(\approx 2\) bits/kmer in the best case, but usually more) and on top of that only stores \(O(\lg n)\) bits of data per minimizer, of which there are fewer for larger \(k\). (And also this incurs a cache miss per minimizer instead of per basepair.)