These are notes for DSB 2023. They’re not very structured though. I usually find methods more interesting than results.
Day 1, Tuesday
Practical data structures for longest common extensions, Alexander Herlez
- LCE: longest common extension: given \(i\), \(j\), the max \(k\) s.t. \(A[i, i+k) = A[j, j+k)\).
- alg:
- compare first k
- if same: sample a subset and use black-box datastructure.
- similar idea to minhash/mash kmer selection methods, same(?) as syncmers
- string synchronizing sets (SSS):
- rolling hash. sample a position when it has minimal hash in it’s window
- https://github.com/herlaz/alx
- useful for constant time LCE when extensions are length 1000 or longer. Not faster for shorter LCEs.
- note: WFA uses this a lot and it’s actually the bottleneck, but really only for short extensions.
Pan-genome de Bruijn graph using the bidirectional FM-index, Lore Depuydt
- Goal: graph of pangenome with correspondences to reads
- Implicit graph by Beller and Ohlebusch: build on top of FM-index of concatenation
- New in Nexus:
- Bidirectional FM-index for bidirectional traversal
- down/upstream neighbourhoods can be visualized efficiently using traversal
- Search schemes for lossless approximate pattern matching (separate talk)
- query -> candidate matches -> graph paths -> read paths
- Lossless aligner; reports many more occurrences than PuffAligner, although this does make it slower for >0 edit distance
- checkpoints k-mers inside long nodes
- this way, to find the start of a long (compressed) node goes down significantly.
- Bidirectional FM-index for bidirectional traversal
- Memory usage: Bidirectional FM-index is >80%.
- Bidir FM is linear in total size of pangenome.
- Future: r index, which uses less memory.
Indexing large metagenomics projects with abundances, Pierre Peterlongo
- Indexing read sets with abundance in <100GB with fast querying with low ram.
- Uses counting bloom filters
fimpera
decreases the overestimation, allowing less memory usage.- Once k-mers are sorted, all processes are sequential! 1000x speedup
- instead of querying k-mers, store all slightly smaller s-mers and query all of them for much better false positive rates
µ-PBWT: Enabling the Storage and Use of UK Biobank Data on a Commodity Laptop, Davide Cozzi
- r-index equivalent of PBWT, using run-length encoding
- 10-100x less memory than PBWT.
- https://github.com/dlcgold/muPBWT
Genome-on-Diet: Taming Large-Scale Genomic Analyses via Sparsified Genomics, Mohammed Alser
- Building an index on spaced kmers/patterns
- Only some positions are sampled. i.e.: sample every other basepair and build kmers on those.
- Index is built on the fly; not preprocessed
- to 2x faster and 2x less memory than minimap2.
- to 50x faster and 700x less memory than other tools.
- summary: subsample 1 in m bits and run minimap on top of that.
- similar to spaced seeds, but additionally subsamples the number of kmers.
- Faster and better accuracy than spaced seeds.
Spectrum preserving tilings enable sparse and modular reference indexing, Giulio Ermanno Pibiri
- Spectrum preserving tiling: Given a set of reads and their DBG, we can for each read store location information per unitig in this larger graph
- Use
SSHash
to store all unitigs. - Sample a subset of unitigs to store location information for. Non-sampled unitigs can still be recovered by querying adjacent unitig locations instead.
- Main contribution: reducing space usage of the reverse index: mapping kmers to locations.
- https://github.com/COMBINE-lab/pufferfish2
Towards a lower-memory chunked graph data structure inspired by Minecraft, Fawaz Dabbaghie
- Chunk big graphs as in minecraft chunks.
- First approach: split human genome graph in 1000 parts using BFS’s from random positions
- Already big speedup!
- TODO: simple idea and super effective. Maybe play around with this at some point
Optimal Worst-Case Design of Gapped k-mer Masks, Sven Rahmann
- Gapped kmers are better in the worst case than normal kmers:
- If you can make \(x\) substitutions in a length \(n\) strings, gapped kmers need a higher \(x\) to mutate (break) all of them.
- Second optimization goal: count number of recovered positions, instead of number of kmers.
#########_#########_#########
(27,29)-mer: in a \(n=100\) window, lack of such matches implies at least 4 errors. Normal 27-mers imply at least 3 errors.- boundary effects (changing \(n\)) are not super strong.
- TODO: read old papers and see if this could be used for A*PA
- How about inexact spaced matches?
Locality-Preserving Hashing of k-mers, Yoshihiro Shibuya
- Mapping of kmers to \([0, 4^n)\) such that kmers with same minimizer are close.
- Split mapping based on whether the minimizer is left-maximal and/or right-maximal inside its super-k-mer.
- https://github.com/jermp/lphash
- Less than 1.44 bits/element for large k (which is the generic lower bound).
Space-efficient k-mer counting using an implicit sequence representation, Miika Leinonen
- Map kmers into a hashtable storing at each index:
- count of kmers mapping here
- the last character of the kmer
- the index of the preceding kmer
- Memory usage: 25%-50% of normal hash table
- saves more memory for larger k.
VeChat: correcting errors in long reads using variation graphs, Alexander Schönhuth
- Error correction using graph cleaning
- Clean using repeated steps of:
- remove edges with low coverage
- clean edges with low confidence: the relative weight this edge has with respect to total weight of edges going out of predecessor or going into successor
- clean isolated nodes and leaf edges
- Up to 10x less remaining errors than other tools.
Fixing homopolymer errors in HiFi reads using dictionary compression, Diego Diaz-Dominguez
- Encode sequence recursively using grammer based
- Grammar compression is good on its own
- TODO: read paper
Orthanq: orthogonal evidence based haplotype quantification, Hamdiye Uzuner
- Variant calling pipeline
- https://github.com/orthanq
Day 2, Wednesday
Random Wheeler graphs, Riccardo Maso
missed it :(
The Graph Wheelerization Problem, Davide Tonetto
- problem: Given a directed graph, color the edges so that it becomes a wheeler graph, with smallest possible alphabet size.
- Properties of the bipartite graph:
- All incoming edges in each vertex are of the same color
- Edges of the same color do not cross
- target-vertices are sorted by color.
- Translate to bipartite graph variant.
- applications:
- Graph can be compressed into \(O(|E| \ln \sigma)\) bits.
- Can help with adjacency queries.
- Greedy linear algorithm can solve assuming total order is know
- ILP formulation for general case.
- Local search heuristic:
- given an order, swap any pair
- TODO: Wheeler graphs seem interesting from theoretical point of view.
Sorting Wheeler NFA’s using relational partition refinement, Bojana Kodric
- Problem: Given an input- consistent NFA, compute an ordered partition of states that is consistent with some Wheeler graph, assuming the input is Wheeler.
- Define forward stable condition, and then itaratively refine the condition.
- Existing algo: Alanko et al. (2021), runtime: \(n^2m\).
- Reusing partition refinement (Paige and Tarjan ‘87). => \(O(m\lg n)\).
Prefix-sorting strings on deterministic finite automata, Sung-Hwan Kim
- Basically: A suffix array for DFA’s.
- Algorithm: Use prefix doubling on the graph for min and max suffixes in each vertex.
- Then, obtain interval for each point and use the implicit partial order on these intervals.
MARIA: Multiple alignment r-index with aggregation, Adrián Goga
- Buils on MONI
- lossless compression of dataset
- finds maximal exact matches.
Approximate pattern matching using search schemes and in-text verification, Luca Renders
- Finding all matches with up to some edit distance, ie to align 100-long reads.
- tool: Columba
- Uses search schemes: set of k searches
- Reduce cache misses of Occurrences table by interleaving them per char.
- In-text verification: once number of candidate matching is small, compare them in the text instead of the index.
- Takeaway: Exact aligment can be as fast as approximate alignment using an efficient implementation
- https://github.com/biointec/columba
- TODO: Follow up on this and search schemes in general – very similar to all the things we considered for A*PA
Chaining of maximal exact matches in graphs, Nicola Rizzo
- Algorithm for near-linear co-linear chaining.
- Problem: symmetrick co-linear chaining with overlaps
- Given a set of anchors between query and graph, find the longest chain.
- Base paper: minimum path cover in parameterized linear time in \(O(k^2 |V| + |E|)\), used as bases for co-linear chaining algorithm.
- New paper on arxiv
RecGraph: recombination-aware alignment of sequences to variation graphs, Jorge Avila Cartes
- Sequence to graph alignment that allows recombinations
- Defines displacement as measure for how large the jump is
- Add new term to DP that uses the displacement
- Finds recombination position within 5bp 98% of time.
- >1 recombination is not implemented
- https://github.com/algolab/recgraph
Exact string alignments to (E)D-texts, Nadia Pisanti
- problem: aligning a string to an (elastic) degenerate string
- algo: align to DS in O(Nd) using WFA with affine costs
- algo: align to EDS in O(NW), W=|P|
- non-elastic: all variants have the same length. WFA applies almost directly.
- elastic: variants can have different lengths. Diagonal tracking as in WFA becomes more complicated.
- MSA corresponds to elastic degenerate string
- IDEA: How about banded alignment in elastic case?
- TODO: WFA for partial elastic string alignment
Periodicity of degenerate strings, Pengfei Wang
- Degenerate string here has only 1 char per set
- Define strong/medium/weak periodiciy of degenerate strings
- Some simple necessary conditions ensure that each \(P_s\subseteq P_m\subseteq P_w\) occurs in practice.
Deriving polygenic risk score using non-negative matrix factorization, Vu Lam Dang
- Matrix factorization over positive numbers using ML
- Can ’extract’ 20-40 features
Identifying antimicrobial resistance gene transfer between plasmids, Marco Teixeira
- method to detect transfer of genes between different plasmids based of graph similarity
Ideas from discussions
- inexact spaced kmers
- search schemes for finding inexact matches by querying multiple exact matches
- Banded alignment for elastic-degenerate string alignment.
- WFA for partial elastic string alignment