Table of Contents
Concepts:
- Mapping Map a sequence onto a reference genome/dataset
- Assembly Build a genome from a set of reads
- de novo (implied): without using a reference genome
- Otherwise just called mapping
Typical complicating factors:
- read errors
- non-uniform coverage
- insert size variation
- chimeric reads (?)
- bireads
- non-uniform read coverage (as in metagenomics, i.e. multi cell assembly)
Spaced \(k\)-mers
Also called
- [gapped] \(q\)-grams
- shapes
- patterns (patternhunter)
Minimap
- minimap 1
- minimap 2
- Suzuki Kasahara algorithm to stop aligning early
- minigraph
- GAF format
- minimasm
- table of tools for stages of assembly
- pack sequence id, position, and strand into one 64-bit integer
- instead of a hash table
hash->vector<position>
, use a sorted vector of positions and make each entry in the hash table point to a range of positions. - OLC: Overlap-Layout-Concensus
- build overlap graph
- extract sequence
- make consensus to reduce error rate from reads
- bulge/tip removal heuristic: clean up small components of the overlap graph
SPAdes
Uses bi-reads and average insert size between them for better long-range information when assembling short reads.
Bi-reads: short reads are usually read of length 100-200 at the start and end of a 400-600 long sequence. Thus, we get 2 reads with an ‘known’ distance between them.
- multisized De Bruijn Graph
- low k for low coverage regions, to not miss potential matches
- large k for high coverage regions, for high precision
MetaSPAdes
- metagenome (single/multicell sample) assembly
- Some graph simplifications for better coverage
- To extend the current edge, do a BFS and skip low-coverage edges.
- To fix (close) repeats, project ‘bulges’ onto each other and reconstruct the consensus later.
MUMmer4
- based on suffix trees (3) / suffix arrays (4)
BLASR
- aligning for long high error rate (15-20%) to a reference genome
Bowtie 2
- n^2 aligning
- simbd accelerated DP
- ‘ful-text minute index’
other aligners:
- BWA: Burrows-Wheeler Aligner
- BWA-SW: Smith-Waterman
Patternhunter
- uses spaced kmers / ‘patterns’
- Uses a fixed optimal sensitivity pattern 111010010100110111 of weight 11, which only has at most 5 overlap for any shift.
Patternhunter 2
- Multiple seed patterns for increased sensitivity
- greedy method for near optimal multiple seeds
- one hashtable per seed pattern
Spaced seeds improve \(k\)-mer-based metagenomic classification
- Modified KRAKEN to use spaces kmers and reports that it works better.
LoMeX
- For counting k-mer spectrum/frequencies
- Find occurences of a spaced kmer, and fill gaps by consensus of matches.
- memory efficient because it writes intermediate results to disk
Meeting notes
- use read count to infer duplicate regions
- lohman lab for long reads of smaller reference https://lomanlab.github.io/mockcommunity/