Spaced k-mer and assembler methods

14 July 2021 2-minute read

Ragnar Groot Koerkamp

assembly • spaced-kmers

Table of Contents

Spaced \(k\)-mers
Minimap
SPAdes
MUMmer4
BLASR
Bowtie 2
Patternhunter
Spaced seeds improve \(k\)-mer-based metagenomic classification
LoMeX
Meeting notes

Concepts:

Mapping Map a sequence onto a reference genome/dataset
Assembly Build a genome from a set of reads
- de novo (implied): without using a reference genome
- Otherwise just called mapping

Typical complicating factors:

read errors
non-uniform coverage
insert size variation
chimeric reads (?)
bireads
non-uniform read coverage (as in metagenomics, i.e. multi cell assembly)

Spaced \(k\)-mers

Also called

[gapped] \(q\)-grams
shapes
patterns (patternhunter)

Minimap

minimap 1
minimap 2
- Suzuki Kasahara algorithm to stop aligning early
minigraph
- GAF format
minimasm
- table of tools for stages of assembly
- pack sequence id, position, and strand into one 64-bit integer
- instead of a hash table hash->vector<position>, use a sorted vector of positions and make each entry in the hash table point to a range of positions.
- OLC: Overlap-Layout-Concensus
  1. build overlap graph
  2. extract sequence
  3. make consensus to reduce error rate from reads
- bulge/tip removal heuristic: clean up small components of the overlap graph

SPAdes

Uses bi-reads and average insert size between them for better long-range information when assembling short reads.

Bi-reads: short reads are usually read of length 100-200 at the start and end of a 400-600 long sequence. Thus, we get 2 reads with an ‘known’ distance between them.

multisized De Bruijn Graph
- low k for low coverage regions, to not miss potential matches
- large k for high coverage regions, for high precision

MetaSPAdes

metagenome (single/multicell sample) assembly
Some graph simplifications for better coverage
To extend the current edge, do a BFS and skip low-coverage edges.
To fix (close) repeats, project ‘bulges’ onto each other and reconstruct the consensus later.

MUMmer4

based on suffix trees (3) / suffix arrays (4)

BLASR

aligning for long high error rate (15-20%) to a reference genome

Bowtie 2

n^2 aligning
simbd accelerated DP
‘ful-text minute index’

other aligners:

BWA: Burrows-Wheeler Aligner
BWA-SW: Smith-Waterman

Patternhunter

uses spaced kmers / ‘patterns’
Uses a fixed optimal sensitivity pattern 111010010100110111 of weight 11, which only has at most 5 overlap for any shift.

Patternhunter 2

Multiple seed patterns for increased sensitivity
greedy method for near optimal multiple seeds
one hashtable per seed pattern

Spaced seeds improve \(k\)-mer-based metagenomic classification

Modified KRAKEN to use spaces kmers and reports that it works better.

LoMeX

For counting k-mer spectrum/frequencies
Find occurences of a spaced kmer, and fill gaps by consensus of matches.
memory efficient because it writes intermediate results to disk

Meeting notes

use read count to infer duplicate regions
lohman lab for long reads of smaller reference https://lomanlab.github.io/mockcommunity/