This is a research proposal for a 5 month internship at CWI during autumn/winter 2023-2024.
Introduction Link to heading
An important problem in bioinformatics is genome assembly: DNA sequencing machines read substrings of a full DNA genome, and these pieces must be assembled together to recover the entire genome.
Let
- The overlap graph, where each read
is a node, and nodes and are connected by a directed edge with length if there is a (sufficiently long) length suffix of that equals a prefix of . - The string graph (Myers 2005) is a simplified graph containing only the irreducible edges of the overlap graph. I.e. it is the smallest graph whose transitive closure is the overlap graph. Also, reads contained in other reads are removed.
- The De Bruijn graph of the reads, which
contains a node for each
-mer present in the set of reads, and for each -mer it adds an edge from its prefix -mer to the suffix -mer. Velvet and SPAdes use this approach (Zerbino and Birney 2008; Bankevich et al. 2012; Nurk et al. 2013).
After additional cleaning of the graph, the assembled genome is found as a set of paths/walks through it covering all nodes (for string graphs) or edges (depending on the exact type of De Bruijn graph used).
In the overlap graph and string graph approach, an important step is to find the
longest suffix-prefix overlap between all pairs of reads
- BLAST stores
-mers per read in a hashmap (Altschul et al. 1990) and counts matching -mers. - DAligner efficiently finds matching
-mers between two (sets of) reads by merge-sorting the -mers of the two (sets of) reads (Myers 2014; Rasmussen, Stoye, and Myers 2005). - MHAP sketches the
-mers in a sequence using MinHash (Berlin et al. 2015). - Minimap sketches the minimizers in a sequence using MinHash (Li 2016).
Alternatively, efficient datastructures can be used to compute all exact maximal pairwise suffix-prefix overlaps:
- Gusfield (Gusfield 1997) gives an
suffix-tree based algorithm. - SGA (Simpson and Durbin 2011) uses the FM-index for
time construction of all irreducible edges (Simpson and Durbin 2010). - Hifiasm (Cheng et al. 2021) is unclear but also seems to only use exact sufix-prefix matches, given that hifi reads have sufficient quality for exact overlaps.
- Loukides et al. (2023) takes a different approach and instead of
computing all (irreducible) pairwise overlaps up-front, it introduces multiple queries:
computes the longest overlap between and in . : computes the longest overlap between and each other in . reports all overlaps longer than in 11This and the methods below can also be done with
. complexity instead of using more advanced geometric algorithms. counts the overlaps longer than in . returns the top longest overlaps of in .
Research plan Link to heading
The plan for this internship is to improve, extend, and apply the results of this last paper, Loukides et al. (2023).
Improve query performance using Heavy-Light Decomposition Link to heading
(Loukides et al. 2023) uses advanced computational geometry techniques to
answer
In particular, it seems that a heavy-light decomposition (HLD) can possibly be used to
obtain an algorithm linear time construction and similar
Add more query types Link to heading
The current presented query types focus on one string at a time. In practice,
often we care about the answer to these queries for all strings, so we could
introduce e.g. the query
Furthermore, it may be possible to restrict queries to only count and return
irreducible overlaps: suffix-prefix overlaps
Extend to non-exact suffix-prefix-overlap that allows for read errors Link to heading
Modern DNA sequencing technologies usually introduce errors in the reads:
- ONT Long reads are very noisy (up to
), so up to errors can be present in read-read overlaps. This is worked around by using approximate methods such as -mer counting and minimizers. - New PacBio HiFi reads have error rates as low as
after cleaning. This makes algorithms based on string data structures useful again, given that they can indeed handle small error rates.
The goal here is to extend the various queries to also count/report matches with at most a fixed number of errors or at most some fixed error rate.
Implement an algorithm to build string graphs, and possibly a full assembler Link to heading
I would like to implement a fast algorithm to build the string graph, based on the queries provided above and/or existing string graph methods such as (Simpson and Durbin 2010). This could turn into a new string graph based assembler.
This first requires a thorough review of existing string graph algorithms and assemblers (Koren et al. 2017; Nurk et al. 2020; Cheng et al. 2021; Simpson and Durbin 2011; Rasmussen, Stoye, and Myers 2005), including new developments for diploid assembly that are able to create separate assemblies for the two copies of each chromosome.
References Link to heading
This and the methods below can also be done with
complexity instead of using more advanced geometric algorithms. ↩︎