This is a research proposal for a 5 month internship at CWI during autumn/winter 2023-2024.
Introduction
An important problem in bioinformatics is genome assembly: DNA sequencing machines read substrings of a full DNA genome, and these pieces must be assembled together to recover the entire genome.
Let \(k\) be the number of reads and let \(R = \{S_1, \dots, S_k\}\) be the set of reads with total length \(n:= |S_1| + \dots + |S_k|\). An important part of the assembly process is to construct an assembly graph of all the reads (after first cleaning the reads). Three approaches to this, ignoring details regarding reverse complements, are (Liao et al. 2019):
- The overlap graph, where each read \(S_i\) is a node, and nodes \(S_i\) and \(S_j\) are connected by a directed edge \(S_i\rightarrow S_j\) with length \(|S_i| - \ell\) if there is a (sufficiently long) length \(\ell\) suffix of \(S_i\) that equals a prefix of \(S_j\).
- The string graph (Myers 2005) is a simplified graph containing only the irreducible edges of the overlap graph. I.e. it is the smallest graph whose transitive closure is the overlap graph. Also, reads contained in other reads are removed.
- The De Bruijn graph of the reads, which contains a node for each $k$-mer present in the set of reads, and for each $(k+1)$-mer it adds an edge from its prefix $k$-mer to the suffix $k$-mer. Velvet and SPAdes use this approach (Zerbino and Birney 2008; Bankevich et al. 2012; Nurk et al. 2013).
After additional cleaning of the graph, the assembled genome is found as a set of paths/walks through it covering all nodes (for string graphs) or edges (depending on the exact type of De Bruijn graph used).
In the overlap graph and string graph approach, an important step is to find the longest suffix-prefix overlap between all pairs of reads \((S_i, S_j)\). Since a full alignment per pair is slow and long reads are often noisy, this is usually sped up using an (approximate) filter (Li 2016):
- BLAST stores $k$-mers per read in a hashmap (Altschul 1997) and counts matching $k$-mers.
- DAligner efficiently finds matching $k$-mers between two (sets of) reads by merge-sorting the $k$-mers of the two (sets of) reads (Myers 2014; Rasmussen, Stoye, and Myers 2005).
- MHAP sketches the $k$-mers in a sequence using MinHash (Berlin et al. 2015).
- Minimap sketches the minimizers in a sequence using MinHash (Li 2016).
Alternatively, efficient datastructures can be used to compute all exact maximal pairwise suffix-prefix overlaps:
- Gusfield (Gusfield 1997) gives an \(O(n+k^2)\) suffix-tree based algorithm.
- SGA (Simpson and Durbin 2011) uses the FM-index for \(O(n)\) time construction of all irreducible edges (Simpson and Durbin 2010).
- Hifiasm (Cheng et al. 2021) is unclear but also seems to only use exact sufix-prefix matches, given that hifi reads have sufficient quality for exact overlaps.
- Loukides et al. (2023) takes a different approach and instead of
computing all (irreducible) pairwise overlaps up-front, it introduces multiple queries:
- \(OneToOne(i,j)\) computes the longest overlap between \(S_i\) and \(S_j\) in \(O(\log \log k)\).
- \(OneToAll(i)\): computes the longest overlap between \(S_i\) and each other \(S_j\) in \(O(k)\).
- \(Report(i,l)\) reports all overlaps longer than \(l\) in \(O(\log n + output)\)1.
- \(Count(i,l)\) counts the overlaps longer than \(l\) in \(O(\log n)\).
- \(Top(i,K)\) returns the top \(K\) longest overlaps of \(S_i\) in \(O(\log^2 n + K)\).
Research plan
The plan for this internship is to improve, extend, and apply the results of this last paper, Loukides et al. (2023).
Improve query performance using Heavy-Light Decomposition
(Loukides et al. 2023) uses advanced computational geometry techniques to answer \(Report\), \(Count\), and \(Top\) queries. My aim is to replace these by simpler methods on graphs directly, and improve both the construction complexity (to \(O(n)\)) and the query complexity (to \(O(\log n+output)\) or even \(O(1)\)).
In particular, it seems that a heavy-light decomposition (HLD) can possibly be used to obtain an algorithm linear time construction and similar \(O(\log n+output)\) query complexity.
Add more query types
The current presented query types focus on one string at a time. In practice, often we care about the answer to these queries for all strings, so we could introduce e.g. the query \(AllTop(K)\) which answers \(Top(i, K)\) for all \(i\).
Furthermore, it may be possible to restrict queries to only count and return irreducible overlaps: suffix-prefix overlaps \(S_i \rightarrow S_j\) such that there is no \(S_k\) that ‘fits in between’, where both \(S_i \rightarrow S_k\) and \(S_k \rightarrow S_j\) are valid suffix-prefix overlaps.
Extend to non-exact suffix-prefix-overlap that allows for read errors
Modern DNA sequencing technologies usually introduce errors in the reads:
- ONT Long reads are very noisy (up to \(10\%\)), so up to \(20\%\) errors can be present in read-read overlaps. This is worked around by using approximate methods such as $k$-mer counting and minimizers.
- New PacBio HiFi reads have error rates as low as \(0.1\%\) after cleaning. This makes algorithms based on string data structures useful again, given that they can indeed handle small error rates.
The goal here is to extend the various queries to also count/report matches with at most a fixed number of errors or at most some fixed error rate.
Implement an algorithm to build string graphs, and possibly a full assembler
I would like to implement a fast algorithm to build the string graph, based on the queries provided above and/or existing string graph methods such as (Simpson and Durbin 2010). This could turn into a new string graph based assembler.
This first requires a thorough review of existing string graph algorithms and assemblers (Koren et al. 2017; Nurk et al. 2020; Cheng et al. 2021; Simpson and Durbin 2011; Rasmussen, Stoye, and Myers 2005), including new developments for diploid assembly that are able to create separate assemblies for the two copies of each chromosome.
References
This and the methods below can also be done with \(\log n / \log \log n\) complexity instead of \(\log n\) using more advanced geometric algorithms. ↩︎