# Research proposal: subquadratic string graph construction

This is a research proposal for a 5 month internship at CWI during autumn/winter 2023-2024.

## Introduction

An important problem in bioinformatics is genome assembly: DNA sequencing machines read substrings of a full DNA genome, and these pieces must be assembled together to recover the entire genome.

Let $$k$$ be the number of reads and let $$R = \{S_1, \dots, S_k\}$$ be the set of reads with total length $$n:= |S_1| + \dots + |S_k|$$. An important part of the assembly process is to construct an assembly graph of all the reads (after first cleaning the reads). Three approaches to this, ignoring details regarding reverse complements, are (Liao et al. 2019):

• The overlap graph, where each read $$S_i$$ is a node, and nodes $$S_i$$ and $$S_j$$ are connected by a directed edge $$S_i\rightarrow S_j$$ with length $$|S_i| - \ell$$ if there is a (sufficiently long) length $$\ell$$ suffix of $$S_i$$ that equals a prefix of $$S_j$$.
• The string graph (Myers 2005) is a simplified graph containing only the irreducible edges of the overlap graph. I.e. it is the smallest graph whose transitive closure is the overlap graph. Also, reads contained in other reads are removed.
• The De Bruijn graph of the reads, which contains a node for each $k$-mer present in the set of reads, and for each $(k+1)$-mer it adds an edge from its prefix $k$-mer to the suffix $k$-mer. Velvet and SPAdes use this approach (Zerbino and Birney 2008; Bankevich et al. 2012; Nurk et al. 2013).

After additional cleaning of the graph, the assembled genome is found as a set of paths/walks through it covering all nodes (for string graphs) or edges (depending on the exact type of De Bruijn graph used).

In the overlap graph and string graph approach, an important step is to find the longest suffix-prefix overlap between all pairs of reads $$(S_i, S_j)$$. Since a full alignment per pair is slow and long reads are often noisy, this is usually sped up using an (approximate) filter (Li 2016):

• BLAST stores $k$-mers per read in a hashmap (Altschul 1997) and counts matching $k$-mers.
• DAligner efficiently finds matching $k$-mers between two (sets of) reads by merge-sorting the $k$-mers of the two (sets of) reads (NO_ITEM_DATA:daligner; Rasmussen, Stoye, and Myers 2005).
• MHAP sketches the $k$-mers in a sequence using MinHash (Berlin et al. 2015).
• Minimap sketches the minimizers in a sequence using MinHash (Li 2016).

Alternatively, efficient datastructures can be used to compute all exact maximal pairwise suffix-prefix overlaps:

• Gusfield (Gusfield 1997) gives an $$O(n+k^2)$$ suffix-tree based algorithm.
• SGA (Simpson and Durbin 2011) uses the FM-index for $$O(n)$$ time construction of all irreducible edges (Simpson and Durbin 2010).
• Hifiasm (Cheng et al. 2021) is unclear but also seems to only use exact sufix-prefix matches, given that hifi reads have sufficient quality for exact overlaps.
• Loukides et al. (2023) takes a different approach and instead of computing all (irreducible) pairwise overlaps up-front, it introduces multiple queries:
• $$OneToOne(i,j)$$ computes the longest overlap between $$S_i$$ and $$S_j$$ in $$O(\log \log k)$$.
• $$OneToAll(i)$$: computes the longest overlap between $$S_i$$ and each other $$S_j$$ in $$O(k)$$.
• $$Report(i,l)$$ reports all overlaps longer than $$l$$ in $$O(\log n + output)$$1.
• $$Count(i,l)$$ counts the overlaps longer than $$l$$ in $$O(\log n)$$.
• $$Top(i,K)$$ returns the top $$K$$ longest overlaps of $$S_i$$ in $$O(\log^2 n + K)$$.

## Research plan

The plan for this internship is to improve, extend, and apply the results of this last paper, Loukides et al. (2023).

### Improve query performance using Heavy-Light Decomposition

(Loukides et al. 2023) uses advanced computational geometry techniques to answer $$Report$$, $$Count$$, and $$Top$$ queries. My aim is to replace these by simpler methods on graphs directly, and improve both the construction complexity (to $$O(n)$$) and the query complexity (to $$O(\log n+output)$$ or even $$O(1)$$).

In particular, it seems that a heavy-light decomposition (HLD) can possibly be used to obtain an algorithm linear time construction and similar $$O(\log n+output)$$ query complexity.

The current presented query types focus on one string at a time. In practice, often we care about the answer to these queries for all strings, so we could introduce e.g. the query $$AllTop(K)$$ which answers $$Top(i, K)$$ for all $$i$$.

Furthermore, it may be possible to restrict queries to only count and return irreducible overlaps: suffix-prefix overlaps $$S_i \rightarrow S_j$$ such that there is no $$S_k$$ that ‘fits in between’, where both $$S_i \rightarrow S_k$$ and $$S_k \rightarrow S_j$$ are valid suffix-prefix overlaps.

### Extend to non-exact suffix-prefix-overlap that allows for read errors

Modern DNA sequencing technologies usually introduce errors in the reads:

• ONT Long reads are very noisy (up to $$10\%$$), so up to $$20\%$$ errors can be present in read-read overlaps. This is worked around by using approximate methods such as $k$-mer counting and minimizers.
• New PacBio HiFi reads have error rates as low as $$0.1\%$$ after cleaning. This makes algorithms based on string data structures useful again, given that they can indeed handle small error rates.

The goal here is to extend the various queries to also count/report matches with at most a fixed number of errors or at most some fixed error rate.

### Implement an algorithm to build string graphs, and possibly a full assembler

I would like to implement a fast algorithm to build the string graph, based on the queries provided above and/or existing string graph methods such as (Simpson and Durbin 2010). This could turn into a new string graph based assembler.

This first requires a thorough review of existing string graph algorithms and assemblers (Koren et al. 2017; Nurk et al. 2020; Cheng et al. 2021; Simpson and Durbin 2011; Rasmussen, Stoye, and Myers 2005), including new developments for diploid assembly that are able to create separate assemblies for the two copies of each chromosome.

## References

Altschul, S. 1997. “Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs.” Nucleic Acids Research 25 (17): 3389–3402. https://doi.org/10.1093/nar/25.17.3389.
Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, et al. 2012. “Spades: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing.” Journal of Computational Biology 19 (5): 455–77. https://doi.org/10.1089/cmb.2012.0021.
Berlin, Konstantin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy. 2015. “Assembling Large Genomes with Single-Molecule Sequencing and Locality-Sensitive Hashing.” Nature Biotechnology 33 (6): 623–30. https://doi.org/10.1038/nbt.3238.
Cheng, Haoyu, Gregory T. Concepcion, Xiaowen Feng, Haowen Zhang, and Heng Li. 2021. “Haplotype-Resolved de Novo Assembly Using Phased Assembly Graphs with Hifiasm.” Nature Methods 18 (2): 170–75. https://doi.org/10.1038/s41592-020-01056-5.
Gusfield, Dan. 1997. “Algorithms on Strings, Trees and Sequences,” May. https://doi.org/10.1017/cbo9780511574931.
Koren, Sergey, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. 2017. “Canu: Scalable and Accurate Long-Read Assembly via Adaptive K-Mer Weighting and Repeat Separation.” Genome Research 27 (5): 722–36. https://doi.org/10.1101/gr.215087.116.
Li, Heng. 2016. “Minimap and Miniasm: Fast Mapping and de Novo Assembly for Noisy Long Sequences.” Bioinformatics 32 (14): 2103–10. https://doi.org/10.1093/bioinformatics/btw152.
Liao, Xingyu, Min Li, You Zou, Fang-Xiang Wu, Yi-Pan, and Jianxin Wang. 2019. “Current Challenges and Solutions of de Novo Assembly.” Quantitative Biology 7 (2): 90–109. https://doi.org/10.1007/s40484-019-0166-9.
Loukides, Grigorios, Solon P. Pissis, Sharma V. Thankachan, and Wiktor Zuba. 2023. “Suffix-Prefix Queries on a Dictionary.” Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.CPM.2023.21.
Myers, E. W. 2005. “The Fragment Assembly String Graph.” Bioinformatics 21 (Suppl 2): ii79–ii85. https://doi.org/10.1093/bioinformatics/bti1114.
Nurk, Sergey, Anton Bankevich, Dmitry Antipov, Alexey Gurevich, Anton Korobeynikov, Alla Lapidus, Andrey Prjibelsky, et al. 2013. “Assembling Genomes and Mini-Metagenomes from Highly Chimeric Reads.” Research in Computational Molecular Biology, 158–70. https://doi.org/10.1007/978-3-642-37195-0_13.
Nurk, Sergey, Brian P. Walenz, Arang Rhie, Mitchell R. Vollger, Glennis A. Logsdon, Robert Grothe, Karen H. Miga, Evan E. Eichler, Adam M. Phillippy, and Sergey Koren. 2020. “Hicanu: Accurate Assembly of Segmental Duplications, Satellites, and Allelic Variants from High-Fidelity Long Reads.” Genome Research 30 (9): 1291–1305. https://doi.org/10.1101/gr.263566.120.
Rasmussen, Kim R., Jens Stoye, and Eugene W. Myers. 2005. “Efficient Q-Gram Filters for Finding All Ε-Matches over a given Length.” Research in Computational Molecular Biology, 189–203. https://doi.org/10.1007/11415770_15.
Simpson, Jared T., and Richard Durbin. 2010. “Efficient Construction of an Assembly String Graph Using the Fm-Index.” Bioinformatics 26 (12): i367–73. https://doi.org/10.1093/bioinformatics/btq217.
———. 2011. “Efficient de Novo Assembly of Large Genomes Using Compressed Data Structures.” Genome Research 22 (3): 549–56. https://doi.org/10.1101/gr.126953.111.
Zerbino, Daniel R., and Ewan Birney. 2008. “Velvet: Algorithms for de Novo Short Read Assembly Using de Bruijn Graphs.” Genome Research 18 (5): 821–29. https://doi.org/10.1101/gr.074492.107.
NO_ITEM_DATA:daligner

1. This and the methods below can also be done with $$\log n / \log \log n$$ complexity instead of $$\log n$$ using more advanced geometric algorithms. ↩︎