Please wait a minute...
Quantitative Biology

ISSN 2095-4689

ISSN 2095-4697(Online)

CN 10-1028/TM

邮发代号 80-971

Quantitative Biology  2020, Vol. 8 Issue (3): 203-215   https://doi.org/10.1007/s40484-020-0214-5
  本期目录
A survey on de novo assembly methods for single-molecular sequencing
Ying Chen, Chuan-Le Xiao()
State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou 510275, China
 全文: PDF(217 KB)   HTML
Abstract

Background: The single-molecular sequencing (SMS) is under rapid development and generating increasingly long and accurate sequences. De novo assembly of genomes from SMS sequences is a critical step for many genomic studies. To scale well with the developing trends of SMS, many de novo assemblers for SMS have been released. These assembly workflows can be categorized into two different kinds: the correction-and-assembly strategy and the assembly-and-correction strategy, both of which are gaining more and more attentions.

Results: In this article we make a discussion on the characteristics of errors in SMS sequences. We then review the currently widely applied de novo assemblers for SMS sequences. We also describe computational methods relevant to de novo assembly, including the alignment methods and the error correction methods. Benchmarks are provided to analyze their performance on different datasets and to provide use guides on applying the computation methods.

Conclusion: We make a detailed review on the latest development of de novo assembly and some relevant algorithms for SMS, including their rationales, solutions and results. Besides, we provide use guides on the algorithms based on their benchmark results. Finally we conclude the review by giving some developing trends of third generation sequencing (TGS).

Key wordsthird generation sequencing    single-molecular real-time sequencing    sequence alignment    sequence error correction    de novo assembly
收稿日期: 2020-03-10      出版日期: 2020-09-25
Corresponding Author(s): Chuan-Le Xiao   
 引用本文:   
. [J]. Quantitative Biology, 2020, 8(3): 203-215.
Ying Chen, Chuan-Le Xiao. A survey on de novo assembly methods for single-molecular sequencing. Quant. Biol., 2020, 8(3): 203-215.
 链接本文:  
https://academic.hep.com.cn/qb/CN/10.1007/s40484-020-0214-5
https://academic.hep.com.cn/qb/CN/Y2020/V8/I3/203
Methods Target sequencing platform Assembly workflow
Alignment methods BLASR PacBio FALCON
DALIGN PacBio FALCON
Minimap2 PacBio, ONT
edlib PacBio, ONT Canu
Consensus methods DAGCon PacBio PBcR, MECAT
FALCON-sense PacBio, ONT FALCON, Canu
Correction-and-assembly workflow FALCON PacBio
Canu PacBio, ONT
MECAT PacBio
Assembly-and-correction workflow Flye PacBio, ONT
Wtdbg2 PacBio, ONT
Tab.1  
Dataset Metric CANU FALCON Flye MECAT Wtdgb2
C.elegans
(80X)
Assembly size 106.5 Mb 100.8 Mb 102.0 Mb 102.1 Mb 104.8 Mb
% reference cover 99.58 99.16 99.29 99.51 99.37
NG75 1,884,280 935,802 1,275,590 1,424,674 2,255,274
NG50 2,677,990 1,629,544 1,926,198 2,113,456 3,596,268
Wall-lock time 9 h 30 m 2 h 6 m 2 h 58 m 3 h 8 m 26 m
A.thaliana
(75X)
Assembly size 118.3 Mb 119.3 Mb 116.4 Mb 117.9 Mb 117.8 Mb
% reference cover 90.53 91.07 90.62 90.91 90.79
NG75 589,667 6,083,367 3,857,946 1,339,557 6,073,475
NG50 1,098,921 10,401,798 6,726,569 4,070,235 11,218,688
Wall-lock time 15 h 50 m (by PacBio) 3 h 33 m 4 h 31 m 1 h 6 m
D.melanogast
(120X)
Assembly size 138.5 Mb 131.2 Mb 134.8 Mb 131.9 Mb
% reference cover 91.09 89.69 90.26 89.63
NG75 8,173,560 2,198,937 5,557,762 5,109,865
NG50 20,301,392 9,318,110 16,078,851 17,035,952
Wall-lock time 16 h 35 m 5 h 13 m 5 h 3 m 46 m
Human
(100X)
Assembly size 2,837 Mb 2,938 Mb 2,712 Mb
% reference cover 89.33 90.13 86.03
NG75 3,793,440 7,726,658 4,387,668
NG50 17,570,750 26,132,317 18,220,221
Total CPU hours (pre-polish CPU hours) 22,750 68,789 2,506 (632)
Tab.2  
Dataset Workflow Assembly size Contigs NG50 (AP) CPU hours
A.thaliana Canu 113408765 288 6,522,919(28%) 1423
Flye 126772040 160 12,043,133(51%) 46.1
Wtdgb2 115323989 349 9,840,213(42%) 14.4
D.melanogaster Canu 146764973 499 3,508,917(14%) 1548.8
Flye 138916413 553 10,420,459(41%) 116.5
Wtdgb2 138929862 872 6,633,247(26%) 26.0
C.reinhardtii Canu 116421921 93 4,563,858(59%) 18320
Flye 112517660 52 6,720,472(86%) 136.3
Wtdgb2 115667394 344 4,289,786(55%) 35.4
O.sativa Canu 383923158 385 5,041,373(16%) 19568.0
Flye 380980383 177 8,315,232(27%) 454.2
Wtdgb2 394595916 2554 2,432,307(8%) 154.3
S.pennellii Canu 961827720 2010 1,663,626(66%) 21131.5
Flye 1003210907 2807 1,847,300(73%) 1687.2
Wtdgb2 934260260 4986 1,227,952(49%) 439.0
Tab.3  
1 S. D. Brown, , S. Nagaraju, , S. Utturkar, , S. De Tissera, , S. Segovia, , W. Mitchell, , M. L. Land, , A. Dassanayake, and M. Köpke, (2014) Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. Biotechnol. Biofuels, 7, 40
https://doi.org/10.1186/1754-6834-7-40. pmid: 24655715
2 A. Rhoads, and K. F. Au, (2015) PacBio sequencing and its applications. Genom. Proteom. Bioinf., 13, 278–289
https://doi.org/10.1016/j.gpb.2015.08.002. pmid: 26542840
3 J. Eid, , A. Fehr, , J. Gray, , K. Luong, , J. Lyle, , G. Otto, , P. Peluso, , D. Rank, , P. Baybayan, , B. Bettman, , et al. (2009) Real-time DNA sequencing from single polymerase molecules. Science, 323, 133–138
https://doi.org/10.1126/science.1162986. pmid: 19023044
4 J. C. Detter, , S. L. Johnson, , K. A. Bishop-Lilly, , P. S. Chain, , H. S. Gibbons, , T. D. Minogue, , S. Sozhamannan, , E.J. Van Gieson, and I. G. Resnick, (2014) Nucleic acid sequencing for characterizing infectious and/or novel agents in complex samples. In: Biological Identification, pp. 3–53. Sawston: Woodhead Publishing
5 A. M. Wenger, , P. Peluso, , W. J. Rowell, , P. C. Chang, , R. J. Hall, , G. T. Concepcion, , J. Ebler, , A. Fungtammasan, , A. Kolesnikov, , N. D. Olson, , et al. (2019) Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol., 37, 1155–1162
https://doi.org/10.1038/s41587-019-0217-9. pmid: 31406327
6 C. S. Chin, , D. H. Alexander, , P. Marks, , A. A. Klammer, , J. Drake, , C. Heiner, , A. Clum, , A. Copeland, , J. Huddleston, , E. E. Eichler, , et al. (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods, 10, 563–569
https://doi.org/10.1038/nmeth.2474. pmid: 23644548
7 A. Magi, , R. Semeraro, , A. Mingrino, , B. Giusti, and R. D’Aurizio, (2018) Nanopore sequencing data analysis: state of the art, applications and challenges. Brief. Bioinformatics, 19, 1256–1272
pmid: 28637243.
8 M. Jain, , S. Koren, , K. H. Miga, , J. Quick, , A. C. Rand, , T. A. Sasani, , J. R. Tyson, , A. D. Beggs, , A. T. Dilthey, , I. T. Fiddes, , et al. (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol., 36, 338–345
https://doi.org/10.1038/nbt.4060. pmid: 29431738
9 A. Magi, , B. Giusti, and L. Tattini, (2017) Characterization of MinION nanopore data for resequencing analyses. Brief. Bioinformatics, 18, 940–953
pmid: 27559152.
10 F. J. Rang, , W. P. Kloosterman, and J. de Ridder, (2018) From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol., 19, 90
https://doi.org/10.1186/s13059-018-1462-9. pmid: 30005597
11 N. J. Loman, , J. Quick, and J. T. Simpson, (2015) A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods, 12, 733–735
https://doi.org/10.1038/nmeth.3444. pmid: 26076426
12 B. J. Walker, , T. Abeel, , T. Shea, , M. Priest, , A. Abouelliel, , S. Sakthikumar, , C. A. Cuomo, , Q. Zeng, , J. Wortman, , S. K. Young, , et al. (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One, 9, e112963
https://doi.org/10.1371/journal.pone.0112963. pmid: 25409509
13 S. B. Kingan, , H. Heaton, , J. Cudini, , C. C. Lambert, , P. Baybayan, , B. D. Galvin, , R. Durbin, , J. Korlach, and M. K. N. Lawniczak, (2019) A high-quality de novo genome assembly from a single mosquito using PacBio sequencing. Genes (Basel), 10, 62
https://doi.org/10.3390/genes10010062. pmid: 30669388
14 R. Vaser, , I. Sović , N. Nagarajan, and M. Šikić, (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res., 27, 737–746
https://doi.org/10.1101/gr.214270.116. pmid: 28100585
15 A. Bashir, , A. Klammer, , W. P. Robins, , C. S. Chin, , D. Webster, , E. Paxinos, , D. Hsu, , M. Ashby, , S. Wang, , P. Peluso, , et al. (2012) A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol., 30, 701–707
https://doi.org/10.1038/nbt.2288. pmid: 22750883
16 S. Koren, , M. C. Schatz, , B. P. Walenz, , J. Martin, , J. T. Howard, , G. Ganapathy, , Z. Wang, , D. A. Rasko, , W. R. McCombie, , E. D. Jarvis, , et al. (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol., 30, 693–700
https://doi.org/10.1038/nbt.2280. pmid: 22750884
17 F. J. Ribeiro, , D. Przybylski,, S. Yin, , T. Sharpe, , S. Gnerre, , A. Abouelleil, , A. M. Berlin, , A. Montmayeur,, T. P. Shea, , B. J. Walker, , et al. (2012) Finished bacterial genomes from shotgun sequence data. Genome Res., 22, 2270–2277
https://doi.org/10.1101/gr.141515.112. pmid: 22829535
18 C. S. Chin, , D. H. Alexander,, P. Marks, , A. A. Klammer, , J. Drake, , C. Heiner, , A. Clum, , A. Copeland, , J. Huddleston, , E. E. Eichler, , et al. (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods, 10, 563–569
https://doi.org/10.1038/nmeth.2474. pmid: 23644548
19 T. F. Smith, and M. S. Waterman, (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197
https://doi.org/10.1016/0022-2836(81)90087-5. pmid: 7265238
20 S. F. Altschul, , T. L. Madden, , A. A. Schäffer, , J. Zhang, , Z. Zhang, , W. Miller, and D. J. Lipman, (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402
https://doi.org/10.1093/nar/25.17.3389. pmid: 9254694
21 M. J. Chaisson, and G. Tesler, (2012) Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics, 13, 238
https://doi.org/10.1186/1471-2105-13-238. pmid: 22988817
22 P. Ferragina, and G. Manzini, (2000) Opportunistic data structures with applications In: Proceedings the 41st Annual Symposium on Foundations of Computer Science, pp. 390–398. IEEE
23 H. Li, and R. Durbin, (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589–595
https://doi.org/10.1093/bioinformatics/btp698. pmid: 20080505
24 L. Ben, and S. L. Salzberg, (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359
25 Y. Chen, , W. Ye, , Y. Zhang, and Y. Xu, (2015) High speed BLASTN: an accelerated MegaBLAST search tool. Nucleic Acids Res., 43, 7762–7768
https://doi.org/10.1093/nar/gkv784. pmid: 26250111
26 T. W. Lam, , W. K. Sung, , S. L. Tam, , C. K. Wong, and S. M. Yiu, (2008) Compressed indexing and local alignment of DNA. Bioinformatics, 24, 791–797
https://doi.org/10.1093/bioinformatics/btn032. pmid: 18227115
27 M. I. Abouelhoda, and E. Ohlebusch, A. (2003) Local chaining algorithm and its applications in comparative genomics. In: International Workshop on Algorithms in Bioinformatics, pp. 1–16. Berlin: Springer
28 D. Eppstein, , Z. Galil, , R. Giancarlo, and G. F. Italiano, (1992) Sparse dynamic programming I: linear cost functions. J. Assoc. Comput. Mach., 39, 519–545
https://doi.org/10.1145/146637.146650.
29 G. Myers, (2014) Efficient local alignment discovery amongst noisy long reads In: International Workshop on Algorithms in Bioinformatics, pp. 52–67. Berlin: Springer
30 T. H. Cormen, , C. E. Leiserson,, R. L. Rivest, and C. Stein, (2009) Introduction to Algorithms (3rd. edition), pp.197–204. Cambridge: MIT Press
31 E. W. Myers, (1986) AnO (ND) difference algorithm and its variations. Algorithmica, 1, 251–266
https://doi.org/10.1007/BF01840446.
32 M. Roberts, , W. Hayes, , B. R. Hunt, , S. M. Mount, and J. A. Yorke, (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics, 20, 3363–3369
https://doi.org/10.1093/bioinformatics/bth408. pmid: 15256412
33 H. Li, (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110
https://doi.org/10.1093/bioinformatics/btw152. pmid: 27153593
34 H. Li, (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34, 3094–3100
https://doi.org/10.1093/bioinformatics/bty191. pmid: 29750242
35 O. Gotoh, (1990) Optimal sequence alignment allowing for long gaps. Bull. Math. Biol., 52, 359–373
https://doi.org/10.1016/S0092-8240(05)80216-2. pmid: 2165832
36 H. Suzuki, and M. Kasahara, (2018) Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, 19, 45
https://doi.org/10.1186/s12859-018-2014-8. pmid: 29504909
37 M. Šošic, and M. Šikic, (2017) Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics, 33, 1394–1395
https://doi.org/10.1093/bioinformatics/btw753. pmid: 28453688
38 G. Myers, (1999) A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. Assoc. Comput. Mach., 46, 395–415
https://doi.org/10.1145/316542.316550.
39 D. S. Hirschberg, (1975) A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18, 341–343
https://doi.org/10.1145/360825.360861.
40 C. Lee, , C. Grasso, and M. F. Sharlow, (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–464
https://doi.org/10.1093/bioinformatics/18.3.452. pmid: 11934745
41 C. S. Chin, , P. Peluso, , F. J. Sedlazeck, , M. Nattestad, , G. T. Concepcion,, A. Clum, , C. Dunn, , R. O’Malley,, R. Figueroa-Balderas, , A. Morales-Cruz, , et al. (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods, 13, 1050–1054
https://doi.org/10.1038/nmeth.4035. pmid: 27749838
42 , E. W Myers. (2005) The fragment assembly string graph. Bioinformatics, 21(suppl_2), ii79–ii85
43 K. Berlin, , S. Koren, , C. S. Chin, , J. P. Drake, , J. M. Landolin, and A. M. Phillippy, (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630
https://doi.org/10.1038/nbt.3238. pmid: 26006009
44 S. Koren, , B. P. Walenz, , K. Berlin, , J. R. Miller, , N. H. Bergman, and A. M. Phillippy, (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res., 27, 722–736
https://doi.org/10.1101/gr.215087.116. pmid: 28298431
45 J. R. Miller, , A. L. Delcher, , S. Koren, , E. Venter, , B. P. Walenz, , A. Brownley, , J. Johnson, , K. Li, , C. Mobarry, and G. Sutton, (2008) Aggressive assembly of pyrosequencing reads with mates. Bioinformatics, 24, 2818–2824
https://doi.org/10.1093/bioinformatics/btn548. pmid: 18952627
46 C. L. Xiao,, Y. Chen,, S. Q. Xie,, K.-N. Chen,, Y. Wang, , Y. Han, , F. Luo, , Z. Xie (2017) MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods, 14, 1072–1074
47 M. Kolmogorov, , J. Yuan, , Y. Lin, and P. A. Pevzner, (2019) Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol., 37, 540–546
https://doi.org/10.1038/s41587-019-0072-8. pmid: 30936562
48 J. Ruan, and H. Li, (2020) Fast and accurate long-read assembly with wtdbg2. Nat. Methods, 17, 155–158
https://doi.org/10.1038/s41592-019-0669-3. pmid: 31819265
49 Y. Chen, , F. Nie, , S.-Q. Xie, and Y.-F. Zheng, (2020) Fast and accurate assembly of Nanopore reads via progressive error correction and adaptive read selection. bioRxiv, 930107
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed