|
|
|
CloudLCA: finding the lowest common ancestor in metagenome analysis using cloud computing |
Guoguang Zhao1,4, Dechao Bu1,4, Changning Liu1, Jing Li1, Jian Yang3, Zhiyong Liu1, Yi Zhao1( ), Runsheng Chen1,2( ) |
| 1. Bioinformatics Research Group, Key Laboratory of Intelligent Information Processing, Advanced Computing Research Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 2. Bioinformatics Laboratory and National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China; 3. State Key Laboratory for Molecular Virology and Genetic Engineering, National Institute for Viral Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing 100176, China; 4. Graduate School of the Chinese Academy of Sciences, Beijing 100190, China |
|
|
|
|
Abstract Estimating taxonomic content constitutes a key problem in metagenomic sequencing data analysis. However, extracting such content from high-throughput data of next-generation sequencing is very time-consuming with the currently available software. Here, we present CloudLCA, a parallel LCA algorithm that significantly improves the efficiency of determining taxonomic composition in metagenomic data analysis. Results show that CloudLCA (1) has a running time nearly linear with the increase of dataset magnitude, (2) displays linear speedup as the number of processors grows, especially for large datasets, and (3) reaches a speed of nearly 215 million reads each minute on a cluster with ten thin nodes. In comparison with MEGAN, a well-known metagenome analyzer, the speed of CloudLCA is up to 5 more times faster, and its peak memory usage is approximately 18.5% that of MEGAN, running on a fat node. CloudLCA can be run on one multiprocessor node or a cluster. It is expected to be part of MEGAN to accelerate analyzing reads, with the same output generated as MEGAN, which can be import into MEGAN in a direct way to finish the following analysis. Moreover, CloudLCA is a universal solution for finding the lowest common ancestor, and it can be applied in other fields requiring an LCA algorithm.
|
| Keywords
CloudLCA
metagenome analysis
cloud computing
|
|
Corresponding Author(s):
Zhao Yi,Email:biozy@ict.ac.cn; Chen Runsheng,Email:chenrs@sun5.ibp.ac.cn
|
|
Issue Date: 01 February 2012
|
|
| 1 |
Blankenberg, D., Von Kuster, G., Coraor, N., Ananda, G., Lazarus, R., Mangan, M., Nekrutenko, A., Taylor, J. (2010). Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol, Chapter 19, Unit 19 . 1011-21 .
|
| 2 |
Blankenberg, D., Taylor, J., Schenck, I., He, J., Zhang, Y., Ghent, M., Veeraraghavan, N., Albert, I., Miller, W., Makova, K.D., . (2007). A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome Res 17, 960-964 . doi: 10.1101/gr.5578007
|
| 3 |
Huson, D.H., Auch, A.F., Qi, J., and Schuster, S.C. (2007). MEGAN analysis of metagenomic data. Genome Res 17, 377-386 . doi: 10.1101/gr.5969107
|
| 4 |
Huson, D.H., Mitra, S., Ruscheweyh, H.J., Weber, N., and Schuster, S.C. (2011). Integrative analysis of environmental sequences using MEGAN4. Genome Res 21, 1552-1560 . doi: 10.1101/gr.120618.111
|
| 5 |
Huson, D.H., Richter, D.C., Mitra, S., Auch, A.F., and Schuster, S.C. (2009). Methods for comparative metagenomics. BMC Bioinformatics 10, S12. doi: 10.1186/1471-2105-10-S1-S12
|
| 6 |
L?mmel, R. (2007). Google's MapReduce programming model- Revisited. Sci Comput Program 68, 208-237 .
|
| 7 |
Langmead, B., Hansen, K.D., and Leek, J.T. (2010). Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol 11, R83. doi: 10.1186/gb-2010-11-8-r83
|
| 8 |
Metzker, M.L. (2010). Sequencing technologies- the next generation. Nat Rev Genet 11, 31-46 . doi: 10.1038/nrg2626
|
| 9 |
Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., , and the MetaHIT Consortium. (2010). A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59-65 . doi: 10.1038/nature08821
|
| 10 |
Schatz, M.C. (2009). CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25, 1363-1369 . doi: 10.1093/bioinformatics/btp236
|
| 11 |
Sudha Sadasivam, G., and Baktavatchalam, G. (2010). A novel approach to multiple sequence alignment using hadoop data grids. Int J Bioinform Res Appl 6, 472-483 . doi: 10.1504/IJBRA.2010.037987
|
| 12 |
Yang, J., Yang, F., Ren, L., Xiong, Z., Wu, Z., Dong, J., Sun, L., Zhang, T., Hu, Y., Du, J., . (2011). Unbiased parallel detection of viral pathogens in clinical samples by use of a metagenomic approach. J Clin Microbiol 49, 3463-3469 . doi: 10.1128/JCM.00273-11
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
| |
Shared |
|
|
|
|
| |
Discussed |
|
|
|
|