Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2023, Vol. 17 Issue (3): 173902   https://doi.org/10.1007/s11704-022-2011-y
  本期目录
AE-TPGG: a novel autoencoder-based approach for single-cell RNA-seq data imputation and dimensionality reduction
Shuchang ZHAO1,2, Li ZHANG1,3, Xuejun LIU1,2()
1. MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210023, China
3. College of Computer Science and Technology, Nanjing Forestry University, Nanjing 210037, China
 全文: PDF(15886 KB)   HTML
Abstract

Single-cell RNA sequencing (scRNA-seq) technology has become an effective tool for high-throughout transcriptomic study, which circumvents the averaging artifacts corresponding to bulk RNA-seq technology, yielding new perspectives on the cellular diversity of potential superficially homogeneous populations. Although various sequencing techniques have decreased the amplification bias and improved capture efficiency caused by the low amount of starting material, the technical noise and biological variation are inevitably introduced into experimental process, resulting in high dropout events, which greatly hinder the downstream analysis. Considering the bimodal expression pattern and the right-skewed characteristic existed in normalized scRNA-seq data, we propose a customized autoencoder based on a two-part-generalized-gamma distribution (AE-TPGG) for scRNA-seq data analysis, which takes mixed discrete-continuous random variables of scRNA-seq data into account using a two-part model and utilizes the generalized gamma (GG) distribution, for fitting the positive and right-skewed continuous data. The adopted autoencoder enables AE-TPGG to captures the inherent relationship between genes. In addition to the ability of achieving low-dimensional representation, the AE-TPGG model also provides a denoised imputation according to statistical characteristic of gene expression. Results on real datasets demonstrate that our proposed model is competitive to current imputation methods and ameliorates a diverse set of typical scRNA-seq data analyses.

Key wordsscRNA-seq    autoencoder    TPGG    data imputation    dimensionality reduction
收稿日期: 2022-01-07      出版日期: 2022-10-25
Corresponding Author(s): Xuejun LIU   
 引用本文:   
. [J]. Frontiers of Computer Science, 2023, 17(3): 173902.
Shuchang ZHAO, Li ZHANG, Xuejun LIU. AE-TPGG: a novel autoencoder-based approach for single-cell RNA-seq data imputation and dimensionality reduction. Front. Comput. Sci., 2023, 17(3): 173902.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-022-2011-y
https://academic.hep.com.cn/fcs/CN/Y2023/V17/I3/173902
Fig.1  
Fig.2  
Gene Min Max Mean Variance Zero ratio Skewness
Cpe 0 3.51 0.62 0.56 0.53 0.57
Hsd11b1 0 2.56 0.17 0.15 0.81 1.53
Tab.1  
Two-part models Gene MLE estimation mean^ var^ Skewness^ KS-test Log-likelihood AIC
TPGM Cpe π^1=0.53
α^=6.40 0.62 0.56 0.79 0.97 ?2787.02 5580.04
β^=4.86
Hsd11b1 π^1=0.81
α^=7.31 0.17 0.14 0.74 0.80 ?1460.48 2936.96
β^=8.29
TPLNM Cpe π^2=0.53
μ^=0.20 0.62 0.59 1.36 0.25 ?2803.16 5612.31
σ^=0.41
Hsd11b1 π^2=0.81
μ^=?0.20 0.17 0.14 1.19 1.00 ?1442.60 2891.20
σ^=0.37
Tab.2  
Fig.3  
Fig.4  
Fig.5  
  
Datesets Sequencing protocol Cell types of cells of genes Zero ratio
Deng Smart-seq 10 268 22,431 0.60
Kolodziejczyk SMARTer 3 704 38,616 0.71
Klein inDrop 4 2,717 24,175 0.66
pbmc1-10Xv2 10x Chromium (v2) 9 6,444 22,280 0.96
Tab.3  
Fig.6  
Fig.7  
Fig.8  
Fig.9  
Fig.10  
Fig.11  
Fig.12  
Fig.13  
Fig.14  
Fig.15  
Fig.16  
  
  
  
  
1 S S Potter . Single-cell RNA sequencing for the study of development, physiology and disease. Nature Reviews Nephrology, 2018, 14( 8): 479–492
2 H, Li E T, Courtois D, Sengupta Y, Tan K H, Chen J J L, Goh S L, Kong C, Chua L K, Hon W S, Tan M, Wong P J, Choi L J K, Wee A M, Hillmer I B, Tan P, Robson S Prabhakar . Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nature Genetics, 2017, 49( 5): 708–718
3 Y, Cao B, Su X, Guo W, Sun Y, Deng L, Bao Q, Zhu X, Zhang Y, Zheng C, Geng X, Chai R, He X, Li Q, Lv H, Zhu W, Deng Y, Xu Y, Wang L, Qiao Y, Tan L, Song G, Wang X, Du N, Gao J, Liu J, Xiao X, Su Z, Du Y, Feng C, Qin C, Qin R, Jin X S Xie . Potent neutralizing antibodies against SARS-CoV-2 identified by high-throughput single-cell sequencing of convalescent patients’ B cells. Cell, 2020, 182( 1): 73–84.e16
4 P V, Kharchenko L, Silberstein D T Scadden . Bayesian approach to single-cell differential expression analysis. Nature Methods, 2014, 11( 7): 740–742
5 G, Finak A, McDavid M, Yajima J, Deng V, Gersuk A K, Shalek C K, Slichter H W, Miller M J, Mcelrath M, Prlic P S, Linsley R Gottardo . MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 2015, 16( 1): 278
6 A T L, Lun K, Bach J C Marioni . Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology, 2016, 17( 1): 75
7 W V, Li J J Li . An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature Communications, 2018, 9( 1): 997
8 M, Huang J, Wang E, Torre H, Dueck S, Shaffer R, Bonasio J I, Murray A, Raj M, Li N R Zhang . SAVER: gene expression recovery for single-cell RNA sequencing. Nature Methods, 2018, 15( 7): 539–542
9 Dijk V, Van R, Sharma J, Nainys K, Yim P, Kathail A J, Carr C, Burdziak K R, Moon C L, Chaffer D, Pattabiraman B, Bierie L, Mazutis G, Wolf S, Krishnaswamy D Pe’er . Recovering gene interactions from single-cell data using data diffusion. Cell, 2018, 174( 3): 716–729.e27
10 Z, Basharat S, Majeed H, Saleem I A, Khan A Yasmin . An overview of algorithms and associated applications for single cell RNA-seq data imputation. Current Genomics, 2021, 22( 5): 319–327
11 Y, LeCun Y, Bengio G Hinton . Deep learning. Nature, 2015, 521( 7553): 436–444
12 Y, Bengio A, Courville P Vincent . Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35( 8): 1798–1828
13 K Hornik . Approximation capabilities of multilayer feedforward networks. Neural Networks, 1991, 4( 2): 251–257
14 G E, Hinton R R Salakhutdinov . Reducing the dimensionality of data with neural networks. Science, 2006, 313( 5786): 504–507
15 A, Kadurin S, Nikolenko K, Khrabrov A, Aliper A Zhavoronkov . druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Molecular Pharmaceutics, 2017, 14( 9): 3098–3104
16 G, Eraslan L M, Simon M, Mircea N S, Mueller F J Theis . Single-cell RNA-seq denoising using a deep count autoencoder. Nature Communications, 2019, 10( 1): 390
17 Z, Zhang F, Cui C, Wang L, Zhao Q Zou . Goals and approaches for each processing step for single-cell RNA sequencing data. Briefings in Bioinformatics, 2021, 22( 4): bbaa314
18 A, Mortazavi B A, Williams K, McCue L, Schaeffer B Wold . Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods, 2008, 5( 7): 621–628
19 J K, Pickrell J C, Marioni A A, Pai J F, Degner B E, Engelhardt E, Nkadori J B, Veyrieras M, Stephens Y, Gilad J K Pritchard . Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 2010, 464( 7289): 768–772
20 D, Risso K, Schwartz G, Sherlock S Dudoit . GC-content normalization for RNA-seq data. BMC Bioinformatics, 2011, 12( 1): 480
21 C A, Vallejos D, Risso A, Scialdone S, Dudoit J C Marioni . Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods, 2017, 14( 6): 565–571
22 B, Li V, Ruotti R M, Stewart J A, Thomson C N Dewey . RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics, 2010, 26( 4): 493–500
23 F, Belotti P, Deb W G, Manning E C Norton . Twopm: two-part models. The Stata Journal: Promoting communications on statistics and Stata, 2015, 15( 1): 3–20
24 J F Lawless . Inference in the generalized gamma and log gamma distributions. Technometrics, 1980, 22( 3): 409–419
25 D, Risso F, Perraudeau S, Gribkova S, Dudoit J P Vert . A general and flexible method for signal extraction from single-cell RNA-seq data. Nature Communications, 2018, 9( 1): 284
26 A M, Klein L, Mazutis I, Akartuna N, Tallapragada A, Veres V, Li L, Peshkin D A, Weitz M W Kirschner . Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 2015, 161( 5): 1187–1201
27 T P Minka . Estimating a gamma distribution. Microsoft Research, 2002, 1( 3): 3–5
28 F. Keras Chollet . See Github.com/fchollet/keras website
29 M, Abadi A, Agarwal P, Barham E, Brevdo Z, Chen , et al.. TensorFlow: large-scale machine learning on heterogeneous distributed systems. 2016, arXiv preprint arXiv: 1603.04467
30 Q, Deng D, Ramsköld B, Reinius R Sandberg . Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science, 2014, 343( 6167): 193–196
31 A A, Kolodziejczyk J K, Kim J C H, Tsang T, Ilicic J, Henriksson K N, Natarajan A C, Tuck X, Gao M, Bühler P, Liu J C, Marioni S A Teichmann . Single cell RNA-Sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell, 2015, 17( 4): 471–485
32 J, Ding X, Adiconis S K, Simmons M S, Kowalczyk C C, Hession N D, Marjanovic T K, Hughes M H, Wadsworth T, Burks L T, Nguyen J Y H, Kwon B, Barak W, Ge A J, Kedaigle S, Carroll S, Li N, Hacohen O, Rozenblatt-Rosen A K, Shalek A C, Villani A, Regev J Z Levin . Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nature Biotechnology, 2020, 38( 6): 737–746
33 D, Arthur S Vassilvitskii . k-means++: the advantages of careful seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. 2007, 1027–1035
34 G X Y, Zheng J M, Terry P, Belgrader P, Ryvkin Z W, Bent R, Wilson S B, Ziraldo T D, Wheeler G P, McDermott J, Zhu M T, Gregory J, Shuga L, Montesclaros J G, Underwood D A, Masquelier S Y, Nishimura M, Schnall-Levin P W, Wyatt C M, Hindson R, Bharadwaj A, Wong K D, Ness L W, Beppu H J, Deeg C, Mcfarland K R, Loeb W J, Valente N G, Ericson E A, Stevens J P, Radich T S, Mikkelsen B J, Hindson J H Bielas . Massively parallel digital transcriptional profiling of single cells. Nature Communications, 2017, 8( 1): 14049
35 M, Stoeckius C, Hafemeister W, Stephenson B, Houck-Loomis P K, Chattopadhyay H, Swerdlow R, Satija P Smibert . Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 2017, 14( 9): 865–868
36 C, Xu Z Su . Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 2015, 31( 12): 1974–1980
37 J, Levine E, Simonds S, Bendall K, Davis E A, Amir M, Tadmor O, Litvin H, Fienberg A, Jager E, Zunder R, Finck A, Gedman I, Radtke J, Downing D, Pe’er G Nolan . Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell, 2015, 162( 1): 184–197
38 M, Francesconi B Lehner . The effects of genetic variation on gene expression dynamics during development. Nature, 2014, 505( 7482): 208–211
39 M E, Boeck C, Huynh L, Gevirtzman O A, Thompson G, Wang D M, Kasper V, Reinke L W, Hillier R H Waterston . The time-resolved transcriptome of C. elegans. Genome Research, 2016, 26( 10): 1441–1450
[1] FCS-22011-OF-SZ_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed