|
|
AE-TPGG: a novel autoencoder-based approach for single-cell RNA-seq data imputation and dimensionality reduction |
Shuchang ZHAO1,2, Li ZHANG1,3, Xuejun LIU1,2() |
1. MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China 2. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210023, China 3. College of Computer Science and Technology, Nanjing Forestry University, Nanjing 210037, China |
|
|
Abstract Single-cell RNA sequencing (scRNA-seq) technology has become an effective tool for high-throughout transcriptomic study, which circumvents the averaging artifacts corresponding to bulk RNA-seq technology, yielding new perspectives on the cellular diversity of potential superficially homogeneous populations. Although various sequencing techniques have decreased the amplification bias and improved capture efficiency caused by the low amount of starting material, the technical noise and biological variation are inevitably introduced into experimental process, resulting in high dropout events, which greatly hinder the downstream analysis. Considering the bimodal expression pattern and the right-skewed characteristic existed in normalized scRNA-seq data, we propose a customized autoencoder based on a two-part-generalized-gamma distribution (AE-TPGG) for scRNA-seq data analysis, which takes mixed discrete-continuous random variables of scRNA-seq data into account using a two-part model and utilizes the generalized gamma (GG) distribution, for fitting the positive and right-skewed continuous data. The adopted autoencoder enables AE-TPGG to captures the inherent relationship between genes. In addition to the ability of achieving low-dimensional representation, the AE-TPGG model also provides a denoised imputation according to statistical characteristic of gene expression. Results on real datasets demonstrate that our proposed model is competitive to current imputation methods and ameliorates a diverse set of typical scRNA-seq data analyses.
|
Keywords
scRNA-seq
autoencoder
TPGG
data imputation
dimensionality reduction
|
Corresponding Author(s):
Xuejun LIU
|
Just Accepted Date: 22 April 2022
Issue Date: 25 October 2022
|
|
1 |
S S Potter . Single-cell RNA sequencing for the study of development, physiology and disease. Nature Reviews Nephrology, 2018, 14( 8): 479–492
|
2 |
H, Li E T, Courtois D, Sengupta Y, Tan K H, Chen J J L, Goh S L, Kong C, Chua L K, Hon W S, Tan M, Wong P J, Choi L J K, Wee A M, Hillmer I B, Tan P, Robson S Prabhakar . Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nature Genetics, 2017, 49( 5): 708–718
|
3 |
Y, Cao B, Su X, Guo W, Sun Y, Deng L, Bao Q, Zhu X, Zhang Y, Zheng C, Geng X, Chai R, He X, Li Q, Lv H, Zhu W, Deng Y, Xu Y, Wang L, Qiao Y, Tan L, Song G, Wang X, Du N, Gao J, Liu J, Xiao X, Su Z, Du Y, Feng C, Qin C, Qin R, Jin X S Xie . Potent neutralizing antibodies against SARS-CoV-2 identified by high-throughput single-cell sequencing of convalescent patients’ B cells. Cell, 2020, 182( 1): 73–84.e16
|
4 |
P V, Kharchenko L, Silberstein D T Scadden . Bayesian approach to single-cell differential expression analysis. Nature Methods, 2014, 11( 7): 740–742
|
5 |
G, Finak A, McDavid M, Yajima J, Deng V, Gersuk A K, Shalek C K, Slichter H W, Miller M J, Mcelrath M, Prlic P S, Linsley R Gottardo . MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 2015, 16( 1): 278
|
6 |
A T L, Lun K, Bach J C Marioni . Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology, 2016, 17( 1): 75
|
7 |
W V, Li J J Li . An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature Communications, 2018, 9( 1): 997
|
8 |
M, Huang J, Wang E, Torre H, Dueck S, Shaffer R, Bonasio J I, Murray A, Raj M, Li N R Zhang . SAVER: gene expression recovery for single-cell RNA sequencing. Nature Methods, 2018, 15( 7): 539–542
|
9 |
Dijk V, Van R, Sharma J, Nainys K, Yim P, Kathail A J, Carr C, Burdziak K R, Moon C L, Chaffer D, Pattabiraman B, Bierie L, Mazutis G, Wolf S, Krishnaswamy D Pe’er . Recovering gene interactions from single-cell data using data diffusion. Cell, 2018, 174( 3): 716–729.e27
|
10 |
Z, Basharat S, Majeed H, Saleem I A, Khan A Yasmin . An overview of algorithms and associated applications for single cell RNA-seq data imputation. Current Genomics, 2021, 22( 5): 319–327
|
11 |
Y, LeCun Y, Bengio G Hinton . Deep learning. Nature, 2015, 521( 7553): 436–444
|
12 |
Y, Bengio A, Courville P Vincent . Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35( 8): 1798–1828
|
13 |
K Hornik . Approximation capabilities of multilayer feedforward networks. Neural Networks, 1991, 4( 2): 251–257
|
14 |
G E, Hinton R R Salakhutdinov . Reducing the dimensionality of data with neural networks. Science, 2006, 313( 5786): 504–507
|
15 |
A, Kadurin S, Nikolenko K, Khrabrov A, Aliper A Zhavoronkov . druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Molecular Pharmaceutics, 2017, 14( 9): 3098–3104
|
16 |
G, Eraslan L M, Simon M, Mircea N S, Mueller F J Theis . Single-cell RNA-seq denoising using a deep count autoencoder. Nature Communications, 2019, 10( 1): 390
|
17 |
Z, Zhang F, Cui C, Wang L, Zhao Q Zou . Goals and approaches for each processing step for single-cell RNA sequencing data. Briefings in Bioinformatics, 2021, 22( 4): bbaa314
|
18 |
A, Mortazavi B A, Williams K, McCue L, Schaeffer B Wold . Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods, 2008, 5( 7): 621–628
|
19 |
J K, Pickrell J C, Marioni A A, Pai J F, Degner B E, Engelhardt E, Nkadori J B, Veyrieras M, Stephens Y, Gilad J K Pritchard . Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 2010, 464( 7289): 768–772
|
20 |
D, Risso K, Schwartz G, Sherlock S Dudoit . GC-content normalization for RNA-seq data. BMC Bioinformatics, 2011, 12( 1): 480
|
21 |
C A, Vallejos D, Risso A, Scialdone S, Dudoit J C Marioni . Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods, 2017, 14( 6): 565–571
|
22 |
B, Li V, Ruotti R M, Stewart J A, Thomson C N Dewey . RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics, 2010, 26( 4): 493–500
|
23 |
F, Belotti P, Deb W G, Manning E C Norton . Twopm: two-part models. The Stata Journal: Promoting communications on statistics and Stata, 2015, 15( 1): 3–20
|
24 |
J F Lawless . Inference in the generalized gamma and log gamma distributions. Technometrics, 1980, 22( 3): 409–419
|
25 |
D, Risso F, Perraudeau S, Gribkova S, Dudoit J P Vert . A general and flexible method for signal extraction from single-cell RNA-seq data. Nature Communications, 2018, 9( 1): 284
|
26 |
A M, Klein L, Mazutis I, Akartuna N, Tallapragada A, Veres V, Li L, Peshkin D A, Weitz M W Kirschner . Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 2015, 161( 5): 1187–1201
|
27 |
T P Minka . Estimating a gamma distribution. Microsoft Research, 2002, 1( 3): 3–5
|
28 |
F. Keras Chollet . See Github.com/fchollet/keras website
|
29 |
M, Abadi A, Agarwal P, Barham E, Brevdo Z, Chen , et al.. TensorFlow: large-scale machine learning on heterogeneous distributed systems. 2016, arXiv preprint arXiv: 1603.04467
|
30 |
Q, Deng D, Ramsköld B, Reinius R Sandberg . Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science, 2014, 343( 6167): 193–196
|
31 |
A A, Kolodziejczyk J K, Kim J C H, Tsang T, Ilicic J, Henriksson K N, Natarajan A C, Tuck X, Gao M, Bühler P, Liu J C, Marioni S A Teichmann . Single cell RNA-Sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell, 2015, 17( 4): 471–485
|
32 |
J, Ding X, Adiconis S K, Simmons M S, Kowalczyk C C, Hession N D, Marjanovic T K, Hughes M H, Wadsworth T, Burks L T, Nguyen J Y H, Kwon B, Barak W, Ge A J, Kedaigle S, Carroll S, Li N, Hacohen O, Rozenblatt-Rosen A K, Shalek A C, Villani A, Regev J Z Levin . Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nature Biotechnology, 2020, 38( 6): 737–746
|
33 |
D, Arthur S Vassilvitskii . k-means++: the advantages of careful seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. 2007, 1027–1035
|
34 |
G X Y, Zheng J M, Terry P, Belgrader P, Ryvkin Z W, Bent R, Wilson S B, Ziraldo T D, Wheeler G P, McDermott J, Zhu M T, Gregory J, Shuga L, Montesclaros J G, Underwood D A, Masquelier S Y, Nishimura M, Schnall-Levin P W, Wyatt C M, Hindson R, Bharadwaj A, Wong K D, Ness L W, Beppu H J, Deeg C, Mcfarland K R, Loeb W J, Valente N G, Ericson E A, Stevens J P, Radich T S, Mikkelsen B J, Hindson J H Bielas . Massively parallel digital transcriptional profiling of single cells. Nature Communications, 2017, 8( 1): 14049
|
35 |
M, Stoeckius C, Hafemeister W, Stephenson B, Houck-Loomis P K, Chattopadhyay H, Swerdlow R, Satija P Smibert . Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 2017, 14( 9): 865–868
|
36 |
C, Xu Z Su . Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 2015, 31( 12): 1974–1980
|
37 |
J, Levine E, Simonds S, Bendall K, Davis E A, Amir M, Tadmor O, Litvin H, Fienberg A, Jager E, Zunder R, Finck A, Gedman I, Radtke J, Downing D, Pe’er G Nolan . Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell, 2015, 162( 1): 184–197
|
38 |
M, Francesconi B Lehner . The effects of genetic variation on gene expression dynamics during development. Nature, 2014, 505( 7482): 208–211
|
39 |
M E, Boeck C, Huynh L, Gevirtzman O A, Thompson G, Wang D M, Kasper V, Reinke L W, Hillier R H Waterston . The time-resolved transcriptome of C. elegans. Genome Research, 2016, 26( 10): 1441–1450
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|