Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (3) : 173902    https://doi.org/10.1007/s11704-022-2011-y
RESEARCH ARTICLE
AE-TPGG: a novel autoencoder-based approach for single-cell RNA-seq data imputation and dimensionality reduction
Shuchang ZHAO1,2, Li ZHANG1,3, Xuejun LIU1,2()
1. MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210023, China
3. College of Computer Science and Technology, Nanjing Forestry University, Nanjing 210037, China
 Download: PDF(15886 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Single-cell RNA sequencing (scRNA-seq) technology has become an effective tool for high-throughout transcriptomic study, which circumvents the averaging artifacts corresponding to bulk RNA-seq technology, yielding new perspectives on the cellular diversity of potential superficially homogeneous populations. Although various sequencing techniques have decreased the amplification bias and improved capture efficiency caused by the low amount of starting material, the technical noise and biological variation are inevitably introduced into experimental process, resulting in high dropout events, which greatly hinder the downstream analysis. Considering the bimodal expression pattern and the right-skewed characteristic existed in normalized scRNA-seq data, we propose a customized autoencoder based on a two-part-generalized-gamma distribution (AE-TPGG) for scRNA-seq data analysis, which takes mixed discrete-continuous random variables of scRNA-seq data into account using a two-part model and utilizes the generalized gamma (GG) distribution, for fitting the positive and right-skewed continuous data. The adopted autoencoder enables AE-TPGG to captures the inherent relationship between genes. In addition to the ability of achieving low-dimensional representation, the AE-TPGG model also provides a denoised imputation according to statistical characteristic of gene expression. Results on real datasets demonstrate that our proposed model is competitive to current imputation methods and ameliorates a diverse set of typical scRNA-seq data analyses.

Keywords scRNA-seq      autoencoder      TPGG      data imputation      dimensionality reduction     
Corresponding Author(s): Xuejun LIU   
Just Accepted Date: 22 April 2022   Issue Date: 25 October 2022
 Cite this article:   
Shuchang ZHAO,Li ZHANG,Xuejun LIU. AE-TPGG: a novel autoencoder-based approach for single-cell RNA-seq data imputation and dimensionality reduction[J]. Front. Comput. Sci., 2023, 17(3): 173902.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-2011-y
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I3/173902
Fig.1  Histogram of dropout ratio of the 23,840 genes across the 2,717 cells in Klein dataset
Fig.2  Histograms of the overall and positive expression distribution for Apoo, Bmp2, Ccdc63, and Deb1 genes in Klein dataset
Gene Min Max Mean Variance Zero ratio Skewness
Cpe 0 3.51 0.62 0.56 0.53 0.57
Hsd11b1 0 2.56 0.17 0.15 0.81 1.53
Tab.1  Relevant expression statistics of the expression of Cpe and Hsd11b1 in Klein dataset
Two-part models Gene MLE estimation mean^ var^ Skewness^ KS-test Log-likelihood AIC
TPGM Cpe π^1=0.53
α^=6.40 0.62 0.56 0.79 0.97 ?2787.02 5580.04
β^=4.86
Hsd11b1 π^1=0.81
α^=7.31 0.17 0.14 0.74 0.80 ?1460.48 2936.96
β^=8.29
TPLNM Cpe π^2=0.53
μ^=0.20 0.62 0.59 1.36 0.25 ?2803.16 5612.31
σ^=0.41
Hsd11b1 π^2=0.81
μ^=?0.20 0.17 0.14 1.19 1.00 ?1442.60 2891.20
σ^=0.37
Tab.2  Results obtained from the MLE of TPGM and TPLNM for the expression level of Cpe and Hsd11b1 in Klein dataset
Fig.3  The sample distributions and estimated gamma distributions of the positive expression values of Cpe and Hsd11b1 in Klein dataset
Fig.4  The percentage of different model selections according to the AIC values for five batched of randomly selected genes from Klein dataset
Fig.5  The framework of the proposed AE-TPGG. First, the encoder module of AE-TPGG automatically extracts a high-level compressed representation of the gene expression profile based on the multiple full connection layers. Subsequently, the decoder component of AE-TPGG derives the parameters α, β, γ and π to acquire the imputation output achieved by the expectation calculation of each gene expression level. The input is associated with the output by optimizing the negative log-likelihood of the TPGG shown at the top of the diagram
  
Datesets Sequencing protocol Cell types of cells of genes Zero ratio
Deng Smart-seq 10 268 22,431 0.60
Kolodziejczyk SMARTer 3 704 38,616 0.71
Klein inDrop 4 2,717 24,175 0.66
pbmc1-10Xv2 10x Chromium (v2) 9 6,444 22,280 0.96
Tab.3  Statistics of the four real datasets
Fig.6  The two-dimensional visualization of Deng dataset (1st row), Kolodziejczyk dataset (2nd row), Klein dataset (3rd row), pbmc1-10Xv2 dataset (4th row). Columns correspond to the raw data of With dropout-C and With dropout-N, the imputed data using MAGIC-C, MAGIC-N, DCA, SAVER, scImpute and AE-TPGG. Cells are colored according to cell types
Fig.7  The clustering performance of the raw data of With dropout-C and With dropout-N, and the imputed data using MAGIC-C, MAGIC-N, DCA, SAVER, scImpute and AE-TPGG on the Deng dataset, Kolodziejczyk dataset, Klein dataset and pbmc1-10Xv2 dataset measured by ACC, ARI, NMI and F1
Fig.8  The t-SNE (left) and AE-TPGG (right) projections of 68K PBMCs, colored according to the 10 purified cell subtypes
Fig.9  The Clustering performance of the PBMC 68K dataset measured by ACC, ARI, NMI and F1
Fig.10  The t-SNE visualization of the transcriptomic profiles of cord blood mononuclear cells from Stoeckius. Cell types are labeled with maker genes
Fig.11  The t-SNE visualizations of the protein expression (1st row), RNA expression derived from the original data (2nd row), imputation data using AE-TPGG (3rd row). Columns correspond to CD3 (1st column), CD8 (2nd column), CD56 (3rd column) proteins and corresponding RNAs CD3E, CD8A and NCAM1
Fig.12  The t-SNE visualizations of imputation data of RNA expression using MAGIC (1st row), DCA (2nd row), SAVER (3rd row), and scImpute (4th row). Columns correspond to RNAs CD3E (1st column), CD8A (2nd column) and NCAM1 (3rd column)
Fig.13  Spearman correlation coefficients of the six protein-RNA pairs for the original and imputed data using DCA, SAVER, scImpute, MAGIC and AE-TPGG
Fig.14  Heatmaps of the top 200 highly variable genes that consists of 100 positive genes and 100 negative genes associated with time course within the dataset using expression data without dropout, with dropout, and the imputed data from AE-TPGG, MAGIC, DCA, SAVER, scImpute, respectively. Yellow and blue colors represent relative high and low expression levels, respectively. Zero values are colored grey
Fig.15  Boxplots of Pearson correlation coefficients between gene expression and the known developmental pattern across the 500 most highly correlated genes within the dataset using the imputed data from AE-TPGG, DCA, MAGIC, SAVER, scImpute, and expression data with dropout, without dropout, respectively. The box represents the interquartile range, the horizontal line in the box is the median, and the whiskers represent 1.5 times the interquartile range. Black dots represent outliers
Fig.16  Gene expression trajectory for exemplary anti-correlated gene pair tbx-36 and his-8 over time for the data without, with dropout, and imputation data using AE-TPGG, MAGIC, DCA, SAVER and scImpute
  
  
  
  Fig.A1 The t-SNE visualizations of the protein expression (1st row), RNA expression derived from the original data (2nd row), imputation data using AE-TPGG (3rd row), MAGIC (4th row), DCA (5th row), SAVER (6th row) and scImpute (7th row). Columns correspond to CD16 (1st column), CD11c (2nd column), CD14 (3rd column) proteins and corresponding RNAs FCGR3A, ITGAX and CD14
1 S S Potter . Single-cell RNA sequencing for the study of development, physiology and disease. Nature Reviews Nephrology, 2018, 14( 8): 479–492
2 H, Li E T, Courtois D, Sengupta Y, Tan K H, Chen J J L, Goh S L, Kong C, Chua L K, Hon W S, Tan M, Wong P J, Choi L J K, Wee A M, Hillmer I B, Tan P, Robson S Prabhakar . Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nature Genetics, 2017, 49( 5): 708–718
3 Y, Cao B, Su X, Guo W, Sun Y, Deng L, Bao Q, Zhu X, Zhang Y, Zheng C, Geng X, Chai R, He X, Li Q, Lv H, Zhu W, Deng Y, Xu Y, Wang L, Qiao Y, Tan L, Song G, Wang X, Du N, Gao J, Liu J, Xiao X, Su Z, Du Y, Feng C, Qin C, Qin R, Jin X S Xie . Potent neutralizing antibodies against SARS-CoV-2 identified by high-throughput single-cell sequencing of convalescent patients’ B cells. Cell, 2020, 182( 1): 73–84.e16
4 P V, Kharchenko L, Silberstein D T Scadden . Bayesian approach to single-cell differential expression analysis. Nature Methods, 2014, 11( 7): 740–742
5 G, Finak A, McDavid M, Yajima J, Deng V, Gersuk A K, Shalek C K, Slichter H W, Miller M J, Mcelrath M, Prlic P S, Linsley R Gottardo . MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 2015, 16( 1): 278
6 A T L, Lun K, Bach J C Marioni . Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology, 2016, 17( 1): 75
7 W V, Li J J Li . An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature Communications, 2018, 9( 1): 997
8 M, Huang J, Wang E, Torre H, Dueck S, Shaffer R, Bonasio J I, Murray A, Raj M, Li N R Zhang . SAVER: gene expression recovery for single-cell RNA sequencing. Nature Methods, 2018, 15( 7): 539–542
9 Dijk V, Van R, Sharma J, Nainys K, Yim P, Kathail A J, Carr C, Burdziak K R, Moon C L, Chaffer D, Pattabiraman B, Bierie L, Mazutis G, Wolf S, Krishnaswamy D Pe’er . Recovering gene interactions from single-cell data using data diffusion. Cell, 2018, 174( 3): 716–729.e27
10 Z, Basharat S, Majeed H, Saleem I A, Khan A Yasmin . An overview of algorithms and associated applications for single cell RNA-seq data imputation. Current Genomics, 2021, 22( 5): 319–327
11 Y, LeCun Y, Bengio G Hinton . Deep learning. Nature, 2015, 521( 7553): 436–444
12 Y, Bengio A, Courville P Vincent . Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35( 8): 1798–1828
13 K Hornik . Approximation capabilities of multilayer feedforward networks. Neural Networks, 1991, 4( 2): 251–257
14 G E, Hinton R R Salakhutdinov . Reducing the dimensionality of data with neural networks. Science, 2006, 313( 5786): 504–507
15 A, Kadurin S, Nikolenko K, Khrabrov A, Aliper A Zhavoronkov . druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Molecular Pharmaceutics, 2017, 14( 9): 3098–3104
16 G, Eraslan L M, Simon M, Mircea N S, Mueller F J Theis . Single-cell RNA-seq denoising using a deep count autoencoder. Nature Communications, 2019, 10( 1): 390
17 Z, Zhang F, Cui C, Wang L, Zhao Q Zou . Goals and approaches for each processing step for single-cell RNA sequencing data. Briefings in Bioinformatics, 2021, 22( 4): bbaa314
18 A, Mortazavi B A, Williams K, McCue L, Schaeffer B Wold . Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods, 2008, 5( 7): 621–628
19 J K, Pickrell J C, Marioni A A, Pai J F, Degner B E, Engelhardt E, Nkadori J B, Veyrieras M, Stephens Y, Gilad J K Pritchard . Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 2010, 464( 7289): 768–772
20 D, Risso K, Schwartz G, Sherlock S Dudoit . GC-content normalization for RNA-seq data. BMC Bioinformatics, 2011, 12( 1): 480
21 C A, Vallejos D, Risso A, Scialdone S, Dudoit J C Marioni . Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods, 2017, 14( 6): 565–571
22 B, Li V, Ruotti R M, Stewart J A, Thomson C N Dewey . RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics, 2010, 26( 4): 493–500
23 F, Belotti P, Deb W G, Manning E C Norton . Twopm: two-part models. The Stata Journal: Promoting communications on statistics and Stata, 2015, 15( 1): 3–20
24 J F Lawless . Inference in the generalized gamma and log gamma distributions. Technometrics, 1980, 22( 3): 409–419
25 D, Risso F, Perraudeau S, Gribkova S, Dudoit J P Vert . A general and flexible method for signal extraction from single-cell RNA-seq data. Nature Communications, 2018, 9( 1): 284
26 A M, Klein L, Mazutis I, Akartuna N, Tallapragada A, Veres V, Li L, Peshkin D A, Weitz M W Kirschner . Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 2015, 161( 5): 1187–1201
27 T P Minka . Estimating a gamma distribution. Microsoft Research, 2002, 1( 3): 3–5
28 F. Keras Chollet . See Github.com/fchollet/keras website
29 M, Abadi A, Agarwal P, Barham E, Brevdo Z, Chen , et al.. TensorFlow: large-scale machine learning on heterogeneous distributed systems. 2016, arXiv preprint arXiv: 1603.04467
30 Q, Deng D, Ramsköld B, Reinius R Sandberg . Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science, 2014, 343( 6167): 193–196
31 A A, Kolodziejczyk J K, Kim J C H, Tsang T, Ilicic J, Henriksson K N, Natarajan A C, Tuck X, Gao M, Bühler P, Liu J C, Marioni S A Teichmann . Single cell RNA-Sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell, 2015, 17( 4): 471–485
32 J, Ding X, Adiconis S K, Simmons M S, Kowalczyk C C, Hession N D, Marjanovic T K, Hughes M H, Wadsworth T, Burks L T, Nguyen J Y H, Kwon B, Barak W, Ge A J, Kedaigle S, Carroll S, Li N, Hacohen O, Rozenblatt-Rosen A K, Shalek A C, Villani A, Regev J Z Levin . Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nature Biotechnology, 2020, 38( 6): 737–746
33 D, Arthur S Vassilvitskii . k-means++: the advantages of careful seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. 2007, 1027–1035
34 G X Y, Zheng J M, Terry P, Belgrader P, Ryvkin Z W, Bent R, Wilson S B, Ziraldo T D, Wheeler G P, McDermott J, Zhu M T, Gregory J, Shuga L, Montesclaros J G, Underwood D A, Masquelier S Y, Nishimura M, Schnall-Levin P W, Wyatt C M, Hindson R, Bharadwaj A, Wong K D, Ness L W, Beppu H J, Deeg C, Mcfarland K R, Loeb W J, Valente N G, Ericson E A, Stevens J P, Radich T S, Mikkelsen B J, Hindson J H Bielas . Massively parallel digital transcriptional profiling of single cells. Nature Communications, 2017, 8( 1): 14049
35 M, Stoeckius C, Hafemeister W, Stephenson B, Houck-Loomis P K, Chattopadhyay H, Swerdlow R, Satija P Smibert . Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 2017, 14( 9): 865–868
36 C, Xu Z Su . Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics, 2015, 31( 12): 1974–1980
37 J, Levine E, Simonds S, Bendall K, Davis E A, Amir M, Tadmor O, Litvin H, Fienberg A, Jager E, Zunder R, Finck A, Gedman I, Radtke J, Downing D, Pe’er G Nolan . Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell, 2015, 162( 1): 184–197
38 M, Francesconi B Lehner . The effects of genetic variation on gene expression dynamics during development. Nature, 2014, 505( 7482): 208–211
39 M E, Boeck C, Huynh L, Gevirtzman O A, Thompson G, Wang D M, Kasper V, Reinke L W, Hillier R H Waterston . The time-resolved transcriptome of C. elegans. Genome Research, 2016, 26( 10): 1441–1450
[1] FCS-22011-OF-SZ_suppl_1 Download
[1] Yi ZHU, Yishuai GENG, Yun LI, Jipeng QIANG, Xindong WU. Representation learning: serial-autoencoder for personalized recommendation[J]. Front. Comput. Sci., 2024, 18(4): 184316-.
[2] Yi ZHU, Xindong WU, Jipeng QIANG, Yunhao YUAN, Yun LI. Representation learning via an integrated autoencoder for unsupervised domain adaptation[J]. Front. Comput. Sci., 2023, 17(5): 175334-.
[3] Yiteng PAN, Fazhi HE, Haiping YU. A correlative denoising autoencoder to model social influence for top-N recommender system[J]. Front. Comput. Sci., 2020, 14(3): 143301-.
[4] Guijuan ZHANG, Yang LIU, Xiaoning JIN. A survey of autoencoder-based recommender systems[J]. Front. Comput. Sci., 2020, 14(2): 430-450.
[5] Qianjun ZHANG, Lei ZHANG. Convolutional adaptive denoising autoencoders for hierarchical feature extraction[J]. Front. Comput. Sci., 2018, 12(6): 1140-1148.
[6] Yanyan ZHANG,Jianchun ZHANG,Zhisong PAN,Daoqiang ZHANG. Multi-view dimensionality reduction via canonical random correlation analysis[J]. Front. Comput. Sci., 2016, 10(5): 856-869.
[7] Leilei YANG, Songcan CHEN. Linear discriminant analysis with worst between-class separation and average within-class compactness[J]. Front. Comput. Sci., 2014, 8(5): 785-792.
[8] Lishan QIAO, Limei ZHANG, Songcan CHEN. Dimensionality reduction with adaptive graph[J]. Front Comput Sci, 2013, 7(5): 745-753.
[9] Pu HUANG, Zhenmin TANG, Caikou CHEN, Xintian CHENG. Nearest-neighbor classifier motivated marginal discriminant projections for face recognition[J]. Front Comput Sci Chin, 2011, 5(4): 419-428.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed