Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2023, Vol. 17 Issue (1): 171310   https://doi.org/10.1007/s11704-022-1172-z
  本期目录
Distortion-free PCA on sample space for highly variable gene detection from single-cell RNA-seq data
Momo MATSUDA, Yasunori FUTAMURA, Xiucai YE(), Tetsuya SAKURAI()
Department of Computer Science, University of Tsukuba, Tsukuba 305-8573, Japan
 全文: PDF(10514 KB)   HTML
Abstract

Single-cell RNA-seq (scRNA-seq) allows the analysis of gene expression in each cell, which enables the detection of highly variable genes (HVG) that contribute to cell-to-cell variation within a homogeneous cell population. HVG detection is necessary for clustering analysis to improve the clustering result. scRNA-seq includes some genes that are expressed with a certain probability in all cells which make the cells indistinguishable. These genes are referred to as background noise. To remove the background noise and select the informative genes for clustering analysis, in this paper, we propose an effective HVG detection method based on principal component analysis (PCA). The proposed method utilizes PCA to evaluate the genes (features) on the sample space. The distortion-free principal components are selected to calculate the distance from the origin to gene as the weight of each gene. The genes that have the greatest distances to the origin are selected for clustering analysis. Experimental results on both synthetic and gene expression datasets show that the proposed method not only removes the background noise to select the informative genes for clustering analysis, but also outperforms the existing HVG detection methods.

Key wordssingle-cell RNA-sequencing    feature selection    principal component analysis    highly variable gene detection    background noise    clustering analysis
收稿日期: 2021-04-13      出版日期: 2022-03-17
Corresponding Author(s): Xiucai YE,Tetsuya SAKURAI   
 引用本文:   
. [J]. Frontiers of Computer Science, 2023, 17(1): 171310.
Momo MATSUDA, Yasunori FUTAMURA, Xiucai YE, Tetsuya SAKURAI. Distortion-free PCA on sample space for highly variable gene detection from single-cell RNA-seq data. Front. Comput. Sci., 2023, 17(1): 171310.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-022-1172-z
https://academic.hep.com.cn/fcs/CN/Y2023/V17/I1/171310
Fig.1  
Fig.2  
Fig.3  
Fig.4  
Dataset # of samples # of features # of classes Accession number
Synthetic 1,000 4,500 5 ?
Gierahn 4,296 6,713 6 GSM2486333
Pollen 276 13,007 7 GSE71315
Tab.1  
Fig.5  
Fig.6  
Fig.7  
Fig.8  
Fig.9  
Fig.10  
Fig.11  
Fig.12  
Gene Rank p_val avg_logFC pct_1 pct_2 p_val_adj Cluster
IGKC 2 8.9E?172 2.235 0.410 0.030 4.5E?170 Bcell
MS4A1 6 0.0E+00 2.157 0.723 0.044 0.0E+00 Bcell
BANK1 10 3.1E?168 1.280 0.380 0.024 1.0E?166 Bcell
IL7R 9 1.4E?130 1.204 0.765 0.311 7.1E?129 CD4
CD3D 49 3.2E?110 1.016 0.577 0.182 8.2E?109 CD4
TRAC 43 1.2E?90 1.003 0.481 0.146 1.2E?89 CD4
IL7R 9 1.4E?96 1.540 0.851 0.353 4.7E?95 CD8
CD2 28 2.6E?97 1.492 0.755 0.225 1.8E?95 CD8
CD3D 49 9.9E?56 1.128 0.603 0.222 2.2E?54 CD8
TXN 13 2.1E?72 3.102 1.000 0.492 2.3E?71 DC
IDO1 18 1.1E?135 2.658 0.971 0.150 2.9E?134 DC
TBC1D4 22 1.0E?260 2.443 0.817 0.036 8.0E?259 DC
KYNU 93 1.7E?181 1.094 0.616 0.157 1.7E?179 Myeloid
CST3 175 3.9E?173 1.010 0.658 0.207 1.9E?171 Myeloid
CSTA 166 3.9E?140 1.015 0.437 0.078 1.3E?138 Myeloid
IFITM1 78 1.5E?117 1.354 0.652 0.210 1.1E?115 NK
IFITM2 26 1.9E?58 0.972 0.624 0.333 7.4E?57 NK
IL32 128 2.9E?29 0.822 0.442 0.228 7.6E?28 NK
Tab.2  
gene Rank p_val avg_logFC pct_1 pct_2 p_val_adj Cluster
MK167 3 8.9E?25 3.825 1.000 0.108 1.7E?23 Dividing R.G.
KIF15 8 3.2E?26 3.662 1.000 0.092 1.8E?23 Dividing R.G.
TPX2 7 4.5E?24 3.641 1.000 0.108 6.3E?23 Dividing R.G.
MT-RNR1 79 2.1E?04 2.348 1.000 0.956 2.8E?23 Endothelia
ATP1A2 83 3.3E?01 1.965 0.333 0.207 4.5E?01 Endothelia
RPS6 70 1.1E?03 1.583 1.000 0.970 8.6E?03 Endothelia
FAM60A 74 3.8E?05 2.156 0.691 0.247 1.4E?03 Inter Progenitor
CCND2 42 2.5E?04 0.942 1.000 0.937 4.1E?04 Inter Progenitor
PRDX1 71 3.3E?04 0.885 0.905 0.671 4.1E?03 Inter Progenitor
DLX6-AS1 10 1.6E?36 6.610 1.000 0.299 1.9E?35 Interneuron
GAD1 87 3.8E?31 5.617 0.673 0.041 1.8E?30 Interneuron
FAM65B 81 3.3E?31 4.984 0.691 0.050 1.8E?30 Interneuron
SATB2 37 8.6E?20 2.029 0.972 0.329 2.2E?18 Maturing Neuron
MPP6 49 5.8E?11 2.001 0.750 0.258 1.6E?10 Maturing Neuron
MCTP1 69 5.7E?12 1.965 0.611 0.142 2.4E?11 Maturing Neuron
SEMA3C 86 1.3E?15 1.819 0.773 0.330 1.6E?14 Newborn Neuron
MLLT3 51 4.1E?17 1.195 0.979 0.782 1.0E?15 Newborn Neuron
ENC1 80 2.4E?08 0.937 0.938 0.715 1.2E?07 Newborn Neuron
CLU 39 5.7E?30 4.123 0.956 0.242 1.3E?28 Radial Glia
GPX3 99 1.9E?31 4.108 0.733 0.052 1.3E?29 Radial Glia
VIM 66 1.1E?30 3.496 0.956 0.216 3.6E?29 Radial Glia
Tab.3  
Fig.13  
Fig.14  
  
  
  
  
1 A Oshlack , M D Robinson , M D Young . From RNA-seq reads to differential expression results. Genome Biology, 2010, 11( 12): 220–
2 X Ye , W Zhang , T Sakurai . Adaptive unsupervised feature learning for gene signature identification in non-small-cell lung cancer. IEEE Access, 2020, 8 : 154354– 154362
3 F Ozsolak , P M Milos . RNA sequencing: advances, challenges and opportunities. Nature Reviews Genetics, 2011, 12( 2): 87– 98
4 A Wagner , A Regev , N Yosef . Revealing the vectors of cellular identity with single-cell genomics. Nature Biotechnology, 2016, 34( 11): 1145– 1160
5 V Y Kiselev , T S Andrews , M Hemberg . Challenges in unsupervised clustering of single-cell RNA-seq data. Nature Reviews Genetics, 2019, 20( 5): 273– 282
6 X Ye , W Zhang , Y Futamura , T Sakurai . Detecting interactive gene groups for single-cell RNA-Seq data based on co-expression network analysis and subgraph learning. Cells, 2020, 9( 9): 1938–
7 X Ye , T Sakurai . Robust similarity measure for spectral clustering based on shared neighbors. ETRI Journal, 2016, 38( 3): 540– 550
8 F Emmert-Streib , M Dehmer , B Haibe-Kains . Gene regulatory networks and their applications: understanding biological and medical problems in terms of networks. Frontiers in Cell and Developmental Biology, 2014, 2 : 38–
9 D Thompson , A Regev , S Roy . Comparative analysis of gene regulatory networks: from network reconstruction to evolution. Annual Review of Cell and Developmental Biology, 2015, 31 : 399– 428
10 X Ye , T Sakurai . Spectral clustering with adaptive similarity measure in Kernel space. Intelligent Data Analysis, 2018, 22( 4): 751– 765
11 S H Yip , P C Sham , J Wang . Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data. Briefings in Bioinformatics, 2019, 20( 4): 1583– 1589
12 G Finak , A McDavid , M Yajima , J Deng , V Gersuk , A K Shalek , C K Slichter , H W Miller , M J McElrath , M Prlic , P S Linsley , R Gottardo . MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 2015, 16 : 278–
13 S H Yip , P Wang , J P A Kocher , P C Sham , J Wang . Linnorm: improved statistical analysis for single cell RNA-seq expression data. Nucleic Acids Research, 2017, 45( 22): e179–
14 C W Law , Y Chen , W Shi , G K Smyth . Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014, 15( 2): R29–
15 C A Vallejos , J C Marioni , S Richardson . BASiCS: bayesian analysis of single-cell sequencing data. PLoS Computational Biology, 2015, 11( 6): e1004333–
16 A T L Lun , K Bach , J C Marioni . Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology, 2016, 17 : 75–
17 A T L Lun , D J McCarthy , J C Marioni . A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research, 2016, 5 : 2122–
18 P Brennecke , S Anders , J K Kim , A A Kolodziejczyk , X W Zhang , V Proserpio , B Baying , V Benes , S A Teichmann , J C Marioni , M G Heisler . Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods, 2013, 10( 11): 1093– 1095
19 H I H Chen , Y Jin , Y Huang , Y Chen . Detection of high variability in gene expression from single-cell RNA-seq profiling. BMC Genomics, 2016, 17( S7): 508–
20 I Guyon, J Weston, S Barnhill, V Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(1−3): 389−422
21 R Díaz-Uriarte , Andrés S A de . Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 2006, 7 : 3–
22 R Satija , J A Farrell , D Gennert , A F Schier , A Regev . Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 2015, 33( 5): 495– 502
23 T Stuart , A Butler , P Hoffman , C Hafemeister , E Papalexi , III W M Mauck , Y Hao , M Stoeckius , P Smibert , R Satija . Comprehensive integration of single-cell data. Cell, 2019, 177( 7): 1888– 1902.e21
24 C Mayer , C Hafemeister , R C Bandler , R Machold , R B Brito , X Jaglin , K Allaway , A Butler , G Fishell , R Satija . Developmental diversification of cortical inhibitory interneurons. Nature, 2018, 555( 7697): 457– 462
25 H Hotelling . Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 1993, 24( 6): 417– 441
26 I T Jolliffe. Principal Component Analysis. Springer, 1986
27 K Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. In: Kotz S, Johnson N L, eds. Breakthroughs in Statistics. New York: Springer, 1992
28 Heckert N A, Filliben J J. NIST/SEMATECH e-Handbook of statistical methods; Chapter 1: Exploratory Data Analysis. 2003
29 T M Gierahn , II M H Wadsworth , T K Hughes , B D Bryson , A Butler , R Satija , S Fortune , J C Love , A K Shalek . Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nature Methods, 2017, 14( 4): 395– 398
30 A H Liu , T J Nowakowski , A A Pollen , J H Lui , M A Horlbeck , F J Attenello , D He , J S Weissman , A R Kriegstein , A A Diaz , D A Lim . Single-cell analysis of long non-coding RNAs in the developing human neocortex. Genome Biology, 2016, 17 : 67–
31 A A Pollen , T J Nowakowski , J Shuga , X Wang , A A Leyrat . Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nature Biotechnology, 2014, 32( 10): 1053– 1058
32 C Hafemeister , R Satija . Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology, 2019, 20( 1): 296–
33 W M Rand . Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, 66( 336): 846– 850
34 L McInnes , J Healy , N Saul , L Großberger . UMAP: uniform manifold approximation and projection. The Journal of Open Source Software, 2018, 3( 29): 861–
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed