Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (1) : 181901    https://doi.org/10.1007/s11704-022-2111-8
Interdisciplinary
cKBET: assessing goodness of batch effect correction for single-cell RNA-seq
Yameng ZHAO, Yin GUO, Limin LI()
School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China
 Download: PDF(39074 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Single-cell RNA sequencing reveals the gene structure and gene expression status of a single cell, which can reflect the heterogeneity between cells. However, batch effects caused by non-biological factors may hinder data integration and downstream analysis. Although the batch effect can be evaluated by visualizing the data, which actually is subjective and inaccurate. In this work, we propose a quantitative method cKBET, which considers the batch and cell type information simultaneously. The cKBET method accesses batch effects by comparing the global and local fraction of cells of different batches in different cell types. We verify the performance of our cKBET method on simulated and real biological data sets. The experimental results show that our cKBET method is superior to existing methods in most cases. In general, our cKBET method can detect batch effect with either balanced or unbalanced cell types, and thus evaluate batch correction methods.

Keywords single-cell RNA-seq dataset      batch effect assessment      cKBET method     
Corresponding Author(s): Limin LI   
About author:

Changjian Wang and Zhiying Yang contributed equally to this work.

Issue Date: 21 February 2023
 Cite this article:   
Yameng ZHAO,Yin GUO,Limin LI. cKBET: assessing goodness of batch effect correction for single-cell RNA-seq[J]. Front. Comput. Sci., 2024, 18(1): 181901.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-2111-8
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I1/181901
Fig.1  The principle of cKBET. cKBET method compares the batch label distributions between the cells of the same type within a fixed-size neighborhood and all of the cells in the same cell type. (a) The cells in different batches are well mixed, while the cells with different types are not separated, our cKBET method could correctly access the data set that has batch effect; (b) The cells in different batches are well mixedand the cells with different types are separated, our cKBET method could correctly access the data set that has no batch effect
  
Fig.2  t-SNE visualization of simulated data with balanced cell types in different batches
Uncorrelated MNN Limma ComBat Harmony BC-t-SNE DESC BERMUDA MMD-ResNet
kBET 0.3152 0.0399 0.0214 0.0277 0.028 0.0191 0.0345 0.2487 0.8366
cKBET 0.2325 0.044 0.027 0.026 0.0241 0.0257 0.0351 0.2944 0.7023
Tab.1  Quantitative results of kBET and cKBET method for simulated data with balanced cell types in different batches. The best quality dataset assessed by each method is bolded
Fig.3  t-SNE visualization of simulated data with different proportions of cell types in different batches. (a) l = 2, π1 = (1/4, 3/4), π2 = (3/4, 1/4); (b) l = 2, π1 = (1/6, 5/6), π2 = (5/6, 1/6); (c) l = 2, π1 = (1/10, 9/10), π2 = (9/10, 1/10)
Uncorrelated MNN Limma ComBat Harmony BC-t-SNE DESC BERMUDA MMD-ResNet
π1=[1/4,3/4]T and π2=[3/4,1/4]T
SIL 0.2424 0.2309 0.1167 0.1170 0.2259 0.3719 0.0011 0.1448 0.3681
PcR 0.9511 0.9468 0.0000 0.0000 0.9229 0.9527 0.0000 0.8901 0.8010
kBET 0.3232 0.1014 0.0212 0.0510 0.1000 0.2678 0.0328 0.2789 0.8544
cKBET 0.3174 0.1109 0.2813 0.2199 0.0620 0.2895 0.3017 0.1002 0.7157
π1=[1/6,5/6]T and π2=[5/6,1/6]T
SIL 0.3081 0.3003 0.1196 0.1204 0.2957 0.4322 0.0011 0.0320 0.4228
PcR 0.9510 0.9426 0.0000 0.0000 0.9228 0.9507 0.0000 0.8220 0.7912
kBET 0.3295 0.1187 0.0182 0.0325 0.1315 0.2895 0.0340 0.4614 0.8294
cKBET 0.2201 0.0844 0.2125 0.2061 0.0413 0.2328 0.1861 0.1924 0.3609
π1=[1/10,9/10]T and π2=[9/10,1/10]T
SIL 0.5046 0.5038 0.0744 0.0748 0.5018 0.6232 0.0011 0.0107 0.5723
PcR 0.9511 0.9187 0.0000 0.0000 0.9260 0.9487 0.0000 0.3606 0.7014
kBET 0.3847 0.2000 0.0326 0.1097 0.2253 0.2974 0.2274 0.8334 0.8251
cKBET 0.0878 0.1609 0.0880 0.0849 0.0490 0.0818 0.1177 0.0576 0.1942
Tab.2  Quantitative results of SIL, PcR, kBET and cKBET method for simulated data with different proportions of cell types in each batch. The best quality dataset assessed by each method is bolded
Fig.4  t-SNE visualization of simulated data with multiple cell types, where (a) l=4 and (b) l=8
Uncorrelated MNN Limma ComBat Harmony BC-t-SNE DESC BERMUDA MMD-ResNet
4 cell types
SIL 0.0485 0.0562 0.0462 0.0288 0.0429 0.0986 0.0182 0.0639 0.1961
PcR 0.5794 0.0041 0.0000 0.0000 0.0000 0.9508 0.1900 0.8911 0.5699
kBET 0.1788 0.9942 0.0275 0.4798 0.0230 0.2463 0.5640 0.6430 0.8415
cKBET 0.1275 0.0731 0.0468 0.1138 0.0684 0.1283 0.1480 0.2747 0.4510
8 cell types
SIL 0.0354 0.0106 0.0212 0.0096 0.0154 0.1011 0.0035 0.0222 0.0892
PcR 0.0062 0.0010 0.0000 0.0000 0.3943 1.0000 0.2282 0.0370 0.0389
kBET 0.2743 0.8875 0.0488 0.7702 0.0309 0.3114 0.4428 0.5656 0.8409
cKBET 0.0795 0.1661 0.0335 0.2958 0.0564 0.0830 0.1446 0.1602 0.6351
Tab.3  Quantitative results of SIL, PcR, kBET and cKBET method for simulated data with multiple cell types. The best quality dataset assessed by each method is bolded.
Dataset 1 Dataset 2 Dataset 3
Batch1 Batch2 Batch1 Batch2 Batch1 Batch2
Type 1 143 145 99 199 52 243
Type 2 157 155 201 101 248 57
Tab.4  The number of cells in different cell types and batches for 3 uncorrected simulation data sets
Fig.5  The results of 3 simulated data in Tab.4. (a) t-SNE visualization; (b) Rejection rate under different k values
MESC MTC MBC
Batch1 Batch2 Batch1 Batch2 Batch1 Batch2
Type 1 78/167 39/167 2/4 1/4 2/9 4/9
Type 2 50/167 50/167 1/4 2/4 4/9 2/9
Type 3 39/167 78/167 1/4 1/4 3/9 3/9
Tab.5  Proportions of different cell types in different batches for the real data sets
Fig.6  t-SNE and UMAP visualization of the MESC dataset with 3 cell types
Fig.7  t-SNE and UMAP visualization of the MTC dataset with 3 cell types
Fig.8  t-SNE and UMAP visualization of the MBC dataset with 3 cell types
Uncorrelated MNN Limma ComBat Harmony BC-t-SNE DESC BERMUDA MMD-ResNet
MESC
SIL 0.0265 0.0721 0.0223 0.0013 0.0198 0.7064 0.0443 0.2064 0.3169
PcR 0.4969 0.4274 0.0000 0.2200 0.0000 0.5501 0.1904 0.8258 0.5868
kBET 0.6129 0.5456 0.3606 0.5879 0.3562 0.5947 0.4059 0.5750 0.6565
cKBET 0.3973 0.0673 0.0579 0.3157 0.0546 0.1031 0.2404 0.3920 0.7938
MTC
SIL 0.0321 0.1447 0.0560 0.0166 0.0262 0.0867 0.1076 0.1117 0.1648
PcR 0.3452 0.3596 0.0000 0.0000 0.0886 0.4269 0.6477 0.8649 0.4476
kBET 0.5987 0.7618 0.6229 0.5678 0.5527 0.6527 0.6784 0.7668 0.8247
cKBET 0.5164 0.3562 0.3761 0.4544 0.3417 0.4208 0.4002 0.4821 0.7601
MESC
SIL 0.0851 0.1276 0.0850 0.0856 0.0726 0.0611 0.0693 0.0435 0.2014
PcR 0.6082 0.1136 0.0000 0.0000 0.0000 0.2295 0.4669 0.7305 0.8228
kBET 0.7249 0.7369 0.7021 0.6697 0.6613 0.7197 0.7118 0.5222 0.7602
cKBET 0.4518 0.3428 0.3853 0.3968 0.3559 0.4869 0.5848 0.5029 0.7004
Tab.6  Quantitative results of SIL, PcR, kBET and cKBET method for the real data. The best quality dataset assessed by each method is bolded
  
  
  
1 T, Hashimshony F, Wagner N, Sher I Yanai . CEL-seq: single-cell RNA-seq by multiplexed linear amplification. Cell Reports, 2012, 2( 3): 666–673
2 S, Picelli Å K, Björklund O R, Faridani S, Sagasser G, Winberg R Sandberg . Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nature Methods, 2013, 10( 11): 1096–1098
3 E Z, Macosko A, Basu R, Satija J, Nemesh K, Shekhar M, Goldman I, Tirosh A R, Bialas N, Kamitaki E M, Martersteck J J, Trombetta D A, Weitz J R, Sanes A K, Shalek A, Regev S A McCarroll . Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 2015, 161( 5): 1202–1214
4 A M, Klein L, Mazutis I, Akartuna N, Tallapragada A, Veres V, Li L, Peshkin D A, Weitz M W Kirschner . Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell, 2015, 161( 5): 1187–1201
5 J, Cao J S, Packer V, Ramani D A, Cusanovich C, Huynh R, Daza X, Qiu C, Lee S N, Furlan F J, Steemers A, Adey R H, Waterston C, Trapnell J Shendure . Comprehensive single-cell transcriptional profiling of a multicellular organism. Science, 2017, 357( 6352): 661–667
6 G X Y, Zheng J M, Terry P, Belgrader P, Ryvkin Z W, Bent R, Wilson S B, Ziraldo T D, Wheeler G P, McDermott J, Zhu M T, Gregory J, Shuga L, Montesclaros J G, Underwood D A, Masquelier S Y, Nishimura M, Schnall-Levin P W, Wyatt C M, Hindson R, Bharadwaj A, Wong K D, Ness L W, Beppu H J, Deeg C, McFarland K R, Loeb W J, Valente N G, Ericson E A, Stevens J P, Radich T S, Mikkelsen B J, Hindson J H Bielas . Massively parallel digital transcriptional profiling of single cells. Nature Communications, 2017, 8: 14049
7 X, Zhang S L, Marjani Z, Hu S M, Weissman X, Pan S Wu . Single-cell sequencing for precise cancer research: progress and prospects. Cancer Research, 2016, 76( 6): 1305–1312
8 H, Chen F, Ye G Guo . Revolutionizing immunology with single-cell RNA sequencing. Cellular & Molecular Immunology, 2019, 16( 3): 242–249
9 S C, Hicks F W, Townes M, Teng R A Irizarry . Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics, 2018, 19( 4): 562–578
10 P Y, Tung J D, Blischak C J, Hsiao D A, Knowles J E, Burnett J K, Pritchard Y Gilad . Batch effects and the effective design of single-cell gene expression studies. Scientific Reports, 2017, 7: 39921
11 W E, Johnson C, Li A Rabinovic . Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 2007, 8( 1): 118–127
12 M E, Ritchie B, Phipson D, Wu Y, Hu C W, Law W, Shi G K Smyth . limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 2015, 43( 7): e47
13 D, Risso J, Ngai T P, Speed S Dudoit . Normalization of RNA-seq data using factor analysis of control genes or samples. Nature Biotechnology, 2014, 32( 9): 896–902
14 J T Leek . Svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Research, 2014, 42( 21): e161
15 L, Haghverdi A T L, Lun M D, Morgan J C Marioni . Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology, 2018, 36( 5): 421–427
16 I, Korsunsky N, Millard J, Fan K, Slowikowski F, Zhang K, Wei Y, Baglaenko M, Brenner P R, Loh S Raychaudhuri . Fast, sensitive and accurate integration of single-cell data with harmony. Nature Methods, 2019, 16( 12): 1289–1296
17 E, Aliverti J L, Tilson D L, Filer B, Babcock A, Colaneri J, Ocasio T R, Gershon K C, Wilhelmsen D B Dunson . Projected t-SNE for batch correction. Bioinformatics, 2020, 36( 11): 3522–3527
18 X, Li K, Wang Y, Lyu H, Pan J, Zhang D, Stambolian K, Susztak M P, Reilly G, Hu M Li . Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nature Communications, 2020, 11( 1): 2338
19 T, Wang T S, Johnson W, Shao Z, Lu B R, Helm J, Zhang K Huang . BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biology, 2019, 20( 1): 165
20 U, Shaham K P, Stanton J, Zhao H, Li K, Raddassi R, Montgomery Y Kluger . Removal of batch effects using distribution-matching residual networks. Bioinformatics, 2017, 33( 16): 2539–2546
21 M, Büttner Z, Miao F A, Wolf S A, Teichmann F J Theis . A test metric for assessing single-cell RNA-seq batch correction. Nature Methods, 2019, 16( 1): 43–49
22 K Pearson . LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1901, 2( 11): 559–572
23 der Maaten L, Van G Hinton . Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9( 86): 2579–2605
24 P J Rousseeuw . Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987, 20: 53–65
25 W F Massy . Principal components regression in exploratory statistical research. Journal of the American Statistical Association, 1965, 60( 309): 234–256
26 D J, McCarthy K R, Campbell A T L, Lun Q F Wills . Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics, 2017, 33( 8): 1179–1186
27 A A, Kolodziejczyk J K, Kim J C H, Tsang T, Ilicic J, Henriksson K N, Natarajan A C, Tuck X, Gao M, Bühler P, Liu J C, Marioni S A Teichmann . Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell, 2015, 17( 4): 471–485
28 Tabula Muris Consortium The . Single-cell transcriptomics of 20 mouse organs creates a Tabula muris. Nature, 2018, 562( 7727): 367–372
[1] FCS-22111-OF-YZ_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed