Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (5) : 175330    https://doi.org/10.1007/s11704-022-2135-0
RESEARCH ARTICLE
Unsupervised spectral feature selection algorithms for high dimensional data
Mingzhao WANG1, Henry HAN2, Zhao HUANG1(), Juanying XIE1()
1. School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
2. Department of Computer Science, School of Engineering & Computer Science, Baylor University, Waco, TX 76798, USA
 Download: PDF(20445 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

It is a significant and challenging task to detect the informative features to carry out explainable analysis for high dimensional data, especially for those with very small number of samples. Feature selection especially the unsupervised ones are the right way to deal with this challenge and realize the task. Therefore, two unsupervised spectral feature selection algorithms are proposed in this paper. They group features using advanced Self-Tuning spectral clustering algorithm based on local standard deviation, so as to detect the global optimal feature clusters as far as possible. Then two feature ranking techniques, including cosine-similarity-based feature ranking and entropy-based feature ranking, are proposed, so that the representative feature of each cluster can be detected to comprise the feature subset on which the explainable classification system will be built. The effectiveness of the proposed algorithms is tested on high dimensional benchmark omics datasets and compared to peer methods, and the statistical test are conducted to determine whether or not the proposed spectral feature selection algorithms are significantly different from those of the peer methods. The extensive experiments demonstrate the proposed unsupervised spectral feature selection algorithms outperform the peer ones in comparison, especially the one based on cosine similarity feature ranking technique. The statistical test results show that the entropy feature ranking based spectral feature selection algorithm performs best. The detected features demonstrate strong discriminative capabilities in downstream classifiers for omics data, such that the AI system built on them would be reliable and explainable. It is especially significant in building transparent and trustworthy medical diagnostic systems from an interpretable AI perspective.

Keywords feature selection      spectral clustering      feature ranking techniques      entropy      cosine similarity     
Corresponding Author(s): Zhao HUANG,Juanying XIE   
Just Accepted Date: 26 July 2022   Issue Date: 15 December 2022
 Cite this article:   
Mingzhao WANG,Henry HAN,Zhao HUANG, et al. Unsupervised spectral feature selection algorithms for high dimensional data[J]. Front. Comput. Sci., 2023, 17(5): 175330.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-2135-0
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I5/175330
  
Datasets #Features (Genes) #Samples #Classes Sources
Colon 2000 62 (40+22) 2 Alon et al. (1999) [49]
DLBCL 4026 47 (24+23) 2 Alizadeh et al. (2000) [50]
Lymphoma 4026 45 (23+22) 2 Alizadeh et al. (2000) [50]
DLBCL Outcome 6817 58 (32+26) 2 Shipp et al. (2002) [51]
Adenoma 7457 36 (18+18) 2 Notterman et al. (2001) [52]
Prostate1 12558 43 (25+18) 2 Chandran et al. (2007) [53]
Prostate2 12600 136 (77+59) 2 Singh et al. (2002) [54]
Prostate3 12625 102 (50+52) 2 Singh et al. (2002) [54]
SRBCT 2308 83 (29+25+11+18) 4 Khan et al. (2001) [55]
GLIOMA 4434 50 (14+7+14+15) 4 Li et al. (2018) [56]
TOX_171 5748 171 (45+45+39+42) 4 Bajwa et al. (2016) [57]
Tab.1  Descriptions of biological datasets used in this study
Fig.1  The entire diagram of the experimental design of the experiments of this paper
Datasets FSSC_SE FSSC_LE Difference FSSC_SC FSSC_LC Difference
Colon 0.8429 0.8405 0.0024 0.8238 0.8119 0.0119
DLBCL 0.8417 0.7100 0.1317 0.8750 0.7600 0.1150
Lymphoma 0.8717 0.6883 0.1833 0.8417 0.7750 0.0667
DLBCL Outcome 0.6143 0.6410 -0.0267 0.6819 0.5862 0.0957
Adenoma 0.9500 0.9000 0.0500 0.9750 0.9250 0.0500
Prostate1 0.9800 0.9750 0.0050 1.0000 0.9550 0.0450
Prostate2 0.7128 0.7051 0.0077 0.7200 0.7039 0.0160
Prostate3 0.8727 0.8436 0.0291 0.8618 0.8424 0.0191
SRBCT 0.9653 0.9653 0 0.9292 0.9497 ?0.0206
GLIOMA 0.8643 0.7481 0.1162 0.8310 0.7717 0.0593
TOX_171 0.7972 0.8511 ?0.0539 0.8054 0.8652 ?0.0598
Tab.2  Acc difference of FSSC_SE & FSSC_LE, FSSC_SC & FSSC_LC
Fig.2  The classification capability comparison of the feature subsets detected by 10 feature selection algorithms on DLBCL in terms of 6 classification measures including Acc, Sn, Sp, AUC, MCC and F2. (a) Acc; (b) Sn; (c) Sp; (d) AUC; (e) MCC; (f) F2
Algorithms FSSC_SE FSSC_SC FSSC_LE FSSC_LC DGFS Laplacian MCFS RUFS NDFS FSSC-SD Baseline
FSSC_SE 0/11/0 5/0/6 8/1/2 10/0/1 11/0/0 10/0/1 7/2/2 11/0/0 11/0/0 8/2/1 8/0/3
FSSC_SC 6/0/5 0/11/0 8/0/3 9/0/2 11/0/0 10/1/0 6/1/4 8/0/3 11/0/0 6/1/4 4/1/6
FSSC_LE 2/1/8 3/0/8 0/11/0 6/0/5 8/1/2 8/0/3 4/0/7 8/0/3 11/0/0 5/0/6 5/1/5
FSSC_LC 1/0/10 2/0/9 5/0/6 0/11/0 9/0/2 9/1/1 2/0/9 7/0/4 10/0/1 3/0/8 2/0/9
Tab.3  The win/draw/loss results of algorithms in terms of the maximal average Acc
Fig.3  The classification capability comparison of the feature subsets detected by 10 feature selection algorithms on Prostate2 in terms of 6 classification measures including Acc, Sn, Sp, AUC, MCC and F2. (a) Acc; (b) Sn; (c) Sp; (d) AUC; (e) MCC; (f) F2
Fig.4  The classification capability comparison of the feature subsets detected by 10 feature selection algorithms on GLIOMA in terms of 6 classification measures including Acc, Sn, Sp, MAUC, MCC and F2. (a) Acc; (b) Sn; (c) Sp; (d) MAUC; (e) MCC; (f) F2
Fig.5  The classification capability comparison of the feature subsets detected by 10 unsupervised feature selection algorithms on 11 datasets in terms of the maximal mean Acc of the classiers of the feature subsets of 10-fold cross validation experiments
Fig.6  The significance comparison results of the 10 feature selection algorithms against each other using the Nemenyi’s test
  
  
  
  
1 I, Guyon J, Weston S, Barnhill V Vapnik . Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46( 1): 389–422
2 V, Bolón-Canedo N, Sánchez-Maroño A, Alonso-Betanzos J M, Benítez F Herrera . A review of microarray datasets and applied feature selection methods. Information Sciences, 2014, 282: 111–135
3 J, Xie M, Wang S, Xu Z, Huang P W Grant . The unsupervised feature selection algorithms based on standard deviation and cosine similarity for genomic data analysis. Frontiers in Genetics, 2021, 12: 684100
4 J Y, Xie M Z, Wang Y, Zhou H C, Gao S Q Xu . Differential expression gene selection algorithms for unbalanced gene datasets. Chinese Journal of Computers, 2019, 42( 6): 1232–1251
5 M, Wang L, Ding M, Xu J, Xie S, Wu S, Xu Y, Yao Q Liu . A novel method detecting the key clinic factors of portal vein system thrombosis of splenectomy & cardia devascularization patients for cirrhosis & portal hypertension. BMC Bioinformatics, 2019, 20( 22): 720
6 J, Xie Z, Wu Q Zheng . An adaptive 2D feature selection algorithm based on information gain and Pearson correlation coefficient. Journal of Shaanxi Normal University: Natural Science Edition, 2020, 48( 6): 69–81
7 X, Hu P, Zhou P, Li J, Wang X Wu . A survey on online feature selection with streaming features. Frontiers of Computer Science, 2018, 12( 3): 479–493
8 Z U, Khan D, Pi S, Yao A, Nawaz F, Ali S Ali . piEnPred: a bi-layered discriminative model for enhancers and their subtypes via novel cascade multi-level subset feature selection algorithm. Frontiers of Computer Science, 2021, 15( 6): 156904
9 J, Chen Y, Zeng Y, Li G B Huang . Unsupervised feature selection based extreme learning machine for clustering. Neurocomputing, 2020, 386: 198–207
10 H, Lim D W Kim . Pairwise dependence-based unsupervised feature selection. Pattern Recognition, 2021, 111: 107663
11 J, Feng L, Jiao F, Liu T, Sun X Zhang . Unsupervised feature selection based on maximum information and minimum redundancy for hyperspectral images. Pattern Recognition, 2016, 51: 295–309
12 J Y, Xie H C Gao . Statistical correlation and k-means based distinguishable gene subset selection algorithms. Journal of Software, 2014, 25( 9): 2050–2075
13 J, Xie H, Gao W, Xie X, Liu P W Grant . Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors. Information Sciences, 2016, 354: 19–40
14 P, Bhattacharjee P Mitra . A survey of density based clustering algorithms. Frontiers of Computer Science, 2021, 15( 1): 151308
15 P, Bhattacharjee P Mitra . iMass: an approximate adaptive clustering algorithm for dynamic data using probability based dissimilarity. Frontiers of Computer Science, 2021, 15( 2): 1–3
16 Q, Song J, Ni G Wang . A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Transactions on Knowledge and Data Engineering, 2013, 25( 1): 1–14
17 J, Xie M, Wang Y, Zhou J Li . Coordinating discernibility and independence scores of variables in a 2D space for efficient and accurate feature selection. In: Proceedings of the 12th International Conference on Intelligent Computing. 2016, 116–127
18 H, Xue S, Li X, Chen Y Wang . A maximum margin clustering algorithm based on indefinite kernels. Frontiers of Computer Science, 2019, 13( 4): 813–827
19 A, Likas N, Vlassis J J Verbeek . The global k-means clustering algorithm. Pattern Recognition, 2003, 36( 2): 451–461
20 J Y, Xie S, Jiang C X, Wang Y, Zhang W X Xie . An improved global k-means clustering algorithm. Journal of Shaanxi Normal University: Natural Science Edition, 2010, 38( 2): 18–22
21 Luxburg U Von . A tutorial on spectral clustering. Statistics and Computing, 2007, 17( 4): 395–416
22 X, Zhang Q You . An improved spectral clustering algorithm based on random walk. Frontiers of Computer Science in China, 2011, 5( 3): 268–278
23 A Y, Ng M I, Jordan Y Weiss . On spectral clustering: analysis and an algorithm. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. 2001, 849–856
24 J, Shi J Malik . Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22( 8): 888–905
25 L, Zelnik-Manor P Perona . Self-tuning spectral clustering. In: Proceedings of the 17th International Conference on Neural Information Processing Systems. 2004, 1601–1608
26 S Z Y C J Alpert . Spectral partitioning: the more eigenvectors, the better. In: Proceedings of the 32nd Design Automation Conference. 1995, 195–200
27 Y Weiss . Segmentation using eigenvectors: a unifying view. In: Proceedings of the 7th IEEE International Conference on Computer Vision. 1999, 975–982
28 J, Xie Y, Zhou L Ding . Local standard deviation spectral clustering. In: Proceedings of 2018 IEEE International Conference on Big Data and Smart Computing (BigComp). 2018, 242–250
29 J Y, Xie L J Ding . The true self-adaptive spectral clustering algorithms. Acta Electronica Sinica, 2019, 47( 5): 1000–1008
30 Z, Zhao H Liu . Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 1151–1157
31 D, García-García R Santos-Rodríguez . Spectral clustering and feature selection for microarray data. In: Proceedings of 2009 International Conference on Machine Learning and Applications. 2009, 425–428
32 S, Zhou X, Liu C, Zhu Q, Liu J Yin . Spectral clustering-based local and global structure preservation for feature selection. In: Proceedings of 2014 International Joint Conference on Neural Networks (IJCNN). 2014, 550–557
33 X, He D, Cai P Niyogi . Laplacian score for feature selection. In: Proceedings of the 18th International Conference on Neural Information Processing Systems. 2005, 507–514
34 D, Cai C, Zhang X He . Unsupervised feature selection for multi-cluster data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010, 333–342
35 M, Qian C Zhai . Robust unsupervised feature selection. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence. 2013, 1621–1627
36 Z, Li Y, Yang J, Liu X, Zhou H Lu . Unsupervised feature selection using nonnegative spectral analysis. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence. 2012, 1026–1032
37 J, He Y, Bi L, Ding Z, Li S Wang . Unsupervised feature selection based on decision graph. Neural Computing and Applications, 2017, 28( 10): 3047–3059
38 J Y, Xie L J, Ding M Z Wang . Spectral clustering based unsupervised feature selection algorithms. Journal of Software, 2020, 31( 4): 1009–1024
39 P, Baldi S, Brunak Y, Chauvin C A F, Andersen H Nielsen . Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 2000, 16( 5): 412–424
40 J, Davis M Goadrich . The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 233–240
41 T Fawcett . An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27( 8): 861–874
42 V N Vapnik . The Nature of Statistical Learning Theory. Berlin: Springer Science & Business Media, 2013
43 M, Dash H Liu . Feature selection for clustering. In: Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications. 2000, 110–121
44 M, Dash K, Choi P, Scheuermann H Liu . Feature selection for clustering – a filter solution. In: Proceedings of the 2002 IEEE International Conference on Data Mining. 2002, 115–122
45 J, Han J, Pei M Kamber . Data Mining: Concepts and Techniques. Amsterdam: Elsevier, 2011
46 F, Luo H, Huang Z, Ma J Liu . Semisupervised sparse manifold discriminative analysis for feature extraction of hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing, 2016, 54( 10): 6197–6211
47 F, Luo Z, Zou J, Liu Z Lin . Dimensionality reduction and classification of hyperspectral image via multistructure unified discriminative embedding. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5517916
48 F, Zhao L, Jiao H, Liu X, Gao M Gong . Spectral clustering with eigenvector selection based on entropy ranking. Neurocomputing, 2010, 73( 10–12): 1704–1717
49 U, Alon N, Barkai D A, Notterman K, Gish S, Ybarra D, Mack A J Levine . Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 1999, 96( 12): 6745–6750
50 A A, Alizadeh M B, Eisen R E, Davis C, Ma I S, Lossos A, Rosenwald J C, Boldrick H, Sabet T, Tran X, Yu J I, Powell L, Yang G E, Marti T, Moore J Jr, Hudson L, Lu D B, Lewis R, Tibshirani G, Sherlock W C, Chan T C, Greiner D D, Weisenburger J O, Armitage R, Warnke R, Levy W, Wilson M R, Grever J C, Byrd D, Botstein P O, Brown L M Staudt . Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 2000, 403( 6769): 503–511
51 M A, Shipp K N, Ross P, Tamayo A P, Weng J L, Kutok R C T, Aguiar M, Gaasenbeek M, Angelo M, Reich G S, Pinkus T S, Ray M A, Koval K W, Last A, Norton T A, Lister J, Mesirov D S, Neuberg E S, Lander J C, Aster T R Golub . Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 2002, 8( 1): 68–74
52 D A, Notterman U, Alon A J, Sierk A J Levine . Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. Cancer Research, 2001, 61( 7): 3124–3130
53 U R, Chandran C, Ma R, Dhir M, Bisceglia M, Lyons-Weiler W, Liang G, Michalopoulos M, Becich F A Monzon . Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process. BMC Cancer, 2007, 7( 1): 64
54 D, Singh P G, Febbo K, Ross D G, Jackson J, Manola C, Ladd P, Tamayo A A, Renshaw A V, D’Amico J P, Richie E S, Lander M, Loda P W, Kantoff T R, Golub W R Sellers . Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 2002, 1( 2): 203–209
55 J, Khan J S, Wei M, Ringnér L H, Saal M, Ladanyi F, Westermann F, Berthold M, Schwab C R, Antonescu C, Peterson P S Meltzer . Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 2001, 7( 6): 673–679
56 J, Li K, Cheng S, Wang F, Morstatter R P, Trevino J, Tang H Liu . Feature selection: a data perspective. ACM Computing Surveys, 2018, 50( 6): 94
57 G, Bajwa R J, DeBerardinis B, Shao B, Hall J D, Farrar M A Gill . Cutting edge: critical role of glycolysis in human plasmacytoid dendritic cell antiviral responses. The Journal of Immunology, 2016, 196( 5): 2004–2009
58 C C, Chang C J Lin . LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2( 3): 27
59 M Friedman . A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 1940, 11( 1): 86–92
60 P B Nemenyi . Distribution-free multiple comparisons. Princeton University, Dissertation, 1963
[1] FCS-22135-OF-NW_suppl_1 Download
[1] Liangxuan ZHU, Han LI, Xuelin ZHANG, Lingjuan WU, Hong CHEN. Neural partially linear additive model[J]. Front. Comput. Sci., 2024, 18(6): 186334-.
[2] Yan LIN, Jiashu WANG, Xiaowei LIU, Xueqin XIE, De WU, Junjie ZHANG, Hui DING. A computational model to identify fertility-related proteins using sequence information[J]. Front. Comput. Sci., 2024, 18(1): 181902-.
[3] Momo MATSUDA, Yasunori FUTAMURA, Xiucai YE, Tetsuya SAKURAI. Distortion-free PCA on sample space for highly variable gene detection from single-cell RNA-seq data[J]. Front. Comput. Sci., 2023, 17(1): 171310-.
[4] Pengpai WANG, Mingliang WANG, Yueying ZHOU, Ziming XU, Daoqiang ZHANG. Multiband decomposition and spectral discriminative analysis for motor imagery BCI via deep neural network[J]. Front. Comput. Sci., 2022, 16(5): 165328-.
[5] Zaheer Ullah KHAN, Dechang PI, Shuanglong YAO, Asif NAWAZ, Farman ALI, Shaukat ALI. piEnPred: a bi-layered discriminative model for enhancers and their subtypes via novel cascade multi-level subset feature selection algorithm[J]. Front. Comput. Sci., 2021, 15(6): 156904-.
[6] Je Sen TEH, Weijian TENG, Azman SAMSUDIN, Jiageng CHEN. A post-processing method for true random number generators based on hyperchaos with applications in audio-based generators[J]. Front. Comput. Sci., 2020, 14(6): 146405-.
[7] Lei CHEN, Kai SHAO, Xianzhong LONG, Lingsheng WANG. Multi-task regression learning for survival analysis via prior information guided transductive matrix completion[J]. Front. Comput. Sci., 2020, 14(5): 145312-.
[8] Parnika PARANJAPE, Meera DHABU, Parag DESHPANDE. A novel classifier for multivariate instance using graph class signatures[J]. Front. Comput. Sci., 2020, 14(4): 144307-.
[9] Farid FEYZI, Saeed PARSA. Inforence: effective fault localization based on information-theoretic analysis and statistical causal inference[J]. Front. Comput. Sci., 2019, 13(4): 735-759.
[10] Nannan XIE, Xing WANG, Wei WANG, Jiqiang LIU. Fingerprinting Android malware families[J]. Front. Comput. Sci., 2019, 13(3): 637-646.
[11] Rizwan Ahmed KHAN, Alexandre MEYER, Hubert KONIK, Saida BOUAKAZ. Saliency-based framework for facial expression recognition[J]. Front. Comput. Sci., 2019, 13(1): 183-198.
[12] Xuegang HU, Peng ZHOU, Peipei LI, Jing WANG, Xindong WU. A survey on online feature selection with streaming features[J]. Front. Comput. Sci., 2018, 12(3): 479-493.
[13] Sudipta ROY, Debnath BHATTACHARYYA, Samir Kumar BANDYOPADHYAY, Tai-Hoon KIM. An improved brain MR image binarization method as a preprocessing for abnormality detection and features extraction[J]. Front. Comput. Sci., 2017, 11(4): 717-727.
[14] Yun SONG,Zhihui LI,Yongming LI,Ren XIN. The optimal information rate for graph access structures of nine participants[J]. Front. Comput. Sci., 2015, 9(5): 778-787.
[15] Djamal ZIANI. Feature selection on probabilistic symbolic objects[J]. Front. Comput. Sci., 2014, 8(6): 933-947.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed