A data representation method using distance correlation
Xinyan LIANG1, Yuhua QIAN1,2(), Qian GUO3,4, Keyin ZHENG1
1. Institute of Big Data Science and Industry, Shanxi University, Taiyuan 030006, China 2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China 3. School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China 4. Shanxi Key Laboratory of Big Data Analysis and Parallel Computing, Taiyuan University of Science and Technology, Taiyuan 030024, China
Association in-between features has been demonstrated to improve the representation ability of data. However, the original association data reconstruction method may face two issues: the dimension of reconstructed data is undoubtedly higher than that of original data, and adopted association measure method does not well balance effectiveness and efficiency. To address above two issues, this paper proposes a novel association-based representation improvement method, named as AssoRep. AssoRep first obtains the association between features via distance correlation method that has some advantages than Pearson’s correlation coefficient. Then an improved matrix is formed via stacking the association value of any two features. Next, an improved feature representation is obtained by aggregating the original feature with the enhancement matrix. Finally, the improved feature representation is mapped to a low-dimensional space via principal component analysis. The effectiveness of AssoRep is validated on 120 datasets and the fruits further prefect our previous work on the association data reconstruction.
Y, Zhu Y, Geng Y, Li J, Qiang X Wu . Representation learning: serial-autoencoder for personalized recommendation. Frontiers of Computer Science, 2024, 18( 4): 184316
2
Y, Bengio A, Courville P Vincent . Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35( 8): 1798–1828
3
B B, Jia J Y, Liu J Y, Hang M L Zhang . Learning label-specific features for decomposition-based multi-class classification. Frontiers of Computer Science, 2023, 17( 6): 176348
4
M L, Zhang J P, Fang Y B Wang . BiLabel-specific features for multi-label classification. ACM Transactions on Knowledge Discovery from Data, 2021, 16( 1): 18
5
M, Yang Q, Liu X, Sun N, Shi H Xue . Towards kernelizing the classifier for hyperbolic data. Frontiers of Computer Science, 2024, 18( 1): 181301
6
X, Dong T, Luo R, Fan W, Zhuge C Hou . Active label distribution learning via kernel maximum mean discrepancy. Frontiers of Computer Science, 2023, 17( 4): 174327
7
Y, Zhang L, Jiang C Li . Attribute augmentation-based label integration for crowdsourcing. Frontiers of Computer Science, 2023, 17( 5): 175331
8
A R, Troncoso-García M, Martínez-Ballesteros F, Martínez-Álvarez A Troncoso . A new approach based on association rules to add explainability to time series forecasting models. Information Fusion, 2023, 94: 169–180
9
X, Liang Y, Qian Q, Guo H, Cheng J Liang . AF: an association-based fusion method for multi-modal classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 12): 9236–9254
10
B B, Jia M L Zhang . Multi-dimensional classification via kNN feature augmentation. Pattern Recognition, 2020, 106: 107423
11
M, Deng W, Yang C, Chen C Liu . Exploring associations between streetscape factors and crime behaviors using Google Street View images. Frontiers of Computer Science, 2022, 16( 4): 164316
12
Q, Guo Y, Qian X Liang . GLRM: logical pattern mining in the case of inconsistent data distribution based on multigranulation strategy. International Journal of Approximate Reasoning, 2022, 143: 78–101
13
Q, Guo Y, Qian X, Liang Y, She D, Li J Liang . Logic could be learned from images. International Journal of Machine Learning and Cybernetics, 2021, 12( 12): 3397–3414
14
J Kuzma . Basic Statistics for the Health Sciences. Palo Alto: Mayfield Publishing Company, 1984, 158–169
15
C Spearman . The proof and measurement of association between two things. The American Journal of Psychology, 1904, 15( 1): 72–101
16
M G Kendall . A new measure of rank correlation. Biometrika, 1938, 30( 1-2): 81–93
17
G J, Székely M L, Rizzo N K Bakirov . Measuring and testing dependence by correlation of distances. The Annals of Statistics, 2007, 35( 6): 2769–2794
18
D N, Reshef Y A, Reshef H K, Finucane S R, Grossman G, Mcvean P J, Turnbaugh E S, Lander M, Mitzenmacher P C Sabeti . Detecting novel associations in large data sets. Science, 2011, 334( 6062): 1518–1524
19
H, Cheng Y, Qian Z, Hu J Liang . Association mining method based on neighborhood perspective. SCIENTIA SINICA Informationis, 2020, 50( 6): 824–844
20
Y, Zhu J T, Kwok Z H Zhou . Multi-label learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering, 2018, 30( 6): 1081–1094
21
N, Xu J, Shu R, Zheng X, Geng D, Meng M L Zhang . Variational label enhancement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45( 5): 6537–6551
22
M L, Zhang Z H Zhou . A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 2014, 26( 8): 1819–1837
23
M L, Zhang Y K, Li X Y, Liu X Geng . Binary relevance for multi-label learning: an overview. Frontiers of Computer Science, 2018, 12( 2): 191–202
24
Y, Kou G, Lin Y, Qian S Liao . A novel multi-label feature selection method with association rules and rough set. Information Sciences, 2023, 624: 299–323
25
Y, Zhang H, Zhu Z, Song P, Koniusz I King . Spectral feature augmentation for graph contrastive learning and beyond. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 11289−11297
26
Z, Gao Y, Wu Y, Jia M Harandi . Hyperbolic feature augmentation via distribution estimation and infinite sampling on manifolds. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 34421–34435
27
M L, Zhang L Wu . LIFT: multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37( 1): 107–120
28
S, Zheng W, Yuan D Guan . Heterogeneous information network embedding with incomplete multi-view fusion. Frontiers of Computer Science, 2022, 16( 5): 165611
29
B, Wang H, Li B, Wei Z, Kang C Li . Nighttime image dehazing using color cast removal and dual path multi-scale fusion strategy. Frontiers of Computer Science, 2022, 16( 4): 164706
30
Z, Wang L, Li Y, Xue C, Jiang J, Wang K, Sun H Ma . FeNet: feature enhancement network for lightweight remote-sensing image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5622112
31
W, Wang M L Zhang . Partial label learning with discrimination augmentation. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022, 1920−1928
32
C, Gong D, Wang M, Li V, Chandra Q Liu . KeepAugment: a simple information-preserving data augmentation approach. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 1055−1064
33
M, Wang H, Han Z, Huang J Xie . Unsupervised spectral feature selection algorithms for high dimensional data. Frontiers of Computer Science, 2023, 17( 5): 175330
34
Liu J, Chai C, Luo Y, Lou Y, Feng J, Tang N. Feature augmentation with reinforcement learning. In: Proceedings of the 38th IEEE International Conference on Data Engineering. 2022, 3360−3372
35
H, Li C, Xu L, Ma H, Bo D Zhang . MODENN: a shallow broad neural network model based on multi-order descartes expansion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 12): 9417–9433
36
R Taylor . Interpretation of the correlation coefficient: a basic review. Journal of Diagnostic Medical Sonography, 1990, 6( 1): 35–39
37
C Spearman . The proof and measurement of association between two things. The American Journal of Psychology, 1987, 100( 3-4): 441–471
38
C Spearman . The proof and measurement of association between two things. International Journal of Epidemiology, 2010, 39( 5): 1137–1150
39
M T, Puth M, Neuhäuser G D Ruxton . Effective use of Spearman’s and Kendall’s correlation coefficients for association between two measured traits. Animal Behaviour, 2015, 102: 77–84
40
C E Shannon . A mathematical theory of communication. The Bell system Technical Journal, 1948, 27( 3): 379–423
41
H, Cheng Y, Qian Y, Guo K, Zheng Q Zhang . Neighborhood information-based method for multivariate association mining. IEEE Transactions on Knowledge and Data Engineering, 2023, 35( 6): 6126–6135
42
A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez Ł, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
43
W X, Shen X, Zeng F, Zhu Y L, Wang C, Qin Y, Tan Y Y, Jiang Y Z Chen . Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. Nature Machine Intelligence, 2021, 3( 4): 334–343
44
X, Liang Q, Guo Y, Qian W, Ding Q Zhang . Evolutionary deep fusion method and its application in chemical structure recognition. IEEE Transactions on Evolutionary Computation, 2021, 25( 5): 883–893
45
A, Gretton O, Bousquet A, Smola B Schölkopf . Measuring statistical dependence with hilbert-schmidt norms. In: Proceedings of the 16th International Conference on Algorithmic Learning Theory. 2005, 63−77
46
Fernández-Delgado M, Cernadas E, Barro S, Amorim D. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 2014, 15(1): 3133–3181
47
C H, Lampert H, Nickisch S Harmeling . Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36( 3): 453–465
48
J, Arevalo T, Solorio M, Montes-y-Gómez F A Gonzalez . Gated multimodal networks. Neural Computing and Applications, 2020, 32( 14): 10209–10228
49
Y, Zhang C, Cao J, Cheng H Lu . EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia, 2018, 20( 5): 1038–1050
50
F, Pedregosa G, Varoquaux A, Gramfort V, Michel B, Thirion O, Grisel M, Blondel P, Prettenhofer R, Weiss V, Dubourg J, Vanderplas A, Passos D, Cournapeau M, Brucher M, Perrot É Duchesnay . Scikit-learn: machine learning in python. The Journal of Machine Learning Research, 2011, 12: 2825–2830
M, Cover E Hart . Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 1967, 13( 1): 21–27
53
L Breiman . Random forests. Machine Learning, 2001, 45( 1): 5–32
54
Y, Freund R E Schapire . Large margin classification using the perceptron algorithm. Machine Learning, 1999, 37( 3): 277–296
55
J Demšar . Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 2006, 7: 1–30
56
Y A, Reshef D N, Reshef H K, Finucane P C, Sabeti M Mitzenmacher . Measuring dependence powerfully and equitably. The Journal of Machine Learning Research, 2016, 17( 1): 7406–7468