Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2019, Vol. 13 Issue (5) : 996-1009    https://doi.org/10.1007/s11704-018-7182-1
RESEARCH ARTICLE
Transfer synthetic over-sampling for class-imbalance learning with limited minority class data
Xu-Ying LIU1,2,3(), Sheng-Tao WANG1,2,3, Min-Ling ZHANG1,2,3
1. School of Computer Science and Engineering, Southeast University, Nanjing 210096, China
2. Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, Nanjing 210096, China
3. Collaborative Innovation Center forWireless Communications Technology, Nanjing 210096, China
 Download: PDF(492 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The problem of limited minority class data is encountered in many class imbalanced applications, but has received little attention. Synthetic over-sampling, as popular class-imbalance learning methods, could introduce much noise when minority class has limited data since the synthetic samples are not i.i.d. samples of minority class. Most sophisticated synthetic sampling methods tackle this problem by denoising or generating samples more consistent with ground-truth data distribution. But their assumptions about true noise or ground-truth data distribution may not hold. To adapt synthetic sampling to the problem of limited minority class data, the proposed Traso framework treats synthetic minority class samples as an additional data source, and exploits transfer learning to transfer knowledge from them to minority class. As an implementation, TrasoBoost method firstly generates synthetic samples to balance class sizes. Then in each boosting iteration, the weights of synthetic samples and original data decrease and increase respectively when being misclassified, and remain unchanged otherwise. The misclassified synthetic samples are potential noise, and thus have smaller influence in the following iterations. Besides, the weights of minority class instances have greater change than those of majority class instances to be more influential. And only original data are used to estimate error rate to be immune from noise. Finally, since the synthetic samples are highly related to minority class, all of the weak learners are aggregated for prediction. Experimental results show TrasoBoost outperforms many popular class-imbalance learning methods.

Keywords machine learning      data mining      class imbalance      over sampling      boosting      transfer learning     
Corresponding Author(s): Xu-Ying LIU   
Just Accepted Date: 14 June 2018   Online First Date: 07 January 2019    Issue Date: 25 June 2019
 Cite this article:   
Xu-Ying LIU,Sheng-Tao WANG,Min-Ling ZHANG. Transfer synthetic over-sampling for class-imbalance learning with limited minority class data[J]. Front. Comput. Sci., 2019, 13(5): 996-1009.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-018-7182-1
https://academic.hep.com.cn/fcs/EN/Y2019/V13/I5/996
1 H He, E A Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263–1284
https://doi.org/10.1109/TKDE.2008.239
2 X Y Liu, J Wu, Z H Zhou. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, 39(2): 539–550
https://doi.org/10.1109/TSMCB.2008.2007853
3 D Cieslak, N Chawla. Learning decision trees for unbalanced data. In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2008, 241–256
https://doi.org/10.1007/978-3-540-87479-9_34
4 M Galar, A Fernandez, E Barrenechea, H Bustince, F Herrera. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012, 42(4): 463–484
https://doi.org/10.1109/TSMCC.2011.2161285
5 S Wang, L L Minku, X Yao. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368
https://doi.org/10.1109/TKDE.2014.2345380
6 Y Yan, M Chen, M L Shyu, S C Chen. Deep learning for imbalanced multimedia data classification. In: Proceedings of the 2015 IEEE International Symposium on Multimedia. 2015, 483–488
https://doi.org/10.1109/ISM.2015.126
7 S Wang, W Liu, J Wu, L Cao, Q Meng, P J Kennedy. Training deep neural networks on imbalanced data sets. In: Proceedings of the 2016 International Joint Conference on Neural Networks. 2016, 4368–4374
https://doi.org/10.1109/IJCNN.2016.7727770
8 T Fawcett, F J Provost. Combining data mining and machine learning for effective user profiling. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996, 8–13
9 M Kubat, R C Holte, S Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 1998, 30(2–3): 195–215
https://doi.org/10.1023/A:1007452223027
10 D D Lewis, M Ringuette. A comparison of two learning algorithms for text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval. 1994, 81–93
11 S Wang, X Yao. Using class imbalance learning for software defect prediction. IEEE Transactions on Reliability, 2013, 62(2): 434–443
https://doi.org/10.1109/TR.2013.2259203
12 A P Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997, 30(6): 1145–1159
https://doi.org/10.1016/S0031-3203(96)00142-2
13 Q Yang, X Wu. 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making, 2006, 5(4): 597–604
https://doi.org/10.1142/S0219622006002258
14 G M Weiss. Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 7–19
https://doi.org/10.1145/1007730.1007734
15 G M Weiss. Mining with Rare Cases. Data Mining and Knowledge Discovery Handbook, Springer, Boston, MA. 2005, 765–776
https://doi.org/10.1007/0-387-25465-X_35
16 T M Khoshgoftaar, C Seiffert, J V Hulse, A Napolitano, A Folleco. Learning with limited minority class data. In: Proceedings of the 6th International Conference on Machine Learning and Applications. 2007, 348–353
https://doi.org/10.1109/ICMLA.2007.76
17 N V Chawla, K W Bowyer, L O Hall, W P Kegelmeyer. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321–357
https://doi.org/10.1613/jair.953
18 H Han, W Y Wang, B H Mao. Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. In: Proceedings of the International Conference on Intelligent Computing. 2005, 878–887
https://doi.org/10.1007/11538059_91
19 G E Batista, R C Prati, M C Monard. A study of the behavior of several methods for balancing machine learning training data. ACM SGKDD Explorations Newsletter, 2004, 6(1): 20–29
https://doi.org/10.1145/1007730.1007735
20 J Laurikkala. Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. 2001, 63–66
https://doi.org/10.1007/3-540-48229-6_9
21 H He, Y Bai, E A Garcia, S Li. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. 2008, 1322–1328
22 B Das, N C Krishnan, D J Cook. wRACOG: a gibbs sampling-based oversampling technique. In: Proceedings of the 13th IEEE International Conference on Data Mining. 2013, 111–120
https://doi.org/10.1109/ICDM.2013.18
23 H Zhang, M Li. RWO-sampling: a random walk over-sampling approach to imbalanced data classification. Information Fusion, 2014, 20: 99–116
https://doi.org/10.1016/j.inffus.2013.12.003
24 S J Pan, Q Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345–1359
https://doi.org/10.1109/TKDE.2009.191
25 M Galar, A Fernández, E Barrenechea, F Herrera. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46(12): 3460–3471
https://doi.org/10.1016/j.patcog.2013.05.006
26 E Ramentol, Y Caballero, R Bello, F Herrera. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and Information Systems, 2012, 33(2): 245–265
https://doi.org/10.1007/s10115-011-0465-6
27 S Wang, X Yao. Multiclass imbalance problems: analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2012, 42(4): 1119–1130
https://doi.org/10.1109/TSMCB.2012.2187280
28 X Y Liu, Q Q Li. Learning from combination of data chunks for multiclass imbalanced data. In: Proceedings of the 2014 International Joint Conference on Neural Networks. 2014, 1680–1687
https://doi.org/10.1109/IJCNN.2014.6889667
29 S Li, Z Wang, G Zhou, S Y M Lee. Semi-supervised learning for imbalanced sentiment classification. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence. 2011, 1826–1832
30 M L Zhang, Y K Li, X Y Liu. Towards class-imbalance aware multilabel learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 4041–4047
31 T R Hoens, N V Chawla. Learning in non-stationary environments with class imbalance. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 168–176
https://doi.org/10.1145/2339530.2339558
32 S Wang, L L Minku, X Yao. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368
https://doi.org/10.1109/TKDE.2014.2345380
33 H Cao, X L Li, Y K Woon, S K Ng. SPO: structure preserving oversampling for imbalanced time series classification. In: Proceedings of the 11th IEEE International Conference on Data Mining. 2011, 1008–1013
https://doi.org/10.1109/ICDM.2011.137
34 H Cao, X L Li, D Y K Woon, S K Ng. Integrated oversampling for imbalanced time series classification. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(12): 2809–2822
https://doi.org/10.1109/TKDE.2013.37
35 N V Chawla, A Lazarevic, L O Hall, K W Bowyer. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119
https://doi.org/10.1007/978-3-540-39804-2_12
36 S Wang, X Yao. Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining. 2009, 324–331
https://doi.org/10.1109/CIDM.2009.4938667
37 Y Sun, M S Kamel, A K Wong, Y Wang. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 2007, 40(12): 3358–3378
https://doi.org/10.1016/j.patcog.2007.04.009
38 C Seiffert, T M Khoshgoftaar, J Van Hulse, A Napolitano. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A (Systems and Humans), 2010, 40(1): 185–197
https://doi.org/10.1109/TSMCA.2009.2029559
39 I Tomek. Two modifications of CNN. IEEE Transactions of System Man Cybernetics, 1976, 6: 769–772
40 R Raina, A Battle, H Lee, B Packer, A Y Ng. Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 759–766
https://doi.org/10.1145/1273496.1273592
41 Y Wei, Y, Zhu C W Leung, Y Song, Q Yang. Instilling social to physical: co-regularized heterogeneous transfer learning. In: Proceedings of the 13th AAAI Conference on Artificial Intelligence. 2016, 1338–1344
42 K Weiss, T M Khoshgoftaar, D Wang. A survey of transfer learning. Journal of Big Data, 2016, 3(1): 1–40
https://doi.org/10.1186/s40537-016-0043-6
43 S Al-Stouhi, C K Reddy. Transfer learning for class imbalance problems with inadequate data. Knowledge and Information Systems, 2016, 48(1): 201–208
https://doi.org/10.1007/s10115-015-0870-3
44 L Ge, J Gao, H Ngo, K Li, A Zhang. On handling negative transfer and imbalanced distributions in multiple source transfer learning. Statistical Analysis and Data Mining, 2014, 7(4): 254–271
https://doi.org/10.1002/sam.11217
45 W Dai, Q Yang, G R Xue, Y Yu. Boosting for transfer learning. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 193–200
https://doi.org/10.1145/1273496.1273521
46 C Blake, E Keogh, C J Merz. UCI repository of machine learning databases. University of California, Irvine, CA, 1996
47 L Breiman, J Friedman, R A Olshen, C J Stone. Classification and Regression Trees. London: Routledge Press, 2017
https://doi.org/10.1201/9781315139470
48 R E Schapire. A brief introduction to Boosting. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999, 1401–1406
49 R Barandela, R M Valdovinos, J S Snchez. New applications of ensembles of classifiers. Pattern Analysis and Applications, 2003, 6(3): 245–256
https://doi.org/10.1007/s10044-003-0192-z
50 L Breiman. Bagging predictors. Machine Learning, 1996, 24(2): 123–140
https://doi.org/10.1007/BF00058655
[1] Article highlights Download
[1] Xia-an BI, Yiming XIE, Hao WU, Luyun XU. Identification of differential brain regions in MCI progression via clustering-evolutionary weighted SVM ensemble algorithm[J]. Front. Comput. Sci., 2021, 15(6): 156903-.
[2] Yan-Ping SUN, Min-Ling ZHANG. Compositional metric learning for multi-label classification[J]. Front. Comput. Sci., 2021, 15(5): 155320-.
[3] Genan DAI, Xiaoyang HU, Youming GE, Zhiqing NING, Yubao LIU. Attention based simplified deep residual network for citywide crowd flows prediction[J]. Front. Comput. Sci., 2021, 15(2): 152317-.
[4] Jian SUN, Pu-Feng DU. Predicting protein subchloroplast locations: the 10th anniversary[J]. Front. Comput. Sci., 2021, 15(2): 152901-.
[5] Syed Farooq ALI, Muhammad Aamir KHAN, Ahmed Sohail ASLAM. Fingerprint matching, spoof and liveness detection: classification and literature review[J]. Front. Comput. Sci., 2021, 15(1): 151310-.
[6] Yuling MA, Chaoran CUI, Jun YU, Jie GUO, Gongping YANG, Yilong YIN. Multi-task MIML learning for pre-course student performance prediction[J]. Front. Comput. Sci., 2020, 14(5): 145313-.
[7] Guijuan ZHANG, Yang LIU, Xiaoning JIN. A survey of autoencoder-based recommender systems[J]. Front. Comput. Sci., 2020, 14(2): 430-450.
[8] Lu LIU, Shang WANG. Meta-path-based outlier detection in heterogeneous information network[J]. Front. Comput. Sci., 2020, 14(2): 388-403.
[9] Yu-Feng LI, De-Ming LIANG. Safe semi-supervised learning: a brief introduction[J]. Front. Comput. Sci., 2019, 13(4): 669-676.
[10] Wenhao ZHENG, Hongyu ZHOU, Ming LI, Jianxin WU. CodeAttention: translating source code to comments by exploiting the code constructs[J]. Front. Comput. Sci., 2019, 13(3): 565-578.
[11] Satoshi MIYAZAWA, Xuan SONG, Tianqi XIA, Ryosuke SHIBASAKI, Hodaka KANEDA. Integrating GPS trajectory and topics from Twitter stream for human mobility estimation[J]. Front. Comput. Sci., 2019, 13(3): 460-470.
[12] Hao SHAO. Query by diverse committee in transfer active learning[J]. Front. Comput. Sci., 2019, 13(2): 280-291.
[13] Qingying SUN, Zhongqing WANG, Shoushan LI, Qiaoming ZHU, Guodong ZHOU. Stance detection via sentiment information and neural network model[J]. Front. Comput. Sci., 2019, 13(1): 127-138.
[14] Ruochen HUANG, Xin WEI, Liang ZHOU, Chaoping LV, Hao MENG, Jiefeng JIN. A survey of data-driven approach on multimedia QoE evaluation[J]. Front. Comput. Sci., 2018, 12(6): 1060-1075.
[15] Shuaiqiang WANG, Yilong YIN. Polygene-based evolutionary algorithms with frequent pattern mining[J]. Front. Comput. Sci., 2018, 12(5): 950-965.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed