Evolutionary under-sampling based bagging ensemble method for imbalanced data classification

doi:10.1007/s11704-016-5306-z

Front. Comput. Sci.

2018, Vol. 12

Issue (2) : 331-350 https://doi.org/10.1007/s11704-016-5306-z

RESEARCH ARTICLE

Evolutionary under-sampling based bagging ensemble method for imbalanced data classification

Bo SUN^1,²(

), Haiyan CHEN^1,²(

), Jiandong WANG¹, Hua XIE²

¹. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
². National Key Lab of ATFM, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

Download: PDF(549 KB)
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve this problem, many effective approaches have been proposed. Among them, the bagging ensemble methods with integration of the under-sampling techniques have demonstrated better performance than some other ones including the bagging ensemble methods integrated with the over-sampling techniques, the cost-sensitive methods, etc. Although these under-sampling techniques promote the diversity among the generated base classifiers with the help of random partition or sampling for the majority class, they do not take any measure to ensure the individual classification performance, consequently affecting the achievability of better ensemble performance. On the other hand, evolutionary under-sampling EUS as a novel undersampling technique has been successfully applied in searching for the best majority class subset for training a goodperformance nearest neighbor classifier. Inspired by EUS, in this paper, we try to introduce it into the under-sampling bagging framework and propose an EUS based bagging ensemble method EUS-Bag by designing a new fitness function considering three factors to make EUS better suited to the framework. With our fitness function, EUS-Bag could generate a set of accurate and diverse base classifiers. To verify the effectiveness of EUS-Bag, we conduct a series of comparison experiments on 22 two-class imbalanced classification problems. Experimental results measured using recall, geometric mean and AUC all demonstrate its superior performance.

Keywords class imbalanced problem under-sampling bagging evolutionary under-sampling ensemble learning machine learning data mining

Corresponding Author(s): Bo SUN,Haiyan CHEN

Just Accepted Date: 23 June 2016 Online First Date: 22 September 2017 Issue Date: 23 March 2018

Cite this article:

Bo SUN,Haiyan CHEN,Jiandong WANG, et al. Evolutionary under-sampling based bagging ensemble method for imbalanced data classification[J]. Front. Comput. Sci., 2018, 12(2): 331-350.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-5306-z
https://academic.hep.com.cn/fcs/EN/Y2018/V12/I2/331

1	Banfield R E, Hall L O, Bowyer K W, Kegelmeyer W P. A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1): 173–180 https://doi.org/10.1109/TPAMI.2007.250609
2	Donate J P, Cortez P, Sanchez G G, Miguel A S. Time series forecasting using a weighted cross-validation evolutionary artificial neural network ensemble. Neurocomputing, 2013, 109(1): 27–32 https://doi.org/10.1016/j.neucom.2012.02.053
3	Niu D X, Wang Y L, Wu D D. Power load forecasting using support vector machine and ant colony optimization. Expert Systems with Applications, 2010, 37(3): 2531–2539 https://doi.org/10.1016/j.eswa.2009.08.019
4	Rutkowski L, Jaworski M, Pietruczuk L, Duda P. The CART decision tree for mining data streams. Information Sciences, 2014, 266: 1–15 https://doi.org/10.1016/j.ins.2013.12.060
5	Bar-Hen A, Gey S, Poggi J M. Influence measures for CART classification trees. Journal of Classification, 2015, 32(1): 21–45 https://doi.org/10.1007/s00357-015-9172-4
6	Mazurowski M A, Habas P A, Zurada J M, Lo J Y, Baker J A, Tourassi G D. Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Networks, 2008, 21(2): 427–436 https://doi.org/10.1016/j.neunet.2007.12.031
7	Tomczak J M, Zieba M. Probabilistic combination of classification rules and its application to medical diagnosis. Machine Learning, 2015, 101(1–3): 105–135 https://doi.org/10.1007/s10994-015-5508-x
8	Tavallaee M, Stakhanova N, Ghorbani A A. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2010, 40(5): 516–524 https://doi.org/10.1109/TSMCC.2010.2048428
9	Ngai EWT, Hu Y, Wong Y H, Chen Y J, Sun X. The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature. Decision Support Systems, 2011, 50(3): 559–569 https://doi.org/10.1016/j.dss.2010.08.006
10	Chang X J, Yu Y L, Yang Y, Hauptmann A G. Searching persuasively: joint event detection and evidence justification with limited supervision. In: Proceedings of the 23rd Annual ACM Conference on Multimedia. 2015, 581–590 https://doi.org/10.1145/2733373.2806218
11	Chang X J, Yang Y, Xing E P, Yu Y L. Complex event detection using semantic saliency and nearly-isotonic SVM. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 1348–1357
12	Chang X J, Yang Y, Hauptmann A G, Xing E P. Semantic concept discovery for large-scale zero-shot event detection. In: Proceedings of the 4th International Joint Conference on Artificial Intelligence. 2015
13	Bermejo P, Gámez J A, Puerta J M. Improving the performance of naive bayes multinomial in e-mail foldering by introducing distributionbased balance of datasets. Expert Systems with Applications, 2011, 38(3): 2072–2080 https://doi.org/10.1016/j.eswa.2010.07.146
14	Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(4): 463–484 https://doi.org/10.1109/TSMCC.2011.2161285
15	Nanni L, Fantozzi C, Lazzarini N. Coupling different methods for overcoming the class imbalance problem. Neurocomputing, 2015, 158(1): 48–61 https://doi.org/10.1016/j.neucom.2015.01.068
16	Batista G E, Prati R C, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20–29 https://doi.org/10.1145/1007730.1007735
17	Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16(1): 321–357
18	Sáez J A, Luengo J, Stefanowski J, Herrera F. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 2015, 291(1): 184–203 https://doi.org/10.1016/j.ins.2014.08.051
19	Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 2004, 20(1): 18–36 https://doi.org/10.1111/j.0824-7935.2004.t01-1-00228.x
20	He H B, Garcia E A. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263–1284 https://doi.org/10.1109/TKDE.2008.239
21	Drummond C, Holte R C. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the International Conference on Machine Learning, Workshop on Learning from Imbalanced Datasets II. 2003, 1–8
22	Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new oversampling method in imbalanced data sets learning. In: Proceedings of International Conference on Intelligent Computing. 2005, 878–887 https://doi.org/10.1007/11538059_91
23	Lin Y, Lee Y, Wahba G. Support vector machines for classification in nonstandard situations. Machine Learning, 2002, 46(1–3): 191–202 https://doi.org/10.1023/A:1012406528296
24	Wu G, Chang E Y. KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(6): 786–795 https://doi.org/10.1109/TKDE.2005.95
25	Barandela R, Sánchez J S, Garcia V, Rangel E. Strategies for learning in class imbalance problems. Pattern Recognition, 2003, 36(3): 849–851 https://doi.org/10.1016/S0031-3203(02)00257-1
26	Ling C X, Sheng V S, Yang Q. Test strategies for cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(8): 1055–1067 https://doi.org/10.1109/TKDE.2006.131
27	Zhou Z H, Liu X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63–77 https://doi.org/10.1109/TKDE.2006.17
28	Chawla N V, Cieslak D A, Hall L O, Joshi A. Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery, 2008, 17(2): 225–252 https://doi.org/10.1007/s10618-008-0087-0
29	Tao D C, Tang X O, Li X L, Wu X D. Asymmetric bagging and random subspace for support vector machines-based relevance feedback. IEEE Transactions on Pattern Analysis andMachine Intelligence, 2006, 28(7): 1088–1099
30	Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of IEEE Symposium on Computational Intelligence and Data Mining. 2009, 324–331 https://doi.org/10.1109/CIDM.2009.4938667
31	Hido S, Kashima H, Takahashi Y. Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining, 2009, 2(5–6): 412–426 https://doi.org/10.1002/sam.10061
32	Liu X Y, Wu J X, Zhou Z H. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2009, 39(2): 539–550 https://doi.org/10.1109/TSMCB.2008.2007853
33	Seiffert C, Khoshgoftaar T M, Van Hulse J, Napolitano A. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 2010, 40(1): 185–197 https://doi.org/10.1109/TSMCA.2009.2029559
34	Barandela R, Valdovinos R M, Sánchez J S. New applications of ensembles of classifiers. Pattern Analysis and Applications, 2003, 6(3): 245–256 https://doi.org/10.1007/s10044-003-0192-z
35	Khoshgoftaar T M, Van Hulse J, Napolitano A. Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 41(3): 552–568 https://doi.org/10.1109/TSMCA.2010.2084081
36	Chawla N V, Lazarevic A, Hall L O, Bowyer K W. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2003, 107–119 https://doi.org/10.1007/978-3-540-39804-2_12
37	Zhou Z H. Ensemble Methods: Foundations and Algorithms. Florida: CRC Press, 2012
38	Sun B, Chen H Y, Wang J D. An empirical margin explanation for the effectiveness of DECORATE ensemble learning algorithm. Knowledge-Based Systems, 2015, 78(1): 1–12 https://doi.org/10.1016/j.knosys.2015.01.005
39	Hsu KW, Srivastava J. Improving bagging performance through multialgorithm ensembles. Frontiers of Computer Science, 2012, 6(5): 498–512
40	Liu E, Zhao H, Guo F F, Liang J M, Tian J. Fingerprint segmentation based on an AdaBoost classifier. Frontiers of Computer Science, 2011, 5(2): 148–157 https://doi.org/10.1007/s11704-011-9134-x
41	Yan Y, Xu Z W, Tsang I W, Long G, Yang Y. Robust semi-supervised learning through label aggregation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 1–7
42	Rong W G, Peng B L, Ouyang Y X, Li C, Xiong Z. Structural information aware deep semi-supervised recurrent neural network for sentiment analysis. Frontiers of Computer Science, 2015, 9(2): 171–184 https://doi.org/10.1007/s11704-014-4085-7
43	Zhou Z H. When semi-supervised learning meets ensemble learning. Frontiers of Electrical and Electronic Engineering, 2011, 6(1): 6–16 https://doi.org/10.1007/s11460-011-0126-2
44	Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123–140 https://doi.org/10.1007/BF00058655
45	Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997, 55(1): 119–139 https://doi.org/10.1006/jcss.1997.1504
46	Garcia S, Herrera F. Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evolutionary Computation, 2009, 17(3): 275–306 https://doi.org/10.1162/evco.2009.17.3.275
47	Garcia S, Derrac J, Cano J, Herrera F. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 417–435 https://doi.org/10.1109/TPAMI.2011.142
48	Luengo J, Fernández A, Garica S, Herrera F. Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Computing, 2011, 15(10): 1909–1936 https://doi.org/10.1007/s00500-010-0625-8
49	Drown D J, Khoshgoftaar T M, Seliya N. Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Transactions on Systems, Man and Cybernetics: PART A – Systems and Humans, 2009, 39(5): 1097–1107 https://doi.org/10.1109/TSMCA.2009.2020804
50	Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 2013, 46(12): 3460–3471 https://doi.org/10.1016/j.patcog.2013.05.006
51	Fawcett T. ROC graphs: notes and practical considerations for researchers. Machine Learning, 2004, 31(1): 1–38
52	Kuncheva L I, Whitaker C J. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 2003, 51(2): 181–207 https://doi.org/10.1023/A:1022859003006
53	Dietterich T G. Ensemble Learning. Cambridge: The MIT Press, 2002
54	Banfield R E, Hall L O, Bowyer K W, Kegelmeyer W P. Ensemble diversity measures and their application to thinning. Information Fusion, 2005, 6(1): 49–62 https://doi.org/10.1016/j.inffus.2004.04.005
55	Man K F, Tang K S, Kwong S. Genetic Algorithms: Concepts and Designs. Berlin: Springer Science & Business Media, 2012
56	Sun Z B, Song Q B, Zhu X Y, Sun H L, Xu B W, Zhou Y M. A novel ensemble method for classifying imbalanced data. Pattern Recognition, 2015, 48(5): 1623–1637 https://doi.org/10.1016/j.patcog.2014.11.014
57	He H B, Ma Y Q. Imbalanced Learning: Foundations, Algorithms, and Applications. New Jersey: John Wiley & Sons, 2013 https://doi.org/10.1002/9781118646106
58	Demšar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 2006, 7(1): 1–30

[1]

Download

[1]	Xia-an BI, Yiming XIE, Hao WU, Luyun XU. Identification of differential brain regions in MCI progression via clustering-evolutionary weighted SVM ensemble algorithm[J]. Front. Comput. Sci., 2021, 15(6): 156903-.
[2]	Yan-Ping SUN, Min-Ling ZHANG. Compositional metric learning for multi-label classification[J]. Front. Comput. Sci., 2021, 15(5): 155320-.
[3]	Genan DAI, Xiaoyang HU, Youming GE, Zhiqing NING, Yubao LIU. Attention based simplified deep residual network for citywide crowd flows prediction[J]. Front. Comput. Sci., 2021, 15(2): 152317-.
[4]	Jian SUN, Pu-Feng DU. Predicting protein subchloroplast locations: the 10th anniversary[J]. Front. Comput. Sci., 2021, 15(2): 152901-.
[5]	Syed Farooq ALI, Muhammad Aamir KHAN, Ahmed Sohail ASLAM. Fingerprint matching, spoof and liveness detection: classification and literature review[J]. Front. Comput. Sci., 2021, 15(1): 151310-.
[6]	Yuling MA, Chaoran CUI, Jun YU, Jie GUO, Gongping YANG, Yilong YIN. Multi-task MIML learning for pre-course student performance prediction[J]. Front. Comput. Sci., 2020, 14(5): 145313-.
[7]	Xibin DONG, Zhiwen YU, Wenming CAO, Yifan SHI, Qianli MA. A survey on ensemble learning[J]. Front. Comput. Sci., 2020, 14(2): 241-258.
[8]	Guijuan ZHANG, Yang LIU, Xiaoning JIN. A survey of autoencoder-based recommender systems[J]. Front. Comput. Sci., 2020, 14(2): 430-450.
[9]	Lu LIU, Shang WANG. Meta-path-based outlier detection in heterogeneous information network[J]. Front. Comput. Sci., 2020, 14(2): 388-403.
[10]	Xu-Ying LIU, Sheng-Tao WANG, Min-Ling ZHANG. Transfer synthetic over-sampling for class-imbalance learning with limited minority class data[J]. Front. Comput. Sci., 2019, 13(5): 996-1009.
[11]	Yu-Feng LI, De-Ming LIANG. Safe semi-supervised learning: a brief introduction[J]. Front. Comput. Sci., 2019, 13(4): 669-676.
[12]	Wenhao ZHENG, Hongyu ZHOU, Ming LI, Jianxin WU. CodeAttention: translating source code to comments by exploiting the code constructs[J]. Front. Comput. Sci., 2019, 13(3): 565-578.
[13]	Satoshi MIYAZAWA, Xuan SONG, Tianqi XIA, Ryosuke SHIBASAKI, Hodaka KANEDA. Integrating GPS trajectory and topics from Twitter stream for human mobility estimation[J]. Front. Comput. Sci., 2019, 13(3): 460-470.
[14]	Hao SHAO. Query by diverse committee in transfer active learning[J]. Front. Comput. Sci., 2019, 13(2): 280-291.
[15]	Qingying SUN, Zhongqing WANG, Shoushan LI, Qiaoming ZHU, Guodong ZHOU. Stance detection via sentiment information and neural network model[J]. Front. Comput. Sci., 2019, 13(1): 127-138.

Viewed

Full text

Abstract

Cited

Shared

Discussed