Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2017, Vol. 11 Issue (5) : 836-851    https://doi.org/10.1007/s11704-016-5250-y
RESEARCH ARTICLE
Boosting imbalanced data learning with Wiener process oversampling
Qian LI1, Gang LI2, Wenjia NIU1(), Yanan CAO1, Liang CHANG3, Jianlong TAN1, Li GUO1
1. Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
2. School of Information Technology, Deakin University, Geelong VIC 3125, Australia
3. Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China
 Download: PDF(825 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Learning from imbalanced data is a challenging task in a wide range of applications, which attracts significant research efforts from machine learning and data mining community. As a natural approach to this issue, oversampling balances the training samples through replicating existing samples or synthesizing new samples. In general, synthesization outperforms replication by supplying additional information on the minority class. However, the additional information needs to follow the same normal distribution of the training set, which further constrains the new samples within the predefined range of training set. In this paper, we present the Wiener process oversampling (WPO) technique that brings the physics phenomena into sample synthesization. WPO constructs a robust decision region by expanding the attribute ranges in training set while keeping the same normal distribution. The satisfactory performance of WPO can be achieved with much lower computing complexity. In addition, by integrating WPO with ensemble learning, the WPOBoost algorithm outperformsmany prevalent imbalance learning solutions.

Keywords imbalanced-data learning      oversampling      ensemble learning      Wiener process      AdaBoost     
Corresponding Author(s): Wenjia NIU   
Just Accepted Date: 16 March 2016   Online First Date: 17 March 2017    Issue Date: 26 September 2017
 Cite this article:   
Qian LI,Gang LI,Wenjia NIU, et al. Boosting imbalanced data learning with Wiener process oversampling[J]. Front. Comput. Sci., 2017, 11(5): 836-851.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-5250-y
https://academic.hep.com.cn/fcs/EN/Y2017/V11/I5/836
1 ZhouZ H, LiuX Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 63–77
https://doi.org/10.1109/TKDE.2006.17
2 LiuX Y, ZhouZ H. The influence of class imbalance on cost-sensitive learning: an empirical study. In: Proceedings of the 6th International Conference on Data Mining. 2006, 970–974
https://doi.org/10.1109/icdm.2006.158
3 YuL, WangS, LaiK K. Developing an svm-based ensemble learning system for customer risk identification collaborating with customer relationship management. Frontiers of Computer Science in China, 2010, 4(2): 196–203
https://doi.org/10.1007/s11704-010-0508-2
4 LiuE, ZhaoH, GuoF, Liang J, TianJ . Fingerprint segmentation based on an adaboost classifier. Frontiers of Computer Science in China, 2011, 5(2): 148–157
https://doi.org/10.1007/s11704-011-9134-x
5 HanH, WangW, MaoB. Over-sampling algorithm based on adaboost in unbalanced data set. Computer Engineering, 2007, 33(10): 207–209 (in Chinese)
6 ChawlaN V, Lazarevic A, HallL O , BowyerK W. Smoteboost: improving prediction of the minority class in boosting. Lecture Notes in Computer Science, 2003, 2838: 107–119
https://doi.org/10.1007/978-3-540-39804-2_12
7 MeaseD, WynerA J, BujaA. Boosted classification trees and class probability/quantile estimation. The Journal of Machine Learning Research, 2007, 8: 409–439
8 BatistaG E, PratiR C, MonardM C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20–29
https://doi.org/10.1145/1007730.1007735
9 BunkhumpornpatC, Sinapiromsaran K, LursinsapC . Safe-levelsmote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceedings of Advances in Knowledge Discovery and Data Mining. 2009, 475–482
10 ChawlaN V, BowyerK W, HallL O, Kegelmeyer W P. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 321–357
11 YuanB, LiuW. Measure oriented training: a targeted approach to imbalanced classification problems. Frontiers of Computer Science, 2012, 6(5): 489–497
https://doi.org/10.1007/s11704-012-2943-8
12 KangP, ChoS. EUS SVMS: ensemble of under-sampled svms for data imbalance problems. In: Proceedings of Neural Information Processing. 2006, 837–846
https://doi.org/10.1007/11893028_93
13 JapkowiczN. The class imbalance problem: significance and strategies. In: Proceedings of International Conference on Artificial Intelligence. 2000
14 GalarM, Fernandez A, BarrenecheaE , BustinceH, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, 2012, 42(4): 463–484
https://doi.org/10.1109/TSMCC.2011.2161285
15 YuanB, MaX. Sampling+ reweighting: boosting the performance of adaboost on imbalanced datasets. In: Proceedings of International Joint Conference on Neural Networks. 2012, 1–6
https://doi.org/10.1109/ijcnn.2012.6252738
16 HidaT. Brownian motion. Springer US, 1980, 11(5): 44–113
17 DietterichT G. Ensemble methods in machine learning. In: Proceedings of Multiple classifier systems. 2000, 1–15
https://doi.org/10.1007/3-540-45014-9_1
18 MaloofM A. Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of ICML-2003 Workshop on Learning from Imbalanced Data Sets II. 2003
19 ChawlaN V, Japkowicz N, KotczA . Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 1–6
https://doi.org/10.1145/1007730.1007733
20 HanH, WangW Y, MaoB H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Proceedings of Advances in Intelligent Computing. 2005, 878–887
https://doi.org/10.1007/11538059_91
21 LiuX Y, WuJ, ZhouZ H. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, 2009, 39(2): 539–550
https://doi.org/10.1109/TSMCB.2008.2007853
22 SchapireR E. The boosting approach to machine learning: an overview. Nonlinear Estimation and Classification, 2003, 149–171
23 SchapireR E, SingerY. Boostexter: a boosting-based system for text categorization. Machine Learning, 2000, 39(2–3): 135–168
https://doi.org/10.1023/A:1007649029923
24 LiX, WangL, SungE. Adaboost with svm-based component classifiers. Engineering Applications of Artificial Intelligence, 2008, 21(5): 785–795
https://doi.org/10.1016/j.engappai.2007.07.001
25 BoydS, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004
https://doi.org/10.1017/CBO9780511804441
26 AsuncionA, NewmanD. UCI machine learning repository. , 2007
27 BreimanL, Friedman J, StoneC J , OlshenR A.Classification and Regression Trees. Belmont: Wadsworth International Group, 1984
28 LewisD D. Naive (Bayes) at forty: the independence assumption in information retrieval. In: Proceedings of Machine Learning: ECML-98. 1998, 4–15
29 KellerJ M, GrayM R, GivensJ A. A fuzzy k-nearest neighbor algorithm. IEEE Transactions on Systems, Man and Cybernetics, 1985 (4): 580–585
https://doi.org/10.1109/TSMC.1985.6313426
30 BreimanL. Bagging predictors. Machine Learning, 1996, 24(2): 123–140
https://doi.org/10.1007/BF00058655
[1] FCS-0836-15250-WJN_suppl_1 Download
[1] Xibin DONG, Zhiwen YU, Wenming CAO, Yifan SHI, Qianli MA. A survey on ensemble learning[J]. Front. Comput. Sci., 2020, 14(2): 241-258.
[2] Tao SUN, Zhi-Hua ZHOU. Structural diversity for decision tree ensemble learning[J]. Front. Comput. Sci., 2018, 12(3): 560-570.
[3] Bo SUN, Haiyan CHEN, Jiandong WANG, Hua XIE. Evolutionary under-sampling based bagging ensemble method for imbalanced data classification[J]. Front. Comput. Sci., 2018, 12(2): 331-350.
[4] Eryun LIU, Heng ZHAO, Fangfei GUO, Jimin LIANG, Jie TIAN. Fingerprint segmentation based on an AdaBoost classifier[J]. Front Comput Sci Chin, 2011, 5(2): 148-157.
[5] Jianhua JIA, Bingxiang LIU, Licheng JIAO. Soft spectral clustering ensemble applied to image segmentation[J]. Front Comput Sci Chin, 2011, 5(1): 66-78.
[6] Lean YU, Shouyang WANG, Kin Keung LAI. Developing an SVM-based ensemble learning system for customer risk identification collaborating with customer relationship management[J]. Front Comput Sci Chin, 2010, 4(2): 196-203.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed