Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2018, Vol. 12 Issue (4) : 694-713    https://doi.org/10.1007/s11704-018-7314-7
RESEARCH ARTICLE
Dropout training for SVMs with data augmentation
Ning CHEN1(), Jun ZHU2, Jianfei CHEN2, Ting CHEN2
1. MOE Key lab of Bioinformatics, Bioinformatics Division and Center for Synthetic and Systems Biology, TNLIST, Tsinghua University, Beijing 100084, China
2. State Key Lab of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
 Download: PDF(2291 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Dropout and other feature noising schemes have shown promise in controlling over-fitting by artificially corrupting the training data. Though extensive studies have been performed for generalized linear models, little has been done for support vector machines (SVMs), one of the most successful approaches for supervised learning. This paper presents dropout training for both linear SVMs and the nonlinear extension with latent representation learning. For linear SVMs, to deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm iteratively minimizes the expectation of a reweighted least square problem, where the re-weights are analytically updated. For nonlinear latent SVMs, we consider learning one layer of latent representations in SVMs and extend the data augmentation technique in conjunction with first-order Taylor-expansion to deal with the intractable expected hinge loss and the nonlinearity of latent representations. Finally, we apply the similar data augmentation ideas to develop a new IRLS algorithm for the expected logistic loss under corrupting distributions, and we further develop a non-linear extension of logistic regression by incorporating one layer of latent representations. Our algorithms offer insights on the connection and difference between the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of dropout training on significantly boosting the classification accuracy of both linear and nonlinear SVMs.

Keywords dropout      SVMs      logistic regression      data augmentation      iteratively reweighted least square     
Corresponding Author(s): Ning CHEN   
Just Accepted Date: 09 October 2017   Online First Date: 20 December 2017    Issue Date: 14 June 2018
 Cite this article:   
Ning CHEN,Jun ZHU,Jianfei CHEN, et al. Dropout training for SVMs with data augmentation[J]. Front. Comput. Sci., 2018, 12(4): 694-713.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-018-7314-7
https://academic.hep.com.cn/fcs/EN/Y2018/V12/I4/694
1 Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15: 1929–1958
2 Wager S, Wang S, Liang P. Dropout training as adaptive regularization. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013
3 Maaten L V, Chen M, Tyree S, Weinberger K Q. Learning with marginalized corrupted features. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 410–418
4 Wang S, Wang M Q, Wager S, Liang P, Manning C D. Feature noising for log-linear structured prediction. In: Proceedings of Conference on Empirical Methods on Natural Language Processing. 2013, 1170–1179
5 Wang S, Manning C. Fast dropout training. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 777–785
6 Wang H, Shi X J, Yeung D Y. Relational stacked denoising autoencoder for tag recommendation. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015, 3052–3058
7 Vapnik V. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995
https://doi.org/10.1007/978-1-4757-2440-0
8 Burges C J C, Scholkopf B. Improving the accuracy and speed of support vector machines. In: Proceedings of Advances in Neural Information Processing Systems. 1997, 375–381
9 Globerson A, Roweis S. Nightmare at test time: robust learning by feature deletion. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 353–360
https://doi.org/10.1145/1143844.1143889
10 Dekel O, Shamir O. Learning to classify with missing and corrupted features. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 149–178
https://doi.org/10.1145/1390156.1390184
11 Teo C H, Globerson A, Roweis S T, Smola A K. Convex learning with invariances. In: Proceedings of Advances in Neural Information Processing Systems. 2008, 1489–1496
12 Polson N G, Scott S L. Data augmentation for support vector machines. Bayesian Analysis, 2011, 6(1): 1–24
https://doi.org/10.1214/11-BA601
13 Polson N G, Scott J G, Windle J. Bayesian inference for logistic models using Polya-Gamma latent variables. Journal of the American Statistical Association, 2013, 108(504): 1339–1349
https://doi.org/10.1080/01621459.2013.829001
14 Rosasco L, De Vito E, Caponnetto A, Piana M, Verri A. Are loss functions all the same? Neural Computation, 2004, 16(5): 1063–1076
https://doi.org/10.1162/089976604773135104
15 Globerson A, Koo T Y, Carreras X, Collins M. Exponentiated gradient algorithms for log-linear structured prediction. In: Proceedings of the 24th International Conference on Machine Learning. 2007, 305–312
https://doi.org/10.1145/1273496.1273535
16 Baldi P, Sadowski P. The dropout learning algorithm. Artificial Intelligence, 2014, 210(5): 78–122
https://doi.org/10.1016/j.artint.2014.02.004
17 Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15: 1929–1958
18 Srivastava N. Improving neural networks with dropout. Dissertation for the Master Degree. Toronto: University of Toronto, 2013
19 Huang G, Song S J, Gupta J N D, Wu C. Semi-supervised and unsupervised extreme learning machines. IEEE Transactions on Cybernetics, 2014, 44(12): 2405–2417
https://doi.org/10.1109/TCYB.2014.2307349
20 Van Erven T, Kotlowski W, Warmuth M K. Follow the leader with dropout perturbations. Proceedings of Machine Learning Research, 2014, 35: 949–974
21 Xu P Y, Sarikaya R. Targeted feature dropout for robust slot filling in natural language understanding. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association. 2014, 258–262
22 Rashmi R K, Gilad-Bachrach R. Dart: dropouts meet multiple additive regression trees. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics. 2015, 489–497
23 Chen M M, Xu Z X, Weinberger K, Sha F. Marginalized denoising autoencoders for domain adaptation. In: Proceedings of International Conference on Machine Learning. 2012, 767–774
24 Chen M M, Weinberger K, Sha F, Bengio Y. Marginalized denoising autoencoders for nonlinear representation. In: Proceedings of the 31st International Conference on Machine Learning. 2014, 3342–3350
25 Chen Z, Chen M M, Weinberger K Q, Zhang W X. Marginalized denoising for link prediction and multi-label learning. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015, 1707–1713
26 Chen Z, Zhang W X. A marginalized denoising method for link prediction in relational data. In: Proceedings of the SIAM International Conference on Data Mining. 2014, 298–306
https://doi.org/10.1137/1.9781611973440.34
27 Chen M M, Zheng A, Weinberger K. Fast image tagging. In: Proceedings of International Conference on Machine Learning. 2013, 2311–2319
28 Qian Q, Hu J H, Jin R, Pei J, Zhu S H. Distance metric learning using dropout: a structured regularization approach. In: Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2014, 323–332
https://doi.org/10.1145/2623330.2623678
29 Wager S, Fithian W, Wang S, Liang P S. Altitude training: strong bounds for single-layer dropout. In: Proceedings of Advances in Neural Information Processing Systems. 2014, 100–108
30 Bachman P, Alsharif O, Precup D. Learning with pseudo-ensembles. In: Proceedings of Advances in Neural Information Processing Systems. 2014, 3365–3373
31 Helmbold D P, Long P M. On the inductive bias of dropout. Journal of Machine Learning Research, 2015, 16: 3403–3454
32 Maeda S. A Bayesian encourages dropout. 2014, arXiv:1412.7003v3
33 Gal Y, Ghahramani Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of International Conference on Machine Learning. 2016, 1651–1660
34 Chen N, Zhu J, Chen J F, Zhang B. Dropout training for support vector machines. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence. 2014, 1752–1759
35 Vincent P, Larochelle H, Bengio Y, Manzagol P A. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning. 2008, 1096–1103
https://doi.org/10.1145/1390156.1390294
36 Saul L K, Jaakkola T, Jordan M I. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 1996, 4: 61–76
37 Zhu J, Chen N, Perkins H, Zhang B. Gibbs max-margin topic models with data augmentation. Journal of Machine Learning Research, 2014, 15: 1073–1110
38 Devroye L. Non-Uniform Random Variate Generation. New York: Springer-Verlag, 1986
https://doi.org/10.1007/978-1-4613-8643-8
39 Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(3): 503–528
https://doi.org/10.1007/BF01589116
40 Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2009
https://doi.org/10.1007/978-0-387-84858-7
41 Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798–1828
https://doi.org/10.1109/TPAMI.2013.50
42 Guo J, Che W X, Yarowsky D, Wang H F, Liu T. A distributed representation-based framework for cross-lingual transfer parsing. Journal of Artificial Intelligence Research, 2016, 55: 995–1023
43 Smola A J, Scholkopf B. A tutorial on support vector regression. Statistics and Computing, 2003, 14(3): 199–222
https://doi.org/10.1023/B:STCO.0000035301.49549.88
44 Chen N, Zhu J, Xia F, Zhang B. Generalized relational topic models with data augmentation. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence. 2013, 1273–1279
45 Blitzer J, Dredze M, Pereira F. Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. 2007, 440–447
46 Torralba A, Fergus R, Freeman W. A large dataset for non-parametric object and scene recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2008, 30(11): 1958–1970
https://doi.org/10.1109/TPAMI.2008.128
47 Krizhevsky A. Learning multiple layers of features from tiny images. Technical Report. 2009
48 Zhu J, Xing E P. Conditional topic random fields. In: Proceedings of International Conference on Machine Learning. 2010, 1239–1246
49 Rifkin R, Klautau A. In defense of one-vs-all classification. Journal of Machine Learning Research, 2004, (5): 101–141
50 Blei D, McAuliffe J D. Supervised topic models. In: Proceedings of Advances in Neural Information Processing Systems. 2007
51 Tang Y. Deep learning with linear support vector machines. In: Proceedings of ICML workshop on Representational Learning. 2013
52 Kingma D P, Welling M. Efficient gradient-based inference through transformations between bayes nets and neural nets. In: Proceedings of International Conference on Machine Learning. 2014, 3791–3799
53 Bacon P L, Bengio E, Pineau J, Precup D. Conditional computation in neural networks using a decision-theoretic approach. In: Proceedings of the 2nd Multidisciplinary Conference on Reinforcement Learning and Decision Making. 2015
[1] Wai-Ki CHING, Ho-Yin LEUNG, Zhenyu WU, Hao JIANG. Modeling default risk via a hidden Markov model of multiple sequences[J]. Front Comput Sci Chin, 2010, 4(2): 187-195.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed