Adam revisited: a weighted past gradients perspective

doi:10.1007/s11704-019-8457-x

Front. Comput. Sci.

2020, Vol. 14

Issue (5) : 145309 https://doi.org/10.1007/s11704-019-8457-x

RESEARCH ARTICLE

Adam revisited: a weighted past gradients perspective

Hui ZHONG¹, Zaiyi CHEN², Chuan QIN¹, Zai HUANG¹, Vincent W. ZHENG³, Tong XU¹, Enhong CHEN¹(

)

¹. Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China, Hefei 230027, China
². Zhejiang Cainiao Supply Chain Management Co. Ltd, Hangzhou 311122, China
³. Advanced Digital Sciences Center, Singapore 138602, Singapore

Download: PDF(1035 KB)
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Adaptive learning rate methods have been successfully applied in many fields, especially in training deep neural networks. Recent results have shown that adaptive methods with exponential increasing weights on squared past gradients (i.e., ADAM, RMSPROP) may fail to converge to the optimal solution. Though many algorithms, such as AMSGRAD and ADAMNC, have been proposed to fix the non-convergence issues, achieving a data-dependent regret bound similar to or better than ADAGRAD is still a challenge to these methods. In this paper, we propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues. Unlike AMSGRAD and ADAMNC, we consider using a milder growing weighting strategy on squared past gradient, in which weights grow linearly. Based on this idea, we propose weighted adaptive gradient method framework (WAGMF) and implement WADA algorithm on this framework. Moreover, we prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly. This bound may partially explain the good performance of ADAM in practice. Finally, extensive experiments demonstrate the effectiveness of WADA and its variants in comparison with several variants of ADAM on training convex problems and deep neural networks.

Keywords adaptive learning rate methods stochastic gra-dient descent online learning

Corresponding Author(s): Enhong CHEN

Issue Date: 10 March 2020

Cite this article:

Hui ZHONG,Zaiyi CHEN,Chuan QIN, et al. Adam revisited: a weighted past gradients perspective[J]. Front. Comput. Sci., 2020, 14(5): 145309.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-019-8457-x
https://academic.hep.com.cn/fcs/EN/Y2020/V14/I5/145309

1	H Robbins, S Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22(3): 400–407 https://doi.org/10.1214/aoms/1177729586
2	J C Duchi, E Hazan, Y Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011, 12(Jul): 2121–2159
3	T Tieleman, G Hinton. Lecture 6.5-rmsprop: divide the gradient b a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012, 4(2): 26–31
4	M D Zeiler. Adadelta: an adaptive learning rate method. 2012, arXiv preprint arXiv:1212.5701
5	D P Kingma, J Ba. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations. 2015
6	Y Yin, Z Huang, E Chen, Q Liu, F Zhang, X Xie, G Hu. Transcribing content from structural images with spotlight mechanism. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 2643–2652 https://doi.org/10.1145/3219819.3219962
7	A Krizhevsky, I Sutskever, G E Hinton. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1097–1105
8	Y Su, Q Liu, Q Liu, Z Huang, Y Yin, E Chen, C Ding, S Wei , G Hu. Exercise-enhanced sequential modeling for student performance prediction. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 2435–2443
9	Q Liu, Z Huang, Z Huang, C Liu, E Chen, Y Su, G Hu. Finding similar exercises in online education systems. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 1821–1830 https://doi.org/10.1145/3219819.3219960
10	G Salton, C Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 1988, 24(5): 513–523 https://doi.org/10.1016/0306-4573(88)90021-0
11	Y LeCun, L Bottou, Y Bengio, P Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278–2324 https://doi.org/10.1109/5.726791
12	E Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016, 2(3–4): 157–325 https://doi.org/10.1561/2400000013
13	S J Reddi, S Kale, S Kumar. On the convergence of adam and beyond. In: Proceedings of International Conference on Learning Representations. 2018
14	M C Mukkamala, M Hein. Variants of RMSProp and Adagrad with logarithmic regret bounds. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2545–2553
15	A Rakhlin, O Shamir, K Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1571–1578
16	O Shamir, T Zhang. Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 71–79
17	S Lacoste-Julien, M Schmidt, F Bach. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. 2012, arXiv preprint arXiv:1212.2002
18	J Dean, G Corrado, R Monga, K Chen, M Devin, M Mao, M Ranzato, A Senior, P Tucker, K Yang, Q V Le, A Y Ng. Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1223–1231
19	Z Chen, Y Xu, E Chen, T Yang. SADAGRAD: strongly adaptive stochastic gradient methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 912–920
20	H Huang, C Wang, B Dong. Nostalgic adam: weighing more of the past gradients when designing the adaptive learning rate. 2018, arXiv preprint arXiv:1805.07557 https://doi.org/10.24963/ijcai.2019/355
21	J Chen, Q Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks. 2018, arXiv preprint arXiv:1806.06763
22	M Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning. 2003, 928–936
23	N Cesa-Bianchi, A Conconi, C Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 2004, 50(9): 2050–2057 https://doi.org/10.1109/TIT.2004.833339
24	J Bernstein, Y Wang, K Azizzadenesheli, A Anandkumar. SIGNSGD: compressed optimisation for non-convex problems. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 559–568
25	T Dozat. Incorporating nesterov momentum into adam. In: Proceedings of International Conference on Learning Representations, Workshop Track. 2016
26	A Krizhevsky, G Hinton. Learning multiple layers of features from tiny images. Technical Report, Citeseer, 2009
27	K He, X Zhang, S Ren, J Sun. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778 https://doi.org/10.1109/CVPR.2016.90
28	T Yang, Y Yan, Z Yuan, R Jin. Why does stagewise training accelerate convergence of testing error over SGD? 2018, arXiv preprint arXiv:1812.03934
29	L Perez, J Wang. The effectiveness of data augmentation in image classification using deep learning. 2017, arXiv preprint arXiv:1712.04621
30	X Glorot, Y Bengio. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 2010, 249–256
31	V Nair, G E Hinton. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference onMachine Learning. 2010, 807–814
32	N Srivastava, G E Hinton, A Krizhevsky, I Sutskever, R Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1): 1929–1958

[1]

Download

[1]	Jia ZHU, Xingcheng WU, Jing XIAO, Changqin HUANG, Yong TANG, Ke Deng. Improved expert selection model for forex trading[J]. Front. Comput. Sci., 2018, 12(3): 518-527.
[2]	Lele CAO,Fuchun SUN,Hongbo LI,Wenbing HUANG. Advancing the incremental fusion of robotic sensory features using online multi-kernel extreme learning machine[J]. Front. Comput. Sci., 2017, 11(2): 276-289.

Viewed

Full text

Abstract

Cited

Shared

Discussed