Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2025, Vol. 19 Issue (1): 191301   https://doi.org/10.1007/s11704-023-3245-z
  本期目录
New logarithmic step size for stochastic gradient descent
Mahsa Soheil SHAMAEE1, Sajad Fathi HAFSHEJANI2(), Zeinab SAEIDIAN3
1. Department of Computer Science, Faculty of Mathematical Science, University of Kashan, Kashan 87317-53153, Iran
2. Department of Applied Mathematics, Shiraz University of Technology, Shiraz 13876-71557, Iran
3. Department of Mathematical Sciences, University of Kashan, Kashan 87317-53153, Iran
 全文: PDF(5696 KB)   HTML
Abstract

In this paper, we propose a novel warm restart technique using a new logarithmic step size for the stochastic gradient descent (SGD) approach. For smooth and non-convex functions, we establish an O(1T) convergence rate for the SGD. We conduct a comprehensive implementation to demonstrate the efficiency of the newly proposed step size on the FashionMinst, CIFAR10, and CIFAR100 datasets. Moreover, we compare our results with nine other existing approaches and demonstrate that the new logarithmic step size improves test accuracy by 0.9% for the CIFAR100 dataset when we utilize a convolutional neural network (CNN) model.

Key wordsstochastic gradient descent    logarithmic step size    warm restart technique
收稿日期: 2023-03-25      出版日期: 2024-03-14
Corresponding Author(s): Sajad Fathi HAFSHEJANI   
 引用本文:   
. [J]. Frontiers of Computer Science, 2025, 19(1): 191301.
Mahsa Soheil SHAMAEE, Sajad Fathi HAFSHEJANI, Zeinab SAEIDIAN. New logarithmic step size for stochastic gradient descent. Front. Comput. Sci., 2025, 19(1): 191301.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-023-3245-z
https://academic.hep.com.cn/fcs/CN/Y2025/V19/I1/191301
Step sizes Conditions Convergence rate Reference
ηt=η02(1+cos?(πtT)) η0=1Lc,c=O(T) O(1T) [18]
ηt=η0(βT)tT η0=1Lc,c=O(T) O(1T) [18]
ηt=η0(βT)tT β=O(T) O(ln?TT) [20]
ηt=Constant η0=1T O(1T) [15]
New proposed method η0=1Lc,c=O(Tln?T) O(1T) New
Tab.1  
Fig.1  
  
Methods FashionMNIST CIFAR10 CIFAR100
η0 α c η0 α c η0 α c
Constant step size 0.007 ? ? 0.07 ? ? 0.07 ? ?
O(1t) step size 0.05 0.00038 ? 0.1 0.00023 ? 0.8 0.004 ?
O(1t) step size 0.05 0.00653 ? 0.2 0.07907 ? 0.1 0.015 ?
Adam 0.0009 ? ? 0.0009 ? ? 0.0009 ? ?
SGD+Armijo 0.5 ? 0.1 2.5 ? 0.1 5 ? 0.5
ReduceLROnPlateau 0.04 0.5 ? 0.07 0.1 ? 0.1 0.5 ?
Stagewise - 1 Milestone 0.04 0.1 ? 0.1 0.1 ? 0.07 0.1 ?
Stagewise - 2 Milestones 0.04 0.1 ? 0.2 0.1 ? 0.07 0.1 ?
Cosine step size 0.05 ? ? 0.25 ? ? 0.09 ? ?
Ln step size 0.05 ? ? 0.25 ? ? 0.53 ? ?
Tab.2  
Fig.2  
Fig.3  
Fig.4  
Fig.5  
Methods Training loss Test accuracy
Constant Step Size 0.0007±0.0003 0.9299±0.0016
O(1t) Step Size 0.0006±0.0000 0.9311±0.0012
O(1t) Step Size 0.0011±0.0003 0.9275±0.0007
Adam 0.0131±0.0017 0.9166±0.0019
SGD+Armijo 6.73e?05 ± 0.0000 0.9277±0.0012
ReduceLROnPlateau 0.0229±0.0032 0.9299±0.0014
Stagewise - 1 Milestone 0.0004±0.0000 0.9303±0.0007
Stagewise - 2 Milestones 0.0015±0.0000 0.9298±0.0014
Cosine step size 0.0004±1.1e?05 0.9284±0.0005
Ln step size 0.0007±0.0000 0.9314 ± 0.0018
Tab.3  
Methods Training loss Test accuracy
Constant Step Size 0.1870±0.0257 0.8776±0.0060
O(1t) Step Size 0.0261±0.0021 0.8946±0.0020
O(1t) Step Size 0.2634±0.3493 0.8747±0.0077
Adam 0.1140±0.0060 0.8839±0.0031
SGD+Armijo 0.0463±0.0169 0.8834±0.0059
ReduceLROnPlateau 0.0646±0.0041 0.9081±0.0019
Stagewise - 1 Milestone 0.0219±0.0012 0.9111±0.0034
Stagewise - 2 Milestones 0.0344±0.0062 0.9151±0.0039
Cosine step size 0.0102±0.0007 0.9201 ± 0.0017
Ln step size 0.0089 ± 0.0011 0.9012±0.0027
Tab.4  
Methods Training loss Test accuracy
Constant Step Size 1.19599±0.0183 0.5579±0.0039
O(1t) Step Size 1.43381±0305 0.54436±0.0038
O(1t) Step Size 1.14579±0.0125 0.5624±0.0020
Adam 1.53433±0.0180 0.5206±0.0053
SGD+Armijo 1.3046±0.0476 0.5244±0.0053
ReduceLROnPlateau 1.6000±0.055 0.5148±0.1929
Stagewise - 1 Milestone 1.1889±0.0077 0.5706±0.0049
Stagewise - 2 Milestones 1.09061±0.0040 0.5779±0.0017
Cosine step size 1.12694±0.0081 0.5739±0.0025
Ln step size 1.0805 ± 0.0327 0.5869 ± 0.0035
Tab.5  
Methods Training loss Test accuracy
Constant Step Size 1.0789±0.06336 0.6089±0.01485
O(1t) Step Size 0.2709±0.02047 0.6885±0.0051
O(1t) Step Size 0.7584±0.04068 0.6411±0.00469
Adam 0.6008±0.02396 0.6555±0.00430
SGD+Armijo 0.1162±0.01175 0.6869±0.00460
ReduceLROnPlateau 0.0863±0.00896 0.7440±0.00695
Stagewise - 1 Milestone 0.15300±0.00382 0.7456±0.00188
Stagewise - 2 Milestones 0.21032±0.01067 0.73102±0.00402
Cosine step size 0.0622±0.00349 0.7568±0.00311
Ln step size 0.06011 ± 0.0018 0.7579 ± 0.0022
Tab.6  
  
  
  
1 H, Robbins S Monro . A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22( 3): 400–407
2 A, Krizhevsky I, Sutskever G E Hinton . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90
3 A, Krizhevsky G Hinton . Learning multiple layers of features from tiny images. Toronto: University of Toronto, Department of Computer Science, 2009
4 Redmon J, Farhadi A. Yolo9000: better, faster, stronger. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6517−6525
5 J, Zhang C Zong . Deep neural networks in machine translation: an overview. IEEE Intelligent Systems, 2015, 30( 5): 16–25
6 Mishra P, Sarawadekar K. Polynomial learning rate policy with warm restart for deep neural network. In: Proceedings of 2019 IEEE Region 10 Conference (TENCON). 2019, 2087−2092
7 S, Vaswani A, Mishkin I, Laradji M, Schmidt G, Gidel S Lacoste-Julien . Painless stochastic gradient: Interpolation, line-search, and convergence rates. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 335
8 R M, Gower N, Loizou X, Qian A, Sailanbayev E, Shulgin P Richtárik . SGD: General analysis and improved rates. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5200−5209
9 Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2261–2269
10 L N Smith . Cyclical learning rates for training neural networks. In: Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017, 464−472
11 Loshchilov I, Hutter F. SGDR: Stochastic gradient descent with warm restarts. In: Proceedings of the 5th International Conference on Learning Representations. 2017
12 G, Vrbančič V Podgorelec . Efficient ensemble for image-based identification of pneumonia utilizing deep CNN and SGD with warm restarts. Expert Systems with Applications, 2022, 187: 115834
13 Xu G, Cao H, Dong Y, Yue C, Zou Y. Stochastic gradient descent with step cosine warm restarts for pathological lymph node image classification via PET/CT images. In: Proceedings of the 5th IEEE International Conference on Signal and Image Processing (ICSIP). 2020, 490−493
14 A, Nemirovski A, Juditsky G, Lan A Shapiro . Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009, 19( 4): 1574–1609
15 S, Ghadimi G H Lan . Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013, 23( 4): 2341–2368
16 F, Bach E Moulines . Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 451−459
17 A, Rakhlin O, Shamir K Sridharan . Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, 2011
18 X, Li Z, Zhuang F Orabona . A second look at exponential and cosine step sizes: Simplicity, adaptivity, and performance. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 6553−6564
19 R, Ge S M, Kakade R, Kidambi P Netrapalli . The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, 1341
20 Wang X, Magnússon S, Johansson M. On the convergence of step decay step-size for stochastic optimization. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021, 14226−14238
21 J, Nocedal S J Wright . Numerical Optimization. New York: Springer, 1999
22 He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778
23 Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
24 N S, Aybat A, Fallah M, Gurbuzbalaban A Ozdaglar . A universally optimal multistage accelerated stochastic gradient method. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 765
25 G B, Thomas R L, Finney M D, Weir F R Giordano . Thomas’ Calculus, Early Transcendentals. 10th ed. Boston: Addison Wesley, 2002
[1] FCS-23245-OF-MSS_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed