Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2025, Vol. 19 Issue (1) : 191301    https://doi.org/10.1007/s11704-023-3245-z
Artificial Intelligence
New logarithmic step size for stochastic gradient descent
Mahsa Soheil SHAMAEE1, Sajad Fathi HAFSHEJANI2(), Zeinab SAEIDIAN3
1. Department of Computer Science, Faculty of Mathematical Science, University of Kashan, Kashan 87317-53153, Iran
2. Department of Applied Mathematics, Shiraz University of Technology, Shiraz 13876-71557, Iran
3. Department of Mathematical Sciences, University of Kashan, Kashan 87317-53153, Iran
 Download: PDF(5696 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

In this paper, we propose a novel warm restart technique using a new logarithmic step size for the stochastic gradient descent (SGD) approach. For smooth and non-convex functions, we establish an O(1T) convergence rate for the SGD. We conduct a comprehensive implementation to demonstrate the efficiency of the newly proposed step size on the FashionMinst, CIFAR10, and CIFAR100 datasets. Moreover, we compare our results with nine other existing approaches and demonstrate that the new logarithmic step size improves test accuracy by 0.9% for the CIFAR100 dataset when we utilize a convolutional neural network (CNN) model.

Keywords stochastic gradient descent      logarithmic step size      warm restart technique     
Corresponding Author(s): Sajad Fathi HAFSHEJANI   
Just Accepted Date: 18 October 2023   Issue Date: 14 March 2024
 Cite this article:   
Mahsa Soheil SHAMAEE,Sajad Fathi HAFSHEJANI,Zeinab SAEIDIAN. New logarithmic step size for stochastic gradient descent[J]. Front. Comput. Sci., 2025, 19(1): 191301.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-3245-z
https://academic.hep.com.cn/fcs/EN/Y2025/V19/I1/191301
Step sizes Conditions Convergence rate Reference
ηt=η02(1+cos?(πtT)) η0=1Lc,c=O(T) O(1T) [18]
ηt=η0(βT)tT η0=1Lc,c=O(T) O(1T) [18]
ηt=η0(βT)tT β=O(T) O(ln?TT) [20]
ηt=Constant η0=1T O(1T) [15]
New proposed method η0=1Lc,c=O(Tln?T) O(1T) New
Tab.1  Convergence rates of SGD based on the specific values for the initial step size or the parameters associated with the step size
Fig.1  (a) Representation of the value ηtt=1Tηt for both the new step size and cosine step size in [18]; (b) warm restarts simulated every T=30 (blue line), T=70 (orange line), T=100 (green line), and T=200 (magenta line) epochs with η0=1; (c) the comparison of the warm restart Algorithm 1 with the new proposed step size and cosine step size on the FashionMnist dataset
  
Methods FashionMNIST CIFAR10 CIFAR100
η0 α c η0 α c η0 α c
Constant step size 0.007 ? ? 0.07 ? ? 0.07 ? ?
O(1t) step size 0.05 0.00038 ? 0.1 0.00023 ? 0.8 0.004 ?
O(1t) step size 0.05 0.00653 ? 0.2 0.07907 ? 0.1 0.015 ?
Adam 0.0009 ? ? 0.0009 ? ? 0.0009 ? ?
SGD+Armijo 0.5 ? 0.1 2.5 ? 0.1 5 ? 0.5
ReduceLROnPlateau 0.04 0.5 ? 0.07 0.1 ? 0.1 0.5 ?
Stagewise - 1 Milestone 0.04 0.1 ? 0.1 0.1 ? 0.07 0.1 ?
Stagewise - 2 Milestones 0.04 0.1 ? 0.2 0.1 ? 0.07 0.1 ?
Cosine step size 0.05 ? ? 0.25 ? ? 0.09 ? ?
Ln step size 0.05 ? ? 0.25 ? ? 0.53 ? ?
Tab.2  The hyperparameter values for the methods employed on the FashionMnist, CIFAR10, and CIFAR100 datasets
Fig.2  Comparison of new proposed step size and five other step sizes on FashionMinst (a), CIFAR10 (b), and CIFAR100 (c) datasets
Fig.3  Comparison of new proposed step size and four other step sizes on FashionMinst (a), CIFAR10 (b), and CIFAR100 (c) datasets
Fig.4  Comparison of new proposed step size and five other step sizes on CIFAR100 dataset using the DenseNet-BC model
Fig.5  Comparison of new proposed step size and four other step sizes on CIFAR100 dataset using the DenseNet-BC model
Methods Training loss Test accuracy
Constant Step Size 0.0007±0.0003 0.9299±0.0016
O(1t) Step Size 0.0006±0.0000 0.9311±0.0012
O(1t) Step Size 0.0011±0.0003 0.9275±0.0007
Adam 0.0131±0.0017 0.9166±0.0019
SGD+Armijo 6.73e?05 ± 0.0000 0.9277±0.0012
ReduceLROnPlateau 0.0229±0.0032 0.9299±0.0014
Stagewise - 1 Milestone 0.0004±0.0000 0.9303±0.0007
Stagewise - 2 Milestones 0.0015±0.0000 0.9298±0.0014
Cosine step size 0.0004±1.1e?05 0.9284±0.0005
Ln step size 0.0007±0.0000 0.9314 ± 0.0018
Tab.3  Average final training loss and test accuracy on the FashionMNIST dataset. The 95% confidence intervals of the mean loss and accuracy value over 5 runs starting from different random seeds
Methods Training loss Test accuracy
Constant Step Size 0.1870±0.0257 0.8776±0.0060
O(1t) Step Size 0.0261±0.0021 0.8946±0.0020
O(1t) Step Size 0.2634±0.3493 0.8747±0.0077
Adam 0.1140±0.0060 0.8839±0.0031
SGD+Armijo 0.0463±0.0169 0.8834±0.0059
ReduceLROnPlateau 0.0646±0.0041 0.9081±0.0019
Stagewise - 1 Milestone 0.0219±0.0012 0.9111±0.0034
Stagewise - 2 Milestones 0.0344±0.0062 0.9151±0.0039
Cosine step size 0.0102±0.0007 0.9201 ± 0.0017
Ln step size 0.0089 ± 0.0011 0.9012±0.0027
Tab.4  Average final training loss and test accuracy obtained by 5 runs starting from different random seeds and ± is followed by an estimated margin of error under 95% confidence on the CIFR10 dataset
Methods Training loss Test accuracy
Constant Step Size 1.19599±0.0183 0.5579±0.0039
O(1t) Step Size 1.43381±0305 0.54436±0.0038
O(1t) Step Size 1.14579±0.0125 0.5624±0.0020
Adam 1.53433±0.0180 0.5206±0.0053
SGD+Armijo 1.3046±0.0476 0.5244±0.0053
ReduceLROnPlateau 1.6000±0.055 0.5148±0.1929
Stagewise - 1 Milestone 1.1889±0.0077 0.5706±0.0049
Stagewise - 2 Milestones 1.09061±0.0040 0.5779±0.0017
Cosine step size 1.12694±0.0081 0.5739±0.0025
Ln step size 1.0805 ± 0.0327 0.5869 ± 0.0035
Tab.5  Average final training loss and test accuracy obtained by 5 runs starting from different random seeds and ± is followed by an estimated margin of error under 95% confidence on the CIFR100 dataset using the convolutional neural network (CNN) model
Methods Training loss Test accuracy
Constant Step Size 1.0789±0.06336 0.6089±0.01485
O(1t) Step Size 0.2709±0.02047 0.6885±0.0051
O(1t) Step Size 0.7584±0.04068 0.6411±0.00469
Adam 0.6008±0.02396 0.6555±0.00430
SGD+Armijo 0.1162±0.01175 0.6869±0.00460
ReduceLROnPlateau 0.0863±0.00896 0.7440±0.00695
Stagewise - 1 Milestone 0.15300±0.00382 0.7456±0.00188
Stagewise - 2 Milestones 0.21032±0.01067 0.73102±0.00402
Cosine step size 0.0622±0.00349 0.7568±0.00311
Ln step size 0.06011 ± 0.0018 0.7579 ± 0.0022
Tab.6  Average final training loss and test accuracy obtained by 5 runs starting from different random seeds and ± is followed by an estimated margin of error under 95% confidence on the CIFR100 dataset using the DenseNet-BC model
  
  
  
1 H, Robbins S Monro . A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22( 3): 400–407
2 A, Krizhevsky I, Sutskever G E Hinton . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90
3 A, Krizhevsky G Hinton . Learning multiple layers of features from tiny images. Toronto: University of Toronto, Department of Computer Science, 2009
4 Redmon J, Farhadi A. Yolo9000: better, faster, stronger. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6517−6525
5 J, Zhang C Zong . Deep neural networks in machine translation: an overview. IEEE Intelligent Systems, 2015, 30( 5): 16–25
6 Mishra P, Sarawadekar K. Polynomial learning rate policy with warm restart for deep neural network. In: Proceedings of 2019 IEEE Region 10 Conference (TENCON). 2019, 2087−2092
7 S, Vaswani A, Mishkin I, Laradji M, Schmidt G, Gidel S Lacoste-Julien . Painless stochastic gradient: Interpolation, line-search, and convergence rates. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 335
8 R M, Gower N, Loizou X, Qian A, Sailanbayev E, Shulgin P Richtárik . SGD: General analysis and improved rates. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5200−5209
9 Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2261–2269
10 L N Smith . Cyclical learning rates for training neural networks. In: Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017, 464−472
11 Loshchilov I, Hutter F. SGDR: Stochastic gradient descent with warm restarts. In: Proceedings of the 5th International Conference on Learning Representations. 2017
12 G, Vrbančič V Podgorelec . Efficient ensemble for image-based identification of pneumonia utilizing deep CNN and SGD with warm restarts. Expert Systems with Applications, 2022, 187: 115834
13 Xu G, Cao H, Dong Y, Yue C, Zou Y. Stochastic gradient descent with step cosine warm restarts for pathological lymph node image classification via PET/CT images. In: Proceedings of the 5th IEEE International Conference on Signal and Image Processing (ICSIP). 2020, 490−493
14 A, Nemirovski A, Juditsky G, Lan A Shapiro . Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009, 19( 4): 1574–1609
15 S, Ghadimi G H Lan . Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013, 23( 4): 2341–2368
16 F, Bach E Moulines . Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 451−459
17 A, Rakhlin O, Shamir K Sridharan . Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, 2011
18 X, Li Z, Zhuang F Orabona . A second look at exponential and cosine step sizes: Simplicity, adaptivity, and performance. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 6553−6564
19 R, Ge S M, Kakade R, Kidambi P Netrapalli . The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, 1341
20 Wang X, Magnússon S, Johansson M. On the convergence of step decay step-size for stochastic optimization. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021, 14226−14238
21 J, Nocedal S J Wright . Numerical Optimization. New York: Springer, 1999
22 He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778
23 Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
24 N S, Aybat A, Fallah M, Gurbuzbalaban A Ozdaglar . A universally optimal multistage accelerated stochastic gradient method. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 765
25 G B, Thomas R L, Finney M D, Weir F R Giordano . Thomas’ Calculus, Early Transcendentals. 10th ed. Boston: Addison Wesley, 2002
[1] FCS-23245-OF-MSS_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed