New logarithmic step size for stochastic gradient descent

doi:10.1007/s11704-023-3245-z

Front. Comput. Sci.

2025, Vol. 19

Issue (1) : 191301 https://doi.org/10.1007/s11704-023-3245-z

Artificial Intelligence

New logarithmic step size for stochastic gradient descent

Mahsa Soheil SHAMAEE¹, Sajad Fathi HAFSHEJANI²(

), Zeinab SAEIDIAN³

¹. Department of Computer Science, Faculty of Mathematical Science, University of Kashan, Kashan 87317-53153, Iran
². Department of Applied Mathematics, Shiraz University of Technology, Shiraz 13876-71557, Iran
³. Department of Mathematical Sciences, University of Kashan, Kashan 87317-53153, Iran

Download: PDF(5696 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

In this paper, we propose a novel warm restart technique using a new logarithmic step size for the stochastic gradient descent (SGD) approach. For smooth and non-convex functions, we establish an $O (1 T)$ convergence rate for the SGD. We conduct a comprehensive implementation to demonstrate the efficiency of the newly proposed step size on the FashionMinst, CIFAR10, and CIFAR100 datasets. Moreover, we compare our results with nine other existing approaches and demonstrate that the new logarithmic step size improves test accuracy by 0.9% for the CIFAR100 dataset when we utilize a convolutional neural network (CNN) model.

Keywords stochastic gradient descent logarithmic step size warm restart technique

Corresponding Author(s): Sajad Fathi HAFSHEJANI

Just Accepted Date: 18 October 2023 Issue Date: 14 March 2024

Cite this article:

Mahsa Soheil SHAMAEE,Sajad Fathi HAFSHEJANI,Zeinab SAEIDIAN. New logarithmic step size for stochastic gradient descent[J]. Front. Comput. Sci., 2025, 19(1): 191301.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-3245-z
https://academic.hep.com.cn/fcs/EN/Y2025/V19/I1/191301

Step sizes	Conditions	Convergence rate	Reference
$η t = η 0 2 (1 + cos ? (π t T))$	$η 0 = 1 L c, c = O (T)$	$O (1 T)$	[18]
$η t = η 0 (β T) t T$	$η 0 = 1 L c, c = O (T)$	$O (1 T)$	[18]
$η t = η 0 (β T) t T$	$β = O (T)$	$O (ln ? T T)$	[20]
$η t$ =Constant	$η 0 = 1 T$	$O (1 T)$	[15]
New proposed method	$η 0 = 1 L c, c = O (T ln ? T)$	$O (1 T)$	New

Tab.1 Convergence rates of SGD based on the specific values for the initial step size or the parameters associated with the step size

Fig.1 (a) Representation of the value

η t ∑ t = 1 T η t

for both the new step size and cosine step size in [18]; (b) warm restarts simulated every

T = 30

(blue line),

T = 70

(orange line),

T = 100

(green line), and

T = 200

(magenta line) epochs with

η 0 = 1

; (c) the comparison of the warm restart Algorithm 1 with the new proposed step size and cosine step size on the FashionMnist dataset

Methods	FashionMNIST			CIFAR10			CIFAR100
Methods	$η 0$	$α$	$c$	$η 0$	$α$	$c$	$η 0$	$α$	$c$
Constant step size	$0.007$	?	?	$0.07$	?	?	$0.07$	?	?
$O (1 t)$ step size	$0.05$	$0.00038$	?	$0.1$	$0.00023$	?	$0.8$	$0.004$	?
$O (1 t)$ step size	$0.05$	$0.00653$	?	$0.2$	$0.07907$	?	$0.1$	$0.015$	?
Adam	$0.0009$	?	?	$0.0009$	?	?	$0.0009$	?	?
SGD+Armijo	$0.5$	?	$0.1$	$2.5$	?	$0.1$	$5$	?	$0.5$
ReduceLROnPlateau	$0.04$	$0.5$	?	$0.07$	$0.1$	?	$0.1$	$0.5$	?
Stagewise - 1 Milestone	$0.04$	$0.1$	?	$0.1$	$0.1$	?	$0.07$	$0.1$	?
Stagewise - 2 Milestones	$0.04$	$0.1$	?	$0.2$	$0.1$	?	$0.07$	$0.1$	?
Cosine step size	$0.05$	?	?	$0.25$	?	?	$0.09$	?	?
Ln step size	$0.05$	?	?	$0.25$	$?$	?	$0.53$	$?$	?

Tab.2 The hyperparameter values for the methods employed on the FashionMnist, CIFAR10, and CIFAR100 datasets

Fig.2 Comparison of new proposed step size and five other step sizes on FashionMinst (a), CIFAR10 (b), and CIFAR100 (c) datasets

Fig.3 Comparison of new proposed step size and four other step sizes on FashionMinst (a), CIFAR10 (b), and CIFAR100 (c) datasets

Fig.4 Comparison of new proposed step size and five other step sizes on CIFAR100 dataset using the DenseNet-BC model

Fig.5 Comparison of new proposed step size and four other step sizes on CIFAR100 dataset using the DenseNet-BC model

Methods	Training loss	Test accuracy
Constant Step Size	$0.0007 ± 0.0003$	$0.9299 ± 0.0016$
$O (1 t)$ Step Size	$0.0006 ± 0.0000$	$0.9311 ± 0.0012$
$O (1 t)$ Step Size	$0.0011 ± 0.0003$	$0.9275 ± 0.0007$
Adam	$0.0131 ± 0.0017$	$0.9166 ± 0.0019$
SGD+Armijo	6.73e?05 ± 0.0000	$0.9277 ± 0.0012$
ReduceLROnPlateau	$0.0229 ± 0.0032$	$0.9299 ± 0.0014$
Stagewise - 1 Milestone	$0.0004 ± 0.0000$	$0.9303 ± 0.0007$
Stagewise - 2 Milestones	$0.0015 ± 0.0000$	$0.9298 ± 0.0014$
Cosine step size	$0.0004 ± 1.1 e ? 05$	$0.9284 ± 0.0005$
Ln step size	$0.0007 ± 0.0000$	0.9314 ± 0.0018

Tab.3 Average final training loss and test accuracy on the FashionMNIST dataset. The

95 %

confidence intervals of the mean loss and accuracy value over 5 runs starting from different random seeds

Methods	Training loss	Test accuracy
Constant Step Size	$0.1870 ± 0.0257$	$0.8776 ± 0.0060$
$O (1 t)$ Step Size	$0.0261 ± 0.0021$	$0.8946 ± 0.0020$
$O (1 t)$ Step Size	$0.2634 ± 0.3493$	$0.8747 ± 0.0077$
Adam	$0.1140 ± 0.0060$	$0.8839 ± 0.0031$
SGD+Armijo	$0.0463 ± 0.0169$	$0.8834 ± 0.0059$
ReduceLROnPlateau	$0.0646 ± 0.0041$	$0.9081 ± 0.0019$
Stagewise - 1 Milestone	$0.0219 ± 0.0012$	$0.9111 ± 0.0034$
Stagewise - 2 Milestones	$0.0344 ± 0.0062$	$0.9151 ± 0.0039$
Cosine step size	$0.0102 ± 0.0007$	0.9201 $±$ 0.0017
Ln step size	0.0089 ± 0.0011	$0.9012 ± 0.0027$

Tab.4 Average final training loss and test accuracy obtained by 5 runs starting from different random seeds and ± is followed by an estimated margin of error under

95 %

confidence on the CIFR10 dataset

Methods	Training loss	Test accuracy
Constant Step Size	$1.19599 ± 0.0183$	$0.5579 ± 0.0039$
$O (1 t)$ Step Size	$1.43381 ± 0305$	$0.54436 ± 0.0038$
$O (1 t)$ Step Size	$1.14579 ± 0.0125$	$0.5624 ± 0.0020$
Adam	$1.53433 ± 0.0180$	$0.5206 ± 0.0053$
SGD+Armijo	$1.3046 ± 0.0476$	$0.5244 ± 0.0053$
ReduceLROnPlateau	$1.6000 ± 0.055$	$0.5148 ± 0.1929$
Stagewise - 1 Milestone	$1.1889 ± 0.0077$	$0.5706 ± 0.0049$
Stagewise - 2 Milestones	$1.09061 ± 0.0040$	$0.5779 ± 0.0017$
Cosine step size	$1.12694 ± 0.0081$	$0.5739 ± 0.0025$
Ln step size	1.0805 ± 0.0327	0.5869 ± 0.0035

Tab.5 Average final training loss and test accuracy obtained by 5 runs starting from different random seeds and ± is followed by an estimated margin of error under

95 %

confidence on the CIFR100 dataset using the convolutional neural network (CNN) model

Methods	Training loss	Test accuracy
Constant Step Size	$1.0789 ± 0.06336$	$0.6089 ± 0.01485$
$O (1 t)$ Step Size	$0.2709 ± 0.02047$	$0.6885 ± 0.0051$
$O (1 t)$ Step Size	$0.7584 ± 0.04068$	$0.6411 ± 0.00469$
Adam	$0.6008 ± 0.02396$	$0.6555 ± 0.00430$
SGD+Armijo	$0.1162 ± 0.01175$	$0.6869 ± 0.00460$
ReduceLROnPlateau	$0.0863 ± 0.00896$	$0.7440 ± 0.00695$
Stagewise - 1 Milestone	$0.15300 ± 0.00382$	$0.7456 ± 0.00188$
Stagewise - 2 Milestones	$0.21032 ± 0.01067$	$0.73102 ± 0.00402$
Cosine step size	$0.0622 ± 0.00349$	$0.7568 ± 0.00311$
Ln step size	0.06011 ± 0.0018	0.7579 ± 0.0022

Tab.6 Average final training loss and test accuracy obtained by 5 runs starting from different random seeds and ± is followed by an estimated margin of error under

95 %

confidence on the CIFR100 dataset using the DenseNet-BC model

1	H, Robbins S Monro . A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22( 3): 400–407
2	A, Krizhevsky I, Sutskever G E Hinton . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90
3	A, Krizhevsky G Hinton . Learning multiple layers of features from tiny images. Toronto: University of Toronto, Department of Computer Science, 2009
4	Redmon J, Farhadi A. Yolo9000: better, faster, stronger. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6517−6525
5	J, Zhang C Zong . Deep neural networks in machine translation: an overview. IEEE Intelligent Systems, 2015, 30( 5): 16–25
6	Mishra P, Sarawadekar K. Polynomial learning rate policy with warm restart for deep neural network. In: Proceedings of 2019 IEEE Region 10 Conference (TENCON). 2019, 2087−2092
7	S, Vaswani A, Mishkin I, Laradji M, Schmidt G, Gidel S Lacoste-Julien . Painless stochastic gradient: Interpolation, line-search, and convergence rates. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 335
8	R M, Gower N, Loizou X, Qian A, Sailanbayev E, Shulgin P Richtárik . SGD: General analysis and improved rates. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5200−5209
9	Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2261–2269
10	L N Smith . Cyclical learning rates for training neural networks. In: Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017, 464−472
11	Loshchilov I, Hutter F. SGDR: Stochastic gradient descent with warm restarts. In: Proceedings of the 5th International Conference on Learning Representations. 2017
12	G, Vrbančič V Podgorelec . Efficient ensemble for image-based identification of pneumonia utilizing deep CNN and SGD with warm restarts. Expert Systems with Applications, 2022, 187: 115834
13	Xu G, Cao H, Dong Y, Yue C, Zou Y. Stochastic gradient descent with step cosine warm restarts for pathological lymph node image classification via PET/CT images. In: Proceedings of the 5th IEEE International Conference on Signal and Image Processing (ICSIP). 2020, 490−493
14	A, Nemirovski A, Juditsky G, Lan A Shapiro . Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009, 19( 4): 1574–1609
15	S, Ghadimi G H Lan . Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013, 23( 4): 2341–2368
16	F, Bach E Moulines . Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 451−459
17	A, Rakhlin O, Shamir K Sridharan . Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, 2011
18	X, Li Z, Zhuang F Orabona . A second look at exponential and cosine step sizes: Simplicity, adaptivity, and performance. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 6553−6564
19	R, Ge S M, Kakade R, Kidambi P Netrapalli . The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, 1341
20	Wang X, Magnússon S, Johansson M. On the convergence of step decay step-size for stochastic optimization. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021, 14226−14238
21	J, Nocedal S J Wright . Numerical Optimization. New York: Springer, 1999
22	He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778
23	Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
24	N S, Aybat A, Fallah M, Gurbuzbalaban A Ozdaglar . A universally optimal multistage accelerated stochastic gradient method. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 765
25	G B, Thomas R L, Finney M D, Weir F R Giordano . Thomas’ Calculus, Early Transcendentals. 10th ed. Boston: Addison Wesley, 2002

[1]

FCS-23245-OF-MSS_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed