A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-<i>k</i>

doi:10.1007/s11704-023-2430-4

Front. Comput. Sci.

2024, Vol. 18

Issue (4) : 184315 https://doi.org/10.1007/s11704-023-2430-4

Artificial Intelligence

A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-k

Enes DEDEOGLU, Himmet Toprak KESGIN(

), Mehmet Fatih AMASYALI

Department of Computer Engineering, Yildiz Technical University, Istanbul 34220, Turkey

Download: PDF(6813 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

The use of all samples in the optimization process does not produce robust results in datasets with label noise. Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction. In this paper, we recommend using samples with loss less than a threshold determined during the optimization, instead of using all samples in the mini-batch. Our proposed method, Adaptive-k, aims to exclude label noise samples from the optimization process and make the process robust. On noisy datasets, we found that using a threshold-based approach, such as Adaptive-k, produces better results than using all samples or a fixed number of low-loss samples in the mini-batch. On the basis of our theoretical analysis and experimental results, we show that the Adaptive-k method is closest to the performance of the Oracle, in which noisy samples are entirely removed from the dataset. Adaptive-k is a simple but effective method. It does not require prior knowledge of the noise ratio of the dataset, does not require additional model training, and does not increase training time significantly. In the experiments, we also show that Adaptive-k is compatible with different optimizers such as SGD, SGDM, and Adam. The code for Adaptive-k is available at GitHub.

Keywords robust optimization label noise noisy label deep learning noisy datasets noise ratio estimation robust training

Corresponding Author(s): Himmet Toprak KESGIN

Just Accepted Date: 04 April 2023 Issue Date: 05 June 2023

Cite this article:

Enes DEDEOGLU,Himmet Toprak KESGIN,Mehmet Fatih AMASYALI. A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-k[J]. Front. Comput. Sci., 2024, 18(4): 184315.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2430-4
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I4/184315

Fig.1 Precision and recall values in MNIST for noise ratio,

τ = 0.4

and

k = 0.6

. The first 30 epochs are vanilla training, and the next 50 are adaptive training for Adaptive-k

Fig.2 Rate of clean samples and selected samples in MNIST for noise ratio

τ = 0.4

. This figure shows the iterations in any given epoch during the training

Notation	Description
$X 1$	Clean samples
$X 2$	Noisy samples
$μ 1$	Mean loss of clean samples
$μ 2$	Mean loss of noisy samples
$σ 1$	Standard deviation of losses for clean samples
$σ 2$	Standard deviation of losses for noisy samples
$N$	Normal distribution
$f 1 (x)$	Probability density function of losses for clean samples
$f 2 (x)$	Probability density function of losses for noisy samples
$F 1 (x)$	Cumulative distribution function of losses for clean samples
$F 2 (x)$	Cumulative distribution function of losses for noisy samples
$f D (x)$	Probability density function of losses for all samples
$F D (x)$	Cumulative distribution function of losses for all samples
$μ D$	Mean loss of all samples
$σ D$	Standard deviation of losses for all samples
$f a d k (x)$	Probability density function of Adaptive-k distribution
$μ a d k$	Mean of Adaptive-k distribution
$σ a d k$	Standard deviation of Adaptive-k distribution
$M S E a d k$	Mean squared error of Adaptive-k distribution
$f M K L (x)$	Probability density function of MKL distribution
$μ M K L$	Mean of MKL distribution
$σ M K L$	Standard deviation of MKL distribution
$M S E M K L$	Mean squared error of MKL distribution

Tab.1 Explanation of symbols used in equations

Fig.3 Mixture distribution for noisy dataset, where

μ 1 = 0

σ 1 = 1

μ 2 = 5

σ 2 = 2

τ = 0.4

Fig.4 MKL distribution’s pdf for noisy dataset, where

μ 1 = 0

σ 1 = 1

μ 2 = 5

σ 2 = 2

τ = 0.4

Fig.5 Comparison of MSE for SGD and MKL

Fig.6 Adaptive-k distribution’s pdf for noisy dataset, where

μ 1 = 0

σ 1 = 1

μ 2 = 5

σ 2 = 2

τ = 0.4

Fig.7 Comparison of MSE for Adaptive k and MKL

Fig.8 Loss distributions of clean and noisy data throughout training for noise ratio

τ = 0.3

Dataset	$τ$	Oracle	Vanilla	MKL [7]	Vanilla + MKL	Adaptive-k	Trimloss [16]	Co-teaching [10]
MNIST	0.10	98.54	97.73	96.23	98.06	98.32	98.22	98.11
MNIST	0.20	98.33	96.59	97.20	97.97	98.14	98.00	97.58
MNIST	0.30	98.21	94.25	96.93	97.31	97.66	97.11	96.70
MNIST	0.40	98.08	89.19	94.88	96.18	95.72	92.10	95.62
FMNIST	0.10	85.91	85.06	82.01	84.89	85.39	85.41	85.37
FMNIST	0.20	85.78	82.93	82.17	84.31	84.30	84.76	84.19
FMNIST	0.30	85.96	79.92	81.01	83.18	83.36	81.76	81.93
FMNIST	0.40	85.21	72.99	78.09	79.73	80.46	72.46	77.72
Cifar 10	0.10	81.42	77.94	77.07	78.62	80.15	78.87	80.31
Cifar 10	0.20	80.78	75.39	77.57	79.09	79.13	73.28	78.99
Cifar 10	0.30	79.52	71.76	73.23	76.85	78.12	63.48	76.49
Cifar 10	0.40	78.56	63.28	61.68	69.75	75.72	50.26	73.81
IMDB	0.10	88.09	87.50	80.31	87.37	87.17	87.63	87.78
IMDB	0.20	87.72	86.18	80.89	85.50	86.31	84.13	86.02
IMDB	0.30	87.17	83.00	78.88	82.95	81.40	68.94	83.18
IMDB	0.40	87.29	68.11	64.60	66.01	66.35	54.23	67.59
HOTEL	0.10	61.05	58.41	52.55	57.96	58.73	60.11	59.72
HOTEL	0.20	60.03	56.40	52.86	56.73	57.16	54.49	55.47
HOTEL	0.30	59.68	52.59	53.73	54.34	53.57	49.95	54.70
HOTEL	0.40	58.38	51.44	50.22	51.45	51.65	38.26	51.72
SARCASM	0.10	82.38	81.21	76.19	80.17	79.98	80.99	80.82
SARCASM	0.20	81.93	79.74	77.85	79.71	79.00	78.07	79.78
SARCASM	0.30	81.55	78.18	77.19	78.31	77.49	72.78	77.50
SARCASM	0.40	81.47	74.26	71.90	73.77	72.84	60.38	72.30
20_NEWS	0.10	70.21	66.94	62.49	66.5	70.2	68.23	68.18
20_NEWS	0.20	69.25	63.99	62.21	67.66	68.67	62.5	66.68
20_NEWS	0.30	68.32	60.85	56.37	66.18	66.46	51.83	63.5
20_NEWS	0.40	67.3	52.58	45.21	56.42	59.55	44.65	59.62

Tab.2 Comparison of Adaptive-k and Vanilla+MKL in different datasets with Oracle, Vanilla, MKL, Trimloss, and co-teaching on average test set accuracy. For each dataset, the best accuracy is in green, the second best is in blue, and the third best is in red

Tab.3 Cost comparison of algorithms

SGD
Dataset	$τ$	Oracle	Vanilla	MKL [7]	Vanilla + MKL	Adaptive-k	Trimloss [16]	Co-teaching [10]
MNIST	0.10	98.54	97.73	96.23	98.06	98.32	98.22	98.11
MNIST	0.20	98.33	96.59	97.20	97.97	98.14	98.00	97.58
MNIST	0.30	98.21	94.25	96.93	97.31	97.66	97.11	96.70
MNIST	0.40	98.08	89.19	94.88	96.18	95.72	92.10	95.62
Fashion MNIST	0.10	85.91	85.06	82.01	84.89	85.39	85.41	85.37
Fashion MNIST	0.20	85.78	82.93	82.17	84.31	84.30	84.76	84.19
Fashion MNIST	0.30	85.96	79.92	81.01	83.18	83.36	81.76	81.93
Fashion MNIST	0.40	85.21	72.99	78.09	79.73	80.46	72.46	77.72
Cifar 10	0.10	81.42	77.94	77.07	78.62	80.15	78.87	80.31
Cifar 10	0.20	80.78	75.39	77.57	79.09	79.13	73.28	78.99
Cifar 10	0.30	79.52	71.76	73.23	76.85	78.12	63.48	76.49
Cifar 10	0.40	78.56	63.28	61.68	69.75	75.72	50.26	73.81
SGDM
Dataset	$τ$	Oracle	Vanilla	MKL [7]	Vanilla + MKL	Adaptive-k	Trimloss [16]	Co_teaching [10]
MNIST	0.10	97.62	97.10	87.54	97.37	97.76	97.48	97.13
MNIST	0.20	97.60	95.73	88.75	97.28	97.69	97.19	96.32
MNIST	0.30	97.38	92.56	91.56	96.82	97.27	96.50	94.17
MNIST	0.40	97.30	86.07	86.66	94.38	96.51	96.13	86.12
Fashion MNIST	0.10	85.10	82.78	76.73	83.30	83.59	84.45	83.05
Fashion MNIST	0.20	83.76	80.92	76.06	82.88	82.84	82.34	81.40
Fashion MNIST	0.30	84.75	78.99	76.26	81.23	81.65	77.70	77.87
Fashion MNIST	0.40	83.89	69.58	68.57	77.05	77.81	68.22	70.29
Cifar 10	0.10	81.67	77.95	75.68	78.93	79.95	79.82	80.4
Cifar 10	0.20	80.36	75.84	75.88	79.20	79.12	70.86	78.75
Cifar 10	0.30	79.66	71.89	72.06	76.07	77.89	59.28	76.28
Cifar 10	0.40	78.89	64.80	57.80	69.80	68.29	49.16	71.7
ADAM
Dataset	$τ$	Oracle	Vanilla	MKL [7]	Vanilla + MKL	Adaptive-k	Trimloss [16]	Co_teaching [10]
MNIST	0.10	98.46	97.62	97.12	97.66	98.23	98.33	97.86
MNIST	0.20	98.37	96.58	97.34	97.51	97.92	97.66	97.28
MNIST	0.30	98.27	93.98	96.51	97.16	97.38	97.34	96.76
MNIST	0.40	98.13	84.47	93.63	94.65	96.39	94.81	94.25
Fashion MNIST	0.10	86.04	84.89	83.55	85.26	85.84	85.48	85.76
Fashion MNIST	0.20	85.89	83.09	83.75	84.49	84.83	84.70	84.32
Fashion MNIST	0.30	85.81	80.04	81.63	82.22	82.84	79.18	81.86
Fashion MNIST	0.40	85.42	72.41	74.99	79.01	79.49	77.19	78.37
Cifar 10	0.10	80.83	78.80	77.42	79.19	80.55	79.14	80.62
Cifar 10	0.20	80.16	76.58	74.75	78.58	79.07	71.51	79.44
Cifar 10	0.30	79.33	73.03	69.54	76.57	77.99	59.6	77.08
Cifar 10	0.40	78.36	65.24	58.55	71.68	74.93	49.63	73.04

Tab.4 This table consists of three parts and shows the results for SGD, SGDM and ADAM optimizers respectively

Fig.9 The means of losses of clean and noisy samples in the MNIST data set for

τ

= 0.3 and the threshold value determined by adaptive k for the separation of these samples. The blue line shows the mean losses of noisy samples, the red line shows the mean losses of clean samples, and the green dashed line shows the calculated and used threshold value (

μ D

in Adaptive-k Algorithm 2). The shaded area shows the ranges as mean ± 1.5 * standard deviation. The first 30 epochs are vanilla training, and the next 50 are adaptive training for Adaptive-k

Fig.10 It shows only the adaptive training for the MNIST dataset. For the MNIST dataset, after the first 30 epochs (Vanilla), the Adaptive-k training is started, which is 50 epochs. Solid lines indicate sample ratios determined as clean by Adaptive-k, while dashed lines indicate actual clean sample ratios

1	C, Zhang S, Bengio M, Hardt B, Recht O Vinyals . Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 2021, 64( 3): 107–115
2	S, Liao X, Jiang Z Ge . Weakly supervised multilayer perceptron for industrial fault classification with inaccurate and incomplete labels. IEEE Transactions on Automation Science and Engineering, 2022, 19( 2): 1192–1201
3	D, Ortego E, Arazo P, Albert N E, O’Connor K McGuinness . Towards robust learning with different label noise distributions. In: Proceedings of the 25th International Conference on Pattern Recognition (ICPR). 2021, 7020−7027
4	E, Arazo D, Ortego P, Albert N, O’Connor K McGuinness . Unsupervised label noise modeling and loss correction. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 312−321
5	K, Nishi Y, Ding A, Rich T Höllerer . Augmentation strategies for learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8018−8027
6	N, Majidi E, Amid H, Talebi M K Warmuth . Exponentiated gradient reweighting for robust training under label noise and beyond. 2021, arXiv preprint arXiv: 2104.01493
7	V, Shah X, Wu S Sanghavi . Choosing the sample with lowest loss makes SGD robust. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. 2020, 2120−2130
8	Y, Bengio J, Louradour R, Collobert J Weston . Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 41−48
9	H T, Kesgin M F Amasyali . Cyclical curriculum learning. 2022, arXiv preprint arXiv: 2202.05531
10	B, Han Q, Yao X, Yu G, Niu M, Xu W, Hu I W, Tsang M Sugiyama . Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8536−8546
11	X, Shi W Che . Combating with extremely noisy samples in weakly supervised slot filling for automatic diagnosis. Frontiers of Computer Science, 2023, 17( 5): 175333
12	H, Yang Y, Jin Z, Li D B, Wang L, Miao X, Geng M L Zhang . Learning from noisy labels via dynamic loss thresholding. 2021, arXiv preprint arXiv: 2104.02570
13	Y, Wei M, Xue X, Liu P Xu . Data fusing and joint training for learning with noisy labels. Frontiers of Computer Science, 2022, 16( 6): 166338
14	Q, Yao H, Yang B, Han G, Niu J T Kwok . Searching to exploit memorization effect in learning with noisy labels. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 1000
15	Chi Y, Li Y, Zhang H, Liang Y. Median-truncated gradient descent: a robust and scalable nonconvex approach for signal estimation. In: Proceedings of the 3rd International MATHEON Conference on Compressed Sensing and Its Applications. 2019, 237−261
16	Y, Shen S Sanghavi . Learning with bad training data via iterative trimmed loss minimization. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5739−5748
17	K, Nakamura B W Hong . Regularization in neural network optimization via trimmed stochastic gradient descent with noisy label. 2020, arXiv preprint arXiv: 2012.11073
18	D P, Kingma J Ba . Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
19	L Deng . The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 2012, 29( 6): 141–142
20	H, Xiao K, Rasul R Vollgraf . Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. 2017, arXiv preprint arXiv: 1708.07747
21	Krizhevsky A. Learning multiple layers of features from tiny images.Technical Report, 2009
22	K, He X, Zhang S, Ren J Sun . Identity mappings in deep residual networks. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 630−645
23	A L, Maas R E, Daly P T, Pham D, Huang A Y, Ng C Potts . Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011, 142−150
24	comet-examples/comet-keras-cnn-lstm-example.py at master • comet-ml/comet-examples • github. See qwone.com/~jason/20Newsgroups website, 2021
25	R, Misra P Arora . Sarcasm detection using hybrid neural network. 2019, arXiv preprint arXiv: 1908.07414
26	kaggle. Sarcasm detection: a guide for ML and DL approach. See kaggle.com/subbhashit/sarcasm-detection-a-guide-for-ml-and-dl-approach website. 2021
27	M H, Alam W J, Ryu S Lee . Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Information Sciences, 2016, 339: 206–223
28	kaggle. Hotel reviews sentiment prediction. See kaggle.com/code/shahraizanwar/hotel-reviews-sentiment-prediction/notebook website. 2021
29	Home page for 20 newsgroups data set. See qwone.com/~jason/ 20Newsgroups website, 2014
30	Team K. Using pre-trained word embeddings. See keras.io/examples/nlp/pretrained_word_embeddings website, 2021

[1]

FCS-22430-OF-ED_suppl_1

Download

[1]	Jingyan SUI, Shizhe DING, Xulin HUANG, Yue YU, Ruizhi LIU, Boyang XIA, Zhenxin DING, Liming XU, Haicang ZHANG, Chungong YU, Dongbo BU. A survey on deep learning-based algorithms for the traveling salesman problem[J]. Front. Comput. Sci., 2025, 19(6): 196322-.
[2]	Yanlin LI, Wantong JIAO, Ruihan LIU, Xuejin DENG, Feng ZHU, Weiwei XUE. Expanding the sequence spaces of synthetic binding protein using deep learning-based framework ProteinMPNN[J]. Front. Comput. Sci., 2025, 19(5): 195903-.
[3]	Mengting NIU, Yaojia CHEN, Chunyu WANG, Quan ZOU, Lei XU. Computational approaches for circRNA-disease association prediction: a review[J]. Front. Comput. Sci., 2025, 19(4): 194904-.
[4]	Yao WU, Hong HUANG, Yu SONG, Hai JIN. Soft-GNN: towards robust graph neural networks via self-adaptive data utilization[J]. Front. Comput. Sci., 2025, 19(4): 194311-.
[5]	Shao-Yuan LI, Shi-Ji ZHAO, Zheng-Tao CAO, Sheng-Jun HUANG, Songcan CHEN. Robust domain adaptation with noisy and shifted label distribution[J]. Front. Comput. Sci., 2025, 19(3): 193310-.
[6]	Jingyu LIU, Shi CHEN, Li SHEN. A comprehensive survey on graph neural network accelerators[J]. Front. Comput. Sci., 2025, 19(2): 192104-.
[7]	Shaoyuan LI, Yuxiang ZHENG, Ye SHI, Shengjun HUANG, Songcan CHEN. KD-Crowd: a knowledge distillation framework for learning from crowds[J]. Front. Comput. Sci., 2025, 19(1): 191302-.
[8]	Lingling ZHAO, Shitao SONG, Pengyan WANG, Chunyu WANG, Junjie WANG, Maozu GUO. A MLP-Mixer and mixture of expert model for remaining useful life prediction of lithium-ion batteries[J]. Front. Comput. Sci., 2024, 18(5): 185329-.
[9]	Hengyu LIU, Tiancheng ZHANG, Fan LI, Minghe YU, Ge YU. A probabilistic generative model for tracking multi-knowledge concept mastery probability[J]. Front. Comput. Sci., 2024, 18(3): 183602-.
[10]	Mingzhi YUAN, Kexue FU, Zhihao LI, Manning WANG. Decoupled deep hough voting for point cloud registration[J]. Front. Comput. Sci., 2024, 18(2): 182703-.
[11]	Mingzhen LI, Changxi LIU, Jianjin LIAO, Xuegui ZHENG, Hailong YANG, Rujun SUN, Jun XU, Lin GAN, Guangwen YANG, Zhongzhi LUAN, Depei QIAN. Towards optimized tensor code generation for deep learning on sunway many-core processor[J]. Front. Comput. Sci., 2024, 18(2): 182101-.
[12]	Hanadi AL-MEKHLAFI, Shiguang LIU. Single image super-resolution: a comprehensive review and recent insight[J]. Front. Comput. Sci., 2024, 18(1): 181702-.
[13]	Yufei ZENG, Zhixin LI, Zhenbin CHEN, Huifang MA. Aspect-level sentiment analysis based on semantic heterogeneous graph convolutional network[J]. Front. Comput. Sci., 2023, 17(6): 176340-.
[14]	Yamin HU, Hao JIANG, Zongyao HU. Measuring code maintainability with deep neural networks[J]. Front. Comput. Sci., 2023, 17(6): 176214-.
[15]	Muazzam MAQSOOD, Sadaf YASMIN, Saira GILLANI, Maryam BUKHARI, Seungmin RHO, Sang-Soo YEO. An efficient deep learning-assisted person re-identification solution for intelligent video surveillance in smart cities[J]. Front. Comput. Sci., 2023, 17(4): 174329-.

Viewed

Full text

Abstract

Cited

Shared

Discussed