Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (4) : 184315    https://doi.org/10.1007/s11704-023-2430-4
Artificial Intelligence
A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-k
Enes DEDEOGLU, Himmet Toprak KESGIN(), Mehmet Fatih AMASYALI
Department of Computer Engineering, Yildiz Technical University, Istanbul 34220, Turkey
 Download: PDF(6813 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The use of all samples in the optimization process does not produce robust results in datasets with label noise. Because the gradients calculated according to the losses of the noisy samples cause the optimization process to go in the wrong direction. In this paper, we recommend using samples with loss less than a threshold determined during the optimization, instead of using all samples in the mini-batch. Our proposed method, Adaptive-k, aims to exclude label noise samples from the optimization process and make the process robust. On noisy datasets, we found that using a threshold-based approach, such as Adaptive-k, produces better results than using all samples or a fixed number of low-loss samples in the mini-batch. On the basis of our theoretical analysis and experimental results, we show that the Adaptive-k method is closest to the performance of the Oracle, in which noisy samples are entirely removed from the dataset. Adaptive-k is a simple but effective method. It does not require prior knowledge of the noise ratio of the dataset, does not require additional model training, and does not increase training time significantly. In the experiments, we also show that Adaptive-k is compatible with different optimizers such as SGD, SGDM, and Adam. The code for Adaptive-k is available at GitHub.

Keywords robust optimization      label noise      noisy label      deep learning      noisy datasets      noise ratio estimation      robust training     
Corresponding Author(s): Himmet Toprak KESGIN   
Just Accepted Date: 04 April 2023   Issue Date: 05 June 2023
 Cite this article:   
Enes DEDEOGLU,Himmet Toprak KESGIN,Mehmet Fatih AMASYALI. A robust optimization method for label noisy datasets based on adaptive threshold: Adaptive-k[J]. Front. Comput. Sci., 2024, 18(4): 184315.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2430-4
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I4/184315
  
Fig.1  Precision and recall values in MNIST for noise ratio, τ=0.4 and k=0.6. The first 30 epochs are vanilla training, and the next 50 are adaptive training for Adaptive-k
  
Fig.2  Rate of clean samples and selected samples in MNIST for noise ratio τ=0.4. This figure shows the iterations in any given epoch during the training
Notation Description
X1 Clean samples
X2 Noisy samples
μ1 Mean loss of clean samples
μ2 Mean loss of noisy samples
σ1 Standard deviation of losses for clean samples
σ2 Standard deviation of losses for noisy samples
N Normal distribution
f1(x) Probability density function of losses for clean samples
f2(x) Probability density function of losses for noisy samples
F1(x) Cumulative distribution function of losses for clean samples
F2(x) Cumulative distribution function of losses for noisy samples
fD(x) Probability density function of losses for all samples
FD(x) Cumulative distribution function of losses for all samples
μD Mean loss of all samples
σD Standard deviation of losses for all samples
fadk(x) Probability density function of Adaptive-k distribution
μadk Mean of Adaptive-k distribution
σadk Standard deviation of Adaptive-k distribution
MS Eadk Mean squared error of Adaptive-k distribution
fMKL(x) Probability density function of MKL distribution
μMKL Mean of MKL distribution
σMKL Standard deviation of MKL distribution
MS EMKL Mean squared error of MKL distribution
Tab.1  Explanation of symbols used in equations
Fig.3  Mixture distribution for noisy dataset, where μ1=0, σ1=1, μ2=5, σ2=2, τ=0.4
Fig.4  MKL distribution’s pdf for noisy dataset, where μ1=0, σ1=1, μ2=5, σ2=2, τ=0.4
Fig.5  Comparison of MSE for SGD and MKL
Fig.6  Adaptive-k distribution’s pdf for noisy dataset, where μ1=0, σ1=1, μ2=5, σ2=2, τ=0.4
Fig.7  Comparison of MSE for Adaptive k and MKL
Fig.8  Loss distributions of clean and noisy data throughout training for noise ratio τ=0.3
Dataset τ Oracle Vanilla MKL [7] Vanilla + MKL Adaptive-k Trimloss [16] Co-teaching [10]
MNIST 0.10 98.54 97.73 96.23 98.06 98.32 98.22 98.11
MNIST 0.20 98.33 96.59 97.20 97.97 98.14 98.00 97.58
MNIST 0.30 98.21 94.25 96.93 97.31 97.66 97.11 96.70
MNIST 0.40 98.08 89.19 94.88 96.18 95.72 92.10 95.62
FMNIST 0.10 85.91 85.06 82.01 84.89 85.39 85.41 85.37
FMNIST 0.20 85.78 82.93 82.17 84.31 84.30 84.76 84.19
FMNIST 0.30 85.96 79.92 81.01 83.18 83.36 81.76 81.93
FMNIST 0.40 85.21 72.99 78.09 79.73 80.46 72.46 77.72
Cifar 10 0.10 81.42 77.94 77.07 78.62 80.15 78.87 80.31
Cifar 10 0.20 80.78 75.39 77.57 79.09 79.13 73.28 78.99
Cifar 10 0.30 79.52 71.76 73.23 76.85 78.12 63.48 76.49
Cifar 10 0.40 78.56 63.28 61.68 69.75 75.72 50.26 73.81
IMDB 0.10 88.09 87.50 80.31 87.37 87.17 87.63 87.78
IMDB 0.20 87.72 86.18 80.89 85.50 86.31 84.13 86.02
IMDB 0.30 87.17 83.00 78.88 82.95 81.40 68.94 83.18
IMDB 0.40 87.29 68.11 64.60 66.01 66.35 54.23 67.59
HOTEL 0.10 61.05 58.41 52.55 57.96 58.73 60.11 59.72
HOTEL 0.20 60.03 56.40 52.86 56.73 57.16 54.49 55.47
HOTEL 0.30 59.68 52.59 53.73 54.34 53.57 49.95 54.70
HOTEL 0.40 58.38 51.44 50.22 51.45 51.65 38.26 51.72
SARCASM 0.10 82.38 81.21 76.19 80.17 79.98 80.99 80.82
SARCASM 0.20 81.93 79.74 77.85 79.71 79.00 78.07 79.78
SARCASM 0.30 81.55 78.18 77.19 78.31 77.49 72.78 77.50
SARCASM 0.40 81.47 74.26 71.90 73.77 72.84 60.38 72.30
20_NEWS 0.10 70.21 66.94 62.49 66.5 70.2 68.23 68.18
20_NEWS 0.20 69.25 63.99 62.21 67.66 68.67 62.5 66.68
20_NEWS 0.30 68.32 60.85 56.37 66.18 66.46 51.83 63.5
20_NEWS 0.40 67.3 52.58 45.21 56.42 59.55 44.65 59.62
Tab.2  Comparison of Adaptive-k and Vanilla+MKL in different datasets with Oracle, Vanilla, MKL, Trimloss, and co-teaching on average test set accuracy. For each dataset, the best accuracy is in green, the second best is in blue, and the third best is in red
Algorithm Noise ratio hyperparameter required Training multiple models required
Adaptive-k No No
MKL Yes No
Trimloss Yes No
Co-teaching Yes Yes
Tab.3  Cost comparison of algorithms
SGD
Dataset τ Oracle Vanilla MKL [7] Vanilla + MKL Adaptive-k Trimloss [16] Co-teaching [10]
MNIST 0.10 98.54 97.73 96.23 98.06 98.32 98.22 98.11
MNIST 0.20 98.33 96.59 97.20 97.97 98.14 98.00 97.58
MNIST 0.30 98.21 94.25 96.93 97.31 97.66 97.11 96.70
MNIST 0.40 98.08 89.19 94.88 96.18 95.72 92.10 95.62
Fashion MNIST 0.10 85.91 85.06 82.01 84.89 85.39 85.41 85.37
Fashion MNIST 0.20 85.78 82.93 82.17 84.31 84.30 84.76 84.19
Fashion MNIST 0.30 85.96 79.92 81.01 83.18 83.36 81.76 81.93
Fashion MNIST 0.40 85.21 72.99 78.09 79.73 80.46 72.46 77.72
Cifar 10 0.10 81.42 77.94 77.07 78.62 80.15 78.87 80.31
Cifar 10 0.20 80.78 75.39 77.57 79.09 79.13 73.28 78.99
Cifar 10 0.30 79.52 71.76 73.23 76.85 78.12 63.48 76.49
Cifar 10 0.40 78.56 63.28 61.68 69.75 75.72 50.26 73.81
SGDM
Dataset τ Oracle Vanilla MKL [7] Vanilla + MKL Adaptive-k Trimloss [16] Co_teaching [10]
MNIST 0.10 97.62 97.10 87.54 97.37 97.76 97.48 97.13
MNIST 0.20 97.60 95.73 88.75 97.28 97.69 97.19 96.32
MNIST 0.30 97.38 92.56 91.56 96.82 97.27 96.50 94.17
MNIST 0.40 97.30 86.07 86.66 94.38 96.51 96.13 86.12
Fashion MNIST 0.10 85.10 82.78 76.73 83.30 83.59 84.45 83.05
Fashion MNIST 0.20 83.76 80.92 76.06 82.88 82.84 82.34 81.40
Fashion MNIST 0.30 84.75 78.99 76.26 81.23 81.65 77.70 77.87
Fashion MNIST 0.40 83.89 69.58 68.57 77.05 77.81 68.22 70.29
Cifar 10 0.10 81.67 77.95 75.68 78.93 79.95 79.82 80.4
Cifar 10 0.20 80.36 75.84 75.88 79.20 79.12 70.86 78.75
Cifar 10 0.30 79.66 71.89 72.06 76.07 77.89 59.28 76.28
Cifar 10 0.40 78.89 64.80 57.80 69.80 68.29 49.16 71.7
ADAM
Dataset τ Oracle Vanilla MKL [7] Vanilla + MKL Adaptive-k Trimloss [16] Co_teaching [10]
MNIST 0.10 98.46 97.62 97.12 97.66 98.23 98.33 97.86
MNIST 0.20 98.37 96.58 97.34 97.51 97.92 97.66 97.28
MNIST 0.30 98.27 93.98 96.51 97.16 97.38 97.34 96.76
MNIST 0.40 98.13 84.47 93.63 94.65 96.39 94.81 94.25
Fashion MNIST 0.10 86.04 84.89 83.55 85.26 85.84 85.48 85.76
Fashion MNIST 0.20 85.89 83.09 83.75 84.49 84.83 84.70 84.32
Fashion MNIST 0.30 85.81 80.04 81.63 82.22 82.84 79.18 81.86
Fashion MNIST 0.40 85.42 72.41 74.99 79.01 79.49 77.19 78.37
Cifar 10 0.10 80.83 78.80 77.42 79.19 80.55 79.14 80.62
Cifar 10 0.20 80.16 76.58 74.75 78.58 79.07 71.51 79.44
Cifar 10 0.30 79.33 73.03 69.54 76.57 77.99 59.6 77.08
Cifar 10 0.40 78.36 65.24 58.55 71.68 74.93 49.63 73.04
Tab.4  This table consists of three parts and shows the results for SGD, SGDM and ADAM optimizers respectively
Fig.9  The means of losses of clean and noisy samples in the MNIST data set for τ = 0.3 and the threshold value determined by adaptive k for the separation of these samples. The blue line shows the mean losses of noisy samples, the red line shows the mean losses of clean samples, and the green dashed line shows the calculated and used threshold value (μD in Adaptive-k Algorithm 2). The shaded area shows the ranges as mean ± 1.5 * standard deviation. The first 30 epochs are vanilla training, and the next 50 are adaptive training for Adaptive-k
Fig.10  It shows only the adaptive training for the MNIST dataset. For the MNIST dataset, after the first 30 epochs (Vanilla), the Adaptive-k training is started, which is 50 epochs. Solid lines indicate sample ratios determined as clean by Adaptive-k, while dashed lines indicate actual clean sample ratios
  
  
  
1 C, Zhang S, Bengio M, Hardt B, Recht O Vinyals . Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 2021, 64( 3): 107–115
2 S, Liao X, Jiang Z Ge . Weakly supervised multilayer perceptron for industrial fault classification with inaccurate and incomplete labels. IEEE Transactions on Automation Science and Engineering, 2022, 19( 2): 1192–1201
3 D, Ortego E, Arazo P, Albert N E, O’Connor K McGuinness . Towards robust learning with different label noise distributions. In: Proceedings of the 25th International Conference on Pattern Recognition (ICPR). 2021, 7020−7027
4 E, Arazo D, Ortego P, Albert N, O’Connor K McGuinness . Unsupervised label noise modeling and loss correction. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 312−321
5 K, Nishi Y, Ding A, Rich T Höllerer . Augmentation strategies for learning with noisy labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8018−8027
6 N, Majidi E, Amid H, Talebi M K Warmuth . Exponentiated gradient reweighting for robust training under label noise and beyond. 2021, arXiv preprint arXiv: 2104.01493
7 V, Shah X, Wu S Sanghavi . Choosing the sample with lowest loss makes SGD robust. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. 2020, 2120−2130
8 Y, Bengio J, Louradour R, Collobert J Weston . Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 41−48
9 H T, Kesgin M F Amasyali . Cyclical curriculum learning. 2022, arXiv preprint arXiv: 2202.05531
10 B, Han Q, Yao X, Yu G, Niu M, Xu W, Hu I W, Tsang M Sugiyama . Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8536−8546
11 X, Shi W Che . Combating with extremely noisy samples in weakly supervised slot filling for automatic diagnosis. Frontiers of Computer Science, 2023, 17( 5): 175333
12 H, Yang Y, Jin Z, Li D B, Wang L, Miao X, Geng M L Zhang . Learning from noisy labels via dynamic loss thresholding. 2021, arXiv preprint arXiv: 2104.02570
13 Y, Wei M, Xue X, Liu P Xu . Data fusing and joint training for learning with noisy labels. Frontiers of Computer Science, 2022, 16( 6): 166338
14 Q, Yao H, Yang B, Han G, Niu J T Kwok . Searching to exploit memorization effect in learning with noisy labels. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 1000
15 Chi Y, Li Y, Zhang H, Liang Y. Median-truncated gradient descent: a robust and scalable nonconvex approach for signal estimation. In: Proceedings of the 3rd International MATHEON Conference on Compressed Sensing and Its Applications. 2019, 237−261
16 Y, Shen S Sanghavi . Learning with bad training data via iterative trimmed loss minimization. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5739−5748
17 K, Nakamura B W Hong . Regularization in neural network optimization via trimmed stochastic gradient descent with noisy label. 2020, arXiv preprint arXiv: 2012.11073
18 D P, Kingma J Ba . Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
19 L Deng . The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 2012, 29( 6): 141–142
20 H, Xiao K, Rasul R Vollgraf . Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. 2017, arXiv preprint arXiv: 1708.07747
21 Krizhevsky A. Learning multiple layers of features from tiny images.Technical Report, 2009
22 K, He X, Zhang S, Ren J Sun . Identity mappings in deep residual networks. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 630−645
23 A L, Maas R E, Daly P T, Pham D, Huang A Y, Ng C Potts . Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011, 142−150
24 comet-examples/comet-keras-cnn-lstm-example.py at master • comet-ml/comet-examples • github. See qwone.com/~jason/20Newsgroups website, 2021
25 R, Misra P Arora . Sarcasm detection using hybrid neural network. 2019, arXiv preprint arXiv: 1908.07414
26 kaggle. Sarcasm detection: a guide for ML and DL approach. See kaggle.com/subbhashit/sarcasm-detection-a-guide-for-ml-and-dl-approach website. 2021
27 M H, Alam W J, Ryu S Lee . Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Information Sciences, 2016, 339: 206–223
28 kaggle. Hotel reviews sentiment prediction. See kaggle.com/code/shahraizanwar/hotel-reviews-sentiment-prediction/notebook website. 2021
29 Home page for 20 newsgroups data set. See qwone.com/~jason/ 20Newsgroups website, 2014
30 Team K. Using pre-trained word embeddings. See keras.io/examples/nlp/pretrained_word_embeddings website, 2021
[1] FCS-22430-OF-ED_suppl_1 Download
[1] Jingyan SUI, Shizhe DING, Xulin HUANG, Yue YU, Ruizhi LIU, Boyang XIA, Zhenxin DING, Liming XU, Haicang ZHANG, Chungong YU, Dongbo BU. A survey on deep learning-based algorithms for the traveling salesman problem[J]. Front. Comput. Sci., 2025, 19(6): 196322-.
[2] Yanlin LI, Wantong JIAO, Ruihan LIU, Xuejin DENG, Feng ZHU, Weiwei XUE. Expanding the sequence spaces of synthetic binding protein using deep learning-based framework ProteinMPNN[J]. Front. Comput. Sci., 2025, 19(5): 195903-.
[3] Mengting NIU, Yaojia CHEN, Chunyu WANG, Quan ZOU, Lei XU. Computational approaches for circRNA-disease association prediction: a review[J]. Front. Comput. Sci., 2025, 19(4): 194904-.
[4] Yao WU, Hong HUANG, Yu SONG, Hai JIN. Soft-GNN: towards robust graph neural networks via self-adaptive data utilization[J]. Front. Comput. Sci., 2025, 19(4): 194311-.
[5] Shao-Yuan LI, Shi-Ji ZHAO, Zheng-Tao CAO, Sheng-Jun HUANG, Songcan CHEN. Robust domain adaptation with noisy and shifted label distribution[J]. Front. Comput. Sci., 2025, 19(3): 193310-.
[6] Jingyu LIU, Shi CHEN, Li SHEN. A comprehensive survey on graph neural network accelerators[J]. Front. Comput. Sci., 2025, 19(2): 192104-.
[7] Shaoyuan LI, Yuxiang ZHENG, Ye SHI, Shengjun HUANG, Songcan CHEN. KD-Crowd: a knowledge distillation framework for learning from crowds[J]. Front. Comput. Sci., 2025, 19(1): 191302-.
[8] Lingling ZHAO, Shitao SONG, Pengyan WANG, Chunyu WANG, Junjie WANG, Maozu GUO. A MLP-Mixer and mixture of expert model for remaining useful life prediction of lithium-ion batteries[J]. Front. Comput. Sci., 2024, 18(5): 185329-.
[9] Hengyu LIU, Tiancheng ZHANG, Fan LI, Minghe YU, Ge YU. A probabilistic generative model for tracking multi-knowledge concept mastery probability[J]. Front. Comput. Sci., 2024, 18(3): 183602-.
[10] Mingzhi YUAN, Kexue FU, Zhihao LI, Manning WANG. Decoupled deep hough voting for point cloud registration[J]. Front. Comput. Sci., 2024, 18(2): 182703-.
[11] Mingzhen LI, Changxi LIU, Jianjin LIAO, Xuegui ZHENG, Hailong YANG, Rujun SUN, Jun XU, Lin GAN, Guangwen YANG, Zhongzhi LUAN, Depei QIAN. Towards optimized tensor code generation for deep learning on sunway many-core processor[J]. Front. Comput. Sci., 2024, 18(2): 182101-.
[12] Hanadi AL-MEKHLAFI, Shiguang LIU. Single image super-resolution: a comprehensive review and recent insight[J]. Front. Comput. Sci., 2024, 18(1): 181702-.
[13] Yufei ZENG, Zhixin LI, Zhenbin CHEN, Huifang MA. Aspect-level sentiment analysis based on semantic heterogeneous graph convolutional network[J]. Front. Comput. Sci., 2023, 17(6): 176340-.
[14] Yamin HU, Hao JIANG, Zongyao HU. Measuring code maintainability with deep neural networks[J]. Front. Comput. Sci., 2023, 17(6): 176214-.
[15] Muazzam MAQSOOD, Sadaf YASMIN, Saira GILLANI, Maryam BUKHARI, Seungmin RHO, Sang-Soo YEO. An efficient deep learning-assisted person re-identification solution for intelligent video surveillance in smart cities[J]. Front. Comput. Sci., 2023, 17(4): 174329-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed