Frontiers of Computer Science

Front. Comput. Sci.    2022, Vol. 16 Issue (2) : 162601
Cancer classification with data augmentation based on generative adversarial networks
Kaimin WEI1,2, Tianqi LI1,2, Feiran HUANG1,2(), Jinpeng CHEN3, Zefan HE1,2
1. College of Information Science and Technology, Jinan University, Guangzhou 510632, China
2. Guangdong Key Laboratory of Data Security and Privacy Protection, Guangzhou 510632, China
3. School of Software Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
Accurate diagnosis is a significant step in cancer treatment. Machine learning can support doctors in prognosis decision-making, and its performance is always weakened by the high dimension and small quantity of genetic data. Fortunately, deep learning can effectively process the high dimensional data with growing. However, the problem of inadequate data remains unsolved and has lowered the performance of deep learning. To end it, we propose a generative adversarial model that uses non target cancer data to help target generator training. We use the reconstruction loss to further stabilize model training and improve the quality of generated samples. We also present a cancer classification model to optimize classification performance. Experimental results prove that mean absolute error of cancer gene made by our model is 19.3% lower than DC-GAN, and the classification accuracy rate of our produced data is higher than the data created by GAN. As for the classification model, the classification accuracy of our model reaches 92.6%, which is 7.6% higher than the model without any generated data.

Keywords data mining      cancer data analysis      deep learning      generative adversarial networks     
Corresponding Author(s): Feiran HUANG   
Just Accepted Date: 27 April 2020   Online First Date: 31 August 2021    Issue Date: 03 September 2021
Kaimin WEI,Tianqi LI,Feiran HUANG, et al. Cancer classification with data augmentation based on generative adversarial networks[J]. Front. Comput. Sci., 2022, 16(2): 162601.
Fig.1  The structure of the generative adversarial model for cancer gene expression data
Fig.2  The structure of cancer classification model based on generated data
Fig.3  The difference between real samples and generated ones
Fig.4  An image and its corresponding label
Fig.5  A generated image sample with its label vector
Cancer name #Normal sample #Ill sample
Lung 59 78
Breast 113 150
Prostate 52 71
Colon 41 57
Gastric 32 47
Liver 50 68
Rectal 10 22
Esophageal 11 23
Thyroid 58 77
CCRCC 72 91
Uterine 35 50
HNSCC 44 61
Tab.1  The statistics of cancer datasets
Generative model Normal sample Ill sample
Mean SD Mean SD
GAN 0.238 0.008 0.225 0.014
DCGAN 0.344 0.011 0.276 0.008
VAE 0.323 0.021 0.238 0.004
Gene-GAN 0.151 0.002 0.103 0.003
Tab.2  The MAE of different generative models
Fig.6  The convergence of different generative models
Generative models Accuracy
Mean SD
GAN 0.843 0.025
DCGAN 0.877 0.044
VAE 0.872 0.065
Gene-GAN 0.892 0.046
Tab.3  Classification results of different generative models
Classifiers Accuracy
Mean SD
Decision tree 0.608 0
KNN (k=3) 0.864 0
Support vector machine 0.84 0
VGG 0.781 0.144
ResNet 0.849 0.012
Gene-GAN (non-amplified) 0.85 0.027
Gene-GAN (first fake then real) 0.872 0.047
Gene-GAN (mixed) 0.892 0.046
Tab.4  Classification results for different classifiers
Fig.7  The ROC curve of different classifiers
Label smooth rate Accuracy
Mean SD
? = 1 0.889 0.025
? = 0.8 0.899 0.022
? = 0.6 0.912 0.021
? = 0.4 0.926 0.008
? = 0.2 0.919 0.017
? = 0 0.892 0.046
Tab.5  Accuracy results under different label smooth rate
Fig.8  The ROC curve under different smooth rate
#Amplified data Accuracy
Mean SD
N=100 0.913 0.016
N=300 0.917 0.012
N=600 0.930 0.016
N=1000 0.926 0.008
N=1500 0.918 0.012
Tab.6  Results with different quantities of amplified data
Fig.9  The ROC curve with different quantities of amplified data
