Please wait a minute...
Frontiers of Earth Science

ISSN 2095-0195

ISSN 2095-0209(Online)

CN 11-5982/P

Postal Subscription Code 80-963

2018 Impact Factor: 1.205

Front. Earth Sci.    2024, Vol. 18 Issue (3) : 538-553    https://doi.org/10.1007/s11707-022-1079-x
A Grad-CAM and capsule network hybrid method for remote sensing image scene classification
Zhan HE1,4, Chunju ZHANG2(), Shu WANG3(), Jianwei HUANG4, Xiaoyun ZHENG1, Weijie JIANG4, Jiachen BO4, Yucheng YANG4
. Shenzhen Data Management Center of Planning and Natural Resources, Key Laboratory of Urban Land Resources Monitoring and Simulation (Ministry of Natural Resources), Shenzhen 518000, China
. Key Laboratory of Jianghuai Arable Land Resources Protection and Eco-restoration (Ministry of Natural Resources), Hefei 230088, China
. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
. School of Civil Engineering, Hefei University of Technology, Hefei 230009, China
 Download: PDF(6041 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Remote sensing image scene classification and remote sensing technology applications are hot research topics. Although CNN-based models have reached high average accuracy, some classes are still misclassified, such as “freeway,” “spare residential,” and “commercial_area.” These classes contain typical decisive features, spatial-relation features, and mixed decisive and spatial-relation features, which limit high-quality image scene classification. To address this issue, this paper proposes a Grad-CAM and capsule network hybrid method for image scene classification. The Grad-CAM and capsule network structures have the potential to recognize decisive features and spatial-relation features, respectively. By using a pre-trained model, hybrid structure, and structure adjustment, the proposed model can recognize both decisive and spatial-relation features. A group of experiments is designed on three popular data sets with increasing classification difficulties. In the most advanced experiment, 92.67% average accuracy is achieved. Specifically, 83%, 75%, and 86% accuracies are obtained in the classes of “church,” “palace,” and “commercial_area,” respectively. This research demonstrates that the hybrid structure can effectively improve performance by considering both decisive and spatial-relation features. Therefore, Grad-CAM-CapsNet is a promising and powerful structure for image scene classification.

Keywords image scene classification      CNN      Grad-CAM      CapsNet      DenseNet     
Corresponding Author(s): Chunju ZHANG,Shu WANG   
Online First Date: 03 July 2024    Issue Date: 29 September 2024
 Cite this article:   
Zhan HE,Chunju ZHANG,Shu WANG, et al. A Grad-CAM and capsule network hybrid method for remote sensing image scene classification[J]. Front. Earth Sci., 2024, 18(3): 538-553.
 URL:  
https://academic.hep.com.cn/fesci/EN/10.1007/s11707-022-1079-x
https://academic.hep.com.cn/fesci/EN/Y2024/V18/I3/538
Fig.1  Scene images taken from the NWPU-RESISC45 data set showing the similarity of different land covers, (a) freeway; (b) railway; (c) runway; (d) dense residential; (e) medium residential; (f) sparse residential; (g) church (h) palace; (i) commercial area.
Fig.2  The Grad-CAM-CapsNet architecture contains three parts: an attention block, a feature fusion block, and a CapsNet block.
Procedure 1: Grad-CAM-CapsNet
Step1: Attention block
  Input: Image X
  Output: Attention image XA
Substep 1: Calculate the weight coefficients αic in the Grad-CAM according to Eq. (1)
Substep 2: Calculate the attention map Xam according to Eq. (2)
Substep 3: Xam is resized to match the size of input image X by upsampling
Substep 4: Input X to the pre-trained CNN model to obtain feature map Fp
Substep 5: Input Xam into the customized CNN model to obtain feature map Fc
Substep 6: Fuse Fp and Fc by multiplying them to obtain attention masked image XA
Substep 7: Return XA
Step2: CapsNet block
  Input: Attention masked image XA
  Output: The probability P of the input image
Substep 1: The attention image XA is converted to capsule form by the PrimaryCaps layer
Substep 2: After processing the DigitCaps layer, the category capsules can be obtained
Substep 3: Obtain the probability P by computing the length of each category capsule according to Eq. (4)
Substep 4: Return the probability P of the input image
Tab.1  The whole process of the Grad-CAM-CapsNet
Fig.3  Calculation process of generating the attention map using Grad-CAM.
Fig.4  Grad-CAM implementation.
Fig.5  Input images masked with attention maps.
Fig.6  The architecture of CapsNet.
Model ID Model Attention CapsNet Core features
1 CaffeNet (Xia et al., 2017) No No Dropout structure
2 GoogLeNet (Szegedy et al., 2017) No No Inception structure
3 VGG16 (Liu et al., 2017) No No Multiple small convolution kernels
4 CNN-ELM (Weng et al., 2017) No No ELM structure
5 Fine-tuned GoogLeNet (Weng et al., 2017) No No Fine-tune
6 Fine-tuned VGG19 (Castelluccio et al., 2015) No No Fine-tune
7 Deep CNN Transfer (Marmanis et al., 2016) No No Different scale features
8 Triple networks (Liu and Huang, 2018) No No Label training replacement
9 Attention-based residual network (Fan et al., 2019) Yes No CNN with attention
10 Two-stream fusion (Yu and Liu, 2018a) Yes No Separate spatial and temporal features
11 VGG16-CapsNet (Zhang et al., 2019) No Yes CapsNet
12 D-CapsNet (Raza et al., 2020) Yes Yes Spatial attention & CapsNets
13 Grad-CA –CapsNet (our proposed model) Yes Yes Pretrained attention & CapsNets
Tab.2  List of experimental comparative models
Model ID Model Types Accuracy and Standard Deviation
1 CaffeNet Without attention or CapsNet 95.02±0.81
2 GoogLeNet 94.31±0.89
3 VGG16 95.21±1.20
4 CNN-ELM 95.62
7 Deep CNN transfer 98.49
9 D-CNN with VGGNet16 98.93±0.10
5 Fine-tuned GoogLeNet 97.10
6 Fine-tuned VGG19 98.1
10 Two-stream fusion With attention or CapsNet 98.02±1.03
9 attention-based residual network 98.81±0.30
12 Ours 99.05±0.15
Tab.3  The performances of different models on the UC Merced Land-Use data set
Fig.7  Confusion matrix of the proposed Grad-CAM-CapsNet model on the UC Merced Land-Use data set with a training ratio of 80%.
Model ID Model Types Accuracy (50% training ratio) Accuracy (20% training ratio)
2 GoogLeNet Without attention or CapsNet 86.39±0.55 83.44±0.40
1 CaffeNet 89.53±0.31 86.86±0.47
3 VGG16 89.64±0.36 86.59±0.29
12 Two-stream fusion Attention 94.58±0.25 92.32±0.41
13 VGG-16-CapsNet CapsNet 94.74±0.17 91.63±0.19
14 D-CapsNet Attention & CapsNet 96.15±0.14 92.73±0.15
15 Ours 96.43±0.12 93.68±0.14
Tab.4  The performances of different models on the AID data set
Fig.8  Confusion matrix of our proposed model on the AID data set with a 20% training ratio.
Model ID Model Types Accuracy (20% training ratio) Accuracy (10% training ratio)
2 GoogLeNet Without attention or CapsNet 86.39±0.55 83.44±0.40
3 VGG-16 89.53±0.31 86.86±0.47
6 Fine-tuned VGG-16 90.36±0.18 87.15±0.45
8 Triple networks 92.33±0.20 87.15±0.45
10 Two-stream fusion Attention 83.16±0.18 80.22±0.22
11 VGG-16-CapsNet CapsNet 89.18±0.14 85.08±0.13
12 D-CapsNet Attention & CapsNet 92.46±0.14 88.18±0.19
13 Ours 92.67±0.08 89.34±0.20
Tab.5  The performances of different models on the NWPU-RESISC45 data set
Fig.9  Confusion matrix of our proposed model on the NWPU-RESISC45 data set with a 20% training ratio.
ID Model type Church Palace Commercial area Total accuracy
1 Without attention or CapsNet 64% 61% 76% 87.65%
2 Attention 79% 68% 86% 83.16%
3 CapsNet 74% 85% 80% 89.18%
4 Attention & CapsNet 83% 75% 86% 92.67%
Tab.6  The performances of different models with specific classes on the NWPU-RESISC45 Data set
Data set Accuracy of pretrained model (using Grad-CAM)-CapsNet/% Accuracy of pretrained model (without Grad-CAM)-CapsNet/% Increment
UC Merced 99.05 98.02 + 1.03%
AID 96.43 95.99 + 0.44%
NWPU-RESISC45 92.67 92.34 + 0.33%
Tab.7  The performances of different models with specific classes on the NWPU-RESISC45 data set
Data set Accuracy of pretrained model (using Grad-CAM)- FC layer/% Accuracy of pretrained model (without Grad-CAM)- FC layer/% Increment
UC Merced 97.62 95.95 + 1.67%
AID 95.36 94.82 + 0.54%
NWPU-RESISC45 91.06 88.21 + 2.85%
Tab.8  The effectiveness of the Grad-CAM mechanism with an FC layer in different data sets
Data set Accuracy/% Increment Accuracy/% Increment
Grad-CAM-Cap sNet Grad-CAM-FC layer without Grad-CAM-CapsNet without Grad-CAM-FC layer
UC Merced 99.05 97.62 + 1.43% 98.02 95.95 + 2.07%
AID 96.43 95.36 + 1.07% 95.99 94.82 + 1.17%
NWPU-RESISC45 92.67 91.06 + 1.61% 92.34 88.21 + 4.13%
Tab.9  The effectiveness of CapsNet in different data sets
Data set CNN-CapsNet DenseNet-CapsNet with weight freeze DenseNet-CapsNet without weight freeze
UC Merced 61.02 98.12 99.05
AID 64.38 95.21 96.43
NWPU-RESISC45 45.67 91.04 92.67
Tab.10  The effectiveness of the pre-trained model in different data sets
1 Z Abai, N Rajmalwar (2019). DenseNet models for tiny imagenet classification. arXiv preprint arXiv: 1904.10429
2 A, Ahmed A, Jalal K Kim (2020). A novel statistical method for scene classification based on multi-object categorization and logistic regression.Sensors (Basel), 20(14): 3871
https://doi.org/10.3390/s20143871
3 S Bai (2016). Growing random forest on deep convolutional neural networks for scene categorization.Expert Systems with Applications, 71: 279–287
https://doi.org/10.1016/j.eswa.2016.10.038
4 M Castelluccio, G Poggi, C Sansone (2015). Land use classification in remote sensing images by convolutional neural networks. arXiv preprint arXiv: 1508.00092
5 S, Chaib H, Liu Y, Gu H Yao (2017). Deep feature fusion for VHR remote sensing scene classification.IEEE Trans Geosci Remote Sens, 55(8): 4775–4784
https://doi.org/10.1109/TGRS.2017.2700322
6 J, Chen C, Wang Z, Ma J, Chen D, He S Ackland (2018). Remote sensing scene classification based on convolutional neural networks pre-trained using attention-guided sparse filters.Remote Sens (Basel), 10(2): 290
https://doi.org/10.3390/rs10020290
7 G, Cheng J, Han X Lu (2017a). Remote sensing image scene classification: benchmark and state of the art.Proc IEEE, 105(10): 1865–1883
https://doi.org/10.1109/JPROC.2017.2675998
8 G, Cheng Z, Li X, Yao L, Guo Z Wei (2017b). Remote sensing image scene classification using bag of convolutional features.IEEE Geosci Remote Sens Lett, 14(10): 1735–1739
https://doi.org/10.1109/LGRS.2017.2731997
9 G, Cheng C, Yang X, Yao L, Guo J Han (2018). When deep learning meets metric learning: remote sensing image scene classification via learning discriminative CNNs.IEEE Trans Geosci Remote Sens, 56(5): 2811–2821
https://doi.org/10.1109/TGRS.2017.2783902
10 G, Cheng P, Zhou J Han (2016). Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images.IEEE Trans Geosci Remote Sens, 54(12): 7405–7415
https://doi.org/10.1109/TGRS.2016.2601622
11 R Fan, L Wang, R Feng (2019). Attention based residual network for high-resolution remote sensing imagery scene classification. In: IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 1346–1349
12 J, Gan Q, Li Z, Zhang J Wang (2016). Two-level feature representation for aerial scene classification.IEEE Geosci Remote Sens Lett, 13(11): 1626–1630
https://doi.org/10.1109/LGRS.2016.2598567
13 C, Gong J, Han X Lu (2017). Remote sensing image scene classification: benchmark and state of the art.In: Proceedings of the IEEE, 105(10): 1865–1883
https://doi.org/10.1109/JPROC.2017.2675998
14 Q Hou, D Zhou, J Feng (2021). Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13713–13722
15 D P Kingma, J Ba (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv: 1412.6980
16 J, Knorn A, Rabe V C, Radeloff T, Kuemmerle J, Kozak P Hostert (2009). Land cover mapping of large areas using chain classification of neighboring Landsat satellite images.Remote Sens Environ, 113(5): 957–964
https://doi.org/10.1016/j.rse.2009.01.010
17 R, Lei C, Zhang W, Liu L, Zhang X, Zhang Y, Yang J, Huang Z, Li Z Zhou (2021). Hyperspectral remote sensing image classification using deep convolutional capsule network.IEEE J Sel Top Appl Earth Obs Remote Sens, 14: 8297–8315
https://doi.org/10.1109/JSTARS.2021.3101511
18 R, Lei C, Zhang X, Zhang J, Huang Z, Li W, Liu H Cui (2022). Multiscale feature aggregation capsule neural network for hyperspectral remote sensing image classification.Remote Sens (Basel), 14(7): 1652
https://doi.org/10.3390/rs14071652
19 J, Li D, Lin Y, Wang G, Xu Y, Zhang C, Ding Y Zhou (2020). Deep discriminative representation learning with attention map for scene classification.Remote Sens (Basel), 12(9): 1366
https://doi.org/10.3390/rs12091366
20 Y Liu, M M Cheng, X Hu (2017). Richer convolutional features for edge detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5872–5881
21 Y, Liu C Huang (2018). Scene classification via triplet networks.IEEE J Sel Top Appl Earth Obs Remote Sens, 11(1): 220–237
https://doi.org/10.1109/JSTARS.2017.2761800
22 D, Marmanis M, Datcu T, Esch U Stilla (2016). Deep learning earth observation classification using imagenet pretrained networks.IEEE Geosci Remote Sens Lett, 13(1): 105–109
https://doi.org/10.1109/LGRS.2015.2499239
23 X, Mei E, Pan Y, Ma X, Dai J, Huang F, Fan Q, Du H, Zheng J Ma (2019). Spectral-spatial attention networks for hyperspectral image classification.Remote Sens (Basel), 11(8): 963
https://doi.org/10.3390/rs11080963
24 Z, Pan J, Xu Y, Guo Y, Hu G Wang (2020). Deep learning segmentation and classification for urban village using a worldview satellite image based on U-Net.Remote Sens (Basel), 12(10): 1574
https://doi.org/10.3390/rs12101574
25 de Lima R, Pires K Marfurt (2019). Convolutional neural network for remote-sensing scene classification: transfer learning analysis.Remote Sens (Basel), 12(1): 86
https://doi.org/10.3390/rs12010086
26 K, Raiyani T, Gonçalves L, Rato P, Salgueiro da Silva J R Marques (2021). Sentinel-2 image scene classification: a comparison between Sen2Cor and a machine learning approach.Remote Sens (Basel), 13(2): 300
https://doi.org/10.3390/rs13020300
27 A, Raza H, Huo S, Sirajuddin T Fang (2020). Diverse capsules network combining multiconvolutional layers for remote sensing image scene classification.IEEE J Sel Top Appl Earth Obs Remote Sens, 13: 5297–5313
https://doi.org/10.1109/JSTARS.2020.3021045
28 S Sabour, N Frosst, G E Hinton (2017). Dynamic routing between capsules. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), 3859–3869
29 G, Sheng W, Yang T, Xu H Sun (2012). High-resolution satellite scene classification using a sparse coding based multiple feature combination.Int J Remote Sens, 33(8): 2395–2412
https://doi.org/10.1080/01431161.2011.608740
30 X, Sun Q, Zhu Q Qin (2021). A multi-level convolution pyramid semantic fusion framework for high-resolution remote sensing image scene classification and annotation.IEEE Access, 9: 18195–18208
https://doi.org/10.1109/ACCESS.2021.3052977
31 C, Szegedy S, Ioffe V, Vanhoucke A A (2017) Alemi . Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI'17). AAAI Press, 4278–4284
32 T, Tian X, Liu L Wang (2019a). Remote sensing scene classification based on res-capsnet. In: IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium.IEEE, 2019: 525–528
33 X, Tian J, An G Mu (2019b). Power System Transient Stability Assessment Method Based on CapsNet. In: 2019 IEEE Innovative Smart Grid Technologies-Asia (ISGT Asia).IEEE, 2019: 1159–1164
34 W, Tong W, Chen W, Han X, Li L Wang (2020). Channel-attention-based DenseNet network for remote sensing image scene classification.IEEE J Sel Top Appl Earth Obs Remote Sens, 13: 4121–4132
https://doi.org/10.1109/JSTARS.2020.3009352
35 T, Vo D, Tran W Ma (2015). Tensor decomposition and application in image classification with histogram of oriented gradients.Neurocomputing, 165: 38–45
https://doi.org/10.1016/j.neucom.2014.06.093
36 Y, Wang J, Zhang M Kan (2020). Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 12275–12284
37 Q, Weng Z, Mao J, Lin W Guo (2017). Land-use classification via extreme learning classifier based on deep convolutional features.IEEE Geosci Remote Sens Lett, 14(5): 704–708
https://doi.org/10.1109/LGRS.2017.2672643
38 G S, Xia J, Hu F, Hu B, Shi X, Bai Y, Zhong L, Zhang X Lu (2017). AID: A benchmark data set for performance evaluation of aerial scene classification.IEEE Trans Geosci Remote Sens, 55(7): 3965–3981
https://doi.org/10.1109/TGRS.2017.2685945
39 Y, Yang S Newsam (2010). Bag-of-visual-words and spatial extensions for land-use classification.In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2010: 270–279
40 C, Yu C, Gao J, Wang G, Yu C, Shen N Sang (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation.Int J Comput Vis, 129(11): 3051–3068
https://doi.org/10.1007/s11263-021-01515-2
41 Y, Yu F Liu (2018a). A two-stream deep fusion framework for high-resolution aerial scene classification.Comput Intell Neurosci, 2018: 8639367
https://doi.org/10.1155/2018/8639367
42 Y, Yu F Liu (2018b). Dense connectivity based two-stream deep feature fusion framework for aerial scene classification.Remote Sens (Basel), 10(7): 1158
https://doi.org/10.3390/rs10071158
43 W, Zhang P, Tang L Zhao (2019). Remote sensing image scene classification using CNN-CapsNet.Remote Sens (Basel), 11(5): 494
https://doi.org/10.3390/rs11050494
44 X, Zhang G, Wang S G Zhao (2022). CapsNet-COVID19: Lung CT image classification method based on CapsNet model.Math Biosci Eng, 19(5): 5055–5074
https://doi.org/10.3934/mbe.2022236
45 B, Zhao Y, Zhong L, Zhang B Huang (2016). The Fisher kernel coding framework for high spatial resolution scene classification.Remote Sens (Basel), 8(2): 157
https://doi.org/10.3390/rs8020157
46 D, Zhao Y, Chen L Lv (2017). Deep reinforcement learning with visual attention for vehicle classification.IEEE Trans Cogn Dev Syst, 9(4): 356–367
https://doi.org/10.1109/TCDS.2016.2614675
47 X, Zhao J, Zhang J, Tian L, Zhuo J Zhang (2020). Residual dense network based on channel-spatial attention for the scene classification of a high-resolution remote sensing image.Remote Sens (Basel), 12(11): 1887
https://doi.org/10.3390/rs12111887
48 B, Zhou A, Khosla A Lapedriza (2016). Learning deep features for discriminative localization.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2921–2929
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed