. Shenzhen Data Management Center of Planning and Natural Resources, Key Laboratory of Urban Land Resources Monitoring and Simulation (Ministry of Natural Resources), Shenzhen 518000, China . Key Laboratory of Jianghuai Arable Land Resources Protection and Eco-restoration (Ministry of Natural Resources), Hefei 230088, China . State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China . School of Civil Engineering, Hefei University of Technology, Hefei 230009, China
Remote sensing image scene classification and remote sensing technology applications are hot research topics. Although CNN-based models have reached high average accuracy, some classes are still misclassified, such as “freeway,” “spare residential,” and “commercial_area.” These classes contain typical decisive features, spatial-relation features, and mixed decisive and spatial-relation features, which limit high-quality image scene classification. To address this issue, this paper proposes a Grad-CAM and capsule network hybrid method for image scene classification. The Grad-CAM and capsule network structures have the potential to recognize decisive features and spatial-relation features, respectively. By using a pre-trained model, hybrid structure, and structure adjustment, the proposed model can recognize both decisive and spatial-relation features. A group of experiments is designed on three popular data sets with increasing classification difficulties. In the most advanced experiment, 92.67% average accuracy is achieved. Specifically, 83%, 75%, and 86% accuracies are obtained in the classes of “church,” “palace,” and “commercial_area,” respectively. This research demonstrates that the hybrid structure can effectively improve performance by considering both decisive and spatial-relation features. Therefore, Grad-CAM-CapsNet is a promising and powerful structure for image scene classification.
Online First Date: 03 July 2024Issue Date: 29 September 2024
Cite this article:
Zhan HE,Chunju ZHANG,Shu WANG, et al. A Grad-CAM and capsule network hybrid method for remote sensing image scene classification[J]. Front. Earth Sci.,
2024, 18(3): 538-553.
Fig.1 Scene images taken from the NWPU-RESISC45 data set showing the similarity of different land covers, (a) freeway; (b) railway; (c) runway; (d) dense residential; (e) medium residential; (f) sparse residential; (g) church (h) palace; (i) commercial area.
Fig.2 The Grad-CAM-CapsNet architecture contains three parts: an attention block, a feature fusion block, and a CapsNet block.
Procedure 1: Grad-CAM-CapsNet
Step1: Attention block
Input: Image X
Output: Attention image XA
● Substep 1: Calculate the weight coefficients in the Grad-CAM according to Eq. (1)
● Substep 2: Calculate the attention map Xam according to Eq. (2)
● Substep 3: Xam is resized to match the size of input image X by upsampling
● Substep 4: Input X to the pre-trained CNN model to obtain feature map Fp
● Substep 5: Input Xam into the customized CNN model to obtain feature map Fc
● Substep 6: Fuse Fp and Fc by multiplying them to obtain attention masked image XA
● Substep 7: Return XA
Step2: CapsNet block
Input: Attention masked image XA
Output: The probability P of the input image
● Substep 1: The attention image XA is converted to capsule form by the PrimaryCaps layer
● Substep 2: After processing the DigitCaps layer, the category capsules can be obtained
● Substep 3: Obtain the probability P by computing the length of each category capsule according to Eq. (4)
● Substep 4: Return the probability P of the input image
Tab.1 The whole process of the Grad-CAM-CapsNet
Fig.3 Calculation process of generating the attention map using Grad-CAM.
Fig.4 Grad-CAM implementation.
Fig.5 Input images masked with attention maps.
Fig.6 The architecture of CapsNet.
Model ID
Model
Attention
CapsNet
Core features
1
CaffeNet (Xia et al., 2017)
No
No
Dropout structure
2
GoogLeNet (Szegedy et al., 2017)
No
No
Inception structure
3
VGG16 (Liu et al., 2017)
No
No
Multiple small convolution kernels
4
CNN-ELM (Weng et al., 2017)
No
No
ELM structure
5
Fine-tuned GoogLeNet (Weng et al., 2017)
No
No
Fine-tune
6
Fine-tuned VGG19 (Castelluccio et al., 2015)
No
No
Fine-tune
7
Deep CNN Transfer (Marmanis et al., 2016)
No
No
Different scale features
8
Triple networks (Liu and Huang, 2018)
No
No
Label training replacement
9
Attention-based residual network (Fan et al., 2019)
Yes
No
CNN with attention
10
Two-stream fusion (Yu and Liu, 2018a)
Yes
No
Separate spatial and temporal features
11
VGG16-CapsNet (Zhang et al., 2019)
No
Yes
CapsNet
12
D-CapsNet (Raza et al., 2020)
Yes
Yes
Spatial attention & CapsNets
13
Grad-CA –CapsNet (our proposed model)
Yes
Yes
Pretrained attention & CapsNets
Tab.2 List of experimental comparative models
Model ID
Model
Types
Accuracy and Standard Deviation
1
CaffeNet
Without attention or CapsNet
95.02±0.81
2
GoogLeNet
94.31±0.89
3
VGG16
95.21±1.20
4
CNN-ELM
95.62
7
Deep CNN transfer
98.49
9
D-CNN with VGGNet16
98.93±0.10
5
Fine-tuned GoogLeNet
97.10
6
Fine-tuned VGG19
98.1
10
Two-stream fusion
With attention or CapsNet
98.02±1.03
9
attention-based residual network
98.81±0.30
12
Ours
99.05±0.15
Tab.3 The performances of different models on the UC Merced Land-Use data set
Fig.7 Confusion matrix of the proposed Grad-CAM-CapsNet model on the UC Merced Land-Use data set with a training ratio of 80%.
Model ID
Model
Types
Accuracy (50% training ratio)
Accuracy (20% training ratio)
2
GoogLeNet
Without attention or CapsNet
86.39±0.55
83.44±0.40
1
CaffeNet
89.53±0.31
86.86±0.47
3
VGG16
89.64±0.36
86.59±0.29
12
Two-stream fusion
Attention
94.58±0.25
92.32±0.41
13
VGG-16-CapsNet
CapsNet
94.74±0.17
91.63±0.19
14
D-CapsNet
Attention & CapsNet
96.15±0.14
92.73±0.15
15
Ours
96.43±0.12
93.68±0.14
Tab.4 The performances of different models on the AID data set
Fig.8 Confusion matrix of our proposed model on the AID data set with a 20% training ratio.
Model ID
Model
Types
Accuracy (20% training ratio)
Accuracy (10% training ratio)
2
GoogLeNet
Without attention or CapsNet
86.39±0.55
83.44±0.40
3
VGG-16
89.53±0.31
86.86±0.47
6
Fine-tuned VGG-16
90.36±0.18
87.15±0.45
8
Triple networks
92.33±0.20
87.15±0.45
10
Two-stream fusion
Attention
83.16±0.18
80.22±0.22
11
VGG-16-CapsNet
CapsNet
89.18±0.14
85.08±0.13
12
D-CapsNet
Attention & CapsNet
92.46±0.14
88.18±0.19
13
Ours
92.67±0.08
89.34±0.20
Tab.5 The performances of different models on the NWPU-RESISC45 data set
Fig.9 Confusion matrix of our proposed model on the NWPU-RESISC45 data set with a 20% training ratio.
ID
Model type
Church
Palace
Commercial area
Total accuracy
1
Without attention or CapsNet
64%
61%
76%
87.65%
2
Attention
79%
68%
86%
83.16%
3
CapsNet
74%
85%
80%
89.18%
4
Attention & CapsNet
83%
75%
86%
92.67%
Tab.6 The performances of different models with specific classes on the NWPU-RESISC45 Data set
Data set
Accuracy of pretrained model (using Grad-CAM)-CapsNet/%
Accuracy of pretrained model (without Grad-CAM)-CapsNet/%
Increment
UC Merced
99.05
98.02
+ 1.03%
AID
96.43
95.99
+ 0.44%
NWPU-RESISC45
92.67
92.34
+ 0.33%
Tab.7 The performances of different models with specific classes on the NWPU-RESISC45 data set
Data set
Accuracy of pretrained model (using Grad-CAM)- FC layer/%
Accuracy of pretrained model (without Grad-CAM)- FC layer/%
Increment
UC Merced
97.62
95.95
+ 1.67%
AID
95.36
94.82
+ 0.54%
NWPU-RESISC45
91.06
88.21
+ 2.85%
Tab.8 The effectiveness of the Grad-CAM mechanism with an FC layer in different data sets
Data set
Accuracy/%
Increment
Accuracy/%
Increment
Grad-CAM-Cap sNet
Grad-CAM-FC layer
without Grad-CAM-CapsNet
without Grad-CAM-FC layer
UC Merced
99.05
97.62
+ 1.43%
98.02
95.95
+ 2.07%
AID
96.43
95.36
+ 1.07%
95.99
94.82
+ 1.17%
NWPU-RESISC45
92.67
91.06
+ 1.61%
92.34
88.21
+ 4.13%
Tab.9 The effectiveness of CapsNet in different data sets
Data set
CNN-CapsNet
DenseNet-CapsNet with weight freeze
DenseNet-CapsNet without weight freeze
UC Merced
61.02
98.12
99.05
AID
64.38
95.21
96.43
NWPU-RESISC45
45.67
91.04
92.67
Tab.10 The effectiveness of the pre-trained model in different data sets
1
Z Abai, N Rajmalwar (2019). DenseNet models for tiny imagenet classification. arXiv preprint arXiv: 1904.10429
2
A, Ahmed A, Jalal K Kim (2020). A novel statistical method for scene classification based on multi-object categorization and logistic regression.Sensors (Basel), 20(14): 3871 https://doi.org/10.3390/s20143871
3
S Bai (2016). Growing random forest on deep convolutional neural networks for scene categorization.Expert Systems with Applications, 71: 279–287 https://doi.org/10.1016/j.eswa.2016.10.038
4
M Castelluccio, G Poggi, C Sansone (2015). Land use classification in remote sensing images by convolutional neural networks. arXiv preprint arXiv: 1508.00092
5
S, Chaib H, Liu Y, Gu H Yao (2017). Deep feature fusion for VHR remote sensing scene classification.IEEE Trans Geosci Remote Sens, 55(8): 4775–4784 https://doi.org/10.1109/TGRS.2017.2700322
6
J, Chen C, Wang Z, Ma J, Chen D, He S Ackland (2018). Remote sensing scene classification based on convolutional neural networks pre-trained using attention-guided sparse filters.Remote Sens (Basel), 10(2): 290 https://doi.org/10.3390/rs10020290
7
G, Cheng J, Han X Lu (2017a). Remote sensing image scene classification: benchmark and state of the art.Proc IEEE, 105(10): 1865–1883 https://doi.org/10.1109/JPROC.2017.2675998
8
G, Cheng Z, Li X, Yao L, Guo Z Wei (2017b). Remote sensing image scene classification using bag of convolutional features.IEEE Geosci Remote Sens Lett, 14(10): 1735–1739 https://doi.org/10.1109/LGRS.2017.2731997
9
G, Cheng C, Yang X, Yao L, Guo J Han (2018). When deep learning meets metric learning: remote sensing image scene classification via learning discriminative CNNs.IEEE Trans Geosci Remote Sens, 56(5): 2811–2821 https://doi.org/10.1109/TGRS.2017.2783902
10
G, Cheng P, Zhou J Han (2016). Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images.IEEE Trans Geosci Remote Sens, 54(12): 7405–7415 https://doi.org/10.1109/TGRS.2016.2601622
11
R Fan, L Wang, R Feng (2019). Attention based residual network for high-resolution remote sensing imagery scene classification. In: IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 1346–1349
12
J, Gan Q, Li Z, Zhang J Wang (2016). Two-level feature representation for aerial scene classification.IEEE Geosci Remote Sens Lett, 13(11): 1626–1630 https://doi.org/10.1109/LGRS.2016.2598567
13
C, Gong J, Han X Lu (2017). Remote sensing image scene classification: benchmark and state of the art.In: Proceedings of the IEEE, 105(10): 1865–1883 https://doi.org/10.1109/JPROC.2017.2675998
14
Q Hou, D Zhou, J Feng (2021). Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13713–13722
15
D P Kingma, J Ba (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv: 1412.6980
16
J, Knorn A, Rabe V C, Radeloff T, Kuemmerle J, Kozak P Hostert (2009). Land cover mapping of large areas using chain classification of neighboring Landsat satellite images.Remote Sens Environ, 113(5): 957–964 https://doi.org/10.1016/j.rse.2009.01.010
17
R, Lei C, Zhang W, Liu L, Zhang X, Zhang Y, Yang J, Huang Z, Li Z Zhou (2021). Hyperspectral remote sensing image classification using deep convolutional capsule network.IEEE J Sel Top Appl Earth Obs Remote Sens, 14: 8297–8315 https://doi.org/10.1109/JSTARS.2021.3101511
18
R, Lei C, Zhang X, Zhang J, Huang Z, Li W, Liu H Cui (2022). Multiscale feature aggregation capsule neural network for hyperspectral remote sensing image classification.Remote Sens (Basel), 14(7): 1652 https://doi.org/10.3390/rs14071652
19
J, Li D, Lin Y, Wang G, Xu Y, Zhang C, Ding Y Zhou (2020). Deep discriminative representation learning with attention map for scene classification.Remote Sens (Basel), 12(9): 1366 https://doi.org/10.3390/rs12091366
20
Y Liu, M M Cheng, X Hu (2017). Richer convolutional features for edge detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5872–5881
D, Marmanis M, Datcu T, Esch U Stilla (2016). Deep learning earth observation classification using imagenet pretrained networks.IEEE Geosci Remote Sens Lett, 13(1): 105–109 https://doi.org/10.1109/LGRS.2015.2499239
23
X, Mei E, Pan Y, Ma X, Dai J, Huang F, Fan Q, Du H, Zheng J Ma (2019). Spectral-spatial attention networks for hyperspectral image classification.Remote Sens (Basel), 11(8): 963 https://doi.org/10.3390/rs11080963
24
Z, Pan J, Xu Y, Guo Y, Hu G Wang (2020). Deep learning segmentation and classification for urban village using a worldview satellite image based on U-Net.Remote Sens (Basel), 12(10): 1574 https://doi.org/10.3390/rs12101574
25
de Lima R, Pires K Marfurt (2019). Convolutional neural network for remote-sensing scene classification: transfer learning analysis.Remote Sens (Basel), 12(1): 86 https://doi.org/10.3390/rs12010086
26
K, Raiyani T, Gonçalves L, Rato P, Salgueiro da Silva J R Marques (2021). Sentinel-2 image scene classification: a comparison between Sen2Cor and a machine learning approach.Remote Sens (Basel), 13(2): 300 https://doi.org/10.3390/rs13020300
27
A, Raza H, Huo S, Sirajuddin T Fang (2020). Diverse capsules network combining multiconvolutional layers for remote sensing image scene classification.IEEE J Sel Top Appl Earth Obs Remote Sens, 13: 5297–5313 https://doi.org/10.1109/JSTARS.2020.3021045
28
S Sabour, N Frosst, G E Hinton (2017). Dynamic routing between capsules. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), 3859–3869
29
G, Sheng W, Yang T, Xu H Sun (2012). High-resolution satellite scene classification using a sparse coding based multiple feature combination.Int J Remote Sens, 33(8): 2395–2412 https://doi.org/10.1080/01431161.2011.608740
30
X, Sun Q, Zhu Q Qin (2021). A multi-level convolution pyramid semantic fusion framework for high-resolution remote sensing image scene classification and annotation.IEEE Access, 9: 18195–18208 https://doi.org/10.1109/ACCESS.2021.3052977
31
C, Szegedy S, Ioffe V, Vanhoucke A A (2017) Alemi . Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI'17). AAAI Press, 4278–4284
32
T, Tian X, Liu L Wang (2019a). Remote sensing scene classification based on res-capsnet. In: IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium.IEEE, 2019: 525–528
33
X, Tian J, An G Mu (2019b). Power System Transient Stability Assessment Method Based on CapsNet. In: 2019 IEEE Innovative Smart Grid Technologies-Asia (ISGT Asia).IEEE, 2019: 1159–1164
34
W, Tong W, Chen W, Han X, Li L Wang (2020). Channel-attention-based DenseNet network for remote sensing image scene classification.IEEE J Sel Top Appl Earth Obs Remote Sens, 13: 4121–4132 https://doi.org/10.1109/JSTARS.2020.3009352
35
T, Vo D, Tran W Ma (2015). Tensor decomposition and application in image classification with histogram of oriented gradients.Neurocomputing, 165: 38–45 https://doi.org/10.1016/j.neucom.2014.06.093
36
Y, Wang J, Zhang M Kan (2020). Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 12275–12284
37
Q, Weng Z, Mao J, Lin W Guo (2017). Land-use classification via extreme learning classifier based on deep convolutional features.IEEE Geosci Remote Sens Lett, 14(5): 704–708 https://doi.org/10.1109/LGRS.2017.2672643
38
G S, Xia J, Hu F, Hu B, Shi X, Bai Y, Zhong L, Zhang X Lu (2017). AID: A benchmark data set for performance evaluation of aerial scene classification.IEEE Trans Geosci Remote Sens, 55(7): 3965–3981 https://doi.org/10.1109/TGRS.2017.2685945
39
Y, Yang S Newsam (2010). Bag-of-visual-words and spatial extensions for land-use classification.In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2010: 270–279
40
C, Yu C, Gao J, Wang G, Yu C, Shen N Sang (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation.Int J Comput Vis, 129(11): 3051–3068 https://doi.org/10.1007/s11263-021-01515-2
41
Y, Yu F Liu (2018a). A two-stream deep fusion framework for high-resolution aerial scene classification.Comput Intell Neurosci, 2018: 8639367 https://doi.org/10.1155/2018/8639367
42
Y, Yu F Liu (2018b). Dense connectivity based two-stream deep feature fusion framework for aerial scene classification.Remote Sens (Basel), 10(7): 1158 https://doi.org/10.3390/rs10071158
43
W, Zhang P, Tang L Zhao (2019). Remote sensing image scene classification using CNN-CapsNet.Remote Sens (Basel), 11(5): 494 https://doi.org/10.3390/rs11050494
44
X, Zhang G, Wang S G Zhao (2022). CapsNet-COVID19: Lung CT image classification method based on CapsNet model.Math Biosci Eng, 19(5): 5055–5074 https://doi.org/10.3934/mbe.2022236
45
B, Zhao Y, Zhong L, Zhang B Huang (2016). The Fisher kernel coding framework for high spatial resolution scene classification.Remote Sens (Basel), 8(2): 157 https://doi.org/10.3390/rs8020157
46
D, Zhao Y, Chen L Lv (2017). Deep reinforcement learning with visual attention for vehicle classification.IEEE Trans Cogn Dev Syst, 9(4): 356–367 https://doi.org/10.1109/TCDS.2016.2614675
47
X, Zhao J, Zhang J, Tian L, Zhuo J Zhang (2020). Residual dense network based on channel-spatial attention for the scene classification of a high-resolution remote sensing image.Remote Sens (Basel), 12(11): 1887 https://doi.org/10.3390/rs12111887
48
B, Zhou A, Khosla A Lapedriza (2016). Learning deep features for discriminative localization.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2921–2929