A Grad-CAM and capsule network hybrid method for remote sensing image scene classification

doi:10.1007/s11707-022-1079-x

Front. Earth Sci.

2024, Vol. 18

Issue (3) : 538-553 https://doi.org/10.1007/s11707-022-1079-x

A Grad-CAM and capsule network hybrid method for remote sensing image scene classification

Zhan HE^1,⁴, Chunju ZHANG²(

), Shu WANG³(

), Jianwei HUANG⁴, Xiaoyun ZHENG¹, Weijie JIANG⁴, Jiachen BO⁴, Yucheng YANG⁴

¹. Shenzhen Data Management Center of Planning and Natural Resources, Key Laboratory of Urban Land Resources Monitoring and Simulation (Ministry of Natural Resources), Shenzhen 518000, China
². Key Laboratory of Jianghuai Arable Land Resources Protection and Eco-restoration (Ministry of Natural Resources), Hefei 230088, China
³. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
⁴. School of Civil Engineering, Hefei University of Technology, Hefei 230009, China

Download: PDF(6041 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Remote sensing image scene classification and remote sensing technology applications are hot research topics. Although CNN-based models have reached high average accuracy, some classes are still misclassified, such as “freeway,” “spare residential,” and “commercial_area.” These classes contain typical decisive features, spatial-relation features, and mixed decisive and spatial-relation features, which limit high-quality image scene classification. To address this issue, this paper proposes a Grad-CAM and capsule network hybrid method for image scene classification. The Grad-CAM and capsule network structures have the potential to recognize decisive features and spatial-relation features, respectively. By using a pre-trained model, hybrid structure, and structure adjustment, the proposed model can recognize both decisive and spatial-relation features. A group of experiments is designed on three popular data sets with increasing classification difficulties. In the most advanced experiment, 92.67% average accuracy is achieved. Specifically, 83%, 75%, and 86% accuracies are obtained in the classes of “church,” “palace,” and “commercial_area,” respectively. This research demonstrates that the hybrid structure can effectively improve performance by considering both decisive and spatial-relation features. Therefore, Grad-CAM-CapsNet is a promising and powerful structure for image scene classification.

Keywords image scene classification CNN Grad-CAM CapsNet DenseNet

Corresponding Author(s): Chunju ZHANG,Shu WANG

Online First Date: 03 July 2024 Issue Date: 29 September 2024

Cite this article:

Zhan HE,Chunju ZHANG,Shu WANG, et al. A Grad-CAM and capsule network hybrid method for remote sensing image scene classification[J]. Front. Earth Sci., 2024, 18(3): 538-553.

URL:

https://academic.hep.com.cn/fesci/EN/10.1007/s11707-022-1079-x
https://academic.hep.com.cn/fesci/EN/Y2024/V18/I3/538

Fig.1 Scene images taken from the NWPU-RESISC45 data set showing the similarity of different land covers, (a) freeway; (b) railway; (c) runway; (d) dense residential; (e) medium residential; (f) sparse residential; (g) church (h) palace; (i) commercial area.

Fig.2 The Grad-CAM-CapsNet architecture contains three parts: an attention block, a feature fusion block, and a CapsNet block.

Procedure 1: Grad-CAM-CapsNet
Step1: Attention block
Input: Image X
Output: Attention image XA
● Substep 1: Calculate the weight coefficients $α i c$ in the Grad-CAM according to Eq. (1)
● Substep 2: Calculate the attention map Xam according to Eq. (2)
● Substep 3: Xam is resized to match the size of input image X by upsampling
● Substep 4: Input X to the pre-trained CNN model to obtain feature map Fp
● Substep 5: Input Xam into the customized CNN model to obtain feature map Fc
● Substep 6: Fuse Fp and Fc by multiplying them to obtain attention masked image XA
● Substep 7: Return XA
Step2: CapsNet block
Input: Attention masked image XA
Output: The probability P of the input image
● Substep 1: The attention image XA is converted to capsule form by the PrimaryCaps layer
● Substep 2: After processing the DigitCaps layer, the category capsules can be obtained
● Substep 3: Obtain the probability P by computing the length of each category capsule according to Eq. (4)
● Substep 4: Return the probability P of the input image

Tab.1 The whole process of the Grad-CAM-CapsNet

Fig.3 Calculation process of generating the attention map using Grad-CAM.

Fig.4 Grad-CAM implementation.

Fig.5 Input images masked with attention maps.

Fig.6 The architecture of CapsNet.

Tab.2 List of experimental comparative models

Tab.3 The performances of different models on the UC Merced Land-Use data set

Fig.7 Confusion matrix of the proposed Grad-CAM-CapsNet model on the UC Merced Land-Use data set with a training ratio of 80%.

Tab.4 The performances of different models on the AID data set

Fig.8 Confusion matrix of our proposed model on the AID data set with a 20% training ratio.

Tab.5 The performances of different models on the NWPU-RESISC45 data set

Fig.9 Confusion matrix of our proposed model on the NWPU-RESISC45 data set with a 20% training ratio.

Tab.6 The performances of different models with specific classes on the NWPU-RESISC45 Data set

Tab.7 The performances of different models with specific classes on the NWPU-RESISC45 data set

Tab.8 The effectiveness of the Grad-CAM mechanism with an FC layer in different data sets

Tab.9 The effectiveness of CapsNet in different data sets

Tab.10 The effectiveness of the pre-trained model in different data sets

1	Z Abai, N Rajmalwar (2019). DenseNet models for tiny imagenet classification. arXiv preprint arXiv: 1904.10429
2	A, Ahmed A, Jalal K Kim (2020). A novel statistical method for scene classification based on multi-object categorization and logistic regression.Sensors (Basel), 20(14): 3871 https://doi.org/10.3390/s20143871
3	S Bai (2016). Growing random forest on deep convolutional neural networks for scene categorization.Expert Systems with Applications, 71: 279–287 https://doi.org/10.1016/j.eswa.2016.10.038
4	M Castelluccio, G Poggi, C Sansone (2015). Land use classification in remote sensing images by convolutional neural networks. arXiv preprint arXiv: 1508.00092
5	S, Chaib H, Liu Y, Gu H Yao (2017). Deep feature fusion for VHR remote sensing scene classification.IEEE Trans Geosci Remote Sens, 55(8): 4775–4784 https://doi.org/10.1109/TGRS.2017.2700322
6	J, Chen C, Wang Z, Ma J, Chen D, He S Ackland (2018). Remote sensing scene classification based on convolutional neural networks pre-trained using attention-guided sparse filters.Remote Sens (Basel), 10(2): 290 https://doi.org/10.3390/rs10020290
7	G, Cheng J, Han X Lu (2017a). Remote sensing image scene classification: benchmark and state of the art.Proc IEEE, 105(10): 1865–1883 https://doi.org/10.1109/JPROC.2017.2675998
8	G, Cheng Z, Li X, Yao L, Guo Z Wei (2017b). Remote sensing image scene classification using bag of convolutional features.IEEE Geosci Remote Sens Lett, 14(10): 1735–1739 https://doi.org/10.1109/LGRS.2017.2731997
9	G, Cheng C, Yang X, Yao L, Guo J Han (2018). When deep learning meets metric learning: remote sensing image scene classification via learning discriminative CNNs.IEEE Trans Geosci Remote Sens, 56(5): 2811–2821 https://doi.org/10.1109/TGRS.2017.2783902
10	G, Cheng P, Zhou J Han (2016). Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images.IEEE Trans Geosci Remote Sens, 54(12): 7405–7415 https://doi.org/10.1109/TGRS.2016.2601622
11	R Fan, L Wang, R Feng (2019). Attention based residual network for high-resolution remote sensing imagery scene classification. In: IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 1346–1349
12	J, Gan Q, Li Z, Zhang J Wang (2016). Two-level feature representation for aerial scene classification.IEEE Geosci Remote Sens Lett, 13(11): 1626–1630 https://doi.org/10.1109/LGRS.2016.2598567
13	C, Gong J, Han X Lu (2017). Remote sensing image scene classification: benchmark and state of the art.In: Proceedings of the IEEE, 105(10): 1865–1883 https://doi.org/10.1109/JPROC.2017.2675998
14	Q Hou, D Zhou, J Feng (2021). Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13713–13722
15	D P Kingma, J Ba (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv: 1412.6980
16	J, Knorn A, Rabe V C, Radeloff T, Kuemmerle J, Kozak P Hostert (2009). Land cover mapping of large areas using chain classification of neighboring Landsat satellite images.Remote Sens Environ, 113(5): 957–964 https://doi.org/10.1016/j.rse.2009.01.010
17	R, Lei C, Zhang W, Liu L, Zhang X, Zhang Y, Yang J, Huang Z, Li Z Zhou (2021). Hyperspectral remote sensing image classification using deep convolutional capsule network.IEEE J Sel Top Appl Earth Obs Remote Sens, 14: 8297–8315 https://doi.org/10.1109/JSTARS.2021.3101511
18	R, Lei C, Zhang X, Zhang J, Huang Z, Li W, Liu H Cui (2022). Multiscale feature aggregation capsule neural network for hyperspectral remote sensing image classification.Remote Sens (Basel), 14(7): 1652 https://doi.org/10.3390/rs14071652
19	J, Li D, Lin Y, Wang G, Xu Y, Zhang C, Ding Y Zhou (2020). Deep discriminative representation learning with attention map for scene classification.Remote Sens (Basel), 12(9): 1366 https://doi.org/10.3390/rs12091366
20	Y Liu, M M Cheng, X Hu (2017). Richer convolutional features for edge detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5872–5881
21	Y, Liu C Huang (2018). Scene classification via triplet networks.IEEE J Sel Top Appl Earth Obs Remote Sens, 11(1): 220–237 https://doi.org/10.1109/JSTARS.2017.2761800
22	D, Marmanis M, Datcu T, Esch U Stilla (2016). Deep learning earth observation classification using imagenet pretrained networks.IEEE Geosci Remote Sens Lett, 13(1): 105–109 https://doi.org/10.1109/LGRS.2015.2499239
23	X, Mei E, Pan Y, Ma X, Dai J, Huang F, Fan Q, Du H, Zheng J Ma (2019). Spectral-spatial attention networks for hyperspectral image classification.Remote Sens (Basel), 11(8): 963 https://doi.org/10.3390/rs11080963
24	Z, Pan J, Xu Y, Guo Y, Hu G Wang (2020). Deep learning segmentation and classification for urban village using a worldview satellite image based on U-Net.Remote Sens (Basel), 12(10): 1574 https://doi.org/10.3390/rs12101574
25	de Lima R, Pires K Marfurt (2019). Convolutional neural network for remote-sensing scene classification: transfer learning analysis.Remote Sens (Basel), 12(1): 86 https://doi.org/10.3390/rs12010086
26	K, Raiyani T, Gonçalves L, Rato P, Salgueiro da Silva J R Marques (2021). Sentinel-2 image scene classification: a comparison between Sen2Cor and a machine learning approach.Remote Sens (Basel), 13(2): 300 https://doi.org/10.3390/rs13020300
27	A, Raza H, Huo S, Sirajuddin T Fang (2020). Diverse capsules network combining multiconvolutional layers for remote sensing image scene classification.IEEE J Sel Top Appl Earth Obs Remote Sens, 13: 5297–5313 https://doi.org/10.1109/JSTARS.2020.3021045
28	S Sabour, N Frosst, G E Hinton (2017). Dynamic routing between capsules. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), 3859–3869
29	G, Sheng W, Yang T, Xu H Sun (2012). High-resolution satellite scene classification using a sparse coding based multiple feature combination.Int J Remote Sens, 33(8): 2395–2412 https://doi.org/10.1080/01431161.2011.608740
30	X, Sun Q, Zhu Q Qin (2021). A multi-level convolution pyramid semantic fusion framework for high-resolution remote sensing image scene classification and annotation.IEEE Access, 9: 18195–18208 https://doi.org/10.1109/ACCESS.2021.3052977
31	C, Szegedy S, Ioffe V, Vanhoucke A A (2017) Alemi . Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI'17). AAAI Press, 4278–4284
32	T, Tian X, Liu L Wang (2019a). Remote sensing scene classification based on res-capsnet. In: IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium.IEEE, 2019: 525–528
33	X, Tian J, An G Mu (2019b). Power System Transient Stability Assessment Method Based on CapsNet. In: 2019 IEEE Innovative Smart Grid Technologies-Asia (ISGT Asia).IEEE, 2019: 1159–1164
34	W, Tong W, Chen W, Han X, Li L Wang (2020). Channel-attention-based DenseNet network for remote sensing image scene classification.IEEE J Sel Top Appl Earth Obs Remote Sens, 13: 4121–4132 https://doi.org/10.1109/JSTARS.2020.3009352
35	T, Vo D, Tran W Ma (2015). Tensor decomposition and application in image classification with histogram of oriented gradients.Neurocomputing, 165: 38–45 https://doi.org/10.1016/j.neucom.2014.06.093
36	Y, Wang J, Zhang M Kan (2020). Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 12275–12284
37	Q, Weng Z, Mao J, Lin W Guo (2017). Land-use classification via extreme learning classifier based on deep convolutional features.IEEE Geosci Remote Sens Lett, 14(5): 704–708 https://doi.org/10.1109/LGRS.2017.2672643
38	G S, Xia J, Hu F, Hu B, Shi X, Bai Y, Zhong L, Zhang X Lu (2017). AID: A benchmark data set for performance evaluation of aerial scene classification.IEEE Trans Geosci Remote Sens, 55(7): 3965–3981 https://doi.org/10.1109/TGRS.2017.2685945
39	Y, Yang S Newsam (2010). Bag-of-visual-words and spatial extensions for land-use classification.In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2010: 270–279
40	C, Yu C, Gao J, Wang G, Yu C, Shen N Sang (2021). Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation.Int J Comput Vis, 129(11): 3051–3068 https://doi.org/10.1007/s11263-021-01515-2
41	Y, Yu F Liu (2018a). A two-stream deep fusion framework for high-resolution aerial scene classification.Comput Intell Neurosci, 2018: 8639367 https://doi.org/10.1155/2018/8639367
42	Y, Yu F Liu (2018b). Dense connectivity based two-stream deep feature fusion framework for aerial scene classification.Remote Sens (Basel), 10(7): 1158 https://doi.org/10.3390/rs10071158
43	W, Zhang P, Tang L Zhao (2019). Remote sensing image scene classification using CNN-CapsNet.Remote Sens (Basel), 11(5): 494 https://doi.org/10.3390/rs11050494
44	X, Zhang G, Wang S G Zhao (2022). CapsNet-COVID19: Lung CT image classification method based on CapsNet model.Math Biosci Eng, 19(5): 5055–5074 https://doi.org/10.3934/mbe.2022236
45	B, Zhao Y, Zhong L, Zhang B Huang (2016). The Fisher kernel coding framework for high spatial resolution scene classification.Remote Sens (Basel), 8(2): 157 https://doi.org/10.3390/rs8020157
46	D, Zhao Y, Chen L Lv (2017). Deep reinforcement learning with visual attention for vehicle classification.IEEE Trans Cogn Dev Syst, 9(4): 356–367 https://doi.org/10.1109/TCDS.2016.2614675
47	X, Zhao J, Zhang J, Tian L, Zhuo J Zhang (2020). Residual dense network based on channel-spatial attention for the scene classification of a high-resolution remote sensing image.Remote Sens (Basel), 12(11): 1887 https://doi.org/10.3390/rs12111887
48	B, Zhou A, Khosla A Lapedriza (2016). Learning deep features for discriminative localization.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2921–2929

Viewed

Full text

Abstract

Cited

Shared

Discussed