SSA: semantic structure aware inference on CNN networks for weakly pixel-wise dense predictions without cost

doi:10.1007/s11704-024-3571-9

Front. Comput. Sci.

2025, Vol. 19

Issue (2) : 192702 https://doi.org/10.1007/s11704-024-3571-9

Image and Graphics

SSA: semantic structure aware inference on CNN networks for weakly pixel-wise dense predictions without cost

Yanpeng SUN, Zechao LI(

)

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210014, China

Download: PDF(5384 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

The pixel-wise dense prediction tasks based on weakly supervisions currently use Class Attention Maps (CAMs) to generate pseudo masks as ground-truth. However, existing methods often incorporate trainable modules to expand the immature class activation maps, which can result in significant computational overhead and complicate the training process. In this work, we investigate the semantic structure information concealed within the CNN network, and propose a semantic structure aware inference (SSA) method that utilizes this information to obtain high-quality CAM without any additional training costs. Specifically, the semantic structure modeling module (SSM) is first proposed to generate the class-agnostic semantic correlation representation, where each item denotes the affinity degree between one category of objects and all the others. Then, the immature CAM are refined through a dot product operation that utilizes semantic structure information. Finally, the polished CAMs from different backbone stages are fused as the output. The advantage of SSA lies in its parameter-free nature and the absence of additional training costs, which makes it suitable for various weakly supervised pixel-dense prediction tasks. We conducted extensive experiments on weakly supervised object localization and weakly supervised semantic segmentation, and the results confirm the effectiveness of SSA.

Keywords class attention maps semantic structure weakly-supervised object localization weakly-supervised semantic segmentation

Corresponding Author(s): Zechao LI

About author: Li Liu and Yanqing Liu contributed equally to this work.

Just Accepted Date: 04 January 2024 Issue Date: 22 April 2024

Cite this article:

Yanpeng SUN,Zechao LI. SSA: semantic structure aware inference on CNN networks for weakly pixel-wise dense predictions without cost[J]. Front. Comput. Sci., 2025, 19(2): 192702.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-024-3571-9
https://academic.hep.com.cn/fcs/EN/Y2025/V19/I2/192702

Fig.1 Visualizations of the semantic structure information in backbone stages. Pixels of the same class as the marked pixel are brightly colored. The brighter the color, the higher the similarity. Our motivation comes from this phenomenon

Fig.2 The overall network architecture of the proposed semantic structure aware inference (SSA). Since SSA is only used in the inference CAM stage, it is suitable for all CNN-based models

Fig.3 The details of the Semantic Structure modeling Module (SSM)

Tab.1 Ablation study based on the number of SA block in SSM module. N-SA denotes the number of the SA block

Fig.4 Visualization results of SSA with different numbers of SA-blocks. (a) Input image, (b) seed CAM, (c) and (d) are the results of SSA after using SA block once and twice, respectively

Stage	Error rate/%
Stage	TOP-1	TOP-5	GT-known
$? ?$	49.0	38.5	34.1
Stage 3	66.4	59.2	56.5
Stage 4	44.8	33.0	28.1
Stage 5	40.1	27.3	21.9
Stage 3 + Stage 4	49.7	39.3	34.7
Stage 3 + Stage 5	40.0	27.0	21.6
Stage 4 + Stage 5	38.1	24.9	19.4
Stage 3 + Stage 4 + Stage 5	39.0	26.1	20.6

Tab.2 Ablation study of SSA on CUB-200-2011. Stage indicates the semantic structure information of which stage is used to expand the seed CAM

Fig.5 Visualization results of SSA with semantic structure information of different stages. (a) Input image, (b) using Stage 3, (c) using Stage 4 and (d) using Stage 5

Fig.6 Compared results in terms of IoU curve on the ILSVRC dataset. (a) The IoU curve results based on VGG16; (b) the IoU curve results based on Inception V3

Method	Backbone	Error rate/%
Method	Backbone	TOP-1	TOP-5	GT-known
CAM [3]	GoogLeNet	58.9	49.3	44.9
SPG [9]	GoogLeNet	53.4	42.8	?
CAM [3] $?$	InceptionV3	53.8	42.8	38.3
DANet [42]	InceptionV3	50.6	39.5	33.0
SEM [31]	GoogLeNet	47.0	?	30.0
ADL [8]	InceptionV3	47.0	?	?
SPA+SCG [32]	InceptionV3	46.4	33.5	27.9
CAM + SSA	InceptionV3	44.9	31.5	25.5
CAM [3]	VGG16	55.9	47.8	44.0
ACoL [43]	VGG16	54.1	43.5	45.9
SPG [9]	VGG16	51.1	42.2	41.1
ADL [8]	VGG16	47.6	?	?
DANet [42]	VGG16	47.5	38.0	32.3
I²C [44]	VGG16	44.0	31.6	?
MEIL [45]	VGG16	42.5	?	?
SPA + SCG [32] $?$	VGG16	40.5	28.4	22.9
CAM + SSA	VGG16	44.9	33.0	29.1
SPA + SSA	VGG16	38.1	24.9	19.4

Tab.3 Comparison with state-of-the-arts on the CUB-200-2011 test set

Method	Backbone	Error rate/%
Method	Backbone	TOP-1	TOP-5	GT-known
CAM [3]	VGG16	57.2	45.1	?
CutMix [41]	VGG16	56.6	?	?
SEM [31]	VGG16	55.4	?	39.2
ADL [8]	VGG16	55.1	?	?
ACoL [43]	VGG16	54.2	40.6	37.0
MEIL [45]	VGG16	53.2	?	?
I²C [44]	VGG16	52.6	41.5	36.1
SPA+SCG [32] $?$	VGG16	51.1	39.5	35.1
SPA + SSA	VGG16	50.7	39.0	34.4

Tab.4 Comparison with state-of-the-arts on the ILSVRC test set

Method	Sup.	Val	Test
Fully supervised
FCN [46]	$P .$	?	62.2
Deeplab [40]	$P .$	67.7	70.3
Weakly supervised
BoxSup [47]	$B .$	62.0	64.6
SDI [48]	$B .$	65.7	67.5
MCIS [49]	$I . S . W .$	67.7	67.5
OAA++⁺ [50]	$I . S .$	66.1	67.2
AttnBN [51]	$I . S .$	62.1	63.0
OAA [52]	$I . S .$	63.9	65.6
FickeNet [21]	$I . S .$	64.9	65.3
MCIS [49]	$I . S .$	66.2	66.9
CIAN [53]	$I . S .$	64.3	65.3
SEC [54]	$I .$	50.7	51.7
AE-PSL [19]	$I .$	55.0	55.7
IRNet [39]	$I .$	63.5	64.8
SSDD [55]	$I .$	64.9	65.5
SEAM [56]	$I .$	64.5	65.7
SC-CAM [57]	$I .$	66.1	65.9
CONTA [16]	$I .$	66.1	66.7
ECS [58]	$I .$	66.6	67.6
AdvCAM [18]	$I .$	68.1	68.0
IRNet + SSA	$I .$	67.4	67.9
IRNet¹ + SSA	$I .$	68.9	68.2

Tab.5 Comparison with state-of-the-arts on the PASCAL VOC 2012 val and test sets

Method	Backbone	Sup.	Val
Weakly supervised
SEC [54]	VGG16	$I .$	22.4
SEAM [56]	ResNet38	$I .$	31.9
IRNet [39]	ResNet50	$I .$	32.6
CONTA [16]	ResNet101	$I .$	33.4
URN [59]	ResNet101	$I .$	40.7
MCTformer [27]	ResNet38	$I .$	42.0
L2G [60]	ResNet50	$I . S .$	42.7
IRNet + SSA	ResNet50	$I .$	43.2

Tab.6 Comparison with state-of-the-arts on the COCO val sets

1	Z, Cheng P, Qiao K, Li S, Li P, Wei X, Ji L, Yuan C, Liu J Chen . Out-of-candidate rectification for weakly supervised semantic segmentation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 23673−23684
2	T, Cheng X, Wang S, Chen Q, Zhang W Liu . BoxTeacher: exploring high-quality pseudo labels for weakly supervised instance segmentation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 3145−3154
3	B, Zhou A, Khosla A, Lapedriza A, Oliva A Torralba . Learning deep features for discriminative localization. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2921−2929
4	R R, Selvaraju M, Cogswell A, Das R, Vedantam D, Parikh D Batra . Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 618−626
5	H, Wang R, Naidu J, Michael S S Kundu . SS-CAM: smoothed score-CAM for sharper visual feature localization. 2020, arXiv preprint arXiv: 2006.14255
6	A, Chattopadhay A, Sarkar P, Howlader V N Balasubramanian . Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In: Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. 2018, 839−847
7	C, Zeng K, Yan Z, Wang Y, Yu S, Xia N Zhao . Abs-CAM: a gradient optimization interpretable approach for explanation of convolutional neural networks. Signal, Image and Video Processing, 2023, 17( 4): 1069–1076
8	J, Choe H Shim . Attention-based dropout layer for weakly supervised object localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2214−2223
9	X, Zhang Y, Wei G, Kang Y, Yang T Huang . Self-produced guidance for weakly-supervised object localization. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 610−625
10	C, Zhang W, Zhong C, Li H Deng . Random walk-based erasing data augmentation for deep learning. Signal, Image and Video Processing, 2023, 17( 5): 2447–2454
11	Z, Zhong L, Zheng G, Kang S, Li Y Yang . Random erasing data augmentation. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence. 2020, 13001−13008
12	R, Fu Q, Hu X, Dong Y, Guo Y, Gao B Li . Axiom-based grad-cam: Towards accurate visualization and explanation of CNNs. In: Proceedings of the 31st British Machine Vision Conference. 2020
13	K, He X, Zhang S, Ren J Sun . Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778
14	D, Omeiza S, Speakman C, Cintas K Weldermariam . Smooth grad-CAM++: an enhanced inference level visualization technique for deep convolutional neural network models. 2019, arXiv preprint arXiv: 1908.01224
15	Q, Zhang L, Rao Y Yang . Group-CAM: group score-weighted visual explanations for deep convolutional networks. 2021, arXiv preprint arXiv: 2103.13859
16	D, Zhang H, Zhang J, Tang X S, Hua Q Sun . Causal intervention for weakly-supervised semantic segmentation. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 56
17	J, Xie J, Xiang J, Chen X, Hou X, Zhao L Shen . C2 AM: contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 989−998
18	J, Lee E, Kim S Yoon . Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 4071−4080
19	Y, Wei J, Feng X, Liang M M, Cheng Y, Zhao S Yan . Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6488−6496
20	T, DeVries GW Taylor . Improved regularization of convolutional neural networks with cutout. 2017, arXiv preprint arXiv: 1708.04552
21	J, Lee E, Kim S, Lee J, Lee S Yoon . FickleNet: weakly and semi-supervised semantic image segmentation using stochastic inference. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 5262−5271
22	A, Dosovitskiy L, Beyer A, Kolesnikov D, Weissenborn X, Zhai T, Unterthiner M, Dehghani M, Minderer G, Heigold S, Gelly J, Uszkoreit N Houlsby . An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021
23	L, Ru Y, Zhan B, Yu B Du . Learning affinity from attention: end-to-end weakly-supervised semantic segmentation with transformers. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 16825−16834
24	L, Ru H, Zheng Y, Zhan B Du . Token contrast for weakly-supervised semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2023, 3093−3102
25	M, Caron H, Touvron I, Misra H, Jégou J, Mairal P, Bojanowski A Joulin . Emerging properties in self-supervised vision transformers. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 9630−9640
26	W, Gao F, Wan X, Pan Z, Peng Q, Tian Z, Han B, Zhou Q Ye . TS-CAM: token semantic coupled attention map for weakly supervised object localization. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 2866−2875
27	L, Xu W, Ouyang M, Bennamoun F, Boussaid D Xu . Multi-class token transformer for weakly supervised semantic segmentation. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 4300−4309
28	A, Paszke S, Gross F, Massa A, Lerer J, Bradbury G, Chanan T, Killeen Z, Lin N, Gimelshein L, Antiga A, Desmaison A, Köpf E, Yang Z, DeVito M, Raison A, Tejani S, Chilamkurthy B, Steiner L, Fang J, Bai S Chintala . PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 721
29	O, Russakovsky J, Deng H, Su J, Krause S, Satheesh S, Ma Z, Huang A, Karpathy A, Khosla M, Bernstein A C, Berg L Fei-Fei . ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115( 3): 211–252
30	C, Wah S, Branson P, Welinder P, Perona S Belongie . The Caltech-UCSD birds-200−2011 dataset. Technical Report CNS-TR-2011−001. California Institute of Technology, 2011
31	X, Zhang Y, Wei Y, Yang F Wu . Rethinking localization map: towards accurate object perception with self-enhancement maps. 2020, arXiv preprint arXiv: 2006.05220
32	X, Pan Y, Gao Z, Lin F, Tang W, Dong H, Yuan F, Huang C Xu . Unveiling the potential of structure preserving for weakly supervised object localization. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 11637−11646
33	K, Simonyan A Zisserman . Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
34	C, Szegedy V, Vanhoucke S, Ioffe J, Shlens Z Wojna . Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2818−2826
35	M, Everingham Gool L, Van C K I, Williams J, Winn A Zisserman . The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88( 2): 303–338
36	B, Hariharan P, Arbeláez L, Bourdev S, Maji J Malik . Semantic contours from inverse detectors. In: Proceedings of 2011 International Conference on Computer Vision. 2011, 991−998
37	Z, Li Y, Sun L, Zhang J Tang . CTNet: context-based tandem network for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 12): 9904–9917
38	Y, Sun Q, Chen X, He J, Wang H, Feng J, Han E, Ding J, Cheng Z, Li J Wang . Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 37484−37496
39	J, Ahn S, Cho S Kwak . Weakly supervised learning of instance segmentation with inter-pixel relations. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2204−2213
40	L C, Chen G, Papandreou I, Kokkinos K, Murphy A L Yuille . DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 4): 834–848
41	Yun S, Han D, Chun S, Oh S J, Yoo Y, Choe J. CutMix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 6022−6031
42	H, Xue C, Liu F, Wan J, Jiao X, Ji Q Ye . DANet: divergent activation for weakly supervised object localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 6588−6597
43	X, Zhang Y, Wei J, Feng Y, Yang T Huang . Adversarial complementary learning for weakly supervised object localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1325−1334
44	X, Zhang Y, Wei Y Yang . Inter-image communication for weakly supervised localization. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 271−287
45	J, Mai M, Yang W Luo . Erasing integrated learning: a simple yet effective approach for weakly supervised object localization. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 8763−8772
46	J, Long E, Shelhamer T Darrell . Fully convolutional networks for semantic segmentation. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3431−3440
47	J, Dai K, He J Sun . BoxSup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: Proceedings of 2015 IEEE International Conference on Computer Vision. 2015, 1635−1643
48	A, Khoreva R, Benenson J, Hosang M, Hein B Schiele . Simple does it: weakly supervised instance and semantic segmentation. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1665−1674
49	G, Sun W, Wang J, Dai Gool L Van . Mining cross-image semantics for weakly supervised semantic segmentation. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 347−365
50	P T, Jiang L H, Han Q, Hou M M, Cheng Y Wei . Online attention accumulation for weakly supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 10): 7062–7077
51	K, Li Y, Zhang K, Li Y, Li Y Fu . Attention bridging network for knowledge transfer. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5197−5206
52	P T, Jiang Q, Hou Y, Cao M M, Cheng Y, Wei H K Xiong . Integral object mining via online attention accumulation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 2070−2079
53	Fan J, Zhang Z, Tan T, Song C, Xiao J. CIAN: cross-image affinity net for weakly supervised semantic segmentation. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. 2020, 10762−10769
54	A, Kolesnikov C H Lampert . Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 695−711
55	W, Shimoda K Yanai . Self-supervised difference detection for weakly-supervised semantic segmentation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5207−5216
56	Y, Wang J, Zhang M, Kan S, Shan X Chen . Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12272−12281
57	Y T, Chang Q, Wang W C, Hung R, Piramuthu Y H, Tsai M H Yang . Weakly-supervised semantic segmentation via sub-category exploration. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 8988−8997
58	K, Sun H, Shi Z, Zhang Y Huang . ECS-Net: improving weakly supervised semantic segmentation by using connections between class activation maps. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 7263−7272
59	Li Y, Duan Y, Kuang Z, Chen Y, Zhang W, Li X. Uncertainty estimation via response scaling for pseudo-mask noise mitigation in weakly-supervised semantic segmentation. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence,34th Conference on Innovative Applications of Artificial Intelligence, The 12th Symposium on Educational Advances in Artificial Intelligence. 2022, 1447−1455
60	P T, Jiang Y, Yang Q, Hou Y Wei . L2G: a simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 16865−16875

[1]

FCS-23571-OF-YS_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed