Graph-Segmenter: graph transformer with boundary-aware attention for semantic segmentation
Zizhang WU1(), Yuanzhu GAN1, Tianhao XU1,2, Fan WANG1
1. Computer Vision Perception Department of ZongMu Technology, Shanghai 201203, China 2. Faculty of Electrical Engineering, Information Technology, Physics, Technical University of Braunschweig, Braunschweig 38106, Germany
The transformer-based semantic segmentation approaches, which divide the image into different regions by sliding windows and model the relation inside each window, have achieved outstanding success. However, since the relation modeling between windows was not the primary emphasis of previous work, it was not fully utilized. To address this issue, we propose a Graph-Segmenter, including a graph transformer and a boundary-aware attention module, which is an effective network for simultaneously modeling the more profound relation between windows in a global view and various pixels inside each window as a local one, and for substantial low-cost boundary adjustment. Specifically, we treat every window and pixel inside the window as nodes to construct graphs for both views and devise the graph transformer. The introduced boundary-aware attention module optimizes the edge information of the target objects by modeling the relationship between the pixel on the object’s edge. Extensive experiments on three widely used semantic segmentation datasets (Cityscapes, ADE-20k and PASCAL Context) demonstrate that our proposed network, a Graph Transformer with Boundary-aware Attention, can achieve state-of-the-art segmentation performance.
H, Ruan H, Song B, Liu Y, Cheng Q Liu . Intellectual property protection for deep semantic segmentation models. Frontiers of Computer Science, 2023, 17( 1): 171306
2
D, Zhang Y, Zhou J, Zhao Z, Yang H, Dong R, Yao H Ma . Multi-granularity semantic alignment distillation learning for remote sensing image semantic segmentation. Frontiers of Computer Science, 2022, 16( 4): 164351
3
S, Grigorescu B, Trasnea T, Cocias G Macesanu . A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 2020, 37( 3): 362–386
4
D, Feng C, Haase-Schütz L, Rosenbaum H, Hertlein C, Gläser F, Timm W, Wiesbeck K Dietmayer . Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 2021, 22( 3): 1341–1360
5
J, Janai F, Güney A, Behl A Geiger . Computer vision for autonomous vehicles: problems, datasets and state of the art. Foundations and Trends® in Computer Graphics and Vision, 2020, 12(1−3): 1−308
6
E, Arnold O Y, Al-Jarrah M, Dianati S, Fallah D, Oxtoby A Mouzakitis . A survey on 3D object detection methods for autonomous driving applications. IEEE Transactions on Intelligent Transportation Systems, 2019, 20( 10): 3782–3795
7
P, Wang P, Chen Y, Yuan D, Liu Z, Huang X, Hou G Cottrell . Understanding convolution for semantic segmentation. In: Proceedings of the Winter Conference on Applications of Computer Vision. 2018, 1451−1460
8
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186
9
L, Wang D, Li Y, Zhu L, Tian Y Shan . Dual super-resolution learning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 3773−3782
10
C, Yu J, Wang C, Gao G, Yu C, Shen N Sang . Context prior for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12413−12422
11
J W, Rae A, Potapenko S M, Jayakumar T P Lillicrap . Compressive transformers for long-range sequence modelling. In: Proceedings of the 8th International Conference on Learning Representations. 2020
12
J, Lee Y, Lee J, Kim A, Kosiorek S, Choi Y W Teh . Set transformer: a framework for attention-based permutation-invariant neural networks. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 3744−3753
13
A, Dosovitskiy L, Beyer A, Kolesnikov D, Weissenborn X, Zhai T, Unterthiner M, Dehghani M, Minderer G, Heigold S, Gelly J, Uszkoreit N Houlsby . An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021
14
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2021, 9992−10002
15
E, Xie W, Wang Z, Yu A, Anandkumar J M, Alvarez P Luo . SegFormer: simple and efficient design for semantic segmentation with transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021
16
X, Chu Z, Tian Y, Wang B, Zhang H, Ren X, Wei H, Xia C Shen . Twins: revisiting the design of spatial attention in vision transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021
17
J, Fang L, Xie X, Wang X, Zhang W, Liu Q Tian . MSG-transformer: exchanging local spatial information by manipulating messenger tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 12053−12062
18
P, Wang X, Wang F, Wang M, Lin S, Chang H, Li R Jin . KVT: k-NN attention for boosting vision transformers. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 285−302
19
X, Chu B, Zhang Z, Tian X, Wei H Xia . Do we really need explicit position encodings for vision transformers? 2021, arXiv preprint arXiv: 2102.10882
20
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3213−3223
21
B, Zhou H, Zhao X, Puig T, Xiao S, Fidler A, Barriuso A Torralba . Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision, 2019, 127( 3): 302–321
22
Mottaghi R, Chen X, Liu X, Cho N G, Lee S, Fidler S, Urtasun R, Yuille A. The role of context for object detection and semantic segmentation in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 891−898
23
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3431−3440
24
Y, Shen H, Zhang Y, Fan A P, Lee L Xu . Smart health of ultrasound telemedicine based on deeply represented semantic segmentation. IEEE Internet of Things Journal, 2021, 8( 23): 16770–16778
25
Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6230−6239
26
L C, Chen Y, Zhu G, Papandreou F, Schroff H Adam . Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 833−851
27
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network for scene segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 3141−3149
28
Ding H, Zhang H, Liu J, Li J, Feng Z, Jiang X. Interaction via bi-directional graph of semantic region affinity for scene parsing. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2021, 15828−15838
29
Y, Yuan X, Chen J Wang . Object-contextual representations for semantic segmentation. In: Proceedings of the European Conference on Computer Vision. 2020
30
X, Li A, You Z, Zhu H, Zhao M, Yang K, Yang S, Tan Y Tong . Semantic flow for fast and accurate scene parsing. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 775−793
31
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W. CCNet: criss-cross attention for semantic segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2019, 603−612
32
X, Li H, Zhao L, Han Y, Tong S, Tan K Yang . Gated fully fusion for semantic segmentation. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11418−11425
33
He J, Deng Z, Zhou L, Wang Y, Qiao Y. Adaptive pyramid context network for semantic segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7511−7520
34
Ding H, Jiang X, Liu A Q, Thalmann N M, Wang G. Boundary-aware feature propagation for scene segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2019, 6818−6828
35
V, Mnih N, Heess A, Graves K Kavukcuoglu . Recurrent models of visual attention. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 2204−2212
36
D, Bahdanau K, Cho Y Bengio . Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
37
N, Parmar A, Vaswani J, Uszkoreit L, Kaiser N, Shazeer A, Ku D Tran . Image transformer. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4055−4064
38
N, Carion F, Massa G, Synnaeve N, Usunier A, Kirillov S Zagoruyko . End-to-end object detection with transformers. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 213−229
39
X, Zhu W, Su L, Lu B, Li X, Wang J Dai . Deformable DETR: deformable transformers for end-to-end object detection. In: Proceedings of the 9th International Conference on Learning Representations. 2021
40
Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H. End-to-end video instance segmentation with transformers. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8737−8746
41
Y, Wang V, Guizilini T, Zhang Y, Wang H, Zhao J Solomon . DETR3D: 3D object detection from multi-view images via 3D-to-2D queries. In: Proceedings of the Conference on Robot Learning. 2021, 180−191
42
Strudel R, Garcia R, Laptev I, Schmid C. Segmenter: transformer for semantic segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2021, 7242−7252
43
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr P H S, Zhang L. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 6877−6886
44
L, Zhang X, Li A, Arnab K, Yang Y, Tong P H S Torr . Dual graph convolutional network for semantic segmentation. In: Proceedings of the 30th British Machine Vision Conference 2019. 2019, 254
45
Pan S Y, Lu C Y, Lee S P, Peng W H. Weakly-supervised image semantic segmentation using graph convolutional networks. In: Proceedings of IEEE International Conference on Multimedia and Expo. 2021, 1−6
46
H, Wang L, Dong M Sun . Local feature aggregation algorithm based on graph convolutional network. Frontiers of Computer Science, 2022, 16( 3): 163309
47
J, Wu X, He X, Wang Q, Wang W, Chen J, Lian X Xie . Graph convolution machine for context-aware recommender system. Frontiers of Computer Science, 2022, 16( 6): 166614
48
J, Bruna W, Zaremba A, Szlam Y LeCun . Spectral networks and locally connected networks on graphs. In: Proceedings of the 2nd International Conference on Learning Representations. 2014
49
P, Velickovic G, Cucurull A, Casanova A, Romero P, Liò Y Bengio . Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018
50
Zhang L, Xu D, Arnab A, Torr P H S. Dynamic graph message passing networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 3723−3732
51
Y, Zhu X, Xu F, Shen Y, Ji L, Gao H T Shen . PoseGTAC: graph transformer encoder-decoder with atrous convolution for 3D human pose estimation. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 1359−1365
52
X, Dong C, Long W, Xu C Xiao . Dual graph convolutional networks with transformer and curriculum learning for image captioning. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021, 2615−2624
53
S, Yan Y, Xiong D Lin . Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 7444−7452
54
T, Li K, Zhang S, Shen B, Liu Q, Liu Z Li . Image co-saliency detection and instance co-segmentation using attention graph clustering based graph convolutional network. IEEE Transactions on Multimedia, 2022, 24: 492–505
55
X, Li Y, Yang Q, Zhao T, Shen Z, Lin H Liu . Spatial pyramid based graph reasoning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 8947−8956
56
H, Hu D, Ji W, Gan S, Bai W, Wu J Yan . Class-wise dynamic graph convolution for semantic segmentation. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 1−17
57
Y, Zhang M, Liu J, He F, Pan Y Guo . Affinity fusion graph-based framework for natural image segmentation. IEEE Transactions on Multimedia, 2022, 24: 440–450
58
C, Chen S, Qian Q, Fang C Xu . HAPGN: hierarchical attentive pooling graph network for point cloud segmentation. IEEE Transactions on Multimedia, 2021, 23: 2335–2346
59
Y, Su W, Liu Z, Yuan M, Cheng Z, Zhang X, Shen C Wang . DLA-Net: learning dual local attention features for semantic segmentation of large-scale building facade point clouds. Pattern Recognition, 2022, 123: 108372
60
Y, Liu S, Yang B, Li W, Zhou J, Xu H, Li Y Lu . Affinity derivation and graph merge for instance segmentation. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 708−724
61
Z, Zhang P, Cui W Zhu . Deep learning on graphs: a survey. IEEE Transactions on Knowledge and Data Engineering, 2022, 34( 1): 249–270
62
Z, Wu S, Pan F, Chen G, Long C, Zhang P S Yu . A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32( 1): 4–24
63
W L, Hamilton R, Ying J Leskovec . Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1025−1035
64
T N, Kipf M Welling . Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations. 2017
65
M, Yin Z, Yao Y, Cao X, Li Z, Zhang S, Lin H Hu . Disentangled non-local neural networks. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 191−207
66
Wang X, Girshick R, Gupta A, He K. Non-local neural networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7794−7803
67
Yang M, Yu K, Zhang C, Li Z, Yang K. DenseASPP for semantic segmentation in street scenes. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3684−3692
68
L C, Chen M D, Collins Y, Zhu G, Papandreou B, Zoph F, Schroff H, Adam J Shlens . Searching for efficient multi-scale architectures for dense image prediction. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8713−8724
69
B, Cheng M D, Collins Y, Zhu T, Liu T S, Huang H, Adam L C Chen . Panoptic-DeepLab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12472−12482
70
Hou Q, Zhang L, Cheng M M, Feng J. Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 4002−4011
71
C, Yu J, Wang C, Peng C, Gao G, Yu N Sang . BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 334−349
72
H, Zhao Y, Zhang S, Liu J, Shi C C, Loy D, Lin J Jia . PSANet: point-wise spatial attention network for scene parsing. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 270−286
73
Y, Yuan L, Huang J, Guo C, Zhang X, Chen J Wang . OCNet: object context network for scene parsing. 2018, arXiv preprint arXiv: 1809.00916
74
T, Xiao Y, Liu B, Zhou Y, Jiang J Sun . Unified perceptual parsing for scene understanding. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 432−448
75
Fu J, Liu J, Wang Y, Li Y, Bao Y, Tang J, Lu H. Adaptive context network for scene parsing. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2019, 6747−6756
76
Y, Huang D, Kang L, Chen X, Zhe W, Jia L, Bao X He . CAR: class-aware regularizations for semantic segmentation. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 518−534
77
Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H. Expectation-maximization attention networks for semantic segmentation. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2019, 9166−9175
78
Ding H, Jiang X, Shuai B, Liu A Q, Wang G. Semantic correlation promoted shape-variant context for segmentation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 8877−8886