Semantic image parsing, which refers to the process of decomposing images into semantic regions and constructing the structure representation of the input, has recently aroused widespread interest in the field of computer vision. The recent application of deep representation learning has driven this field into a new stage of development. In this paper, we summarize three aspects of the progress of research on semantic image parsing, i.e., category-level semantic segmentation, instance-level semantic segmentation, and beyond segmentation. Specifically, we first review the general frameworks for each task and introduce the relevant variants. The advantages and limitations of each method are also discussed. Moreover, we present a comprehensive comparison of different benchmark datasets and evaluation metrics. Finally, we explore the future trends and challenges of semantic image parsing.
Zhao H S, Shi J P, Qi X J, Wang X G, Jia J Y. Pyramid scene parsing network. In: Proceedings of International Conference on Computer Vision and Pattern Recognition. 2017, 2881–2890 https://doi.org/10.1109/CVPR.2017.660
2
He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proceedings of IEEE International Conference on Computation Vision. 2017, 2980–2988 https://doi.org/10.1109/ICCV.2017.322
3
Tu Z, Chen X, Yuille A L, Zhu S C. Image parsing: unifying segmentation, detection, and recognition. International Journal of Computer Vision, 2005, 63(2): 113–140 https://doi.org/10.1007/s11263-005-6642-x
4
Tu Z, Zhu S C. Parsing images into region and curve processes. In: Proceedings of European Conference on Computer Vision. 2002, 393–407 https://doi.org/10.1007/3-540-47977-5_26
5
Han F, Zhu S C. Bottom-up/top-down image parsing with attribute grammar. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(1): 59–73 https://doi.org/10.1109/TPAMI.2008.65
6
Lin L, Wang G, Zhang R, Zhang R, Liang X, Zuo W. Deep structured scene parsing by learning with image descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2276–2284 https://doi.org/10.1109/CVPR.2016.250
Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2005, 886–893 https://doi.org/10.1109/CVPR.2005.177
9
Ahonen T, Hadid A, Pietikäinen M. Face recognition with local binary patterns. In: Proceedings of European Conference on Computer Vision. 2004, 469–481 https://doi.org/10.1007/978-3-540-24670-1_36
10
Liu Z, Li X, Luo P, Loy C C, Tang X. Semantic image segmentation via deep parsing network. In: Proceedings of IEEE International Conference on Computer Vision. 2015, 1377–1385 https://doi.org/10.1109/ICCV.2015.162
11
Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834–848 https://doi.org/10.1109/TPAMI.2017.2699184
12
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3431–3440 https://doi.org/10.1109/CVPR.2015.7298965
13
Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L. Semantic image segmentation with deep convolutional nets and fully connected CRFs. 2014, arXiv preprint arXiv:1412.7062
14
Peng C, Zhang X, Yu G, Luo G, Sun J. Large kernel matters-improve semantic segmentation by global convolutional network. 2017, arXiv preprint arXiv:1703.02719
15
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems. 2012, 1097–1105
16
Socher R, Manning C D, Ng A Y. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In: Proceedings of NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop. 2010, 1–9
17
Li Y, Qi H, Dai J, Ji X, Wei Y. Fully convolutional instance-aware semantic segmentation. 2016, arXiv preprint arXiv:1611.07709
18
Bengio Y. Deep learning of representations: looking forward. In: Proceedings of International Conference on Statistical Language and Speech Processing. 2013, 1–37 https://doi.org/10.1007/978-3-642-39593-2_1
19
Bengio Y. Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning. 2012, 17–36
20
Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis andMachine Intelligence, 2013, 35(8): 1798–1828
21
LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, Jackel L D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4): 541–551 https://doi.org/10.1162/neco.1989.1.4.541
22
Dai J, He K, Li Y, Ren S, Sun J. Instance-sensitive fully convolutional networks. In: Proceedings of European Conference on Computer Vision. 2016, 534–549 https://doi.org/10.1007/978-3-319-46466-4_32
23
Islam M A, Naha S, Rochan M, Bruce N, Wang Y. Label refinement network for coarse-to-fine semantic segmentation. 2017, arXiv preprint arXiv:1703.00551
24
Lipton Z C, Berkowitz J, Elkan C. A critical review of recurrent neural networks for sequence learning. 2015, arXiv preprint arXiv:1506.00019
Liang X, Shen X, Xiang D, Feng J, Lin L, Yan S. Semantic object parsing with local-global long short-term memory. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3185–3193 https://doi.org/10.1109/CVPR.2016.347
27
Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3128–3137 https://doi.org/10.1109/CVPR.2015.7298932
28
Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of Advances in Neural Information Processing Systems. 2014, 3104–3112
29
Li Z, Gan Y, Liang X, Yu Y, Cheng H, Lin L. LSTM-CF: unifying context modeling and fusion with LSTMS for RGB-D scene labeling. In: Proceedings of European Conference on Computer Vision. 2016, 541–557 https://doi.org/10.1007/978-3-319-46475-6_34
30
Peng Z, Zhang R, Liang X, Liu X, Lin L. Geometric scene parsing with hierarchical LSTM. 2016, arXiv preprint arXiv:1604.01931
31
Byeon W, Breuel T M, Raue F, Liwicki M. Scene labeling with LSTM recurrent neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3547–3555 https://doi.org/10.1109/CVPR.2015.7298977
32
Liang X, Shen X, Feng J, Lin L, Yan S. Semantic object parsing with graph LSTM. In: Proceedings of European Conference on Computer Vision. 2016, 125–143 https://doi.org/10.1007/978-3-319-46448-0_8
33
Liang X, Lin L, Shen X, Feng J, Yan S, Xing E P. Interpretable structure-evolving LSTM. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2175–2184 https://doi.org/10.1109/CVPR.2017.234
34
Zhang R, Yang W, Peng Z, Wang X, Lin L. Progressively diffused networks for semantic image segmentation. 2017, arXiv preprint arXiv:1702.05839
35
Elman J L. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 1991, 7(2-3): 195–225 https://doi.org/10.1007/BF00114844
36
Liu W, Rabinovich A, Berg A C. Parsenet: looking wider to see better. 2015, arXiv preprint arXiv:1506.04579
37
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 1–9 https://doi.org/10.1109/CVPR.2015.7298594
38
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778 https://doi.org/10.1109/CVPR.2016.90
39
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014, arXiv preprint arXiv:1409.1556
40
Pinheiro P H O, Collobert R. Recurrent convolutional neural networks for scene labeling. In: Proceedings of International Conference on Machine Learning. 2014, 82–90
41
Graves A, Fernández S, Schmidhuber J. Multi-dimensional recurrent neural networks. In: Proceedings of the International Conference on Artificial Neural Networks. 2007, 549–558 https://doi.org/10.1007/978-3-540-74690-4_56
42
Lin L, Huang L, Chen T, Gan Y, Cheng H. Knowledge-guided recurrent neural network learning for task-oriented action prediction. In: Proceedings of IEEE International Conference on Multimedia and Expo. 2017, 625–630 https://doi.org/10.1109/ICME.2017.8019345
43
Farabet C, Couprie C, Najman L, LeCun Y. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915–1929 https://doi.org/10.1109/TPAMI.2012.231
44
Gupta S, Girshick R, Arbeláez P, Malik J. Learning rich features from RGB-D images for object detection and segmentation. In: Proceedings of European Conference on Computer Vision. 2014, 345–360 https://doi.org/10.1007/978-3-319-10584-0_23
45
Ning F, Delhomme D, LeCun Y, Piano F, Bottou L, Barbano P E. Toward automatic phenotyping of developing embryos from videos. IEEE Transactions on Image Processing, 2005, 14(9): 1360–1371 https://doi.org/10.1109/TIP.2005.852470
46
Liang X, Liu S, Shen X, Yang J, Liu L, Dong J, Lin L, Yan S. Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(12): 2402–2414 https://doi.org/10.1109/TPAMI.2015.2408360
47
Liang X, Xu C, Shen X, Yang J, Liu S, Tang J, Lin L, Yan S. Human parsing with contextualized convolutional neural network. In: Proceedings of IEEE International Conference on Computer Vision. 2015, 1386–1394 https://doi.org/10.1109/ICCV.2015.163
48
Krähenbühl P, Koltun V. Efficientcient inference in fully connected CRFS with gaussian edge potentials. In: Proceedings of Advances in Neural Information Processing Systems. 2011, 109–117
49
Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision. 2015, 1520–1528 https://doi.org/10.1109/ICCV.2015.178
50
Badrinarayanan V, Handa A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. 2015, arXiv preprint arXiv:1505.07293
51
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. 2015, 234–241 https://doi.org/10.1007/978-3-319-24574-4_28
52
Lin G, Milan A, Shen C, Reid I. Refinenet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. 2016, arXiv preprint arXiv:1611.06612
53
Chen L C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. 2017, arXiv preprint arXiv:1706.05587
Li X, Liu Z, Luo P, Loy C C, Tang X. Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. 2017, arXiv preprint arXiv:1704.01344
56
Zhou Y, Xie L, Shen W, Wang Y, Fishman E K, Yuille A L. A fixedpoint model for pancreas segmentation in abdominal ct scans. In: Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. 2017, 693–701
57
Li Q, Wang J, Wipf D, Tu Z. Fixed-point model for structured labeling. In: Proceedings of International Conference on Machine Learning. 2013, 214–221
58
Wang G, Luo P, Lin L, Wang X. Learning object interactions and descriptions for semantic image segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 5859–5867 https://doi.org/10.1109/CVPR.2017.556
59
Luo P, Wang G, Lin L, Wang X. Deep dual learning for semantic image segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2718–2726 https://doi.org/10.1109/ICCV.2017.296
60
Schwing A G, Urtasun R. Fully connected deep structured networks. 2015, arXiv preprint arXiv:1503.02351
61
Yang W, Luo P, Lin L. Clothing co-parsing by joint image segmentation and labeling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3182–3189 https://doi.org/10.1109/CVPR.2014.407
62
Byeon W, Liwicki M, Breuel T M. Texture classification using 2D LSTM networks. In: Proceedings of International Conference on Pattern Recognition. 2014, 1144–1149 https://doi.org/10.1109/ICPR.2014.206
63
Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of IEEE International Conference on Computer Vision. 2015, 2650–2658 https://doi.org/10.1109/ICCV.2015.304
64
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 580–587 https://doi.org/10.1109/CVPR.2014.81
65
Reza M, Kosecka J. Reinforcement learning for semantic segmentation in indoor scenes. 2016, arXiv preprint arXiv:1606.01178
66
Van Oord A, Kalchbrenner N, Kavukcuoglu K. Pixel recurrent neural networks. In: Proceedings of International Conference on Machine Learning. 2016, 1747–1756
67
Kalchbrenner N, Danihelka I, Graves A. Grid long short-term memory. 2015, arXiv preprint arXiv:1507.01526
68
Hariharan B, Arbeláez P, Girshick R, Malik J. Simultaneous detection and segmentation. In: Proceedings of European Conference on Computer Vision. 2014, 297–312 https://doi.org/10.1007/978-3-319-10584-0_20
69
Liang X, Wei Y, Shen X, Jie Z, Feng J, Lin L, Yan S. Reversible recursive instance-level object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 633–641 https://doi.org/10.1109/CVPR.2016.75
70
Liang X, Wei Y, Shen X, Yang J, Lin L, Yan S. Proposal-free network for instance-level object segmentation. 2015, arXiv preprint arXiv:1509.02636
71
Abtahi F, Zhu Z, Burry A M. A deep reinforcement learning approach to character segmentation of license plate images. In: Proceedings of International Conference on Machine Vision Applications. 2015, 539–542 https://doi.org/10.1109/MVA.2015.7153249
72
Lin L, Wang K, Zuo W, Wang M, Luo J, Zhang L. A deep structured model with radius–margin bound for 3D human activity recognition. International Journal of Computer Vision, 2016, 118(2): 256–273 https://doi.org/10.1007/s11263-015-0876-z
73
Hariharan B, Arbeláez P, Girshick R, Malik J. Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 447–456 https://doi.org/10.1109/CVPR.2015.7298642
74
Chen Y T, Liu X, Yang M H. Multi-instance object segmentation with occlusion handling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3470–3478 https://doi.org/10.1109/CVPR.2015.7298969
75
Arbeláez P, Pont-Tuset J, Barron J T, Marques F, Malik J. Multiscale combinatorial grouping. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 328–335 https://doi.org/10.1109/CVPR.2014.49
76
Li G, Xie Y, Lin L, Yu Y. Instance-level salient object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 247–256 https://doi.org/10.1109/CVPR.2017.34
77
Dai J, He K, Sun J. Instance-aware semantic segmentation via multitask network cascades. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3150–3158
He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Proceedings of European Conference on Computer Vision. 2014, 346–361 https://doi.org/10.1007/978-3-319-10578-9_23
80
Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems. 2015, 91–99
81
Newell A, Huang Z, Deng J. Associative embedding: end-to-end learning for joint detection and grouping. In: Proceedings of Advances in Neural Information Processing Systems. 2017, 2274–2284
82
Harley A W, Derpanis K G, Kokkinos I. Learning dense convolutional embeddings for semantic segmentation. 2015, arXiv preprint arXiv:1511.04377
83
Fathi A, Wojna Z, Rathod V, Wang P, Song H O, Guadarrama S, Murphy K P. Semantic instance segmentation via deep metric learning. 2017, arXiv preprint arXiv:1703.10277
84
Yang L, Jin R. Distance metric learning: a comprehensive survey. Michigan State Universiy, 2006, 2(2): 1–51
85
Xu J, Schwing A G, Urtasun R. Tell me what you see and I will show you where it is. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3190–3197 https://doi.org/10.1109/CVPR.2014.408
86
Miller G A, Beckwith R, Fellbaum C, Gross D, Miller K J. Introduction to wordnet: an on-line lexical database. International Journal of Lexicography, 1990, 3(4): 235–244 https://doi.org/10.1093/ijl/3.4.235
87
Socher R, Bauer J, Manning C D, Ng A Y. Parsing with compositional vector grammars. In: Proceedings of Annual Meeting of the Association for Computational Linguistics. 2013, 455–465
88
Everingham M, Van Gool L,Williams C K, Winn J, Zisserman A. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88(2): 303–338 https://doi.org/10.1007/s11263-009-0275-4
89
Chen X, Mottaghi R, Liu X, Fidler S, Urtasun R, Yuille A L. Detect what you can: detecting and representing objects using holistic models and body parts. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1971–1978 https://doi.org/10.1109/CVPR.2014.254
90
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3): 211–252 https://doi.org/10.1007/s11263-015-0816-y
91
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A. Semantic understanding of scenes through the ADE20K dataset. 2016, arXiv preprint arXiv:1608.05442
92
Lin T Y, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick C L, Dollár P. Microsoft COCO: common objects in context. 2014, arXiv preprint arXiv:1405.0312
93
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: common objects in context. In: Proceedings of European Conference on Computer Vision. 2014, 740–755 https://doi.org/10.1007/978-3-319-10602-1_48
94
Liu C, Yuen J, Torralba A. Nonparametric scene parsing: label transfer via dense scene alignment. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2009, 1972–1979 https://doi.org/10.1109/CVPR.2009.5206536
95
Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In: Proceedings of European Conference on Computer Vision. 2012, 746–760 https://doi.org/10.1007/978-3-642-33715-4_54
96
Gupta S, Arbelaez P, Malik J. Perceptual organization and recognition of indoor scenes from RGB-D images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2013, 564–571 https://doi.org/10.1109/CVPR.2013.79
97
Song S, Lichtenberg S P, Xiao J. Sun RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 567–576 https://doi.org/10.1109/CVPR.2015.7298655
98
Janoch A, Karayev S, Jia Y, Barron J T, Fritz M, Saenko K, Darrell T. A category-level 3D object dataset: putting the kinect to work. Consumer Depth Cameras for Computer Vision. 2013, 141–165
99
Xiao J, Owens A, Torralba A. SUN3D: a database of big spaces reconstructed using SFM and object labels. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 1625–1632 https://doi.org/10.1109/ICCV.2013.458
100
Yamaguchi K, Kiapour M H, Ortiz L E, Berg T L. Parsing clothing in fashion photographs. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2012, 3570–3577 https://doi.org/10.1109/CVPR.2012.6248101
101
Liu S, Feng J, Domokos C, Xu H, Huang J, Hu Z, Yan S. Fashion parsing with weak color-category labels. IEEE Transactions on Multimedia, 2014, 16(1): 253–265 https://doi.org/10.1109/TMM.2013.2285526
102
Dong J, Chen Q, Xia W, Huang Z, Yan S. A deformable mixture parsing model with parselets. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 3408–3415 https://doi.org/10.1109/ICCV.2013.423
103
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3213–3223 https://doi.org/10.1109/CVPR.2016.350