GRAMO: geometric resampling augmentation for monocular 3D object detection

doi:10.1007/s11704-023-3242-2

Front. Comput. Sci.

2024, Vol. 18

Issue (5) : 185706 https://doi.org/10.1007/s11704-023-3242-2

Image and Graphics

GRAMO: geometric resampling augmentation for monocular 3D object detection

He GUAN^1,², Chunfeng SONG^1,², Zhaoxiang ZHANG^1,²(

)

¹. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
². Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation Chinese Academy of Sciences, Beijing 100190, China

Download: PDF(8573 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Data augmentation is widely recognized as an effective means of bolstering model robustness. However, when applied to monocular 3D object detection, non-geometric image augmentation neglects the critical link between the image and physical space, resulting in the semantic collapse of the extended scene. To address this issue, we propose two geometric-level data augmentation operators named Geometric-Copy-Paste (Geo-CP) and Geometric-Crop-Shrink (Geo-CS). Both operators introduce geometric consistency based on the principle of perspective projection, complementing the options available for data augmentation in monocular 3D. Specifically, Geo-CP replicates local patches by reordering object depths to mitigate perspective occlusion conflicts, and Geo-CS re-crops local patches for simultaneous scaling of distance and scale to unify appearance and annotation. These operations ameliorate the problem of class imbalance in the monocular paradigm by increasing the quantity and distribution of geometrically consistent samples. Experiments demonstrate that our geometric-level augmentation operators effectively improve robustness and performance in the KITTI and Waymo monocular 3D detection benchmarks.

Keywords 3D detection monocular augmentation geometry

Corresponding Author(s): Zhaoxiang ZHANG

Just Accepted Date: 10 October 2023 Issue Date: 10 January 2024

Cite this article:

He GUAN,Chunfeng SONG,Zhaoxiang ZHANG. GRAMO: geometric resampling augmentation for monocular 3D object detection[J]. Front. Comput. Sci., 2024, 18(5): 185706.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-3242-2
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I5/185706

Fig.1 Below is a comparison of the performance of the automotive category in the KITTI 3D object detection benchmark. The results show a sustained performance improvement with the proposed geometric augmentation operators. In contrast, the “Vanilla” approach represents a clean baseline achieved by removing the existing enhancement operators

Fig.2 Visual illustration of the Geometric-Copy-Paste operation

Fig.3 Visual illustration of the Geometric-Crop-Shrink operation

	$A P B E V / 3 D \| R 40 (I o U = 0.7, C a r)$
	Easy	Mod.	Hard
Vanilla	17.50/11.47	14.62/10.04	12.36/8.30
+ Random Flipping	22.00/15.39	17.64/12.68	15.93/10.67
+ Color Jittering	23.64/16.84	18.84/13.83	16.25/11.65
+ Affine Resize	27.06/19.78	21.40/15.50	18.77/13.77
+ Geo-CP	28.86/20.67	22.83/16.99	20.80/14.65
+ Geo-CS	29.83/22.47	23.15/17.94	21.01/15.37

Tab.1 Ablations on the KITTI val set for amplification operations

	$A P B E V / 3 D \| R 40 (I o U = 0.7, C a r)$
	Easy	Mod.	Hard
0.0	30.26/21.13	22.57/15.66	19.45/13.07
0.05	29.83/22.47	23.15/17.94	21.01/15.37
0.1	28.25/22.13	22.62/17.57	19.68/14.99
0.3	29.40/21.15	23.26/17.16	20.96/14.73
0.5	28.05/20.88	22.44/17.03	20.46/14.84
0.7	27.88/21.45	22.51/16.75	19.84/14.91

Tab.2 Ablations on the KITTI val set for different IoU thresholds

	$A P B E V / 3 D \| R 40 (I o U = 0.7, C a r : P e d e s t r i a n : C y c l i s t)$
	Easy	Mod.	Hard
5:0:0	26.57/19.95	21.27/15.40	18.79/13.80
5:2:2	28.06/21.24	21.90/15.98	19.27/14.31
10:0:0	25.20/18.34	19.95/14.76	17.91/12.43
10:1:1	26.85/19.02	21.98/15.62	19.27/14.03
10:3:3	29.83/22.47	23.15/17.94	21.01/15.37
10:5:5	28.13/20.10	22.65/16.32	19.62/14.34
15:2:2	26.60/19.98	21.67/15.98	19.14/14.24
15:5:5	26.38/18.99	21.23/15.87	19.30/14.25

Tab.3 Ablations on the KITTI val set for different sampling proportions

Method	$A P B E V / 3 D \| R 40 (I o U = 0.7, C a r)$
Method	Easy	Mod.	Hard
$MonoDETR ?$	35.88/26.26	24.78/18.59	20.92/15.34
MonoDETR (ours)	38.26/27.04	27.15/19.93	23.04/16.10
Improvement	+2.38/+0.78	+2.37/+1.34	+2.12/+0.76

Tab.4 Ablations on the KITTI val set for another detector

Methods	Runtime/ms	$A P B E V \| R 40 (I o U = 0.7)$			$A P 3 D \| R 40 (I o U = 0.7)$
Methods	Runtime/ms	Easy	Moderate	Hard	Easy	Moderate	Hard
MonoGRNet [5]	400	18.19	11.17	8.73	15.74	9.61	4.25
M3D-RPN [14]	160	21.02	13.67	10.23	14.76	9.71	7.42
MonoPair [20]	60	19.28	14.83	12.89	13.04	9.99	8.65
PatchNet [11]	400	22.97	16.86	14.97	15.68	11.12	10.17
D4LCN [4]	200	22.51	16.02	12.55	16.65	11.72	9.51
GrooMeD-NMS [46]	120	26.19	18.27	14.05	18.10	12.32	9.65
MonoRCNN [13]	70	25.48	18.11	14.10	18.36	12.65	10.03
CaDDN [12]	630	27.94	18.91	17.19	19.17	13.41	11.46
MonoFlex [25]	35	28.23	19.75	16.89	19.94	13.89	12.07
AutoShape [23]	40	30.66	20.08	15.95	22.47	14.17	11.36
GUPNet [2]	34	30.29	21.19	18.20	22.26	15.02	13.12
MonoCon [19]	26	31.12	22.10	19.00	22.50	16.46	13.95
MonoDLE [24] (baseline)	40	24.79	18.89	16.00	17.23	12.26	10.29
MonoDLE (ours)	40	32.44	21.74	18.38	22.34	15.67	13.12
Improvements	40	+7.65	+2.85	+2.38	+5.11	+3.41	+2.83

Tab.5 Comparative analyses on the KITTI test set. We highlight the best results in bold and the second place in underlined

Fig.4 Qualitative results on the KITTI dataset. The prediction results are shown from left to right in the following order: image, BEV and Lidar. We color the predictions red and the ground truths green, including Car, Pedestrian, and Cyclist. LiDAR signals are shown for visualization only. Best viewed in color with zoom-in

Methods	LEVEL_1				LEVEL_2
Methods	$3 D A P 70$	$3 D A P H 70$	$3 D A P 50$	$3 D A P H 50$	$3 D A P 70$	$3 D A P H 70$	$3 D A P 50$	$3 D A P H 50$
M3D-RPN [14]	0.35	0.34	3.79	3.63	0.33	0.33	3.61	3.46
PatchNet [11]	0.39	0.37	2.92	2.74	0.38	0.36	2.42	2.28
GUPNet [2]	2.28	2.27	10.02	9.94	2.14	2.12	9.39	9.31
CaDDN [12]	5.03	4.99	17.54	17.31	4.49	4.45	16.51	16.28
DID-M3D [42]	?	?	20.66	20.47	?	?	19.37	19.19
BEVFormer [47]	?	7.70	?	30.80	?	6.90	?	27.70
MonoFlex [25]	11.70	11.64	32.26	32.06	10.96	10.90	30.31	30.12
DCD [26]	12.57	12.50	33.44	33.24	11.78	11.72	31.43	31.25
MonoDLE (baseline)	10.93	9.93	28.35	27.93	9.66	9.58	27.90	27.55
MonoDLE (ours)	12.83	12.74	33.01	32.38	11.87	12.04	32.13	31.58
Improvements	+1.90	+2.81	+4.67	+4.45	+2.21	+2.46	+4.23	+4.03

Tab.6 Results on Waymo val set. We highlight the best results in bold and the second place in underlined

1	Dijk T V, Croon G D. How do neural networks see depth in single images? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 2183−2191
2	Lu Y, Ma X, Yang L, Zhang T, Liu Y, Chu Q, Yan J, Ouyang W. Geometry uncertainty projection network for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 3111−3121
3	Qin Z, Li X. MonoGround: detecting monocular 3D objects from the ground. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 3793−3802
4	Ding M, Huo Y, Yi H, Wang Z, Shi J, Lu Z, Luo P. Learning depth-guided convolutions for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020, 1000−1001
5	Qin Z, Wang J, Lu Y. MonoGRNet: a geometric reasoning network for monocular 3D object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 8851−8858
6	Wang L, Du L, Ye X, Fu Y, Guo G, Xue X, Feng J, Zhang L. Depth-conditioned dynamic message propagation for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 454−463
7	Park D, Ambrus R, Guizilini V, Li J, Gaidon A. Is pseudo-lidar needed for monocular 3D object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 3142−3152
8	Wang Y, Chao W, Garg D, Hariharan B, Campbell M, Weinberger K. Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 8445−8453
9	Qian R, Garg D, Wang Y, You Y, Belongie S, Hariharan B, Campbell M, Weinberger K, Chao W. End-to-end Pseudo-LiDAR for image-based 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 5881−5890
10	Chen Y, Dai H, Ding Y. Pseudo-Stereo for monocular 3D object detection in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 887−897
11	Ma X, Liu S, Xia Z, Zhang H, Zeng X, Ouyang W. Rethinking Pseudo-LiDAR representation. In: Proceedings of European Conference on Computer Vision. 2020, 311−327
12	Reading C, Harakeh A, Chae J, Waslander S. Categorical depth distribution network for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8555−8564
13	Shi X, Ye Q, Chen X, Chen C, Chen Z, Kim T. Geometry-based distance decomposition for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 15172−15181
14	Brazil G, Liu X. M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 9287−9296
15	Luo S, Dai H, Shao L, Ding Y. M3DSSD: monocular 3D single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 6145−6154
16	Wang T, Zhu X, Pang J, Lin D. FCOS3D: fully convolutional one-stage monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 913−922
17	Mousavian A, Anguelov D, Flynn J, Kosecka J. 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017, 7074−7082
18	Shi X, Chen Z, Kim T. Distance-normalized unified representation for monocular 3D object detection. In: Proceedings of European Conference on Computer Vision. 2020, 91−107
19	Liu X, Xue N, Wu T. Learning auxiliary monocular contexts helps monocular 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 1810−1818
20	Chen Y, Tai L, Sun K, Li M. MonoPair: monocular 3D object detection using pairwise spatial relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12093−12102
21	Gu J, Wu B, Fan L, Huang J, Cao S, Xiang Z, Hua X. Homography loss for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 1080−1089
22	Chabot F, Chaouch M, Rabarisoa J, Teuliere C, Chateau T. Deep MANTA: A Coarse-To-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis From Monocular Image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, 2040−2049
23	Liu Z, Zhou D, Lu F, Fang J, Zhang L. AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 15641−15650
24	Ma X, Zhang Y, Xu D, Zhou D, Yi S, Li H, Ouyang W. Delving into localization errors for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 4721−4730
25	Zhang Y, Lu J, Zhou J. Objects are different: flexible monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 3289−3298
26	Li Y, Chen Y, He J, Zhang Z. Densely constrained depth estimator for monocular 3D object detection. In: European Conference on Computer Vision. 2022, 718−734
27	Chen H, Huang Y, Tian W, Gao Z, Xiong L. MonoRUn: monocular 3D object detection by reconstruction and uncertainty propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 10379−10388
28	Chen H, Wang P, Wang F, Tian W, Xiong L, Li H. EPro-PnP: generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 2781−2790
29	Fang H, Sun J, Wang R, Gou M, Li Y, Lu C. InstaBoost: boosting instance segmentation via probability map guided copy-pasting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 682−691
30	Georgakis G, Mousavian A, Berg A, Kosecka J. Synthesizing training data for object detection in indoor scenes. 2017, arXiv preprint arXiv: 1702.07836
31	Dvornik N, Mairal J, Schmid C. Modeling visual context is key to augmenting object detection datasets. In: Proceedings of the European Conference on Computer Vision. 2018, 364−380
32	Dwibedi D, Misra I, Hebert M. Cut, paste and learn: surprisingly easy synthesis for instance detection. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 1301−1310
33	Wang H, Huang D, Wang Y. GridNet: efficiently learning deep hierarchical representation for 3D point cloud understanding. Frontiers of Computer Science, 2022, 16(1): 161301.
34	Xian Y, Xiao J, Wang Y. A fast registration algorithm of rock point cloud based on spherical projection and feature extraction. Frontiers of Computer Science, 2019, 13(1): 170−182
35	Yan Y, Mao Y, Li B. SECOND: sparsely embedded convolutional detection. Sensors, 2018, 18(10): 3337
36	Xiao A, Huang J, Guan D, Cui K, Lu S, Shao L. PolarMix: a general data augmentation technique for LiDAR point clouds. In: Proceedings of Advances in Neural Information Processing Systems. 2022, 11035−11048
37	Zhang W, Wang Z, Loy C. Exploring data augmentation for multi-modality 3D object detection. 2021, arXiv preprint arXiv: 2012.12741.
38	Wang C, Ma C, Zhu M, Yang X. Point augmenting: cross-modal augmentation for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 11794−11803
39	Jiang H, Cheng M, Li S, Borji A, Wang J. Joint salient object detection and existence prediction. Frontiers of Computer Science, 2019, 13(1): 778−788
40	Yang X, Xue T, Luo H, Guo J. Fast and accurate visual odometry from a monocular camera. Frontiers of Computer Science, 2019, 13(1): 1326−1336
41	Lian Q, Ye B, Xu R, Yao W, Zhang T. Exploring geometric consistency for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 1685−1694
42	Peng L, Wu X, Yang Z, Liu H, Cai D. DID-M3D: decoupling instance depth for monocular 3D object detection. In: Proceedings of European Conference on Computer Vision. 2022, 71−88
43	Chen X, Kundu K, Zhang Z, Ma H, Fidler S, Urtasun R. Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2147−2156
44	Yu F, Wang D, Shelhamer E, Darrell T. Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 2403−2412
45	Zhang R, Qiu H, Wang T, Guo Z, Qiao Y, Li H, Gao P. MonoDETR: Depth-guided transformer for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 9155−9166
46	Kumar A, Brazil G, Liu X. GrooMeD-NMS: grouped mathematically differentiable NMS for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8973−8983
47	Li Z, Wang W, Li H, Xie E, Sima C, Lu T, Yu Q, Dai J. BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Proceedings of European Conference on Computer Vision. 2022, 1−18

[1]

FCS-23242-OF-HG_suppl_1

Download

[1]	Ningping MOU, Xinli YUE, Lingchen ZHAO, Qian WANG. Fairness is essential for robustness: fair adversarial training by identifying and augmenting hard examples[J]. Front. Comput. Sci., 2025, 19(3): 193803-.
[2]	Taiyan CHEN, Xianghua YING. FPSMix: data augmentation strategy for point cloud classification[J]. Front. Comput. Sci., 2025, 19(2): 192701-.
[3]	Meimei YANG, Qiao LIU, Xinkai SUN, Na SHI, Hui XUE. Towards kernelizing the classifier for hyperbolic data[J]. Front. Comput. Sci., 2024, 18(1): 181301-.
[4]	Yao ZHANG, Liangxiao JIANG, Chaoqun LI. Attribute augmentation-based label integration for crowdsourcing[J]. Front. Comput. Sci., 2023, 17(5): 175331-.
[5]	Xin CHEN, He JIANG, Zhenyu CHEN, Tieke HE, Liming NIE. Automatic test report augmentation to assist crowdsourced testing[J]. Front. Comput. Sci., 2019, 13(5): 943-959.
[6]	Ning CHEN, Jun ZHU, Jianfei CHEN, Ting CHEN. Dropout training for SVMs with data augmentation[J]. Front. Comput. Sci., 2018, 12(4): 694-713.
[7]	Jinjiang LI, Hanyi GE. New progress in geometric computing for image and video processing[J]. Front Comput Sci, 2012, 6(6): 769-775.
[8]	LIANG Tielin, WANG Dongming, WANG Dongming. On the design and implementation of a geometric-object-oriented language[J]. Front. Comput. Sci., 2007, 1(2): 180-190.

Viewed

Full text

Abstract

Cited

Shared

Discussed