Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (5) : 185706    https://doi.org/10.1007/s11704-023-3242-2
Image and Graphics
GRAMO: geometric resampling augmentation for monocular 3D object detection
He GUAN1,2, Chunfeng SONG1,2, Zhaoxiang ZHANG1,2()
1. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
2. Center for Research on Intelligent Perception and Computing, State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation Chinese Academy of Sciences, Beijing 100190, China
 Download: PDF(8573 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Data augmentation is widely recognized as an effective means of bolstering model robustness. However, when applied to monocular 3D object detection, non-geometric image augmentation neglects the critical link between the image and physical space, resulting in the semantic collapse of the extended scene. To address this issue, we propose two geometric-level data augmentation operators named Geometric-Copy-Paste (Geo-CP) and Geometric-Crop-Shrink (Geo-CS). Both operators introduce geometric consistency based on the principle of perspective projection, complementing the options available for data augmentation in monocular 3D. Specifically, Geo-CP replicates local patches by reordering object depths to mitigate perspective occlusion conflicts, and Geo-CS re-crops local patches for simultaneous scaling of distance and scale to unify appearance and annotation. These operations ameliorate the problem of class imbalance in the monocular paradigm by increasing the quantity and distribution of geometrically consistent samples. Experiments demonstrate that our geometric-level augmentation operators effectively improve robustness and performance in the KITTI and Waymo monocular 3D detection benchmarks.

Keywords 3D detection      monocular      augmentation      geometry     
Corresponding Author(s): Zhaoxiang ZHANG   
Just Accepted Date: 10 October 2023   Issue Date: 10 January 2024
 Cite this article:   
He GUAN,Chunfeng SONG,Zhaoxiang ZHANG. GRAMO: geometric resampling augmentation for monocular 3D object detection[J]. Front. Comput. Sci., 2024, 18(5): 185706.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-3242-2
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I5/185706
Fig.1  Below is a comparison of the performance of the automotive category in the KITTI 3D object detection benchmark. The results show a sustained performance improvement with the proposed geometric augmentation operators. In contrast, the “Vanilla” approach represents a clean baseline achieved by removing the existing enhancement operators
Fig.2  Visual illustration of the Geometric-Copy-Paste operation
Fig.3  Visual illustration of the Geometric-Crop-Shrink operation
APBEV/3D|R40(IoU=0.7,Car)
Easy Mod. Hard
Vanilla 17.50/11.47 14.62/10.04 12.36/8.30
+ Random Flipping 22.00/15.39 17.64/12.68 15.93/10.67
+ Color Jittering 23.64/16.84 18.84/13.83 16.25/11.65
+ Affine Resize 27.06/19.78 21.40/15.50 18.77/13.77
+ Geo-CP 28.86/20.67 22.83/16.99 20.80/14.65
+ Geo-CS 29.83/22.47 23.15/17.94 21.01/15.37
Tab.1  Ablations on the KITTI val set for amplification operations
APBEV/3D|R40(IoU=0.7,Car)
Easy Mod. Hard
0.0 30.26/21.13 22.57/15.66 19.45/13.07
0.05 29.83/22.47 23.15/17.94 21.01/15.37
0.1 28.25/22.13 22.62/17.57 19.68/14.99
0.3 29.40/21.15 23.26/17.16 20.96/14.73
0.5 28.05/20.88 22.44/17.03 20.46/14.84
0.7 27.88/21.45 22.51/16.75 19.84/14.91
Tab.2  Ablations on the KITTI val set for different IoU thresholds
APBEV/3D|R40(IoU=0.7,Car:Pedestrian:Cyclist)
Easy Mod. Hard
5:0:0 26.57/19.95 21.27/15.40 18.79/13.80
5:2:2 28.06/21.24 21.90/15.98 19.27/14.31
10:0:0 25.20/18.34 19.95/14.76 17.91/12.43
10:1:1 26.85/19.02 21.98/15.62 19.27/14.03
10:3:3 29.83/22.47 23.15/17.94 21.01/15.37
10:5:5 28.13/20.10 22.65/16.32 19.62/14.34
15:2:2 26.60/19.98 21.67/15.98 19.14/14.24
15:5:5 26.38/18.99 21.23/15.87 19.30/14.25
Tab.3  Ablations on the KITTI val set for different sampling proportions
Method APBEV/3D|R40(IoU=0.7,Car)
Easy Mod. Hard
MonoDETR? 35.88/26.26 24.78/18.59 20.92/15.34
MonoDETR (ours) 38.26/27.04 27.15/19.93 23.04/16.10
Improvement +2.38/+0.78 +2.37/+1.34 +2.12/+0.76
Tab.4  Ablations on the KITTI val set for another detector
Methods Runtime/ms APBEV|R40(IoU=0.7) AP3D|R40(IoU=0.7)
Easy Moderate Hard Easy Moderate Hard
MonoGRNet [5] 400 18.19 11.17 8.73 15.74 9.61 4.25
M3D-RPN [14] 160 21.02 13.67 10.23 14.76 9.71 7.42
MonoPair [20] 60 19.28 14.83 12.89 13.04 9.99 8.65
PatchNet [11] 400 22.97 16.86 14.97 15.68 11.12 10.17
D4LCN [4] 200 22.51 16.02 12.55 16.65 11.72 9.51
GrooMeD-NMS [46] 120 26.19 18.27 14.05 18.10 12.32 9.65
MonoRCNN [13] 70 25.48 18.11 14.10 18.36 12.65 10.03
CaDDN [12] 630 27.94 18.91 17.19 19.17 13.41 11.46
MonoFlex [25] 35 28.23 19.75 16.89 19.94 13.89 12.07
AutoShape [23] 40 30.66 20.08 15.95 22.47 14.17 11.36
GUPNet [2] 34 30.29 21.19 18.20 22.26 15.02 13.12
MonoCon [19] 26 31.12 22.10 19.00 22.50 16.46 13.95
MonoDLE [24] (baseline) 40 24.79 18.89 16.00 17.23 12.26 10.29
MonoDLE (ours) 40 32.44 21.74 18.38 22.34 15.67 13.12
Improvements +7.65 +2.85 +2.38 +5.11 +3.41 +2.83
Tab.5  Comparative analyses on the KITTI test set. We highlight the best results in bold and the second place in underlined
Fig.4  Qualitative results on the KITTI dataset. The prediction results are shown from left to right in the following order: image, BEV and Lidar. We color the predictions red and the ground truths green, including Car, Pedestrian, and Cyclist. LiDAR signals are shown for visualization only. Best viewed in color with zoom-in
Methods LEVEL_1 LEVEL_2
3DAP70 3DAPH70 3DAP50 3DAPH50 3DAP70 3DAPH70 3DAP50 3DAPH50
M3D-RPN [14] 0.35 0.34 3.79 3.63 0.33 0.33 3.61 3.46
PatchNet [11] 0.39 0.37 2.92 2.74 0.38 0.36 2.42 2.28
GUPNet [2] 2.28 2.27 10.02 9.94 2.14 2.12 9.39 9.31
CaDDN [12] 5.03 4.99 17.54 17.31 4.49 4.45 16.51 16.28
DID-M3D [42] ? ? 20.66 20.47 ? ? 19.37 19.19
BEVFormer [47] ? 7.70 ? 30.80 ? 6.90 ? 27.70
MonoFlex [25] 11.70 11.64 32.26 32.06 10.96 10.90 30.31 30.12
DCD [26] 12.57 12.50 33.44 33.24 11.78 11.72 31.43 31.25
MonoDLE (baseline) 10.93 9.93 28.35 27.93 9.66 9.58 27.90 27.55
MonoDLE (ours) 12.83 12.74 33.01 32.38 11.87 12.04 32.13 31.58
Improvements +1.90 +2.81 +4.67 +4.45 +2.21 +2.46 +4.23 +4.03
Tab.6  Results on Waymo val set. We highlight the best results in bold and the second place in underlined
  
  
  
1 Dijk T V, Croon G D. How do neural networks see depth in single images? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 2183−2191
2 Lu Y, Ma X, Yang L, Zhang T, Liu Y, Chu Q, Yan J, Ouyang W. Geometry uncertainty projection network for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 3111−3121
3 Qin Z, Li X. MonoGround: detecting monocular 3D objects from the ground. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 3793−3802
4 Ding M, Huo Y, Yi H, Wang Z, Shi J, Lu Z, Luo P. Learning depth-guided convolutions for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020, 1000−1001
5 Qin Z, Wang J, Lu Y. MonoGRNet: a geometric reasoning network for monocular 3D object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 8851−8858
6 Wang L, Du L, Ye X, Fu Y, Guo G, Xue X, Feng J, Zhang L. Depth-conditioned dynamic message propagation for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 454−463
7 Park D, Ambrus R, Guizilini V, Li J, Gaidon A. Is pseudo-lidar needed for monocular 3D object detection? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 3142−3152
8 Wang Y, Chao W, Garg D, Hariharan B, Campbell M, Weinberger K. Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 8445−8453
9 Qian R, Garg D, Wang Y, You Y, Belongie S, Hariharan B, Campbell M, Weinberger K, Chao W. End-to-end Pseudo-LiDAR for image-based 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 5881−5890
10 Chen Y, Dai H, Ding Y. Pseudo-Stereo for monocular 3D object detection in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 887−897
11 Ma X, Liu S, Xia Z, Zhang H, Zeng X, Ouyang W. Rethinking Pseudo-LiDAR representation. In: Proceedings of European Conference on Computer Vision. 2020, 311−327
12 Reading C, Harakeh A, Chae J, Waslander S. Categorical depth distribution network for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8555−8564
13 Shi X, Ye Q, Chen X, Chen C, Chen Z, Kim T. Geometry-based distance decomposition for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 15172−15181
14 Brazil G, Liu X. M3D-RPN: monocular 3D region proposal network for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 9287−9296
15 Luo S, Dai H, Shao L, Ding Y. M3DSSD: monocular 3D single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 6145−6154
16 Wang T, Zhu X, Pang J, Lin D. FCOS3D: fully convolutional one-stage monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 913−922
17 Mousavian A, Anguelov D, Flynn J, Kosecka J. 3D bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017, 7074−7082
18 Shi X, Chen Z, Kim T. Distance-normalized unified representation for monocular 3D object detection. In: Proceedings of European Conference on Computer Vision. 2020, 91−107
19 Liu X, Xue N, Wu T. Learning auxiliary monocular contexts helps monocular 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 1810−1818
20 Chen Y, Tai L, Sun K, Li M. MonoPair: monocular 3D object detection using pairwise spatial relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12093−12102
21 Gu J, Wu B, Fan L, Huang J, Cao S, Xiang Z, Hua X. Homography loss for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 1080−1089
22 Chabot F, Chaouch M, Rabarisoa J, Teuliere C, Chateau T. Deep MANTA: A Coarse-To-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis From Monocular Image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, 2040−2049
23 Liu Z, Zhou D, Lu F, Fang J, Zhang L. AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 15641−15650
24 Ma X, Zhang Y, Xu D, Zhou D, Yi S, Li H, Ouyang W. Delving into localization errors for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 4721−4730
25 Zhang Y, Lu J, Zhou J. Objects are different: flexible monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 3289−3298
26 Li Y, Chen Y, He J, Zhang Z. Densely constrained depth estimator for monocular 3D object detection. In: European Conference on Computer Vision. 2022, 718−734
27 Chen H, Huang Y, Tian W, Gao Z, Xiong L. MonoRUn: monocular 3D object detection by reconstruction and uncertainty propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 10379−10388
28 Chen H, Wang P, Wang F, Tian W, Xiong L, Li H. EPro-PnP: generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 2781−2790
29 Fang H, Sun J, Wang R, Gou M, Li Y, Lu C. InstaBoost: boosting instance segmentation via probability map guided copy-pasting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 682−691
30 Georgakis G, Mousavian A, Berg A, Kosecka J. Synthesizing training data for object detection in indoor scenes. 2017, arXiv preprint arXiv: 1702.07836
31 Dvornik N, Mairal J, Schmid C. Modeling visual context is key to augmenting object detection datasets. In: Proceedings of the European Conference on Computer Vision. 2018, 364−380
32 Dwibedi D, Misra I, Hebert M. Cut, paste and learn: surprisingly easy synthesis for instance detection. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 1301−1310
33 Wang H, Huang D, Wang Y. GridNet: efficiently learning deep hierarchical representation for 3D point cloud understanding. Frontiers of Computer Science, 2022, 16(1): 161301.
34 Xian Y, Xiao J, Wang Y. A fast registration algorithm of rock point cloud based on spherical projection and feature extraction. Frontiers of Computer Science, 2019, 13(1): 170−182
35 Yan Y, Mao Y, Li B. SECOND: sparsely embedded convolutional detection. Sensors, 2018, 18(10): 3337
36 Xiao A, Huang J, Guan D, Cui K, Lu S, Shao L. PolarMix: a general data augmentation technique for LiDAR point clouds. In: Proceedings of Advances in Neural Information Processing Systems. 2022, 11035−11048
37 Zhang W, Wang Z, Loy C. Exploring data augmentation for multi-modality 3D object detection. 2021, arXiv preprint arXiv: 2012.12741.
38 Wang C, Ma C, Zhu M, Yang X. Point augmenting: cross-modal augmentation for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 11794−11803
39 Jiang H, Cheng M, Li S, Borji A, Wang J. Joint salient object detection and existence prediction. Frontiers of Computer Science, 2019, 13(1): 778−788
40 Yang X, Xue T, Luo H, Guo J. Fast and accurate visual odometry from a monocular camera. Frontiers of Computer Science, 2019, 13(1): 1326−1336
41 Lian Q, Ye B, Xu R, Yao W, Zhang T. Exploring geometric consistency for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 1685−1694
42 Peng L, Wu X, Yang Z, Liu H, Cai D. DID-M3D: decoupling instance depth for monocular 3D object detection. In: Proceedings of European Conference on Computer Vision. 2022, 71−88
43 Chen X, Kundu K, Zhang Z, Ma H, Fidler S, Urtasun R. Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2147−2156
44 Yu F, Wang D, Shelhamer E, Darrell T. Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 2403−2412
45 Zhang R, Qiu H, Wang T, Guo Z, Qiao Y, Li H, Gao P. MonoDETR: Depth-guided transformer for monocular 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 9155−9166
46 Kumar A, Brazil G, Liu X. GrooMeD-NMS: grouped mathematically differentiable NMS for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 8973−8983
47 Li Z, Wang W, Li H, Xie E, Sima C, Lu T, Yu Q, Dai J. BEVFormer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Proceedings of European Conference on Computer Vision. 2022, 1−18
[1] FCS-23242-OF-HG_suppl_1 Download
[1] Ningping MOU, Xinli YUE, Lingchen ZHAO, Qian WANG. Fairness is essential for robustness: fair adversarial training by identifying and augmenting hard examples[J]. Front. Comput. Sci., 2025, 19(3): 193803-.
[2] Taiyan CHEN, Xianghua YING. FPSMix: data augmentation strategy for point cloud classification[J]. Front. Comput. Sci., 2025, 19(2): 192701-.
[3] Meimei YANG, Qiao LIU, Xinkai SUN, Na SHI, Hui XUE. Towards kernelizing the classifier for hyperbolic data[J]. Front. Comput. Sci., 2024, 18(1): 181301-.
[4] Yao ZHANG, Liangxiao JIANG, Chaoqun LI. Attribute augmentation-based label integration for crowdsourcing[J]. Front. Comput. Sci., 2023, 17(5): 175331-.
[5] Xin CHEN, He JIANG, Zhenyu CHEN, Tieke HE, Liming NIE. Automatic test report augmentation to assist crowdsourced testing[J]. Front. Comput. Sci., 2019, 13(5): 943-959.
[6] Ning CHEN, Jun ZHU, Jianfei CHEN, Ting CHEN. Dropout training for SVMs with data augmentation[J]. Front. Comput. Sci., 2018, 12(4): 694-713.
[7] Jinjiang LI, Hanyi GE. New progress in geometric computing for image and video processing[J]. Front Comput Sci, 2012, 6(6): 769-775.
[8] LIANG Tielin, WANG Dongming, WANG Dongming. On the design and implementation of a geometric-object-oriented language[J]. Front. Comput. Sci., 2007, 1(2): 180-190.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed