Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TIP.2024.3430473 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!