3-D object detection is a fundamental task in the context of autonomous driving. In the literature, cheap monocular image-based methods show a significant performance drop compared to the expensive LiDAR and stereo-images-based algorithms. In this article, we aim to close this performance gap by bridging the representation capability between 2-D and 3-D domains. We propose a novel monocular 3-D object detection model using self-supervised learning and auxiliary learning, resorting to mimicking the representations over 3-D point clouds. Specifically, given a 2-D region proposal and the corresponding instance point cloud, we supervise the feature activation from our image-based convolution network to mimic the latent feature of a point-based neural network at the training stage. While state-of-the-art (SOTA) monocular 3-D detection algorithms typically convert images to pseudo-LiDAR with depth estimation and regress 3-D detection with LiDAR-based methods, our approach seeks the power of the 2-D neural network straightforwardly and essentially enhances the 2-D module capability with latent spatial-aware representations by contrastive learning. We empirically validate the performance improvement from the feature mimicking the KITTI and ApolloScape datasets and achieve the SOTA performance on the KITTI and ApolloScape leaderboard.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TCYB.2021.3090370 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!