Publications by authors named "Jiwen Lu"

In this paper, we propose a point-based cross-attention transformer named CrossPoints with parametric Global Porous Sampling (GPS) strategy. The attention module is crucial to capture the correlations between different tokens for transformers. Most existing point-based transformers design multi-scale self-attention operations with down-sampled point clouds by the widely-used Farthest Point Sampling (FPS) strategy.

View Article and Find Full Text PDF

In this paper, we propose an effective plug-and-play module called structural relation network (SRN) to model structural dependencies in 3D point clouds for feature representation. Existing network architectures such as PointNet++ and RS-CNN capture local structures individually and ignore the inner interactions between different sub-clouds. Motivated by the fact that structural relation modeling plays critical roles for humans to understand 3D objects, our SRN exploits local information by modeling structural relations in 3D spaces.

View Article and Find Full Text PDF

Accurately detecting the lanes plays a significant role in various autonomous and assistant driving scenarios. It is a highly structured task as lanes in the 3D world are continuous and parallel to each other. While most existing methods focus on how to inject structural priors into the representation of each lane, we propose a StructLane method to further leverage the structural relations among lanes for more accurate and robust lane detection.

View Article and Find Full Text PDF

In this paper, we propose an anycost network quantization method for efficient image super-resolution with variable resource budgets. Conventional quantization approaches acquire discrete network parameters for deployment with fixed complexity constraints, while image super-resolution networks are usually applied on mobile devices with frequently modified resource budgets due to the change of battery levels or computing chips. Hence, exhaustively optimizing quantized networks with each complexity constraint results in unacceptable training costs.

View Article and Find Full Text PDF

In this paper, we present a Structure-aware Cross-Modal Transformer (SCMT) to fully capture the 3D structures hidden in sparse depths for depth completion. Most existing methods learn to predict dense depths by taking depths as an additional channel of RGB images or learning 2D affinities to perform depth propagation. However, they fail to exploit 3D structures implied in the depth channel, thereby losing the informative 3D knowledge that provides important priors to distinguish the foreground and background features.

View Article and Find Full Text PDF

Nowadays, pre-training big models on large-scale datasets has achieved great success and dominated many downstream tasks in natural language processing and 2D vision, while pre-training in 3D vision is still under development. In this paper, we provide a new perspective of transferring the pre-trained knowledge from 2D domain to 3D domain with Point-to-Pixel Prompting in data space and Pixel-to-Point distillation in feature space, exploiting shared knowledge in images and point clouds that display the same visual world. Following the principle of prompting engineering, Point-to-Pixel Prompting transforms point clouds into colorful images with geometry-preserved projection and geometry-aware coloring.

View Article and Find Full Text PDF

In this paper, we present a new framework named DIML to achieve more interpretable deep metric learning. Unlike traditional deep metric learning method that simply produces a global similarity given two images, DIML computes the overall similarity through the weighted sum of multiple local part-wise similarities, making it easier for human to understand the mechanism of how the model distinguish two images. Specifically, we propose a structural matching strategy that explicitly aligns the spatial embeddings by computing an optimal matching flow between feature maps of the two images.

View Article and Find Full Text PDF

In this paper, we propose a dynamic 3D object detector named HyperDet3D, which is adaptively adjusted based on the hyper scene-level knowledge on the fly. Existing methods strive for object-level representations of local elements and their relations without scene-level priors, which suffer from ambiguity between similarly-structured objects only based on the understanding of individual points and object candidates. Instead, we design scene-conditioned hypernetworks to simultaneously learn scene-agnostic embeddings to exploit sharable abstracts from various 3D scenes, and scene-specific knowledge which adapts the 3D detector to the given scene at test time.

View Article and Find Full Text PDF

In this paper, we propose a Transformer encoder-decoder architecture, called PoinTr, which reformulates point cloud completion as a set-to-set translation problem and employs a geometry-aware block to model local geometric relationships explicitly. The migration of Transformers enables our model to better learn structural knowledge and preserve detailed information for point cloud completion. Taking a step towards more complicated and diverse situations, we further propose AdaPoinTr by developing an adaptive query generation mechanism and designing a novel denoising task during completing a point cloud.

View Article and Find Full Text PDF

In this paper, we propose a weakly-supervised approach for 3D object detection, which makes it possible to train a strong 3D detector with position-level annotations (i.e. annotations of object centers and categories).

View Article and Find Full Text PDF

This paper proposes an introspective deep metric learning (IDML) framework for uncertainty-aware comparisons of images. Conventional deep metric learning methods focus on learning a discriminative embedding to describe the semantic features of images, which ignore the existence of uncertainty in each image resulting from noise or semantic ambiguity. Training without awareness of these uncertainties causes the model to overfit the annotated labels during training and produce overconfident judgments during inference.

View Article and Find Full Text PDF

Face clustering is a promising method for annotating unlabeled face images. Recent supervised approaches have boosted the face clustering accuracy greatly, however their performance is still far from satisfactory. These methods can be roughly divided into global-based and local-based ones.

View Article and Find Full Text PDF

In this paper, we propose Point-Voxel Correlation Fields to explore relations between two consecutive point clouds and estimate scene flow that represents 3D motions. Most existing works only consider local correlations, which are able to handle small movements but fail when there are large displacements. Therefore, it is essential to introduce all-pair correlation volumes that are free from local neighbor restrictions and cover both short- and long-term dependencies.

View Article and Find Full Text PDF

In this paper, we propose a discrepancy-aware meta-learning approach for zero-shot face manipulation detection, which aims to learn a discriminative model maximizing the generalization to unseen face manipulation attacks with the guidance of the discrepancy map. Unlike existing face manipulation detection methods that usually present algorithmic solutions to the known face manipulation attacks, where the same types of attacks are used to train and test the models, we define the detection of face manipulation as a zero-shot problem. We formulate the learning of the model as a meta-learning process and generate zero-shot face manipulation tasks for the model to learn the meta-knowledge shared by diversified attacks.

View Article and Find Full Text PDF

Generative data-free quantization emerges as a practical compression approach that quantizes deep neural networks to low bit-width without accessing the real data. This approach generates data utilizing batch normalization (BN) statistics of the full-precision networks to quantize the networks. However, it always faces the serious challenges of accuracy degradation in practice.

View Article and Find Full Text PDF

Deep learning based fusion methods have been achieving promising performance in image fusion tasks. This is attributed to the network architecture that plays a very important role in the fusion process. However, in general, it is hard to specify a good fusion architecture, and consequently, the design of fusion networks is still a black art, rather than science.

View Article and Find Full Text PDF

In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative regions, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input to accelerate vision Transformers.

View Article and Find Full Text PDF

Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required.

View Article and Find Full Text PDF

Existing image-based rendering methods usually adopt depth-based image warping operation to synthesize novel views. In this paper, we reason the essential limitations of the traditional warping operation to be the limited neighborhood and only distance-based interpolation weights. To this end, we propose content-aware warping, which adaptively learns the interpolation weights for pixels of a relatively large neighborhood from their contextual information via a lightweight neural network.

View Article and Find Full Text PDF

In this paper, we propose a deep metric learning with adaptively composite dynamic constraints (DML-DC) method for image retrieval and clustering. Most existing deep metric learning methods impose pre-defined constraints on the training samples, which might not be optimal at all stages of training. To address this, we propose a learnable constraint generator to adaptively produce dynamic constraints to train the metric towards good generalization.

View Article and Find Full Text PDF

In this article, we propose extremely low-precision vision transformers called Quantformer for efficient inference. Conventional network quantization methods directly quantize weights and activations of fully-connected layers without considering properties of transformer architectures. Quantization sizably deviates the self-attention compared with full-precision counterparts, and the shared quantization strategy for diversely distributed patch features causes severe quantization errors.

View Article and Find Full Text PDF

In this work, we present a new multi-view depth estimation method NerfingMVS that utilizes both conventional reconstruction and learning-based priors over the recently proposed neural radiance fields (NeRF). Unlike existing neural network based optimization method that relies on estimated correspondences, our method directly optimizes over implicit volumes, eliminating the challenging step of matching pixels in indoor scenes. The key to our approach is to utilize the learning-based priors to guide the optimization process of NeRF.

View Article and Find Full Text PDF

In this paper, we investigate the problem of abductive visual reasoning (AVR), which requires vision systems to infer the most plausible explanation for visual observations. Unlike previous work which performs visual reasoning on static images or synthesized scenes, we exploit long-term reasoning from instructional videos that contain a wealth of detailed information about the physical world. We conceptualize two tasks for this emerging and challenging topic.

View Article and Find Full Text PDF

In this paper, we propose an uncertainty-aware multi-resolution learning for point cloud segmentation, named PointRas. Most existing works for point cloud segmentation design encoder networks to obtain better representation of local space in point cloud. However, few of them investigate the utilization of features in the lower resolutions produced by encoders and consider the contextual learning between various resolutions in decoder network.

View Article and Find Full Text PDF

Most existing point cloud instance and semantic segmentation methods rely heavily on strong supervision signals, which require point-level labels for every point in the scene. However, such strong supervision suffers from large annotation costs, arousing the need to study efficient annotating. In this paper, we discover that the locations of instances matter for both instance and semantic 3D scene segmentation.

View Article and Find Full Text PDF