Publications by authors named "Huchuan Lu"

Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient.

View Article and Find Full Text PDF

Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model's understanding of the complete context of the video scenes.

View Article and Find Full Text PDF
Article Synopsis
  • Recovering clear images from motion-blurred photos is hard, and existing methods are limited to a fixed number of frames after training.
  • This study introduces a new framework that uses event camera technology to recover sharp frames at any time interval.
  • The proposed method, which incorporates using a bi-directional recurrent network to process events and blurry image features, shows improved performance in generating sharp sequences compared to current techniques.
View Article and Find Full Text PDF
Article Synopsis
  • - The study focuses on improving blurry image deblurring by using event cameras, which are bio-inspired visual sensors, to recover sharp frames from blurry images.
  • - It critiques existing methods that overlook the importance of blur modeling and proposes a new approach that uses an event-assisted blurriness encoder to implicitly model blur in images.
  • - The method integrates blurriness information into a base image unfolding network using novel modulation and aggregation techniques, demonstrating superior performance in experiments compared to current state-of-the-art methods on various datasets.
View Article and Find Full Text PDF

Introducing deep trackers to thermal infrared (TIR) tracking is hampered by the scarcity of large training datasets. To alleviate the predicament, a common approach is full fine-tuning (FFT) based on pretrained RGB parameters. Nevertheless, due to its inefficient training pattern and representation collapse risk, some parameter-efficient fine-tuning (PEFT) alternatives have been promoted recently.

View Article and Find Full Text PDF

In this article, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation.

View Article and Find Full Text PDF

Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and videos, i.

View Article and Find Full Text PDF

Pyramid-based deformation decomposition is a promising registration framework, which gradually decomposes the deformation field into multi-resolution subfields for precise registration. However, most pyramid-based methods directly produce one subfield per resolution level, which does not fully depict the spatial deformation. In this paper, we propose a novel registration model, called GroupMorph.

View Article and Find Full Text PDF

Image-text matching remains a challenging task due to heterogeneous semantic diversity across modalities and insufficient distance separability within triplets. Different from previous approaches focusing on enhancing multi-modal representations or exploiting cross-modal correspondence for more accurate retrieval, in this paper we aim to leverage the knowledge transfer between peer branches in a boosting manner to seek a more powerful matching model. Specifically, we propose a brand-new Deep Boosting Learning (DBL) algorithm, where an anchor branch is first trained to provide insights into the data properties, with a target branch gaining more advanced knowledge to develop optimal features and distance metrics.

View Article and Find Full Text PDF

With the growing demands of applications on online devices, the speed-accuracy trade-off is critical in the semantic segmentation system. Recently, the bilateral segmentation network has shown promising capacity to achieve the balance between favorable accuracy and fast speed, and has become the mainstream backbone in real-time semantic segmentation. Segmentation of target objects relies on high-level semantics, whereas it requires detailed low-level features to model specific local patterns for accurate location.

View Article and Find Full Text PDF

Existing image inpainting methods often produce artifacts that are caused by using vanilla convolution layers as building blocks that treat all image regions equally and generate holes at random locations with equal probability. This design does not differentiate the missing regions and valid regions in inference and does not consider the predictability of missing regions in training. To address these issues, we propose a deformable dynamic sampling (DDS) mechanism which is built on deformable convolutions (DCs), and a constraint is proposed to avoid the deformably sampled elements falling into the corrupted regions.

View Article and Find Full Text PDF

Depth data with a predominance of discriminative power in location is advantageous for accurate salient object detection (SOD). Existing RGBD SOD methods have focused on how to properly use depth information for complementary fusion with RGB data, having achieved great success. In this work, we attempt a far more ambitious use of the depth information by injecting the depth maps into the encoder in a single-stream model.

View Article and Find Full Text PDF

Existing works mainly focus on crowd and ignore the confusion regions which contain extremely similar appearance to crowd in the background, while crowd counting needs to face these two sides at the same time. To address this issue, we propose a novel end-to-end trainable confusion region discriminating and erasing network called CDENet. Specifically, CDENet is composed of two modules of confusion region mining module (CRM) and guided erasing module (GEM).

View Article and Find Full Text PDF

Recently, referring image segmentation has attracted wide attention given its huge potential in human-robot interaction. Networks to identify the referred region must have a deep understanding of both the image and language semantics. To do so, existing works tend to design various mechanisms to achieve cross-modality fusion, for example, tile and concatenation and vanilla nonlocal manipulation.

View Article and Find Full Text PDF

Advanced deep convolutional neural networks (CNNs) have shown great success in video-based person re-identification (Re-ID). However, they usually focus on the most obvious regions of persons with a limited global representation ability. Recently, it witnesses that Transformers explore the interpatch relationships with global observations for performance improvements.

View Article and Find Full Text PDF

Both salient object detection (SOD) and camouflaged object detection (COD) are typical object segmentation tasks. They are intuitively contradictory, but are intrinsically related. In this paper, we explore the relationship between SOD and COD, and then borrow successful SOD models to detect camouflaged objects to save the design cost of COD models.

View Article and Find Full Text PDF

Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback.

View Article and Find Full Text PDF

General deep learning-based methods for infrared and visible image fusion rely on the unsupervised mechanism for vital information retention by utilizing elaborately designed loss functions. However, the unsupervised mechanism depends on a well-designed loss function, which cannot guarantee that all vital information of source images is sufficiently extracted. In this work, we propose a novel interactive feature embedding in a self-supervised learning framework for infrared and visible image fusion, attempting to overcome the issue of vital information degradation.

View Article and Find Full Text PDF

Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation.

View Article and Find Full Text PDF

While deep-learning-based tracking methods have achieved substantial progress, they entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised (SS) learning for visual tracking. In this work, we develop the crop-transform-paste operation, which is able to synthesize sufficient training data by simulating various appearance variations during tracking, including appearance variations of objects and background interference.

View Article and Find Full Text PDF

Correlation has a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion method that considers the similarity between the template and the search region. However, the correlation operation is a local linear matching process, losing semantic information and easily falling into a local optimum, which may be the bottleneck in designing high-accuracy tracking algorithms.

View Article and Find Full Text PDF

Benefiting from color independence, illumination invariance and location discrimination attributed by the depth map, it can provide important supplemental information for extracting salient objects in complex environments. However, high-quality depth sensors are expensive and can not be widely applied. While general depth sensors produce the noisy and sparse depth information, which brings the depth-based networks with irreversible interference.

View Article and Find Full Text PDF

Benefiting from deep learning, defocus blur detection (DBD) has made prominent progress. Existing DBD methods generally study multiscale and multilevel features to improve performance. In this article, from a different perspective, we explore to generate confrontational images to attack DBD network.

View Article and Find Full Text PDF

This paper focuses on referring segmentation, which aims to selectively segment the corresponding visual region in an image (or video) according to the referring expression. However, the existing methods usually consider the interaction between multi-modal features at the decoding end of the network. Specifically, they interact the visual features of each scale with language respectively, thus ignoring the correlation between multi-scale features.

View Article and Find Full Text PDF

Previous 2D saliency detection methods extract salient cues from a single view and directly predict the expected results. Both traditional and deep-learning-based 2D methods do not consider geometric information of 3D scenes. Therefore the relationship between scene understanding and salient objects cannot be effectively established.

View Article and Find Full Text PDF