IEEE Trans Pattern Anal Mach Intell
October 2024
Deep learning-based low-light image enhancement (LLIE) is a task of leveraging deep neural networks to enhance the image illumination while keeping the image content unchanged. From the perspective of training data, existing methods complete the LLIE task driven by one of the following three data types: paired data, unpaired data and zero-reference data. Each type of these data-driven methods has its own advantages, e.
View Article and Find Full Text PDFIEEE Trans Image Process
September 2024
Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through distinct branch types and output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of computing gradient at the extrapolation point.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning, under-modeling of temporal dynamics, detached video-language view.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
September 2024
AdamW modifies Adam by adding a decoupled weight decay to decay network weights per training iteration. For adaptive algorithms, this decoupled weight decay does not affect specific optimization steps, and differs from the widely used l-regularizer which changes optimization steps via changing the first- and second-order gradient moments. Despite its great practical success, for AdamW, its convergence behavior and generalization improvement over Adam and l-regularized Adam ( l-Adam) remain absent yet.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
May 2024
Vision-Language Pre-Training (VLP) has demonstrated remarkable potential in aligning image and text pairs, paving the way for a wide range of cross-modal learning tasks. Nevertheless, we have observed that VLP models often fall short in terms of visual grounding and localization capabilities, which are crucial for many downstream tasks, such as visual reasoning. In response, we introduce a novel Position-guided Text Prompt (PTP) paradigm to bolster the visual grounding abilities of cross-modal models trained with VLP.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
May 2024
Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer.
View Article and Find Full Text PDFMetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, by migrating our focus away from the token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and demonstrate their gratifying performance. We summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
November 2023
Unsupervised domain adaption has been widely adopted in tasks with scarce annotated data. Unfortunately, mapping the target-domain distribution to the source-domain unconditionally may distort the essential structural information of the target-domain data, leading to inferior performance. To address this issue, we first propose to introduce active sample selection to assist domain adaptation regarding the semantic segmentation task.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
November 2023
We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification.
View Article and Find Full Text PDFDeep long-tailed learning, one of the most challenging problems in visual recognition, aims to train well-performing deep models from a large number of images that follow a long-tailed class distribution. In the last decade, deep learning has emerged as a powerful recognition model for learning high-quality image representations and has led to remarkable breakthroughs in generic visual recognition. However, long-tailed class imbalance, a common problem in practical visual recognition tasks, often limits the practicality of deep network based recognition models in real-world applications, since they can be easily biased towards dominant classes and perform poorly on tail classes.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
November 2023
Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is typically represented by attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant and sufficient visual-semantic interaction for advancing ZSL. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention, which ignore the transferable and discriminative attribute localization of visual features for representing the key semantic knowledge for effective knowledge transfer in ZSL.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
April 2024
Studying the relationship between linear discriminant analysis (LDA) and least squares regression (LSR) is of great theoretical and practical significance. It is well-known that the two-class LDA is equivalent to an LSR problem, and directly casting multiclass LDA as an LSR problem, however, becomes more challenging. Recent study reveals that the equivalence between multiclass LDA and LSR can be established based on a special class indicator matrix, but under a mild condition which may not hold under the scenarios with low-dimensional or oversampled data.
View Article and Find Full Text PDFIEEE Trans Image Process
October 2022
Few-shot segmentation aims at learning to segment query images guided by only a few annotated images from the support set. Previous methods rely on mining the feature embedding similarity across the query and the support images to achieve successful segmentation. However, these models tend to perform badly in cases where the query instances have a large variance from the support ones.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
April 2024
The field of fashion compatibility learning has attracted great attention from both the academic and industrial communities in recent years. Many studies have been carried out for fashion compatibility prediction, collocated outfit recommendation, artificial intelligence (AI)-enabled compatible fashion design, and related topics. In particular, AI-enabled compatible fashion design can be used to synthesize compatible fashion items or outfits to improve the design experience for designers or the efficacy of recommendations for customers.
View Article and Find Full Text PDFRecently, Vision Transformers (ViTs) have been broadly explored in visual recognition. With low efficiency in encoding fine-level features, the performance of ViTs is still inferior to the state-of-the-art CNNs when trained from scratch on a midsize dataset like ImageNet. Through experimental analysis, we find it is because of two reasons: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines, leading to low training sample efficiency; 2) the redundant attention backbone design of ViTs leads to limited feature richness for fixed computation budgets and limited training samples.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
January 2023
In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies and meanwhile avoid the attention building process in transformers.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
November 2022
In this paper, we address the makeup transfer and removal tasks simultaneously, which aim to transfer the makeup from a reference image to a source image and remove the makeup from the with-makeup image respectively. Existing methods have achieved much advancement in constrained scenarios, but it is still very challenging for them to transfer makeup between images with large pose and expression differences, or handle makeup details like blush on cheeks or highlight on the nose. In addition, they are hardly able to control the degree of makeup during transferring or to transfer a specified part in the input face.
View Article and Find Full Text PDFIEEE Trans Image Process
May 2021
Single Image Deraining (SID) is a relatively new and still challenging topic in emerging vision applications, and most of the recently emerged deraining methods use the supervised manner depending on the ground-truth (i.e., using paired data).
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
September 2022
Vision and language understanding techniques have achieved remarkable progress, but currently it is still difficult to well handle problems involving very fine-grained details. For example, when the robot is told to "bring me the book in the girl's left hand", most existing methods would fail if the girl holds one book respectively in her left and right hand. In this work, we introduce a new task named human-centric relation segmentation (HRS), as a fine-grained case of HOI-det.
View Article and Find Full Text PDFLearning to capture dependencies between spatial positions is essential to many visual tasks, especially the dense labeling problems like scene parsing. Existing methods can effectively capture long-range dependencies with self-attention mechanism while short ones by local convolution. However, there is still much gap between long-range and short-range dependencies, which largely reduces the models' flexibility in application to diverse spatial scales and relationships in complicated natural scene images.
View Article and Find Full Text PDFDespite the remarkable progress in face recognition related technologies, reliably recognizing faces across ages remains a big challenge. The appearance of a human face changes substantially over time, resulting in significant intra-class variations. As opposed to current techniques for age-invariant face recognition, which either directly extract age-invariant features for recognition, or first synthesize a face that matches target age before feature extraction, we argue that it is more desirable to perform both tasks jointly so that they can leverage each other.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
May 2021
Unsupervised domain adaptation (UDA) aims at reducing the distribution discrepancy when transferring knowledge from a labeled source domain to an unlabeled target domain. Previous UDA methods assume that the source and target domains share an identical label space, which is unrealistic in practice since the label information of the target domain is agnostic. This article focuses on a more realistic UDA scenario, i.
View Article and Find Full Text PDFScene graph generation has received increasing attention in recent years. Enhancing the predicate representations is an important entry point to this task. There are various methods to fully investigate the context of representation enhancement.
View Article and Find Full Text PDFIn this article, we propose a deep extension of sparse subspace clustering, termed deep subspace clustering with L1-norm (DSC-L1). Regularized by the unit sphere distribution assumption for the learned deep features, DSC-L1 can infer a new data affinity matrix by simultaneously satisfying the sparsity principle of SSC and the nonlinearity given by neural networks. One of the appealing advantages brought by DSC-L1 is that when original real-world data do not meet the class-specific linear subspace distribution assumption, DSC-L1 can employ neural networks to make the assumption valid with its nonlinear transformations.
View Article and Find Full Text PDF