Publications by authors named "Liqiang Nie"

Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA.

View Article and Find Full Text PDF

Despite remarkable successes in unimodal learning tasks, backdoor attacks against cross-modal learning are still underexplored due to the limited generalization and inferior stealthiness when involving multiple modalities. Notably, since works in this area mainly inherit ideas from unimodal visual attacks, they struggle with dealing with diverse cross-modal attack circumstances and manipulating imperceptible trigger samples, which hinders their practicability in real-world applications. In this paper, we introduce a novel bilateral backdoor to fill in the missing pieces of the puzzle in the cross-modal backdoor and propose a generalized invisible backdoor framework against cross-modal learning (BadCM).

View Article and Find Full Text PDF

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various DeepFake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities.

View Article and Find Full Text PDF

The image-level label has prevailed in weakly supervised semantic segmentation tasks due to its easy availability. Since image-level labels can only indicate the existence or absence of specific categories of objects, visualization-based techniques have been widely adopted to provide object location clues. Considering class activation maps (CAMs) can only locate the most discriminative part of objects, recent approaches usually adopt an expansion strategy to enlarge the activation area for more integral object localization.

View Article and Find Full Text PDF

Although stereo image restoration has been extensively studied, most existing work focuses on restoring stereo images with limited horizontal parallax due to the binocular symmetry constraint. Stereo images with unlimited parallax (e.g.

View Article and Find Full Text PDF

Query-oriented micro-video summarization task aims to generate a concise sentence with two properties: (a) summarizing the main semantic of the micro-video and (b) being expressed in the form of search queries to facilitate retrieval. Despite its enormous application value in the retrieval area, this direction has barely been explored. Previous studies of summarization mostly focus on the content summarization for traditional long videos.

View Article and Find Full Text PDF

The composed image retrieval (CIR) task aims to retrieve the desired target image for a given multimodal query, i.e., a reference image with its corresponding modification text.

View Article and Find Full Text PDF

Talking face generation is the process of synthesizing a lip-synchronized video when given a reference portrait and an audio clip. However, generating a fine-grained talking video is nontrivial due to several challenges: 1) capturing vivid facial expressions, such as muscle movements; 2) ensuring smooth transitions between consecutive frames; and 3) preserving the details of the reference portrait. Existing efforts have only focused on modeling rigid lip movements, resulting in low-fidelity videos with jerky facial muscle deformations.

View Article and Find Full Text PDF

Visual Question Answering (VQA) is fundamentally compositional in nature, and many questions are simply answered by decomposing them into modular sub-problems. The recent proposed Neural Module Network (NMN) employ this strategy to question answering, whereas heavily rest with off-the-shelf layout parser or additional expert policy regarding the network architecture design instead of learning from the data. These strategies result in the unsatisfactory adaptability to the semantically-complicated variance of the inputs, thereby hindering the representational capacity and generalizability of the model.

View Article and Find Full Text PDF

Partial person re-identification (ReID) aims to solve the problem of image spatial misalignment due to occlusions or out-of-views. Despite significant progress through the introduction of additional information, such as human pose landmarks, mask maps, and spatial information, partial person ReID remains challenging due to noisy keypoints and impressionable pedestrian representations. To address these issues, we propose a unified attribute-guided collaborative learning scheme for partial person ReID.

View Article and Find Full Text PDF

Visual Commonsense Reasoning (VCR), deemed as one challenging extension of Visual Question Answering (VQA), endeavors to pursue a higher-level visual comprehension. VCR includes two complementary processes: question answering over a given image and rationale inference for answering explanation. Over the years, a variety of VCR methods have pushed more advancements on the benchmark dataset.

View Article and Find Full Text PDF

The goal of talking face generation is to synthesize a sequence of face images of the specified identity, ensuring the mouth movements are synchronized with the given audio. Recently, image-based talking face generation has emerged as a popular approach. It could generate talking face images synchronized with the audio merely depending on a facial image of arbitrary identity and an audio clip.

View Article and Find Full Text PDF

Semi-supervised learning has been well established in the area of image classification but remains to be explored in video-based action recognition. FixMatch is a state-of-the-art semi-supervised method for image classification, but it does not work well when transferred directly to the video domain since it only utilizes the single RGB modality, which contains insufficient motion information. Moreover, it only leverages highly-confident pseudo-labels to explore consistency between strongly-augmented and weakly-augmented samples, resulting in limited supervised signals, long training time, and insufficient feature discriminability.

View Article and Find Full Text PDF

Fashion Compatibility Modeling (FCM), which aims to automatically evaluate whether a given set of fashion items makes a compatible outfit, has attracted increasing research attention. Recent studies have demonstrated the benefits of conducting the item representation disentanglement towards FCM. Although these efforts have achieved prominent progress, they still perform unsatisfactorily, as they mainly investigate the visual content of fashion items, while overlooking the semantic attributes of items (e.

View Article and Find Full Text PDF

Recently, fashion compatibility modeling, which can score the matching degree of several complementary fashion items, has gained increasing research attention. Previous studies have primarily learned the features of fashion items and utilize their interaction as the fashion compatibility. However, the try-on looking of an outfit help us to learn the fashion compatibility in a combined manner, where items are spatially distributed and partially covered by other items.

View Article and Find Full Text PDF

The de facto review-involved recommender systems, using review information to enhance recommendation, have received increasing interest over the past years. Thereinto, one advanced branch is to extract salient aspects from textual reviews (i.e.

View Article and Find Full Text PDF

Recent studies have pointed out that many well-developed Visual Question Answering (VQA) models are heavily affected by the language prior problem. It refers to making predictions based on the co-occurrence pattern between textual questions and answers instead of reasoning upon visual contents. To tackle this problem, most existing methods focus on strengthening the visual feature learning capability to reduce this text shortcut influence on model decisions.

View Article and Find Full Text PDF

This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.

View Article and Find Full Text PDF

Conversational image search, a revolutionary search mode, is able to interactively induce the user response to clarify their intents step by step. Several efforts have been dedicated to the conversation part, namely automatically asking the right question at the right time for user preference elicitation, while few studies focus on the image search part given the well-prepared conversational query. In this paper, we work towards conversational image search, which is much difficult compared to the traditional image search task, due to the following challenges: 1) understanding complex user intents from a multimodal conversational query; 2) utilizing multiform knowledge associated images from a memory network; and 3) enhancing the image representation with distilled knowledge.

View Article and Find Full Text PDF

Video moment localization, as an important branch of video content analysis, has attracted extensive attention in recent years. However, it is still in its infancy due to the following challenges: cross-modal semantic alignment and localization efficiency. To address these impediments, we present a cross-modal semantic alignment network.

View Article and Find Full Text PDF

Due to the continuous booming of surveillance and Web videos, video moment localization, as an important branch of video content analysis, has attracted wide attention from both industry and academia in recent years. It is, however, a non-trivial task due to the following challenges: temporal context modeling, intelligent moment candidate generation, as well as the necessary efficiency and scalability in practice. To address these impediments, we present a deep end-to-end cross-modal hashing network.

View Article and Find Full Text PDF

Video person re-identification (video Re-ID) plays an important role in surveillance video analysis and has gained increasing attention recently. However, existing supervised methods require vast labeled identities across cameras, resulting in poor scalability in practical applications. Although some unsupervised approaches have been exploited for video Re-ID, they are still in their infancy due to the complex nature of learning discriminative features on unlabelled data.

View Article and Find Full Text PDF

Cross-modal retrieval has recently attracted much interest along with the rapid development of multimodal data, and effectively utilizing the complementary relationship of different modal data and eliminating the heterogeneous gap as much as possible are the two key challenges. In this paper, we present a novel network model termed cross-modal Dual Subspace learning with Adversarial Network (DSAN). The main contributions are as follows: (1) Dual subspaces (visual subspace and textual subspace) are proposed, which can better mine the underlying structure information of different modalities as well as modality-specific information.

View Article and Find Full Text PDF

Efficient hashing techniques have attracted extensive research interests in both storage and retrieval of highdimensional data, such as images and videos. In existing hashing methods, a linear model is commonly utilized owing to its efficiency. To obtain better accuracy, linear-based hashing methods focus on designing a generalized linear objective function with different constraints or penalty terms that consider the inherent characteristics and neighborhood information of samples.

View Article and Find Full Text PDF

GWI survey has highlighted the flourishing use of multiple social networks: the average number of social media accounts per Internet user is 5.54, and among them, 2.82 are being used actively.

View Article and Find Full Text PDF