IEEE Trans Pattern Anal Mach Intell
September 2024
IEEE Trans Pattern Anal Mach Intell
December 2024
While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning, under-modeling of temporal dynamics, detached video-language view.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2023
Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is the understanding of the alignments between video scenes and question semantics to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes, which undermines the prediction with unreliable reasoning.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
November 2023
We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
October 2023
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. This task has achieved significant momentum in the computer vision community as it enables activity grounding beyond pre-defined activity classes by utilizing the semantic diversity of natural language descriptions. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization).
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
April 2024
The present machine learning schema typically uses a one-pass model inference (e.g., forward propagation) to make predictions in the testing phase.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2023
In recommender systems, users' behavior data are driven by the interactions of user-item latent factors. To improve recommendation effectiveness and robustness, recent advances focus on latent factor disentanglement via variational inference. Despite significant progress, uncovering the underlying interactions, i.
View Article and Find Full Text PDFIEEE Trans Image Process
June 2022
Although the single-image super-resolution (SISR) methods have achieved great success on the single degradation, they still suffer performance drop with multiple degrading effects in real scenarios. Recently, some blind and non-blind models for multiple degradations have been explored. However, these methods usually degrade significantly for distribution shifts between the training and test data.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
February 2023
Explainability is crucial for probing graph neural networks (GNNs), answering questions like "Why the GNN model makes a certain prediction?". Feature attribution is a prevalent technique of highlighting the explanatory subgraph in the input graph, which plausibly leads the GNN model to make its prediction. Various attribution methods have been proposed to exploit gradient-like or attention scores as the attributions of edges, then select the salient edges with top attribution scores as the explanation.
View Article and Find Full Text PDFIEEE Trans Image Process
January 2022
The task of video moment retrieval (VMR) is to retrieve the specific video moment from an untrimmed video, according to a textual query. It is a challenging task that requires effective modeling of complex cross-modal matching relationship. Recent efforts primarily model the cross-modal interactions by hand-crafted network architectures.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
October 2022
Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges - the affective gap, perception subjectivity, and label noise and absence.
View Article and Find Full Text PDFIEEE Trans Image Process
June 2021
Food recognition has captured numerous research attention for its importance for health-related applications. The existing approaches mostly focus on the categorization of food according to dish names, while ignoring the underlying ingredient composition. In reality, two dishes with the same name do not necessarily share the exact list of ingredients.
View Article and Find Full Text PDFMeta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, typical meta-learning models use shallow neural networks, thus limiting its effectiveness.
View Article and Find Full Text PDFWith superhuman-level performance of face recognition, we are more concerned about the recognition of fine-grained attributes, such as emotion, age, and gender. However, given that the label space is extremely large and follows a long-tail distribution, it is quite expensive to collect sufficient samples for fine-grained attributes. This results in imbalanced training samples and inferior attribute recognition models.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
August 2020
Matrix factorization (MF) has been widely used to discover the low-rank structure and to predict the missing entries of data matrix. In many real-world learning systems, the data matrix can be very high dimensional but sparse. This poses an imbalanced learning problem since the scale of missing entries is usually much larger than that of the observed entries, but they cannot be ignored due to the valuable negative signal.
View Article and Find Full Text PDFRecently, a great progress in automatic image captioning has been achieved by using semantic concepts detected from the image. However, we argue that existing concepts-to-caption framework, in which the concept detector is trained using the image-caption pairs to minimize the vocabulary discrepancy, suffers from the deficiency of insufficient concepts. The reasons are two-fold: 1) the extreme imbalance between the number of occurrence positive and negative samples of the concept and 2) the incomplete labeling in training captions caused by the biased annotation and usage of synonyms.
View Article and Find Full Text PDFDetecting events from massive social media data in social networks can facilitate browsing, search, and monitoring of real-time events by corporations, governments, and users. The short, conversational, heterogeneous, and real-time characteristics of social media data bring great challenges for event detection. The existing event detection approaches rely mainly on textual information, while the visual content of microblogs and the intrinsic correlation among the heterogeneous data are scarcely explored.
View Article and Find Full Text PDFEURASIP J Bioinform Syst Biol
December 2016
Online community-based health services accumulate a huge amount of unstructured health question answering (QA) records at a continuously increasing pace. The ability to organize these health QA records has been found to be effective for data access. The existing approaches for organizing information are often not applicable to health domain due to its domain nature as characterized by complex relation among entities, large vocabulary gap, and heterogeneity of users.
View Article and Find Full Text PDFIEEE Trans Image Process
March 2016
We present a deep learning strategy to fuse multiple semantic cues for complex event recognition. In particular, we tackle the recognition task by answering how to jointly analyze human actions (who is doing what), objects (what), and scenes (where). First, each type of semantic features (e.
View Article and Find Full Text PDFNonnegative matrix factorization (NMF) has received considerable attention in image processing, computer vision, and patter recognition. An important variant of NMF is nonnegative graph embedding (NGE), which encodes the statistical or geometric information of data in the process of matrix factorization. The NGE offers a general framework for unsupervised/supervised settings.
View Article and Find Full Text PDFIEEE Trans Image Process
April 2012
User interaction is an effective way to handle the semantic gap problem in image annotation. To minimize user effort in the interactions, many active learning methods were proposed. These methods treat the semantic concepts individually or correlatively.
View Article and Find Full Text PDFRecently, extensive research efforts have been dedicated to view-based methods for 3-D object retrieval due to the highly discriminative property of multiviews for 3-D object representation. However, most of state-of-the-art approaches highly depend on their own camera array settings for capturing views of 3-D objects. In order to move toward a general framework for 3-D object retrieval without the limitation of camera array restriction, a camera constraint-free view-based (CCFV) 3-D object retrieval algorithm is proposed in this paper.
View Article and Find Full Text PDF