AI Article Synopsis

  • Visual knowledge bases like Visual Genome are crucial for computer vision tasks, but they struggle with sparse and incomplete data regarding visual relationships.
  • Current scene graph models rely on a tiny number of labeled relationships, making it expensive and impractical to annotate data manually, while traditional text-based methods don't work well with visual data.
  • The paper presents a semi-supervised approach that effectively generates probabilistic labels for many unlabeled images using only a few labeled examples, significantly improving scene graph prediction performance by utilizing a generative model.

Article Abstract

Visual knowledge bases such as Visual Genome power numerous applications in computer vision, including visual question answering and captioning, but suffer from sparse, incomplete relationships. All scene graph models to date are limited to training on a small set of visual relationships that have thousands of training labels each. Hiring human annotators is expensive, and using textual knowledge base completion methods are incompatible with visual data. In this paper, we introduce a semi-supervised method that assigns probabilistic relationship labels to a large number of unlabeled images using few' labeled examples. We analyze visual relationships to suggest two types of image-agnostic features that are used to generate noisy heuristics, whose outputs are aggregated using a factor graph-based generative model. With as few as 10 labeled examples per relationship, the generative model creates enough training data to train any existing state-of-the-art scene graph model. We demonstrate that our method outperforms all baseline approaches on scene graph prediction by 5.16 recall@ 100 for PREDCLS. In our limited label setting, we define a complexity metric for relationships that serves as an indicator (R = 0.778) for conditions under which our method succeeds over transfer learning, the de-facto approach for training with limited labels.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7098690PMC
http://dx.doi.org/10.1109/iccv.2019.00267DOI Listing

Publication Analysis

Top Keywords

scene graph
16
graph prediction
8
limited labels
8
visual relationships
8
labeled examples
8
generative model
8
visual
6
scene
4
limited
4
prediction limited
4

Similar Publications

Temporal Multi-Modal Knowledge Graphs (TMMKGs) can be regarded as a synthesis of Temporal Knowledge Graphs (TKGs) and Multi-Modal Knowledge Graphs (MMKGs), combining the characteristics of both. TMMKGs can effectively model dynamic real-world phenomena, particularly in scenarios involving multiple heterogeneous information sources and time series characteristics, such as e-commerce websites, scene recording data, and intelligent transportation systems. We propose a Temporal Multi-Modal Knowledge Graph Generation (TMMKGG) method that can automatically construct TMMKGs, aiming to reduce construction costs.

View Article and Find Full Text PDF

Visual semantic decoding aims to extract perceived semantic information from the visual responses of the human brain and convert it into interpretable semantic labels. Although significant progress has been made in semantic decoding across individual visual cortices, studies on the semantic decoding of the ventral and dorsal cortical visual pathways remain limited. This study proposed a graph neural network (GNN)-based semantic decoding model on a natural scene dataset (NSD) to investigate the decoding differences between the dorsal and ventral pathways in process various parts of speech, including verbs, nouns, and adjectives.

View Article and Find Full Text PDF

DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.

Neural Netw

December 2024

School of Computer and Electronic Information, Guangxi University, University Road, Nanning, 530004, Guangxi, China. Electronic address:

Vision-language navigation (VLN) is a challenging task that requires agents to capture the correlation between different modalities from redundant information according to instructions, and then make sequential decisions on visual scenes and text instructions in the action space. Recent research has focused on extracting visual features and enhancing text knowledge, ignoring the potential bias in multi-modal data and the problem of spurious correlations between vision and text. Therefore, this paper studies the relationship structure between multi-modal data from the perspective of causality and weakens the potential correlation between different modalities through cross-modal causality reasoning.

View Article and Find Full Text PDF

Visual question generation involves the generation of meaningful questions about an image. Although we have made significant progress in automatically generating a single high-quality question related to an image, existing methods often ignore the diversity and interpretability of generated questions, which are important for various daily tasks that require clear question sources. In this paper, we propose an explicitly diverse visual question generation model that aims to generate diverse questions based on interpretable question sources.

View Article and Find Full Text PDF

VIIDA and InViDe: computational approaches for generating and evaluating inclusive image paragraphs for the visually impaired.

Disabil Rehabil Assist Technol

December 2024

Department of Informatics, Universidade Federal de Viçosa - UFV, Viçosa, Brazil.

Article Synopsis
  • Existing image description methods for blind or low vision individuals are often inadequate, either oversimplifying visuals into short captions or overwhelming users with lengthy descriptions.
  • VIIDA is introduced as a new procedure to enhance image description specifically for webinar scenes, along with InViDe, a metric for evaluating these descriptions based on accessibility for BLV people.
  • Utilizing advanced tech like a multimodal Visual Question Answering model and Natural Language Processing, VIIDA effectively creates descriptions closely matching image content, while InViDe provides insights into the effectiveness of various methods, fostering further development in Assistive Technologies.
View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!