Visual knowledge bases such as Visual Genome power numerous applications in computer vision, including visual question answering and captioning, but suffer from sparse, incomplete relationships. All scene graph models to date are limited to training on a small set of visual relationships that have thousands of training labels each. Hiring human annotators is expensive, and using textual knowledge base completion methods are incompatible with visual data. In this paper, we introduce a semi-supervised method that assigns probabilistic relationship labels to a large number of unlabeled images using few' labeled examples. We analyze visual relationships to suggest two types of image-agnostic features that are used to generate noisy heuristics, whose outputs are aggregated using a factor graph-based generative model. With as few as 10 labeled examples per relationship, the generative model creates enough training data to train any existing state-of-the-art scene graph model. We demonstrate that our method outperforms all baseline approaches on scene graph prediction by 5.16 recall@ 100 for PREDCLS. In our limited label setting, we define a complexity metric for relationships that serves as an indicator (R = 0.778) for conditions under which our method succeeds over transfer learning, the de-facto approach for training with limited labels.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7098690 | PMC |
http://dx.doi.org/10.1109/iccv.2019.00267 | DOI Listing |
Neural Netw
January 2025
Harbin University of Science and Technology, Harbin, 150006, China.
Temporal Multi-Modal Knowledge Graphs (TMMKGs) can be regarded as a synthesis of Temporal Knowledge Graphs (TKGs) and Multi-Modal Knowledge Graphs (MMKGs), combining the characteristics of both. TMMKGs can effectively model dynamic real-world phenomena, particularly in scenarios involving multiple heterogeneous information sources and time series characteristics, such as e-commerce websites, scene recording data, and intelligent transportation systems. We propose a Temporal Multi-Modal Knowledge Graph Generation (TMMKGG) method that can automatically construct TMMKGs, aiming to reduce construction costs.
View Article and Find Full Text PDFInt J Neural Syst
January 2025
Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou 313001, P. R. China.
Visual semantic decoding aims to extract perceived semantic information from the visual responses of the human brain and convert it into interpretable semantic labels. Although significant progress has been made in semantic decoding across individual visual cortices, studies on the semantic decoding of the ventral and dorsal cortical visual pathways remain limited. This study proposed a graph neural network (GNN)-based semantic decoding model on a natural scene dataset (NSD) to investigate the decoding differences between the dorsal and ventral pathways in process various parts of speech, including verbs, nouns, and adjectives.
View Article and Find Full Text PDFNeural Netw
December 2024
School of Computer and Electronic Information, Guangxi University, University Road, Nanning, 530004, Guangxi, China. Electronic address:
Vision-language navigation (VLN) is a challenging task that requires agents to capture the correlation between different modalities from redundant information according to instructions, and then make sequential decisions on visual scenes and text instructions in the action space. Recent research has focused on extracting visual features and enhancing text knowledge, ignoring the potential bias in multi-modal data and the problem of spurious correlations between vision and text. Therefore, this paper studies the relationship structure between multi-modal data from the perspective of causality and weakens the potential correlation between different modalities through cross-modal causality reasoning.
View Article and Find Full Text PDFNeural Netw
December 2024
Department of Computing, Hong Kong Polytechnic University, Hong Kong SAR, China.
Visual question generation involves the generation of meaningful questions about an image. Although we have made significant progress in automatically generating a single high-quality question related to an image, existing methods often ignore the diversity and interpretability of generated questions, which are important for various daily tasks that require clear question sources. In this paper, we propose an explicitly diverse visual question generation model that aims to generate diverse questions based on interpretable question sources.
View Article and Find Full Text PDFDisabil Rehabil Assist Technol
December 2024
Department of Informatics, Universidade Federal de Viçosa - UFV, Viçosa, Brazil.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!