Integrating information from vision and language modalities has sparked interesting applications in the fields of computer vision and natural language processing. Existing methods, though promising in tasks like image captioning and visual question answering, face challenges in understanding real-life issues and offering step-by-step solutions. In particular, they typically limit their scope to solutions with a sequential structure, thus ignoring complex inter-step dependencies. To bridge this gap, we propose a graph-based approach to vision-language problem solving. It leverages a novel integrated attention mechanism that jointly considers the importance of features within each step as well as across multiple steps. Together with a graph neural network method, this attention mechanism can be progressively learned to predict sequential and non-sequential solution graphs depending on the characterization of the problem-solving process. To tightly couple attention with the problem-solving procedure, we further design new learning objectives with attention metrics that quantify this integrated attention, which better aligns visual and language information within steps, and more accurately captures information flow between steps. Experimental results on VisualHow, a comprehensive dataset of varying solution structures, show significant improvements in predicting steps and dependencies, demonstrating the effectiveness of our approach in tackling various vision-language problems.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TPAMI.2024.3357631DOI Listing

Publication Analysis

Top Keywords

integrated attention
12
vision-language problems
8
attention mechanism
8
attention
6
problem step
4
step focus
4
focus learning
4
learning solve
4
solve vision-language
4
problems integrated
4

Similar Publications

Wolbachia-based mosquito control strategies have gained significant attention as a sustainable approach to reduce the transmission of vector-borne diseases such as dengue, Zika, and chikungunya. These endosymbiotic bacteria can limit the ability of mosquitoes to transmit pathogens, offering a promising alternative to traditional chemical-based interventions. With the growing impact of climate change on mosquito population dynamics and disease transmission, Wolbachia interventions represent an adaptable and resilient strategy for mitigating the public health burden of vector-borne diseases.

View Article and Find Full Text PDF

FP-YOLOv8: Surface Defect Detection Algorithm for Brake Pipe Ends Based on Improved YOLOv8n.

Sensors (Basel)

December 2024

School of Mechanical and Power Engineering, Zhengzhou University, Zhengzhou 450000, China.

To address the limitations of existing deep learning-based algorithms in detecting surface defects on brake pipe ends, a novel lightweight detection algorithm, FP-YOLOv8, is proposed. This algorithm is developed based on the YOLOv8n framework with the aim of improving accuracy and model lightweight design. First, the C2f_GhostV2 module has been designed to replace the original C2f module.

View Article and Find Full Text PDF

The attention mechanism is essential to (CNN) vision backbones used for sensing and imaging systems. Conventional attention modules are designed heuristically, relying heavily on empirical tuning. To tackle the challenge of designing attention mechanisms, this paper proposes a novel probabilistic attention mechanism.

View Article and Find Full Text PDF

Despite the accuracy and robustness attained in the field of object tracking, algorithms based on Siamese neural networks often over-rely on information from the initial frame, neglecting necessary updates to the template; furthermore, in prolonged tracking situations, such methodologies encounter challenges in efficiently addressing issues such as complete occlusion or instances where the target exits the frame. To tackle these issues, this study enhances the SiamRPN algorithm by integrating the convolutional block attention module (CBAM), which enhances spatial channel attention. Additionally, it integrates the kernelized correlation filters (KCFs) for enhanced feature template representation.

View Article and Find Full Text PDF

Attention-Based PSO-LSTM for Emotion Estimation Using EEG.

Sensors (Basel)

December 2024

Department of Information and Electronic Engineering, International Hellenic University, 57001 Thessaloniki, Greece.

Recent advances in emotion recognition through Artificial Intelligence (AI) have demonstrated potential applications in various fields (e.g., healthcare, advertising, and driving technology), with electroencephalogram (EEG)-based approaches demonstrating superior accuracy compared to facial or vocal methods due to their resistance to intentional manipulation.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!