Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536548PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0275156PLOS

Publication Analysis

Top Keywords

attention fusion
16
modality attention
12
hybrid multi-head
8
multi-head self-attention
8
modality
5
fusion
5
fusion model
4
model hybrid
4
self-attention
4
self-attention video
4

Similar Publications

OCDet: A comprehensive ovarian cell detection model with channel attention on immunohistochemical and morphological pathology images.

Comput Biol Med

January 2025

Department of Pathology, Peking University Health Science Center, 38 College Road, Haidian, Beijing, 100191, China; Department of Pathology, School of Basic Medical Sciences, Third Hospital, Peking University Health Science Center, Beijing, 100191, China. Electronic address:

Background: Ovarian cancer is among the most lethal gynecologic malignancy that threatens women's lives. Pathological diagnosis is a key tool for early detection and diagnosis of ovarian cancer, guiding treatment strategies. The evaluation of various ovarian cancer-related cells, based on morphological and immunohistochemical pathology images, is deemed an important step.

View Article and Find Full Text PDF

A multiscale molecular structural neural network for molecular property prediction.

Mol Divers

January 2025

Key Laboratory for Macromolecular Science of Shaanxi Province, School of Chemistry and Chemical Engineering, Shaanxi Normal University, Xi'an, 710119, People's Republic of China.

Molecular Property Prediction (MPP) is a fundamental task in important research fields such as chemistry, materials, biology, and medicine, where traditional computational chemistry methods based on quantum mechanics often consume substantial time and computing power. In recent years, machine learning has been increasingly used in computational chemistry, in which graph neural networks have shown good performance in molecular property prediction tasks, but they have some limitations in terms of generalizability, interpretability, and certainty. In order to address the above challenges, a Multiscale Molecular Structural Neural Network (MMSNet) is proposed in this paper, which obtains rich multiscale molecular representations through the information fusion between bonded and non-bonded "message passing" structures at the atomic scale and spatial feature information "encoder-decoder" structures at the molecular scale; a multi-level attention mechanism is introduced on the basis of theoretical analysis of molecular mechanics in order to enhance the model's interpretability; the prediction results of MMSNet are used as label values and clustered in the molecular library by the K-NN (K-Nearest Neighbors) algorithm to reverse match the spatial structure of the molecules, and the certainty of the model is quantified by comparing virtual screening results across different K-values.

View Article and Find Full Text PDF

A Feature-Enhanced Small Object Detection Algorithm Based on Attention Mechanism.

Sensors (Basel)

January 2025

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China.

With the rapid development of AI algorithms and computational power, object recognition based on deep learning frameworks has become a major research direction in computer vision. UAVs equipped with object detection systems are increasingly used in fields like smart transportation, disaster warning, and emergency rescue. However, due to factors such as the environment, lighting, altitude, and angle, UAV images face challenges like small object sizes, high object density, and significant background interference, making object detection tasks difficult.

View Article and Find Full Text PDF

The issue of obstacle avoidance and safety for visually impaired individuals has been a major topic of research. However, complex street environments still pose significant challenges for blind obstacle detection systems. Existing solutions often fail to provide real-time, accurate obstacle avoidance decisions.

View Article and Find Full Text PDF

Remaining useful life (RUL) prediction is a cornerstone of Prognostic and Health Management (PHM) for power machinery, playing a crucial role in ensuring the reliability and safety of these critical systems. In recent years, deep learning techniques have shown great promise in RUL prediction, providing more reliable and accurate outcomes. However, existing models often struggle with comprehensive feature extraction, especially in capturing the complex behavior of power machinery, where non-linear degradation patterns arise under varying operational conditions.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!