Insufficient training data is a common barrier to effectively learn multimodal information interactions and question semantics in existing medical Visual Question Answering (VQA) models. This paper proposes a new Asymmetric Cross Modal Attention network called ACMA, which constructs an image-guided attention and a question-guided attention to improve multimodal interactions from insufficient data. In addition, a Semantic Understanding Auxiliary (SUA) in the question-guided attention is newly designed to learn rich semantic embeddings for improving model performance on question understanding by integrating word-level and sentence-level information. Moreover, we propose a new data augmentation method called Multimodal Augmented Mixup (MAM) to train the ACMA, denoted as ACMA-MAM. The MAM incorporates various data augmentations and a vanilla mixup strategy to generate more non-repetitive data, which avoids time-consuming artificial data annotations and improves model generalization capability. Our ACMA-MAM outperforms state-of-the-art models on three publicly accessible medical VQA datasets (VQA-Rad, VQA-Slake, and PathVQA) with accuracies of 76.14 %, 83.13 %, and 53.83 % respectively, achieving improvements of 2.00 %, 1.32 %, and 1.59 % accordingly. Moreover, our model achieves F1 scores of 78.33 %, 82.83 %, and 51.86 %, surpassing the state-of-the-art models by 2.80 %, 1.15 %, and 1.37 % respectively.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.artmed.2023.102667 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!