Recent studies have pointed out that many well-developed Visual Question Answering (VQA) models are heavily affected by the language prior problem. It refers to making predictions based on the co-occurrence pattern between textual questions and answers instead of reasoning upon visual contents. To tackle this problem, most existing methods focus on strengthening the visual feature learning capability to reduce this text shortcut influence on model decisions. However, few efforts have been devoted to analyzing its inherent cause and providing an explicit interpretation. It thus lacks a good guidance for the research community to move forward in a purposeful way, resulting in model construction perplexity towards overcoming this non-trivial problem. In this paper, we propose to interpret the language prior problem in VQA from a class-imbalance view. Concretely, we design a novel interpretation scheme whereby the loss of mis-predicted frequent and sparse answers from the same question type is distinctly exhibited during the late training phase. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer, to a given question whose right answer is sparse in the training set. Based upon this observation, we further propose a novel loss re-scaling approach to assign different weights to each answer according to the training data statistics for estimating the final loss. We apply our approach into six strong baselines and the experimental results on two VQA-CP benchmark datasets evidently demonstrate its effectiveness. In addition, we also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2021.3128322DOI Listing

Publication Analysis

Top Keywords

language prior
12
prior problem
12
loss re-scaling
8
class-imbalance view
8
interpretation scheme
8
problem
5
loss
4
vqa
4
re-scaling vqa
4
vqa revisiting
4

Similar Publications

Localization of function within the brain and central nervous system is an essential aspect of clinical neuroscience. Classical descriptions of functional neuroanatomy provide a foundation for understanding the functional significance of identifiable anatomic structures. However, individuals exhibit substantial variation, particularly in the presence of disorders that alter tissue structure or impact function.

View Article and Find Full Text PDF

Prior research has indicated musicians show an auditory processing advantage in phonemic processing of language. The aim of the current study was to elucidate when in the auditory cortical processing stream this advantage emerges in a cocktail-party-like environment. Participants (n = 34) were aged 18-35 years and deemed to be either a musician (10+-year experience) or nonmusician (no formal training).

View Article and Find Full Text PDF

Background: Alzheimer's disease (AD) is a progressive neurodegenerative disorder affecting millions worldwide, leading to cognitive and functional decline. Early detection and intervention are crucial for enhancing the quality of life of patients and their families. Remote Monitoring Technologies (RMTs) offer a promising solution for early detection by tracking changes in behavioral and cognitive functions, such as memory, language, and problem-solving skills.

View Article and Find Full Text PDF

Although the Transformer architecture has established itself as the industry standard for jobs involving natural language processing, it still has few uses in computer vision. In vision, attention is used in conjunction with convolutional networks or to replace individual convolutional network elements while preserving the overall network design. Differences between the two domains, such as significant variations in the scale of visual things and the higher granularity of pixels in images compared to words in the text, make it difficult to transfer Transformer from language to vision.

View Article and Find Full Text PDF

People with aphasia show stable Cumulative Semantic Interference (CSI) when tested repeatedly in a web-based paradigm: A perspective for longitudinal assessment.

Cortex

December 2024

Humboldt-Universität zu Berlin, Berlin School of Mind and Brain, Berlin, Germany; Max Planck Institute for Human Cognitive and Brain Sciences, Department of Neurology, Leipzig, Germany; University Hospital and Faculty of Medicine Leipzig, Clinic for Cognitive Neurology, Leipzig, Germany.

Retrieving words quickly and correctly is an important language competence. Semantic contexts, such as prior naming of categorically related objects, can induce conceptual priming but also lexical-semantic interference, the latter likely due to enhanced competition during lexical selection. In the continuous naming (CN) paradigm, such semantic interference is evident in a linear increase in naming latency with each additional member of a category out of a seemingly random sequence of pictures being named (cumulative semantic interference/CSI effect).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!