DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.

Neural Netw

School of Computer and Electronic Information, Guangxi University, University Road, Nanning, 530004, Guangxi, China. Electronic address:

Published: December 2024

Vision-language navigation (VLN) is a challenging task that requires agents to capture the correlation between different modalities from redundant information according to instructions, and then make sequential decisions on visual scenes and text instructions in the action space. Recent research has focused on extracting visual features and enhancing text knowledge, ignoring the potential bias in multi-modal data and the problem of spurious correlations between vision and text. Therefore, this paper studies the relationship structure between multi-modal data from the perspective of causality and weakens the potential correlation between different modalities through cross-modal causality reasoning. We propose a novel visual language navigation based on double-gated intervention and confounder causality reasoning (DICCR). First, we decouple the datasets visual-text factors to construct a causality graph of confounder factors with cross-modal reasoning navigation. On this basis, we learn the causality between vision and text with a posterior probability and use confounder factors to block the interference of false association paths to agent decision-making. Then, we propose front-door and back-door causal intervention modules guided by semantic relations to reduce spurious biases in vision and semantics. On this basis, we design a joint local-global causal attention module that aggregates global feature representations through two different gated interventions. Finally, we design a multi-modal feature fusion matching algorithm (FFM), which combines the agent motion trajectory and multi-modal features to provide additional feedback auxiliary for continuous decision-making. We verified the model effectiveness on three benchmark datasets: R2R, REVERIE, and RxR. Experimental results show that DICCR achieved an increase of 3.25% and 4.13% in SPL and SR metrics on the R2R dataset. Compared with the baseline model, DICCR achieves state-of-the-art.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neunet.2024.107078DOI Listing

Publication Analysis

Top Keywords

double-gated intervention
8
intervention confounder
8
vision-language navigation
8
correlation modalities
8
multi-modal data
8
vision text
8
causality reasoning
8
confounder factors
8
causality
5
diccr
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!