Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2024.3386695 | DOI Listing |
Neural Netw
December 2024
School of Computer and Electronic Information, Guangxi University, University Road, Nanning, 530004, Guangxi, China. Electronic address:
Vision-language navigation (VLN) is a challenging task that requires agents to capture the correlation between different modalities from redundant information according to instructions, and then make sequential decisions on visual scenes and text instructions in the action space. Recent research has focused on extracting visual features and enhancing text knowledge, ignoring the potential bias in multi-modal data and the problem of spurious correlations between vision and text. Therefore, this paper studies the relationship structure between multi-modal data from the perspective of causality and weakens the potential correlation between different modalities through cross-modal causality reasoning.
View Article and Find Full Text PDFThe ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
April 2024
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE).
View Article and Find Full Text PDFCyborg Bionic Syst
March 2024
School of Future Technology, Shanghai University, Shanghai, China.
As humans, we can naturally break down a task into individual steps in our daily lives and we are able to provide feedback or dynamically adjust the plan when encountering obstacles. Similarly, our aim is to facilitate agents in comprehending and carrying out natural language instructions in a more efficient and cost-effective manner. For example, in Vision-Language Navigation (VLN) tasks, the agent needs to understand instructions such as "go to the table by the fridge".
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!