A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction.

Yu Hao Fan Yang Hao Huang Shuaihang Yuan Sundeep Rangan John-Ross Rizzo Yao Wang Yi Fang

J Imaging

Tandon School of Engineering, New York University, Brooklyn, NY 11201, USA.

Published: April 2024

People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards independently. Previous assistive technologies for the visually impaired often struggle in real-world scenarios due to the need for constant training and lack of robustness, which limits their effectiveness, especially in dynamic and unfamiliar environments, where accurate and efficient perception is crucial. Therefore, we frame our research question in this paper as: We hypothesize that by leveraging large pretrained foundation models and prompt engineering, we can create a system that effectively addresses the challenges faced by pBLV in unfamiliar environments. Motivated by the prevalence of large pretrained foundation models, particularly in assistive robotics applications, due to their accurate perception and robust contextual understanding in real-world scenarios induced by extensive pretraining, we present a pioneering approach that leverages foundation models to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environment and providing warnings about potential risks. Specifically, our method begins by leveraging a large-image tagging model (i.e., Recognize Anything Model (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV, using prompt engineering. By combining the prompt and input image, a vision-language foundation model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing environmental objects and scenic landmarks, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method can recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11122237	PMC
http://dx.doi.org/10.3390/jimaging10050103	DOI Listing

Publication Analysis

Top Keywords

unfamiliar environments

foundation models

foundation model

people blindness

blindness low

low vision

real-world scenarios

large pretrained

pretrained foundation

prompt engineering

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!