Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation.

Neural Netw

Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, Shanghai, China. Electronic address:

Published: December 2024

Recently, the field of multimodal large language models (MLLMs) has grown rapidly, with many Large Vision-Language Models (LVLMs) relying on sequential visual representations. In these models, images are broken down into numerous tokens before being fed into the Large Language Model (LLM) alongside text prompts. However, the opaque nature of these models poses significant challenges to their interpretability, particularly when dealing with complex reasoning tasks. To address this issue, we utilized Grad-CAM to investigate the interaction dynamics between images and text within complex reasoning processes. Our information flow analysis revealed a distinct pattern: it tends to converge in the initial layers and then disperse as it progresses through deeper layers. This pattern suggests that the early stages of processing focus on the interaction between visual and textual elements, while later stages engage in deeper reasoning. We developed Simignore, a novel image token reduction technique based on this insight. Simignore enhances the model's complex reasoning capabilities by calculating the similarity between image and text embeddings, thereby ignoring tokens that are not semantically relevant. Extensive experiments across different MLLM architectures have shown that our approach consistently improves performance in complex reasoning tasks. This work not only contributes to the advancement of MLLM interpretability but also provides a robust framework for future research in this area. The paper's source code can be accessed from https://github.com/FanshuoZeng/Simignore.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neunet.2024.107059DOI Listing

Publication Analysis

Top Keywords

complex reasoning
20
multimodal large
8
large language
8
reasoning tasks
8
reasoning
6
complex
5
simignore exploring
4
exploring enhancing
4
enhancing multimodal
4
large
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!