LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text.

Qi Wang Hongyu Deng Xue Wu Zhenguo Yang Yun Liu Yazhou Wang Gefei Hao

Neural Netw

State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, China. Electronic address:

Published: May 2023

Text-based image captioning (TextCap) aims to enhance image descriptions by combining visual and textual information, addressing the limits of current methods that often overlook text.
Existing techniques are complex, leading to difficulties in performance, long running times, and high resource usage.
The LCM-Captioner method offers a more efficient solution by using a feature-lightening transformation (TextLighT) and a collaborative attention module (VTCAM) for better semantic alignment, proven effective through tests on the TextCaps dataset.

Text-based image captioning (TextCap) aims to remedy the shortcomings of existing image captioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images from both visual and textual content to achieve a deeper level of comprehension of the images. However, existing methods tend to use numerous complex network architectures to improve performance, which still fails to adequately model the relationship between vision and text on the one side, while on the other side this leads to long running times, high memory consumption, and other unfavorable deployment problems. To solve the above issues, we have developed a lightweight captioning method with a collaborative mechanism, LCM-Captioner, which balances high efficiency with high performance. First, we propose a feature-lightening transformation for the TextCap task, named TextLighT, which is able to learn rich multimodal representations while mapping features to lower dimensions, thereby reducing memory costs. Next, we present a collaborative attention module for visual and text information, VTCAM, to facilitate the semantic alignment of multimodal information to uncover important visual objects and textual content. Finally, the conducted extensive experiments on the TextCaps dataset demonstrate the effectiveness of our method. Code is available at https://github.com/DengHY258/LCM-Captioner.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.neunet.2023.03.010	DOI Listing

Publication Analysis

Top Keywords

image captioning

text-based image

captioning method

method collaborative

collaborative mechanism

vision text

textual content

lcm-captioner lightweight

lightweight text-based

captioning

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered

A PHP Error was encountered