Exploring refined dual visual features cross-combination for image captioning.

Neural Netw

College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China. Electronic address:

Published: December 2024

For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between different regions in the image and contextual information. However, stacking Transformer blocks necessitates quadratic computation through self-attention to visual features, not only resulting in the computation of numerous redundant features but also significantly increasing computational overhead. This paper presents a novel Distilled Cross-Combination Transformer (DCCT) network. Technically, we first introduce a distillation cascade fusion encoder (DCFE), where a probabilistic sparse self-attention layer is used to filter out some redundant and distracting features that affect attention focus, aiming to obtain more refined visual features and enhance encoding efficiency. Next, we develop a parallel cross-fusion attention module (PCFA) that fully exploits the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments conducted on the MSCOCO dataset demonstrate that our proposed DCCT method achieves outstanding performance, rivaling current state-of-the-art approaches.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neunet.2024.106710DOI Listing

Publication Analysis

Top Keywords

visual features
16
features
9
dual visual
8
region features
8
exploring refined
4
refined dual
4
visual
4
features cross-combination
4
cross-combination image
4
image captioning
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!