Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching.

Zhangxiang Shi Tianzhu Zhang Xi Wei Feng Wu Yongdong Zhang

IEEE Trans Image Process

Published: February 2024

The mainstream of image and sentence matching studies currently focuses on fine-grained alignment of image regions and sentence words. However, these methods miss a crucial fact: the correspondence between images and sentences does not simply come from alignments between individual regions and words but from alignments between the phrases they form respectively. In this work, we propose a novel Decoupled Cross-modal Phrase-Attention network (DCPA) for image-sentence matching by modeling the relationships between textual phrases and visual phrases. Furthermore, we design a novel decoupled manner for training and inferencing, which is able to release the trade-off for bi-directional retrieval, where image-to-sentence matching is executed in textual semantic space and sentence-to-image matching is executed in visual semantic space. Extensive experimental results on Flickr30K and MS-COCO demonstrate that the proposed method outperforms state-of-the-art methods by a large margin, and can compete with some methods introducing external knowledge.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2022.3197972	DOI Listing

Publication Analysis

Top Keywords

decoupled cross-modal

cross-modal phrase-attention

phrase-attention network

image-sentence matching

novel decoupled

matching executed

semantic space

matching

network image-sentence

matching mainstream

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!