Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2024.3432552 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!