Recent advances in multimodal pretrained models have improved image-text matching using vast datasets of paired images and texts, but these models rely heavily on supervised learning.
This paper introduces a novel approach called Multimodal Aligned Conceptual Knowledge (MACK) for unpaired image-text matching, where paired data is not available during training.
MACK involves collecting and refining knowledge from unpaired datasets through self-supervised learning, allowing for the computation of image-text similarity scores, and can enhance existing models' performance, particularly in zero-shot and cross-dataset scenarios.