MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.

Zifeng Wang Zhenbang Wu Dinesh Agarwal Jimeng Sun

Proc Conf Empir Methods Nat Lang Process

Department of Computer Science, University of Illinois Urbana-Champaign.

Published: December 2022

Existing vision-text contrastive learning like CLIP (Radford et al., 2021) aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using ≈200K data).

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11323634	PMC
http://dx.doi.org/10.18653/v1/2022.emnlp-main.256	DOI Listing

Publication Analysis

Top Keywords

contrastive learning

zero-shot prediction

false negatives

medclip contrastive

learning

learning unpaired

unpaired medical

images

medical images

images text

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!