Recently, contrastive learning has shown significant progress in learning visual representations from unlabeled data. The core idea is training the backbone to be invariant to different augmentations of an instance. While most methods only maximize the feature similarity between two augmented data, we further generate more challenging training samples and force the model to keep predicting aggregated representation on these hard samples. In this article, we propose MixIR, a mixture-based approach upon the traditional Siamese network. On the one hand, we input two augmented images of an instance to the backbone and obtain the aggregated representation by performing an elementwise maximum of two features. On the other hand, we take the mixture of these augmented images as input and expect the model prediction to be close to the aggregated representation. In this way, the model could access more variant data samples of an instance and keep predicting invariant representations for them. Thus, the learned model is more discriminative compared with previous contrastive learning methods. Extensive experiments on large-scale datasets show that MixIR steadily improves the baseline and achieves competitive results with state-of-the-art methods. Our code is available at https://github.com/happytianhao/MixIR.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2024.3439538 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!