Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation. However, CLIP falls short in capturing fine-grained information, thereby not fully leveraging its powerful capacity in TIReID. Besides, the popular explicit local matching paradigm for mining fine-grained information heavily relies on the quality of local parts and cross-modal inter-part interaction/guidance, leading to intra-modal information distortion and ambiguity problems. Accordingly, in this paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we conduct fine-grained information excavation to mine modality-shared discriminative details for global alignment. Specifically, we propose a multi-level global feature learning (MGF) module that fully mines the discriminative local information within each modality, thereby emphasizing identity-related discriminative clues through enhanced interaction between global image (text) and informative local patches (words). MGF generates a set of enhanced global features for later inference. Furthermore, we design cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules to establish cross-modal correspondence at both coarse and fine-grained levels (image-word, sentence-patch, word-patch), ensuring the reliability of informative local patches/words. CFR and FCD are removed during inference to optimize computational efficiency. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method in TIReID.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2023.3327924DOI Listing

Publication Analysis

Top Keywords

clip-driven fine-grained
8
text-image person
8
person re-identification
8
fine-grained excavation
8
informative local
8
fine-grained
6
local
5
fine-grained text-image
4
re-identification text-image
4
tireid
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!