Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection.

Yifan Xu Mengdan Zhang Xiaoshan Yang Changsheng Xu

IEEE Trans Image Process

Published: November 2024

We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2024.3485518	DOI Listing

Publication Analysis

Top Keywords

multi-modal contextual

contextual knowledge

fusion transformer

open-vocabulary object

object detection

multi-modal masked

masked language

language modeling

masked concept

teacher fusion

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!