Cross-Modality Pyramid Alignment for Visual Intention Understanding.

IEEE Trans Image Process

Published: April 2023

Visual intention understanding is the task of exploring the potential and underlying meaning expressed in images. Simply modeling the objects or backgrounds within the image content leads to unavoidable comprehension bias. To alleviate this problem, this paper proposes a Cross-modality Pyramid Alignment with Dynamic optimization (CPAD) to enhance the global understanding of visual intention with hierarchical modeling. The core idea is to exploit the hierarchical relationship between visual content and textual intention labels. For visual hierarchy, we formulate the visual intention understanding task as a hierarchical classification problem, capturing multiple granular features in different layers, which corresponds to hierarchical intention labels. For textual hierarchy, we directly extract the semantic representation from intention labels at different levels, which supplements the visual content modeling without extra manual annotations. Moreover, to further narrow the domain gap between different modalities, a cross-modality pyramid alignment module is designed to dynamically optimize the performance of visual intention understanding in a joint learning manner. Comprehensive experiments intuitively demonstrate the superiority of our proposed method, outperforming existing visual intention understanding methods.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2023.3261743	DOI Listing

Publication Analysis

Top Keywords

visual intention

intention understanding

cross-modality pyramid

pyramid alignment

intention labels

visual

intention

understanding visual

understanding task

visual content

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!