Moment Retrieval via Cross-Modal Interaction Networks with Query Reconstruction.

Zhijie Lin Zhou Zhao Zhu Zhang Zijian Zhang Deng Cai

IEEE Trans Image Process

Published: January 2020

Moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging task, including the syntactic dependencies of natural language queries, long-range semantic dependencies in video context and the sufficient cross-modal interaction. Specifically, we devise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning and propose a multi-head self-attention to capture long-range semantic dependencies from video context. Next, we employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents, and we also consider query reconstruction from the cross-modal representations of target moment as an auxiliary task to strengthen the cross-modal representations. The extensive experiments on ActivityNet Captions and TACoS demonstrate the effectiveness of our proposed method.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TIP.2020.2965987	DOI Listing

Publication Analysis

Top Keywords

cross-modal interaction

video context

moment retrieval

query reconstruction

natural language

representation learning

long-range semantic

semantic dependencies

dependencies video

cross-modal representations

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!