Temporal language grounding (TLG) is one of the most challenging cross-modal video understanding tasks, which aims at retrieving the most relevant video segment from an untrimmed video according to a natural language sentence. The existing methods can be separated into two dominant types: 1) proposal-based and 2) proposal-free methods, where the former conduct contextual interactions and the latter localizes timestamps flexibly. However, the constant-scale candidates in proposal-based methods limit the localization precision and bring extra computational costs. In contrast, the proposal-free methods perform well on high-precision metrics-based on the fine-grained features but suffer from a lack of coarse-grained interactions, which cause degeneration when the video becomes complex. In this article, we propose a novel framework termed semantic decoupling network (SDN) that combines the advantages of proposal-based and proposal-free methods and overcomes their defects. It contains three key components: 1) semantic decoupling module (SDM); 2) context modeling block (CMB); and 3) semantic cross-level aggregation module (SCAM). By capturing the video-text contexts in multilevel semantics, the SDM and CMB effectively utilize the benefits of proposal-based methods. Meanwhile, the SCAM maintains the merit of proposal-free methods in that it localizes timestamps precisely. The experiments on three challenge datasets, i.e., Charades-STA, TACoS, and ActivityNet-Caption, show that our proposed SDN method significantly outperforms recent state-of-the-art methods, especially the proposal-free methods. Extensive analyses, as well as the implementation code of the proposed SDN method, are provided at https://github.com/CFM-MSG/Code_SDN.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2022.3211850 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!