AI Article Synopsis

  • Temporal action localization (TAL) is gaining attention, but existing methods struggle due to limited annotated untrimmed video data.
  • The authors propose a feature augmentation approach using a learnable Mask-based Feature Augmentation Module (MFAM) that enhances video features while preserving essential action-related information.
  • Extensive testing shows that this framework improves the robustness and performance of TAL models across four benchmark datasets, achieving state-of-the-art results without added computational costs during testing.

Article Abstract

Temporal action localization (TAL) has drawn much attention in recent years, however, the performance of previous methods is still far from satisfactory due to the lack of annotated untrimmed video data. To deal with this issue, we propose to improve the utilization of current data through feature augmentation. Given an input video, we first extract video features with pre-trained video encoders, and then randomly mask various semantic contents of video features to consider different views of video features. To avoid damaging important action-related semantic information, we further develop a learnable feature augmentation framework to generate better views of videos. In particular, a Mask-based Feature Augmentation Module (MFAM) is proposed. The MFAM has three advantages: 1) it captures the temporal and semantic relationships of original video features, 2) it generates masked features with indispensable action-related information, and 3) it randomly recycles some masked information to ensure diversity. Finally, we input the masked features and the original features into shared action detectors respectively, and perform action classification and localization jointly for model learning. The proposed framework can improve the robustness and generalization of action detectors by learning more and better views of videos. In the testing stage, the MFAM can be removed, which does not bring extra computational costs. Extensive experiments are conducted on four TAL benchmark datasets. Our proposed framework significantly improves different TAL models and achieves the state-of-the-art performances.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2024.3413599DOI Listing

Publication Analysis

Top Keywords

feature augmentation
16
video features
16
learnable feature
8
augmentation framework
8
temporal action
8
action localization
8
better views
8
views videos
8
masked features
8
action detectors
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!