As a challenging task of high-level video understanding, weakly supervised temporal action localization has attracted more attention recently. Due to the usage of video-level category labels, this task is usually formulated as the task of classification, which always suffers from the contradiction between classification and detection. In this paper, we describe a novel approach to alleviate the contradiction for detecting more complete action instances by explicitly modeling sub-actions. Our method makes use of three innovations to model the latent sub-actions. First, our framework uses prototypes to represent sub-actions, which can be automatically learned in an end-to-end way. Second, we regard the relations among sub-actions as a graph, and construct the correspondences between sub-actions and actions by the graph pooling operation. Doing so not only makes the sub-actions inter-dependent to facilitate the multi-label setting, but also naturally use the video-level labels as weak supervision. Third, we devise three complementary loss functions, namely, representation loss, balance loss and relation loss to ensure the learned sub-actions are diverse and have clear semantic meanings. Experimental results on THUMOS14 and ActivityNet1.3 datasets demonstrate the effectiveness of our method and superior performance over state-of-the-art approaches.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2021.3078324DOI Listing

Publication Analysis

Top Keywords

modeling sub-actions
8
weakly supervised
8
supervised temporal
8
temporal action
8
action localization
8
sub-actions
7
sub-actions weakly
4
localization challenging
4
challenging task
4
task high-level
4

Similar Publications

Action Quality Assessment (AQA) plays an important role in video analysis, which is applied to evaluate the quality of specific actions, i.e., sports activities.

View Article and Find Full Text PDF

A sequential learning model with GNN for EEG-EMG-based stroke rehabilitation BCI.

Front Neurosci

April 2023

Translational Research Center, Shanghai Yangzhi Rehabilitation Hospital (Shanghai Sunshine Rehabilitation Center), School of Electronic and Information Engineering, Tongji University, Shanghai, China.

Introduction: Brain-computer interfaces (BCIs) have the potential in providing neurofeedback for stroke patients to improve motor rehabilitation. However, current BCIs often only detect general motor intentions and lack the precise information needed for complex movement execution, mainly due to insufficient movement execution features in EEG signals.

Methods: This paper presents a sequential learning model incorporating a Graph Isomorphic Network (GIN) that processes a sequence of graph-structured data derived from EEG and EMG signals.

View Article and Find Full Text PDF

Weakly Supervised Temporal Action Localization (WTAL) aims to localize action segments in untrimmed videos with only video-level category labels in the training phase. In WTAL, an action generally consists of a series of sub-actions, and different categories of actions may share the common sub-actions. However, to distinguish different categories of actions with only video-level class labels, current WTAL models tend to focus on discriminative sub-actions of the action, while ignoring those common sub-actions shared with different categories of actions.

View Article and Find Full Text PDF

In recent years, reinforcement learning has achieved excellent results in low-dimensional static action spaces such as games and simple robotics. However, the action space is usually composite, composed of multiple sub-action with different functions, and time-varying for practical tasks. The existing sub-actions might be temporarily invalid due to the external environment, while unseen sub-actions can be added to the current system.

View Article and Find Full Text PDF

As a challenging task of high-level video understanding, weakly supervised temporal action localization has attracted more attention recently. Due to the usage of video-level category labels, this task is usually formulated as the task of classification, which always suffers from the contradiction between classification and detection. In this paper, we describe a novel approach to alleviate the contradiction for detecting more complete action instances by explicitly modeling sub-actions.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!