Weakly supervised temporal action localization aims to identify and localize action instances in untrimmed videos with only video-level labels. Typically, most methods are based on a multiple instance learning framework that uses a top-K strategy to select salient segments to represent the entire video. Therefore fine-grained video information cannot be learned, resulting in poor action classification and localization performance. In this paper, we propose a Multilevel Semantic and Adaptive Actionness Learning Network (SAL), which is mainly composed of multilevel semantic learning (MSL) branch and adaptive actionness learning (AAL) branch. The MSL branch introduces second-order video semantics, which can capture fine-grained information in videos and improve video-level classification performance. Furthermore, we propagate second-order semantics to action segments to enhance the difference between different actions. The AAL branch uses pseudo labels to learn class-agnostic action information. It introduces a video segments mix-up strategy to enhance foreground generalization ability and adds an adaptive actionness mask to balance the quality and quantity of pseudo labels, thereby improving the stability of training. Extensive experiments show that SAL achieves state-of-the-art results on three benchmarks. Code: https://github.com/lizhilin-ustc/SAL.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.neunet.2024.106905 | DOI Listing |
Neural Netw
February 2025
National Engineering Laboratory for Brain-inspired Intelligence Technology and Application (NEL-BITA), University of Science and Technology of China, Hefei, 230026, China. Electronic address:
Weakly supervised temporal action localization aims to identify and localize action instances in untrimmed videos with only video-level labels. Typically, most methods are based on a multiple instance learning framework that uses a top-K strategy to select salient segments to represent the entire video. Therefore fine-grained video information cannot be learned, resulting in poor action classification and localization performance.
View Article and Find Full Text PDFTemporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines.
View Article and Find Full Text PDFFalling down is a serious problem for health and has become one of the major etiologies of accidental death for the elderly living alone. In recent years, many efforts have been paid to fall recognition based on wearable sensors or standard vision sensors. However, the prior methods have the risk of privacy leaks, and almost all these methods are based on video clips, which cannot localize where the falls occurred in long videos.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!