With the explosive growth of videos, weakly-supervised temporal action localization (WS-TAL) task has become a promising research direction in pattern analysis and machine learning. WS-TAL aims to detect and localize action instances with only video-level labels during training. Modern approaches have achieved impressive progress via powerful deep neural networks. However, robust and reliable WS-TAL remains challenging and underexplored due to considerable uncertainty caused by weak supervision, noisy evaluation environment, and unknown categories in the open world. To this end, we propose a new paradigm, named vectorized evidential learning (VEL), to explore local-to-global evidence collection for facilitating model performance. Specifically, a series of learnable meta-action units (MAUs) are automatically constructed, which serve as fundamental elements constituting diverse action categories. Since the same meta-action unit can manifest as distinct action components within different action categories, we leverage MAUs and category representations to dynamically and adaptively learn action components and action-component relations. After performing uncertainty estimation at both category-level and unit-level, the local evidence from action components is accumulated and optimized under the Subject Logic theory. Extensive experiments on the regular, noisy, and open-set settings of three popular benchmarks show that VEL consistently obtains more robust and reliable action localization performance than state-of-the-arts.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2023.3311447 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!