Action Quality Assessment (AQA) plays an important role in video analysis, which is applied to evaluate the quality of specific actions, i.e., sports activities. However, it is still challenging because there are lots of small action discrepancies with similar backgrounds, but current approaches mostly adopt holistic video representations. So that fine-grained intra-class variations are unable to be captured. To address the aforementioned challenge, we propose a Fine-grained Spatio-temporal Parsing Network (FSPN) which is composed of the intra-sequence action parsing module and spatiotemporal multiscale transformer module to learn fine-grained spatiotemporal sub-action representations for more reliable AQA. The intra-sequence action parsing module performs semantical sub-action parsing by mining sub-actions at fine-grained levels. It enables a correct description of the subtle differences between action sequences. The spatiotemporal multiscale transformer module learns motion-oriented action features and obtains their long-range dependencies among sub-actions at different scales. Furthermore, we design a group contrastive loss to train the model and learn more discriminative feature representations for sub-actions without explicit supervision. We exhaustively evaluate our proposed approach in the FineDiving, AQA-7, and MTL-AQA datasets. Extensive experiment results demonstrate the effectiveness and feasibility of our proposed approach, which outperforms the state-of-the-art methods by a significant margin.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2023.3331212DOI Listing

Publication Analysis

Top Keywords

fine-grained spatio-temporal
8
spatio-temporal parsing
8
parsing network
8
action quality
8
quality assessment
8
intra-sequence action
8
action parsing
8
parsing module
8
spatiotemporal multiscale
8
multiscale transformer
8

Similar Publications

Aiming at the problem that the existing human skeleton behavior recognition methods are insensitive to human local movements and show inaccurate recognition in distinguishing similar behaviors, a multi-scale spatio-temporal graph convolution method incorporating multi-granularity features is proposed for human behavior recognition. Firstly, a skeleton fine-grained partitioning strategy is proposed, which initializes the skeleton data into data streams of different granularities. An adaptive cross-scale feature fusion layer is designed using a normalized Gaussian function to perform feature fusion among different granularities, guiding the model to focus on discriminative feature representations among similar behaviors through fine-grained features.

View Article and Find Full Text PDF

Humans typically make decisions based on past experiences and observations, while in the field of robotic manipulation, the robot's action prediction often relies solely on current observations, which tends to make robots overlook environmental changes or become ineffective when current observations are suboptimal. To address this pivotal challenge in robotics, inspired by human cognitive processes, we propose our method which integrates historical learning and multi-view attention to improve the performance of robotic manipulation. Based on a spatio-temporal attention mechanism, our method not only combines observations from current and past steps but also integrates historical actions to better perceive changes in robots' behaviours and their impacts on the environment.

View Article and Find Full Text PDF

MTLPM: a long-term fine-grained PM2.5 prediction method based on spatio-temporal graph neural network.

Environ Monit Assess

November 2024

School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan, 430200, China.

The concentration of PM2.5 is one of the air quality indicators that the public pays the most attention to. Existing methods for PM2.

View Article and Find Full Text PDF

Development of over 30-years of high spatiotemporal resolution air pollution models and surfaces for California.

Environ Int

November 2024

Research Division, California Air Resources Board, Sacramento, CA 95812, the United States of America.

California's diverse geography and meteorological conditions necessitate models capturing fine-grained patterns of air pollution distribution. This study presents the development of high-resolution (100 m) daily land use regression (LUR) models spanning 1989-2021 for nitrogen dioxide (NO), fine particulate matter (PM), and ozone (O) across California. These machine learning LUR algorithms integrated comprehensive data sources, including traffic, land use, land cover, meteorological conditions, vegetation dynamics, and satellite data.

View Article and Find Full Text PDF

Cut-and-Paste: Subject-driven video editing with attention control.

Neural Netw

January 2025

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China.

Article Synopsis
  • The paper introduces a new video editing framework called Cut-and-Paste that allows for more precise semantic editing by using a combination of text prompts and reference images.
  • It addresses the challenges of traditional text-only video editing methods, which often require complex descriptions to achieve fine-grained control over edited details and regions.
  • By integrating cross attention control from image editing and applying it to video, the system effectively maintains background consistency while allowing for targeted edits, proving to be more effective compared to existing techniques.
View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!