Recognizing expressions from dynamic facial videos can find more natural affect states of humans, and it becomes a more challenging task in real-world scenes due to pose variations of face, partial occlusions and subtle dynamic changes of emotion sequences. Existing transformer-based methods often focus on self-attention to model the global relations among spatial features or temporal features, which cannot well focus on important expression-related locality structures from both spatial and temporal features for the in-the-wild expression videos. To this end, we incorporate diverse graph structures into transformers and propose a CDGT method to construct diverse graph transformers for efficient emotion recognition from in-the-wild videos. Specifically, our method contains a spatial dual-graphs transformer and a temporal hyperbolic-graph transformer. The former deploys a dual-graph constrained attention to capture latent emotion-related graph geometry structures among local spatial tokens for efficient feature representation, especially for the video frames with pose variations and partial occlusions. The latter adopts a hyperbolic-graph constrained self-attention that explores important temporal graph structure information under hyperbolic space to model more subtle changes of dynamic emotion. Extensive experimental results on in-the-wild video-based facial expression databases show that our proposed CDGT outperforms other state-of-the-art methods.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neunet.2024.106573DOI Listing

Publication Analysis

Top Keywords

diverse graph
12
graph transformers
8
emotion recognition
8
facial videos
8
pose variations
8
partial occlusions
8
temporal features
8
graph
5
cdgt constructing
4
constructing diverse
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!