Human-object relationship detection reveals the fine-grained relationship between humans and objects, helping the comprehensive understanding of videos. Previous human-object relationship detection approaches are mainly developed with object features and relation features without exploring the specific information of humans. In this paper, we propose a novel Relation-Pose Transformer (RPT) for human-object relationship detection. Inspired by the coordination of eye-head-body movements in cognitive science, we employ the head pose to find those crucial objects that humans focus on and use the body pose with skeleton information to represent multiple actions. Then, we utilize the spatial encoder to capture spatial contextualized information of the relation pair, which integrates the relation features and pose features. Next, the temporal decoder aims to model the temporal dependency of the relationship. Finally, we adopt multiple classifiers to predict different types of relationships. Extensive experiments on the benchmark Action Genome validate the effectiveness of our proposed method and show the state-of-the-art performance compared with related methods.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TIP.2023.3270040 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!