Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively. Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TIP.2023.3330304 | DOI Listing |
Sci Rep
November 2024
Cognitive Neuroscience Laboratory, German Primate Center - Leibniz Institute for Primate Research, Göttingen, Germany.
Human object perception depends on the proper integration of multiple visual features, such as color and motion. When features are integrated incorrectly, they are perceptually misbound and can cause illusions. This study investigates the phenomenon of continuous misbinding of color and motion features in peripheral vision, addressing the role of spatial continuity and color configuration in binding processes.
View Article and Find Full Text PDFNeuron
December 2024
McGovern Institute for Brain Research, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA.
Characterizing the functional organization of cerebral cortex is a fundamental step in understanding how different kinds of information are processed in the brain. However, it is still unclear how these areas are organized during naturalistic visual and auditory stimulation. Here, we used high-resolution functional MRI data from 176 human subjects to map the macro-architecture of the entire cerebral cortex based on responses to a 60-min audiovisual movie stimulus.
View Article and Find Full Text PDFIEEE Trans Vis Comput Graph
November 2024
We present HOIMotion - a novel approach for human motion forecasting during human-object interactions that integrates information about past body poses and egocentric 3D object bounding boxes. Human motion forecasting is important in many augmented reality applications but most existing methods have only used past body poses to predict future motion. HOIMotion first uses an encoder-residual graph convolutional network (GCN) and multi-layer perceptrons to extract features from body poses and egocentric 3D object bounding boxes, respectively.
View Article and Find Full Text PDFSensors (Basel)
July 2024
School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
Human-object interaction (HOI) detection identifies a "set of interactions" in an image involving the recognition of interacting instances and the classification of interaction categories. The complexity and variety of image content make this task challenging. Recently, the Transformer has been applied in computer vision and received attention in the HOI detection task.
View Article and Find Full Text PDFIEEE Trans Vis Comput Graph
June 2024
Human eye gaze plays a significant role in many virtual and augmented reality (VR/AR) applications, such as gaze-contingent rendering, gaze-based interaction, or eye-based activity recognition. However, prior works on gaze analysis and prediction have only explored eye-head coordination and were limited to human-object interactions. We first report a comprehensive analysis of eye-body coordination in various human-object and human-human interaction activities based on four public datasets collected in real-world (MoGaze), VR (ADT), as well as AR (GIMO and EgoBody) environments.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!