We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component is class activation pooling (CAP), a differentiable pooling layer that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a re-designed output gate. Action, object and context descriptors are fused by a multi-head prediction that accounts for the inter-dependencies between noun-verb-action structured labels in egocentric video datasets. EgoACO features built-in visual explanations, helping learning and interpretation of discriminative information in video. Results on the two largest egocentric action recognition datasets currently available, EPIC-KITCHENS and EGTEA Gaze+, show that by decoding action-context-object descriptors, the model achieves state-of-the-art recognition performance.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2021.3058649 | DOI Listing |
Front Neurosci
November 2024
German Center for Vertigo and Balance Disorders, LMU University Hospital, Munich, Germany.
IEEE J Biomed Health Inform
November 2024
Objective: Hand function is central to inter- actions with our environment. Developing a comprehen- sive model of hand grasps in naturalistic environments is crucial across various disciplines, including robotics, ergonomics, and rehabilitation. Creating such a taxonomy poses challenges due to the significant variation in grasp- ing strategies that individuals may employ.
View Article and Find Full Text PDFData Brief
December 2024
Faculty of Computing, Universiti Teknologi Malaysia, 81310 Skudai, Johor Bahru, Malaysia.
The dataset presents raw data on the egocentric (first-person view) and exocentric (third-person view) perspectives, including 47166 frame images. Egocentric and exocentric frame images are recorded from original iPhone videos simultaneously. The egocentric view captures the details of proximity hand gestures and attentiveness of the iPhone wearer, while the exocentric view captures the hand gestures in the top-down view of all participants.
View Article and Find Full Text PDFSensors (Basel)
October 2024
Department of Informatics, Indiana University Bloomington, Bloomington, IN 47408, USA.
Rock climbing has propelled from niche sport to mainstream free-time activity and Olympic sport. Moreover, climbing can be studied as an example of a high-stakes perception-action task. However, understanding what constitutes an expert climber is not simple or straightforward.
View Article and Find Full Text PDFWe introduce the Visual Experience Dataset (VEDB), a compilation of more than 240 hours of egocentric video combined with gaze- and head-tracking data that offer an unprecedented view of the visual world as experienced by human observers. The dataset consists of 717 sessions, recorded by 56 observers ranging from 7 to 46 years of age. This article outlines the data collection, processing, and labeling protocols undertaken to ensure a representative sample and discusses the potential sources of error or bias within the dataset.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!