We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component is class activation pooling (CAP), a differentiable pooling layer that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a re-designed output gate. Action, object and context descriptors are fused by a multi-head prediction that accounts for the inter-dependencies between noun-verb-action structured labels in egocentric video datasets. EgoACO features built-in visual explanations, helping learning and interpretation of discriminative information in video. Results on the two largest egocentric action recognition datasets currently available, EPIC-KITCHENS and EGTEA Gaze+, show that by decoding action-context-object descriptors, the model achieves state-of-the-art recognition performance.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TPAMI.2021.3058649DOI Listing

Publication Analysis

Top Keywords

egocentric video
12
action recognition
8
action-context-object descriptors
8
labels egocentric
8
video datasets
8
class activation
8
activation pooling
8
context descriptors
8
video
6
learning recognize
4

Similar Publications

Article Synopsis
  • Spatial memory and orientation issues are often early signs of dementia, making early detection important for effective treatment.
  • The study involved 135 participants with varying cognitive abilities, who were assessed using subjective and objective spatial orientation tests, ensuring they had normal vestibular function.
  • Results showed a significant correlation between self-reported spatial discomfort and actual spatial impairment, with cognitively impaired individuals experiencing greater discomfort and higher angular deviations in tasks demanding spatial transformation.
View Article and Find Full Text PDF

Objective: Hand function is central to inter- actions with our environment. Developing a comprehen- sive model of hand grasps in naturalistic environments is crucial across various disciplines, including robotics, ergonomics, and rehabilitation. Creating such a taxonomy poses challenges due to the significant variation in grasp- ing strategies that individuals may employ.

View Article and Find Full Text PDF

The dataset presents raw data on the egocentric (first-person view) and exocentric (third-person view) perspectives, including 47166 frame images. Egocentric and exocentric frame images are recorded from original iPhone videos simultaneously. The egocentric view captures the details of proximity hand gestures and attentiveness of the iPhone wearer, while the exocentric view captures the hand gestures in the top-down view of all participants.

View Article and Find Full Text PDF

Rock climbing has propelled from niche sport to mainstream free-time activity and Olympic sport. Moreover, climbing can be studied as an example of a high-stakes perception-action task. However, understanding what constitutes an expert climber is not simple or straightforward.

View Article and Find Full Text PDF

We introduce the Visual Experience Dataset (VEDB), a compilation of more than 240 hours of egocentric video combined with gaze- and head-tracking data that offer an unprecedented view of the visual world as experienced by human observers. The dataset consists of 717 sessions, recorded by 56 observers ranging from 7 to 46 years of age. This article outlines the data collection, processing, and labeling protocols undertaken to ensure a representative sample and discusses the potential sources of error or bias within the dataset.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!