Effective spatio-temporal modeling as a core of video representation learning is challenged by complex scale variations in spatio-temporal cues in videos, especially different visual tempos of actions and varying spatial sizes of moving objects. Most of the existing works handle complex spatio-temporal scale variations based on input-level or feature-level pyramid mechanisms, which, however, rely on expensive multistream architectures or explore multiscale spatio-temporal features in a fixed manner. To effectively capture complex scale dynamics of spatio-temporal cues in an efficient way, this article proposes a single-stream architecture (SS-Arch.) with single-input namely, adaptive multi-granularity spatio-temporal network (AMS-Net) to model adaptive multi-granularity (Multi-Gran.) Spatio-temporal cues for video action recognition. To this end, our AMS-Net proposes two core components, namely, competitive progressive temporal modeling (CPTM) block and collaborative spatio-temporal pyramid (CSTP) module. They, respectively, capture fine-grained temporal cues and fuse coarse-level spatio-temporal features in an adaptive manner. It admits that AMS-Net can handle subtle variations in visual tempos and fair-sized spatio-temporal dynamics in a unified architecture. Note that our AMS-Net can be flexibly instantiated based on existing deep convolutional neural networks (CNNs) with the proposed CPTM block and CSTP module. The experiments are conducted on eight video benchmarks, and the results show our AMS-Net establishes state-of-the-art (SOTA) performance on fine-grained action recognition (i.e., Diving48 and FineGym), while performing very competitively on widely used Something-Something and Kinetics.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2023.3321141DOI Listing

Publication Analysis

Top Keywords

spatio-temporal cues
16
adaptive multi-granularity
12
action recognition
12
spatio-temporal
11
multi-granularity spatio-temporal
8
cues video
8
video action
8
complex scale
8
scale variations
8
visual tempos
8

Similar Publications

Silent Signals in the Snow: Tracking the Spatio-Temporal Territorial Marking Behavior of Snow Leopards () in the Mountainous Region of Baltistan, Pakistan.

Ecol Evol

December 2024

Baltistan Wildlife Conservation and Development Organization (Reg) Apixoq Abbas Town Skardu Gilgit Baltistan Pakistan.

Mammals, being social creatures communicate through a variety of signal cues, thus it is vital to understand how wild carnivores create and maintain connections with their neighbors for their survival. However, observing elusive species in their natural habitats poses significant challenges leading to scarcities of data. In this study, we aimed to provide a detailed long-term observation of snow leopards in the northern region of Pakistan, hence we utilized data from 136 camera traps between 2018 and 2023 in order to investigate the territorial marking behavior of snow leopards in Baltistan.

View Article and Find Full Text PDF
Article Synopsis
  • Animacy perception is the skill animals use to recognize whether objects are alive, essential for identifying social partners or threats for survival.
  • Research indicates that both vertebrates and arthropods demonstrate this perceptual ability, though the term "animacy" has been less frequently used in studies involving arthropods.
  • The review highlights evidence of biological motion detection, the use of static visual cues for individual recognition, particularly in paper wasps, and behaviors like thanatosis, where an animal pretends to be dead to manipulate perception of liveliness.
View Article and Find Full Text PDF

Effects of virus-induced immunogenic cues on oncolytic virotherapy.

Sci Rep

November 2024

Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, The Netherlands.

Oncolytic virotherapy is a promising form of cancer treatment that uses viruses to infect and kill cancer cells. In addition to their direct effects on cancer cells, the viruses stimulate various immune responses partly directed against the tumour. Efforts are made to genetically engineer oncolytic viruses to enhance their immunogenic potential.

View Article and Find Full Text PDF

TRPV4-A Multifunctional Cellular Sensor Protein with Therapeutic Potential.

Sensors (Basel)

October 2024

Faculty of Medicine and Health Technology, Tampere University, 33520 Tampere, Finland.

Transient receptor potential vanilloid (TRPV) channel proteins belong to the superfamily of TRP proteins that form cationic channels in the animal cell membranes. These proteins have various subtype-specific functions, serving, for example, as sensors for pain, pressure, pH, and mechanical extracellular stimuli. The sensing of extracellular cues by TRPV4 triggers Ca-influx through the channel, subsequently coordinating numerous intracellular signaling cascades in a spatio-temporal manner.

View Article and Find Full Text PDF

Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!