Effective spatio-temporal modeling as a core of video representation learning is challenged by complex scale variations in spatio-temporal cues in videos, especially different visual tempos of actions and varying spatial sizes of moving objects. Most of the existing works handle complex spatio-temporal scale variations based on input-level or feature-level pyramid mechanisms, which, however, rely on expensive multistream architectures or explore multiscale spatio-temporal features in a fixed manner. To effectively capture complex scale dynamics of spatio-temporal cues in an efficient way, this article proposes a single-stream architecture (SS-Arch.) with single-input namely, adaptive multi-granularity spatio-temporal network (AMS-Net) to model adaptive multi-granularity (Multi-Gran.) Spatio-temporal cues for video action recognition. To this end, our AMS-Net proposes two core components, namely, competitive progressive temporal modeling (CPTM) block and collaborative spatio-temporal pyramid (CSTP) module. They, respectively, capture fine-grained temporal cues and fuse coarse-level spatio-temporal features in an adaptive manner. It admits that AMS-Net can handle subtle variations in visual tempos and fair-sized spatio-temporal dynamics in a unified architecture. Note that our AMS-Net can be flexibly instantiated based on existing deep convolutional neural networks (CNNs) with the proposed CPTM block and CSTP module. The experiments are conducted on eight video benchmarks, and the results show our AMS-Net establishes state-of-the-art (SOTA) performance on fine-grained action recognition (i.e., Diving48 and FineGym), while performing very competitively on widely used Something-Something and Kinetics.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TNNLS.2023.3321141 | DOI Listing |
Ecol Evol
December 2024
Baltistan Wildlife Conservation and Development Organization (Reg) Apixoq Abbas Town Skardu Gilgit Baltistan Pakistan.
Mammals, being social creatures communicate through a variety of signal cues, thus it is vital to understand how wild carnivores create and maintain connections with their neighbors for their survival. However, observing elusive species in their natural habitats poses significant challenges leading to scarcities of data. In this study, we aimed to provide a detailed long-term observation of snow leopards in the northern region of Pakistan, hence we utilized data from 136 camera traps between 2018 and 2023 in order to investigate the territorial marking behavior of snow leopards in Baltistan.
View Article and Find Full Text PDFFront Psychol
November 2024
Center for Mind/Brain Sciences, University of Trento, Trento, Italy.
Sci Rep
November 2024
Groningen Institute for Evolutionary Life Sciences, University of Groningen, Groningen, The Netherlands.
Oncolytic virotherapy is a promising form of cancer treatment that uses viruses to infect and kill cancer cells. In addition to their direct effects on cancer cells, the viruses stimulate various immune responses partly directed against the tumour. Efforts are made to genetically engineer oncolytic viruses to enhance their immunogenic potential.
View Article and Find Full Text PDFSensors (Basel)
October 2024
Faculty of Medicine and Health Technology, Tampere University, 33520 Tampere, Finland.
Transient receptor potential vanilloid (TRPV) channel proteins belong to the superfamily of TRP proteins that form cationic channels in the animal cell membranes. These proteins have various subtype-specific functions, serving, for example, as sensors for pain, pressure, pH, and mechanical extracellular stimuli. The sensing of extracellular cues by TRPV4 triggers Ca-influx through the channel, subsequently coordinating numerous intracellular signaling cascades in a spatio-temporal manner.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
September 2024
Referring segmentation is a fundamental vision-language task that aims to segment out an object from an image or video in accordance with a natural language description. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image or video frames. A paradigm for tackling this problem in both the image and the video domains is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!