We propose a new statistical generative model for spatiotemporal video segmentation. The objective is to partition a video sequence into homogeneous segments that can be used as "building blocks" for semantic video segmentation. The baseline framework is a Gaussian mixture model (GMM)-based video modeling approach that involves a six-dimensional spatiotemporal feature space. Specifically, we introduce the concept of frame saliency to quantify the relevancy of a video frame to the GMM-based spatiotemporal video modeling. This helps us use a small set of salient frames to facilitate the model training by reducing data redundancy and irrelevance. A modified expectation maximization algorithm is developed for simultaneous GMM training and frame saliency estimation, and the frames with the highest saliency values are extracted to refine the GMM estimation for video segmentation. Moreover, it is interesting to find that frame saliency can imply some object behaviors. This makes the proposed method also applicable to other frame-related video analysis tasks, such as key-frame extraction, video skimming, etc. Experiments on real videos demonstrate the effectiveness and efficiency of the proposed method.

Download full-text PDF

Source
http://dx.doi.org/10.1109/tip.2007.908283DOI Listing

Publication Analysis

Top Keywords

spatiotemporal video
12
video modeling
12
video segmentation
12
frame saliency
12
video
10
salient frames
8
proposed method
8
selecting salient
4
spatiotemporal
4
frames spatiotemporal
4

Similar Publications

Compressed ultrafast photography (CUP) is a high-speed imaging technique with a frame rate of up to ten trillion frames per second (fps) and a sequence depth of hundreds of frames. This technique is a powerful tool for investigating ultrafast processes. However, since the reconstruction process is an ill-posed problem, the image reconstruction will be more difficult with the increase of the number of reconstruction frames and the number of pixels of each reconstruction frame.

View Article and Find Full Text PDF

In the medical field, endoscopic video analysis is crucial for disease diagnosis and minimally invasive surgery. The Endoscopic Foundation Models (Endo- FM) utilize large-scale self-supervised pre-training on endoscopic video data and leverage video transformer models to capture long-range spatiotemporal dependencies. However, detecting complex lesions such as gastrointestinal metaplasia (GIM) in endoscopic videos remains challenging due to unclear boundaries and indistinct features, and Endo-FM has not demonstrated good performance.

View Article and Find Full Text PDF

To address the challenges of missed detections caused by insufficient shape and texture features and blurred boundaries in existing detection methods, this paper introduces a novel moving vehicle detection approach for satellite videos. The proposed method leverages frame difference and convolution to effectively integrate spatiotemporal information. First, a frame difference module (FDM) is designed, combining frame difference and convolution.

View Article and Find Full Text PDF

Arbitrary translational Six Degrees of Freedom (6DoF) video represents a transitional stage towards immersive terminal videos, allowing users to freely switch viewpoints for a 3D scene experience. However, the increased freedom of movement introduces new distortions that significantly impact human visual perception quality. Therefore, it is crucial to explore quality assessment (QA) to validate its application feasibility.

View Article and Find Full Text PDF

Neural processing of naturalistic audiovisual events in space and time.

Commun Biol

January 2025

Western Institute for Neuroscience, Western University, London, ON, Canada.

Our brain seamlessly integrates distinct sensory information to form a coherent percept. However, when real-world audiovisual events are perceived, the specific brain regions and timings for processing different levels of information remain less investigated. To address that, we curated naturalistic videos and recorded functional magnetic resonance imaging (fMRI) and electroencephalography (EEG) data when participants viewed videos with accompanying sounds.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!