Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Most existing methods approach these challenges from a discriminative learning perspective, focusing on learning the correspondence between query and activity boundary locations through complex cross-modal interactions. However, the continuous nature of video content often results in unclear boundaries between temporal events. This boundary ambiguity may confuse models, resulting in the subpar performance in predicting target boundaries. To alleviate this problem, we propose to solve the two tasks jointly from the perspective of denoising generation. Moreover, the target boundary can be localized clearly by iterative refinement from coarse to fine. Specifically, a novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process by combining the diffusion model. During training, the Gaussian noise is added to corrupt the ground truth (GT), with noisy candidates produced as input. The model is trained to reverse this noise addition process. In the inference phase, DiffusionVMR initiates directly from Gaussian noise and progressively refines the proposals from the noise to the meaningful output. Notably, the proposed DiffusionVMR inherits the advantages of diffusion models that allow for iteratively refined results during inference, enhancing the boundary transition from coarse to fine. Furthermore, the training and inference of DiffusionVMR are decoupled. An arbitrary setting can be used in DiffusionVMR during inference without consistency with the training phase. Extensive experiments conducted on five widely used benchmarks (i.e., QVHighlight, Charades-STA, TACoS, YouTubeHighlights, and TVSum) across two tasks (moment retrieval and/or highlight detection) demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TNNLS.2024.3516033DOI Listing

Publication Analysis

Top Keywords

moment retrieval
12
highlight detection
12
diffusion model
8
video moment
8
retrieval highlight
8
video content
8
denoising generation
8
coarse fine
8
gaussian noise
8
proposed diffusionvmr
8

Similar Publications

Organ donation requests to families often occur during moments of profound grief and create an emotional burden that is compounded by the varying emotional responses to circumstances surrounding death. These responses, in turn, interact with the timing of the request to influence authorization decisions. Understanding the interplay between timing and circumstances of death is crucial for improving authorization rates and addressing the organ donor shortage.

View Article and Find Full Text PDF

Memory retrieval activates regions across the brain, including not only the hippocampus and medial temporal lobe (MTL), but also frontal, parietal, and lateral temporal cortical regions. What remains unclear, however, is how these regions communicate to organize retrieval-specific processing. Here, we elucidate the role of theta (3-8 Hz) synchronization, broadly implicated in memory function, during the spontaneous retrieval of episodic memories.

View Article and Find Full Text PDF

Video moment retrieval and highlight detection have received attention in the current era of video content proliferation, aiming to localize moments and estimate clip relevances based on user-specific queries. Most existing methods approach these challenges from a discriminative learning perspective, focusing on learning the correspondence between query and activity boundary locations through complex cross-modal interactions. However, the continuous nature of video content often results in unclear boundaries between temporal events.

View Article and Find Full Text PDF

Prior behavioral work showed that event structure plays a key role in our ability to mentally search through memories of continuous naturalistic experience. We hypothesized that, neurally, this memory search process involves a division of labor between slowly unfurling neocortical states representing event knowledge and fast hippocampal-neocortical communication that supports retrieval of new information at transitions between events. To test this, we tracked slow neural state-patterns in a sample of ten patients undergoing intracranial electroencephalography as they viewed a movie and then searched their memories in a structured naturalistic interview.

View Article and Find Full Text PDF

The diagnostic and therapeutic value of time in bed extension in Insufficient Sleep Syndrome.

Sleep Med

April 2025

Sleep Medicine Unit, Neurocenter of Italian Switzerland, Civic Hospital of Lugano, Ente Ospedaliero Cantonale (EOC), Lugano, Switzerland; Faculty of Biomedical Sciences, Università della Svizzera Italiana, Lugano, Switzerland.

Background: Insufficient sleep syndrome (ISS) represents an emerging health concern but remains poorly defined as a diagnostic entity, though included in the international classification of sleep disorders. In the present study, we aimed to clarify the longitudinal course of ISS and to identify prognostic factors by comparing remitting and non-remitting patients.

Methods: A chart-review was realized, retrieving fifty-five patients with ISS (aged 39.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!