Video Temporal Grounding (VTG) aims to locate the time interval in a video that is semantically relevant to a language query. Existing VTG methods interact the query with entangled video features and treat the instances in a dataset independently. However, intra-video entanglement and inter-video connection are rarely considered in these methods, leading to mismatches between the video and language. To this end, we propose a novel method, dubbed Hierarchically Semantic Associating (HiSA), which aims to precisely align the video with language and obtain discriminative representation for further location regression. Specifically, the action factors and background factors are disentangled from adjacent video segments, enforcing precise multimodal interaction and alleviating the intra-video entanglement. In addition, cross-guided contrast is elaborately framed to capture the inter-video connection, which benefits the multimodal understanding to locate the time interval. Extensive experiments on three benchmark datasets demonstrate that our approach significantly outperforms the state-of-the-art methods. The project page is available at: https://github.com/zhexu1997/HiSA.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TIP.2022.3191841DOI Listing

Publication Analysis

Top Keywords

hierarchically semantic
8
semantic associating
8
video temporal
8
temporal grounding
8
locate time
8
time interval
8
intra-video entanglement
8
inter-video connection
8
video language
8
video
7

Similar Publications

When we listen to speech, our brain's neurophysiological responses "track" its acoustic features, but it is less well understood how these auditory responses are enhanced by linguistic content. Here, we recorded magnetoencephalography (MEG) responses while subjects of both sexes listened to four types of continuous-speech-like passages: speech-envelope modulated noise, English-like non-words, scrambled words, and a narrative passage. Temporal response function (TRF) analysis provides strong neural evidence for the emergent features of speech processing in cortex, from acoustics to higher-level linguistics, as incremental steps in neural speech processing.

View Article and Find Full Text PDF

Weakly-supervised thyroid ultrasound segmentation: Leveraging multi-scale consistency, contextual features, and bounding box supervision for accurate target delineation.

Comput Biol Med

January 2025

Department of Artificial Intelligence, Faculty of Artificial Intelligence, Egyptian Russian University, 11829, Badr City, Egypt. Electronic address:

Weakly-supervised learning (WSL) methods have gained significant attention in medical image segmentation, but they often face challenges in accurately delineating boundaries due to overfitting to weak annotations such as bounding boxes. This issue is particularly pronounced in thyroid ultrasound images, where low contrast and noisy backgrounds hinder precise segmentation. In this paper, we propose a novel weakly-supervised segmentation framework that addresses these challenges.

View Article and Find Full Text PDF

During the Covid-19 pandemic, the widespread use of social media platforms has facilitated the dissemination of information, fake news, and propaganda, serving as a vital source of self-reported symptoms related to Covid-19. Existing graph-based models, such as Graph Neural Networks (GNNs), have achieved notable success in Natural Language Processing (NLP). However, utilizing GNN-based models for propaganda detection remains challenging because of the challenges related to mining distinct word interactions and storing nonconsecutive and broad contextual data.

View Article and Find Full Text PDF

Glaucoma, a severe eye disease leading to irreversible vision loss if untreated, remains a significant challenge in healthcare due to the complexity of its detection. Traditional methods rely on clinical examinations of fundus images, assessing features like optic cup and disc sizes, rim thickness, and other ocular deformities. Recent advancements in artificial intelligence have introduced new opportunities for enhancing glaucoma detection.

View Article and Find Full Text PDF

Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality retrieval task to match a person across different spectral camera views. Most existing works focus on learning shared feature representations from the final embedding space of advanced networks to alleviate modality differences between visible and infrared images. However, exclusively relying on high-level semantic information from the network's final layers can restrict shared feature representations and overlook the benefits of low-level details.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!