Understanding videos, especially aligning them with textual data, presents a significant challenge in computer vision. The advent of vision-language models (VLMs) like CLIP has sparked interest in leveraging their capabilities for enhanced video understanding, showing marked advancements in both performance and efficiency. However, current methods often neglect vital user-generated metadata such as video titles. In this paper, we present Cap4Video++, a universal framework that leverages auxiliary captions to enrich video understanding. More recently, we witness the flourishing of large language models (LLMs) like ChatGPT. Cap4Video++ harnesses the synergy of vision-language models (VLMs) and large language models (LLMs) to generate video captions, utilized in three key phases: (i) Input stage employs Semantic Pair Sampling to extract beneficial samples from captions, aiding contrastive learning. (ii) Intermediate stage sees Video-Caption Cross-modal Interaction and Adaptive Caption Selection work together to bolster video and caption representations. (iii) Output stage introduces a Complementary Caption-Text Matching branch, enhancing the primary video branch by improving similarity calculations. Our comprehensive experiments on text-video retrieval and video action recognition across nine benchmarks clearly demonstrate Cap4Video++'s superiority over existing models, highlighting its effectiveness in utilizing automatically generated captions to advance video understanding.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TPAMI.2024.3410329DOI Listing

Publication Analysis

Top Keywords

video understanding
16
video
9
auxiliary captions
8
vision-language models
8
models vlms
8
large language
8
language models
8
models llms
8
understanding
5
captions
5

Similar Publications

Background: Saliva is a protein-rich body fluid for noninvasive discovery of biomolecules, containing both human and microbial components, associated with various chronic diseases. Type-2 diabetes (T2D) imposes a significant health and socio-economic burden. Prior research on T2D salivary microbiome utilized methods such as metagenomics, metatranscriptomics, 16S rRNA sequencing, and low-throughput proteomics.

View Article and Find Full Text PDF

Gestures are essential in early language development. We investigate the use of gestures in children with cochlear implants (CIs), with a particular focus on deictic, iconic, and conventional gestures. The aim is to understand how the use of gestures in everyday interactions relates to age, vocabulary testing results, and language development reported by parents.

View Article and Find Full Text PDF

Telemedicine use increased substantially with the COVID-19 pandemic. Understanding of the impact of telemedicine modality (video vs. phone) on post-telemedicine acute care for higher risk conditions is limited.

View Article and Find Full Text PDF

An Electroglottographic and Acoustic Study on Mandarin Speech in Male Heroin Users.

J Voice

January 2025

Department of Audio, Video, and Electronic Forensics, Academy of Forensic Science, Shanghai, China; Shanghai Forensic Service Platform, Key Laboratory of Forensic Science, Ministry of Justice, Shanghai, China.

Drug abuse can cause severe damage to the human speech organs. The vocal folds are one of the important speech organs that produce voice through vibration when airflow passes through. Previous studies have reported the negative effects of drugs on speech organs, including the vocal folds, but there is still limited research on relevant field.

View Article and Find Full Text PDF

Background: Video consultations (VC) disrupt how general practice provides care and how patients receive it. A step towards understanding the use of VC is to study the association between user-status and general practitioner and practice characteristics.

Aim: To study the association between general practitioner and general practice characteristics and VC user-status (users, never users, and former users).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!