Understanding videos, especially aligning them with textual data, presents a significant challenge in computer vision. The advent of vision-language models (VLMs) like CLIP has sparked interest in leveraging their capabilities for enhanced video understanding, showing marked advancements in both performance and efficiency. However, current methods often neglect vital user-generated metadata such as video titles. In this paper, we present Cap4Video++, a universal framework that leverages auxiliary captions to enrich video understanding. More recently, we witness the flourishing of large language models (LLMs) like ChatGPT. Cap4Video++ harnesses the synergy of vision-language models (VLMs) and large language models (LLMs) to generate video captions, utilized in three key phases: (i) Input stage employs Semantic Pair Sampling to extract beneficial samples from captions, aiding contrastive learning. (ii) Intermediate stage sees Video-Caption Cross-modal Interaction and Adaptive Caption Selection work together to bolster video and caption representations. (iii) Output stage introduces a Complementary Caption-Text Matching branch, enhancing the primary video branch by improving similarity calculations. Our comprehensive experiments on text-video retrieval and video action recognition across nine benchmarks clearly demonstrate Cap4Video++'s superiority over existing models, highlighting its effectiveness in utilizing automatically generated captions to advance video understanding.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2024.3410329 | DOI Listing |
Microbiome
January 2025
Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Blegdamsvej 3B, Copenhagen, 2200, Denmark.
Background: Saliva is a protein-rich body fluid for noninvasive discovery of biomolecules, containing both human and microbial components, associated with various chronic diseases. Type-2 diabetes (T2D) imposes a significant health and socio-economic burden. Prior research on T2D salivary microbiome utilized methods such as metagenomics, metatranscriptomics, 16S rRNA sequencing, and low-throughput proteomics.
View Article and Find Full Text PDFClin Linguist Phon
January 2025
BKV, Linköping University, Linköping, Sweden.
Gestures are essential in early language development. We investigate the use of gestures in children with cochlear implants (CIs), with a particular focus on deictic, iconic, and conventional gestures. The aim is to understand how the use of gestures in everyday interactions relates to age, vocabulary testing results, and language development reported by parents.
View Article and Find Full Text PDFTelemed J E Health
January 2025
Kaiser Permanente Division of Research, Pleasanton, California, USA.
Telemedicine use increased substantially with the COVID-19 pandemic. Understanding of the impact of telemedicine modality (video vs. phone) on post-telemedicine acute care for higher risk conditions is limited.
View Article and Find Full Text PDFJ Voice
January 2025
Department of Audio, Video, and Electronic Forensics, Academy of Forensic Science, Shanghai, China; Shanghai Forensic Service Platform, Key Laboratory of Forensic Science, Ministry of Justice, Shanghai, China.
Drug abuse can cause severe damage to the human speech organs. The vocal folds are one of the important speech organs that produce voice through vibration when airflow passes through. Previous studies have reported the negative effects of drugs on speech organs, including the vocal folds, but there is still limited research on relevant field.
View Article and Find Full Text PDFBJGP Open
January 2025
Steno Diabetes Center Aarhus, Aarhus University Hospital, Aarhus, Denmark.
Background: Video consultations (VC) disrupt how general practice provides care and how patients receive it. A step towards understanding the use of VC is to study the association between user-status and general practitioner and practice characteristics.
Aim: To study the association between general practitioner and general practice characteristics and VC user-status (users, never users, and former users).
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!