Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate the methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TPAMI.2023.3258628 | DOI Listing |
Eur J Neurosci
January 2025
Department of Psychology, National Chengchi University, Taipei, Taiwan.
Word problems are essential for math learning and education, bridging numerical knowledge with real-world applications. Despite their importance, the neural mechanisms underlying word problem solving, especially in children, remain poorly understood. Here, we examine children's cognitive and brain response profiles for arithmetic word problems (AWPs), which involve one-step mathematical operations, and compare them with nonarithmetic word problems (NWPs), structured as parallel narratives without numerical operations.
View Article and Find Full Text PDFEar Hear
January 2025
Department of Psychological Sciences, Kansas State University, Manhattan, KS, USA.
Objectives: Occupational hearing loss is a significant problem worldwide despite the fact that it can be mitigated by the wearing of hearing protection devices (HPDs). When surveyed, workers frequently report that worsened work performance while wearing HPDs is one reason why they choose not to wear them. However, there have been few studies to supplement these subjective reports with objective measures.
View Article and Find Full Text PDFPLoS Biol
January 2025
Max Planck Institute for Psycholinguistics, Nijmegen, the Netherlands.
Studies of perception have long shown that the brain adds information to its sensory analysis of the physical environment. A touchstone example for humans is language use: to comprehend a physical signal like speech, the brain must add linguistic knowledge, including syntax. Yet, syntactic rules and representations are widely assumed to be atemporal (i.
View Article and Find Full Text PDFBrain Lang
January 2025
Department of Linguistics, Graduate School of Arts & Letters, Tohoku University, Sendai, Japan. Electronic address:
This study examines the neural mechanisms behind integrating syntactic and information structures during sentence comprehension using functional Magnetic Resonance Imaging. Focusing on Japanese sentences with canonical (SOV) and non-canonical (OSV) word orders, the study revealed distinct neural networks responsible for processing these linguistic structures. The left opercular part of the inferior frontal gyrus, left premotor area, and left posterior superior/middle temporal gyrus were primarily involved in syntactic processing.
View Article and Find Full Text PDFJ Neurosci
January 2025
Department of Electrical and Computer Engineering, University of Maryland, College Park, MD, 20742
When we listen to speech, our brain's neurophysiological responses "track" its acoustic features, but it is less well understood how these auditory responses are enhanced by linguistic content. Here, we recorded magnetoencephalography (MEG) responses while subjects of both sexes listened to four types of continuous-speech-like passages: speech-envelope modulated noise, English-like non-words, scrambled words, and a narrative passage. Temporal response function (TRF) analysis provides strong neural evidence for the emergent features of speech processing in cortex, from acoustics to higher-level linguistics, as incremental steps in neural speech processing.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!