Unimodal and cross-modal information provided by faces and voices contribute to identity percepts. To examine how these sources of information interact, we devised a novel audio-visual sorting task in which participants were required to group video-only and audio-only clips into two identities. In a series of three experiments, we show that unimodal face and voice sorting were more accurate than cross-modal sorting: While face sorting was consistently most accurate followed by voice sorting, cross-modal sorting was at chancel level or below. In Experiment 1, we compared performance in our novel audio-visual sorting task to a traditional identity matching task, showing that unimodal and cross-modal identity perception were overall moderately more accurate than the traditional identity matching task. In Experiment 2, separating unimodal from cross-modal sorting led to small improvements in accuracy for unimodal sorting, but no change in cross-modal sorting performance. In Experiment 3, we explored the effect of minimal audio-visual training: Participants were shown a clip of the two identities in conversation prior to completing the sorting task. This led to small, nonsignificant improvements in accuracy for unimodal and cross-modal sorting. Our results indicate that unfamiliar face and voice perception operate relatively independently with no evidence of mutual benefit, suggesting that extracting reliable cross-modal identity information is challenging.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8763756 | PMC |
http://dx.doi.org/10.3758/s13421-021-01198-7 | DOI Listing |
IEEE Trans Pattern Anal Mach Intell
February 2025
With only video-level event labels, this paper targets at the task of weakly-supervised audio-visual event perception (WS-AVEP), which aims to temporally localize and categorize events that belong to each modality. Despite the recent progress, most existing approaches either ignore the unsynchronized property of audio-visual tracks or discount the complementary modality for explicit enhancement. We argue that, a modality should provide ample presence evidence for an event, while the complementary modality offers absence evidence as a reference.
View Article and Find Full Text PDFIEEE Trans Med Imaging
January 2025
Histo-genomic multi-modal methods have emerged as a powerful paradigm, demonstrating significant potential for cancer prognosis. However, genome sequencing, unlike histopathology imaging, is still not widely accessible in underdeveloped regions, limiting the application of these multi-modal approaches in clinical settings. To address this, we propose a novel Genome-informed Hyper-Attention Network, termed G-HANet, which is capable of effectively learning the histo-genomic associations during training to elevate uni-modal whole slide image (WSI)-based inference for the first time.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
December 2024
Social images are often associated with rich but noisy tags from community contributions. Although social tags can potentially provide valuable semantic training information for image retrieval, existing studies all fail to effectively filter noises by exploiting the cross-modal correlation between image content and tags. The current cross-modal vision-and-language representation learning methods, which selectively attend to the relevant parts of the image and text, show a promising direction.
View Article and Find Full Text PDFJ Exp Psychol Hum Percept Perform
March 2025
Queensland University of Technology, School of Psychology and Counselling.
The relative timing between sensory signals strongly determines whether they are integrated in the brain. Two classical measures of temporal integration are provided by simultaneity judgments, where one judges whether cross-modal stimuli are synchronous, and violations of the race model inequality (RMI) due to faster responses to cross-modal than unimodal stimuli. While simultaneity judgments are subject to trial history effects (rapid temporal recalibration) and long-term experience (musical training), it is unknown whether RMI violations are similarly affected.
View Article and Find Full Text PDFSheng Wu Yi Xue Gong Cheng Xue Za Zhi
February 2025
School of Life Health Information Science and Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China.
In audiovisual emotion recognition, representational learning is a research direction receiving considerable attention, and the key lies in constructing effective affective representations with both consistency and variability. However, there are still many challenges to accurately realize affective representations. For this reason, in this paper we proposed a cross-modal audiovisual recognition model based on a multi-head cross-attention mechanism.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!