Synthetic faces generated with the facial action coding system or deep neural networks improve speech-in-noise perception, but not as much as real faces.

Yingjia Yu Anastasia Lado Yue Zhang John F Magnotti Michael S Beauchamp

Front Neurosci

Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

Published: May 2024

The prevalence of synthetic talking faces in both commercial and academic environments is increasing as the technology to generate them grows more powerful and available. While it has long been known that seeing the face of the talker improves human perception of speech-in-noise, recent studies have shown that synthetic talking faces generated by deep neural networks (DNNs) are also able to improve human perception of speech-in-noise. However, in previous studies the benefit provided by DNN synthetic faces was only about half that of real human talkers. We sought to determine whether synthetic talking faces generated by an alternative method would provide a greater perceptual benefit. The facial action coding system (FACS) is a comprehensive system for measuring visually discernible facial movements. Because the action units that comprise FACS are linked to specific muscle groups, synthetic talking faces generated by FACS might have greater verisimilitude than DNN synthetic faces which do not reference an explicit model of the facial musculature. We tested the ability of human observers to identity speech-in-noise accompanied by a blank screen; the real face of the talker; and synthetic talking faces generated either by DNN or FACS. We replicated previous findings of a large benefit for seeing the face of a real talker for speech-in-noise perception and a smaller benefit for DNN synthetic faces. FACS faces also improved perception, but only to the same degree as DNN faces. Analysis at the phoneme level showed that the performance of DNN and FACS faces was particularly poor for phonemes that involve interactions between the teeth and lips, such as /f/, /v/, and /th/. Inspection of single video frames revealed that the characteristic visual features for these phonemes were weak or absent in synthetic faces. Modeling the real vs. synthetic difference showed that increasing the realism of a few phonemes could substantially increase the overall perceptual benefit of synthetic faces.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11111898	PMC
http://dx.doi.org/10.3389/fnins.2024.1379988	DOI Listing

Publication Analysis

Top Keywords

synthetic faces

faces generated

synthetic talking

talking faces

faces

synthetic

dnn synthetic

facial action

action coding

coding system

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered