VPT: Video portraits transformer for realistic talking face generation.

Neural Netw

School of Automation Science and Engineering, South China University of Technology, China. Electronic address:

Published: January 2025

Talking face generation is a promising approach within various domains, such as digital assistants, video editing, and virtual video conferences. Previous works with audio-driven talking faces focused primarily on the synchronization between audio and video. However, existing methods still have certain limitations in synthesizing photo-realistic video with high identity preservation, audiovisual synchronization, and facial details like blink movements. To solve these problems, a novel talking face generation framework, termed video portraits transformer (VPT) with controllable blink movements is proposed and applied. It separates the process of video generation into two stages, i.e., audio-to-landmark and landmark-to-face stages. In the audio-to-landmark stage, the transformer encoder serves as the generator used for predicting whole facial landmarks from given audio and continuous eye aspect ratio (EAR). During the landmark-to-face stage, the video-to-video (vid-to-vid) network is employed to transfer landmarks into realistic talking face videos. Moreover, to imitate real blink movements during inference, a transformer-based spontaneous blink generation module is devised to generate the EAR sequence. Extensive experiments demonstrate that the VPT method can produce photo-realistic videos of talking faces with natural blink movements, and the spontaneous blink generation module can generate blink movements close to the real blink duration distribution and frequency.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.neunet.2025.107122DOI Listing

Publication Analysis

Top Keywords

blink movements
20
talking face
16
face generation
12
video portraits
8
portraits transformer
8
realistic talking
8
talking faces
8
blink
8
stages audio-to-landmark
8
real blink
8

Similar Publications

Locked-in syndrome is a rare neurological disorder. It is characterized by tetraparesis, paralysis of facial and masticatory muscles, anarthria and pseudobulbar syndrome with possible preservation of vertical movements of the eyeballs and blinking, as well as preservation of consciousness. A serious problem with the «locked-in person» syndrome is the inability of the patient to socialize, which causes him to experience no less suffering than from physical limitations.

View Article and Find Full Text PDF

Eye metrics are a marker of visual conscious awareness and neural processing in cerebral blindness.

bioRxiv

January 2025

Laboratory of Brain and Cognition (LBC), National Institute of Mental Health (NIMH), National Institutes of Health (NIH), Bethesda, Maryland (MD), USA.

Damage to the primary visual pathway can cause vision loss. Some cerebrally blind people retain degraded vision or sensations and can perform visually guided behaviors. These cases motivate investigation and debate on blind field conscious awareness and linked residual neural processing.

View Article and Find Full Text PDF

Fast, bioluminescent blinks attract group members of the nocturnal flashlight fish Anomalops katoptron (Bleeker, 1856).

Front Zool

January 2025

Department of General Zoology and Neurobiology, Institute of Biology and Biotechnology, Ruhr-University Bochum, 44801, Bochum, Germany.

Background: During their nighttime shoaling, the flashlight fish Anomalops katoptron produce fascinating, bioluminescent blink patterns, which have been related to the localization of food, determination of nearest neighbor distance, and initiation of the shoal's movement direction. Information transfer e.g.

View Article and Find Full Text PDF

VPT: Video portraits transformer for realistic talking face generation.

Neural Netw

January 2025

School of Automation Science and Engineering, South China University of Technology, China. Electronic address:

Talking face generation is a promising approach within various domains, such as digital assistants, video editing, and virtual video conferences. Previous works with audio-driven talking faces focused primarily on the synchronization between audio and video. However, existing methods still have certain limitations in synthesizing photo-realistic video with high identity preservation, audiovisual synchronization, and facial details like blink movements.

View Article and Find Full Text PDF

Explicit metrics for implicit emotions: investigating physiological and gaze indices of learner emotions.

Front Psychol

December 2024

Departent of Learning, Data-Analytics and Technology, Faculty of Behavioural, Management and Social Sciences, University of Twente, Enschede, Netherlands.

Learning experiences are intertwined with emotions, which in turn have a significant effect on learning outcomes. Therefore, digital learning environments can benefit from taking the emotional state of the learner into account. To do so, the first step is real-time emotion detection which is made possible by sensors that can continuously collect physiological and eye-tracking data.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!