VividFace: Real-Time and Realistic
Facial Expression Shadowing
for Humanoid Robots

*Project lead, Advisor

A demonstration of VividFace. The humanoid robot faithfully imitates the facial expressions of the human performer in real time. The shadowing of subtle details, such as frowning, gaze direction, and head pose enhances realism.

distraction
Real-world examples of humanoid robots performing realistic facial expression imitation.

Abstract

Humanoid facial expression shadowing enables robots to realistically imitate human facial expressions in real time, which is critical for lifelike, facially expressive humanoid robots and affective human–robot interaction. Existing progress in humanoid facial expression imitation remains limited, often failing to achieve either real-time performance or realistic expressiveness due to offline video-based inference designs and insufficient ability to capture and transfer subtle expression details. To address these limitations, we present VividFace, a realtime and realistic facial expression shadowing system for humanoid robots. An optimized imitation framework X2CNet++ enhances expressiveness by fine-tuning the human-to-humanoid facial motion transfer module and introducing a featureadaptation training strategy for better alignment across different image sources. Real-time shadowing is further enabled by a video-stream-compatible inference pipeline and a streamlined workflow based on asynchronous I/O for efficient communication across devices. VividFace produces vivid humanoid faces by mimicking human facial expressions within 0.05 seconds, while generalizing across diverse facial configurations. Extensive realworld demonstrations validate its practical utility.

System Overview

project lead
An overview of the VividFace workflow. An RGB camera captures human facial expression dynamics (A), and the image frames (each frame denoted by \(I_d\)) are streamed to the server and processed by the imitation framework, which consists of the motion transfer module \(\mathcal{M}_1\) and the mapping network \(\mathcal{M}_2\). The motion transfer module produces an intermediate expression representation \(I_m = \mathcal{M}_1(I_d; f_s, x_{c,s})\) that integrates human motion with a virtual robot face. The mapping network then predicts control values \(\hat{\mathbf{y}} = \mathcal{M}_2(I_m)\), which are used to drive the physical robot to reproduce the expression (B). The intermediate data flow for three example frames is visualized on the right (C).

Real-Time Video Demonstrations