X2C: A Benchmark Featuring Nuanced Facial Expressions <br>for Realistic Humanoid Imitation

A video demonstration to showcase the realistic imitation capabilities of our propose framework Mimetician, where the emotional nuances of the human performer are capatured and transferred to the humanoid robot.

Examples of realistic humanoid imitation. Different individuals express a wide range of facial expressions, with nuances reflected in features such as frown, gaze direction, eye openness, nose wrinkles, mouth openness, and so on. These nuanced human facial expressions extend beyond canonical emotions and can be regarded as either blends of different canonical emotions or as a single emotion with varying intensities. The humanoid robot, Ameca, mimics every detail, resulting in a realistic imitation.

We present a new benchmark along with a novel human-to-robot motion transfer baseline that enables a humanoid robot to realistically imitate nuanced facial expressions from humans. Prior work on human facial expression imitation for humanoid robots typically operates within limited emotion categories, failing to capture the subtle emotional nuances inherent in human expressions, thereby hindering realistic imitation. To this end, we introduce X2C (Anything to Control), the first benchmark featuring nuanced facial expressions for realistic humanoid imitation. X2C comprises a training set and a test set. The training set consists of 104,987 images of a virtual humanoid robot in a simulation environment, each depicting nuanced facial expressions and annotated with 30 low-level control values. The test set contains nuanced human facial expressions with the same control value annotations as in the training set. Equipped with this benchmark, we propose Mimetician, a novel human-to-robot motion transfer baseline for realistic imitation. In this framework, a mapping network is trained using X2C to map images of the virtual robot to control values that encode emotional nuances within the robot's action space. Extensive experiments are conducted both on the test set and the physical humanoid robot, with both quantitative and qualitative evaluations. For more details, please refer to our project page.

Existing methods typically follow a recognition and imitation paradigm, operating on a limited set of emotions and failing to capture emotional nuances. For example, the robot might always display the same happy face, even if the human performer exhibits happiness with varying intensities. In contrast, the proposed Mimetician framework does not impose restrictions on canonical emotions and directly predicts low-level control values in the robot's action space. These control values represent the movement of actuators in the robot's face, thereby encoding subtle variations in facial expressions.

X2C training set examples demonstration. Each example in the X2C dataset consists of (1) an image depicting the virtual robot, shown in the middle, and (2) the corresponding control values, visualized at the bottom. To facilitate an understanding of the relationship between the physical and virtual robots, images of the physical robot are also included at the top. The physical robot and its virtual counterpart share the same set of controls.

X2C test set examples demonstration. Each example is an image-control value pair, where the image depicting nuance facial expression of human. The control values correspond to a robotic facial expression and serve as the ground truth for the imitation task. Note that the quantitative evaluation can be conducted on the test set even without the physical robot.

The pipeline for training set collection. A. We first construct facial expression animations and record videos in the simulation environment. B. In the annotation module, both the control values and image frames are sampled at a timestep of 0.05 seconds, which are used to construct temporally aligned pairs.

The pipeline for test set collection. A. We first construct pairs of robot images and corresponding control values. B. Volunteers are then invited to replicate the robot's facial expressions precisely. C. The captured human facial images subsequently annotated with the control values associated with the robot images.

An overview of Mimetician, the proposed human-to-robot motion transfer framework for realistic humanoid imitation. A. During training, we learn a mapping from the virtual robot to the control values within the robot's action space. B. During inference, the human motion is first transferred to the virtual robot, which then passes through the learned mapping network to predict control values that encode nuanced facial, driving the physical robot using these control values to imitate human facial expressions in a realistic manner.

X2C: A Benchmark Featuring
Nuanced Facial Expressions
for Realistic Humanoid Imitation

A video demonstration to showcase the realistic imitation capabilities of our propose framework Mimetician, where the emotional nuances of the human performer are capatured and transferred to the humanoid robot.

Abstract

Comparison with Current Approaches

X2C Rollout

Dataset Collection

Framework