X2C: A Dataset Featuring Nuanced Facial Expressions <br>for Realistic Humanoid Imitation

A video demonstration showcases the realistic imitation capabilities of our proposed framework X2CNet, where the correspondence between expression representation in image space and the robot's action space is learned using the X2C dataset. TThis framework validates the value of our dataset in advancing research on realistic humanoid facial expression imitation. Notably, our dataset and imitation framework are applicable to multiple humanoid robots with different facial appearances.

Examples of realistic humanoid imitation. Different individuals express a wide range of facial expressions, with nuances reflected in features such as frown, gaze direction, eye openness, nose wrinkles, mouth openness, and so on. These nuanced human facial expressions extend beyond canonical emotions and can be regarded as either blends of different canonical emotions or as a single emotion with varying intensities. The humanoid robot, Ameca, mimics every detail, resulting in a realistic imitation. Notably, our dataset and imitation framework are applicable to multiple robots with different facial appearances.

The ability to imitate realistic facial expressions is essential for humanoid robots engaged in affective human-robot communication. However, the lack of datasets containing diverse humanoid facial expressions with proper annotations hinders progress in realistic humanoid facial expression imitation. To address these challenges, we introduce X2C (Anything to Control), a dataset featuring nuanced facial expressions for realistic humanoid imitation. With X2C, we contribute: 1) a high-quality, high-diversity, large-scale dataset comprising 100,000 (image, control value) pairs. Each image depicts a humanoid robot displaying a diverse range of facial expressions, annotated with 30 control values representing the ground-truth expression configuration; 2) X2CNet, a novel human-to-humanoid facial expression imitation framework that learns the correspondence between nuanced humanoid expressions and their underlying control values from X2C. It enables facial expression imitation in the wild for different human performers, providing a baseline for the imitation task, showcasing the potential value of our dataset; 3) real-world demonstrations on a physical humanoid robot, highlighting its capability to advance realistic humanoid facial expression imitation.

Demonstration of X2C dataset examples. Each example in the X2C dataset consists of: (1) an image depicting the virtual robot, shown in the middle; and (2) the corresponding control values, visualized at the bottom. In these visualizations, the height of each blue bar represents the magnitude of the corresponding value, while the orange dots indicate the values in the neutral state.

The pipeline for dataset collection. We first curate humanoid facial expression animations covering all basic emotions and beyond. Images and their corresponding control values are then sampled at the same timestamps (e.g., if an image is sampled at t = 2.0, its control value annotation is also sampled at t = 2.0) to obtain the temporally aligned pairs

An illustration of the correspondence between control values and control units. In the control value visualization, the first 4 values control the brow movements, the next 4 control eyelid motions, and so on for the other units.

Value distributions of 30 controls. Controls for different expression-relevant units are indicated by different colors.

An overview of X2CNet, the proposed imitation framework. The first module captures facial expression subtleties from humans, while the mapping network learns the correspondence between various humanoid expressions and their underlying control values using the X2C dataset.

X2C: A Dataset Featuring
Nuanced Facial Expressions
for Realistic Humanoid Imitation

Abstract

The X2C Dataset

Dataset Collection

Control Values and Distributions