REALM: A Coarse-to-Fine Generative Framework
for Embodied Reactive Listening

REALM synthesizes lifelike, reactive listener motions driven purely by speaker audio. As demonstrated on the Ameca humanoid robot above, our coarse-to-fine framework successfully models natural reaction delays and disentangles smooth head trajectories from rapid facial micro-expressions.

Abstract

Generating responsive listener facial motion is a fundamental challenge in dyadic interactions, carrying profound implications for embodied conversational AI. However, existing approaches face two primary limitations. First, they fail to model listening as a fundamentally reactive process, often disregarding natural reaction delays and historical motion context. Because active speakers produce pronounced signals while listeners remain predominantly quiescent, lacking a mechanism to reconcile these distinct behavioral profiles causes generated motions to unnaturally deviate from the listener's ground truth. Second, they ignore varying temporal scales, failing to separate smooth head movements from rapid facial expressions. To address these issues, we propose REALM (Reactive Embodied Audio-driven Listening Model), a coarse-to-fine generative framework explicitly designed for audio-driven reactive listening. To capture this reactive nature and prevent unnatural deviation, we introduce a Reactive Gated Speaker-Listener Fusion module. It leverages a shifted variant of Attention with Linear Biases (ALiBi) to model realistic reaction delays, while a gating mechanism dynamically balances the speaker's acoustic trigger against the listener's motion history. To resolve temporal scale mismatches, a coarse decoder establishes smooth trajectories for head pose and expressions, followed by a refinement module that injects audio-modulated stochastic noise into expression features to synthesize lifelike, high-frequency subtleties. Extensive evaluations demonstrate that REALM outperforms state-of-the-art baselines in both distributional realism and temporal synchrony. Finally, we validate our model's physical viability by deploying the synthesized motions onto a humanoid robot, bridging the gap between digital avatars and physically embodied agents.

challenges
Core challenges in responsive listener motion generation. (a) Natural Reaction Delay: Listening is a fundamentally reactive process characterized by an inherent temporal delay (\(\tau\)) between the speaker's acoustic stimulus and the listener's response. (b) Distinct Behavioral Patterns: Active speakers produce pronounced signals, while listeners exhibit predominantly quiescent resting behaviors. Driven purely by the acoustic stimulus from the speaker, the generated motion deviates from the ground-truth manifold, resulting in an unnatural deviation (red dashed line). (c) Varying Temporal Scales: Facial dynamics operate at distinct frequencies. Smooth, overall head poses (\(r^{(i)}\)) must be disentangled from rapid, high-frequency facial expressions (\(x^{(j)}\)) to prevent over-smoothing.

REALM (Reactive Embodied Audio-driven Listening Model)

system overview
Overview of the REALM framework. The system ingests speaker audio ($\mathbf{A}$) and listener motion history ($\mathbf{H}$). It consists of a Reactive Gated Speaker-Listener Fusion module to model reaction delays, a coarse-to-fine motion refinement module to resolve varying temporal scales, and a physical grounding pipeline for robotic embodiment.

Robotic Embodiment

robot results
Qualitative comparison of physical embodiment on the Ameca robot. To illustrate the performance observed across the ViCo test set, we visualize two representative video sequences (frames $t_1, t_2, t_3$ and $t_1', t_2', t_3'$) comparing the reactive listener motion generated by our method (REALM) against the human Ground Truth and the ListenFormer [Liu et al., 2024] baseline. Driven by the speaker's context (Top Row), ListenFormer frequently suffers from deterministic over-smoothing, failing to synthesize intended expressions such as a conversational smile or a natural blink (highlighted in red). In contrast, REALM effectively anchors the listener to their natural reactive manifold, accurately recovering contextually appropriate expressions and subtle stochastic micro-dynamics (e.g., smiling and blinking, highlighted in green) that closely match the human Ground Truth.