UniLS: End-to-End Audio-Driven Framework Eliminates the ‘Poker Face’ in Digital Human Dialogue
UniLS, the first end‑to‑end audio‑driven framework that jointly generates speaking and listening facial motions for digital humans, achieves state‑of‑the‑art speaking accuracy, improves listening naturalness by 44.1 %, and runs at over 500 FPS, as demonstrated on the CVPR 2026‑accepted paper with extensive quantitative and user studies.
Digital human dialogue requires realistic facial motions for both speaking and listening. Existing methods focus on one direction: speak‑only models (e.g., ARTalk, DiffPoseTalk) generate only speaking motions, while listen‑only models produce only listening reactions, and the only prior joint model, DualTalk, depends on pre‑computed facial sequences of the speaker, preventing end‑to‑end training and real‑time deployment.
UniLS (Unified Listening and Speaking) addresses this gap by decomposing listening behavior into an internal motion prior and external audio modulation, enabling a fully end‑to‑end framework that drives both speaking and listening facial actions using only dual‑track audio.
Core finding: audio‑motion imbalance. t‑SNE analysis of audio features versus facial parameters shows that during speaking, audio and motion are tightly coupled, while during listening the correlation is weak because many listening gestures (blinks, micro‑expressions) are independent of the interlocutor’s speech. This imbalance causes naïve end‑to‑end training to overfit the speaking branch and produce stiff, low‑variance listening expressions.
Two‑stage training. Stage 1 trains a self‑regressive generator without audio on 546.5 h of unpaired multi‑scene video (CelebV, TalkingHead‑1KH, TEDTalk, VFHQ) using FLAME 3D parameters discretized by a multi‑scale VQ‑autoencoder, learning intrinsic motion priors such as blink frequency and head micro‑movements. Stage 2 fine‑tunes the generator on 251.5 h of speaking and 406.0 h of listening paired dialogue (Seamless Interaction) by adding two cross‑attention layers: one conditioned on the speaker’s own audio for speaking, the other on the listener’s counterpart audio for listening. The pre‑trained weights are adapted with LoRA, preserving learned motion priors while enabling audio‑driven modulation.
Quantitative results. On the Seamless Interaction test set, UniLS achieves the best scores across all metrics. Speaking accuracy improves to LVE 5.83 and MHD 1.89. Listening quality shows a 44.1 % increase in distribution metric, with FDD dropping from 43.58 (DualTalk) to 17.12, F‑FID from 13.143 to 4.304, and P‑FID from 0.079 to 0.038.
User study. More than 91 % of participants preferred UniLS’s listening reactions, 90 % preferred its facial expression naturalness, and 86 % preferred its lip‑sync quality over DualTalk.
Real‑time performance. UniLS runs at 560.6 FPS on a single RTX 5090 GPU (421.3 M parameters), surpassing ARTalk’s 357.7 FPS (489.5 M parameters). DualTalk cannot achieve real‑time inference due to its non‑end‑to‑end design.
In summary, UniLS is the first end‑to‑end audio‑driven framework that jointly generates speaking and listening facial motions, solving the “listening stiffness” problem through a two‑stage training strategy that separates motion priors from audio modulation, and delivering state‑of‑the‑art quality, naturalness, and real‑time speed for interactive digital humans.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
