Beyond Simple Motions: How SentiAvatar Redefines 3D Digital Human Action Generation
SentiAvatar introduces a two‑stage plan‑then‑infill framework that separates sentence‑level semantic planning from frame‑level prosody‑driven motion infill, leveraging a 200K‑sequence Motion Foundation Model and the newly released 21k‑clip SuSuInterActs dataset to achieve state‑of‑the‑art, real‑time expressive 3D digital human animation.
When interacting with a 3D digital human, users often notice a disjoint between speech and motion, leading to stiff facial expressions and mismatched gestures that trigger the uncanny valley effect. The article identifies three long‑standing research gaps: lack of high‑quality multimodal data (especially Chinese dialogue with synchronized facial expressions), degradation of model understanding for compound semantic actions, and misalignment of generated motion with speech prosody.
To address these gaps, the AI startup SentiPulse, together with a research team from Renmin University, proposes SentiAvatar , a new paradigm for expressive and interactive 3D digital humans. The framework is built around a two‑stage plan‑then‑infill architecture that decouples sentence‑level semantic planning from frame‑level prosody‑driven motion generation.
The first component, a large‑language‑model (LLM) semantic planner, receives behavior‑tagged dialogue scripts and sparse audio tokens, outputting a sparse sequence of key‑frame action tokens. It maintains continuity across dialogue turns by using the last two key‑frame audio‑action token pairs as context.
The second component, the Body Infill Transformer , fills three intermediate frames between each pair of key frames using continuous HuBERT audio features (768‑dim, 20 FPS) as conditioning. An iterative confidence‑based decoding strategy (default six steps) ensures high‑quality predictions without the degradation typical of one‑shot generation.
Facial expression generation follows a parallel Face Infill Transformer that bypasses the LLM planner, directly mapping audio features to facial tokens and decoding them into 51‑dim ARKit blendshape coefficients.
Both channels share the same HuBERT feature extractor, enabling end‑to‑end latency of roughly 0.53 seconds to produce a 6‑second motion segment, supporting unlimited multi‑turn streaming.
The authors also release the SuSuInterActs dataset, collected with a optical motion‑capture system, MANUS gloves, and iPhone ARKit. It contains 21,133 clips (36.9 hours) of synchronized Chinese dialogue, audio, full‑body skeleton (63 joints, 6D rotation), and facial blendshape coefficients. Over 14 k clips include non‑default body actions and 9.4 k include non‑default facial expressions, providing a rich resource for multimodal research.
To endow the system with broad motion priors, a Motion Foundation Model is pretrained on more than 200 k heterogeneous motion sequences (≈ 676 hours). The model uses Qwen‑0.5B as backbone, expands its vocabulary to 2,048 motion tokens (R‑VQVAE, 4‑layer residual quantization) and audio tokens (HuBERT k‑means). The pretraining task is text‑to‑motion generation with all textual descriptions translated into Chinese for language consistency.
Quantitative evaluation shows SentiAvatar achieving state‑of‑the‑art results on both the newly introduced SuSuInterActs and the public BEATv2 benchmark. On SuSuInterActs, it reaches R@1 = 43.64 % (nearly double the T2M‑GPT baseline) and FID = 8.912. On BEATv2, it records FGD = 4.941 and BC = 8.078, surpassing the previous best Language‑of‑Motion (FGD = 5.301) and SynTalker (BC = 7.971). It also attains the lowest Event Sync Distance (ESD = 0.456 s) among all compared methods.
Ablation studies confirm that removing the LLM planner drops R@1 to 28.06 % and inflates FID to 27.567, while removing the Infill Transformer reduces R@1 to 27.52 % and worsens ESD to 0.503 s, demonstrating the indispensability of both stages. Further audio‑condition ablations reveal that continuous HuBERT features drive frame‑level synchronization, whereas discrete audio tokens from the LLM contribute to overall motion quality and rhythmic planning.
The system’s engineering efficiency is highlighted by its ability to generate a 6‑second motion sequence in under 0.3 seconds, enabling continuous real‑time interaction without waiting for sentence boundaries.
All code, the SuSuInterActs dataset, and the pretrained models have been open‑sourced on GitHub, inviting the research community to advance 3D digital human technology toward the next generation of “digital life” that can perceive context, understand emotion, and express autonomously.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
