How Kling-Avatar Generates Long, Emotionally Rich Digital Human Videos with Multimodal LLMs
Kuaishou's Kling-Avatar leverages a multimodal large‑language‑model‑driven two‑stage generation framework to produce minute‑long digital‑human videos that synchronize lip movements, facial expressions, and body gestures with audio, achieving high visual quality, identity consistency, and controllable storytelling across diverse scenarios.
Kling-Avatar is a new digital‑human function launched on the Kuaishou Keling platform, accompanied by a technical report that details how a model that previously only synced lip movements to audio has been evolved into a solution capable of expressive, user‑intent‑driven performances.
1. Multimodal Understanding and Storyline Generation
The system uses a multimodal large language model (MLLM) director to convert three types of inputs—audio, image, and textual prompts—into a coherent story line. Audio provides speech content and emotional trajectory, images supply portrait and scene elements, and text guides actions, camera language, and emotion changes. The director outputs a structured narrative that is injected into a video diffusion model via cross‑attention layers, producing a global blueprint video that defines rhythm, style, and key expressive nodes.
2. Two‑Stage Cascaded Long‑Video Generation Framework
After the blueprint video is created, the system selects high‑quality key frames based on identity consistency, motion diversity, occlusion avoidance, and facial clarity. Adjacent key frames serve as start‑and‑end conditions for generating sub‑segments in parallel. An audio‑aligned frame‑insertion strategy ensures frame‑level synchronization of lip movements with the acoustic rhythm.
3. Training and Evaluation Data Pipeline
Collected thousands of hours of high‑quality video from speeches, dialogues, and singing.
Trained expert models to assess mouth clarity, camera switching, audio‑visual sync, and aesthetic quality.
Filtered expert‑selected clips through manual review to build a high‑quality training set.
For evaluation, a benchmark of 375 reference‑image/audio/text‑prompt triples was constructed, covering real and AI‑generated faces, multiple ethnicities, diverse languages (Chinese, English, Japanese, Korean), and varied textual instructions for emotion, action, and camera control.
4. Experimental Results
Using a Good/Same/Bad (GSB) user‑preference metric, Kling-Avatar was compared against state‑of‑the‑art systems OmniHuman‑1 and HeyGen. Across overall effect, lip‑sync, visual quality, instruction response, and identity consistency, Kling-Avatar achieved superior scores, often leading in every dimension.
Qualitative examples show precise lip‑sync even for challenging phonemes, natural facial expressions aligned with speech, and accurate execution of textual cues for emotion, motion, and camera movements in complex scenes such as singing and speaking.
5. Conclusion
Kling-Avatar demonstrates a new paradigm for digital‑human generation, moving from simple lip‑sync to full‑performance video synthesis that maintains identity and emotional richness over minute‑long clips. The system is publicly available on the Kuaishou Keling platform for users to experience end‑to‑end AI‑driven digital‑human creation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
