How MIDAS Achieves Real‑Time Multimodal Digital‑Human Video Generation
The MIDAS framework introduced by the Kling Team combines autoregressive video generation with a lightweight diffusion denoising head to deliver real‑time, high‑quality digital‑human synthesis under multimodal control, achieving sub‑500 ms latency, 64× compression, and robust performance across multilingual dialogue, singing, and interactive world modeling tasks.
Introduction
Digital human video generation is becoming a core technique for enhancing human‑computer interaction, but existing methods struggle with low latency, multimodal control, and long‑term temporal consistency.
MIDAS Framework
The Kling Team proposes MIDAS (Multimodal Interactive Digital‑human Synthesis), which integrates autoregressive video generation with a lightweight diffusion denoising head to enable real‑time, smooth synthesis under multimodal conditions.
64× high‑compression autoencoder reduces each frame to at most 60 tokens, dramatically lowering computational load.
End‑to‑end generation latency below 500 ms, supporting real‑time streaming interaction.
Four‑step diffusion denoising balances efficiency and visual quality.
Multimodal Instruction Control
MIDAS accepts audio, pose, and text signals. A unified multimodal condition projector maps different modalities into a shared latent space, forming global instruction tokens that are injected frame‑by‑frame to guide the autoregressive model in generating semantically and spatially consistent actions and expressions.
Causal Latent Prediction + Diffusion Rendering
The model can embed any autoregressive backbone (e.g., Qwen2.5‑3B) to predict latent representations frame by frame, followed by a lightweight diffusion head for denoising and high‑resolution rendering, ensuring temporal coherence while keeping inference latency low.
High‑Compression Autoencoder (DC‑AE)
A 64× compression autoencoder encodes each frame into up to 60 tokens, supporting reconstruction up to 384×640 resolution and employing causal temporal convolutions with RoPE attention to guarantee temporal consistency.
Large‑Scale Multimodal Dialogue Dataset
For training, the authors built a ~20 000‑hour dialogue dataset covering single‑ and dual‑speaker scenarios, multiple languages, and diverse styles, providing rich context for the model.
Method Overview
Model architecture: Qwen2.5‑3B serves as the autoregressive backbone, while the diffusion head follows a PixArt‑α/MLP structure.
Training strategy: Controlled noise injection with 20 noise buckets and corresponding embeddings mitigates exposure bias during inference.
Inference mechanism: Chunked streaming generation (6 frames per chunk) achieves approximately 480 ms response time.
Results
Dual‑speaker dialogue: Real‑time processing of two‑person audio streams produces synchronized lip‑sync, facial expression, and listening posture, enabling natural turn‑taking conversation.
Cross‑language singing synthesis: The system accurately synchronizes lip movements for Chinese, Japanese, and English songs without explicit language tags, generating up to 4‑minute videos without noticeable drift.
Interactive world modeling: Trained on a Minecraft dataset, MIDAS responds to directional control signals, demonstrating scene consistency and memory, highlighting its potential as an interactive world model.
Conclusion
MIDAS delivers an end‑to‑end solution for real‑time digital‑human generation, balancing efficiency and quality, and opens avenues for virtual‑human live streaming, metaverse interaction, and multimodal AI agents. Future work will explore higher resolutions, more complex interaction logic, and deployment in production environments.
Paper: https://arxiv.org/pdf/2508.19320
Project page: https://chenmingthu.github.io/milm/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
