How MIDAS Achieves Real‑Time Multimodal Digital‑Human Video Generation

The MIDAS framework introduced by the Kling Team combines autoregressive video generation with a lightweight diffusion denoising head to deliver real‑time, high‑quality digital‑human synthesis under multimodal control, achieving sub‑500 ms latency, 64× compression, and robust performance across multilingual dialogue, singing, and interactive world modeling tasks.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How MIDAS Achieves Real‑Time Multimodal Digital‑Human Video Generation

Introduction

Digital human video generation is becoming a core technique for enhancing human‑computer interaction, but existing methods struggle with low latency, multimodal control, and long‑term temporal consistency.

Digital human generation illustration
Digital human generation illustration

MIDAS Framework

The Kling Team proposes MIDAS (Multimodal Interactive Digital‑human Synthesis), which integrates autoregressive video generation with a lightweight diffusion denoising head to enable real‑time, smooth synthesis under multimodal conditions.

64× high‑compression autoencoder reduces each frame to at most 60 tokens, dramatically lowering computational load.

End‑to‑end generation latency below 500 ms, supporting real‑time streaming interaction.

Four‑step diffusion denoising balances efficiency and visual quality.

Multimodal Instruction Control

MIDAS accepts audio, pose, and text signals. A unified multimodal condition projector maps different modalities into a shared latent space, forming global instruction tokens that are injected frame‑by‑frame to guide the autoregressive model in generating semantically and spatially consistent actions and expressions.

Multimodal control diagram
Multimodal control diagram

Causal Latent Prediction + Diffusion Rendering

The model can embed any autoregressive backbone (e.g., Qwen2.5‑3B) to predict latent representations frame by frame, followed by a lightweight diffusion head for denoising and high‑resolution rendering, ensuring temporal coherence while keeping inference latency low.

High‑Compression Autoencoder (DC‑AE)

A 64× compression autoencoder encodes each frame into up to 60 tokens, supporting reconstruction up to 384×640 resolution and employing causal temporal convolutions with RoPE attention to guarantee temporal consistency.

Large‑Scale Multimodal Dialogue Dataset

For training, the authors built a ~20 000‑hour dialogue dataset covering single‑ and dual‑speaker scenarios, multiple languages, and diverse styles, providing rich context for the model.

Method Overview

Model architecture: Qwen2.5‑3B serves as the autoregressive backbone, while the diffusion head follows a PixArt‑α/MLP structure.

Training strategy: Controlled noise injection with 20 noise buckets and corresponding embeddings mitigates exposure bias during inference.

Inference mechanism: Chunked streaming generation (6 frames per chunk) achieves approximately 480 ms response time.

Results

Dual‑speaker dialogue: Real‑time processing of two‑person audio streams produces synchronized lip‑sync, facial expression, and listening posture, enabling natural turn‑taking conversation.

Cross‑language singing synthesis: The system accurately synchronizes lip movements for Chinese, Japanese, and English songs without explicit language tags, generating up to 4‑minute videos without noticeable drift.

Interactive world modeling: Trained on a Minecraft dataset, MIDAS responds to directional control signals, demonstrating scene consistency and memory, highlighting its potential as an interactive world model.

Conclusion

MIDAS delivers an end‑to‑end solution for real‑time digital‑human generation, balancing efficiency and quality, and opens avenues for virtual‑human live streaming, metaverse interaction, and multimodal AI agents. Future work will explore higher resolutions, more complex interaction logic, and deployment in production environments.

Paper: https://arxiv.org/pdf/2508.19320

Project page: https://chenmingthu.github.io/milm/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIdiffusiondigital humanReal-time Videomultimodal generationautoregressive
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.