Yan: Tencent’s Real‑Time High‑Fidelity Interactive Video Generation
Tencent’s newly released Yan system advances interactive video generation by delivering high‑fidelity, real‑time, editable content for games, virtual worlds and AIGC, featuring a three‑module architecture—Yan‑Sim for AAA‑level simulation, Yan‑Gen for multimodal generation, and Yan‑Edit for granular editing—while also introducing a large‑scale high‑quality dataset and efficient inference optimizations.
What is Interactive Generative Video (IGV)?
Interactive Generative Video (IGV) refers to AI systems that continuously generate interactive video content based on user input, breaking the static, one‑way nature of traditional video generation and enabling personalized, immersive experiences.
Content creation: greatly enhances the diversity and controllability of AIGC content, empowering games, virtual worlds, film, education and other domains.
Agent training: provides unlimited, controllable simulation environments for general‑purpose agents.
Human‑AI interaction: enables more natural, real‑time AI‑human interaction, expanding the boundaries of entertainment and social scenarios.
Existing Interactive Video Generation Approaches
World models: e.g., Genie3, which builds interactive, movable environments from text/image inputs but still has room for improvement in resolution, interaction richness and duration.
Game‑based IGV: e.g., The‑Matrix, Matrix‑Game, focusing on game scenes; some support real‑time interaction but lack generalisation, high resolution, complex physics simulation and editable content.
Yan System Overview
Yan is an end‑to‑end interactive video generation framework consisting of three core modules: Yan‑Sim (AAA‑level simulation), Yan‑Gen (multimodal generation) and Yan‑Edit (fine‑grained editing). All modules share a unified high‑quality interactive video dataset collected from a 3D game environment (Meta‑Dream‑Star).
Yan‑Sim: Real‑Time High‑Fidelity Simulation
Yan‑Sim uses a compressed 3D‑VAE combined with KV‑cache shift‑window denoising to achieve 1080p/60 FPS high‑fidelity real‑time simulation, supporting complex physics and multiple styles. The model adopts an autoregressive diffusion paradigm with spatial, motion and temporal attention, and employs causal temporal attention for frame‑by‑frame generation. Inference is accelerated by reducing DDIM sampling steps to 4, using shift‑window parallel denoising, KV‑cache reuse, structural pruning and FP8 quantisation, yielding 1.5‑2× speed‑up on multi‑GPU setups.
Yan‑Gen: Multimodal Interactive Generation
Yan‑Gen accepts text, image and motion inputs and generates diverse, controllable interactive content. Global captions anchor static scene attributes (layout, style, lighting) while local captions describe dynamic events, preventing long‑term drift. Multimodal conditioning is injected via cross‑attention layers in a DiT backbone. Text is encoded by umt5‑xxl, images by ViT‑H‑14, and actions by a dedicated encoder. The system also leverages VLM‑generated annotations on 98 M frames.
Yan‑Edit: Multi‑Granular Interactive Editing
Yan‑Edit enables real‑time structural and style editing through text prompts. The architecture decouples interaction simulation (Yan‑Sim) and visual rendering (Yan‑Gen + ControlNet). Structural edits are injected via cross‑attention into the depth‑map VAE, while style edits use ControlNet weights. The module supports arbitrary‑time structure or style prompt changes, ensuring interactive consistency and spatio‑temporal coherence.
Data Collection Pipeline
An automatic pipeline based on reinforcement‑learning agents explores modern 3D game environments, collecting diverse interactive data. Multi‑stage filtering (visual, anomaly, rule‑based) removes low‑quality samples. Balanced sampling across position, survival and collision improves generalisation. The resulting dataset covers 90+ scenes, 400 M frames (≈3700 hours), 1080p resolution, 30 FPS, and high‑precision motion alignment.
Performance and Results
Yan delivers high‑resolution (1080p/60 FPS) real‑time simulation, supports unlimited video length, and exhibits strong temporal consistency. Multi‑scene generation demonstrates accurate physics (inertia, electric shock, bounce) and style fidelity. Example demos include electric shock, wind simulation, infinite exploration, and cross‑domain content generation from real images to playable segments.
Limitations and Future Work
Long‑term spatio‑temporal consistency still has room for improvement, occasional drift in complex interactions.
Lightweight models and edge deployment need further optimisation.
Action space and interaction complexity are constrained by the underlying game engine; extending to real‑world scenarios remains an open challenge.
Future directions include scaling data and model size, enhancing efficiency and generalisation, and exploring real‑world extensions.
References
Jiwen Yu, Yiran Qin, Haoxuan Che, et al. A survey of interactive generative video. arXiv preprint arXiv:2504.21853, 2025.
Genie 3: A new frontier for world models.
Ruili Feng, Han Zhang, et al. The Matrix: Infinite‑horizon world generation with real‑time moving control. arXiv preprint arXiv:2412.03568, 2024.
Yifan Zhang, Chunli Peng, et al. Matrix‑Game: Interactive world foundation model. arXiv, 2025.
Mingyu Yang, Junyou Li, et al. Playable game generation. arXiv preprint arXiv:2412.00887, 2024.
Jiwen Yu, Yiran Qin, et al. GameFactory: Creating new games with generative interactive videos. arXiv preprint arXiv:2501.08325, 2025.
Zeyinzi Jiang, Zhen Han, et al. VACE: All‑in‑one video creation and editing. arXiv preprint arXiv:2503.07598, 2025.
Tencent Tech
Tencent's official tech account. Delivering quality technical content to serve developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
