Artificial Intelligence 10 min read

Xiaomi Auto Unveils Integrated Reconstruction‑Generation World Model Framework Achieving SOTA on Major Benchmarks

Xiaomi Auto introduces a novel world‑model framework that tightly couples 3D reconstruction and generative prediction, delivering state‑of‑the‑art performance on Waymo and nuScenes benchmarks while enabling high‑fidelity, long‑duration video synthesis for autonomous‑driving scenarios.

Xiaomi Tech

May 26, 2026

Xiaomi Auto Unveils Integrated Reconstruction‑Generation World Model Framework Achieving SOTA on Major Benchmarks

World Model Overview

World models predict the future evolution of the surrounding environment, enabling autonomous vehicles to anticipate rare high‑risk events such as sudden tire loss, falling rocks, or unexpected pedestrians.

Technical Paths: Reconstruction vs Generation

Reconstruction (WorldRec) : Recovers geometrically accurate 3D scenes from multi‑view observations. It provides high fidelity and strong cross‑view consistency but can only reproduce content that has been observed.

Generation (WorldGen) : Uses diffusion models to directly predict future frames, allowing imagination of unseen viewpoints and events. It lacks an explicit 3D structure and can drift over long sequences.

Simple pipelines that first reconstruct a scene and then feed the result to a generator suffer from a fundamental conflict: reconstruction optimises deterministic geometry fidelity, while generation optimises a diverse distributional output. The mismatch degrades both sides’ advantages.

Joint Reconstruction‑Generation Framework

Reconstruction anchors generation : WorldRec maintains a 4D Gaussian global representation that is projected into the vehicle’s view and supplied as a rendering prior to the generator. This locks geometry (lane layout, building positions, camera consistency) while the generator fills lighting, texture, and unseen regions.

Generation expands reconstruction : WorldGen synthesises content for future frames, unseen viewpoints, and occluded areas, removing the “road‑only” limitation.

Joint drift suppression : Deterministic geometry from WorldRec continuously corrects the generative process, preventing cumulative exposure bias and keeping minute‑long videos stable.

WorldRec Design

Replaces dense per‑pixel Gaussian clouds with sparse 3D query points, achieving 10‑second video reconstruction in 10 seconds.

Each anchor point corresponds to a unique 3D location, eliminating multi‑view conflicts.

Aggregates features from multiple cameras and timestamps, forming a cross‑view consistent scene representation.

Visibility‑weighted fusion down‑weights occluded or reflective views and up‑weights clean observations.

WorldGen Design

Two‑stage training:

Full‑bidirectional temporal‑attention pre‑training builds a global spatio‑temporal understanding by exposing the model to all frames simultaneously.

Causal fine‑tuning with teacher forcing and ODE distillation compresses denoising steps from 50 to 4, speeding inference 12× and using distribution‑matching distillation to suppress long‑sequence drift.

Generates a single view frame in 0.19 s on an H20 GPU; supports up to 81 fps for videos up to one minute, far faster than the autoregressive baseline Epona (1.06 s/frame).

Benchmark Results

On the Waymo dataset, WorldRec attains a PSNR of 28.48, surpassing the previous SOTA DGGT by roughly one point, and retains leading performance on zero‑shot nuScenes tests, demonstrating strong generalisation to unseen scenes.

WorldGen achieves an FVD of 64.97 and an FID of 7.04 on nuScenes, outperforming all comparable bidirectional and autoregressive models while maintaining competitive FID scores.

Deployments

Synthetic data generation : Delivered over 100 k high‑quality clips for perception model training, improving vehicle recognition in hazardous scenarios.

Simulation testing : Built a closed‑loop simulation environment that reproduces real accidents for targeted optimisation, enhancing test efficiency and coverage.

Driving‑school videos : Dynamically generated first‑person teaching videos for complex road conditions, now available across all Xiaomi models.

References

Technical homepage: https://JointWM.github.io/

Paper (arXiv): https://arxiv.org/pdf/2605.18137

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

generative modeling 3D reconstruction autonomous driving World model Benchmark SOTA Xiaomi Auto

Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.