Artificial Intelligence 8 min read

Xiaomi OneVL: A Breakthrough Open‑Source Model for Fast, Accurate Autonomous Driving

Xiaomi unveils OneVL, an open‑source stepwise latent language‑vision reasoning framework that unifies VLA, world‑model and latent inference, delivering higher accuracy than explicit CoT and inference speed comparable to answer‑only models, with SOTA benchmark results across multiple autonomous‑driving tests.

Xiaomi Tech

May 13, 2026

Xiaomi OneVL: A Breakthrough Open‑Source Model for Fast, Accurate Autonomous Driving

Today Xiaomi officially released OneVL, a stepwise latent language‑vision reasoning framework for autonomous driving that extends the XLA (eXplainable Latent Architecture) line.

Problem Statement

When large models gain inference capability, the key challenge is achieving both speed and accuracy. Explicit chain‑of‑thought (CoT) improves trajectory quality but incurs token‑by‑token latency, which is hard for real‑time driving; answer‑only inference removes latency but loses essential causal judgment.

Latent CoT Background

Previous industry work introduced latent CoT, replacing sequential token generation with high‑dimensional machine language to retain reasoning quality while dramatically compressing inference delay.

OneVL’s Three Key Technologies

Dual‑modal latent tokens : visual latent tokens encode the scene’s physical causal structure, while language latent tokens encode driving intent, allowing the model to “think clearly” before speaking.

Dual decoders (training‑only) : a visual decoder predicts future frames 0.5 s/1 s ahead, granting world‑model predictive ability; a language decoder reconstructs readable CoT text for interpretability. Both decoders are removed during inference, incurring zero extra cost.

Pre‑fill inference : all latent tokens are injected into the context and the reasoning completes in a single parallel pass, making latency almost identical to answer‑only models and up to 2.3× faster than explicit CoT.

Unified Architecture

OneVL unifies the previously separate VLA (vision‑language‑action) and world‑model pipelines into a single framework, achieving both richer multimodal cognition and strong scene understanding.

Benchmark Performance

Achieves SOTA on ROADWork, Impromptu and Alpamayo‑R1 benchmarks.

On NAVSIM, obtains a PDM‑score of 88.84, surpassing explicit CoT’s 88.29 – the first latent‑inference method to beat explicit autoregressive CoT on all benchmarks.

With an MLP regression head, inference latency drops to 0.24 s (4.16 Hz), only 5.4 % of VLA autoregressive latency, providing a viable path for real‑time vehicle deployment.

Ablation experiments show that compressing dynamic physical information yields significant performance gains.

Interpretability

OneVL offers dual‑modal interpretability: textual explanations describe “why” a maneuver is chosen, while predicted future frames visualize “what will happen next,” concretely realizing XLA’s goal of understanding and reasoning.

Open‑Source Release

The model weights, training code and inference code are fully open‑source ( https://github.com/xiaomi-research/onevl). The technical report is available on arXiv ( https://arxiv.org/abs/2604.18486) and the project homepage ( https://Xiaomi-Embodied-Intelligence.github.io/OneVL).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

open-source Benchmark autonomous driving large model latent reasoning OneVL XLA

Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.