Xiaomi OneVL: A Breakthrough Open‑Source Model for Fast, Accurate Autonomous Driving
Xiaomi unveils OneVL, an open‑source stepwise latent language‑vision reasoning framework that unifies VLA, world‑model and latent inference, delivering higher accuracy than explicit CoT and inference speed comparable to answer‑only models, with SOTA benchmark results across multiple autonomous‑driving tests.
Today Xiaomi officially released OneVL, a stepwise latent language‑vision reasoning framework for autonomous driving that extends the XLA (eXplainable Latent Architecture) line.
Problem Statement
When large models gain inference capability, the key challenge is achieving both speed and accuracy. Explicit chain‑of‑thought (CoT) improves trajectory quality but incurs token‑by‑token latency, which is hard for real‑time driving; answer‑only inference removes latency but loses essential causal judgment.
Latent CoT Background
Previous industry work introduced latent CoT, replacing sequential token generation with high‑dimensional machine language to retain reasoning quality while dramatically compressing inference delay.
OneVL’s Three Key Technologies
Dual‑modal latent tokens : visual latent tokens encode the scene’s physical causal structure, while language latent tokens encode driving intent, allowing the model to “think clearly” before speaking.
Dual decoders (training‑only) : a visual decoder predicts future frames 0.5 s/1 s ahead, granting world‑model predictive ability; a language decoder reconstructs readable CoT text for interpretability. Both decoders are removed during inference, incurring zero extra cost.
Pre‑fill inference : all latent tokens are injected into the context and the reasoning completes in a single parallel pass, making latency almost identical to answer‑only models and up to 2.3× faster than explicit CoT.
Unified Architecture
OneVL unifies the previously separate VLA (vision‑language‑action) and world‑model pipelines into a single framework, achieving both richer multimodal cognition and strong scene understanding.
Benchmark Performance
Achieves SOTA on ROADWork, Impromptu and Alpamayo‑R1 benchmarks.
On NAVSIM, obtains a PDM‑score of 88.84, surpassing explicit CoT’s 88.29 – the first latent‑inference method to beat explicit autoregressive CoT on all benchmarks.
With an MLP regression head, inference latency drops to 0.24 s (4.16 Hz), only 5.4 % of VLA autoregressive latency, providing a viable path for real‑time vehicle deployment.
Ablation experiments show that compressing dynamic physical information yields significant performance gains.
Interpretability
OneVL offers dual‑modal interpretability: textual explanations describe “why” a maneuver is chosen, while predicted future frames visualize “what will happen next,” concretely realizing XLA’s goal of understanding and reasoning.
Open‑Source Release
The model weights, training code and inference code are fully open‑source ( https://github.com/xiaomi-research/onevl). The technical report is available on arXiv ( https://arxiv.org/abs/2604.18486) and the project homepage ( https://Xiaomi-Embodied-Intelligence.github.io/OneVL).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
