How AutoMoT Leverages Large‑Model Understanding for End‑to‑End Driving Decisions and Trajectory Planning

AutoMoT introduces a unified Vision‑Language‑Action model that combines a 4B Qwen3‑VL understanding expert with a 1.6B action expert via layer‑wise shared attention and asynchronous inference, achieving state‑of‑the‑art results on Bench2Drive and nuScenes while preserving general VLM capabilities.

Machine Heart
Machine Heart
Machine Heart
How AutoMoT Leverages Large‑Model Understanding for End‑to‑End Driving Decisions and Trajectory Planning

Introduction

Large‑model vision‑language models (VLMs) excel at scene understanding, but autonomous driving also requires immediate action decisions such as braking, lane changes, and trajectory planning. The key challenge is turning high‑level understanding into concrete driving commands.

Model Architecture

AutoMoT consists of two experts:

Understanding Expert (UE) : a 4 B parameter Qwen3‑VL backbone that ingests multi‑frame RGB images and navigation prompts to produce reasoning tokens.

Action Expert (AE) : a 1.6 B parameter module that receives current RGB, LiDAR BEV, decision queries, target points, and planning queries to generate decision and planning tokens.

The two experts are linked by Layer‑wise Shared Attention , allowing AE to read UE’s intermediate representations at each transformer layer. This cross‑task causal mask creates a clear information flow: Decision can attend to Understanding, and Planning can attend to both Understanding and Decision, while preserving bidirectional attention within each task.

Asynchronous Inference

To meet real‑time constraints, UE updates its high‑level understanding at a lower frequency and caches its key‑value (KV) states. AE runs at a higher frequency, reusing the cached UE states for multiple action steps without recomputing the full VLM forward pass. This decouples costly scene reasoning from fast trajectory refreshes.

Experimental Validation

Closed‑loop (Bench2Drive) : AutoMoT achieves 87.34 DS / 70.00 % success rate (SR), surpassing SimLingo (85.07 DS / 67.27 % SR). Adding the Action Refiner (AutoMoT+) raises performance to 89.42 DS / 74.09 % SR, establishing a new SOTA.

Open‑loop (nuScenes) : Average L2 error is 0.32 m and average collision rate is 0.07 %, with per‑second L2 values of 0.14 / 0.29 / 0.54 m and collision rates of 0.01 % / 0.06 % / 0.15 %.

General VLM benchmarks show AutoMoT scoring 67.00 on LingoQA (close to ReCogDrive’s 67.20), 0.89 on OmniDrive (above ReCogDrive’s 0.82), 6.07 on CODA‑LM, and strong results on TallyQA (81.40) and InfoVQA (89.30), demonstrating retained generic reasoning ability.

Fine‑Tuning Analysis

Fine‑tuning the backbone yields marginal gains on pure understanding tasks (LingoQA improves from 67.00 to 67.20) but large improvements on planning‑related tasks (OmniDrive rises from 18.20 to 67.80). However, extensive fine‑tuning degrades performance on generic VQA benchmarks: TallyQA drops to 52.40, InfographicVQA to 50.20, and VizWiz to 50.20, indicating a trade‑off between domain specialization and general reasoning.

Conclusion

AutoMoT’s core contribution is reorganizing the relationship between “understanding” and “action” in autonomous driving. By preserving the UE’s pre‑trained VLM capabilities and delegating decision‑making to a dedicated AE, the system achieves SOTA driving performance while maintaining broad VLM competence, offering a scalable path for real‑world VLA deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Autonomous DrivingVision-Language-ActionAsynchronous InferenceAutoMoTBench2DriveLayer-wise Shared AttentionnuScenes
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.