How FSDrive Uses Spatio‑Temporal CoT to Revolutionize Autonomous Driving
FSDrive introduces a spatio‑temporal chain‑of‑thought approach that enables visual language models to generate future driving scenes as images, improving trajectory planning accuracy and safety by eliminating cross‑modal gaps and enforcing physical constraints in autonomous driving.
Abstract
FSDrive is an end‑to‑end autonomous driving large model that generates optimal trajectories from real‑time driving scenarios. By integrating a navigation engine, it enhances dynamic path recommendation accuracy, especially in congestion, accidents, and lane‑changing situations.
Introduction
Current autonomous driving large models use discrete textual chain‑of‑thought (CoT) as an intermediate reasoning step, which abstracts visual information and can blur spatio‑temporal relationships, leading to loss of fine‑grained details. Inspired by human drivers visualizing future scenes, the authors propose a spatio‑temporal CoT method that allows visual language models (VLMs) to think visually and plan trajectories based on both current observations and predicted future worlds.
Key Contributions
FSDrive unifies future scenes and perception as images, guiding model attention and enforcing physical laws.
It eliminates the semantic gap caused by cross‑modal conversion (visual to text).
It establishes an end‑to‑end visual reasoning pipeline enabling VLMs to perform causal reasoning directly from visual inputs.
Unified Pre‑training Paradigm for Visual Generation and Understanding
The method consists of two stages: a pre‑training stage that endows VLMs with visual generation capability, and a fine‑tuning stage that enables visual thinking. Existing autoregressive image generation models use VQ‑VAE tokens lacking semantic information; the proposed approach incorporates the VQ‑VAE codebook into the model vocabulary, extending it to a multimodal space covering both visual and textual tokens while preserving the original MLLM architecture.
Visual Understanding Pre‑training: Uses VQA tasks to retain semantic understanding.
Visual Generation Pre‑training: Predicts future visual tokens autoregressively, leveraging abundant video data without extra annotations.
Progressive Image Generation: Generates lane markings first as a structural skeleton, then predicts 3D detection boxes, ensuring compliance with static and dynamic physical constraints before rendering full future frames.
Visualization‑Based Spatio‑Temporal CoT
The model acts as a world model, generating a unified image that predicts future lane markings and 3D boxes, providing coarse visual cues that guide attention to drivable areas and key objects while enforcing physical constraints. The spatio‑temporal CoT serves as an intermediate reasoning step, allowing the VLM to function as an inverse dynamics model that plans trajectories based on current observations and visualized future predictions.
Experiment
Evaluations on the nuScenes dataset show that FSDrive achieves state‑of‑the‑art performance in average planning error and collision rate, demonstrating that visual thinking of future scenes significantly reduces risk. Ablation studies confirm the effectiveness of the spatio‑temporal CoT; removing it leads to substantial trajectory deviation and higher collision risk.
Conclusion
FSDrive presents a spatio‑temporal CoT‑based autonomous driving framework that enables VLMs to think visually. By unifying future scene generation with perception, it removes cross‑modal semantic gaps and establishes an end‑to‑end visual reasoning pipeline. The unified pre‑training paradigm and progressive generation method enhance visual generation quality, and extensive experiments validate the approach’s effectiveness, advancing autonomous driving toward visual reasoning and spatial intelligence.
Paper: https://arxiv.org/abs/2505.17685
Project page: https://miv-xjtu.github.io/FSDrive.github.io/
Code: https://github.com/MIV-XJTU/FSDrive
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
