Artificial Intelligence 10 min read

How FSDrive Uses Spatio‑Temporal CoT to Revolutionize Autonomous Driving

FSDrive introduces a spatio‑temporal chain‑of‑thought approach that enables visual language models to generate future driving scenes as images, improving trajectory planning accuracy and safety by eliminating cross‑modal gaps and enforcing physical constraints in autonomous driving.

Amap Tech

Sep 19, 2025

How FSDrive Uses Spatio‑Temporal CoT to Revolutionize Autonomous Driving

Abstract

FSDrive is an end‑to‑end autonomous driving large model that generates optimal trajectories from real‑time driving scenarios. By integrating a navigation engine, it enhances dynamic path recommendation accuracy, especially in congestion, accidents, and lane‑changing situations.

Introduction

Current autonomous driving large models use discrete textual chain‑of‑thought (CoT) as an intermediate reasoning step, which abstracts visual information and can blur spatio‑temporal relationships, leading to loss of fine‑grained details. Inspired by human drivers visualizing future scenes, the authors propose a spatio‑temporal CoT method that allows visual language models (VLMs) to think visually and plan trajectories based on both current observations and predicted future worlds.

Key Contributions

FSDrive unifies future scenes and perception as images, guiding model attention and enforcing physical laws.

It eliminates the semantic gap caused by cross‑modal conversion (visual to text).

It establishes an end‑to‑end visual reasoning pipeline enabling VLMs to perform causal reasoning directly from visual inputs.

Unified Pre‑training Paradigm for Visual Generation and Understanding

The method consists of two stages: a pre‑training stage that endows VLMs with visual generation capability, and a fine‑tuning stage that enables visual thinking. Existing autoregressive image generation models use VQ‑VAE tokens lacking semantic information; the proposed approach incorporates the VQ‑VAE codebook into the model vocabulary, extending it to a multimodal space covering both visual and textual tokens while preserving the original MLLM architecture.

Visual Understanding Pre‑training: Uses VQA tasks to retain semantic understanding.

Visual Generation Pre‑training: Predicts future visual tokens autoregressively, leveraging abundant video data without extra annotations.

Progressive Image Generation: Generates lane markings first as a structural skeleton, then predicts 3D detection boxes, ensuring compliance with static and dynamic physical constraints before rendering full future frames.

Visualization‑Based Spatio‑Temporal CoT

The model acts as a world model, generating a unified image that predicts future lane markings and 3D boxes, providing coarse visual cues that guide attention to drivable areas and key objects while enforcing physical constraints. The spatio‑temporal CoT serves as an intermediate reasoning step, allowing the VLM to function as an inverse dynamics model that plans trajectories based on current observations and visualized future predictions.

Experiment

Evaluations on the nuScenes dataset show that FSDrive achieves state‑of‑the‑art performance in average planning error and collision rate, demonstrating that visual thinking of future scenes significantly reduces risk. Ablation studies confirm the effectiveness of the spatio‑temporal CoT; removing it leads to substantial trajectory deviation and higher collision risk.

Conclusion

FSDrive presents a spatio‑temporal CoT‑based autonomous driving framework that enables VLMs to think visually. By unifying future scene generation with perception, it removes cross‑modal semantic gaps and establishes an end‑to‑end visual reasoning pipeline. The unified pre‑training paradigm and progressive generation method enhance visual generation quality, and extensive experiments validate the approach’s effectiveness, advancing autonomous driving toward visual reasoning and spatial intelligence.

Paper: https://arxiv.org/abs/2505.17685

Project page: https://miv-xjtu.github.io/FSDrive.github.io/

Code: https://github.com/MIV-XJTU/FSDrive

AI research autonomous driving trajectory planning visual language model spatio-temporal CoT

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.