Can Diffusion Models Turn Noisy GPS into Sub‑Meter Visual Localization?

The DiffVL framework redefines visual localization as a diffusion‑based GPS denoising task, using BEV‑conditioned visual cues and standard SD maps to achieve sub‑meter accuracy without high‑definition maps, and demonstrates its superiority through extensive autonomous‑driving experiments.

Amap Tech
Amap Tech
Amap Tech
Can Diffusion Models Turn Noisy GPS into Sub‑Meter Visual Localization?

Background

Road‑level navigation traditionally relies on GNSS and high‑definition (HD) maps. GNSS suffers from multipath interference in urban canyons, and HD maps incur prohibitive mapping and maintenance costs, limiting large‑scale deployment. Standard‑definition (SD) maps such as OpenStreetMap are scalable but lack the geometric precision required for high‑accuracy localization.

Problem

Existing visual‑map matching pipelines treat GPS processing and visual localization as separate stages, which restricts accuracy and robustness when only SD maps are available.

DiffVL Overview

DiffVL reframes visual localization as a GPS denoising problem conditioned on bird’s‑eye‑view (BEV) visual features and SD map information. A conditional diffusion model learns the full posterior distribution of the vehicle pose given noisy GPS.

Formulation

Problem re‑construction: Instead of regressing a deterministic pose, the model predicts the probability distribution of the true pose conditioned on noisy GPS.

Forward diffusion: During training, ground‑truth poses are progressively corrupted with Gaussian noise until the distribution matches real GPS noise.

Reverse denoising: At inference, the model iteratively removes noise from the observed GPS using BEV visual features and SD map cues.

Conditional guidance: Each denoising step is guided by multimodal conditions—visual BEV features provide geometric context, while the SD map supplies topological priors.

Pose generation: After multiple denoising steps, the output is a high‑precision, noise‑free vehicle pose estimate.

Architecture

Multimodal conditional encoder: Aggregates BEV visual features, vector map embeddings, and GPS trajectory embeddings to condition the diffusion process.

Visual BEV encoder: Converts monocular or multi‑camera images into top‑down feature maps that capture lane markings, road layout, and surrounding objects.

SD map encoder: Processes vector map data (e.g., OpenStreetMap) to extract road network topology.

GPS trajectory encoder: Encodes the raw noisy GPS sequence.

Diffusion denoising network: Predicts pose residuals at each denoising step, integrating all conditioned features.

Training Procedure

For each training sample, the ground‑truth pose p_0 is corrupted over T diffusion steps:

p_t = \sqrt{\alpha_t}\,p_{t-1} + \sqrt{1-\alpha_t}\,\epsilon_t

, where \epsilon_t is sampled from the empirical GPS noise distribution. The network is trained to predict the added noise \epsilon_t given the noisy pose p_t and the multimodal conditioning.

Inference

Given a noisy GPS observation p_T, the model runs the reverse diffusion process for T steps, each time using the conditioned BEV and map features to predict and subtract the noise component, finally yielding a denoised pose p_0. The mean of the resulting distribution is used as the final pose estimate.

Experimental Evaluation

DiffVL was evaluated on several public autonomous‑driving datasets (e.g., nuScenes, Argoverse 2, KITTI) and compared against state‑of‑the‑art SD‑map‑based visual localization methods such as OrienterNet.

Quantitative results: Using only OpenStreetMap and monocular images, DiffVL achieved sub‑meter average localization error (≈0.8 m), outperforming traditional BEV‑feature matching baselines by ~30%.

Qualitative results: The method produces smooth, stable trajectories from noisy GPS inputs, demonstrating effective multi‑source fusion.

Ablation studies: Removing the diffusion component or the multimodal conditioning increases error to >2 m, confirming the necessity of each module.

Future Directions

Planned extensions include end‑to‑end integration with planning and control, multi‑agent cooperative localization, and incorporation of additional sensors (LiDAR, radar) and richer map modalities.

DiffVL architecture diagram
DiffVL architecture diagram
diffusion modelautonomous drivingBEVvisual localizationGPS denoisingSD map
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.