Artificial Intelligence 16 min read

LocDiff: Achieving Global-Scale Precise Image Geolocation Without Grids or Reference Libraries

The LocDiff framework introduces a spherical‑harmonics Dirac‑delta encoding and a conditional Siren‑UNet diffusion model that enables accurate worldwide image geolocation without relying on predefined grids or external image libraries, outperforming prior methods in precision, generalization, and computational efficiency.

HyperAI Super Neural

Nov 19, 2025

LocDiff: Achieving Global-Scale Precise Image Geolocation Without Grids or Reference Libraries

Background and Motivation

Image geolocation aims to infer latitude‑longitude coordinates from visual content and is crucial for tasks such as wildlife monitoring and urban street‑view analysis. Traditional approaches either regress coordinates directly—yielding errors of hundreds of kilometres on global datasets—or discretize the problem, which limits spatial resolution and geographic coverage.

Why Diffusion Models?

Recent diffusion‑based generative techniques excel at modeling continuous data distributions. The authors observed that geographic coordinates reside on an embedded Riemannian manifold rather than Euclidean space, and that naïve noise injection distorts the manifold, while raw coordinates lack multi‑scale information.

Core Innovation: SHDD Encoding and LocDiff Framework

The team defines an ideal Position Encoding Space (PE) as a bijective mapping from the unit sphere C (parameterized by (θ, φ)) to a high‑dimensional Euclidean space ℝᵈ, with a decoder that is a surjection back to C and preserves small perturbations. To satisfy these properties they propose the Spherical Harmonics Dirac Delta (SHDD) encoding:

Convert a spherical point (θ₀, φ₀) to a Dirac delta function δ₍θ₀,φ₀₎ on the sphere.

Encode this function into a vector of spherical‑harmonic coefficients; truncating at order L yields a compact (L+1)²‑dimensional representation.

SHDD provides a dense encoding where each vector uniquely corresponds to a spherical function, and the KL‑divergence between SHDD representations aligns with the Wasserstein‑2 distance, guaranteeing a continuous similarity measure.

Modal‑Search Decoder

Using the reverse KL divergence as a modal‑search objective, the decoder locates the region of highest probability mass on the sphere. A hyper‑parameter ρ balances resolution and stability: larger ρ yields coarser, more stable predictions; smaller ρ improves precision but is sensitive to local noise. This design eliminates the need for pre‑defined spherical grids or external reference image libraries.

Conditional Siren‑UNet (CS‑UNet) Backbone

The conditional generation network builds on SirenNet because spherical‑harmonic coefficients are sums of sine‑cosine terms; sine activations preserve gradient flow for these features. CS‑UNet integrates:

Image embedding e_I from a frozen CLIP encoder.

Latent vector x.

Diffusion timestep t transformed into scale and offset vectors.

A C‑Siren module that fuses x, e_I, and t to produce denoised features.

The resulting architecture enables efficient conditioning on visual input during diffusion.

Training and Inference

Training follows the standard DDPM pipeline: each (image, spherical coordinate) pair is encoded as (e_I, SHDD). The SHDD representation is progressively noised to pure Gaussian noise; CS‑UNet learns to reverse this process under the guidance of e_I. The loss is the SHDD KL divergence, which is more stable than spherical MSE and retains multi‑scale information.

During inference, the model starts from random Gaussian noise, iteratively generates an SHDD coefficient vector, and finally decodes it to (θ, φ) via the modal‑search decoder. Approximate integrals are computed by summing over a discretized set of global anchor points sampled randomly each training step.

Experimental Evaluation

Datasets follow the GeoCLIP benchmark: MP16 (4.72 M training images), and three test sets—Im2GPS3k, YFCC26k, and GWS15k. Five spatial scales are evaluated: street (1 km), city (25 km), region (200 km), country (750 km), and continent (2 500 km).

LocDiff consistently outperforms baselines. A hybrid model, LocDiff‑H, restricts GeoCLIP retrieval to a 200 km radius around LocDiff’s generated location, achieving the best results on Im2GPS3k and YFCC26k but lagging on GWS15k due to distribution shift.

Compared with generative baselines DiffR³ and FMR³ on OSM‑5M and YFCC‑4k, LocDiff attains higher accuracy, confirming the advantage of multi‑scale diffusion.

Generalization and Efficiency

LocDiff’s performance remains stable when anchors are drawn from the MP16 library or a uniform grid, and when the number of anchors varies from 21 k to 1 M, demonstrating robustness to anchor selection.

Computationally, SHDD encoding/decoding are near‑constant‑time closed‑form operations; training converges in ~2 M steps on YFCC, far fewer than the ~10 M steps required by competing models. Inference uses simple matrix multiplications and argmax, yielding linear space complexity.

Broader Context

The paper “LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space” was accepted to NeurIPS 2025 (https://openreview.net/forum?id=ghybX0Qlls). Related works from MIT‑CSAIL and the GeoCoT framework further illustrate rapid progress in spherical position encoding and multi‑step geographic reasoning.