Artificial Intelligence 6 min read

DA3 Enables Arbitrary‑View 3D Reconstruction with a Single Transformer

The ByteDance‑Seed team introduces Depth Anything 3 (DA3), a minimalist visual‑geometry model that uses a vanilla Transformer backbone and depth‑ray representation to jointly predict depth and camera pose from any number of images, achieving state‑of‑the‑art performance with a 35.7% gain in pose accuracy and a 23.6% improvement in geometric precision over prior methods.

HyperAI Super Neural

Dec 22, 2025

DA3 Enables Arbitrary‑View 3D Reconstruction with a Single Transformer

Background

Three‑dimensional perception from visual inputs underlies tasks such as monocular depth estimation, structure‑from‑motion, multi‑view stereo, and simultaneous localization and mapping (SLAM). Existing approaches typically develop separate, highly specialized models for each task, which hinders the reuse of large‑scale pretrained knowledge.

DA3 Design

Depth Anything 3 (DA3) is a single Transformer model trained on a dedicated visual‑geometry benchmark. It adopts a vanilla DINO encoder as the backbone and does not incorporate any task‑specific architectural customisation. By predicting a unified depth‑ray representation, DA3 jointly estimates per‑view depth and camera pose from an arbitrary set of images, regardless of whether the poses are known beforehand.

Key Findings

Only a standard Transformer backbone is required; no custom structures are added.

A single depth‑ray prediction target yields strong performance, eliminating the need for separate multi‑task learning pipelines.

Benchmark and Results

The authors built a new visual‑geometry benchmark that evaluates camera pose estimation, arbitrary‑view geometry, and visual rendering. On this suite DA3 achieves state‑of‑the‑art results across all tasks. Compared with the VGGT baseline, camera pose accuracy improves by an average of 35.7 %, geometric precision rises by 23.6 %, and monocular depth estimation surpasses the previous DA2 model.

Experiments show that the minimalist design can reconstruct a consistent visual space from any number of input images, irrespective of whether camera poses are provided.