DA3 Enables Arbitrary‑View 3D Reconstruction with a Single Transformer
The ByteDance‑Seed team introduces Depth Anything 3 (DA3), a minimalist visual‑geometry model that uses a vanilla Transformer backbone and depth‑ray representation to jointly predict depth and camera pose from any number of images, achieving state‑of‑the‑art performance with a 35.7% gain in pose accuracy and a 23.6% improvement in geometric precision over prior methods.
Background
Three‑dimensional perception from visual inputs underlies tasks such as monocular depth estimation, structure‑from‑motion, multi‑view stereo, and simultaneous localization and mapping (SLAM). Existing approaches typically develop separate, highly specialized models for each task, which hinders the reuse of large‑scale pretrained knowledge.
DA3 Design
Depth Anything 3 (DA3) is a single Transformer model trained on a dedicated visual‑geometry benchmark. It adopts a vanilla DINO encoder as the backbone and does not incorporate any task‑specific architectural customisation. By predicting a unified depth‑ray representation, DA3 jointly estimates per‑view depth and camera pose from an arbitrary set of images, regardless of whether the poses are known beforehand.
Key Findings
Only a standard Transformer backbone is required; no custom structures are added.
A single depth‑ray prediction target yields strong performance, eliminating the need for separate multi‑task learning pipelines.
Benchmark and Results
The authors built a new visual‑geometry benchmark that evaluates camera pose estimation, arbitrary‑view geometry, and visual rendering. On this suite DA3 achieves state‑of‑the‑art results across all tasks. Compared with the VGGT baseline, camera pose accuracy improves by an average of 35.7 %, geometric precision rises by 23.6 %, and monocular depth estimation surpasses the previous DA2 model.
Experiments show that the minimalist design can reconstruct a consistent visual space from any number of input images, irrespective of whether camera poses are provided.
Example
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
