Meta’s VLM³ Boosts Depth Accuracy to 0.9 Using Qwen3‑VL‑4B for Unified 3D Tasks
Meta and Princeton introduce VLM³, a unified vision‑language framework built on Qwen3‑VL‑4B that models depth estimation, object‑level 3D understanding, pixel matching and camera pose estimation without extra encoders, achieving up to 0.90 depth accuracy and outperforming larger specialist models on multiple benchmarks.
Motivation: 3D Perception
Three‑dimensional perception—recovering real‑world geometry, scale, and spatial relationships from 2D images—is fundamental for autonomous driving, robotics, and 3D reconstruction. Unlike 2D tasks such as classification, 3D perception demands precise spatial reasoning and geometric modeling, making it one of the most challenging problems in computer vision.
Limitations of Existing Vision‑Language Models
Recent Vision‑Language Models (VLMs) have excelled on 2D tasks through large‑scale pre‑training, yet they still lag behind specialized 3D models on depth estimation, pixel matching, and camera pose estimation. The 3D vision community currently lacks a universal foundation model; most state‑of‑the‑art methods rely on task‑specific architectures, loss functions, and training pipelines.
Research Question
Can a standard VLM, without additional encoders, visual prompts, or task‑specific modules, perform a broad set of fine‑grained 3D perception tasks?
VLM³ Framework
Meta and Princeton propose VLM³ (VLM Cubed), which builds on a standard VLM and unifies four 3D tasks—object‑level 3D understanding, metric depth estimation, pixel matching, and camera pose solving—through a consistent data organization and training paradigm. The work is documented in the arXiv pre‑print “VLM3: Vision Language Models Are Native 3D Learners”.
Research Highlights
On the SpatialRGPT benchmark, VLM³‑4B surpasses the larger SpatialRGPT‑8B without any extra encoder.
Compared with the previous best VLM DepthLM‑7B, VLM³‑4B raises average depth accuracy δ₁ from 0.84 to 0.90, matching the specialist model UniDepthV2.
VLM³ reduces the endpoint error (EPE) of the baseline VLM by an order of magnitude, outperforming classic expert models DKM and RoMa.
VLM³ lifts the AUC₃₀° metric from a near‑random 5 % to 94 %, exceeding VGGT and approaching DA3‑Giant.
Mixed Multi‑Task Dataset
The authors construct a hybrid dataset covering single‑view and multi‑view scenarios, spanning metric depth, object‑level 3D understanding, pixel matching, and camera pose estimation.
For metric depth, they extend the DepthLM base with Argoverse2, Waymo, NuScenes, ScanNet++, Taskonomy, HM3D, Matterport3D, and add 10 million self‑collected outdoor street images, increasing the training set from 16 million to 26 million images—approximately 32 million images and 3.2 billion depth points overall.
Training weight adjustments are applied: smaller datasets receive reduced weights because uniform sampling leads to over‑fitting, as demonstrated by ablation experiments.
Object‑level 3D understanding reuses the SpatialRGPT dataset (≈1 million images with qualitative and quantitative Q&A), which lacks camera intrinsics for many samples, better reflecting real‑world conditions.
For pixel matching and pose estimation, a unified multi‑view dataset aggregates 14 sources (BlendedMVS, DynamicReplica, SailVOS3D, ScanNet++, etc.), containing about 9.9 million image pairs. Only pairs with >25 % visual overlap are kept, and 30 independent ScanNet++ scenes are reserved for testing to avoid data leakage. Dataset weights follow the original pair counts.
Model Design: Minimal‑Change Principle
VLM³ does not introduce a new 3D architecture; it retains the native structure of the base VLM. The framework follows a “minimal‑change” principle, avoiding extra encoders, custom loss functions, or task‑specific modules, and instead optimizes three aspects: input representation, spatial localization, and data organization.
Qwen3‑VL‑4B serves as the backbone, and the entire system is trained with standard supervised fine‑tuning (SFT), identical to existing VLM pre‑training and fine‑tuning pipelines.
Image Standardization
Because multi‑source datasets have inconsistent camera parameters, VLM³ normalizes all images to a standard focal‑length space. Missing intrinsics are estimated using a single‑image calibration model, reducing distribution shift caused by varying imaging conditions.
Textual Spatial Localization
Instead of visual prompts or dedicated positional encoders, VLM³ normalizes image coordinates to a common space and expresses positions as text. This enables the model to leverage its language modeling capability for pixel‑level, region‑level, and cross‑view correspondence learning without extra visual modules. A single sample provides roughly ten times more supervision for depth estimation while keeping computational cost unchanged.
Fine‑Grained Data Mixing
Extensive experiments show that naïvely enlarging data scale or using equal‑weight mixing can saturate or degrade performance. By designing differentiated sampling strategies based on dataset size and task difficulty, VLM³ improves 3D representation ability, making data proportion a core component of the framework.
Unified Modeling of Four 3D Tasks
Depth estimation is cast as textual pixel‑position supervision; object‑level 3D understanding uses textual coordinate boxes instead of mask encoders; pixel matching transforms cross‑view correspondence into coordinate prediction; camera pose estimation decomposes pose parameters into translation distance, direction, and rotation angle expressed as Q&A. All tasks are handled within the standard VLM autoregressive generation framework.
Comprehensive Evaluation
Experiments compare VLM³ against generic VLMs and leading specialist models across the four tasks.
Metric Depth Estimation
Evaluated on nine public datasets and five representative benchmarks using δ₁ as the primary metric, VLM³‑4B raises average accuracy from 0.84 to 0.90, surpassing DepthLM‑7B and reaching the performance of UniDepthV2 and MoGe‑2.
Object‑Level 3D Understanding
Using the SpatialRGPT benchmark, the 4 B‑parameter VLM³ outperforms the 8 B‑parameter SpatialRGPT, despite the latter’s extra mask encoder, demonstrating the strength of unified textual localization.
Pixel Matching
On the UFM benchmark, VLM³ reduces endpoint error by an order of magnitude relative to the baseline VLM and exceeds classic expert models DKM and RoMa, approaching the top‑ranked UFM method.
Camera Pose Estimation
On ETH3D and ScanNet++ with the AUC₃₀° metric, VLM³ lifts performance from near‑random to 94 %, surpassing VGGT and MapAnything and matching the state‑of‑the‑art DA3‑Giant.
Conclusion
Historically, 3D vision research has followed a task‑driven path, designing dedicated models for depth, matching, or pose. VLM³ demonstrates that, with standardized image processing, textual spatial modeling, and refined data mixing, a standard vision‑language model can achieve or exceed specialist performance across multiple fine‑grained 3D tasks, suggesting that generic VLMs possess far greater 3D representation capability than previously assumed and providing empirical support for a unified foundation model in 3D vision.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
