Artificial Intelligence 15 min read

Meta’s VLM³ Boosts Depth Accuracy to 0.9 Using Qwen3‑VL‑4B for Unified 3D Tasks

Meta and Princeton introduce VLM³, a unified vision‑language framework built on Qwen3‑VL‑4B that models depth estimation, object‑level 3D understanding, pixel matching and camera pose estimation without extra encoders, achieving up to 0.90 depth accuracy and outperforming larger specialist models on multiple benchmarks.

HyperAI Super Neural

Jun 8, 2026

Meta’s VLM³ Boosts Depth Accuracy to 0.9 Using Qwen3‑VL‑4B for Unified 3D Tasks

Motivation: 3D Perception

Three‑dimensional perception—recovering real‑world geometry, scale, and spatial relationships from 2D images—is fundamental for autonomous driving, robotics, and 3D reconstruction. Unlike 2D tasks such as classification, 3D perception demands precise spatial reasoning and geometric modeling, making it one of the most challenging problems in computer vision.

Limitations of Existing Vision‑Language Models

Recent Vision‑Language Models (VLMs) have excelled on 2D tasks through large‑scale pre‑training, yet they still lag behind specialized 3D models on depth estimation, pixel matching, and camera pose estimation. The 3D vision community currently lacks a universal foundation model; most state‑of‑the‑art methods rely on task‑specific architectures, loss functions, and training pipelines.

Research Question

Can a standard VLM, without additional encoders, visual prompts, or task‑specific modules, perform a broad set of fine‑grained 3D perception tasks?

VLM³ Framework

Meta and Princeton propose VLM³ (VLM Cubed), which builds on a standard VLM and unifies four 3D tasks—object‑level 3D understanding, metric depth estimation, pixel matching, and camera pose solving—through a consistent data organization and training paradigm. The work is documented in the arXiv pre‑print “VLM3: Vision Language Models Are Native 3D Learners”.

Research Highlights

On the SpatialRGPT benchmark, VLM³‑4B surpasses the larger SpatialRGPT‑8B without any extra encoder.

Compared with the previous best VLM DepthLM‑7B, VLM³‑4B raises average depth accuracy δ₁ from 0.84 to 0.90, matching the specialist model UniDepthV2.

VLM³ reduces the endpoint error (EPE) of the baseline VLM by an order of magnitude, outperforming classic expert models DKM and RoMa.

VLM³ lifts the AUC₃₀° metric from a near‑random 5 % to 94 %, exceeding VGGT and approaching DA3‑Giant.

Mixed Multi‑Task Dataset

The authors construct a hybrid dataset covering single‑view and multi‑view scenarios, spanning metric depth, object‑level 3D understanding, pixel matching, and camera pose estimation.

For metric depth, they extend the DepthLM base with Argoverse2, Waymo, NuScenes, ScanNet++, Taskonomy, HM3D, Matterport3D, and add 10 million self‑collected outdoor street images, increasing the training set from 16 million to 26 million images—approximately 32 million images and 3.2 billion depth points overall.

Training weight adjustments are applied: smaller datasets receive reduced weights because uniform sampling leads to over‑fitting, as demonstrated by ablation experiments.

Object‑level 3D understanding reuses the SpatialRGPT dataset (≈1 million images with qualitative and quantitative Q&A), which lacks camera intrinsics for many samples, better reflecting real‑world conditions.

For pixel matching and pose estimation, a unified multi‑view dataset aggregates 14 sources (BlendedMVS, DynamicReplica, SailVOS3D, ScanNet++, etc.), containing about 9.9 million image pairs. Only pairs with >25 % visual overlap are kept, and 30 independent ScanNet++ scenes are reserved for testing to avoid data leakage. Dataset weights follow the original pair counts.

Model Design: Minimal‑Change Principle

VLM³ does not introduce a new 3D architecture; it retains the native structure of the base VLM. The framework follows a “minimal‑change” principle, avoiding extra encoders, custom loss functions, or task‑specific modules, and instead optimizes three aspects: input representation, spatial localization, and data organization.

Qwen3‑VL‑4B serves as the backbone, and the entire system is trained with standard supervised fine‑tuning (SFT), identical to existing VLM pre‑training and fine‑tuning pipelines.

Image Standardization

Because multi‑source datasets have inconsistent camera parameters, VLM³ normalizes all images to a standard focal‑length space. Missing intrinsics are estimated using a single‑image calibration model, reducing distribution shift caused by varying imaging conditions.

Textual Spatial Localization

Instead of visual prompts or dedicated positional encoders, VLM³ normalizes image coordinates to a common space and expresses positions as text. This enables the model to leverage its language modeling capability for pixel‑level, region‑level, and cross‑view correspondence learning without extra visual modules. A single sample provides roughly ten times more supervision for depth estimation while keeping computational cost unchanged.

Fine‑Grained Data Mixing

Extensive experiments show that naïvely enlarging data scale or using equal‑weight mixing can saturate or degrade performance. By designing differentiated sampling strategies based on dataset size and task difficulty, VLM³ improves 3D representation ability, making data proportion a core component of the framework.

Unified Modeling of Four 3D Tasks

Depth estimation is cast as textual pixel‑position supervision; object‑level 3D understanding uses textual coordinate boxes instead of mask encoders; pixel matching transforms cross‑view correspondence into coordinate prediction; camera pose estimation decomposes pose parameters into translation distance, direction, and rotation angle expressed as Q&A. All tasks are handled within the standard VLM autoregressive generation framework.

Comprehensive Evaluation

Experiments compare VLM³ against generic VLMs and leading specialist models across the four tasks.

Metric Depth Estimation

Evaluated on nine public datasets and five representative benchmarks using δ₁ as the primary metric, VLM³‑4B raises average accuracy from 0.84 to 0.90, surpassing DepthLM‑7B and reaching the performance of UniDepthV2 and MoGe‑2.

Object‑Level 3D Understanding

Using the SpatialRGPT benchmark, the 4 B‑parameter VLM³ outperforms the 8 B‑parameter SpatialRGPT, despite the latter’s extra mask encoder, demonstrating the strength of unified textual localization.

Pixel Matching

On the UFM benchmark, VLM³ reduces endpoint error by an order of magnitude relative to the baseline VLM and exceeds classic expert models DKM and RoMa, approaching the top‑ranked UFM method.

Camera Pose Estimation

On ETH3D and ScanNet++ with the AUC₃₀° metric, VLM³ lifts performance from near‑random to 94 %, surpassing VGGT and MapAnything and matching the state‑of‑the‑art DA3‑Giant.

Conclusion

Historically, 3D vision research has followed a task‑driven path, designing dedicated models for depth, matching, or pose. VLM³ demonstrates that, with standardized image processing, textual spatial modeling, and refined data mixing, a standard vision‑language model can achieve or exceed specialist performance across multiple fine‑grained 3D tasks, suggesting that generic VLMs possess far greater 3D representation capability than previously assumed and providing empirical support for a unified foundation model in 3D vision.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multi-Task Learning benchmark Vision-Language Models Depth Estimation 3D Perception Qwen3-VL-4B

Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.