Artificial Intelligence 12 min read

How Meta’s SAM 3D Turns a Single Photo into Detailed 3D Models

Meta’s newly released SAM 3 and SAM 3D models enable single‑image 3D reconstruction and promptable segmentation, outperforming prior methods on benchmarks, introducing a shared perception encoder, a Presence Head to reduce hallucinations, and a two‑stage generation pipeline that produces high‑fidelity geometry and texture.

AI Frontier Lectures

Nov 28, 2025

How Meta’s SAM 3D Turns a Single Photo into Detailed 3D Models

Overview of SAM 3 and SAM 3D

Meta’s MSL Lab released two new models: SAM 3D Objects for general object and scene reconstruction, and SAM 3D Body for full‑body human reconstruction. Both models take a single 2D image and output a detailed 3D mesh, handling small objects, indirect viewpoints, and occlusions.

Performance Highlights

SAM 3D Objects outperforms prior 3D reconstruction methods, generalizing across diverse image types and supporting dense scene reconstruction. In direct human comparisons its win‑rate is at least five times higher than competing models. SAM 3D Body remains robust under unusual poses, partial occlusions, and multi‑person scenes, achieving state‑of‑the‑art results on standard benchmarks.

Promptable Segmentation with SAM 3

SAM 3 extends the SAM 2 architecture with a concept‑prompting interface. Users can provide free‑form text or example image patches to define arbitrary concepts, removing the need for a fixed label set. This enables segmentation of fine‑grained objects such as “red‑striped umbrella” or “striped cat”.

The new SA‑Co benchmark (Concept‑Based Arbitrary Segmentation) evaluates large‑vocabulary detection and segmentation. SAM 3 achieves 47.0% zero‑shot accuracy on LVIS, surpassing the previous SOTA of 38.5%, and beats baseline methods by at least a factor of two on SA‑Co. It also outperforms SAM 2 on video Promptable Visual Segmentation (PVS) tasks.

Core Architecture

Both SAM 3 and SAM 3D share a Perception Encoder backbone that feeds a detector and a tracker, ensuring consistent and efficient feature extraction. The detector builds on an improved DETR architecture and incorporates prompt tokens for text and image examples. Prompt tokens interact with image features via a cross‑attention fusion encoder before being passed to the decoder as object queries.

A novel Presence Head decouples existence prediction from localization. A learnable global presence token estimates the probability that a concept appears in the image; the final confidence score is the product of this presence probability and the localized matching score, reducing hallucinations when objects are absent.

Two‑Stage 3D Generation Pipeline (SAM 3D Objects)

Geometry stage : a 1.2‑billion‑parameter flow‑matching Transformer with a Mixture‑of‑Transformers (MoT) head predicts a coarse voxel shape and a six‑degree‑of‑freedom pose (rotation, translation, scale) in camera coordinates.

Texture & refinement stage : a sparse latent flow‑matching network extracts active voxels from the coarse shape, refines geometry, and synthesizes high‑fidelity texture. Two VAE decoders then output either a mesh or a 3D Gaussian splat for downstream rendering.

A Model‑in‑the‑Loop (MITL) data engine generates large‑scale image‑3D pairs: the model proposes multiple 3D candidates, human annotators select the best match via a Best‑of‑N search, and the chosen geometry is aligned to the scene using point‑cloud reference, providing low‑cost, high‑quality training data.

SAM 3D Body: Full‑Body Human Reconstruction

SAM 3D Body replaces the traditional SMPL model with a Momentum Human Rig representation that explicitly decouples skeletal pose from body shape, avoiding skinning distortion. It uses a promptable encoder‑decoder that accepts 2D keypoints or masks as tokens.

The decoder splits into two streams:

Body decoder : predicts global pose, shape, and camera parameters using global features and the Momentum Human Rig token.

Hand decoder : processes cropped hand images with cross‑attention to capture fine hand details.

Key Technical Details

Both models use a shared Perception Encoder that serves detection and tracking modules, ensuring feature consistency.

Detection is based on an enhanced DETR with prompt tokens; prompt tokens are fused via cross‑attention before decoding.

The Presence Head introduces a global existence token; confidence = presence × local match.

Video handling inherits SAM 2’s memory mechanism: a tracker stores past frame features and propagates masks forward.

For new objects, a matching function based on IoU links tracker predictions with detector detections to maintain identity across frames.

Resources

Project pages and code repositories (publicly accessible):

SAM 3: https://ai.meta.com/sam3

SAM 3D Objects: https://github.com/facebookresearch/sam-3d-objects

SAM 3D Body: https://github.com/facebookresearch/sam-3d-body

Paper (SAM 3): https://ai.meta.com/research/publications/sam-3-segment-anything-with-concepts/

Paper (SAM 3D Objects): https://ai.meta.com/research/publications/sam-3d-3dfy-anything-in-images/

Paper (SAM 3D Body): https://ai.meta.com/research/publications/sam-3d-body-robust-full-body-human-mesh-recovery/

Code example

收
藏
，
分
享
、
在
看
，
给
个
三
连
击呗！

3D reconstruction Meta promptable segmentation SAM 3

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview of SAM 3 and SAM 3D

Performance Highlights

Promptable Segmentation with SAM 3

Core Architecture

Two‑Stage 3D Generation Pipeline (SAM 3D Objects)

SAM 3D Body: Full‑Body Human Reconstruction

Key Technical Details

Resources

Code example

AI Frontier Lectures

How this landed with the community

Was this worth your time?

0 Comments

Overview of SAM 3 and SAM 3D

Promptable Segmentation with SAM 3

Two‑Stage 3D Generation Pipeline (SAM 3D Objects)

SAM 3D Body: Full‑Body Human Reconstruction