How Meta’s SAM 3D Turns a Single Photo into Detailed 3D Models
Meta’s newly released SAM 3 and SAM 3D models enable single‑image 3D reconstruction and promptable segmentation, outperforming prior methods on benchmarks, introducing a shared perception encoder, a Presence Head to reduce hallucinations, and a two‑stage generation pipeline that produces high‑fidelity geometry and texture.
Overview of SAM 3 and SAM 3D
Meta’s MSL Lab released two new models: SAM 3D Objects for general object and scene reconstruction, and SAM 3D Body for full‑body human reconstruction. Both models take a single 2D image and output a detailed 3D mesh, handling small objects, indirect viewpoints, and occlusions.
Performance Highlights
SAM 3D Objects outperforms prior 3D reconstruction methods, generalizing across diverse image types and supporting dense scene reconstruction. In direct human comparisons its win‑rate is at least five times higher than competing models. SAM 3D Body remains robust under unusual poses, partial occlusions, and multi‑person scenes, achieving state‑of‑the‑art results on standard benchmarks.
Promptable Segmentation with SAM 3
SAM 3 extends the SAM 2 architecture with a concept‑prompting interface. Users can provide free‑form text or example image patches to define arbitrary concepts, removing the need for a fixed label set. This enables segmentation of fine‑grained objects such as “red‑striped umbrella” or “striped cat”.
The new SA‑Co benchmark (Concept‑Based Arbitrary Segmentation) evaluates large‑vocabulary detection and segmentation. SAM 3 achieves 47.0% zero‑shot accuracy on LVIS, surpassing the previous SOTA of 38.5%, and beats baseline methods by at least a factor of two on SA‑Co. It also outperforms SAM 2 on video Promptable Visual Segmentation (PVS) tasks.
Core Architecture
Both SAM 3 and SAM 3D share a Perception Encoder backbone that feeds a detector and a tracker, ensuring consistent and efficient feature extraction. The detector builds on an improved DETR architecture and incorporates prompt tokens for text and image examples. Prompt tokens interact with image features via a cross‑attention fusion encoder before being passed to the decoder as object queries.
A novel Presence Head decouples existence prediction from localization. A learnable global presence token estimates the probability that a concept appears in the image; the final confidence score is the product of this presence probability and the localized matching score, reducing hallucinations when objects are absent.
Two‑Stage 3D Generation Pipeline (SAM 3D Objects)
Geometry stage : a 1.2‑billion‑parameter flow‑matching Transformer with a Mixture‑of‑Transformers (MoT) head predicts a coarse voxel shape and a six‑degree‑of‑freedom pose (rotation, translation, scale) in camera coordinates.
Texture & refinement stage : a sparse latent flow‑matching network extracts active voxels from the coarse shape, refines geometry, and synthesizes high‑fidelity texture. Two VAE decoders then output either a mesh or a 3D Gaussian splat for downstream rendering.
A Model‑in‑the‑Loop (MITL) data engine generates large‑scale image‑3D pairs: the model proposes multiple 3D candidates, human annotators select the best match via a Best‑of‑N search, and the chosen geometry is aligned to the scene using point‑cloud reference, providing low‑cost, high‑quality training data.
SAM 3D Body: Full‑Body Human Reconstruction
SAM 3D Body replaces the traditional SMPL model with a Momentum Human Rig representation that explicitly decouples skeletal pose from body shape, avoiding skinning distortion. It uses a promptable encoder‑decoder that accepts 2D keypoints or masks as tokens.
The decoder splits into two streams:
Body decoder : predicts global pose, shape, and camera parameters using global features and the Momentum Human Rig token.
Hand decoder : processes cropped hand images with cross‑attention to capture fine hand details.
Key Technical Details
Both models use a shared Perception Encoder that serves detection and tracking modules, ensuring feature consistency.
Detection is based on an enhanced DETR with prompt tokens; prompt tokens are fused via cross‑attention before decoding.
The Presence Head introduces a global existence token; confidence = presence × local match.
Video handling inherits SAM 2’s memory mechanism: a tracker stores past frame features and propagates masks forward.
For new objects, a matching function based on IoU links tracker predictions with detector detections to maintain identity across frames.
Resources
Project pages and code repositories (publicly accessible):
SAM 3: https://ai.meta.com/sam3
SAM 3D Objects: https://github.com/facebookresearch/sam-3d-objects
SAM 3D Body: https://github.com/facebookresearch/sam-3d-body
Paper (SAM 3): https://ai.meta.com/research/publications/sam-3-segment-anything-with-concepts/
Paper (SAM 3D Objects): https://ai.meta.com/research/publications/sam-3d-3dfy-anything-in-images/
Paper (SAM 3D Body): https://ai.meta.com/research/publications/sam-3d-body-robust-full-body-human-mesh-recovery/
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
