Artificial Intelligence 10 min read

How DP-Recon Uses Diffusion Models to Reconstruct 3D Scenes from Sparse Photos

DP-Recon leverages generative diffusion priors and a visibility‑guided SDS loss to achieve high‑fidelity, compositional 3D scene reconstruction from extremely sparse images, delivering superior geometry, texture, and text‑driven editing capabilities demonstrated on benchmark datasets and real‑world indoor scenarios.

AI Frontier Lectures

Apr 28, 2025

How DP-Recon Uses Diffusion Models to Reconstruct 3D Scenes from Sparse Photos

Overview

DP‑Recon is a compositional 3D scene reconstruction framework that integrates a generative diffusion prior via Score‑Distillation Sampling (SDS) to recover high‑quality geometry and texture from a few sparse images.

Reconstruction result and text‑based editing illustration

Problem

Conventional multi‑view reconstruction requires dense camera coverage; with sparse viewpoints geometry collapses, occluded regions remain unreconstructed, and objects cannot be decoupled, limiting downstream tasks such as embodied AI, metaverse content, and visual effects.

Method

Compositional implicit representation

For each detected object and the background a separate signed‑distance‑function (SDF) field is learned. Multi‑modal supervision (RGB, depth, surface normals, instance segmentation) is applied to each field.

Two‑stage optimization

Geometry stage : Initialize geometry from the multi‑modal losses. Then apply an SDS loss that distills knowledge from a pretrained Stable Diffusion model into the SDF. The loss is computed on rendered depth/normal maps and encourages plausible shape completion in unobserved regions.

Appearance stage : Render the optimized meshes with Nvdiffrast, fuse the input image colors with the diffusion prior, and optimize per‑vertex UV texture maps. The result is a high‑resolution UV atlas for each object.

Visibility‑guided weighting

A per‑pixel visibility weight w_vis is derived from a visibility grid accumulated during rasterization. The final SDS loss is weighted as L_SDS·(1‑w_vis) + L_img·w_vis, ensuring that the diffusion prior dominates only in regions that are invisible or heavily occluded in the input views.

Experiments

Evaluations on the Replica and ScanNet++ benchmarks show that DP‑Recon outperforms prior methods in both overall scene reconstruction and object‑wise reconstruction under sparse view settings (e.g., 10–15 images). Quantitative metrics such as Chamfer distance, normal consistency, and PSNR are improved, and qualitative results exhibit smoother backgrounds, fewer artifacts, and accurate recovery of heavily occluded objects.

Applications

Using only 15 frames extracted from a YouTube walkthrough, DP‑Recon produces textured meshes that can be directly imported into Blender or game engines. The diffusion prior also enables text‑driven editing (e.g., “turn the vase into a teddy bear”) and novel‑view synthesis with high fidelity.