How DP-Recon Uses Diffusion Models to Reconstruct 3D Scenes from Sparse Photos
DP-Recon leverages generative diffusion priors and a visibility‑guided SDS loss to achieve high‑fidelity, compositional 3D scene reconstruction from extremely sparse images, delivering superior geometry, texture, and text‑driven editing capabilities demonstrated on benchmark datasets and real‑world indoor scenarios.
Overview
DP‑Recon is a compositional 3D scene reconstruction framework that integrates a generative diffusion prior via Score‑Distillation Sampling (SDS) to recover high‑quality geometry and texture from a few sparse images.
Problem
Conventional multi‑view reconstruction requires dense camera coverage; with sparse viewpoints geometry collapses, occluded regions remain unreconstructed, and objects cannot be decoupled, limiting downstream tasks such as embodied AI, metaverse content, and visual effects.
Method
Compositional implicit representation
For each detected object and the background a separate signed‑distance‑function (SDF) field is learned. Multi‑modal supervision (RGB, depth, surface normals, instance segmentation) is applied to each field.
Two‑stage optimization
Geometry stage : Initialize geometry from the multi‑modal losses. Then apply an SDS loss that distills knowledge from a pretrained Stable Diffusion model into the SDF. The loss is computed on rendered depth/normal maps and encourages plausible shape completion in unobserved regions.
Appearance stage : Render the optimized meshes with Nvdiffrast, fuse the input image colors with the diffusion prior, and optimize per‑vertex UV texture maps. The result is a high‑resolution UV atlas for each object.
Visibility‑guided weighting
A per‑pixel visibility weight w_vis is derived from a visibility grid accumulated during rasterization. The final SDS loss is weighted as L_SDS·(1‑w_vis) + L_img·w_vis, ensuring that the diffusion prior dominates only in regions that are invisible or heavily occluded in the input views.
Experiments
Evaluations on the Replica and ScanNet++ benchmarks show that DP‑Recon outperforms prior methods in both overall scene reconstruction and object‑wise reconstruction under sparse view settings (e.g., 10–15 images). Quantitative metrics such as Chamfer distance, normal consistency, and PSNR are improved, and qualitative results exhibit smoother backgrounds, fewer artifacts, and accurate recovery of heavily occluded objects.
Applications
Using only 15 frames extracted from a YouTube walkthrough, DP‑Recon produces textured meshes that can be directly imported into Blender or game engines. The diffusion prior also enables text‑driven editing (e.g., “turn the vase into a teddy bear”) and novel‑view synthesis with high fidelity.
Resources
Paper: https://arxiv.org/abs/2503.14830 Project page: https://dp-recon.github.io/ Code repository:
https://github.com/DP-Recon/DP-ReconHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
