How MMGDreamer Achieves Precise Geometry Control in 3D Indoor Scene Generation
MMGDreamer introduces a mixed‑modality graph and a dual‑branch diffusion model that combine text, image, and relational cues to generate highly realistic, geometrically controllable 3D indoor scenes, outperforming prior methods on multiple quantitative and qualitative benchmarks.
1. Introduction
Generating high‑fidelity, geometrically controllable 3D indoor scenes is essential for virtual reality, interior design, and game development. Existing methods rely on textual scene graphs, which struggle to describe fine‑grained geometry and lack multimodal flexibility. MMGDreamer addresses these gaps with a mixed‑modality graph (MMG) and a dual‑branch diffusion framework.
2. Background and Motivation
Controllable 3D scene generation requires realistic appearance and precise geometric layout. Text‑only graph representations cannot encode object shapes or accept diverse inputs such as images, motivating a representation that fuses multiple modalities and predicts missing relational information.
3. Method
3.1 Input Representation: Mixed‑Modality Graph (MMG)
Each node in the MMG can carry text, image, or both; edges encode spatial or semantic relations, which may be optionally provided. Nodes are encoded with CLIP (text encoder for textual features, vision encoder for visual features). Category and relation embeddings are added, and missing modalities are zero‑padded to keep tensor shapes consistent.
3.2 Graph Enhancement Module
Visual Enhancement Module : For nodes containing only text, a VQ‑VAE‑style encoder‑quantizer‑decoder pipeline generates complementary visual features, improving geometric control.
Relation Predictor : A graph convolutional network (GCN) followed by an MLP predicts absent edge attributes using zero‑padded edge features; training uses cross‑entropy loss for relation classification.
The output is a visual‑enhanced, relation‑completed graph.
3.3 Dual‑Branch Diffusion Model
The diffusion model shares a common graph encoder and consists of two branches:
Layout Branch : Encodes object bounding boxes (position, size, rotation) and guides a 1‑D UNet denoiser to generate coherent scene layouts.
Shape Branch : Represents object geometry with Truncated Signed Distance Fields (TSDF) encoded into latent vectors by a pretrained VQ‑VAE; a 3‑D UNet denoiser produces detailed object shapes conditioned on the graph representation.
Both branches predict noise residuals; the graph encoder incorporates an echo mechanism to facilitate information exchange among nodes.
4. Training and Inference
Training proceeds in two stages. First, the visual enhancement module and relation predictor are trained independently to improve node visual quality and relational accuracy. Second, the full MMG is fed into the graph encoder and the dual‑branch diffusion model, jointly optimizing layout and shape generation. During inference, an input MMG passes through the enhancement and prediction modules to produce a visual‑enhanced graph, which is decoded by the diffusion model into a high‑quality 3D indoor scene.
5. Experiments
5.1 Quantitative Results
On the SG‑FRONT dataset, MMGDreamer (with both visual enhancement and relation prediction) reduces FID by 9 %, FID‑CLIP by 8 %, and KID by 33 % for living‑room generation compared to the state‑of‑the‑art EchoScene, demonstrating superior realism and geometric control.
5.2 Qualitative Results
Visual comparisons on bedroom, dining‑room, and living‑room scenes show that MMGDreamer accurately reconstructs object geometry (e.g., beds, chairs, cabinets) and preserves spatial coherence, while competing methods exhibit noticeable distortions and missing details.
5.3 Object‑level Generation Quality
Object‑wise evaluation using PointFlow‑based metrics (Minimum Matching Distance, Coverage, and 1‑Nearest Neighbor Accuracy) shows that MMGDreamer achieves the best scores, confirming high‑precision geometry and distribution similarity at the object level.
6. Conclusion
MMGDreamer introduces a novel framework that fuses multimodal inputs through a mixed‑modality graph and leverages a dual‑branch diffusion model to produce geometrically precise, realistic 3D indoor scenes. The visual enhancement module enriches textual nodes, and the relation predictor fills missing relational cues, leading to superior layout and shape quality. Extensive experiments validate its advantage over existing methods, providing a strong foundation for VR/AR, interior design, and game development applications.
Paper: https://arxiv.org/pdf/2502.05874v2
Project page: https://yangzhifeio.github.io/project/MMGDreamer
Code: https://github.com/yangzhifeio/MMGDreamer
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
