How MMGDreamer Achieves Precise Geometry Control in 3D Indoor Scene Generation

MMGDreamer introduces a mixed‑modality graph and a dual‑branch diffusion model that combine text, image, and relational cues to generate highly realistic, geometrically controllable 3D indoor scenes, outperforming prior methods on multiple quantitative and qualitative benchmarks.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How MMGDreamer Achieves Precise Geometry Control in 3D Indoor Scene Generation

1. Introduction

Generating high‑fidelity, geometrically controllable 3D indoor scenes is essential for virtual reality, interior design, and game development. Existing methods rely on textual scene graphs, which struggle to describe fine‑grained geometry and lack multimodal flexibility. MMGDreamer addresses these gaps with a mixed‑modality graph (MMG) and a dual‑branch diffusion framework.

MMGDreamer overview
MMGDreamer overview

2. Background and Motivation

Controllable 3D scene generation requires realistic appearance and precise geometric layout. Text‑only graph representations cannot encode object shapes or accept diverse inputs such as images, motivating a representation that fuses multiple modalities and predicts missing relational information.

3. Method

3.1 Input Representation: Mixed‑Modality Graph (MMG)

Each node in the MMG can carry text, image, or both; edges encode spatial or semantic relations, which may be optionally provided. Nodes are encoded with CLIP (text encoder for textual features, vision encoder for visual features). Category and relation embeddings are added, and missing modalities are zero‑padded to keep tensor shapes consistent.

3.2 Graph Enhancement Module

Visual Enhancement Module : For nodes containing only text, a VQ‑VAE‑style encoder‑quantizer‑decoder pipeline generates complementary visual features, improving geometric control.

Relation Predictor : A graph convolutional network (GCN) followed by an MLP predicts absent edge attributes using zero‑padded edge features; training uses cross‑entropy loss for relation classification.

The output is a visual‑enhanced, relation‑completed graph.

3.3 Dual‑Branch Diffusion Model

The diffusion model shares a common graph encoder and consists of two branches:

Layout Branch : Encodes object bounding boxes (position, size, rotation) and guides a 1‑D UNet denoiser to generate coherent scene layouts.

Shape Branch : Represents object geometry with Truncated Signed Distance Fields (TSDF) encoded into latent vectors by a pretrained VQ‑VAE; a 3‑D UNet denoiser produces detailed object shapes conditioned on the graph representation.

Both branches predict noise residuals; the graph encoder incorporates an echo mechanism to facilitate information exchange among nodes.

4. Training and Inference

Training proceeds in two stages. First, the visual enhancement module and relation predictor are trained independently to improve node visual quality and relational accuracy. Second, the full MMG is fed into the graph encoder and the dual‑branch diffusion model, jointly optimizing layout and shape generation. During inference, an input MMG passes through the enhancement and prediction modules to produce a visual‑enhanced graph, which is decoded by the diffusion model into a high‑quality 3D indoor scene.

5. Experiments

5.1 Quantitative Results

On the SG‑FRONT dataset, MMGDreamer (with both visual enhancement and relation prediction) reduces FID by 9 %, FID‑CLIP by 8 %, and KID by 33 % for living‑room generation compared to the state‑of‑the‑art EchoScene, demonstrating superior realism and geometric control.

5.2 Qualitative Results

Visual comparisons on bedroom, dining‑room, and living‑room scenes show that MMGDreamer accurately reconstructs object geometry (e.g., beds, chairs, cabinets) and preserves spatial coherence, while competing methods exhibit noticeable distortions and missing details.

Qualitative comparison
Qualitative comparison

5.3 Object‑level Generation Quality

Object‑wise evaluation using PointFlow‑based metrics (Minimum Matching Distance, Coverage, and 1‑Nearest Neighbor Accuracy) shows that MMGDreamer achieves the best scores, confirming high‑precision geometry and distribution similarity at the object level.

6. Conclusion

MMGDreamer introduces a novel framework that fuses multimodal inputs through a mixed‑modality graph and leverages a dual‑branch diffusion model to produce geometrically precise, realistic 3D indoor scenes. The visual enhancement module enriches textual nodes, and the relation predictor fills missing relational cues, leading to superior layout and shape quality. Extensive experiments validate its advantage over existing methods, providing a strong foundation for VR/AR, interior design, and game development applications.

Paper: https://arxiv.org/pdf/2502.05874v2

Project page: https://yangzhifeio.github.io/project/MMGDreamer

Code: https://github.com/yangzhifeio/MMGDreamer

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer Visiondiffusion modelAI research3D scene generationvisual enhancementmixed-modality graphrelation prediction
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.