Can Mixed‑Modality Graphs Unlock Precise 3D Indoor Scene Generation?

MMGDreamer introduces a mixed‑modality graph and a dual‑branch diffusion model that jointly enhance geometric control and realism in 3D indoor scene synthesis, outperforming state‑of‑the‑art methods across multiple quantitative and qualitative benchmarks.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can Mixed‑Modality Graphs Unlock Precise 3D Indoor Scene Generation?

Introduction

MMGDreamer is a framework for generating geometrically accurate and controllable 3D indoor scenes. It introduces a Mixed‑Modality Graph (MMG) that can encode text, images, or both for each object, enabling flexible multimodal input and improving geometric precision over text‑only methods.

Methodology

Mixed‑Modality Graph (MMG)

Each node in the MMG stores either a textual description, an image, or a combination of both. Edges represent spatial relationships such as left of or near . Missing modalities or relations are padded with zeros so that the graph tensor has a fixed shape.

Visual Enhancement Module

For nodes that lack visual data, a CLIP encoder extracts a textual embedding, which is then quantized by a pretrained VQ‑VAE codebook. The quantized vector is decoded back into a visual feature that augments the node, allowing the graph to retain rich geometric cues even when only text is provided.

Relation Predictor

A Graph Convolutional Network (GCN) equipped with an echo mechanism processes triples (source‑node, edge, target‑node). It predicts missing edge types and refines existing relations, producing a Mixed‑Enhanced Graph that contains both original and inferred relationships.

Dual‑Branch Diffusion Model

Graph Encoder : A GCN‑based encoder converts the enhanced graph into a latent conditioning vector.

Layout Branch : A 1‑D UNet denoises this latent vector to generate object bounding boxes (position, size, rotation) for the scene layout.

Shape Branch : Object shapes are represented as Truncated Signed Distance Fields (TSDF). A pretrained VQ‑VAE encodes TSDFs into latent codes, which a 3‑D UNet denoises to synthesize detailed geometry.

Training and Inference

Stage 1 : Train the visual enhancement module and the relation predictor independently using cross‑entropy loss for relation classification and reconstruction loss for visual feature synthesis.

Stage 2 : Jointly train the graph encoder together with the dual‑branch diffusion model. The loss combines layout regression (L2 loss between predicted and ground‑truth bounding boxes) and shape reconstruction (L2 loss on TSDF voxels).

Inference pipeline :

Construct the MMG from the user’s text/image inputs.

Apply the visual enhancement module to obtain visual features for text‑only nodes.

Run the relation predictor to fill missing edges.

Encode the resulting Mixed‑Enhanced Graph.

Condition the layout and shape diffusion branches to generate a complete 3D scene.

Experiments

Quantitative Evaluation

Evaluated on the SG‑FRONT dataset using three standard generative metrics:

FID (Fréchet Inception Distance)

FID‑CLIP (CLIP‑based perceptual distance)

KID (Kernel Inception Distance)

Compared with the state‑of‑the‑art method EchoScene, MMGDreamer achieves:

‑9 % lower FID

‑8 % lower FID‑CLIP

‑33 % lower KID

Qualitative Evaluation

Visual results on bedroom, dining, and living‑room scenes show that MMGDreamer preserves accurate object geometry and fine details, while competing methods exhibit distortions and missing parts.

Object‑Level Generation Quality

Using PointFlow to generate point clouds, the following metrics were computed:

MMD (Minimum Matching Distance)

COV (Coverage)

1‑NNA (1‑Nearest Neighbor Accuracy)

MMGDreamer outperforms EchoScene on all three, indicating superior geometric fidelity and distribution similarity for individual objects.

Conclusion

MMGDreamer advances 3D indoor scene synthesis by:

Supporting multimodal inputs through the Mixed‑Modality Graph.

Enriching textual nodes with visual features via a CLIP‑VQ‑VAE pipeline.

Automatically inferring missing spatial relations with a GCN‑based predictor.

Generating layout and detailed geometry jointly with a dual‑branch diffusion architecture.

The model achieves state‑of‑the‑art performance on SG‑FRONT and is applicable to VR/AR, interior design, and game development.

Resources

Paper: https://arxiv.org/pdf/2502.05874v2

Project page: https://yangzhifeio.github.io/project/MMGDreamer

Code repository: https://github.com/yangzhifeio/MMGDreamer

Illustrations

MMGDreamer architecture diagram
MMGDreamer architecture diagram
Qualitative results
Qualitative results
Quantitative comparison chart
Quantitative comparison chart
Object generation quality
Object generation quality
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionAIdiffusion model3D scene generationvisual enhancementmixed-modality graphrelation predictor
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.