Can Mixed‑Modality Graphs Unlock Precise 3D Indoor Scene Generation?
MMGDreamer introduces a mixed‑modality graph and a dual‑branch diffusion model that jointly enhance geometric control and realism in 3D indoor scene synthesis, outperforming state‑of‑the‑art methods across multiple quantitative and qualitative benchmarks.
Introduction
MMGDreamer is a framework for generating geometrically accurate and controllable 3D indoor scenes. It introduces a Mixed‑Modality Graph (MMG) that can encode text, images, or both for each object, enabling flexible multimodal input and improving geometric precision over text‑only methods.
Methodology
Mixed‑Modality Graph (MMG)
Each node in the MMG stores either a textual description, an image, or a combination of both. Edges represent spatial relationships such as left of or near . Missing modalities or relations are padded with zeros so that the graph tensor has a fixed shape.
Visual Enhancement Module
For nodes that lack visual data, a CLIP encoder extracts a textual embedding, which is then quantized by a pretrained VQ‑VAE codebook. The quantized vector is decoded back into a visual feature that augments the node, allowing the graph to retain rich geometric cues even when only text is provided.
Relation Predictor
A Graph Convolutional Network (GCN) equipped with an echo mechanism processes triples (source‑node, edge, target‑node). It predicts missing edge types and refines existing relations, producing a Mixed‑Enhanced Graph that contains both original and inferred relationships.
Dual‑Branch Diffusion Model
Graph Encoder : A GCN‑based encoder converts the enhanced graph into a latent conditioning vector.
Layout Branch : A 1‑D UNet denoises this latent vector to generate object bounding boxes (position, size, rotation) for the scene layout.
Shape Branch : Object shapes are represented as Truncated Signed Distance Fields (TSDF). A pretrained VQ‑VAE encodes TSDFs into latent codes, which a 3‑D UNet denoises to synthesize detailed geometry.
Training and Inference
Stage 1 : Train the visual enhancement module and the relation predictor independently using cross‑entropy loss for relation classification and reconstruction loss for visual feature synthesis.
Stage 2 : Jointly train the graph encoder together with the dual‑branch diffusion model. The loss combines layout regression (L2 loss between predicted and ground‑truth bounding boxes) and shape reconstruction (L2 loss on TSDF voxels).
Inference pipeline :
Construct the MMG from the user’s text/image inputs.
Apply the visual enhancement module to obtain visual features for text‑only nodes.
Run the relation predictor to fill missing edges.
Encode the resulting Mixed‑Enhanced Graph.
Condition the layout and shape diffusion branches to generate a complete 3D scene.
Experiments
Quantitative Evaluation
Evaluated on the SG‑FRONT dataset using three standard generative metrics:
FID (Fréchet Inception Distance)
FID‑CLIP (CLIP‑based perceptual distance)
KID (Kernel Inception Distance)
Compared with the state‑of‑the‑art method EchoScene, MMGDreamer achieves:
‑9 % lower FID
‑8 % lower FID‑CLIP
‑33 % lower KID
Qualitative Evaluation
Visual results on bedroom, dining, and living‑room scenes show that MMGDreamer preserves accurate object geometry and fine details, while competing methods exhibit distortions and missing parts.
Object‑Level Generation Quality
Using PointFlow to generate point clouds, the following metrics were computed:
MMD (Minimum Matching Distance)
COV (Coverage)
1‑NNA (1‑Nearest Neighbor Accuracy)
MMGDreamer outperforms EchoScene on all three, indicating superior geometric fidelity and distribution similarity for individual objects.
Conclusion
MMGDreamer advances 3D indoor scene synthesis by:
Supporting multimodal inputs through the Mixed‑Modality Graph.
Enriching textual nodes with visual features via a CLIP‑VQ‑VAE pipeline.
Automatically inferring missing spatial relations with a GCN‑based predictor.
Generating layout and detailed geometry jointly with a dual‑branch diffusion architecture.
The model achieves state‑of‑the‑art performance on SG‑FRONT and is applicable to VR/AR, interior design, and game development.
Resources
Paper: https://arxiv.org/pdf/2502.05874v2
Project page: https://yangzhifeio.github.io/project/MMGDreamer
Code repository: https://github.com/yangzhifeio/MMGDreamer
Illustrations
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
