GaussianDWM: 3D Gaussian Representation for Driving Understanding and Generation
GaussianDWM introduces a unified 3D Gaussian scene model that simultaneously supports autonomous‑driving perception and multimodal generation, embedding geometry, appearance and language semantics into LLM‑compatible tokens, and demonstrates superior visual‑grounding and RGB‑D generation performance on NuInteract and nuScenes compared with prior methods.
Motivation and Problem Statement
Recent driving world‑model research has focused mainly on generating future visual frames, but practical autonomous‑driving systems also need to answer structured queries such as the existence, location, and spatial relationships of objects, and to support downstream planning. Existing unified frameworks often rely on BEV or depth features for feature‑level alignment, which does not provide a truly unified 3D scene representation.
GaussianDWM Overview
GaussianDWM tackles this gap by placing a 3D Gaussian representation at the core of the world model, enabling the same representation to serve both scene understanding and multi‑modal generation tasks. The overall architecture consists of three tightly coupled modules: World Tokenizer, Scene Understanding, and Multi‑modal Generation, all built around the shared 3D Gaussian field.
Language‑Enhanced 3D Gaussian Tokenizer
The tokenizer extends traditional 3D Gaussian primitives (position, opacity, scale, rotation) with language features derived from CLIP and hierarchical semantics from SAM. To keep storage and computation tractable, a scene‑wise language auto‑encoder compresses the 512‑dimensional CLIP vector to 3 dimensions, allowing semantic information to be anchored at specific spatial locations within the Gaussian field.
Projection and Task‑Aware Sampling
Because a dense 3D Gaussian field cannot be directly consumed by a large language model (LLM), GaussianDWM introduces a Gaussian Projector that maps geometric and language attributes into the LLM embedding space. Task‑aware sampling then selects the most relevant Gaussian tokens: uniform and top‑k sampling for global understanding, and similarity‑based sampling for 2D/3D visual grounding. In experiments, 4096 Gaussian tokens are sampled and fed to the LLM, balancing richness of representation with the LLM’s capacity.
Dual‑Condition Generation
The generation module receives both low‑level conditions (sparse RGB/depth) that constrain texture and geometry, and high‑level world knowledge extracted by the LLM that provides semantic and spatial priors. This dual‑condition design preserves visual detail while enhancing scene coherence, enabling spatial, temporal, and RGB‑D generation.
Evaluation on NuInteract (Scene Understanding)
On the NuInteract benchmark, GaussianDWM achieves an average metric of 59.23, surpassing DriveMonkey’s 52.12. For 2D visual grounding, mAP improves from 19.47 to 34.95; for 3D visual grounding, mAP rises from 34.53 to 52.78, demonstrating that the 3D Gaussian representation benefits both perception and language‑guided querying.
Evaluation on nuScenes (Multi‑modal Generation)
In RGB‑D generation experiments, GaussianDWM attains FID/FVD scores of 8.36/44.50 for ±1 m viewpoint shifts and 11.27/68.17 for ±2 m shifts, outperforming PVG, StreetGaussian, and DiST‑S especially at small‑to‑medium displacements. These results indicate that the model maintains 3D spatial consistency while synthesizing new views.
Ablation Studies
Removing the Gaussian representation drops the average metric to 53.32, while adding similarity‑based sampling raises it to 59.23, confirming the Gaussian field’s central role. In generation, using only low‑level conditions yields an FID of 10.12 at ±1 m; incorporating high‑level world knowledge reduces it to 8.36, and the benefit grows for larger viewpoint changes (FID 21.79 → 18.91 at ±4 m).
Conclusion
GaussianDWM argues that an autonomous‑driving world model must be both generative and queryable. By embedding geometry, appearance, and language semantics into a unified 3D Gaussian representation and bridging it to LLMs via projection and sampling, the system improves both scene understanding and multi‑modal generation, offering a more practical foundation for perception‑planning pipelines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
