How FantasyWorld Unifies Video Generation and 3D Geometry for Consistent Virtual Worlds
FantasyWorld introduces a geometry‑enhanced framework that augments a frozen video diffusion model with a trainable geometry branch, enabling simultaneous video representation and implicit 3D field generation, achieving spatially consistent, high‑quality virtual worlds and outperforming recent baselines in multi‑view coherence and geometric fidelity.
Building high‑quality 3D world models is a key step toward embodied intelligence and AGI, and drives AR/VR content creation and robot navigation. Existing video foundation models generate imaginative content but lack explicit 3D understanding, leading to spatial inconsistency.
To address this, we propose FantasyWorld, an innovative geometry‑enhanced framework that adds a trainable geometry branch to a frozen video foundation model, enabling joint modeling of video representations and implicit 3D fields in a single forward pass. A novel cross‑branch supervision mechanism uses geometric cues to guide video generation while leveraging video priors to regularize 3D predictions, resulting in spatially consistent and generalizable video representations.
The geometry branch produces implicit 3D fields that can be directly used as plug‑and‑play geometry representations for downstream tasks such as novel‑view synthesis and robot navigation, without additional optimization or fine‑tuning.
Extensive experiments show FantasyWorld bridges “imagination” and “spatial perception”, surpassing recent geometry‑consistent baselines in multi‑view coherence and style consistency. Ablation studies confirm that performance gains stem from the unified backbone and cross‑branch information exchange.
Core Highlights
Unified video‑3D modeling: single forward pass yields video features and implicit 3D fields, avoiding per‑scene optimization.
Bidirectional 2D/3D supervision: geometry branch enforces multi‑view consistency; video priors refine 3D predictions.
Lightweight gain on frozen backbone: inserting Integrated Reconstruction & Generation (IRG) blocks and bidirectional cross‑attention into Wan 2.1 backbone yields significant benefits with modest training cost.
Reusable 3D features: implicit features decode to depth, point clouds, and camera poses, serving as universal 3D representations for various downstream tasks.
Method Overview
FantasyWorld takes a reference image, optional text prompt, and target camera trajectory as input. The image is encoded by CLIP, text by umT5, and camera poses by a Plücker‑ray‑based encoder. Both video and geometry branches are optimized jointly during training and inference.
Preconditioning Blocks
We reuse several layers of the frozen WanDiT denoising network, feeding partially denoised features into the geometry branch, thereby providing structural cues from the first step and reducing training variance.
IRG Block (Integrated Reconstruction & Generation)
The IRG unit consists of an asymmetric dual‑branch: an Imagination Prior Branch that reuses the pretrained Wan2.1 backbone to propagate appearance and spatio‑temporal features, and a Geometry‑Consistent Branch that maps shared features to a geometry‑aligned latent space for explicit 3D inference.
Bidirectional cross‑modal attention (MM‑BiCrossAttn) lightly couples the two branches, allowing geometric cues to regularize video features and video priors to refine geometry, leading to synchronized enhancement and convergence toward a “see‑and‑shape” world representation.
3D DPT Decoder (Temporal Alignment & Deep Feature Extraction)
The implicit 3D representation from the geometry branch is decoded by a specialized 3D DPT head, tightly aligned with video frames from WanVAE, ensuring each frame has a corresponding geometric output. Features are extracted from later diffusion layers to improve depth accuracy and pose stability.
Two‑Stage Training
Stage 1 – Latent Bridging: Freeze the Wan2.1 backbone and train only the geometry branch, mapping hidden features from block 16 through a lightweight transformer adapter to the geometry‑aligned latent space, then decoding camera, depth, and point cloud with combined supervision.
Stage 2 – Unified Co‑optimization: With the backbone still frozen, introduce bidirectional cross‑attention adapters and camera control adapters to jointly optimize video diffusion loss and geometric supervision, enhancing multi‑view consistency and temporal stability.
Experimental Results
World Generation evaluated on WorldScore (static‑realistic subset) measures camera/object control, content alignment, 3D consistency, and subjective quality. FantasyWorld achieves the best scores in 3D, style, and photo consistency with lower variance, especially under large (90°) camera motions.
Qualitative comparisons show that competing methods suffer from tearing, misalignment, or style drift under large camera movements, whereas FantasyWorld maintains stable geometry and consistent appearance.
On RealEstate10K, adding the geometry branch improves PSNR/SSIM and reduces LPIPS; even with self‑initialized point clouds, FantasyWorld competes with VGGT.
Qualitative results demonstrate better structural integrity under large motions and occlusion changes, with straight, continuous edges and stable depth boundaries.
Conclusion
FantasyWorld is a unified feed‑forward model that generates virtual worlds with 3D consistency and reusability. By decoupling geometry prediction from appearance generation and introducing a dedicated geometry branch with bidirectional cross‑attention, the model preserves the creative power of pretrained video diffusion while delivering precise geometric fidelity, offering an efficient pathway toward structured embodied‑intelligence world models.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
