Artificial Intelligence 10 min read

How PhysForge Generates Interactive 3D Assets from a Single Image

PhysForge, a physics‑grounded 3D asset generation framework accepted at ICML 2026, converts a single input image into a fully interactive 3D object by first planning a hierarchical physical blueprint with a vision‑language model and then refining geometry, texture, and precise kinematic parameters via a diffusion model, supported by the large‑scale PhysDB dataset.

Machine Heart

Jun 9, 2026

How PhysForge Generates Interactive 3D Assets from a Single Image

Effect Demonstration

PhysForge generates physics‑grounded 3D assets from a single input image. The output includes high‑quality geometry, texture, a hierarchical component structure, and physical property labels (material, mass, semantics) for each part. For movable components the system predicts joint type, axis, origin, motion limits, and interaction mode, enabling objects such as kettles, cabinet doors, buttons, or lamps to be opened, pressed, or grasped in interactive virtual worlds.

In a robot‑simulation demo the assets were imported into the RoboTwin environment; a robotic arm recognized functional parts and performed actions (e.g., opening a cabinet door, pulling a drawer, grasping a specified component) respecting the predicted joint constraints.

Why Physics‑Grounded 3D Assets?

Recent 3D generation models excel at visual quality but typically lack functional logic and hierarchical physical structure required for embodied‑AI and interactive simulations. An interactive asset must answer:

Which functional components compose the object?

What semantics, material, and mass does each component have?

Which components are pushable, graspable, rotatable, or slideable?

What parent‑child hierarchy exists among components?

What are the joint type, axis, origin, and motion limits of movable parts?

These attributes determine whether the asset can be directly used in simulators, game engines, or embodied‑AI systems.

Method Overview: Two‑Stage “Planning‑Generation” Strategy

PhysForge decouples physics‑grounded asset generation into:

Vision‑Language Model (VLM)‑based physical planning.

Diffusion‑based joint generation of geometry, texture, and kinematic parameters.

Stage 1 — VLM‑Based Planning

The VLM is trained as a “physical architect”. It receives a single image, an optional 2D mask, and a 3D voxel representation produced by TRELLIS. It then autoregressively generates a Hierarchical Physical Blueprint that defines for each component:

3D bounding box.

Parent‑child hierarchy.

Joint type.

Material, mass, functional semantics, state machine, and atomic affordances.

The blueprint encodes how the object should be disassembled, used, and moved.

Stage 2 — Diffusion‑Based Generation with KineVoxel Injection

Continuous 3D parameters (joint axis direction, origin, motion limits) require finer generation. PhysForge introduces KineVoxel Injection (KVI): each movable part’s joint origin, axis, and limits are encoded as a kinematic voxel and merged with geometric voxels. The combined voxel tensor is processed by a diffusion denoising model, allowing simultaneous learning of visual appearance and motion specifications.

The final output comprises high‑quality geometry, texture, component hierarchy, and precise kinematic parameters ready for insertion into interactive environments.

PhysDB: A 150 000‑Asset Physically Annotated Dataset

PhysDB contains 150 000 3D assets sourced from Objaverse, covering seven categories (household, industrial, weapons, personal, vehicles, tech & electronics, cultural items). Each asset is annotated with a four‑layer hierarchy:

Holistic properties: overall scale, category, and typical usage scene.

Static properties: part‑level semantics, material, and mass.

Functional properties: intrinsic function and state machine (e.g., “to contain”, button pressed/released).

Interactive properties: pushable, graspable, joint type, parent part, axis origin, axis direction, and joint limits.

This labeling enables the model to learn not only part locations but also their physical roles and operability.

Downstream Applications

Robot simulation: Generated assets serve as operable objects, reducing manual modeling, joint binding, and physics‑parameter configuration for robot training and evaluation.

Virtual worlds and game engines: In Unity, Unreal Engine, and similar platforms the assets already contain material, mass, functional, and joint information, allowing developers to build complex interaction logic without hand‑crafting each movable object.

Embodied‑AI agents: The textual physical blueprint from Stage 1 can be queried via natural language, enabling agents to locate components, understand handles, and plan manipulation actions such as opening a cabinet door.

Conclusion

PhysForge advances 3D generation from static appearance to interactive assets by employing a two‑stage VLM‑based planning and diffusion‑based generation pipeline augmented with KineVoxel Injection. The large‑scale PhysDB dataset provides the hierarchical physical annotations needed for this task. As interactive virtual worlds, robot simulation, and embodied‑AI systems increasingly require physics‑grounded assets, PhysForge represents a pivotal step toward generating 3D objects that are both visually realistic and operable.

Paper: https://arxiv.org/abs/2605.05163

Project page: https://hku-mmlab.github.io/PhysForge/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

diffusion model large dataset 3D Generation vision-language model robotics simulation physics grounding

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Effect Demonstration

Why Physics‑Grounded 3D Assets?

Method Overview: Two‑Stage “Planning‑Generation” Strategy

Stage 1 — VLM‑Based Planning

Stage 2 — Diffusion‑Based Generation with KineVoxel Injection

PhysDB: A 150 000‑Asset Physically Annotated Dataset

Downstream Applications

Conclusion

Machine Heart

How this landed with the community

Was this worth your time?

0 Comments

Stage 1 — VLM‑Based Planning

Stage 2 — Diffusion‑Based Generation with KineVoxel Injection

PhysDB: A 150 000‑Asset Physically Annotated Dataset