From a Single Image to a Physically Realistic 4D Video in One Minute
PhysGM, a CVPR 2026 paper by Beijing Institute of Technology and Li Auto, transforms a single static image into a high‑fidelity 4D video that obeys real‑world physics in under a minute, using a dual‑decoder transformer, DPO alignment, and a newly built 50k‑item PhysAssets dataset, outperforming prior methods in speed and quality.
PhysGM, presented at CVPR 2026 by researchers from Beijing Institute of Technology and Li Auto, is a novel framework that generates high‑fidelity, physically realistic 4D videos from a single static image in less than one minute.
Previous approaches to physics‑aware 4D generation relied on slow per‑scene optimization (SDS) that required hundreds to thousands of iterations, often taking dozens of minutes or hours, and they typically ignored rich physical cues present in the input image.
The core architecture of PhysGM is a Transformer equipped with two parallel decoders: the DPT Head predicts the initial 3D Gaussian Splatting (3DGS) scene parameters (geometry and appearance), while the Physics Head predicts probability distributions for object material properties such as Young's modulus and Poisson's ratio. After prediction, the parameters are fed into a Material Point Method (MPM) simulator, producing the final dynamic video in under three seconds.
Training proceeds in two stages. First, a large‑scale supervised pre‑training phase jointly learns 3DGS and physical attributes on massive data, eliminating the need for multi‑view pre‑reconstruction. Second, a Direct Preference Optimization (DPO) fine‑tuning stage aligns the generated videos with human visual intuition by sampling physical parameters, rendering with MPM, extracting trajectories via SAM‑2 and CoTracker‑3, and rewarding videos with lower perceptual distance to real physics footage.
To support this training, the authors constructed the PhysAssets dataset, comprising 50,000 high‑quality object‑physics pairs sourced from Objaverse, OmniObject3D, HSSD, and annotated with material categories using the multimodal model Qwen3‑VL. A subset includes reference videos, providing valuable supervision for future research.
Quantitative and qualitative evaluations show that PhysGM dramatically outperforms existing baselines. Compared with OmniPhysGS (>12 h generation) and DreamPhysics (>0.5 h), PhysGM generates videos in under one minute. On CLIPsim (visual‑textual physics consistency) and UPR (user preference rate) across five material types, PhysGM achieves substantially higher scores. Visual examples demonstrate realistic simulations of cake (bouncy), stone (hard impact), sand (collapse), ceramic, and rubber (deformation).
In summary, PhysGM is the first framework that can produce physically accurate 4D dynamic scenes from sparse input within an extremely short time, breaking the efficiency bottleneck of physics‑driven generation and opening avenues for large‑scale applications such as embodied agents, autonomous‑driving simulation, and interactive VR.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
