NS-Diff: Adding a Physics Engine to Diffusion Models for Fluid and Rigid‑Body Dynamics
The CVPR 2026 paper introduces NS‑Diff, a physics‑guided video diffusion framework that combines a noise‑robust dynamics detector, a physical‑condition latent injection module, and reinforcement‑learning optimization to reduce jerk error by 43 % and fluid divergence by 33 %, achieving superior physical realism and visual quality across multiple benchmarks.
Background and Motivation
Current video diffusion models such as Sora and Wan generate visually impressive frames but often violate physical laws, producing implausible “break‑through” artifacts. To bridge the gap from visual to physical realism, the PKU team proposes NS‑Diff, which embeds physical constraints and reinforcement learning into the diffusion process.
Technical Solution
1. Noise‑Robust Physical Dynamics Detector
The detector operates in the latent space and consists of four steps:
Global Motion Compensation : Estimate a global homography matrix to remove camera motion, then compensate the optical flow.
Latent‑to‑RGB Decoding : At selected denoising steps, a pretrained VAE decodes latent frames into low‑resolution RGB proxy images, providing sufficient structure for flow estimation while keeping computation cheap.
Noise‑Robust Optical Flow : Fine‑tune ARFlow on noisy samples, compute flow between proxy images, and smooth it with a temporal filter.
Material Region Segmentation : Solve an affine transform for each patch to model planar rigid motion; patches whose velocity‑field divergence plus squared curl exceed a threshold are marked as fluid, otherwise as rigid.
2. Physical Condition Latent Injection (PCLI)
For each patch, the method extracts physical descriptors—velocity field (time derivative of flow), deformation gradient (spatial Jacobian), and material embedding—and projects them with a two‑layer MLP into a latent physical vector. This vector is injected into the DiT denoiser via cross‑attention, allowing the diffusion model to receive explicit physical cues.
3. Physics‑Guided Reinforcement Learning Optimization
The diffusion policy is treated as a stochastic policy whose action is the predicted noise residual. Three reward components are defined:
Rigid‑Body Jerk Regularization : Enforce the Minimum‑Jerk principle by penalizing the third‑order time derivative of the centroid trajectory.
Fluid Dynamics Penalty : Apply a lightweight Navier‑Stokes constraint that minimizes the spatial gradient of the velocity‑field divergence, encouraging incompressibility without solving a full Poisson equation.
PPO Objective : Combine the two penalties into a negative‑weighted sum and update the DiT parameters with Proximal Policy Optimization.
4. Adaptive Activation Scheduler
Because early‑stage denoising is dominated by high noise, an adaptive scheduler modulates both the PCLI injection strength and the RL reward weight, gradually increasing them as noise decreases to ensure stable training.
Experiments
Setup
Evaluations are performed on PhysVideoBench, UCF‑101 (13,320 human‑action videos), and WebVid‑10M (10 M text‑paired videos). Two groups of metrics are reported:
Physical metrics : Jerk Consistency (third‑order derivative magnitude) and Fluid Divergence Error (norm of divergence computed from ground‑truth flow).
Visual metrics : VBench (combined appearance and motion quality), Fréchet Video Distance (FVD), Frame Consistency (average cosine similarity of CLIP embeddings), and CLIPSIM for text‑to‑video alignment.
Results
On PhysVideoBench, NS‑Diff achieves the best scores on all metrics, reducing jerk error by 43 % and fluid divergence by 33 %, while improving FVD by 22.7 % compared with prior methods. On UCF‑101, the NS‑Diff DiT 1B model attains FVD 106 and Frame Consistency 0.94; the larger DiT 11B variant further lowers FVD to 85 and raises consistency to 0.95. On WebVid‑10M, NS‑Diff outperforms VideoFactory on both FVD and CLIPSIM, demonstrating strong generalization to open‑world text‑driven generation. Qualitative comparison (Fig 2) shows far fewer non‑physical artifacts such as sudden object appearance/disappearance or implausible splits/merges.
Conclusion
NS‑Diff demonstrates that tightly integrating classic physical constraints—through a noise‑robust dynamics detector, a latent physical‑condition injection, and reinforcement‑learning‑based optimization—significantly improves both physical fidelity and visual quality of generated videos, establishing a viable path toward physically realistic AIGC.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
