HSImul3R: Bridging Perception and Simulation for Physics‑Ready 3D Human‑Scene Interaction

HSImul3R introduces a physics‑in‑the‑loop reconstruction pipeline that closes the perception‑simulation gap by jointly optimizing human motion and scene geometry, leveraging reinforcement learning, direct simulation‑reward optimization, and a new HSIBench dataset to produce simulation‑ready 3D human‑scene interactions.

Machine Heart
Machine Heart
Machine Heart
HSImul3R: Bridging Perception and Simulation for Physics‑Ready 3D Human‑Scene Interaction

Existing human‑scene interaction (HSI) reconstruction pipelines suffer from a "Perception–Simulation Gap": visually realistic reconstructions collapse in physics simulators due to violations such as body penetration or unstable center‑of‑mass, because human motion and environment geometry are modeled separately.

HSImul3R Framework

HSImul3R formulates reconstruction as a bidirectional physical perception optimization problem . A physics simulator acts as an active supervisor, creating a closed‑loop between human motion and scene geometry.

Forward Optimization : With fixed scene geometry, human motion is refined. An initial reconstruction is aligned using a 3D generative model’s structural prior, then integrated into the simulator. Reinforcement learning guided by physical signals—key‑point tracking consistency and geometric contact constraints—optimizes motion stability.

Reverse Optimization : With physically validated motions, scene geometry is refined. The authors introduce Direct Simulation Reward Optimization (DSRO) , which uses simulator‑derived rewards to improve gravity stability and interaction feasibility of generated objects.

Reconstruction Pipeline

Given everyday videos or images, HSImul3R proceeds in three stages:

Stage 1 – HSfM : Reconstruct static scene geometry with DUSt3R and dynamic human motion. Human motion is obtained by detecting and tracking individuals with SAM2, then extracting SMPL‑based motion sequences and 2D keypoints using 4DHumans and ViTPose.

Joint Alignment : (a) Human‑centric bundle adjustment driven by 2D keypoints; (b) Global human‑scene alignment by minimizing the re‑projection error between ViTPose keypoints and 3D SMPL joints, ensuring scale consistency.

Structural Prior Injection : Pretrained image‑to‑3D generative models (e.g., MIDI) generate high‑fidelity 3D objects for each scene item, using SAM masks to select the most informative view. This corrects structural defects (broken components, missing surfaces) and provides robust interaction constraints.

Physical Optimization Details

Forward optimization defines a loss that minimizes the average Euclidean distance between human contact keypoints (feet/hands) and the nearest surface points of objects. DSRO defines a reward based on two stability signals: (1) gravity stability – the object remains upright under gravity; (2) interaction stability – human and object maintain contact rather than separating. Stability is judged by three criteria: (i) object stays upright under gravity; (ii) the reconstructed scene reaches a stable final state; (iii) interaction includes actual contact.

HSIBench Dataset and Experiments

To train and benchmark the framework, the authors built HSIBench, a dedicated HSI dataset containing 19 object categories, over 50 motion sequences, and 300 unique interaction instances captured from 16 viewpoints by three volunteers (two male, one female). The dataset provides multi‑view supervision for both reconstruction and simulation.

Simulation experiments show that HSImul3R achieves higher stability rates and more accurate geometry than state‑of‑the‑art methods. Real‑world validation is performed on a Unitree G1 humanoid robot: human motions are retargeted to the robot using GMR, then refined by diffusion‑guided reinforcement learning in IsaacGym. The resulting control policy, deployed via the Unitree SDK, demonstrates stable robot‑scene interaction, confirming sim‑to‑real transfer capability.

Limitations

Three limitations are acknowledged: (1) success rate drops in complex or multi‑object (>3) scenes; (2) interaction depth is sometimes insufficient, with human and objects standing independently rather than engaging meaningfully; (3) the fine‑tuned generative 3D model inherits biases from the original MIDI dataset and HSIBench, potentially limiting generalization to out‑of‑domain scenes.

References

Paper: https://arxiv.org/abs/2603.15612

Project page: https://yukangcao.github.io/HSImul3R/

GitHub repository: https://github.com/yukangcao/HSImul3R

reinforcement learning3D reconstructionhuman-scene interactionDSROHSIBenchphysics-in-the-loopsimulation-ready reconstruction
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.