How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene

FantasyHSI introduces a graph‑based multi‑agent framework that combines visual‑language models and video‑generation diffusion to let digital humans perceive, plan, and interact autonomously in any 3D scene, producing physically plausible, long‑duration actions for animation creation and embodied‑AI simulation.

Amap Tech
Amap Tech
Amap Tech
How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene

Overview

Building digital humans that can understand, interact with, and act autonomously in a 3D world is a crucial step toward artificial general intelligence. Existing video‑generation methods rely on external driving signals (action, audio, text) and cannot demonstrate self‑directed behavior. Moreover, current Human‑Scene Interaction (HSI) approaches depend on paired 3D scene reconstructions and motion‑capture data, which are costly and limited in diversity.

Key Highlights

Multi‑agent collaboration based on VLMs : Three agents cooperate—Scene Navigation Agent (understands the 3D environment), Action‑Chain Planner Agent (decomposes high‑level tasks into executable action units), and Critic Agent (dynamically corrects generated actions).

Dynamic directed‑graph modeling for interpretable long‑sequence interaction : A graph represents the human‑scene interaction process, providing a unified framework for perception, planning, correction, and execution of long‑term tasks.

DPO‑optimized video generation : Direct Preference Optimization (DPO) refines the diffusion video model, markedly improving physical realism and suppressing implausible motions such as floating, flying, or body distortion.

Method

FantasyHSI takes a 3D scene and a natural‑language task description as input. Multiple agents collaborate to generate a continuous sequence of actions for a virtual human within the scene.

1. Base generation module

The core generation unit is a text‑to‑video diffusion model. Natural‑language “action units” are fed to the model, producing short video clips. These clips are lifted to 3D motion via motion‑capture techniques and mesh registration, allowing seamless concatenation of successive actions.

When the character needs to move beyond the current camera view, a VLM agent selects an optimal new viewpoint from multi‑view images, enabling the character to explore larger areas.

2. DPO optimization

Training data often contain non‑physical effects (e.g., cartoon physics). To enforce physical plausibility, the model undergoes supervised fine‑tuning on SMPL‑X white‑model motion videos, followed by DPO using preference‑annotated samples from models such as VEO, Hunyuan‑Video, and Kling. The DPO objective anchors generation within a physically realistic action space.

3. Graph‑based interaction modeling

A dynamic directed graph G = (N, E) encodes the interaction: each node N represents a combined human state H and scene state S, while each edge E denotes the action A transitioning between nodes. Only semantically meaningful atomic actions become edges; key milestones become “key nodes” K.

4. Multi‑VLM dynamic graph construction

Three VLM agents work top‑down: the Scene Navigation Agent designs a route and identifies key sub‑goals, the Action‑Chain Planner Agent decomposes sub‑goals into atomic actions, and the Critic Agent evaluates and, if necessary, backtracks erroneous actions, selecting appropriate camera views for subsequent video generation.

Experimental Results

We built SceneBench, an HSI benchmark comprising diverse 3D scenes from TRUMANS, Sketchfab, and our own collection. Quantitative comparisons show FantasyHSI surpasses recent 3D‑motion‑based HSI methods in metrics such as penetration (reducing “clipping”) and motion diversity, demonstrating superior generalization to unseen complex environments.

Qualitative visualizations reveal that competing methods often fail to perceive novel obstacles, leading to implausible actions, whereas FantasyHSI consistently generates context‑aware, physically sound motions.

Conclusion

FantasyHSI presents a novel video‑generation‑centric HSI framework that, guided by natural‑language commands, produces high‑quality, physically plausible, and logically coherent 3D human actions in arbitrary scenes. Its graph‑based multi‑agent architecture not only enriches animation asset creation but also supplies diverse, realistic human behavior data for embodied‑AI simulation platforms.

video generationreinforcement learningvisual language modelGraph Modeling3D synthesishuman-scene interaction
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.