Artificial Intelligence 15 min read

Peking University Unveils EvoPhys-World: The First Human‑Centric 5D World Model for Scene‑Level Control

Peking University’s EvoPhys team introduced EvoPhys-World, a human‑centric 5D world model built on Moer Thread’s domestic GPU platform that advances from visual generation to controllable, interactive, self‑evolving virtual environments, featuring a latent memory pool, unified token architecture, and two operational modes—World Engine and World Policy.

Machine Heart

Jun 5, 2026

Peking University Unveils EvoPhys-World: The First Human‑Centric 5D World Model for Scene‑Level Control

Introduction

Peking University’s EvoPhys team recently released EvoPhys-World, the world’s first human‑centric 5D world model that enables scene‑level controllability. The model runs on Moer Thread’s fully domestic GPU compute stack and pushes world‑model research from "watchable, roamable, shallow interaction" to "manipulable, deep interaction, self‑evolution".

Problem Statement

Existing world models can generate visually realistic scenes and allow agents to observe or roam, but they cannot truly "act" on objects. They lack understanding of physical properties, causal effects of pushes, lifts, or switches, making it difficult for humans or robots to interact with the environment in a physically consistent way.

From 3D to 5D

3D world models focus on spatial layout (what the scene looks like and where objects are). 4D models add the temporal dimension, describing how the world evolves over time. EvoPhys‑World argues that a genuine world model must also capture parallel universes, choice‑driven futures, and the impact of those futures on present decisions, thus requiring a 5D representation of hyper‑dimensional space.

Core Architecture

The model’s backbone consists of three key components:

Latent Memory Pool (4D ST‑Memory) : a long‑term spatiotemporal memory that stores implicit scene states across different times and conditions. A spatiotemporal importance mechanism selects and compresses critical states from this pool to provide consistent spatial and causal context during inference.

Unified Token Chunk output paradigm : a novel mixed‑attention mechanism that generates Unified State‑Action Tokens in parallel, enabling simultaneous prediction of the next world state ( Next‑State Prediction ) and the next action ( Next‑Action Prediction ).

Dual‑mode spiral reasoning : the latent space continuously rolls forward, supporting hour‑scale, scene‑level future interaction and planning.

Two Model Forms

EvoPhys‑World operates in two complementary forms:

Model as World Engine : objects can be duplicated and physics‑based interactions are simulated, allowing arbitrary trajectory roaming and object manipulation within a persistent scene memory.

Model as World Policy : the model not only imagines future worlds but also controls them, mapping human head pose, hand skeleton, and contact information to real‑robot manipulation, thereby reducing reliance on large‑scale robot data.

Together these forms create a "one base model – two forms" self‑evolution loop.

Demonstrations

Three demos illustrate the capabilities:

Demo 1 – Arbitrary Scene Roaming : head‑pose control enables free navigation of any scene.

Demo 2 – Long‑Term Action Interaction : combined head‑pose and hand‑pose control allows realistic object interaction such as picking up, pushing, or flipping items.

Demo 3 – Moving Manipulation : the same control signals are remapped to a dexterous robotic hand, demonstrating transfer from virtual to physical manipulation.

Human‑Centric Action Space

Unlike traditional embodied‑AI systems that define action spaces around specific robot hardware, EvoPhys‑World adopts a "human‑centric" standard action representation. It encodes first‑person observations, head pose, binocular vision, hand‑skeleton points, gestures, and contact relations, aligning directly with how humans perceive and manipulate the physical world. This representation can be learned from massive raw, unlabeled human‑hand EGO datasets.

In a Unity office scenario, the model receives the command "stamp the file" and predicts a sequence of human‑action chunks that accomplish the task, illustrating end‑to‑end human‑action generation.

Closed‑Loop Data‑Model‑Interaction

The system forms a closed loop: data feeds the model, the model generates an interactive world, and the interaction outcomes are fed back to refine the model. This loop validates the "data → model → interaction" cycle and enables the model to evolve continuously.

Emergent Multi‑World‑Line Reasoning

When the latent memory is fixed, EvoPhys‑World can pre‑play multiple possible futures based on different action conditions (e.g., approaching a cup from various directions, choosing different targets, pushing versus flipping). This demonstrates causal pre‑simulation of distinct world lines, confirming the model’s 5D reasoning capability.

GPU Compute Support

The training of EvoPhys‑World leverages Moer Thread’s fully domestic GPU stack, handling 40 000 hours of pure human‑hand EGO data. The platform provides high‑throughput spatiotemporal memory handling, stable long‑sequence training, and tight software‑hardware co‑design, crucial for the model’s dual‑form evolution.

Conclusion

The next frontier for world models is controllability, interaction, and self‑evolution. EvoPhys‑World offers a human‑centric, scene‑level, 5D solution that moves AI beyond mere visual perception toward genuine understanding, manipulation, and transformation of the physical world.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Embodied AI 5D model EvoPhys-World human-centric latent memory world engine world policy

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.