Artificial Intelligence 21 min read

How LLMs Power Real-Time Interactive 3D Worlds in Unreal Engine

This article explains how large language models are integrated with Unreal Engine to enable natural‑language‑driven 3D model search, manipulation, and scene understanding, detailing metadata extraction, vision‑language labeling, RAG‑based retrieval, and function‑call translation for interactive virtual environments.

Alibaba Cloud Developer

Feb 20, 2025

How LLMs Power Real-Time Interactive 3D Worlds in Unreal Engine

Background

Large Language Models (LLMs) have transformed natural language processing, but physical reality is three‑dimensional. Integrating LLMs with 3D data can enhance perception, navigation, and interaction for autonomous systems, AR, robotics, and other applications.

Challenges

Combining LLMs with 3D data faces obstacles such as representation of 3D assets, model scalability, computational efficiency, and the need for robust semantic descriptions that work in real‑world environments.

Solution Overview

The DataV team built a pipeline inside Unreal Engine that uses the Tongyi Qianwen model family to enable real‑time, natural‑language‑driven interaction with a 3D world. The pipeline provides three core capabilities:

Model search and creation: retrieve a 3D asset from a library based on a natural‑language query and instantiate it in the scene.

3D object manipulation: pick, move, or perform complex multi‑object operations via language commands.

Scene understanding and editing: modify scene layout or object attributes (e.g., color) through textual instructions.

3D Model Metadata

To make 3D assets consumable by LLMs, the system extracts four groups of information:

Asset information (ReferencePath, name).

Semantic description (human‑readable text).

Geometric data (vertices, triangles, bounding box, etc.).

Material parameters.

{
  "ReferencePath": "/Game/ArchVizInteriorVol3/Meshes/SM_Bed.SM_Bed",
  "Name": "SM_Bed",
  "Description": "A modern double‑bed",
  "Pivot": "center",
  "GeometryInfo": {"vertices":59997,"Triangles":"114502"},
  "BoundingBox": {"center":[...],"extend":[...]},
  "Materials": [{"MI_Bed_Fabric_1":{"BaseColor":...,"BaseFallof":...}}]
}

Because manual labeling is costly, Vision‑Language Models (VLMs) generate the Description field automatically from rendered thumbnails.

Building a Natural‑Language Model Library

For each asset the pipeline creates a 640×640 thumbnail, sends the image to a VLM, receives a concise description, and stores the result in a CSV that serves as a knowledge base.

RAG‑Enabled Model Search

The CSV is imported into Alibaba Cloud Baichuan’s knowledge‑index service. A Retrieval‑Augmented Generation (RAG) application then maps user queries to the ReferencePath of matching assets, enabling fast natural‑language search.

3D Scene Understanding

Scene‑level annotation is achieved by capturing multi‑view screenshots, feeding them to VLMs, and establishing a one‑to‑one mapping between textual tokens and Unreal actors using strategies such as per‑object isolation, outline highlighting, or UUID‑based masking.

Conversational 3D Interaction

After scene understanding, user intents are translated into function calls that the Unreal engine can execute. Example functions include GatherSceneInfo, GetObjectReferencePath, SpawnObject, and MoveObject, each defined with required parameters.

{
  "type":"function",
  "function":{
    "description":"Generate or place an object in the scene",
    "parameters":{
      "type":"object",
      "properties":{
        "ReferencePath":{"type":"string","description":"Asset path"},
        "description":{"type":"string","description":"Object description"},
        "pos_x":{"type":"number","description":"X position"},
        "pos_y":{"type":"number","description":"Y position"},
        "pos_z":{"type":"number","description":"Z position"},
        "rot_x":{"type":"number","description":"Rotation around X"},
        "rot_y":{"type":"number","description":"Rotation around Y"},
        "rot_z":{"type":"number","description":"Rotation around Z"},
        "scale_x":{"type":"number","description":"Scale X"},
        "scale_y":{"type":"number","description":"Scale Y"},
        "scale_z":{"type":"number","description":"Scale Z"}
      },
      "required":["ReferencePath","description","pos_x","pos_y","pos_z","rot_x","rot_y","rot_z","scale_x","scale_y","scale_z"]
    },
    "name":"spawn_object"
  }
}

Conclusion

The three core modules—3D model representation, scene understanding, and function‑call‑based interaction—demonstrate how LLMs can perceive, reason about, and manipulate a virtual 3D environment, opening opportunities for autonomous driving simulation, embodied AI, and rapid 3D content creation.