How LLMs Power Real-Time Interactive 3D Worlds in Unreal Engine
This article explains how large language models are integrated with Unreal Engine to enable natural‑language‑driven 3D model search, manipulation, and scene understanding, detailing metadata extraction, vision‑language labeling, RAG‑based retrieval, and function‑call translation for interactive virtual environments.
Background
Large Language Models (LLMs) have transformed natural language processing, but physical reality is three‑dimensional. Integrating LLMs with 3D data can enhance perception, navigation, and interaction for autonomous systems, AR, robotics, and other applications.
Challenges
Combining LLMs with 3D data faces obstacles such as representation of 3D assets, model scalability, computational efficiency, and the need for robust semantic descriptions that work in real‑world environments.
Solution Overview
The DataV team built a pipeline inside Unreal Engine that uses the Tongyi Qianwen model family to enable real‑time, natural‑language‑driven interaction with a 3D world. The pipeline provides three core capabilities:
Model search and creation: retrieve a 3D asset from a library based on a natural‑language query and instantiate it in the scene.
3D object manipulation: pick, move, or perform complex multi‑object operations via language commands.
Scene understanding and editing: modify scene layout or object attributes (e.g., color) through textual instructions.
3D Model Metadata
To make 3D assets consumable by LLMs, the system extracts four groups of information:
Asset information (ReferencePath, name).
Semantic description (human‑readable text).
Geometric data (vertices, triangles, bounding box, etc.).
Material parameters.
{
"ReferencePath": "/Game/ArchVizInteriorVol3/Meshes/SM_Bed.SM_Bed",
"Name": "SM_Bed",
"Description": "A modern double‑bed",
"Pivot": "center",
"GeometryInfo": {"vertices":59997,"Triangles":"114502"},
"BoundingBox": {"center":[...],"extend":[...]},
"Materials": [{"MI_Bed_Fabric_1":{"BaseColor":...,"BaseFallof":...}}]
}Because manual labeling is costly, Vision‑Language Models (VLMs) generate the Description field automatically from rendered thumbnails.
Building a Natural‑Language Model Library
For each asset the pipeline creates a 640×640 thumbnail, sends the image to a VLM, receives a concise description, and stores the result in a CSV that serves as a knowledge base.
RAG‑Enabled Model Search
The CSV is imported into Alibaba Cloud Baichuan’s knowledge‑index service. A Retrieval‑Augmented Generation (RAG) application then maps user queries to the ReferencePath of matching assets, enabling fast natural‑language search.
3D Scene Understanding
Scene‑level annotation is achieved by capturing multi‑view screenshots, feeding them to VLMs, and establishing a one‑to‑one mapping between textual tokens and Unreal actors using strategies such as per‑object isolation, outline highlighting, or UUID‑based masking.
Conversational 3D Interaction
After scene understanding, user intents are translated into function calls that the Unreal engine can execute. Example functions include GatherSceneInfo, GetObjectReferencePath, SpawnObject, and MoveObject, each defined with required parameters.
{
"type":"function",
"function":{
"description":"Generate or place an object in the scene",
"parameters":{
"type":"object",
"properties":{
"ReferencePath":{"type":"string","description":"Asset path"},
"description":{"type":"string","description":"Object description"},
"pos_x":{"type":"number","description":"X position"},
"pos_y":{"type":"number","description":"Y position"},
"pos_z":{"type":"number","description":"Z position"},
"rot_x":{"type":"number","description":"Rotation around X"},
"rot_y":{"type":"number","description":"Rotation around Y"},
"rot_z":{"type":"number","description":"Rotation around Z"},
"scale_x":{"type":"number","description":"Scale X"},
"scale_y":{"type":"number","description":"Scale Y"},
"scale_z":{"type":"number","description":"Scale Z"}
},
"required":["ReferencePath","description","pos_x","pos_y","pos_z","rot_x","rot_y","rot_z","scale_x","scale_y","scale_z"]
},
"name":"spawn_object"
}
}Conclusion
The three core modules—3D model representation, scene understanding, and function‑call‑based interaction—demonstrate how LLMs can perceive, reason about, and manipulate a virtual 3D environment, opening opportunities for autonomous driving simulation, embodied AI, and rapid 3D content creation.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
