Artificial Intelligence 9 min read

Reinforcement Learning in Neural MMO: Background, Environment, Competition Solution, and Insights

The article reviews reinforcement learning applied to Neural MMO—a large‑scale, multi‑agent MMO environment—detailing its competitive IJCAI 2022 track, the winning LastOrder solution with transformer‑CNN‑LSTM architecture, reward shaping, a Fictitious Self‑Play meta‑solver, and Bilibili’s scalable Newton training framework.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Reinforcement Learning in Neural MMO: Background, Environment, Competition Solution, and Insights

Since achieving success in board games, reinforcement learning (RL) has also performed well in FPS (e.g., *Quake III*), RTS (e.g., *StarCraft II*), and MOBA (e.g., *DOTA 2*) titles. Compared with these games, MMO titles simulate a much larger world with richer content, making them a new research frontier for RL.

In 2019, OpenAI released Neural MMO, a large‑scale multi‑agent environment designed for RL training. Neural MMO was selected as a competition track for IJCAI 2022, and Bilibili’s RL team (LastOrder) won the championship.

Neural MMO simulates an ecosystem where a variable number of agents survive, grow, and fight. The map is tile‑based and randomly generated (water, forest, grass, rock, etc.). Agents spawn at random edge locations, must locate water and forest resources, explore, level up, and engage in combat with NPCs or other teams. Combat actions include melee, ranged, and magic attacks.

Unlike MOBA environments, Neural MMO does not define fixed opposing sides or bases; many teams coexist in the same world. Teams must consider the presence of multiple opponents, choosing between conservative (wait‑and‑exploit) or aggressive (dominate) strategies, with optimal choices influencing each other.

In the IJCAI 2022 Neural MMO competition, each team controls eight agents on a single map competing against fifteen other teams. Scoring covers exploration, survival, farming, and PvP behavior.

3.1 Feature Extraction and Model Architecture

Agents receive two main types of observations: entity information (nearby teammates, enemies, NPCs) and terrain information (a 15×15 tile view around the agent). The processing pipeline is:

Entity information: encoded with a Transformer, dimensionality‑reduced while preserving features for downstream target selection.

Map information: integrated via a scatter‑connection technique and encoded with a CNN.

Previous action: embedded with an Embedding layer.

Other information (agent’s own state, team statistics, global game stats): encoded with an MLP.

All encoded vectors are concatenated and further processed. To enable information sharing among teammates, part of the data is pooled across team members. The core of the network is an LSTM; its output drives both action selection and value estimation.

3.2 Reward Design

The environment provides sparse raw rewards, which are sufficient for baseline performance. To further improve agent behavior, a denser synthetic reward was constructed from environment data, facilitating more effective learning of desired strategies.

3.3 Meta Solver

Recent work frames multi‑agent problems as a Meta‑Solver + Oracle architecture. While the Oracle (optimal policy) is well‑defined, a practical Meta‑Solver for general‑sum multiplayer games is lacking. After evaluating several approaches, a Fictitious Self‑Play‑inspired method was adopted: each training team occasionally uses historical models as opponents. Experiments showed this class of FSP algorithms performed well in the competition.

3.4 Newton Framework

Newton is Bilibili’s internally developed distributed RL training framework, offering strong scalability for various RL scenarios. It consists of Workers (CPU servers running the game environment and generating data), a Training Server (GPU servers using Horovod for multi‑node, multi‑GPU training), an Inference Server (accelerating inference for Workers), and an Evaluation Server (assessing agent performance and managing training).

4. Conclusion

Neural MMO provides a game‑like environment that aligns with Bilibili’s extensive game‑related research. The RL team supports Bilibili World and other game AI projects. The platform continues to evolve (e.g., adding a trading system), enriching its content and community. Combining MMO dynamics with RL opens opportunities for Human‑in‑the‑Loop, Lifelong Learning, and other emerging algorithms, potentially making MMO worlds among the first domains where AGI emerges.

Note 1: “Meta Solver + Oracle” refers to converting a multi‑agent RL problem into a meta‑game where policies are treated as actions, enabling the use of meta‑solvers and matching oracles (see PSRO and related literature).

multi-agent systemsreinforcement learningdistributed trainingAI in GamesMeta SolverNeural MMO
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.