How We Won OpenAI’s Retro Contest: Joint PPO Mastery on Sonic Games

This article analyzes OpenAI’s Retro Contest on Sonic the Hedgehog, explains why reinforcement learning generalization is crucial for AGI, and details the winning team’s joint PPO pipeline, engineering optimizations, training strategies, and final performance compared to human baselines.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How We Won OpenAI’s Retro Contest: Joint PPO Mastery on Sonic Games

Alibaba’s research team participated in OpenAI’s first Retro Contest, where the goal was to train AI agents to play the classic 2D platformer Sonic the Hedgehog and its sequels, testing the generalization ability of reinforcement‑learning (RL) algorithms across many similar game levels.

OpenAI’s Hidden Intent

OpenAI organized the contest not for publicity or talent scouting but to demonstrate that strong generalization in RL is a key pathway toward artificial general intelligence.

Contest Overview

The competition provided three Sonic games (Sonic 1, Sonic 2, Sonic 3 & Knuckles) with a total of 58 distinct levels. Each level shares core mechanics, allowing agents to potentially transfer knowledge from seen to unseen levels.

Technical Approach

Following OpenAI’s technical report, the team selected joint Proximal Policy Optimization (PPO) as the primary algorithm, preferring it over DQN‑based Rainbow due to lower memory requirements and higher sample efficiency.

Network architecture mirrors the DeepMind Nature paper: raw RGB frames → three convolutional layers → one dense layer → action mapping, with Atari‑style tricks (frame skipping, frame stacking, reward scaling) and minor policy‑output adjustments.

State space: RGB images (gray‑scale insufficient for Sonic’s richer visuals).

Action space: ten discrete actions derived from the original 12‑button controller, including a no‑op action for waiting.

Reward function: normalized score proportional to the agent’s x‑coordinate, scaled to 9000 at level completion, plus a time‑based bonus (0–1000) encouraging faster finishes.

buttons = ["B", "A", "MODE", "START", "UP", "DOWN", "LEFT", "RIGHT", "C", "Y", "X", "Z"]
actions = [["LEFT"], ["RIGHT"], ["LEFT", "DOWN"], ["RIGHT", "DOWN"], ["DOWN"], ["DOWN", "B"], ["B"], [], ["LEFT", "B"], ["RIGHT", "B"]]

Engineering Optimizations

The team used TensorFlow’s distributed training with parameter‑server (ps) nodes and worker nodes that each host a Retro environment, collecting (state, action, reward, next‑state) tuples for gradient computation before sending updates to the ps.

To improve GPU utilization, workers maintain multiple environment copies, allowing batch‑size‑n policy inference (n≈10). Data are split into smaller minibatches to fit GPU memory.

Joint PPO Training

A global model (Model A) was trained on all 58 training levels, converging after ~1.2 billion frames with an average score around 5500. Individual level curves showed many agents still getting stuck early.

Learning by Parts

Starting from Model A, each level was fine‑tuned with additional rewards for coin collection, reaching new positions, and penalizing death before level completion, producing 58 specialized models (Model B) with average scores around 7000.

Unified Model

Model A was further refined using DeepMimic imitation learning, adding weight decay and entropy regularization to prevent over‑fitting, yielding Model C, which matched Model A’s L2 norm and entropy but achieved ~7300 average score on training levels, surpassing the best human baseline by ~45%.

Online Exploration

During testing, the team replaced unavailable internal state information with visual cues to detect new positions, effectively implementing a curiosity‑driven exploration bonus.

Final Test

In the private leaderboard, the team’s algorithm ranked first among the top‑10 public teams, consistently improving across 100 k‑frame evaluation windows on 11 unseen Sonic levels.

Reflections and Limitations

Attempts to use YOLO for object detection proved too slow for real‑time control.

Expert replay data were insufficient given the large action space.

Although Model C performed best in training, the team submitted Model A for the final contest due to concerns about C’s stability on unseen environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

reinforcement learningjoint PPOOpenAI Retro ContestRL generalizationSonic game
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.