How We Won OpenAI’s Retro Contest: Joint PPO and Generalization in Sonic
This article details the technical journey behind Alibaba’s champion solution in OpenAI’s Retro Contest, explaining the reinforcement‑learning challenges of playing Sonic, the joint PPO approach, distributed training optimizations, reward shaping, fine‑tuning with DeepMimic, and the final performance that secured first place.
OpenAI Retro Contest Overview
The first OpenAI Retro Contest challenged teams to train AI agents to play the classic 2‑D platformer Sonic the Hedgehog , testing the generalization ability of reinforcement‑learning (RL) algorithms across multiple levels and game variants.
Why RL Evaluation Matters
Unlike supervised learning, which benefits from fixed train‑test splits and well‑established benchmarks, RL lacks standardized evaluation because agents interact with environments rather than static datasets. This leads to high variance in results and makes reproducibility difficult.
Community‑built libraries such as RL‑Glue, RLPy, Arcade Learning Environment, and especially OpenAI Gym have mitigated this by providing shared environments and leaderboards.
Contest Goals and Setup
The competition required agents to learn from raw RGB frames and output a 12‑button controller mapping, achieving high scores within 1 million frames (≈12 hours) on unseen test levels. Success depended on both rapid learning and strong generalization.
Our Technical Approach
We adopted a joint Proximal Policy Optimization (PPO) strategy, inspired by DeepMind’s Nature paper architecture: three convolutional layers followed by a dense layer, with additional Atari‑style tricks (frame skipping, stacking, reward scaling). Key modifications included:
State space: using full RGB images instead of grayscale to capture Sonic’s richer visual elements.
Action space: defining ten discrete actions (e.g., LEFT, RIGHT, LEFT+DOWN, etc.) rather than the raw 12‑button combination.
Reward shaping: normalizing the x‑position reward to a 0‑9000 scale and adding a time‑based bonus (0‑1000) to encourage faster completion.
buttons = ["B", "A", "MODE", "START", "UP", "DOWN", "LEFT", "RIGHT", "C", "Y", "X", "Z"]
actions = [["LEFT"], ["RIGHT"], ["LEFT", "DOWN"], ["RIGHT", "DOWN"], ["DOWN"], ["DOWN", "B"], ["B"], [], ["LEFT", "B"], ["RIGHT", "B"]]To prevent negative rewards from discouraging backtracking, we introduced a cache mechanism that zeroes out negative signals during exploration.
self.episode_negbuf = 0
reward = env.step(action)
if reward < 0 or self.episode_negbuf < 0:
self.episode_negbuf += reward
reward = 0Engineering Optimizations
We implemented distributed training with TensorFlow: parameter‑server (ps) nodes for weight aggregation and multiple worker nodes each running an independent environment instance. To reduce synchronization overhead, each worker copies the global actor network locally before sampling, and we batch experiences across ten parallel environment copies, achieving effective batch sizes of ~81920 while fitting on a single P100 GPU.
Training Phases
Joint PPO Training : A single global model (Model A) was trained on all 58 training levels, converging after ~1.2 billion frames with an average score around 5500.
Subsequent per‑level fine‑tuning (Model B) added extra rewards for coins, new positions, and death penalties, raising the average score to ~7000.
DeepMimic Fine‑Tuning : Using imitation learning (DeepMimic) we refined Model A into Model C, preserving L2 norm and entropy while improving performance on the training set to ~7300, achieving a ~45 % win rate against top human baselines.
Final Evaluation
In the private leaderboard test (11 unseen Sonic levels, 1 million frames each, three runs per team), our joint PPO‑based submission ranked first, consistently improving beyond other top teams as training progressed.
Reflections and Limitations
We experimented with YOLO‑based object detection and expert replay data, but both proved too slow or insufficient for the large action space. Although Model C performed best during training, we conservatively submitted Model A for the final test due to concerns about its generalization stability.
Key takeaways include the importance of reward shaping, distributed sampling efficiency, and the need for better metrics beyond L2 norm and entropy to assess RL generalization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
