How AI Beats Super Mario with PPO in 5 Minutes

This tutorial demonstrates how to use Huawei Cloud ModelArts and the Proximal Policy Optimization (PPO) reinforcement‑learning algorithm to train an AI agent that can clear most Super Mario levels within about 1500 episodes, even for users with no coding experience.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
How AI Beats Super Mario with PPO in 5 Minutes

When AI Plays Super Mario

Based on Huawei Cloud ModelArts, the tutorial shows how to use the Proximal Policy Optimization (PPO) reinforcement‑learning algorithm to train an AI agent that can clear most Super Mario levels within about 1500 episodes.

ModelArts is an all‑in‑one AI platform that provides data preprocessing, interactive labeling, distributed training, automated model generation and end‑to‑end deployment on edge, cloud and device.

PPO Algorithm Overview

PPO is an on‑policy method that works for both discrete and continuous action spaces. The PPO‑Clip variant (used by OpenAI) limits policy updates with a clipping parameter ε. Key features include on‑policy learning, advantage function estimation, and stochastic exploration.

PPO belongs to on‑policy algorithms.

Applicable to discrete and continuous action spaces.

The loss incorporates a ratio term constrained by ε to control policy step size.

The main training loop consists of creating the Mario environment, building the PPO agent, training, inference and visualizing results, which can be tried for free on AI Gallery.

Quick Start

In ModelArts Jupyter, click the run arrow to execute the prepared code. The example uses PyTorch 1.0.0 with GPU support. You can modify the hyper‑parameters in the following configuration to train on different worlds or stages.

opt={
    "world": 1,                # selectable worlds: 1‑8
    "stage": 1,                # selectable stages: 1‑4
    "action_type": "simple",  # "simple", "right_only", "complex"
    "lr": 1e-4,                # learning rate suggestions: 1e‑3, 1e‑4, 1e‑5, 7e‑5
    "gamma": 0.9,              # reward discount
    "tau": 1.0,                # GAE parameter
    "beta": 0.01,              # entropy coefficient
    "epsilon": 0.2,            # PPO clip coefficient
    "batch_size": 16,          # replay batch size
    "max_episode": 10,         # maximum training episodes
    "num_epochs": 10,          # epochs per experience
    "num_local_steps": 512,    # max steps per episode
    "num_processes": 8,        # training processes (usually CPU cores)
    "save_interval": 5,        # save model every N episodes
    "log_path": "./log",
    "saved_path": "./model",
    "pretrain_model": True,    # load pretrained model for world‑1 stage‑1
    "episode": 5
}

Adjust world and stage to target other levels. After training, the AI agent can navigate the game and achieve high scores.

The tutorial is intended for developers of any skill level; even without coding experience, you can follow the steps to build a game‑playing AI.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIreinforcement learningPPOGame AISuper MarioModelArts
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.