Building a Custom 8×8 GridWorld with Q‑Learning in Gymnasium

This tutorial walks through creating a custom 8×8 GridWorld environment in Gymnasium, implementing a Q‑Learning agent that learns to navigate from the top‑left corner to the bottom‑right goal while avoiding walls, and visualizing training curves, learned policies, and a performance comparison with a random agent.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
Building a Custom 8×8 GridWorld with Q‑Learning in Gymnasium

Project Overview

Custom 8×8 GridWorld where an agent starts at the top‑left corner, avoids walls, and reaches the bottom‑right goal using Q‑Learning without any hard‑coded path.

Project Structure

gridworld/
├── grid_env.py       # custom Gymnasium environment
├── agent.py          # Q‑Learning agent
├── train.py          # training loop + charts
├── visualize.py      # animation and comparison
└── requirements.txt

Environment

Gymnasium environments inherit from gym.Env. GridWorldEnv defines an 8×8 grid, a set of wall coordinates, start and goal positions, a discrete action space (0: up, 1: right, 2: down, 3: left), and an observation space of 64 states.

class GridWorldEnv(gym.Env):
    def __init__(self, render_mode=None):
        self.grid_size = 8
        self.max_steps = 200
        self.action_space = spaces.Discrete(4)          # up, right, down, left
        self.observation_space = spaces.Discrete(64)   # 8x8 = 64 states
        self.walls = {(1,1),(1,2),(1,3),(2,5),(3,5),(4,5),(5,2),(5,3),(5,4),(6,6)}
        self.start = (0, 0)
        self.goal  = (7, 7)

The step() method moves the agent, prevents crossing walls or boundaries, applies a step penalty of -0.01 plus a distance‑based penalty -0.001 * dist, and gives a reward of +1.0 only when the goal is reached.

def step(self, action):
    moves = {0: (-1,0), 1: (0,1), 2: (1,0), 3: (0,-1)}
    dr, dc = moves[action]
    nr, nc = r + dr, c + dc
    if 0 <= nr < self.grid_size and (nr, nc) not in self.walls:
        self.agent_pos = [nr, nc]
    terminated = (tuple(self.agent_pos) == self.goal)
    if terminated:
        reward = 1.0
    else:
        dist = abs(self.agent_pos[0] - 7) + abs(self.agent_pos[1] - 7)
        reward = -0.01 - 0.001 * dist

Agent

The Q‑table is a NumPy array of shape (64, 4) initialized to zeros. The core update rule is: Q(s,a) ← Q(s,a) + α * [r + γ * max_a' Q(s',a') - Q(s,a)] Implementation:

def update(self, state, action, reward, next_state, done):
    best_next = 0.0 if done else np.max(self.Q[next_state])
    td_target = reward + self.gamma * best_next
    td_error  = td_target - self.Q[state, action]
    self.Q[state, action] += self.alpha * td_error

Action selection uses epsilon‑greedy; epsilon decays from near 1.0 to 0.05.

def select_action(self, state):
    if np.random.rand() < self.epsilon:
        return np.random.randint(self.n_actions)   # explore
    return np.argmax(self.Q[state])                # exploit

Training Loop

for ep in range(1, n_episodes + 1):
    obs, _ = env.reset()
    done = False
    while not done:
        action = agent.select_action(obs)
        next_obs, reward, terminated, truncated, _ = env.step(action)
        agent.update(obs, action, reward, next_obs, terminated or truncated)
        obs = next_obs
    agent.decay_epsilon()

Running python train.py for 2000 episodes takes about 10 seconds. Sample log:

Episode   Reward   Steps   Epsilon   Success
--------------------------------------------------
    200   -0.113   59.8    0.367      92.5%
    400    0.704   18.3    0.135     100.0%
    600    0.754   15.4    0.050     100.0%
   2000    0.764   14.9    0.050     100.0%

By episode 400 the agent reaches the goal in roughly 15 steps with 100 % success.

Visualizations

Episode reward curve – noisy early, then rises and stabilizes.

Steps per episode – drops sharply as the agent discovers shorter paths.

Success rate – quickly reaches and maintains 100 %.

Value heatmap shows high values near the goal and low values near walls, illustrating Bellman propagation. Policy plot draws arrows indicating the optimal action for each non‑wall state.

def plot_policy_and_values(agent):
    V = np.max(agent.Q, axis=1).reshape(grid, grid)   # V(s) = max_a Q(s,a)
    policy = np.argmax(agent.Q, axis=1).reshape(grid, grid)
    arrows = {0: '↑', 1: '→', 2: '↓', 3: '←'}

Random vs. Trained Agent

Random    Trained
=============================================
Avg steps               182.6        14.0
Success rate            23.5%      100.0%
Best steps                 31          14
=============================================

The random agent reaches the goal only 23 % of the time within 200 steps, averaging 182 steps; the trained agent succeeds every episode in 14 steps.

Key Observations

Q‑Learning is off‑policy. The update uses max_a' Q(s',a'), allowing learning of the optimal policy while exploring with a random policy.

The Q‑table encodes the policy. After training the environment is no longer required; the agent can be saved and later loaded to act directly from the table.

Reward shaping accelerates learning. Adding a small step penalty and distance‑based penalty provides early directional hints without altering the optimal policy.

The agent‑environment loop is universal. The same loop applies to other RL algorithms; only the policy representation and update rule differ.

Code Repository

https://github.com/ES7/Reinforcement-Learning-Projects/tree/main

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonReinforcement LearningQ-LearningGridWorldGymnasiumRL Visualization
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.