Building a Custom 8×8 GridWorld with Q‑Learning in Gymnasium
This tutorial walks through creating a custom 8×8 GridWorld environment in Gymnasium, implementing a Q‑Learning agent that learns to navigate from the top‑left corner to the bottom‑right goal while avoiding walls, and visualizing training curves, learned policies, and a performance comparison with a random agent.
Project Overview
Custom 8×8 GridWorld where an agent starts at the top‑left corner, avoids walls, and reaches the bottom‑right goal using Q‑Learning without any hard‑coded path.
Project Structure
gridworld/
├── grid_env.py # custom Gymnasium environment
├── agent.py # Q‑Learning agent
├── train.py # training loop + charts
├── visualize.py # animation and comparison
└── requirements.txtEnvironment
Gymnasium environments inherit from gym.Env. GridWorldEnv defines an 8×8 grid, a set of wall coordinates, start and goal positions, a discrete action space (0: up, 1: right, 2: down, 3: left), and an observation space of 64 states.
class GridWorldEnv(gym.Env):
def __init__(self, render_mode=None):
self.grid_size = 8
self.max_steps = 200
self.action_space = spaces.Discrete(4) # up, right, down, left
self.observation_space = spaces.Discrete(64) # 8x8 = 64 states
self.walls = {(1,1),(1,2),(1,3),(2,5),(3,5),(4,5),(5,2),(5,3),(5,4),(6,6)}
self.start = (0, 0)
self.goal = (7, 7)The step() method moves the agent, prevents crossing walls or boundaries, applies a step penalty of -0.01 plus a distance‑based penalty -0.001 * dist, and gives a reward of +1.0 only when the goal is reached.
def step(self, action):
moves = {0: (-1,0), 1: (0,1), 2: (1,0), 3: (0,-1)}
dr, dc = moves[action]
nr, nc = r + dr, c + dc
if 0 <= nr < self.grid_size and (nr, nc) not in self.walls:
self.agent_pos = [nr, nc]
terminated = (tuple(self.agent_pos) == self.goal)
if terminated:
reward = 1.0
else:
dist = abs(self.agent_pos[0] - 7) + abs(self.agent_pos[1] - 7)
reward = -0.01 - 0.001 * distAgent
The Q‑table is a NumPy array of shape (64, 4) initialized to zeros. The core update rule is: Q(s,a) ← Q(s,a) + α * [r + γ * max_a' Q(s',a') - Q(s,a)] Implementation:
def update(self, state, action, reward, next_state, done):
best_next = 0.0 if done else np.max(self.Q[next_state])
td_target = reward + self.gamma * best_next
td_error = td_target - self.Q[state, action]
self.Q[state, action] += self.alpha * td_errorAction selection uses epsilon‑greedy; epsilon decays from near 1.0 to 0.05.
def select_action(self, state):
if np.random.rand() < self.epsilon:
return np.random.randint(self.n_actions) # explore
return np.argmax(self.Q[state]) # exploitTraining Loop
for ep in range(1, n_episodes + 1):
obs, _ = env.reset()
done = False
while not done:
action = agent.select_action(obs)
next_obs, reward, terminated, truncated, _ = env.step(action)
agent.update(obs, action, reward, next_obs, terminated or truncated)
obs = next_obs
agent.decay_epsilon()Running python train.py for 2000 episodes takes about 10 seconds. Sample log:
Episode Reward Steps Epsilon Success
--------------------------------------------------
200 -0.113 59.8 0.367 92.5%
400 0.704 18.3 0.135 100.0%
600 0.754 15.4 0.050 100.0%
2000 0.764 14.9 0.050 100.0%By episode 400 the agent reaches the goal in roughly 15 steps with 100 % success.
Visualizations
Episode reward curve – noisy early, then rises and stabilizes.
Steps per episode – drops sharply as the agent discovers shorter paths.
Success rate – quickly reaches and maintains 100 %.
Value heatmap shows high values near the goal and low values near walls, illustrating Bellman propagation. Policy plot draws arrows indicating the optimal action for each non‑wall state.
def plot_policy_and_values(agent):
V = np.max(agent.Q, axis=1).reshape(grid, grid) # V(s) = max_a Q(s,a)
policy = np.argmax(agent.Q, axis=1).reshape(grid, grid)
arrows = {0: '↑', 1: '→', 2: '↓', 3: '←'}Random vs. Trained Agent
Random Trained
=============================================
Avg steps 182.6 14.0
Success rate 23.5% 100.0%
Best steps 31 14
=============================================The random agent reaches the goal only 23 % of the time within 200 steps, averaging 182 steps; the trained agent succeeds every episode in 14 steps.
Key Observations
Q‑Learning is off‑policy. The update uses max_a' Q(s',a'), allowing learning of the optimal policy while exploring with a random policy.
The Q‑table encodes the policy. After training the environment is no longer required; the agent can be saved and later loaded to act directly from the table.
Reward shaping accelerates learning. Adding a small step penalty and distance‑based penalty provides early directional hints without altering the optimal policy.
The agent‑environment loop is universal. The same loop applies to other RL algorithms; only the policy representation and update rule differ.
Code Repository
https://github.com/ES7/Reinforcement-Learning-Projects/tree/main
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
