Training a DQN AI to Master 2048: Step-by-Step Guide
This article walks through using reinforcement learning with a Deep Q‑Network in PyTorch to train an AI agent that plays the 2048 puzzle game, covering environment setup, algorithm implementation, network design, and a short training run that achieves a score of 256.
As an enthusiastic gamer, the author decided to train an AI using reinforcement learning to play the classic 2048 puzzle game.
They used the open‑source gym‑2048 environment and implemented a Deep Q‑Network (DQN) with PyTorch, running the experiments on Huawei Cloud ModelArts.
Three main steps
1. Create the game environment
2. Build the DQN algorithm
3. Define the neural network model
The network is a simple three‑layer convolutional model that maps the 4×4 board to action values.
def learn(self, buffer):
if buffer.size >= self.args.batch_size:
if self.learn_step_counter % args.target_update_freq == 0:
self.target_model.load_state_dict(self.behaviour_model.state_dict())
self.learn_step_counter += 1
s1, a, s2, done, r = buffer.get_sample(self.args.batch_size)
s1 = torch.FloatTensor(s1).to(device)
s2 = torch.FloatTensor(s2).to(device)
r = torch.FloatTensor(r).to(device)
a = torch.LongTensor(a).to(device)
if args.use_nature_dqn:
q = self.target_model(s2).detach()
else:
q = self.behaviour_model(s2)
target_q = r + torch.FloatTensor(args.gamma * (1 - done)).to(device) * q.max(1)[0]
target_q = target_q.view(args.batch_size, 1)
eval_q = self.behaviour_model(s1).gather(1, torch.reshape(a, (a.size()[0], -1)))
loss = self.criterion(eval_q, target_q)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step() class ReplayBuffer:
def __init__(self, buffer_size, obs_space):
self.s1 = np.zeros(obs_space, dtype=np.float32)
self.s2 = np.zeros(obs_space, dtype=np.float32)
self.a = np.zeros(buffer_size, dtype=np.int32)
self.r = np.zeros(buffer_size, dtype=np.float32)
self.done = np.zeros(buffer_size, dtype=np.float32)
self.buffer_size = buffer_size
self.size = 0
self.pos = 0
def add_transition(self, s1, action, s2, done, reward):
self.s1[self.pos] = s1
self.a[self.pos] = action
if not done:
self.s2[self.pos] = s2
self.done[self.pos] = done
self.r[self.pos] = reward
self.pos = (self.pos + 1) % self.buffer_size
self.size = min(self.size + 1, self.buffer_size)
def get_sample(self, sample_size):
i = random.sample(range(0, self.size), sample_size)
return self.s1[i], self.a[i], self.s2[i], self.done[i], self.r[i] class Net(nn.Module):
def __init__(self, obs, available_actions_count):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(obs, 128, kernel_size=2, stride=1)
self.conv2 = nn.Conv2d(128, 64, kernel_size=2, stride=1)
self.conv3 = nn.Conv2d(64, 16, kernel_size=2, stride=1)
self.fc1 = nn.Linear(16, available_actions_count)
self.relu = nn.ReLU(inplace=True)
def forward(self, x):
x = x.permute(0, 3, 1, 2)
x = self.relu(self.conv1(x))
x = self.relu(self.conv2(x))
x = self.relu(self.conv3(x))
x = self.fc1(x.view(x.shape[0], -1))
return xThe training loop runs for a set number of episodes, resetting the environment each episode, selecting actions with the DQN, storing transitions in the replay buffer, and invoking the learning step. After about ten minutes of training the agent can consistently reach a score of 256.
The full source code and a ready‑to‑run notebook are available on Huawei Cloud ModelArts marketplace.
Huawei Cloud Developer Alliance
The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
