Artificial Intelligence 13 min read

Understanding Actor‑Critic and A2C: From Policy Gradients to REINFORCE in RL

This article derives the policy‑gradient objective for discrete actions, implements the Monte‑Carlo REINFORCE algorithm in PyTorch, explains the actor‑critic framework, introduces Advantage Actor‑Critic (A2C) versus A3C, and demonstrates their performance on the OpenAI Gym CartPole‑v0 environment.

Code DAO

Dec 3, 2021

Understanding Actor‑Critic and A2C: From Policy Gradients to REINFORCE in RL

Derivation of the Policy Gradient

The objective is the expected cumulative future reward from time t to termination T. Gradient ascent on the policy parameters θ yields the policy‑gradient expression ∇_θ J(θ) = 𝔼_{π_θ}[∇_θ log π_θ(a|s) \; G_t] where G_t is the discounted return. The derivation proceeds by expanding the expectation over trajectories, applying the log‑derivative trick, and arriving at the final gradient formula (illustrated by the accompanying equations).

REINFORCE Implementation

REINFORCE collects a full episode τ using the current stochastic policy, stores the log‑probability of each action together with the received reward, computes discounted returns, normalizes them, and updates the policy parameters with the sampled gradient.

import sys
import torch
import gym
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
import matplotlib.pyplot as plt

GAMMA = 0.9

class PolicyNetwork(nn.Module):
    def __init__(self, num_inputs, num_actions, hidden_size, learning_rate=3e-4):
        super(PolicyNetwork, self).__init__()
        self.num_actions = num_actions
        self.linear1 = nn.Linear(num_inputs, hidden_size)
        self.linear2 = nn.Linear(hidden_size, num_actions)
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)

    def forward(self, state):
        x = F.relu(self.linear1(state))
        x = F.softmax(self.linear2(x), dim=1)
        return x

    def get_action(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0)
        probs = self.forward(Variable(state))
        action = np.random.choice(self.num_actions, p=np.squeeze(probs.detach().numpy()))
        log_prob = torch.log(probs.squeeze(0)[action])
        return action, log_prob

The update function computes discounted rewards, normalizes them, forms the loss -log_prob * G_t, and performs a gradient step.

def update_policy(policy_network, rewards, log_probs):
    discounted_rewards = []
    for t in range(len(rewards)):
        Gt = 0
        pw = 0
        for r in rewards[t:]:
            Gt += GAMMA**pw * r
            pw += 1
        discounted_rewards.append(Gt)
    discounted_rewards = torch.tensor(discounted_rewards)
    discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9)
    policy_gradient = []
    for log_prob, Gt in zip(log_probs, discounted_rewards):
        policy_gradient.append(-log_prob * Gt)
    policy_network.optimizer.zero_grad()
    policy_gradient = torch.stack(policy_gradient).sum()
    policy_gradient.backward()
    policy_network.optimizer.step()

The main training loop runs for 5,000 episodes on the CartPole‑v0 environment, records episode lengths and rewards, and prints progress. Plots of episode length and a moving‑average length show that the agent learns to keep the pole balanced longer over time.

Actor‑Critic and Advantage Actor‑Critic (A2C)

Starting from the vanilla policy gradient, the expectation is decomposed into a value term and a Q‑value term. The Q‑value is learned by a neural network, yielding the actor‑critic architecture where the critic estimates the value (Q or V) and the actor updates the policy using the estimated advantage.

The advantage function is defined as A(s,a) = Q(s,a) - V(s). Using the Bellman relationship between Q and V, the update can be rewritten without separate Q and V networks, leading to the Advantage Actor‑Critic (A2C) formulation.

A2C has two main variants: the asynchronous A3C (DeepMind’s “Asynchronous Methods for Deep Reinforcement Learning”, Mnih et al., 2016) and the synchronous A2C. A3C runs multiple parallel workers that update a shared global network, while A2C removes the asynchronous component and achieves comparable performance with higher efficiency.

Implementation of A2C

The network defines separate linear layers for the critic (value) and the actor (policy). The training loop collects trajectories, computes Q‑values by bootstrapping, forms the advantage, and updates both actor and critic losses together with an entropy regularization term.

class ActorCritic(nn.Module):
    def __init__(self, num_inputs, num_actions, hidden_size, learning_rate=3e-4):
        super(ActorCritic, self).__init__()
        self.num_actions = num_actions
        self.critic_linear1 = nn.Linear(num_inputs, hidden_size)
        self.critic_linear2 = nn.Linear(hidden_size, 1)
        self.actor_linear1 = nn.Linear(num_inputs, hidden_size)
        self.actor_linear2 = nn.Linear(hidden_size, num_actions)

    def forward(self, state):
        state = Variable(torch.from_numpy(state).float().unsqueeze(0))
        value = F.relu(self.critic_linear1(state))
        value = self.critic_linear2(value)
        policy_dist = F.relu(self.actor_linear1(state))
        policy_dist = F.softmax(self.actor_linear2(policy_dist), dim=1)
        return value, policy_dist

def a2c(env):
    num_inputs = env.observation_space.shape[0]
    num_outputs = env.action_space.n
    actor_critic = ActorCritic(num_inputs, num_outputs, hidden_size)
    ac_optimizer = optim.Adam(actor_critic.parameters(), lr=learning_rate)
    all_lengths, average_lengths, all_rewards = [], [], []
    entropy_term = 0
    for episode in range(max_episodes):
        log_probs, values, rewards = [], [], []
        state = env.reset()
        for steps in range(num_steps):
            value, policy_dist = actor_critic.forward(state)
            value = value.detach().numpy()[0,0]
            dist = policy_dist.detach().numpy()
            action = np.random.choice(num_outputs, p=np.squeeze(dist))
            log_prob = torch.log(policy_dist.squeeze(0)[action])
            entropy = -np.sum(np.mean(dist) * np.log(dist))
            new_state, reward, done, _ = env.step(action)
            rewards.append(reward)
            values.append(value)
            log_probs.append(log_prob)
            entropy_term += entropy
            state = new_state
            if done or steps == num_steps-1:
                Qval, _ = actor_critic.forward(new_state)
                Qval = Qval.detach().numpy()[0,0]
                all_rewards.append(np.sum(rewards))
                all_lengths.append(steps)
                average_lengths.append(np.mean(all_lengths[-10:]))
                if episode % 10 == 0:
                    sys.stdout.write("episode: {}, reward: {}, total length: {}, average length: {} 
".format(episode, np.sum(rewards), steps, average_lengths[-1]))
                break
        # compute Q values backward
        Qvals = np.zeros_like(values)
        for t in reversed(range(len(rewards))):
            Qval = rewards[t] + GAMMA * Qval
            Qvals[t] = Qval
        # convert to tensors
        values = torch.FloatTensor(values)
        Qvals = torch.FloatTensor(Qvals)
        log_probs = torch.stack(log_probs)
        advantage = Qvals - values
        actor_loss = (-log_probs * advantage).mean()
        critic_loss = 0.5 * advantage.pow(2).mean()
        ac_loss = actor_loss + critic_loss + 0.001 * entropy_term
        ac_optimizer.zero_grad()
        ac_loss.backward()
        ac_optimizer.step()

Running the script on CartPole‑v0 shows a steady increase in episode length and reward, confirming that A2C learns to balance the pole efficiently.

References

UC Berkeley CS294 Lecture Slides

Carnegie Mellon University CS10703 Lecture Slides

Naver D2 RLCode Lecture Video

OpenAI blog post on A2C and ACKTR

Full source code: https://github.com/thechrisyoon08/Reinforcement-Learning

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python reinforcement learning actor-critic policy gradient OpenAI Gym REINFORCE A2C

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.