Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

This comprehensive guide examines large‑scale deep reinforcement learning, detailing policy‑gradient fundamentals, the mathematics of PPO and GAE, practical implementation tricks, reward and observation normalization, network initialization, and the newer Phasic Policy Gradient method, all supported by code snippets and key research references.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive

Introduction

Large‑scale deep reinforcement learning (DRL) relies heavily on model‑free policy‑gradient methods, especially Proximal Policy Optimization (PPO). This article synthesizes recent papers and open‑source implementations to explain the algorithmic details of policy‑gradient methods, PPO, and related optimizations for large‑scale DRL.

Policy Gradient overview
Policy Gradient overview

Policy Gradient Basics

Policy‑gradient methods directly optimize the policy network to increase expected reward. The loss is the negative log‑likelihood weighted by an advantage estimate f(s,a). Various ways to compute f(s,a) exist, with the advantage function providing a relative evaluation of actions.

Estimating advantage involves a bias‑variance trade‑off: using full trajectories yields unbiased but high‑variance estimates, while one‑step value estimates reduce variance but introduce bias.

Generalized Advantage Estimation (GAE)

GAE combines multi‑step returns to balance bias and variance. It recursively computes advantages over a fixed n‑step horizon, handling terminal states d appropriately.

# from ppg

def compute_gae(*, vpred: '(th.Tensor[1, float]) value predictions',
                reward: '(th.Tensor[1, float]) rewards',
                first: '(th.Tensor[1, bool]) mark beginning of episodes',
                γ: '(float)',
                λ: '(float)'):
    orig_device = vpred.device
    assert orig_device == reward.device == first.device
    vpred, reward, first = (x.cpu() for x in (vpred, reward, first))
    first = first.to(dtype=th.float32)
    assert first.dim() == 2
    nenv, nstep = reward.shape
    assert vpred.shape == first.shape == (nenv, nstep + 1)
    adv = th.zeros(nenv, nstep, dtype=th.float32)
    lastgaelam = 0
    for t in reversed(range(nstep)):
        notlast = 1.0 - first[:, t + 1]
        nextvalue = vpred[:, t + 1]
        delta = reward[:, t] + notlast * γ * nextvalue - vpred[:, t]
        adv[:, t] = lastgaelam = delta + notlast * γ * λ * lastgaelam
    vtarg = vpred[:, :-1] + adv
    return adv.to(device=orig_device), vtarg.to(device=orig_device)
GAE diagram
GAE diagram

Log‑Likelihood Computation

The log‑likelihood depends on the action distribution: categorical for discrete actions, Gaussian for continuous actions. PyTorch’s built‑in distributions simplify this calculation. For categorical sampling, OpenAI Baselines sometimes use Gumbel‑max, which is equivalent to multinomial but more efficient.

def sample(self):
    u = tf.random_uniform(tf.shape(self.logits), dtype=self.logits.dtype)
    return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)

Proximal Policy Optimization (PPO)

PPO improves upon the basic policy‑gradient by clipping the probability ratio between the new and old policies, preventing large policy updates and enabling multiple gradient steps per batch. PPO inherits the trust‑region idea from TRPO but replaces the complex constrained optimization with a simple clipped surrogate objective.

Sample reuse is critical: PPO typically reuses each trajectory three times (K=3) with minibatch size NT/4, allowing stable training at scale. Large‑scale applications such as OpenAI’s Dota 5 use massive batch sizes and a sample‑reuse ratio close to 1.

PPO clipping illustration
PPO clipping illustration

PPO Loss Components

The total loss consists of policy loss, value loss, and entropy loss. Typical hyper‑parameters: entropy coefficient 0.001–0.01, learning rate 5e‑5–5e‑6.

Reward and Observation Normalization

Rewards are normalized using a running mean‑std and clipped to a fixed range; observations can be normalized similarly. Example implementations are shown below.

class RewardNormalizer:
    def __init__(self, num_envs, cliprew=10.0, gamma=0.99, epsilon=1e-8, per_env=False):
        ret_rms_shape = (num_envs,) if per_env else ()
        self.ret_rms = RunningMeanStd(shape=ret_rms_shape)
        self.cliprew = cliprew
        self.ret = th.zeros(num_envs)
        self.gamma = gamma
        self.epsilon = epsilon
        self.per_env = per_env

    def __call__(self, reward, first):
        rets = backward_discounted_sum(prevret=self.ret, reward=reward.cpu(), first=first.cpu(), gamma=self.gamma)
        self.ret = rets[:, -1]
        self.ret_rms.update(rets if self.per_env else rets.reshape(-1))
        return self.transform(reward)

    def transform(self, reward):
        return th.clamp(reward / th.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
Reward normalizer
Reward normalizer

Network Initialization

Proper weight initialization (e.g., normalized fan‑in) and scaling of the final policy layer are essential. Using a smaller initial weight for the policy head and softplus for action‑std improves stability.

def NormedLinear(*args, scale=1.0, dtype=th.float32, **kwargs):
    out = nn.Linear(*args, **kwargs)
    out.weight.data *= scale / out.weight.norm(dim=1, p=2, keepdim=True)
    out.bias.data.zero_()
    return out

Phasic Policy Gradient (PPG)

PPG extends PPO by decoupling policy and value networks while sharing a common backbone. An auxiliary KL‑divergence term keeps the policy close to its previous version, allowing the value network to be trained off‑policy more aggressively. Experiments show that increasing value‑network updates improves performance, while keeping policy sample‑reuse near 1 yields the best results.

PPG architecture
PPG architecture

Practical Recommendations

Clip PPO ratio at 0.25 (tune as needed).

Use GAE with λ≈0.9.

Shuffle transitions and recompute advantages once per epoch.

Tune discount factor γ per environment (start at 0.99).

Adam optimizer with β1=0.9, learning rate ≈3e‑4 (decay optional).

Large batch sizes and many parallel environments improve wall‑clock speed.

References and Code

Key papers: PPO (arXiv:1707.06347), PPG (arXiv:2009.04416), implementation studies (arXiv:2005.12729, 1811.02553, 2006.05990). Open‑source implementations include OpenAI’s PPG repository and Stable‑Baselines PPO2.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

reinforcement learningAlgorithm OptimizationPolicy GradientPPODeep RLGAEPPG
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.