Unlocking Large‑Scale Deep Reinforcement Learning: PPO, GAE, and PPG Deep Dive
This comprehensive guide examines large‑scale deep reinforcement learning, detailing policy‑gradient fundamentals, the mathematics of PPO and GAE, practical implementation tricks, reward and observation normalization, network initialization, and the newer Phasic Policy Gradient method, all supported by code snippets and key research references.
Introduction
Large‑scale deep reinforcement learning (DRL) relies heavily on model‑free policy‑gradient methods, especially Proximal Policy Optimization (PPO). This article synthesizes recent papers and open‑source implementations to explain the algorithmic details of policy‑gradient methods, PPO, and related optimizations for large‑scale DRL.
Policy Gradient Basics
Policy‑gradient methods directly optimize the policy network to increase expected reward. The loss is the negative log‑likelihood weighted by an advantage estimate f(s,a). Various ways to compute f(s,a) exist, with the advantage function providing a relative evaluation of actions.
Estimating advantage involves a bias‑variance trade‑off: using full trajectories yields unbiased but high‑variance estimates, while one‑step value estimates reduce variance but introduce bias.
Generalized Advantage Estimation (GAE)
GAE combines multi‑step returns to balance bias and variance. It recursively computes advantages over a fixed n‑step horizon, handling terminal states d appropriately.
# from ppg
def compute_gae(*, vpred: '(th.Tensor[1, float]) value predictions',
reward: '(th.Tensor[1, float]) rewards',
first: '(th.Tensor[1, bool]) mark beginning of episodes',
γ: '(float)',
λ: '(float)'):
orig_device = vpred.device
assert orig_device == reward.device == first.device
vpred, reward, first = (x.cpu() for x in (vpred, reward, first))
first = first.to(dtype=th.float32)
assert first.dim() == 2
nenv, nstep = reward.shape
assert vpred.shape == first.shape == (nenv, nstep + 1)
adv = th.zeros(nenv, nstep, dtype=th.float32)
lastgaelam = 0
for t in reversed(range(nstep)):
notlast = 1.0 - first[:, t + 1]
nextvalue = vpred[:, t + 1]
delta = reward[:, t] + notlast * γ * nextvalue - vpred[:, t]
adv[:, t] = lastgaelam = delta + notlast * γ * λ * lastgaelam
vtarg = vpred[:, :-1] + adv
return adv.to(device=orig_device), vtarg.to(device=orig_device)Log‑Likelihood Computation
The log‑likelihood depends on the action distribution: categorical for discrete actions, Gaussian for continuous actions. PyTorch’s built‑in distributions simplify this calculation. For categorical sampling, OpenAI Baselines sometimes use Gumbel‑max, which is equivalent to multinomial but more efficient.
def sample(self):
u = tf.random_uniform(tf.shape(self.logits), dtype=self.logits.dtype)
return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)Proximal Policy Optimization (PPO)
PPO improves upon the basic policy‑gradient by clipping the probability ratio between the new and old policies, preventing large policy updates and enabling multiple gradient steps per batch. PPO inherits the trust‑region idea from TRPO but replaces the complex constrained optimization with a simple clipped surrogate objective.
Sample reuse is critical: PPO typically reuses each trajectory three times (K=3) with minibatch size NT/4, allowing stable training at scale. Large‑scale applications such as OpenAI’s Dota 5 use massive batch sizes and a sample‑reuse ratio close to 1.
PPO Loss Components
The total loss consists of policy loss, value loss, and entropy loss. Typical hyper‑parameters: entropy coefficient 0.001–0.01, learning rate 5e‑5–5e‑6.
Reward and Observation Normalization
Rewards are normalized using a running mean‑std and clipped to a fixed range; observations can be normalized similarly. Example implementations are shown below.
class RewardNormalizer:
def __init__(self, num_envs, cliprew=10.0, gamma=0.99, epsilon=1e-8, per_env=False):
ret_rms_shape = (num_envs,) if per_env else ()
self.ret_rms = RunningMeanStd(shape=ret_rms_shape)
self.cliprew = cliprew
self.ret = th.zeros(num_envs)
self.gamma = gamma
self.epsilon = epsilon
self.per_env = per_env
def __call__(self, reward, first):
rets = backward_discounted_sum(prevret=self.ret, reward=reward.cpu(), first=first.cpu(), gamma=self.gamma)
self.ret = rets[:, -1]
self.ret_rms.update(rets if self.per_env else rets.reshape(-1))
return self.transform(reward)
def transform(self, reward):
return th.clamp(reward / th.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)Network Initialization
Proper weight initialization (e.g., normalized fan‑in) and scaling of the final policy layer are essential. Using a smaller initial weight for the policy head and softplus for action‑std improves stability.
def NormedLinear(*args, scale=1.0, dtype=th.float32, **kwargs):
out = nn.Linear(*args, **kwargs)
out.weight.data *= scale / out.weight.norm(dim=1, p=2, keepdim=True)
out.bias.data.zero_()
return outPhasic Policy Gradient (PPG)
PPG extends PPO by decoupling policy and value networks while sharing a common backbone. An auxiliary KL‑divergence term keeps the policy close to its previous version, allowing the value network to be trained off‑policy more aggressively. Experiments show that increasing value‑network updates improves performance, while keeping policy sample‑reuse near 1 yields the best results.
Practical Recommendations
Clip PPO ratio at 0.25 (tune as needed).
Use GAE with λ≈0.9.
Shuffle transitions and recompute advantages once per epoch.
Tune discount factor γ per environment (start at 0.99).
Adam optimizer with β1=0.9, learning rate ≈3e‑4 (decay optional).
Large batch sizes and many parallel environments improve wall‑clock speed.
References and Code
Key papers: PPO (arXiv:1707.06347), PPG (arXiv:2009.04416), implementation studies (arXiv:2005.12729, 1811.02553, 2006.05990). Open‑source implementations include OpenAI’s PPG repository and Stable‑Baselines PPO2.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
