← back
Stanford CS 224R · Homework 2 · PPO

Proximal Policy Optimization from Absolute Zero

A robot learning to swing a hammer using stochastic gradient descent on its own behavior. Every PyTorch line decoded. Every formula derived. Every trick justified.

No prior RL assumed PyTorch primer included Every line you wrote, explained ~1500 lines
Roadmap

What You'll Master

Chapter 01

The Setup & Big Picture

Same task as Problem 3, completely different algorithm. A 4-DOF Sawyer arm in MuJoCo simulation must learn to pick up a hammer and drive a nail into a board. The observation is ~39 numbers (positions of robot, hammer, nail). Action is a 4-D continuous vector in [-1, 1]. Reward: 0 every step, 1.0 the moment the nail is fully driven, then episode ends.

You'll implement PPO — Proximal Policy Optimization — the most widely-used policy-gradient algorithm in modern RL. The deliverable is a wandb plot of eval/episode_success over ~1 million environment steps.

In one sentence

PPO trains a neural network policy by repeatedly: (1) collect a batch of rollouts under the current policy, (2) compute how much better-than-average each action turned out, (3) take several gradient steps on a clipped objective that pushes good actions up and bad actions down, but only a little — never letting the policy change too far from where it just was.

Why PPO and not vanilla policy gradient

The simplest policy-gradient algorithm (REINFORCE) goes:

  1. Sample a trajectory under πθ.
  2. For each (st, at) in the trajectory, increase log πθ(at|st) in proportion to the return Gt.
  3. Repeat.

This works in theory but is wildly unstable in practice. Two reasons:

  1. High variance in the return Gt. Reduces convergence to a crawl.
  2. Single big update can break the policy. The gradient says "move this way." But if you move too far, the next batch of rollouts comes from a totally different policy, and the gradient you just applied was based on the old policy. Now you're flying blind.

PPO addresses both problems. GAE reduces variance via a critic-blended advantage estimate. Clipping caps how far each update can push the policy. Together: stable, fast learning.

"On-policy" with caveats

Strictly on-policy means: collect data, do one gradient step, throw the data away. PPO does multiple epochs of gradient steps on the same batch. As long as the policy hasn't drifted too much (the clip enforces this), the same data is approximately valid for several updates. This is sometimes called "near-on-policy" or "approximately on-policy."

Big-picture comparison

PPO (this problem) vs SAC-style (Problem 3): PPO is on-policy, throws away data after a few epochs, simpler to implement, slow to converge (~1M steps). SAC is off-policy, replays a buffer indefinitely, more components to debug, fast to converge (~100k steps). Both work; both have their use cases.

Chapter 02

PyTorch Primer

If you know Python but PyTorch is fuzzy, read this chapter. You'll need every concept here when reading the code.

1. Tensors are NumPy arrays with two superpowers

A tensor is just a multi-dimensional array of numbers, like NumPy's ndarray. The two extras are:

import torch

x = torch.tensor([1.0, 2.0, 3.0])    # shape [3], on CPU
x = x.to("cuda")                       # now on GPU
y = x * 2.0 + 1.0                       # [3, 5, 7], gradient graph remembered

2. Shapes are what bugs hide in

Most PyTorch bugs are shape bugs. The shape of a tensor is its dimensions: a vector of length 4 is shape [4]; a 32×4 matrix is shape [32, 4]. Operations broadcast across compatible shapes silently, which is convenient but dangerous.

a = torch.tensor([1., 2., 3.])             # shape [3]
b = torch.tensor([[10.], [20.], [30.]])     # shape [3, 1]
c = a + b   # shape [3, 3]! NumPy broadcasts. Probably not what you wanted.
Lesson learned the hard way

If a loss is suddenly enormous or stays at zero, the first thing to check is the shapes of every operand. Print x.shape liberally. The PPO code uses .squeeze(-1) to drop a trailing size-1 dimension specifically to avoid silent broadcasts.

3. Common shape operations

MethodWhat it doesExample
.shapeTuple of dimensionstorch.zeros(3, 5).shape[3, 5]
.unsqueeze(d)Add size-1 dim at position d[3].unsqueeze(0)[1, 3]
.squeeze(d)Drop size-1 dim at d[3, 1].squeeze(-1)[3]
.view(...)Reshape, total size unchanged[6].view(2, 3)[2, 3]
.cat([...], dim=)Concatenate along dimtwo [3, 4][6, 4] with dim=0
.stack([...], dim=)Stack into NEW dimtwo [3][2, 3] with dim=0
.sum(-1) / .mean(-1)Reduce along the last dim[3, 4].sum(-1)[3]
x[idx]Slice with int or tensor of indicesx[torch.tensor([0, 2, 5])]

4. Autograd in 60 seconds

When you compute a tensor that has trainable parameters in its history, PyTorch tracks which operations made it. Calling .backward() on a scalar (a single number) tells PyTorch to walk backward through that history, computing gradients for every parameter along the way.

# Forward pass
w = torch.tensor([2.0], requires_grad=True)
x = torch.tensor([3.0])
loss = (w * x − 5.0) ** 2           # scalar, shape []

# Backward pass
loss.backward()                       # fills in w.grad
print(w.grad)                         # tensor([6.])

# Manual update (in practice the optimizer does this)
with torch.no_grad():
    w −= 0.01 * w.grad
    w.grad.zero_()                    # clear before next backward

Three rules to keep in your head:

  1. You can only call .backward() on a scalar. If your loss is a vector, take .mean() or .sum() first.
  2. Gradients accumulate. Always zero them before the next backward pass — optimizer.zero_grad() handles this.
  3. Operations inside with torch.no_grad(): don't get tracked. Use this when computing values you don't want gradients for (like TD targets or log-probs at rollout time).
Why torch.no_grad() matters

If you compute a "target" value y that you'll regress toward, and you forget no_grad, gradients will flow through y back into the network that produced it. Now your network is being trained against a moving target it produced itself → divergence. Always wrap target computation in with torch.no_grad():.

5. nn.Module: the building block

A network is a Python class that inherits from nn.Module. It has learnable parameters (typically the weights of nn.Linear layers it stores) and a forward() method that defines what happens when you call it.

class SimpleNet(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.layer1 = nn.Linear(in_dim, 64)    # weights + bias
        self.layer2 = nn.Linear(64, out_dim)

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        return self.layer2(x)

net = SimpleNet(10, 4)
y = net(torch.randn(32, 10))         # calls .forward, returns shape [32, 4]

Notice net(x), not net.forward(x). The __call__ hook on Module triggers some PyTorch bookkeeping then calls forward.

nn.Sequential is a convenience for stacking layers without writing a forward method:

self.policy = nn.Sequential(
    nn.Linear(obs_dim, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, action_dim),
)

6. Optimizers

An optimizer takes a list of parameters and applies gradient updates to them. The most common: Adam.

optimizer = torch.optim.Adam(net.parameters(), lr=3e-4)

# Standard training loop:
optimizer.zero_grad()    # clear leftover grads
loss.backward()         # compute new grads
optimizer.step()         # take a gradient step

7. Distributions

For policy-gradient algorithms, you need to sample actions from a distribution and compute log-probabilities. PyTorch provides torch.distributions.

from torch.distributions import Normal

mu = torch.tensor([0.0])
sigma = torch.tensor([1.0])
dist = Normal(mu, sigma)

a = dist.sample()              # draws a random number, shape [1]
log_p = dist.log_prob(a)      # log-density at a, shape [1]

The homework uses utils.TruncatedNormal — same idea but the samples are clipped to [-1, 1] to match the action bounds. dist.log_prob(a) gives you log π(a|s), which is what the policy gradient theorem needs.

Why log_prob, not prob

Probabilities for continuous distributions can be tiny in high dimensions (think 1e-30 multiplied across 4 action dims). Log-probabilities stay numerically reasonable. Also, ∇ log p = (1/p) ∇ p — the gradient form is what we actually want for the policy gradient.

That's enough to read every line of on_policy.py. Onward.

Chapter 03

The Policy Gradient Theorem

The whole reason PPO exists is to maximize one quantity: expected discounted return.

The objective J(θ) = 𝔼τ ~ πθ[ Σt=0T γt rt ]

Read it: "the average over all possible trajectories τ that πθ could produce, of the total discounted reward of that trajectory." We want to find the policy parameters θ that maximize J.

To do gradient ascent we need θ J. The trajectory distribution depends on θ through the policy actions, so we can't trivially differentiate. The policy gradient theorem gives us a clever rewriting:

Policy gradient theoremθ J(θ) = 𝔼τ[ Σtθ log πθ(at|st) · Gt ]

where Gt = Σk=0 γk rt+k is the return from time t onward.

In English: to make the policy better, take each action you just sampled and increase its log-probability by an amount proportional to how good the future turned out to be after that action. Good actions: push them up. Bad actions: push them down.

The clever trick (log-derivative)

The derivation hinges on one identity: θ log pθ(x) = (1/pθ(x)) · ∇θ pθ(x). This lets you turn a derivative of a probability density (which depends on samples that depend on θ) into an expectation that you can estimate from samples drawn under θ. The miracle is that environment dynamics — which we don't know — cancel out of the gradient.

The naive estimator

Given a batch of N rollouts, we estimate the expectation by averaging:

θ J ≈ (1/N) Σi=1..N Σtθ log πθ(ait|sit) · Git

If you implement just this, you have REINFORCE. It works, but it's slow. Two problems:

  1. Variance: Gt can swing dramatically across rollouts. Same policy, totally different trajectories due to randomness in action sampling and environment.
  2. No baseline: if rewards are always positive, every action's log-prob increases. The signal is "everything is good," which doesn't tell the policy what to focus on.

The fixes: baselines (advantage), GAE (multi-step variance reduction), clipping (stability). Each gets its own chapter.

Chapter 04

Why We Need Advantage

Naive REINFORCE weights each action by the return Gt. Two issues:

Issue 1: Variance reduction via baselines

Subtract any state-dependent baseline b(st) from Gt. The expected gradient is unchanged (provable: cross-term has zero mean), but variance can drop dramatically.

θ J ≈ 𝔼[ Σtθ log πθ(at|st) · (Gt − b(st)) ]

The natural choice for b: the state-value function V(s) = 𝔼[Gt | st=s], i.e. how good is this state on average?

The advantage function

Definition
Advantage — A(s, a) = Q(s, a) − V(s)

How much better was action a than the policy's average action in state s? Positive: this action was above average; push the policy toward it. Negative: below average; push away.

The cleaned-up policy gradient:

Advantage-weighted policy gradientθ J = 𝔼[ Σtθ log πθ(at|st) · A(st, at) ]

Now the gradient is "how much should we change θ per unit log-prob, weighted by how surprising-good this action was."

How do we estimate A(s, a)?

We don't know V exactly. We train a critic Vφ(s) in parallel with the policy, by regressing toward observed returns. With a critic, the simplest advantage estimator is the 1-step TD error:

A(st, at) ≈ rt + γ Vφ(st+1) − Vφ(st)

This is low-variance (one step of randomness) but biased (uses imperfect V). The other extreme is the Monte Carlo advantage:

A(st, at) ≈ Gt − Vφ(st) where Gt is the actual sum of rewards

Unbiased but high variance (sums many random rewards). We want a middle ground that's tunable. That's exactly what GAE provides.

Chapter 05

GAE: The Bias-Variance Trade

Generalized Advantage Estimation (Schulman et al., 2016) blends 1-step TD and Monte Carlo via a parameter λ ∈ [0, 1]. It's the algorithmic content of your compute_gae implementation.

The recursion

GAE δt = rt + γ (1 − dt) V(st+1) − V(st) 1-step TD error Ât = δt + γ λ (1 − dt) Ât+1 recursion

Computed backwards through a trajectory. (1 − dt) is the done-flag mask — same trick from Problem 1 to zero out bootstrap at terminal states.

The two ends of the spectrum

λEstimatorBiasVariance
0Ât = δt — pure 1-step TDHigh (uses V everywhere)Low
1Ât = Gt − V(st) — Monte CarloZeroHigh
0.95Mostly MC with critic correctionLowModest

λ=0.95 is the standard choice and what this homework uses. It says "trust the actual rewards mostly, but blend in the critic's bootstrapped estimate to dampen variance."

Unrolled, GAE is a geometric sum

If you unroll the recursion, you get:

Ât = δt + (γλ) δt+1 + (γλ)2 δt+2 + (γλ)3 δt+3 + ...

It's a geometrically-weighted sum of future TD errors. Each future TD error contributes less than the last by factor γλ. The recursion is just an efficient backwards pass to compute this sum.

The "returns" output

The function returns both advantages and returns:

returnst = Ât + V(st) used as critic regression target

Why? Recall A = Q − V, so returns = A + V is our estimate of Q (or G under the current policy, since the actor is ε-stochastic anyway). The critic should regress toward this — it's a low-variance estimate of "what does following the policy from here look like in terms of return."

Chapter 06

Why PPO Clips

Vanilla policy gradient is unstable. One big update can move the policy so far that the rollouts you collected no longer represent it, and learning collapses.

We need a way to limit how much the policy can change per update. PPO does this with a clipped surrogate objective.

The importance ratio

Define the importance ratio between new and old policies for the same action:

Importance ratio ρt = πθ(at|st) / πθold(at|st) = exp( log πθ(at|st) − log πθold(at|st) )

If the policy hasn't changed, ρ = 1. If we doubled the probability of action at, ρ = 2. If we halved it, ρ = 0.5.

Why log-space

Probability densities for high-dimensional actions can be very small (1e-30 or smaller). Dividing two tiny numbers is numerically unstable. Subtracting log-probs and exponentiating is mathematically equivalent and numerically safe.

The clipped objective

PPO-Clip loss LCLIP = − (1/B) Σt min( ρt · Ât, clip(ρt, 1−ε, 1+ε) · Ât )

Two terms inside the min:

  1. ρt · Ât — the unclipped surrogate. This is what vanilla policy gradient would use.
  2. clip(ρt, 1−ε, 1+ε) · Ât — the clipped surrogate. Caps ρ in [0.8, 1.2] for ε=0.2.

Then: take the elementwise min of these two, mean across the batch, negate (because PyTorch minimizes).

Why min — the asymmetric pessimistic cap

Walk through both signs of advantage:

Case 1: Â > 0 (good action, push policy up)

ρ increases as we make the action more likely. surr1 = ρ · Â grows linearly. surr2 grows too but is capped at (1 + ε) · Â. Once ρ > 1+ε, surr2 is the smaller one, so min picks it. Above the cap, the gradient w.r.t. θ is zero — no further reward for moving the policy further.

Case 2: Â < 0 (bad action, push policy down)

ρ decreases as we make the action less likely. With  < 0, multiplying by smaller ρ gives a less-negative number. surr2 caps the lower side at (1 − ε) · Â. Once ρ < 1−ε, surr2 is more negative than surr1, so min picks it. Below the cap, the gradient w.r.t. θ is zero — no further punishment for moving the policy further.

The min always picks the worse outcome from the optimizer's perspective. It's a pessimistic bound. The optimizer can't game the surrogate by moving too far in either direction. That's what stabilizes PPO.

Visual mental model

Imagine a hill that climbs steadily but flattens to a plateau at ρ = 1+ε (when Â>0). Or descends but bottoms out at ρ = 1−ε (when Â<0). Outside the flat region, no force pulls you. Inside, normal gradient. The clip is the flat region.

Chapter 07

Value, Entropy, Reverse-KL

The full PPO loss has more than just the clipped surrogate. Three terms get added in on_policy.py:260-263:

loss = (policy_loss
        + self.value_coef    * value_loss
        − self.entropy_coef  * entropy
        + self.reverse_kl_coef * reverse_kl)

Value loss

Lvalue = 𝔼[ ( Vφ(s) − Rt )2 ] where Rt = Ât + V(st) is the GAE return

Mean squared error between the critic and the GAE returns target. Trains Vφ alongside the policy. The value_coef hyperparameter weights how much we care about critic accuracy vs policy improvement (typically 0.5).

Entropy bonus

Lentropy = 𝔼[ H[ πθ(·|s) ] ] add bonus to encourage stochasticity

Entropy = how spread out is the action distribution. Without this, the policy can collapse to nearly deterministic during training, killing exploration. We add entropy to the objective (i.e. subtract from the loss, since loss = -objective). The entropy_coef is small (~0.001-0.01).

Reverse KL to a frozen reference

LKL = 𝔼old[ ρ · (log πθ − log πref) ] approximation of DKLθ || πref)

The reference_actor is a frozen snapshot of the policy taken right after BC pretraining. This term penalizes the RL policy for drifting too far from the BC initialization.

Why we need this: with sparse rewards, BC is the only thing telling the agent how to roughly do the task. Pure RL would forget BC during training. The KL anchor keeps it grounded. Same role as the BC-during-RL alternation in Problem 3 — same disease, different antibiotic.

"Reverse" KL

DKLθ || πref) is "reverse" because it's the KL from the new policy to the reference. (Forward KL would penalize the reference for being far from the new policy — not what we want.) Reverse KL is mode-seeking: it pushes the new policy to cover the reference distribution.

Chapter 08

The Full Algorithm

PPO with BC Pretraining + Reverse-KL Anchor
  1. Initialize actor πθ, critic Vφ, optimizer over both, replay-style buffer for rollouts.
  2. BC pretrain on 20 expert demos:
    repeat for Nbc steps:
      sample (s, a) from demos
      loss = − mean[ log πθ(a|s) ]
      step optimizer
  3. Snapshot reference actor: self.reference_actor = deepcopy(self.actor), freeze grads.
  4. RL loop: repeat
    a) Collect rollout: run πθ in env for T steps; store (s, a, r, s', d, log πold(a|s)).
    b) Compute advantages:
    • values = Vφ(s_all)
    • next_values = Vφ(s'_all)
    • (advantages, returns) = compute_gae(...) (your Task 1)
    • normalize: advantages ← (advantages − mean) / std
    c) PPO epoch loop: for k = 1..ppo_epochs:
    • shuffle indices, iterate minibatches
    • compute ρ, clipped surrogate, value loss, entropy, reverse-KL (your Task 3)
    • combined loss, accumulate gradients
    • one optimizer step per epoch
    d) Discard rollout; collect a fresh one next.

Note the discard step. PPO collects, updates a few times, throws data away. This is what makes it on-policy and what makes it sample-inefficient compared to off-policy methods.

Chapter 09

Code Tour: on_policy.py

Three classes: Actor, Critic, PPOAgent. Same pattern as Problem 3 but the training loop differs.

The Actor

on_policy.py:13-39class Actor(nn.Module):
    def __init__(self, obs_shape, action_shape, hidden_dim, log_std_init=−2.0):
        super().__init__()
        self.policy = nn.Sequential(
            nn.Linear(obs_shape[0], hidden_dim), nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, hidden_dim),   nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, action_shape[0]))
        self.log_std = nn.Linear(hidden_dim, action_shape[0])

    def forward(self, obs):
        features = self.policy[:−1](obs)              # run all but final layer
        mu = torch.tanh(self.policy[−1](features))   # action mean in [-1, 1]
        std = torch.exp(self.log_std(features)).clamp(1e-5, 1.0)
        return utils.TruncatedNormal(mu, std)

Two heads sharing the same backbone: a mu head (action mean) and a log_std head (action variability). Crucially, std is learned here (unlike Problem 3 where std=0.1 is fixed). The slice self.policy[:-1] runs all layers except the final one, so we get the shared 256-D feature vector that both heads branch from.

The Critic

on_policy.py:42-58class Critic(nn.Module):
    def __init__(self, obs_shape, hidden_dim):
        super().__init__()
        self.value_net = nn.Sequential(
            nn.Linear(obs_shape[0], hidden_dim), nn.LayerNorm(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, hidden_dim), nn.LayerNorm(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, 1))                # scalar V(s)

    def forward(self, obs):
        return self.value_net(obs)                  # shape [batch, 1]

V(s) not Q(s, a) — the critic only takes obs as input. PPO doesn't need Q because the policy gradient theorem works with V as a baseline.

The PPOAgent.update method — structure

This is where everything happens. Roughly:

  1. Concatenate the rollout into flat tensors (lines 168-189).
  2. Compute GAE advantages and returns — your Task 2(a) (lines 192-205).
  3. Loop ppo_epochs times, with shuffled minibatches (lines 218-275).
  4. Inside each minibatch: compute clipped surrogate — your Task 2(b) — plus value loss, entropy, reverse-KL. Combine. Accumulate gradients.
  5. Step optimizer once per epoch (lines 274-275).

One subtle point: gradients accumulate across minibatches inside an epoch, with each minibatch loss scaled by (batch_size / total_size). The optimizer steps once per epoch. This is equivalent to averaging the loss across the entire epoch, but avoids OOM from holding the whole rollout in one forward pass.

Chapter 10

Your Three Changes, Decoded

Now the centerpiece: every line you wrote, explained at the level of "what is this PyTorch operation actually doing." This chapter assumes you've read the PyTorch primer in Chapter 02.

Change 1 of 3
compute_gae — the recursion body

What you wrote (3 lines, inside the backwards loop):

for t in reversed(range(T)):
    delta = rewards[t] + discounts[t] * next_values[t] − values[t]
    gae = delta + discounts[t] * self.gae_lambda * gae
    advantages[t] = gae

Line 1: delta = rewards[t] + discounts[t] * next_values[t] − values[t]

delta = rewards[t] + discounts[t] * next_values[t] - values[t]

This is the 1-step TD error at time t. Read piece by piece:

rewards[t] — the scalar reward at step t. Tensor shape [] (a single number) since we indexed into a [T]-shaped tensor.

discounts[t] — pre-multiplied γ * (1 − done[t]). If the episode ended at t, this is 0; otherwise it's γ. Same shape as rewards.

next_values[t] — the critic's prediction Vφ(s't). Computed earlier when you called self.critic(next_obs).squeeze(-1).

values[t] — Vφ(st). Same critic, current state.

Math meaning: r + γ V(s') − V(s) = "how surprised was the critic?" Positive: outcome was better than predicted. Negative: worse.

Line 2: gae = delta + discounts[t] * self.gae_lambda * gae

gae = delta + discounts[t] * self.gae_lambda * gae

The recursion step. Important: the gae on the right-hand side is the value from the previous iteration (which corresponds to time t+1, since we're going backwards). After this assignment, gae holds Ât.

delta — the immediate TD error.

discounts[t] * self.gae_lambda * gae — "decayed contribution from the future." With γ=0.99 and λ=0.95, this is 0.94 of the future value, vanishing geometrically.

(1 − dt) is hidden inside discounts[t] — if t is the last step of an episode, the recursion correctly resets (no contribution from the next episode's value).

Why a running variable: storing Ât for all t naively in memory would mean writing each one twice. The running gae variable accumulates the geometric series in O(T) work.

Line 3: advantages[t] = gae

advantages[t] = gae

Store the just-computed Ât into slot t of the output tensor. Plain tensor indexing, like writing to a NumPy array slot.

You initialized advantages = torch.zeros_like(rewards) outside the loop, so the tensor's shape is [T]. Each loop iteration writes one slot. After the loop, all T slots are filled, going from index T-1 down to 0.

What happens after the loop

returns = advantages + values
return advantages, returns

For each t: returns[t] = Ât + Vφ(st). This is your low-variance estimate of the discounted return from t onward, used as the regression target for the critic later.

Change 2 of 3
Plumbing GAE into the update loop

What you wrote (3 lines, inside with torch.no_grad(): in the update method):

values_all = self.critic(obs_all).squeeze(−1)
next_values_all = self.critic(next_obs_all).squeeze(−1)
advantages_all, returns_all = self.compute_gae(rew_all, values_all, next_values_all, disc_all, done_all)

Line 1: values_all = self.critic(obs_all).squeeze(-1)

values_all = self.critic(obs_all).squeeze(-1)

Three things happen here.

1. self.critic(obs_all) — calls the critic's forward(obs_all) method. obs_all has shape [T, D] (T timesteps, D = obs_dim, ~39). The critic returns shape [T, 1] — one scalar per state, but wrapped in a trailing size-1 dim because the final Linear layer outputs nn.Linear(hidden, 1).

2. .squeeze(-1) — drops the trailing size-1 dim, giving shape [T]. Why? Because compute_gae expects values[t] to be a scalar. If we left the shape as [T, 1], indexing values[t] would give a [1] tensor, and arithmetic with the []-shaped rewards[t] would broadcast to [1]. Probably correct, but cleaner to make shapes match upfront.

3. The whole thing happens inside with torch.no_grad(): — PyTorch doesn't track operations for gradient. We're computing targets, not training anything.

Line 2: next_values_all = self.critic(next_obs_all).squeeze(-1)

next_values_all = self.critic(next_obs_all).squeeze(-1)

Same critic, different input. next_obs_all is shape [T, D], the "next state after each transition." This is what compute_gae uses as next_values[t] = V(s't).

Two forward passes through the same network are slightly wasteful — you could concatenate and do one. But the readability is worth it.

Line 3: advantages_all, returns_all = self.compute_gae(...)

advantages_all, returns_all = self.compute_gae(rew_all, values_all, next_values_all, disc_all, done_all)

Calls your method from Change 1 with all five required arguments. Returns a tuple, which Python unpacks into two named tensors. Both have shape [T].

The variable names matter — the surrounding code references advantages_all (for normalization) and returns_all (as the value-loss target). Using different names would silently break the code.

What happens just after

adv_std = advantages_all.std(unbiased=False)
advantages_all_old = advantages_all
advantages_all = (advantages_all − advantages_all.mean()) / (adv_std + 1e-8)

This was already there. It normalizes the advantages to mean 0 and std 1. Why? Because the policy gradient is sensitive to the scale of advantages — if all advantages are tiny, gradient updates are tiny; if huge, updates are huge and unstable. Normalizing fixes the scale every iteration. The + 1e-8 avoids division by zero. The _old save is kept just for logging.

Change 3 of 3
The PPO clipped surrogate

What you wrote (4 lines, inside the minibatch loop):

ratio = torch.exp(new_log_prob − olp_ep)
surrogate1 = ratio * adv_ep
surrogate2 = torch.clamp(ratio, 1.0self.clip_eps, 1.0 + self.clip_eps) * adv_ep
policy_loss = −torch.min(surrogate1, surrogate2).mean()

Line 1: ratio = torch.exp(new_log_prob − olp_ep)

ratio = torch.exp(new_log_prob - olp_ep)

This computes ρt = πθ(a|s) / πold(a|s) for each transition in the minibatch.

Why the subtraction-then-exp: log(a/b) = log(a) − log(b), so a/b = exp(log(a) − log(b)). We work in log space throughout because individual log-probabilities can be very small (or large negative) numbers; subtraction keeps them in a sane range, exponentiation brings us back to ratio space.

new_log_prob — computed two lines earlier as dist.log_prob(act_ep).sum(-1). Shape [batch_size]. The .sum(-1) sums log-probs across the 4 action dimensions (joint log-prob = sum of independent per-dim log-probs).

olp_ep — "old log prob, for this minibatch slice." It's old_lp_all[batch_idx], the log-prob recorded at rollout time when the actor was a different snapshot. Shape [batch_size].

What's in this tensor: each element is the ratio for one transition. If that ratio is 1.0, the policy hasn't changed for that action. If 1.5, the new policy is 1.5× more likely to pick that action than the rollout-time policy.

Line 2: surrogate1 = ratio * adv_ep

surrogate1 = ratio * adv_ep

Elementwise multiply: surrogate1[i] = ratio[i] * adv_ep[i].

This is the unclipped surrogate — what vanilla policy gradient would use. With adv_ep > 0 and increasing ratio, this grows linearly — reward for moving the policy toward this action. With adv_ep < 0 and decreasing ratio, this also grows (less negative) — reward for moving away.

adv_ep — advantages for this minibatch, normalized. Shape [batch_size].

Line 3: surrogate2 = torch.clamp(ratio, 1.0 − self.clip_eps, 1.0 + self.clip_eps) * adv_ep

surrogate2 = torch.clamp(ratio, 1.0 - self.clip_eps, 1.0 + self.clip_eps) * adv_ep

torch.clamp(x, lo, hi) takes any tensor x and returns elementwise min(max(x, lo), hi). So if self.clip_eps = 0.2, this returns ratios capped to [0.8, 1.2].

For ratios already inside [0.8, 1.2], surrogate1 == surrogate2. For ratios outside that range, surrogate2 is the clipped version multiplied by the advantage.

Line 4: policy_loss = −torch.min(surrogate1, surrogate2).mean()

policy_loss = -torch.min(surrogate1, surrogate2).mean()

Three operations stacked.

1. torch.min(surrogate1, surrogate2) — elementwise min of two same-shaped tensors. Shape [batch_size]. This is what creates the asymmetric pessimistic cap — the optimizer sees the worse of the two surrogates at each transition, not the better one.

2. .mean() — reduces shape [batch_size] to [] (a scalar). Gives the average across the minibatch. We need a scalar to call .backward() on.

3. Negate. The PPO objective is to maximize the min-clipped surrogate. PyTorch optimizers minimize losses. So we negate.

The variable name policy_loss matters. The surrounding code (combined-loss line at on_policy.py:260) references it by name. Naming it anything else would cause a NameError later in the same function.

What happens just after (already there)

with torch.no_grad():
    ref_dist = self.reference_actor(obs_ep)
    ref_log_prob = ref_dist.log_prob(act_ep).sum(−1)
reverse_kl = (ratio * (new_log_prob − ref_log_prob)).mean()

This uses your ratio variable. The reverse-KL term penalizes drift from the BC reference. Computing it requires (a) log-prob under the frozen reference (no grad needed), (b) the ratio you just computed, (c) new_log_prob from the current actor.

values_pred = self.critic(obs_ep).squeeze(−1)
value_loss = F.mse_loss(values_pred, ret_ep)

loss = (policy_loss
        + self.value_coef    * value_loss
        − self.entropy_coef  * entropy
        + self.reverse_kl_coef * reverse_kl)

(loss * (batch_size / T)).backward()

The combined loss: your policy_loss, plus the value-loss MSE, minus the entropy bonus (because we're adding entropy to the objective and the loss is the negation), plus the reverse-KL drift penalty. The (batch_size / T) scaling means that summing across all minibatches in an epoch gives the same gradient magnitude as if we'd computed one big loss across the whole rollout. Then .backward() walks backward through the entire computation graph and fills in gradients for every parameter in the actor and the critic.

One step per epoch (after all minibatches' gradients have accumulated):

nn.utils.clip_grad_norm_(params, max_norm=1)
self.opt.step()

clip_grad_norm_ rescales gradients if the total norm exceeds 1.0 — protection against exploding gradients. .step() applies the Adam update.

Chapter 11

Cheat Sheet & Self-Quiz

Equations to memorize

GAE recursion δt = rt + γ(1−dt) V(st+1) − V(st) Ât = δt + γ λ (1−dt) Ât+1 returnst = Ât + V(st)
PPO clip loss ρ = exp(log πnew − log πold) LCLIP = − mean[ min( ρ · Â, clip(ρ, 1−ε, 1+ε) · Â ) ]
Combined loss L = LCLIP + cv · Lvalue − cH · H[π] + cKL · DKL(π || πref)

Tensor shape cheat-sheet

VariableShapeMeaning
obs_all[T, D]States across all rollout steps
act_all[T, A]Actions taken (4 dims)
rew_all[T]Per-step rewards
disc_all[T]γ(1−done), pre-multiplied
old_lp_all[T]log πold(a|s) at rollout time
values_all[T]Vφ(s_t) for all t
next_values_all[T]Vφ(s'_t) for all t
advantages_all[T]Ât from GAE
returns_all[T]Ât + V(st), critic target
ratio[batch]Importance ratio for minibatch
policy_loss[]Scalar PPO clip loss for minibatch

Self-quiz

  1. Why does the policy gradient theorem need only log-probs of actions taken — not over all possible actions?
  2. What does subtracting the value baseline V(s) accomplish? Why doesn't it bias the gradient?
  3. What does λ control in GAE? What are λ=0 and λ=1 corner cases called?
  4. Why does GAE iterate backwards through time?
  5. Why is the importance ratio computed as exp of a difference, not as a direct division?
  6. What happens to the gradient when ratio > 1+ε and  > 0?
  7. Why is the policy loss negated before backward?
  8. Why is GAE computation wrapped in with torch.no_grad():?
  9. What's the difference between old_log_prob and new_log_prob in the same minibatch?
  10. Why does the entropy term get subtracted from the loss?
  11. Why does PPO need a frozen reference actor for KL?
  12. Why is the actor's std learned here but fixed at 0.1 in Problem 3?
Answer key

1. Because we sampled those actions ourselves — the policy gradient comes from "differentiate log p(observed sample) under your distribution." There's nothing to differentiate at unobserved actions.

2. It reduces variance of the gradient estimator without biasing it — the cross-term in expectation is zero (proven via the log-derivative trick), but the variance of the estimator drops.

3. λ controls the bias-variance tradeoff between 1-step TD (λ=0, biased, low variance, "TD") and Monte Carlo (λ=1, unbiased, high variance, "MC").

4. The recursion Ât = δt + γλ Ât+1 needs the t+1 value to compute the t value. Iterating backwards gives O(T) computation.

5. Numerical stability. Direct division of two tiny probabilities is unstable; subtraction of log-probs and exponentiation is stable.

6. The min picks surrogate2 = (1+ε) · Â, which is constant in θ once ρ passes the cap. Gradient w.r.t. θ goes to zero.

7. PyTorch optimizers minimize. The PPO objective is to maximize. So loss = -objective.

8. Because GAE returns are targets for the critic. If they tracked gradients, the critic would be trained against itself, and gradient descent would chase its own tail.

9. old_log_prob was recorded by the actor that collected the rollout (a fixed snapshot). new_log_prob is computed by the actor right now in the inner loop, which has been updated by previous minibatches in this epoch. Their ratio measures policy drift since rollout.

10. Because we're adding entropy to the objective (encouraging stochasticity = exploration) and the loss is the negation of the objective.

11. To prevent the RL policy from drifting too far from the BC initialization. Sparse rewards mean BC is the only thing telling the agent how to roughly do the task; without the KL anchor, RL would forget BC and learning would collapse.

12. On-policy PPO benefits from learned-std for adaptive exploration as the policy improves. Off-policy methods (P3) get exploration from the diversity of the replay buffer, so a fixed std works fine.

Running it

# Pre-flight (one-time)
modal secret create wandb-secret WANDB_API_KEY=<your_key>

# Launch training
cd /Users/ozyphus/Documents/GitHub/cs224r/homework/hw2/hw2
modal run --detach modal_on_policy.py

~45 minutes to 1.5 hours on Modal A10. The wandb URL prints at startup. Watch eval/episode_success, policy_loss, and actor_std — if std collapses to 0 in the first 50k steps, the entropy_coef is too low.

Take it back to class

You can now teach this

Three big ideas, in order of importance:

  1. Policy gradient with a baseline. Push log-prob of action up by amount A(s,a). The baseline V(s) reduces variance without changing the gradient's expectation.
  2. GAE. A geometrically-weighted sum of TD errors that interpolates between 1-step TD and Monte Carlo. λ tunes the bias-variance trade. The backwards recursion is just an efficient way to compute the geometric sum.
  3. PPO clipping. Compute the importance ratio ρ = πnewold. Take min of (ρ·A, clipped(ρ)·A). The min creates an asymmetric pessimistic cap that prevents the policy from drifting too far per update. This is the entire stability mechanism.

If a friend asks: "Why does PPO use min instead of average?" — you say: "Because the optimizer is adversarial. If we used average, the optimizer would happily blow past the clip in directions where the unclipped term is favorable, defeating the whole point. The min is a pessimistic bound: the optimizer always sees the worse of the two surrogates, so it can't game the clip."

You can teach this. Go finish the training.