A robot learning to swing a hammer using stochastic gradient descent on its own behavior. Every PyTorch line decoded. Every formula derived. Every trick justified.
Same task as Problem 3, completely different algorithm. A 4-DOF Sawyer arm in MuJoCo simulation must learn to pick up a hammer and drive a nail into a board. The observation is ~39 numbers (positions of robot, hammer, nail). Action is a 4-D continuous vector in [-1, 1]. Reward: 0 every step, 1.0 the moment the nail is fully driven, then episode ends.
You'll implement PPO — Proximal Policy Optimization — the most widely-used policy-gradient algorithm in modern RL. The deliverable is a wandb plot of eval/episode_success over ~1 million environment steps.
PPO trains a neural network policy by repeatedly: (1) collect a batch of rollouts under the current policy, (2) compute how much better-than-average each action turned out, (3) take several gradient steps on a clipped objective that pushes good actions up and bad actions down, but only a little — never letting the policy change too far from where it just was.
The simplest policy-gradient algorithm (REINFORCE) goes:
This works in theory but is wildly unstable in practice. Two reasons:
PPO addresses both problems. GAE reduces variance via a critic-blended advantage estimate. Clipping caps how far each update can push the policy. Together: stable, fast learning.
Strictly on-policy means: collect data, do one gradient step, throw the data away. PPO does multiple epochs of gradient steps on the same batch. As long as the policy hasn't drifted too much (the clip enforces this), the same data is approximately valid for several updates. This is sometimes called "near-on-policy" or "approximately on-policy."
PPO (this problem) vs SAC-style (Problem 3): PPO is on-policy, throws away data after a few epochs, simpler to implement, slow to converge (~1M steps). SAC is off-policy, replays a buffer indefinitely, more components to debug, fast to converge (~100k steps). Both work; both have their use cases.
If you know Python but PyTorch is fuzzy, read this chapter. You'll need every concept here when reading the code.
A tensor is just a multi-dimensional array of numbers, like NumPy's ndarray. The two extras are:
.to(device) moves it. Math on tensors that live on a GPU is fast (parallelized across thousands of CUDA cores).import torch x = torch.tensor([1.0, 2.0, 3.0]) # shape [3], on CPU x = x.to("cuda") # now on GPU y = x * 2.0 + 1.0 # [3, 5, 7], gradient graph remembered
Most PyTorch bugs are shape bugs. The shape of a tensor is its dimensions: a vector of length 4 is shape [4]; a 32×4 matrix is shape [32, 4]. Operations broadcast across compatible shapes silently, which is convenient but dangerous.
a = torch.tensor([1., 2., 3.]) # shape [3] b = torch.tensor([[10.], [20.], [30.]]) # shape [3, 1] c = a + b # shape [3, 3]! NumPy broadcasts. Probably not what you wanted.
If a loss is suddenly enormous or stays at zero, the first thing to check is the shapes of every operand. Print x.shape liberally. The PPO code uses .squeeze(-1) to drop a trailing size-1 dimension specifically to avoid silent broadcasts.
| Method | What it does | Example |
|---|---|---|
.shape | Tuple of dimensions | torch.zeros(3, 5).shape → [3, 5] |
.unsqueeze(d) | Add size-1 dim at position d | [3].unsqueeze(0) → [1, 3] |
.squeeze(d) | Drop size-1 dim at d | [3, 1].squeeze(-1) → [3] |
.view(...) | Reshape, total size unchanged | [6].view(2, 3) → [2, 3] |
.cat([...], dim=) | Concatenate along dim | two [3, 4] → [6, 4] with dim=0 |
.stack([...], dim=) | Stack into NEW dim | two [3] → [2, 3] with dim=0 |
.sum(-1) / .mean(-1) | Reduce along the last dim | [3, 4].sum(-1) → [3] |
x[idx] | Slice with int or tensor of indices | x[torch.tensor([0, 2, 5])] |
When you compute a tensor that has trainable parameters in its history, PyTorch tracks which operations made it. Calling .backward() on a scalar (a single number) tells PyTorch to walk backward through that history, computing gradients for every parameter along the way.
# Forward pass w = torch.tensor([2.0], requires_grad=True) x = torch.tensor([3.0]) loss = (w * x − 5.0) ** 2 # scalar, shape [] # Backward pass loss.backward() # fills in w.grad print(w.grad) # tensor([6.]) # Manual update (in practice the optimizer does this) with torch.no_grad(): w −= 0.01 * w.grad w.grad.zero_() # clear before next backward
Three rules to keep in your head:
.backward() on a scalar. If your loss is a vector, take .mean() or .sum() first.optimizer.zero_grad() handles this.with torch.no_grad(): don't get tracked. Use this when computing values you don't want gradients for (like TD targets or log-probs at rollout time).torch.no_grad() matters
If you compute a "target" value y that you'll regress toward, and you forget no_grad, gradients will flow through y back into the network that produced it. Now your network is being trained against a moving target it produced itself → divergence. Always wrap target computation in with torch.no_grad():.
A network is a Python class that inherits from nn.Module. It has learnable parameters (typically the weights of nn.Linear layers it stores) and a forward() method that defines what happens when you call it.
class SimpleNet(nn.Module): def __init__(self, in_dim, out_dim): super().__init__() self.layer1 = nn.Linear(in_dim, 64) # weights + bias self.layer2 = nn.Linear(64, out_dim) def forward(self, x): x = torch.relu(self.layer1(x)) return self.layer2(x) net = SimpleNet(10, 4) y = net(torch.randn(32, 10)) # calls .forward, returns shape [32, 4]
Notice net(x), not net.forward(x). The __call__ hook on Module triggers some PyTorch bookkeeping then calls forward.
nn.Sequential is a convenience for stacking layers without writing a forward method:
self.policy = nn.Sequential( nn.Linear(obs_dim, 256), nn.ReLU(), nn.Linear(256, 256), nn.ReLU(), nn.Linear(256, action_dim), )
An optimizer takes a list of parameters and applies gradient updates to them. The most common: Adam.
optimizer = torch.optim.Adam(net.parameters(), lr=3e-4) # Standard training loop: optimizer.zero_grad() # clear leftover grads loss.backward() # compute new grads optimizer.step() # take a gradient step
For policy-gradient algorithms, you need to sample actions from a distribution and compute log-probabilities. PyTorch provides torch.distributions.
from torch.distributions import Normal mu = torch.tensor([0.0]) sigma = torch.tensor([1.0]) dist = Normal(mu, sigma) a = dist.sample() # draws a random number, shape [1] log_p = dist.log_prob(a) # log-density at a, shape [1]
The homework uses utils.TruncatedNormal — same idea but the samples are clipped to [-1, 1] to match the action bounds. dist.log_prob(a) gives you log π(a|s), which is what the policy gradient theorem needs.
Probabilities for continuous distributions can be tiny in high dimensions (think 1e-30 multiplied across 4 action dims). Log-probabilities stay numerically reasonable. Also, ∇ log p = (1/p) ∇ p — the gradient form is what we actually want for the policy gradient.
That's enough to read every line of on_policy.py. Onward.
The whole reason PPO exists is to maximize one quantity: expected discounted return.
Read it: "the average over all possible trajectories τ that πθ could produce, of the total discounted reward of that trajectory." We want to find the policy parameters θ that maximize J.
To do gradient ascent we need ∇θ J. The trajectory distribution depends on θ through the policy actions, so we can't trivially differentiate. The policy gradient theorem gives us a clever rewriting:
where Gt = Σk=0 γk rt+k is the return from time t onward.
In English: to make the policy better, take each action you just sampled and increase its log-probability by an amount proportional to how good the future turned out to be after that action. Good actions: push them up. Bad actions: push them down.
The derivation hinges on one identity: ∇θ log pθ(x) = (1/pθ(x)) · ∇θ pθ(x). This lets you turn a derivative of a probability density (which depends on samples that depend on θ) into an expectation that you can estimate from samples drawn under θ. The miracle is that environment dynamics — which we don't know — cancel out of the gradient.
Given a batch of N rollouts, we estimate the expectation by averaging:
If you implement just this, you have REINFORCE. It works, but it's slow. Two problems:
The fixes: baselines (advantage), GAE (multi-step variance reduction), clipping (stability). Each gets its own chapter.
Naive REINFORCE weights each action by the return Gt. Two issues:
Subtract any state-dependent baseline b(st) from Gt. The expected gradient is unchanged (provable: cross-term has zero mean), but variance can drop dramatically.
The natural choice for b: the state-value function V(s) = 𝔼[Gt | st=s], i.e. how good is this state on average?
How much better was action a than the policy's average action in state s? Positive: this action was above average; push the policy toward it. Negative: below average; push away.
The cleaned-up policy gradient:
Now the gradient is "how much should we change θ per unit log-prob, weighted by how surprising-good this action was."
We don't know V exactly. We train a critic Vφ(s) in parallel with the policy, by regressing toward observed returns. With a critic, the simplest advantage estimator is the 1-step TD error:
This is low-variance (one step of randomness) but biased (uses imperfect V). The other extreme is the Monte Carlo advantage:
Unbiased but high variance (sums many random rewards). We want a middle ground that's tunable. That's exactly what GAE provides.
Generalized Advantage Estimation (Schulman et al., 2016) blends 1-step TD and Monte Carlo via a parameter λ ∈ [0, 1]. It's the algorithmic content of your compute_gae implementation.
Computed backwards through a trajectory. (1 − dt) is the done-flag mask — same trick from Problem 1 to zero out bootstrap at terminal states.
| λ | Estimator | Bias | Variance |
|---|---|---|---|
| 0 | Ât = δt — pure 1-step TD | High (uses V everywhere) | Low |
| 1 | Ât = Gt − V(st) — Monte Carlo | Zero | High |
| 0.95 | Mostly MC with critic correction | Low | Modest |
λ=0.95 is the standard choice and what this homework uses. It says "trust the actual rewards mostly, but blend in the critic's bootstrapped estimate to dampen variance."
If you unroll the recursion, you get:
It's a geometrically-weighted sum of future TD errors. Each future TD error contributes less than the last by factor γλ. The recursion is just an efficient backwards pass to compute this sum.
The function returns both advantages and returns:
Why? Recall A = Q − V, so returns = A + V is our estimate of Q (or G under the current policy, since the actor is ε-stochastic anyway). The critic should regress toward this — it's a low-variance estimate of "what does following the policy from here look like in terms of return."
Vanilla policy gradient is unstable. One big update can move the policy so far that the rollouts you collected no longer represent it, and learning collapses.
We need a way to limit how much the policy can change per update. PPO does this with a clipped surrogate objective.
Define the importance ratio between new and old policies for the same action:
If the policy hasn't changed, ρ = 1. If we doubled the probability of action at, ρ = 2. If we halved it, ρ = 0.5.
Probability densities for high-dimensional actions can be very small (1e-30 or smaller). Dividing two tiny numbers is numerically unstable. Subtracting log-probs and exponentiating is mathematically equivalent and numerically safe.
Two terms inside the min:
Then: take the elementwise min of these two, mean across the batch, negate (because PyTorch minimizes).
Walk through both signs of advantage:
ρ increases as we make the action more likely. surr1 = ρ · Â grows linearly. surr2 grows too but is capped at (1 + ε) · Â. Once ρ > 1+ε, surr2 is the smaller one, so min picks it. Above the cap, the gradient w.r.t. θ is zero — no further reward for moving the policy further.
ρ decreases as we make the action less likely. With  < 0, multiplying by smaller ρ gives a less-negative number. surr2 caps the lower side at (1 − ε) · Â. Once ρ < 1−ε, surr2 is more negative than surr1, so min picks it. Below the cap, the gradient w.r.t. θ is zero — no further punishment for moving the policy further.
The min always picks the worse outcome from the optimizer's perspective. It's a pessimistic bound. The optimizer can't game the surrogate by moving too far in either direction. That's what stabilizes PPO.
Imagine a hill that climbs steadily but flattens to a plateau at ρ = 1+ε (when Â>0). Or descends but bottoms out at ρ = 1−ε (when Â<0). Outside the flat region, no force pulls you. Inside, normal gradient. The clip is the flat region.
The full PPO loss has more than just the clipped surrogate. Three terms get added in on_policy.py:260-263:
loss = (policy_loss
+ self.value_coef * value_loss
− self.entropy_coef * entropy
+ self.reverse_kl_coef * reverse_kl)
Mean squared error between the critic and the GAE returns target. Trains Vφ alongside the policy. The value_coef hyperparameter weights how much we care about critic accuracy vs policy improvement (typically 0.5).
Entropy = how spread out is the action distribution. Without this, the policy can collapse to nearly deterministic during training, killing exploration. We add entropy to the objective (i.e. subtract from the loss, since loss = -objective). The entropy_coef is small (~0.001-0.01).
The reference_actor is a frozen snapshot of the policy taken right after BC pretraining. This term penalizes the RL policy for drifting too far from the BC initialization.
Why we need this: with sparse rewards, BC is the only thing telling the agent how to roughly do the task. Pure RL would forget BC during training. The KL anchor keeps it grounded. Same role as the BC-during-RL alternation in Problem 3 — same disease, different antibiotic.
DKL(πθ || πref) is "reverse" because it's the KL from the new policy to the reference. (Forward KL would penalize the reference for being far from the new policy — not what we want.) Reverse KL is mode-seeking: it pushes the new policy to cover the reference distribution.
self.reference_actor = deepcopy(self.actor), freeze grads.Note the discard step. PPO collects, updates a few times, throws data away. This is what makes it on-policy and what makes it sample-inefficient compared to off-policy methods.
Three classes: Actor, Critic, PPOAgent. Same pattern as Problem 3 but the training loop differs.
on_policy.py:13-39class Actor(nn.Module): def __init__(self, obs_shape, action_shape, hidden_dim, log_std_init=−2.0): super().__init__() self.policy = nn.Sequential( nn.Linear(obs_shape[0], hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, action_shape[0])) self.log_std = nn.Linear(hidden_dim, action_shape[0]) def forward(self, obs): features = self.policy[:−1](obs) # run all but final layer mu = torch.tanh(self.policy[−1](features)) # action mean in [-1, 1] std = torch.exp(self.log_std(features)).clamp(1e-5, 1.0) return utils.TruncatedNormal(mu, std)
Two heads sharing the same backbone: a mu head (action mean) and a log_std head (action variability). Crucially, std is learned here (unlike Problem 3 where std=0.1 is fixed). The slice self.policy[:-1] runs all layers except the final one, so we get the shared 256-D feature vector that both heads branch from.
on_policy.py:42-58class Critic(nn.Module): def __init__(self, obs_shape, hidden_dim): super().__init__() self.value_net = nn.Sequential( nn.Linear(obs_shape[0], hidden_dim), nn.LayerNorm(hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, hidden_dim), nn.LayerNorm(hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, 1)) # scalar V(s) def forward(self, obs): return self.value_net(obs) # shape [batch, 1]
V(s) not Q(s, a) — the critic only takes obs as input. PPO doesn't need Q because the policy gradient theorem works with V as a baseline.
This is where everything happens. Roughly:
One subtle point: gradients accumulate across minibatches inside an epoch, with each minibatch loss scaled by (batch_size / total_size). The optimizer steps once per epoch. This is equivalent to averaging the loss across the entire epoch, but avoids OOM from holding the whole rollout in one forward pass.
Now the centerpiece: every line you wrote, explained at the level of "what is this PyTorch operation actually doing." This chapter assumes you've read the PyTorch primer in Chapter 02.
What you wrote (3 lines, inside the backwards loop):
for t in reversed(range(T)): delta = rewards[t] + discounts[t] * next_values[t] − values[t] gae = delta + discounts[t] * self.gae_lambda * gae advantages[t] = gae
delta = rewards[t] + discounts[t] * next_values[t] − values[t]This is the 1-step TD error at time t. Read piece by piece:
• rewards[t] — the scalar reward at step t. Tensor shape [] (a single number) since we indexed into a [T]-shaped tensor.
• discounts[t] — pre-multiplied γ * (1 − done[t]). If the episode ended at t, this is 0; otherwise it's γ. Same shape as rewards.
• next_values[t] — the critic's prediction Vφ(s't). Computed earlier when you called self.critic(next_obs).squeeze(-1).
• values[t] — Vφ(st). Same critic, current state.
Math meaning: r + γ V(s') − V(s) = "how surprised was the critic?" Positive: outcome was better than predicted. Negative: worse.
gae = delta + discounts[t] * self.gae_lambda * gaeThe recursion step. Important: the gae on the right-hand side is the value from the previous iteration (which corresponds to time t+1, since we're going backwards). After this assignment, gae holds Ât.
• delta — the immediate TD error.
• discounts[t] * self.gae_lambda * gae — "decayed contribution from the future." With γ=0.99 and λ=0.95, this is 0.94 of the future value, vanishing geometrically.
• (1 − dt) is hidden inside discounts[t] — if t is the last step of an episode, the recursion correctly resets (no contribution from the next episode's value).
Why a running variable: storing Ât for all t naively in memory would mean writing each one twice. The running gae variable accumulates the geometric series in O(T) work.
advantages[t] = gaeStore the just-computed Ât into slot t of the output tensor. Plain tensor indexing, like writing to a NumPy array slot.
You initialized advantages = torch.zeros_like(rewards) outside the loop, so the tensor's shape is [T]. Each loop iteration writes one slot. After the loop, all T slots are filled, going from index T-1 down to 0.
returns = advantages + values
return advantages, returns
For each t: returns[t] = Ât + Vφ(st). This is your low-variance estimate of the discounted return from t onward, used as the regression target for the critic later.
What you wrote (3 lines, inside with torch.no_grad(): in the update method):
values_all = self.critic(obs_all).squeeze(−1) next_values_all = self.critic(next_obs_all).squeeze(−1) advantages_all, returns_all = self.compute_gae(rew_all, values_all, next_values_all, disc_all, done_all)
values_all = self.critic(obs_all).squeeze(-1)Three things happen here.
1. self.critic(obs_all) — calls the critic's forward(obs_all) method. obs_all has shape [T, D] (T timesteps, D = obs_dim, ~39). The critic returns shape [T, 1] — one scalar per state, but wrapped in a trailing size-1 dim because the final Linear layer outputs nn.Linear(hidden, 1).
2. .squeeze(-1) — drops the trailing size-1 dim, giving shape [T]. Why? Because compute_gae expects values[t] to be a scalar. If we left the shape as [T, 1], indexing values[t] would give a [1] tensor, and arithmetic with the []-shaped rewards[t] would broadcast to [1]. Probably correct, but cleaner to make shapes match upfront.
3. The whole thing happens inside with torch.no_grad(): — PyTorch doesn't track operations for gradient. We're computing targets, not training anything.
next_values_all = self.critic(next_obs_all).squeeze(-1)Same critic, different input. next_obs_all is shape [T, D], the "next state after each transition." This is what compute_gae uses as next_values[t] = V(s't).
Two forward passes through the same network are slightly wasteful — you could concatenate and do one. But the readability is worth it.
advantages_all, returns_all = self.compute_gae(...)Calls your method from Change 1 with all five required arguments. Returns a tuple, which Python unpacks into two named tensors. Both have shape [T].
The variable names matter — the surrounding code references advantages_all (for normalization) and returns_all (as the value-loss target). Using different names would silently break the code.
adv_std = advantages_all.std(unbiased=False) advantages_all_old = advantages_all advantages_all = (advantages_all − advantages_all.mean()) / (adv_std + 1e-8)
This was already there. It normalizes the advantages to mean 0 and std 1. Why? Because the policy gradient is sensitive to the scale of advantages — if all advantages are tiny, gradient updates are tiny; if huge, updates are huge and unstable. Normalizing fixes the scale every iteration. The + 1e-8 avoids division by zero. The _old save is kept just for logging.
What you wrote (4 lines, inside the minibatch loop):
ratio = torch.exp(new_log_prob − olp_ep) surrogate1 = ratio * adv_ep surrogate2 = torch.clamp(ratio, 1.0 − self.clip_eps, 1.0 + self.clip_eps) * adv_ep policy_loss = −torch.min(surrogate1, surrogate2).mean()
ratio = torch.exp(new_log_prob − olp_ep)This computes ρt = πθ(a|s) / πold(a|s) for each transition in the minibatch.
Why the subtraction-then-exp: log(a/b) = log(a) − log(b), so a/b = exp(log(a) − log(b)). We work in log space throughout because individual log-probabilities can be very small (or large negative) numbers; subtraction keeps them in a sane range, exponentiation brings us back to ratio space.
• new_log_prob — computed two lines earlier as dist.log_prob(act_ep).sum(-1). Shape [batch_size]. The .sum(-1) sums log-probs across the 4 action dimensions (joint log-prob = sum of independent per-dim log-probs).
• olp_ep — "old log prob, for this minibatch slice." It's old_lp_all[batch_idx], the log-prob recorded at rollout time when the actor was a different snapshot. Shape [batch_size].
What's in this tensor: each element is the ratio for one transition. If that ratio is 1.0, the policy hasn't changed for that action. If 1.5, the new policy is 1.5× more likely to pick that action than the rollout-time policy.
surrogate1 = ratio * adv_epElementwise multiply: surrogate1[i] = ratio[i] * adv_ep[i].
This is the unclipped surrogate — what vanilla policy gradient would use. With adv_ep > 0 and increasing ratio, this grows linearly — reward for moving the policy toward this action. With adv_ep < 0 and decreasing ratio, this also grows (less negative) — reward for moving away.
• adv_ep — advantages for this minibatch, normalized. Shape [batch_size].
surrogate2 = torch.clamp(ratio, 1.0 − self.clip_eps, 1.0 + self.clip_eps) * adv_eptorch.clamp(x, lo, hi) takes any tensor x and returns elementwise min(max(x, lo), hi). So if self.clip_eps = 0.2, this returns ratios capped to [0.8, 1.2].
For ratios already inside [0.8, 1.2], surrogate1 == surrogate2. For ratios outside that range, surrogate2 is the clipped version multiplied by the advantage.
policy_loss = −torch.min(surrogate1, surrogate2).mean()Three operations stacked.
1. torch.min(surrogate1, surrogate2) — elementwise min of two same-shaped tensors. Shape [batch_size]. This is what creates the asymmetric pessimistic cap — the optimizer sees the worse of the two surrogates at each transition, not the better one.
2. .mean() — reduces shape [batch_size] to [] (a scalar). Gives the average across the minibatch. We need a scalar to call .backward() on.
3. Negate. The PPO objective is to maximize the min-clipped surrogate. PyTorch optimizers minimize losses. So we negate.
The variable name policy_loss matters. The surrounding code (combined-loss line at on_policy.py:260) references it by name. Naming it anything else would cause a NameError later in the same function.
with torch.no_grad(): ref_dist = self.reference_actor(obs_ep) ref_log_prob = ref_dist.log_prob(act_ep).sum(−1) reverse_kl = (ratio * (new_log_prob − ref_log_prob)).mean()
This uses your ratio variable. The reverse-KL term penalizes drift from the BC reference. Computing it requires (a) log-prob under the frozen reference (no grad needed), (b) the ratio you just computed, (c) new_log_prob from the current actor.
values_pred = self.critic(obs_ep).squeeze(−1) value_loss = F.mse_loss(values_pred, ret_ep) loss = (policy_loss + self.value_coef * value_loss − self.entropy_coef * entropy + self.reverse_kl_coef * reverse_kl) (loss * (batch_size / T)).backward()
The combined loss: your policy_loss, plus the value-loss MSE, minus the entropy bonus (because we're adding entropy to the objective and the loss is the negation), plus the reverse-KL drift penalty. The (batch_size / T) scaling means that summing across all minibatches in an epoch gives the same gradient magnitude as if we'd computed one big loss across the whole rollout. Then .backward() walks backward through the entire computation graph and fills in gradients for every parameter in the actor and the critic.
One step per epoch (after all minibatches' gradients have accumulated):
nn.utils.clip_grad_norm_(params, max_norm=1) self.opt.step()
clip_grad_norm_ rescales gradients if the total norm exceeds 1.0 — protection against exploding gradients. .step() applies the Adam update.
| Variable | Shape | Meaning |
|---|---|---|
obs_all | [T, D] | States across all rollout steps |
act_all | [T, A] | Actions taken (4 dims) |
rew_all | [T] | Per-step rewards |
disc_all | [T] | γ(1−done), pre-multiplied |
old_lp_all | [T] | log πold(a|s) at rollout time |
values_all | [T] | Vφ(s_t) for all t |
next_values_all | [T] | Vφ(s'_t) for all t |
advantages_all | [T] | Ât from GAE |
returns_all | [T] | Ât + V(st), critic target |
ratio | [batch] | Importance ratio for minibatch |
policy_loss | [] | Scalar PPO clip loss for minibatch |
ratio > 1+ε and  > 0?with torch.no_grad():?old_log_prob and new_log_prob in the same minibatch?1. Because we sampled those actions ourselves — the policy gradient comes from "differentiate log p(observed sample) under your distribution." There's nothing to differentiate at unobserved actions.
2. It reduces variance of the gradient estimator without biasing it — the cross-term in expectation is zero (proven via the log-derivative trick), but the variance of the estimator drops.
3. λ controls the bias-variance tradeoff between 1-step TD (λ=0, biased, low variance, "TD") and Monte Carlo (λ=1, unbiased, high variance, "MC").
4. The recursion Ât = δt + γλ Ât+1 needs the t+1 value to compute the t value. Iterating backwards gives O(T) computation.
5. Numerical stability. Direct division of two tiny probabilities is unstable; subtraction of log-probs and exponentiation is stable.
6. The min picks surrogate2 = (1+ε) · Â, which is constant in θ once ρ passes the cap. Gradient w.r.t. θ goes to zero.
7. PyTorch optimizers minimize. The PPO objective is to maximize. So loss = -objective.
8. Because GAE returns are targets for the critic. If they tracked gradients, the critic would be trained against itself, and gradient descent would chase its own tail.
9. old_log_prob was recorded by the actor that collected the rollout (a fixed snapshot). new_log_prob is computed by the actor right now in the inner loop, which has been updated by previous minibatches in this epoch. Their ratio measures policy drift since rollout.
10. Because we're adding entropy to the objective (encouraging stochasticity = exploration) and the loss is the negation of the objective.
11. To prevent the RL policy from drifting too far from the BC initialization. Sparse rewards mean BC is the only thing telling the agent how to roughly do the task; without the KL anchor, RL would forget BC and learning would collapse.
12. On-policy PPO benefits from learned-std for adaptive exploration as the policy improves. Off-policy methods (P3) get exploration from the diversity of the replay buffer, so a fixed std works fine.
# Pre-flight (one-time) modal secret create wandb-secret WANDB_API_KEY=<your_key> # Launch training cd /Users/ozyphus/Documents/GitHub/cs224r/homework/hw2/hw2 modal run --detach modal_on_policy.py
~45 minutes to 1.5 hours on Modal A10. The wandb URL prints at startup. Watch eval/episode_success, policy_loss, and actor_std — if std collapses to 0 in the first 50k steps, the entropy_coef is too low.
Three big ideas, in order of importance:
If a friend asks: "Why does PPO use min instead of average?" — you say: "Because the optimizer is adversarial. If we used average, the optimizer would happily blow past the clip in directions where the unclipped term is favorable, defeating the whole point. The min is a pessimistic bound: the optimizer always sees the worse of the two surrogates, so it can't game the clip."
You can teach this. Go finish the training.