← back
Stanford CS 224R · Homework 2 · Off-Policy Actor-Critic

Problem 3 from Absolute Zero

A robot arm. A hammer. A nail. A reward of 1.0 if you finish, 0.0 every other moment. From those rules, a self-improving system. Every concept, every line of code, every gotcha — explained.

No prior RL assumed Math derived Code annotated Three implementation tasks
Roadmap

What You'll Master

Chapter 01

The Setup & Why It's Hard

You have a 4-DOF Sawyer robot arm in simulation. It can pick things up, push, swing. Your task: train it to pick up a hammer and use that hammer to drive a nail. Same task as Problem 2, completely different algorithm.

The observation is a vector of about 39 numbers — positions of robot/hammer/nail at the current step and the previous step. The action is a 4-D vector in [-1, 1] — joint displacements.

The reward function is sparse:

Reward r(s, a) = 0 for every step r(s, a) = 1 the moment the nail is fully driven, then episode ends

That sparseness is what makes this hard. Compare it to a "shaped" reward like −|distance to nail| − |hammer angle error|, which would tell the agent "you're getting warmer" continuously. With the sparse reward, the agent gets zero feedback for thousands of timesteps, then a single +1 when it finally succeeds. Most rollouts contain zero learning signal.

Why this is the right choice for the homework

Real-world robotics rewards are sparse. You can't easily hand-code a smooth reward for "drive the nail" — what does "halfway driven" look like? Sparse rewards force you to use exploration + demonstrations + sample-efficient algorithms. Solving sparse-reward tasks is the frontier.

So the homework gives you two things to fight the sparseness:

The deliverable: a wandb plot of eval/episode_success rising from 0 to at least 90% within 100,000 environment steps. For comparison, PPO in Problem 2 needs about 1,000,000 steps for far less. That's a 10× sample-efficiency gap. Understanding where it comes from is the entire lesson.

The mental picture

Think of a chef tasting a dish once and writing down the rating. On-policy: taste, write rating, throw the recipe away, cook a new dish. Off-policy: taste, write rating, store everything in a giant cookbook. Tomorrow, re-read every entry hundreds of times and notice patterns. Same data, way more learning.

Chapter 02

Q-Functions — The Foundation

In Problem 1 you implemented tabular Q-learning. Same idea here, except instead of a 20×4 table, the Q-function is a neural network.

Definition
Q-value — Q(s, a)

The expected total discounted reward you would get if you start in state s, take action a, and then act according to your policy π forever after.

Q-value, formally Qπ(s, a) = 𝔼[ r0 + γ r1 + γ2 r2 + ... | s0=s, a0=a, then π ]

The discount factor γ ∈ [0, 1) — future rewards are worth less than immediate ones. Typical: 0.99. A reward 100 steps away is worth 0.99100 ≈ 0.37 of a reward right now.

Why Q, not V?

You also could imagine a function that just takes state — the state-value V(s):

Vπ(s) = 𝔼[ Qπ(s, a) | a ~ π(·|s) ] average over actions you'd take

PPO (Problem 2) used V. Why does this homework use Q?

Because actions matter. When we update the actor, we want to push the policy toward actions that have high value. With Q(s, a), we can directly ask "is this specific action good?" With V(s), we'd only know whether the state is good on average, and we'd have to use the policy gradient theorem (with all its variance) to extract action-level signal. Q-learning skips the middleman.

The Bellman equation, again

You saw it in P1. Same recursion, same intuition:

Bellman expectation Qπ(s, a) = 𝔼[ r(s, a) + γ Qπ(s', a') | s' ~ env, a' ~ π(·|s') ]

The value of (s, a) decomposes into this step's reward plus the discounted value of where we land next. This recursion is the entire engine of TD learning.

The TD target for a single sample (s, a, r, s') is:

y = r + γ Qπ(s', a') where a' ~ π(·|s')

And the TD error is the difference between this new estimate and our old prediction:

δ = y − Q(s, a) = r + γ Q(s', a') − Q(s, a)

If δ is positive, the action turned out better than we predicted — push Q(s, a) up. If negative, push down. Same as the gridworld update; just with a neural network instead of a table.

From table to network — what changes

In the table case: Q[s, a] += α δ. Direct write. In the network case, we don't have direct slots — we have weights φ. So instead we compute MSE loss (Qφ(s, a) − y)2 and let backprop pull Q toward y. The TD update becomes a regression problem.

Terminal handling

If s' is terminal (the episode ended), there is no future:

y = r if done y = r + γ Q(s', a') otherwise

The starter code provides discount as a pre-multiplied factor: discount = γ · (1 − done). So:

y = r + discount · Q(s', a')

...handles both cases in one expression. When done is 1, discount is 0, the bootstrap term vanishes. Same trick you used in PPO's GAE.

Chapter 03

On-Policy vs Off-Policy

This is the most important conceptual distinction in modern RL.

PropertyOn-policy (PPO, Problem 2)Off-policy (this problem, SAC, DQN)
Data sourceCurrent policy onlyAny past policy
StorageThrow away after few epochsReplay buffer, kept forever
Sample efficiencyLowHigh (10× or more)
StabilityNaturally stableFragile — needs target nets, double-Q
Algorithm classPolicy gradientQ-learning / actor-critic

Why must on-policy be on-policy?

The policy gradient estimator is:

θ J(θ) = 𝔼(s,a) ~ πθ [ ∇θ log πθ(a|s) · A(s, a) ]

The expectation is over (s, a) drawn from the current policy. If you use stale data, the expectation is wrong — you're computing a gradient for the wrong distribution. PPO patches this with importance sampling for a few epochs, but still needs fresh data per update.

Why CAN off-policy be off-policy?

Look at the Bellman equation again:

Q(s, a) = r(s, a) + γ 𝔼[ Q(s', a') ]

This equation is a property of the environment (the reward and transition function), not of any particular policy. As long as you have a tuple (s, a, r, s'), you can use it to enforce the Bellman constraint — regardless of which policy generated the tuple.

That's why off-policy works: the critic learns the environment's value structure, and the policy generating the data doesn't have to match. You can mix transitions from a random initial policy, an expert demonstration, and your current actor, all in the same buffer.

The replay buffer is the magic

You collect 1 transition per environment step but do many gradient updates per step. The "update-to-data" ratio (UTD) is how aggressively you exploit your buffer. UTD=1 is conservative; UTD=5 in this homework's ablation is more aggressive. Off-policy lets you do this; on-policy fundamentally cannot.

What's stored in the buffer

Each entry: (s, a, r, s', done). That's it. No policy, no log-probs (unlike PPO). You can sample any minibatch of these uniformly at random, and the Bellman update applies.

A typical training step batch ← uniform_sample(D, 256) # 256 random transitions y ← r + γ (1-done) · Qtarget(s', a') # TD target, gradients off loss ← mean( (Qonline(s, a) − y)2 ) φ ← φ − lr · ∇φ loss # critic update
Chapter 04

The Deadly Triad

The off-policy story above sounds clean. It isn't. Three properties together cause Q-learning with neural networks to diverge. The triad is famously called "deadly":

  1. Function approximation — we use a neural net Qφ(s, a) instead of a lookup table. Updating one entry leaks into many.
  2. Bootstrapping — the target r + γ Q(s', a') uses the same network we're optimizing. As we update φ, the target moves. We're regressing toward a moving label.
  3. Off-policy data — the (s, a) distribution we sample doesn't match the policy we're evaluating, so updates can amplify errors.

Any one of these is fine. Any two is usually fine. All three together — instability, exploding Q-values, training collapse.

Concrete failure mode 1: Moving target

Imagine vanilla Q-learning: we use the same critic for the target and the online prediction.

loss = (Qφ(s, a) − [r + γ Qφ(s', a')])2

When we take a gradient step, both Qφ(s, a) and Qφ(s', a') change. We tried to pull the prediction toward the target, but we also moved the target. Repeat thousands of times, gradient descent never converges.

Common-sense analogy

Imagine trying to grab a flag that runs away whenever you reach for it — and runs at the same speed you do. The fix: tie the flag's position to a different copy of you that updates slowly. That's the target network.

Concrete failure mode 2: Maximization bias

Vanilla Q-learning's target uses maxa' Q(s', a'). Suppose Q has noise — some actions' Q-values are slightly overestimated, others underestimated, on average correct.

The max operator systematically picks overestimated entries. So your target is biased upward. You train Q toward an upward-biased target. Q grows. Next iteration, even more bias. Q-values explode.

Why max-of-noisy-estimates is biased True Q-values: [1.0, 1.0, 1.0] all equal Noisy estimates: [1.1, 0.9, 1.0] noise added max(estimates) = 1.1 biased UPWARD by 0.1 true max = 1.0

This is called maximization bias. It's the reason vanilla DQN often diverges.

The fixes (preview)

FailureFixWhere in homework
Moving targetTarget network — slowly-updated copy of the criticself.critic_target + soft_update_params
Maximization biasClipped double-Q — min over two criticsmin(Q̄i, Q̄j) in target
Variance in targetEnsemble of N critics, sample 2 randomlynum_critics hyperparameter

Each fix surfaces in your update_critic implementation. Read this chapter again after you've written it — you'll see why each line is there.

Chapter 05

The Actor-Critic Fix

Q-learning works in discrete action spaces because you can compute argmaxa Q(s, a) by enumerating actions (e.g., 18 Atari buttons). With continuous actions in [-1, 1]4, you can't enumerate. There's no argmax.

Two options:

  1. Random sampling / cross-entropy method — sample 64 random actions, evaluate Q for each, pick the best. Slow, low resolution.
  2. Train an actor network — a neural net that learns to output the argmax. We optimize the actor to find the maximizer for us.

We use option 2. The actor πθ(a|s) is trained so that its sampled action a maximizes Qφ(s, a). That's the entire actor objective:

Actor loss Lactor(θ) = − 𝔼s ~ buffer[ Qφ(s, πθ(s)) ]

Read it: "sample a state s, plug the policy's chosen action into the critic, that scalar is what the actor wants to maximize." The minus sign converts maximize into minimize for PyTorch.

The reparameterization trick

For gradient to flow from Qφ back into θ, the action a must be a differentiable function of θ. With a stochastic policy this seems impossible — you can't backprop through "sample from a distribution." Solution: reparameterize.

Reparameterization for Gaussian policy Sample ad-hoc: a = μθ(s) + σ · ε, ε ~ 𝒩(0, 1) Now a is a deterministic function of θ with ε as a fixed input. ∇θ Q(s, a) = ∇θ Q(s, μθ(s) + σε) = (∂Q/∂a) · (∂a/∂θ) ↑ chain rule

The actor in this homework outputs a TruncatedNormal(μ, 0.1). Std is fixed at 0.1, not learned (unlike SAC's adaptive entropy). The 0.1 just adds exploration noise. Calling dist.sample() returns tanh(μθ(s)) + 0.1 · ε, which is differentiable in θ while ε is sampled fresh.

Two networks, two roles

Actor πθ(a|s): maps state to action distribution. Trained to output high-Q actions. Used to act in the environment.

Critic Qφ(s, a): estimates expected return for (state, action). Trained via TD regression. Used as a training signal for the actor.

The ping-pong

You alternate two updates:

  1. Critic update: pull Qφ(s, a) toward TD target r + γ Qtarget(s', a').
  2. Actor update: push πθ toward actions where Qφ is high.

If you only had the critic, you'd know which actions are good but couldn't act on them. If you only had the actor, you'd have nothing to optimize against. Together, they bootstrap each other up. This is the entire idea of actor-critic.

Chapter 06

Three Stabilization Tricks

Each fixes a specific failure mode from Chapter 04. Each surfaces directly in your update_critic.

Trick 1: Target network with soft (Polyak) update

Maintain a second copy of the critic, φ̄, used only for computing TD targets. Update its parameters slowly:

Polyak / soft update φ̄ ← (1 − τ) φ̄ + τ φ, τ small, e.g. 0.005

Each step, the target moves 0.5% toward the online critic. After about 200 steps the target has caught up. Critically, on the timescale of any single gradient update, the target looks frozen. The regression target is stable.

In the codebase: utils.soft_update_params(net, target_net, tau) does this. You'll call it once per critic update.

Common bug: forgetting to soft-update

If you forget the soft update entirely, the target stays at its initialization forever. Q-values won't propagate beyond γ per step — learning crawls. The critic loss will look fine but the actor never improves.

Trick 2: Ensemble of N critics

Maintain N independent critic networks Qφ1, ..., QφN. Each is initialized differently (different orthogonal weights from weight_init) and sees different minibatch orderings. They disagree on out-of-distribution inputs. That disagreement reduces variance when we combine them.

Default in this homework: N=2. Ablation: N=10. With N=10 + UTD=5 you have something close to REDQ ("Randomized Ensembled Double Q-learning"), a state-of-the-art recipe for sample efficiency.

Trick 3: Random pair + min for the target

When computing the TD target, do not use all N critics. Pick 2 randomly, take the elementwise min:

Clipped double-Q with random pair i, j ~ random.sample(1..N, 2) # two distinct indices y = r + γ (1-done) · min( Q̄i(s', a'), Q̄j(s', a') )

The min counteracts maximization bias from Chapter 04. The random sampling (vs always using critics 1 and 2) is the REDQ improvement — it forces every critic to be reliable, not just the first two. Otherwise critics 3..N could drift since they're never used in targets.

Critical: train ALL N critics, only use 2 for the target

Common bug: students sometimes only update the 2 sampled critics, leaving the others frozen. Wrong. Loss should be computed over all N critic predictions:

critic_loss = sum( F.mse_loss(q, target) for q in q_list )

The target uses 2 samples; the loss uses all N predictions.

Putting them together

The TD target with all three tricks:

Final target formula with torch.no_grad(): a' = actor(s').sample(clip=stddev_clip) # next-action from current policy target_q_list = critic_target(s', a') # list of N tensorsi, Q̄j = random.sample(target_q_list, 2) # 2 distinct critics y = r + discount · min(Q̄i, Q̄j) # scalar target

Memorize the shape of this. Three tricks, four lines, all the off-policy stability machinery in modern RL.

Chapter 07

Behavior Cloning

The reward is sparse. Random exploration almost never sees a +1. The agent would never learn from scratch in 100k steps.

So we cheat: start with a warm-started policy by imitating expert demonstrations. This is supervised learning, not RL.

BC loss Lbc(θ) = − 𝔼(s, a) ~ demos[ log πθ(a | s) ]

Maximize the log-probability that the policy assigns to expert actions. Pure maximum likelihood. After ~5000 BC gradient steps, the policy is "kind-of right" — it produces actions that look expert-shaped. Then RL takes over and refines them based on actual reward.

Why BC also runs during RL training

The homework alternates RL gradient steps with BC gradient steps throughout training. Why?

Imagine you've BC-pretrained a decent policy. Now you start RL. The actor's gradient step says "move toward whatever the critic likes." But the critic is itself a randomly-initialized neural net at first — its high-Q regions are essentially random. The actor would walk away from the BC initialization toward the critic's random preferences and forget the demos.

Solution: keep mixing in BC steps during RL. The BC term anchors the actor near the expert distribution. As the critic improves, RL pulls more strongly toward Q-maximization. Net effect: stable improvement over BC, no catastrophic forgetting.

Connecting back to PPO

Problem 2's PPO uses a "reverse KL to a frozen reference policy" for the same reason — preventing drift from the BC warm-start. Both methods solve "don't forget the demos" but with different mechanics: PPO adds a KL penalty in the loss; this homework alternates a BC update. Same disease, different antibiotics.

The BC method signature

Both pretraining and the RL-mixin call the same method:

def bc(self, replay_iter):
    # replay_iter yields (obs, action, reward, discount, next_obs)
    # For BC we only need (obs, action). The rest are unused.
    batch = next(replay_iter)
    obs, action, _, _, _ = utils.to_torch(batch, self.device)

    # YOUR CODE: loss = -E[log pi(a|s)], step actor optimizer

The trick: this same method is called whether the buffer is the demos or the live replay buffer. The training script picks which buffer to feed it. Don't put any "first time only" logic in bc.

Chapter 08

The Full Algorithm

Putting everything together. Read this twice.

Off-Policy Actor-Critic with BC Pretraining + Ensemble Critics
  1. Initialize actor πθ, N critics Qφ1..N, target critics Q̄φ̄1..N ← copies of online, replay buffer D, demonstration buffer Ddemo.
  2. BC pretrain (no RL yet):
    for k = 1 to Nbc:
      sample minibatch (s, a) from Ddemo
      Lbc = − mean[ log πθ(a|s) ]
      θ ← θ − lr · ∇θ Lbc
  3. RL training loop: for step = 1 to Nenv:
    a) Collect: at ~ πθ(·|st), step env, store (st, at, rt, st+1, donet) in D.
    b) Critic updates, UTD times:
    • sample minibatch B from D
    • with no_grad: a' ~ πθ(·|s'), pick 2 random target critics, y = r + γ(1-done) min(Q̄i(s',a'), Q̄j(s',a'))
    • Lcritic = Σk=1..N MSE(Qφk(s, a), y)
    • φ ← φ − lr · ∇φ Lcritic
    • φ̄ ← (1−τ) φ̄ + τ φ
    c) Actor update (one per env step):
    • sample minibatch B from D, pull only s
    • anew ~ πθ(·|s) (reparameterized)
    • Lactor = − (1/N) Σk Qφk(s, anew)
    • θ ← θ − lr · ∇θ Lactor
    d) BC update (every step, on demos):
    • sample (s, a) from Ddemo
    • Lbc = − mean[ log πθ(a|s) ]
    • θ ← θ − lr · ∇θ Lbc
    e) Periodically: evaluate, log to wandb.

You're implementing pieces of step (b), (c), and (d). The orchestration in train_off_policy.py calls your three methods.

Chapter 09

Code Tour: off_policy.py

Three classes, mirroring the three concepts: actor, critic ensemble, agent.

Actor

off_policy.py:11-33class Actor(nn.Module):
    def __init__(self, obs_shape, action_shape, hidden_dim, std=0.1):
        super().__init__()
        self.std = std
        self.policy = nn.Sequential(
            nn.Linear(obs_shape[0], hidden_dim), nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, hidden_dim),   nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, action_shape[0]))

    def forward(self, obs):
        mu = torch.tanh(self.policy(obs))         # action mean, squashed to [-1,1]
        std = torch.ones_like(mu) * self.std      # fixed std (NOT learned)
        return utils.TruncatedNormal(mu, std)

Notable contrasts with PPO's actor:

Critic ensemble

off_policy.py:36-54class Critic(nn.Module):
    def __init__(self, obs_shape, action_shape, num_critics, hidden_dim):
        super().__init__()
        self.critics = nn.ModuleList([nn.Sequential(
            nn.Linear(obs_shape[0] + action_shape[0], hidden_dim), nn.LayerNorm(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, hidden_dim), nn.LayerNorm(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, 1))
            for _ in range(num_critics)])

    def forward(self, obs, action):
        h = torch.cat([obs, action], dim=-1)
        return [critic(h) for critic in self.critics]   # LIST of N tensors

Agent

off_policy.py:57-79class ACAgent:
    def __init__(self, obs_shape, action_shape, device, lr,
                 hidden_dim, num_critics, critic_target_tau, stddev_clip):
        self.device = device
        self.critic_target_tau = critic_target_tau
        self.stddev_clip = stddev_clip

        self.actor = Actor(obs_shape, action_shape, hidden_dim).to(device)
        self.critic = Critic(obs_shape, action_shape, num_critics, hidden_dim).to(device)
        self.critic_target = Critic(obs_shape, action_shape, num_critics, hidden_dim).to(device)
        self.critic_target.load_state_dict(self.critic.state_dict())   # target == online at init

        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=lr)
        self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=lr)

Two notable choices:

The acting method

off_policy.py:88-97def act(self, obs, eval_mode):
    obs = torch.as_tensor(obs, device=self.device).float()
    dist = self.actor(obs.unsqueeze(0))
    if eval_mode:
        action = dist.mean                                  # greedy
    else:
        action = dist.sample(clip=None)                   # with noise
    return action.cpu().numpy()[0]

Used by the rollout collector. Eval mode = deterministic mean. Train mode = sampled with exploration noise.

Chapter 10

Your Three Implementation Tasks

For each task: the math, the gotchas, and pseudocode you can hold in your head while writing the actual implementation. Cross-reference back to the corresponding chapter when in doubt.

Task 1
bc — Behavior cloning loss

Goal: maximize the log-probability that the actor assigns to expert actions.

Math Lbc(θ) = − 𝔼(s, a)[ log πθ(a | s) ]

Already in scope:

batch = next(replay_iter)
obs, action, _, _, _ = utils.to_torch(batch, self.device)

Pseudocode:

  1. Run actor on obs: dist = self.actor(obs)
  2. Compute log-prob of expert action: log_prob = dist.log_prob(action).sum(-1)
  3. Loss is negative mean: loss = -log_prob.mean()
  4. Zero actor optimizer grad: self.actor_opt.zero_grad(set_to_none=True)
  5. Backward: loss.backward()
  6. Step: self.actor_opt.step()
  7. Log: metrics["bc_loss"] = loss.item()
Gotchas

• Use self.actor_opt, not self.opt — this class has separate actor/critic optimizers.

log_prob returns shape [batch, action_dim]. Sum over the last dim (since action is 4-D, joint log-prob = sum of per-dim log-probs). Mean across the batch.

• This same method is also called during RL training. Don't add any one-time logic.

Task 2
update_critic — The full TD update

Goal: one critic gradient step plus one target soft-update. This is the most involved task.

Math, in 5 steps Step 1 — sample next-action from current policy a' ~ πθ(·|s') Step 2 — pick 2 random target critics, take min i, j ~ random.sample(range(N), 2) y = r + γ (1-done) · min( Q̄i(s', a'), Q̄j(s', a') ) Step 3 — loss across ALL N online critics L = Σk=1..N ( Qk(s, a) − stop_grad(y) )2 Step 4 — gradient step on critic params φ ← φ − lr · ∇φ L Step 5 — soft-update all N target critic params φ̄k ← (1 − τ) φ̄k + τ φk for k = 1..N

Already in scope:

batch = next(replay_iter)
obs, action, reward, discount, next_obs = utils.to_torch(batch, self.device)

Pseudocode:

with torch.no_grad():
    # next-action from current policy
    next_action = self.actor(next_obs).sample(clip=self.stddev_clip)

    # forward all target critics → list of [B, 1] tensors
    target_q_list = self.critic_target(next_obs, next_action)

    # pick 2 random critics from the list (without replacement)
    sampled = random.sample(target_q_list, 2)
    target_q = torch.min(sampled[0], sampled[1])     # shape [B, 1]

    # TD target. Mind the shapes: reward [B], discount [B], target_q [B, 1]
    target = reward.unsqueeze(-1) + discount.unsqueeze(-1) * target_q

# online critic predictions on (obs, action)
q_list = self.critic(obs, action)

# sum of MSEs across ALL N critics
critic_loss = sum(F.mse_loss(q, target) for q in q_list)

# gradient step
self.critic_opt.zero_grad(set_to_none=True)
critic_loss.backward()
self.critic_opt.step()

# soft-update target critic
utils.soft_update_params(self.critic, self.critic_target, self.critic_target_tau)

# logging
metrics["critic_loss"] = critic_loss.item()
metrics["critic_target_q"] = target.mean().item()
metrics["critic_q1"] = q_list[0].mean().item()
Gotchas, in order of pain

1. Wrap target computation in with torch.no_grad():. Otherwise gradients flow through the target into the critic and you train it against itself → divergence.

2. Shape alignment. reward and discount are [B]; Q outputs are [B, 1]. Add .unsqueeze(-1) to align. Mismatched shapes broadcast to [B, B] — loss looks fine numerically but is gibberish, training never works. Silent killer.

3. Sample 2 critics WITHOUT replacement: random.sample(list, 2). Don't use random.choices — that's with replacement and could pick the same critic twice.

4. Train ALL N critics, not just the 2 used in the target. sum(F.mse_loss(q, target) for q in q_list) hits every critic.

5. Use self.critic_target, not self.critic, for the target. Mixing these up is the most common bug.

6. stddev_clip: when sampling for the target, clip the noise to self.stddev_clip. This is "target action smoothing" from TD3 — prevents the critic from being trained on extremely off-distribution actions.

Debug signals

critic_loss stays at 0.0 → gradients aren't flowing. Forgot .backward()? Wrong optimizer? Loss disconnected from graph?

critic_loss explodes to 1e6+ → shape bug, OR forgot no_grad and target is being trained, OR forgot soft update.

target.mean() stays at 0 forever → agent never sees reward. Either bc pretrain didn't work, or buffer too small, or something upstream broken.

Task 3
update_actor — Maximize mean Q

Goal: one actor gradient step. Push the policy toward actions where Q is high.

Math a ~ πθ(·|s) Lactor(θ) = − (1/N) Σk=1..N Qφk(s, a)

Already in scope:

batch = next(replay_iter)
obs, _, _, _, _ = utils.to_torch(batch, self.device)

Only obs. The action will come from sampling the current policy — not from the buffer's stored action. Why: we want gradient to flow from Q through the new sampled action back into θ. The buffer's action was produced by an old policy and is detached.

Pseudocode:

# sample fresh action from current policy (reparameterized → grads flow)
dist = self.actor(obs)
action = dist.sample(clip=self.stddev_clip)

# forward all critics. Critic params are NOT optimized in this step.
q_list = self.critic(obs, action)

# mean Q across ensemble, then mean across batch, then negate
q_mean = torch.stack(q_list, dim=0).mean(dim=0)         # [B, 1]
actor_loss = -q_mean.mean()

# gradient step on actor only
self.actor_opt.zero_grad(set_to_none=True)
actor_loss.backward()
self.actor_opt.step()

# logging
metrics["actor_loss"] = actor_loss.item()
metrics["actor_q"] = q_mean.mean().item()
Gotchas

Use sample(clip=self.stddev_clip), not sample(). The clipping is what TD3 calls "target policy smoothing" — even on the actor side it's standard.

Don't accidentally backprop into the critic. Since you're calling self.actor_opt.step() (not the critic's optimizer), and the critic params aren't in actor_opt, you're safe by construction. But don't get cute — if you call self.critic_opt.step() by mistake here, you corrupt the critic.

Mean across critics, then mean across batch. The PDF specifies (1/N) Σk Qk. (Some implementations use min instead; staying with the PDF is correct.)

• The TruncatedNormal's clamp uses a straight-through estimator (x − x.detach() + clamped.detach()), so gradients still flow through μθ(s) even though the action is bounded. See utils.py:119.

Chapter 11

The UTD Ratio Analysis

The homework's final part asks you to run the same algorithm with two configurations and explain the difference:

Runnum_criticsUTDWall-clock targetEnv steps
Default21~30-45 min100k
Ablation105~2 hours50k

What is UTD?

Definition
UTD ratio = update-to-data ratio

The number of critic gradient updates performed per environment step. UTD=1 means: collect 1 transition, do 1 critic update. UTD=5 means: collect 1 transition, do 5 critic updates — same data, 5× more gradient passes through it.

You can ONLY do this with off-policy methods. PPO can't — once you do too many updates on the same rollout, the policy drift exceeds what the clip surrogate can compensate for.

Expected effects

Pro of UTD>1: each transition is more thoroughly exploited. Sample efficiency improves. The 50k-step run with UTD=5 should match or beat the 100k-step run with UTD=1.

Con of UTD>1: more compute per env step (5× more gradient passes). Wall-clock is longer despite fewer env steps.

Con of UTD>1 alone: with only 2 critics and aggressive replay, the critic overfits to recent buffer entries. Q-values become wildly inaccurate on out-of-distribution actions, the actor exploits those errors, training collapses. This is why pure UTD increase often hurts.

Why N=10 saves UTD=5: a larger ensemble has lower variance in target estimates, so it tolerates more aggressive updates. The ensemble's disagreement on out-of-distribution actions provides an implicit regularization signal — min-of-2-randomly-sampled is more conservative when the underlying ensemble is more diverse.

This pairing (high UTD + large ensemble) is the recipe of REDQ, which achieves model-based-level sample efficiency on continuous control.

One-sentence answer template

For your writeup

"Increasing UTD from 1 to 5 with 10 critics improves sample efficiency because each transition is replayed and propagated through critic updates 5× more often, while the larger ensemble keeps target estimates well-calibrated against the resulting overfitting risk."

Comparison with PPO (Problem 2)

The very last part of HW2 asks you to compare PPO vs SAC-default curves in 3-5 sentences with at least two concrete differences.

PropertyPPO (P2)SAC-style (P3 default)
Steps to plateau~1M~100k
WhyDiscards rolloutsReplays buffer indefinitely
Final success rate~50-80%>90%
Run-to-run varianceLower (clipped)Higher (off-policy)
Conceptual basisV-baseline + policy gradientQ-learning with actor

Two concrete differences to discuss:

  1. Sample efficiency. P3 reaches plateau in ~100k steps; P2 needs ~1M. Reason: replay buffer + off-policy means each transition is reused thousands of times in P3, while P2 discards rollouts after a few PPO epochs. With sparse rewards, P3 stores rare reward transitions and replays them; P2 must encounter rewards repeatedly in fresh rollouts.
  2. Final success rate. P3 typically hits ~90-100% because it can fully exploit each transition's information. P2 is more conservative because the clipped objective limits per-update policy change.
Chapter 12

Cheat Sheet & Self-Quiz

Equations to memorize

Bellman target y = r + γ(1−done) · min( Q̄i(s', a'), Q̄j(s', a') )
Critic loss Lcritic = Σk=1..N ( Qk(s, a) − y )2
Actor loss Lactor = − (1/N) Σk=1..N Qk(s, πθ(s))
BC loss Lbc = − mean[ log πθ(a | s) ]
Polyak target update φ̄ ← (1 − τ) φ̄ + τ φ

Variable scope reference

VariableWhereWhat
self.actorACAgentOnline policy πθ
self.criticACAgentEnsemble of N online critics
self.critic_targetACAgentSlow-moving copies of online critics
self.actor_optACAgentAdam, only actor params
self.critic_optACAgentAdam, only critic params
self.critic_target_tauACAgentτ for Polyak (typically 0.005)
self.stddev_clipACAgentAction noise clip for sampling
obs, action, reward, discount, next_obsFrom to_torch(batch)discount = γ(1−done) pre-multiplied
q_listReturned by critic(obs, action)Python list of N tensors, each [B, 1]
utils.soft_update_paramsutils.pyPolyak helper, applies τ-mix to all params

Self-quiz — if you can answer these without re-reading, you're ready

  1. Why does Q-learning with a single critic and no target network typically diverge?
  2. What does the (1 − done) factor accomplish in the Bellman target?
  3. Why do we sample only 2 critics for the target but train all N?
  4. Why is the actor loss − Q(s, π(s)) and not + Q(s, π(s))?
  5. Why is with torch.no_grad(): essential when computing the TD target?
  6. What's the difference between critic and critic_target? When does each get updated?
  7. If the critic loss stays at exactly 0.0 throughout training, what's the most likely bug?
  8. If the critic loss explodes to 1e6+, what's the most likely bug?
  9. Why does PPO need ~1M steps but this SAC-style algorithm needs ~100k?
  10. What does the BC step during RL accomplish that pretraining alone can't?
  11. With UTD=5, which is the bottleneck on training time — env steps or gradient steps?
  12. Why does the actor update sample a fresh action from π(s) rather than use the buffer's stored action?
Answer key — check after attempting

1. Bootstrapping with the same network creates a moving target; combined with off-policy data and function approximation (the deadly triad), gradient descent chases its own tail. Target networks fix this.

2. Zeros out the bootstrapped future when an episode ended — terminal states have no future to bootstrap from.

3. Min-of-2 reduces maximization bias; randomization keeps every critic in the ensemble accountable; train all N to keep the ensemble diverse and accurate everywhere, not just where it's queried for the target.

4. Optimizers minimize, but we want to maximize Q. Negate.

5. Otherwise gradients flow through the target into the critic, training it against its own moving prediction → divergence.

6. critic is online, updated by Adam every step. critic_target is updated only by Polyak averaging, slowly tracking the online critic. Used only for computing TD targets.

7. Almost certainly: target wasn't no_grad'd / detached, OR wrong optimizer, OR .backward() not called, OR shapes mismatched and broadcast made loss = 0.

8. Shape mismatch broadcasting (e.g. [B] + [B, 1][B, B]), OR forgot no_grad on target, OR forgot soft update so target stuck at random init.

9. Off-policy reuses each transition many times; on-policy throws data away after a few epochs. With sparse rewards, the rare reward transitions get replayed dozens of times in off-policy.

10. Pretraining can decay during RL drift; periodic BC during RL keeps the actor anchored to the expert distribution, especially when actor updates push toward critic-overestimated actions.

11. Gradient steps. Each env step costs ~1ms (sim + tiny forward pass). Each gradient step costs ~10ms+ (full forward+backward through actor and ensemble of critics).

12. The buffer's action was sampled from an old policy. We want to train the current actor's parameters, so we need a fresh action where gradients can flow from Q back through μθ. Reparameterization makes that gradient meaningful.

Implementation order

  1. bc — ~5 minutes if you understand log_prob. Easiest. Do this first to verify your dev loop works.
  2. update_critic — ~20-30 minutes. The hardest. Most students hit at least one shape bug.
  3. update_actor — ~10 minutes. Easy after critic.
  4. Launch run 1: modal run --detach modal_off_policy.py
  5. Edit modal_off_policy.py — uncomment num_critics=10 and utd=5.
  6. Launch run 2: modal run --detach modal_off_policy.py
Take it back to class

You can now teach this

If a friend asks: "What's the difference between PPO and SAC?" — you don't recite features. You say: "PPO is a policy gradient method that has to use fresh data because its update is biased on stale data. SAC is a Q-learning method whose update is grounded in the Bellman equation, which is a property of the environment, so any past data works. That asymmetry is why off-policy is 10× more sample-efficient on sparse-reward tasks. Off-policy needs target networks and double-Q to stabilize, but the sample-efficiency win is worth the engineering."

If asked: "Why an ensemble of critics?" — you say: "Maximization bias. The max operator on noisy estimates is biased upward. Two-critic min counteracts that. With an ensemble of 10 and random pair-min, you get a more conservative target that tolerates aggressive UTD without overfitting."

That's the bar. You're there.