CS 224R HW2 · Problem 3 · Off-Policy Actor-Critic from Zero

Roadmap

What You'll Master

01The Setup & Why It's Hard 02Q-Functions: The Foundation 03On-Policy vs. Off-Policy 04The Deadly Triad 05The Actor-Critic Fix 06Three Stabilization Tricks 07Behavior Cloning 08The Full Algorithm 09Code Tour: off_policy.py 10Your Three Tasks 11UTD Ratio Analysis 12Cheat Sheet & Quiz

Chapter 01

The Setup & Why It's Hard

You have a 4-DOF Sawyer robot arm in simulation. It can pick things up, push, swing. Your task: train it to pick up a hammer and use that hammer to drive a nail. Same task as Problem 2, completely different algorithm.

The observation is a vector of about 39 numbers — positions of robot/hammer/nail at the current step and the previous step. The action is a 4-D vector in [-1, 1] — joint displacements.

The reward function is sparse:

Reward r(s, a) = 0 for every step r(s, a) = 1 the moment the nail is fully driven, then episode ends

That sparseness is what makes this hard. Compare it to a "shaped" reward like −|distance to nail| − |hammer angle error|, which would tell the agent "you're getting warmer" continuously. With the sparse reward, the agent gets zero feedback for thousands of timesteps, then a single +1 when it finally succeeds. Most rollouts contain zero learning signal.

Why this is the right choice for the homework

Real-world robotics rewards are sparse. You can't easily hand-code a smooth reward for "drive the nail" — what does "halfway driven" look like? Sparse rewards force you to use exploration + demonstrations + sample-efficient algorithms. Solving sparse-reward tasks is the frontier.

So the homework gives you two things to fight the sparseness:

20 expert demonstrations — recordings of (state, action) pairs from a successful policy. We'll use these to warm-start our policy via behavior cloning so it isn't starting from random scratch.
An off-policy algorithm — meaning we can store every transition we've ever seen and replay it forever, instead of throwing rollouts away after each gradient step.

The deliverable: a wandb plot of eval/episode_success rising from 0 to at least 90% within 100,000 environment steps. For comparison, PPO in Problem 2 needs about 1,000,000 steps for far less. That's a 10× sample-efficiency gap. Understanding where it comes from is the entire lesson.

The mental picture

Think of a chef tasting a dish once and writing down the rating. On-policy: taste, write rating, throw the recipe away, cook a new dish. Off-policy: taste, write rating, store everything in a giant cookbook. Tomorrow, re-read every entry hundreds of times and notice patterns. Same data, way more learning.

Chapter 02

Q-Functions — The Foundation

In Problem 1 you implemented tabular Q-learning. Same idea here, except instead of a 20×4 table, the Q-function is a neural network.

Definition

Q-value — Q(s, a)

The expected total discounted reward you would get if you start in state s, take action a, and then act according to your policy π forever after.

Q-value, formally Q^π(s, a) = 𝔼[ r₀ + γ r₁ + γ² r₂ + ... | s₀=s, a₀=a, then π ]

The discount factor γ ∈ [0, 1) — future rewards are worth less than immediate ones. Typical: 0.99. A reward 100 steps away is worth 0.99¹⁰⁰ ≈ 0.37 of a reward right now.

Why Q, not V?

You also could imagine a function that just takes state — the state-value V(s):

V^π(s) = 𝔼[ Q^π(s, a) | a ~ π(·|s) ] average over actions you'd take

PPO (Problem 2) used V. Why does this homework use Q?

Because actions matter. When we update the actor, we want to push the policy toward actions that have high value. With Q(s, a), we can directly ask "is this specific action good?" With V(s), we'd only know whether the state is good on average, and we'd have to use the policy gradient theorem (with all its variance) to extract action-level signal. Q-learning skips the middleman.

The Bellman equation, again

You saw it in P1. Same recursion, same intuition:

Bellman expectation Q^π(s, a) = 𝔼[ r(s, a) + γ Q^π(s', a') | s' ~ env, a' ~ π(·|s') ]

The value of (s, a) decomposes into this step's reward plus the discounted value of where we land next. This recursion is the entire engine of TD learning.

The TD target for a single sample (s, a, r, s') is:

y = r + γ Q^π(s', a') where a' ~ π(·|s')

And the TD error is the difference between this new estimate and our old prediction:

δ = y − Q(s, a) = r + γ Q(s', a') − Q(s, a)

If δ is positive, the action turned out better than we predicted — push Q(s, a) up. If negative, push down. Same as the gridworld update; just with a neural network instead of a table.

From table to network — what changes

In the table case: Q[s, a] += α δ. Direct write. In the network case, we don't have direct slots — we have weights φ. So instead we compute MSE loss (Q_φ(s, a) − y)² and let backprop pull Q toward y. The TD update becomes a regression problem.

Terminal handling

If s' is terminal (the episode ended), there is no future:

y = r if done y = r + γ Q(s', a') otherwise

The starter code provides discount as a pre-multiplied factor: discount = γ · (1 − done). So:

y = r + discount · Q(s', a')

...handles both cases in one expression. When done is 1, discount is 0, the bootstrap term vanishes. Same trick you used in PPO's GAE.

Chapter 03

On-Policy vs Off-Policy

This is the most important conceptual distinction in modern RL.

Property	On-policy (PPO, Problem 2)	Off-policy (this problem, SAC, DQN)
Data source	Current policy only	Any past policy
Storage	Throw away after few epochs	Replay buffer, kept forever
Sample efficiency	Low	High (10× or more)
Stability	Naturally stable	Fragile — needs target nets, double-Q
Algorithm class	Policy gradient	Q-learning / actor-critic

Why must on-policy be on-policy?

The policy gradient estimator is:

∇_θ J(θ) = 𝔼_{(s,a) ~ π_θ} [ ∇_θ log π_θ(a|s) · A(s, a) ]

The expectation is over (s, a) drawn from the current policy. If you use stale data, the expectation is wrong — you're computing a gradient for the wrong distribution. PPO patches this with importance sampling for a few epochs, but still needs fresh data per update.

Why CAN off-policy be off-policy?

Look at the Bellman equation again:

Q(s, a) = r(s, a) + γ 𝔼[ Q(s', a') ]

This equation is a property of the environment (the reward and transition function), not of any particular policy. As long as you have a tuple (s, a, r, s'), you can use it to enforce the Bellman constraint — regardless of which policy generated the tuple.

That's why off-policy works: the critic learns the environment's value structure, and the policy generating the data doesn't have to match. You can mix transitions from a random initial policy, an expert demonstration, and your current actor, all in the same buffer.

The replay buffer is the magic

You collect 1 transition per environment step but do many gradient updates per step. The "update-to-data" ratio (UTD) is how aggressively you exploit your buffer. UTD=1 is conservative; UTD=5 in this homework's ablation is more aggressive. Off-policy lets you do this; on-policy fundamentally cannot.

What's stored in the buffer

Each entry: (s, a, r, s', done). That's it. No policy, no log-probs (unlike PPO). You can sample any minibatch of these uniformly at random, and the Bellman update applies.

A typical training step batch ← uniform_sample(D, 256) # 256 random transitions y ← r + γ (1-done) · Q_target(s', a') # TD target, gradients off loss ← mean( (Q_online(s, a) − y)² ) φ ← φ − lr · ∇_φ loss # critic update

Chapter 04

The Deadly Triad

The off-policy story above sounds clean. It isn't. Three properties together cause Q-learning with neural networks to diverge. The triad is famously called "deadly":

Function approximation — we use a neural net Q_φ(s, a) instead of a lookup table. Updating one entry leaks into many.
Bootstrapping — the target r + γ Q(s', a') uses the same network we're optimizing. As we update φ, the target moves. We're regressing toward a moving label.
Off-policy data — the (s, a) distribution we sample doesn't match the policy we're evaluating, so updates can amplify errors.

Any one of these is fine. Any two is usually fine. All three together — instability, exploding Q-values, training collapse.

Concrete failure mode 1: Moving target

Imagine vanilla Q-learning: we use the same critic for the target and the online prediction.

loss = (Q_φ(s, a) − [r + γ Q_φ(s', a')])²

When we take a gradient step, both Q_φ(s, a) and Q_φ(s', a') change. We tried to pull the prediction toward the target, but we also moved the target. Repeat thousands of times, gradient descent never converges.

Common-sense analogy

Imagine trying to grab a flag that runs away whenever you reach for it — and runs at the same speed you do. The fix: tie the flag's position to a different copy of you that updates slowly. That's the target network.

Concrete failure mode 2: Maximization bias

Vanilla Q-learning's target uses max_a' Q(s', a'). Suppose Q has noise — some actions' Q-values are slightly overestimated, others underestimated, on average correct.

The max operator systematically picks overestimated entries. So your target is biased upward. You train Q toward an upward-biased target. Q grows. Next iteration, even more bias. Q-values explode.

Why max-of-noisy-estimates is biased True Q-values: [1.0, 1.0, 1.0] all equal Noisy estimates: [1.1, 0.9, 1.0] noise added max(estimates) = 1.1 biased UPWARD by 0.1 true max = 1.0

This is called maximization bias. It's the reason vanilla DQN often diverges.

The fixes (preview)

Failure	Fix	Where in homework
Moving target	Target network — slowly-updated copy of the critic	`self.critic_target` + `soft_update_params`
Maximization bias	Clipped double-Q — min over two critics	`min(Q̄_i, Q̄_j)` in target
Variance in target	Ensemble of N critics, sample 2 randomly	`num_critics` hyperparameter

Each fix surfaces in your update_critic implementation. Read this chapter again after you've written it — you'll see why each line is there.

Chapter 05

The Actor-Critic Fix

Q-learning works in discrete action spaces because you can compute argmax_a Q(s, a) by enumerating actions (e.g., 18 Atari buttons). With continuous actions in [-1, 1]⁴, you can't enumerate. There's no argmax.

Two options:

Random sampling / cross-entropy method — sample 64 random actions, evaluate Q for each, pick the best. Slow, low resolution.
Train an actor network — a neural net that learns to output the argmax. We optimize the actor to find the maximizer for us.

We use option 2. The actor π_θ(a|s) is trained so that its sampled action a maximizes Q_φ(s, a). That's the entire actor objective:

Actor loss L_actor(θ) = − 𝔼_{s ~ buffer}[ Q_φ(s, π_θ(s)) ]

Read it: "sample a state s, plug the policy's chosen action into the critic, that scalar is what the actor wants to maximize." The minus sign converts maximize into minimize for PyTorch.

The reparameterization trick

For gradient to flow from Q_φ back into θ, the action a must be a differentiable function of θ. With a stochastic policy this seems impossible — you can't backprop through "sample from a distribution." Solution: reparameterize.

Reparameterization for Gaussian policy Sample ad-hoc: a = μ_θ(s) + σ · ε, ε ~ 𝒩(0, 1) Now a is a deterministic function of θ with ε as a fixed input. ∇_θ Q(s, a) = ∇_θ Q(s, μ_θ(s) + σε) = (∂Q/∂a) · (∂a/∂θ) ↑ chain rule

The actor in this homework outputs a TruncatedNormal(μ, 0.1). Std is fixed at 0.1, not learned (unlike SAC's adaptive entropy). The 0.1 just adds exploration noise. Calling dist.sample() returns tanh(μ_θ(s)) + 0.1 · ε, which is differentiable in θ while ε is sampled fresh.

Two networks, two roles

Actor π_θ(a|s): maps state to action distribution. Trained to output high-Q actions. Used to act in the environment.

Critic Q_φ(s, a): estimates expected return for (state, action). Trained via TD regression. Used as a training signal for the actor.

The ping-pong

You alternate two updates:

Critic update: pull Q_φ(s, a) toward TD target r + γ Q_target(s', a').
Actor update: push π_θ toward actions where Q_φ is high.

If you only had the critic, you'd know which actions are good but couldn't act on them. If you only had the actor, you'd have nothing to optimize against. Together, they bootstrap each other up. This is the entire idea of actor-critic.

Chapter 06

Three Stabilization Tricks

Each fixes a specific failure mode from Chapter 04. Each surfaces directly in your update_critic.

Trick 1: Target network with soft (Polyak) update

Maintain a second copy of the critic, Q̄_φ̄, used only for computing TD targets. Update its parameters slowly:

Polyak / soft update φ̄ ← (1 − τ) φ̄ + τ φ, τ small, e.g. 0.005

Each step, the target moves 0.5% toward the online critic. After about 200 steps the target has caught up. Critically, on the timescale of any single gradient update, the target looks frozen. The regression target is stable.

In the codebase: utils.soft_update_params(net, target_net, tau) does this. You'll call it once per critic update.

Common bug: forgetting to soft-update

If you forget the soft update entirely, the target stays at its initialization forever. Q-values won't propagate beyond γ per step — learning crawls. The critic loss will look fine but the actor never improves.

Trick 2: Ensemble of N critics

Maintain N independent critic networks Q_φ₁, ..., Q_{φ_N}. Each is initialized differently (different orthogonal weights from weight_init) and sees different minibatch orderings. They disagree on out-of-distribution inputs. That disagreement reduces variance when we combine them.

Default in this homework: N=2. Ablation: N=10. With N=10 + UTD=5 you have something close to REDQ ("Randomized Ensembled Double Q-learning"), a state-of-the-art recipe for sample efficiency.

Trick 3: Random pair + min for the target

When computing the TD target, do not use all N critics. Pick 2 randomly, take the elementwise min:

Clipped double-Q with random pair i, j ~ random.sample(1..N, 2) # two distinct indices y = r + γ (1-done) · min( Q̄_i(s', a'), Q̄_j(s', a') )

The min counteracts maximization bias from Chapter 04. The random sampling (vs always using critics 1 and 2) is the REDQ improvement — it forces every critic to be reliable, not just the first two. Otherwise critics 3..N could drift since they're never used in targets.

Critical: train ALL N critics, only use 2 for the target

Common bug: students sometimes only update the 2 sampled critics, leaving the others frozen. Wrong. Loss should be computed over all N critic predictions:

critic_loss = sum( F.mse_loss(q, target) for q in q_list )

The target uses 2 samples; the loss uses all N predictions.

Putting them together

The TD target with all three tricks:

Final target formula with torch.no_grad(): a' = actor(s').sample(clip=stddev_clip) # next-action from current policy target_q_list = critic_target(s', a') # list of N tensors Q̄_i, Q̄_j = random.sample(target_q_list, 2) # 2 distinct critics y = r + discount · min(Q̄_i, Q̄_j) # scalar target

Memorize the shape of this. Three tricks, four lines, all the off-policy stability machinery in modern RL.

Chapter 07

Behavior Cloning

The reward is sparse. Random exploration almost never sees a +1. The agent would never learn from scratch in 100k steps.

So we cheat: start with a warm-started policy by imitating expert demonstrations. This is supervised learning, not RL.

BC loss L_bc(θ) = − 𝔼_{(s, a) ~ demos}[ log π_θ(a | s) ]

Maximize the log-probability that the policy assigns to expert actions. Pure maximum likelihood. After ~5000 BC gradient steps, the policy is "kind-of right" — it produces actions that look expert-shaped. Then RL takes over and refines them based on actual reward.

Why BC also runs during RL training

The homework alternates RL gradient steps with BC gradient steps throughout training. Why?

Imagine you've BC-pretrained a decent policy. Now you start RL. The actor's gradient step says "move toward whatever the critic likes." But the critic is itself a randomly-initialized neural net at first — its high-Q regions are essentially random. The actor would walk away from the BC initialization toward the critic's random preferences and forget the demos.

Solution: keep mixing in BC steps during RL. The BC term anchors the actor near the expert distribution. As the critic improves, RL pulls more strongly toward Q-maximization. Net effect: stable improvement over BC, no catastrophic forgetting.

Connecting back to PPO

Problem 2's PPO uses a "reverse KL to a frozen reference policy" for the same reason — preventing drift from the BC warm-start. Both methods solve "don't forget the demos" but with different mechanics: PPO adds a KL penalty in the loss; this homework alternates a BC update. Same disease, different antibiotics.

The BC method signature

Both pretraining and the RL-mixin call the same method:

def bc(self, replay_iter):
    # replay_iter yields (obs, action, reward, discount, next_obs)
    # For BC we only need (obs, action). The rest are unused.
    batch = next(replay_iter)
    obs, action, _, _, _ = utils.to_torch(batch, self.device)

    # YOUR CODE: loss = -E[log pi(a|s)], step actor optimizer

The trick: this same method is called whether the buffer is the demos or the live replay buffer. The training script picks which buffer to feed it. Don't put any "first time only" logic in bc.

Chapter 08

The Full Algorithm

Putting everything together. Read this twice.

Off-Policy Actor-Critic with BC Pretraining + Ensemble Critics

Initialize actor π_θ, N critics Q_{φ_1..N}, target critics Q̄_{φ̄_1..N} ← copies of online, replay buffer D, demonstration buffer D_demo.
BC pretrain (no RL yet):
for k = 1 to N_bc:

  sample minibatch (s, a) from D_demo

  L_bc = − mean[ log π_θ(a|s) ]

  θ ← θ − lr · ∇_θ L_bc
RL training loop: for step = 1 to N_env:
a) Collect: a_t ~ π_θ(·|s_t), step env, store (s_t, a_t, r_t, s_t+1, done_t) in D.

b) Critic updates, UTD times:

• sample minibatch B from D

• with no_grad: a' ~ π_θ(·|s'), pick 2 random target critics, y = r + γ(1-done) min(Q̄_i(s',a'), Q̄_j(s',a'))

• L_critic = Σ_k=1..N MSE(Q_{φ_k}(s, a), y)

• φ ← φ − lr · ∇_φ L_critic

• φ̄ ← (1−τ) φ̄ + τ φ

c) Actor update (one per env step):

• sample minibatch B from D, pull only s

• a_new ~ π_θ(·|s) (reparameterized)

• L_actor = − (1/N) Σ_k Q_{φ_k}(s, a_new)

• θ ← θ − lr · ∇_θ L_actor

d) BC update (every step, on demos):

• sample (s, a) from D_demo

• L_bc = − mean[ log π_θ(a|s) ]

• θ ← θ − lr · ∇_θ L_bc

e) Periodically: evaluate, log to wandb.

You're implementing pieces of step (b), (c), and (d). The orchestration in train_off_policy.py calls your three methods.

Chapter 09

Code Tour: off_policy.py

Three classes, mirroring the three concepts: actor, critic ensemble, agent.

Actor

off_policy.py:11-33class Actor(nn.Module):
    def __init__(self, obs_shape, action_shape, hidden_dim, std=0.1):
        super().__init__()
        self.std = std
        self.policy = nn.Sequential(
            nn.Linear(obs_shape[0], hidden_dim), nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, hidden_dim),   nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, action_shape[0]))

    def forward(self, obs):
        mu = torch.tanh(self.policy(obs))         # action mean, squashed to [-1,1]
        std = torch.ones_like(mu) * self.std      # fixed std (NOT learned)
        return utils.TruncatedNormal(mu, std)

Notable contrasts with PPO's actor:

Std is fixed at 0.1, not learned. Off-policy doesn't need adaptive entropy because the replay buffer naturally provides exploration diversity.
No separate log_std head. The policy is just μ_θ(s) plus 0.1·ε noise.
TruncatedNormal clips samples back into [-1, 1] while preserving gradients (see utils.py).

Critic ensemble

off_policy.py:36-54class Critic(nn.Module):
    def __init__(self, obs_shape, action_shape, num_critics, hidden_dim):
        super().__init__()
        self.critics = nn.ModuleList([nn.Sequential(
            nn.Linear(obs_shape[0] + action_shape[0], hidden_dim), nn.LayerNorm(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, hidden_dim), nn.LayerNorm(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, 1))
            for _ in range(num_critics)])

    def forward(self, obs, action):
        h = torch.cat([obs, action], dim=-1)
        return [critic(h) for critic in self.critics]   # LIST of N tensors

Q(s, a) not V(s): input is concatenated obs and action.
Returns a Python list of length N, each element shape [batch, 1]. Iterate over it for ensemble operations.
LayerNorm after each linear — empirically critical for Q-learning stability with continuous control.

Agent

off_policy.py:57-79class ACAgent:
    def __init__(self, obs_shape, action_shape, device, lr,
                 hidden_dim, num_critics, critic_target_tau, stddev_clip):
        self.device = device
        self.critic_target_tau = critic_target_tau
        self.stddev_clip = stddev_clip

        self.actor = Actor(obs_shape, action_shape, hidden_dim).to(device)
        self.critic = Critic(obs_shape, action_shape, num_critics, hidden_dim).to(device)
        self.critic_target = Critic(obs_shape, action_shape, num_critics, hidden_dim).to(device)
        self.critic_target.load_state_dict(self.critic.state_dict())   # target == online at init

        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=lr)
        self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=lr)

Two notable choices:

Two separate optimizers, not one combined. PPO had one because it's a single combined loss; here actor and critic have different losses, often updated at different cadences (UTD ratio).
Target critic is initialized as a full copy of the online critic. This matters — if you initialized them independently, the target would be random for the first hundreds of steps, totally destabilizing training.

The acting method

off_policy.py:88-97def act(self, obs, eval_mode):
    obs = torch.as_tensor(obs, device=self.device).float()
    dist = self.actor(obs.unsqueeze(0))
    if eval_mode:
        action = dist.mean                                  # greedy
    else:
        action = dist.sample(clip=None)                   # with noise
    return action.cpu().numpy()[0]

Used by the rollout collector. Eval mode = deterministic mean. Train mode = sampled with exploration noise.

Chapter 10

Your Three Implementation Tasks

For each task: the math, the gotchas, and pseudocode you can hold in your head while writing the actual implementation. Cross-reference back to the corresponding chapter when in doubt.

Task 1

bc — Behavior cloning loss

Goal: maximize the log-probability that the actor assigns to expert actions.

Math L_bc(θ) = − 𝔼_{(s, a)}[ log π_θ(a | s) ]

Already in scope:

batch = next(replay_iter)
obs, action, _, _, _ = utils.to_torch(batch, self.device)

Pseudocode:

Run actor on obs: dist = self.actor(obs)
Compute log-prob of expert action: log_prob = dist.log_prob(action).sum(-1)
Loss is negative mean: loss = -log_prob.mean()
Zero actor optimizer grad: self.actor_opt.zero_grad(set_to_none=True)
Backward: loss.backward()
Step: self.actor_opt.step()
Log: metrics["bc_loss"] = loss.item()

Gotchas

• Use self.actor_opt, not self.opt — this class has separate actor/critic optimizers.

• log_prob returns shape [batch, action_dim]. Sum over the last dim (since action is 4-D, joint log-prob = sum of per-dim log-probs). Mean across the batch.

• This same method is also called during RL training. Don't add any one-time logic.

Task 2

update_critic — The full TD update

Goal: one critic gradient step plus one target soft-update. This is the most involved task.

Math, in 5 steps Step 1 — sample next-action from current policy a' ~ π_θ(·|s') Step 2 — pick 2 random target critics, take min i, j ~ random.sample(range(N), 2) y = r + γ (1-done) · min( Q̄_i(s', a'), Q̄_j(s', a') ) Step 3 — loss across ALL N online critics L = Σ_k=1..N ( Q_k(s, a) − stop_grad(y) )² Step 4 — gradient step on critic params φ ← φ − lr · ∇_φ L Step 5 — soft-update all N target critic params φ̄_k ← (1 − τ) φ̄_k + τ φ_k for k = 1..N

Already in scope:

batch = next(replay_iter)
obs, action, reward, discount, next_obs = utils.to_torch(batch, self.device)

Pseudocode:

with torch.no_grad():
    # next-action from current policy
    next_action = self.actor(next_obs).sample(clip=self.stddev_clip)

    # forward all target critics → list of [B, 1] tensors
    target_q_list = self.critic_target(next_obs, next_action)

    # pick 2 random critics from the list (without replacement)
    sampled = random.sample(target_q_list, 2)
    target_q = torch.min(sampled[0], sampled[1])     # shape [B, 1]

    # TD target. Mind the shapes: reward [B], discount [B], target_q [B, 1]
    target = reward.unsqueeze(-1) + discount.unsqueeze(-1) * target_q

# online critic predictions on (obs, action)
q_list = self.critic(obs, action)

# sum of MSEs across ALL N critics
critic_loss = sum(F.mse_loss(q, target) for q in q_list)

# gradient step
self.critic_opt.zero_grad(set_to_none=True)
critic_loss.backward()
self.critic_opt.step()

# soft-update target critic
utils.soft_update_params(self.critic, self.critic_target, self.critic_target_tau)

# logging
metrics["critic_loss"] = critic_loss.item()
metrics["critic_target_q"] = target.mean().item()
metrics["critic_q1"] = q_list[0].mean().item()

Gotchas, in order of pain

1. Wrap target computation in with torch.no_grad():. Otherwise gradients flow through the target into the critic and you train it against itself → divergence.

2. Shape alignment. reward and discount are [B]; Q outputs are [B, 1]. Add .unsqueeze(-1) to align. Mismatched shapes broadcast to [B, B] — loss looks fine numerically but is gibberish, training never works. Silent killer.

3. Sample 2 critics WITHOUT replacement: random.sample(list, 2). Don't use random.choices — that's with replacement and could pick the same critic twice.

4. Train ALL N critics, not just the 2 used in the target. sum(F.mse_loss(q, target) for q in q_list) hits every critic.

5. Use self.critic_target, not self.critic, for the target. Mixing these up is the most common bug.

6. stddev_clip: when sampling for the target, clip the noise to self.stddev_clip. This is "target action smoothing" from TD3 — prevents the critic from being trained on extremely off-distribution actions.

Debug signals

• critic_loss stays at 0.0 → gradients aren't flowing. Forgot .backward()? Wrong optimizer? Loss disconnected from graph?

• critic_loss explodes to 1e6+ → shape bug, OR forgot no_grad and target is being trained, OR forgot soft update.

• target.mean() stays at 0 forever → agent never sees reward. Either bc pretrain didn't work, or buffer too small, or something upstream broken.

Task 3

update_actor — Maximize mean Q

Goal: one actor gradient step. Push the policy toward actions where Q is high.

Math a ~ π_θ(·|s) L_actor(θ) = − (1/N) Σ_k=1..N Q_{φ_k}(s, a)

Already in scope:

batch = next(replay_iter)
obs, _, _, _, _ = utils.to_torch(batch, self.device)

Only obs. The action will come from sampling the current policy — not from the buffer's stored action. Why: we want gradient to flow from Q through the new sampled action back into θ. The buffer's action was produced by an old policy and is detached.

Pseudocode:

# sample fresh action from current policy (reparameterized → grads flow)
dist = self.actor(obs)
action = dist.sample(clip=self.stddev_clip)

# forward all critics. Critic params are NOT optimized in this step.
q_list = self.critic(obs, action)

# mean Q across ensemble, then mean across batch, then negate
q_mean = torch.stack(q_list, dim=0).mean(dim=0)         # [B, 1]
actor_loss = -q_mean.mean()

# gradient step on actor only
self.actor_opt.zero_grad(set_to_none=True)
actor_loss.backward()
self.actor_opt.step()

# logging
metrics["actor_loss"] = actor_loss.item()
metrics["actor_q"] = q_mean.mean().item()

Gotchas

• Use sample(clip=self.stddev_clip), not sample(). The clipping is what TD3 calls "target policy smoothing" — even on the actor side it's standard.

• Don't accidentally backprop into the critic. Since you're calling self.actor_opt.step() (not the critic's optimizer), and the critic params aren't in actor_opt, you're safe by construction. But don't get cute — if you call self.critic_opt.step() by mistake here, you corrupt the critic.

• Mean across critics, then mean across batch. The PDF specifies (1/N) Σ_k Q_k. (Some implementations use min instead; staying with the PDF is correct.)

• The TruncatedNormal's clamp uses a straight-through estimator (x − x.detach() + clamped.detach()), so gradients still flow through μ_θ(s) even though the action is bounded. See utils.py:119.

Chapter 11

The UTD Ratio Analysis

The homework's final part asks you to run the same algorithm with two configurations and explain the difference:

Run	num_critics	UTD	Wall-clock target	Env steps
Default	2	1	~30-45 min	100k
Ablation	10	5	~2 hours	50k

What is UTD?

Definition

UTD ratio = update-to-data ratio

The number of critic gradient updates performed per environment step. UTD=1 means: collect 1 transition, do 1 critic update. UTD=5 means: collect 1 transition, do 5 critic updates — same data, 5× more gradient passes through it.

You can ONLY do this with off-policy methods. PPO can't — once you do too many updates on the same rollout, the policy drift exceeds what the clip surrogate can compensate for.

Expected effects

Pro of UTD>1: each transition is more thoroughly exploited. Sample efficiency improves. The 50k-step run with UTD=5 should match or beat the 100k-step run with UTD=1.

Con of UTD>1: more compute per env step (5× more gradient passes). Wall-clock is longer despite fewer env steps.

Con of UTD>1 alone: with only 2 critics and aggressive replay, the critic overfits to recent buffer entries. Q-values become wildly inaccurate on out-of-distribution actions, the actor exploits those errors, training collapses. This is why pure UTD increase often hurts.

Why N=10 saves UTD=5: a larger ensemble has lower variance in target estimates, so it tolerates more aggressive updates. The ensemble's disagreement on out-of-distribution actions provides an implicit regularization signal — min-of-2-randomly-sampled is more conservative when the underlying ensemble is more diverse.

This pairing (high UTD + large ensemble) is the recipe of REDQ, which achieves model-based-level sample efficiency on continuous control.

One-sentence answer template

For your writeup

"Increasing UTD from 1 to 5 with 10 critics improves sample efficiency because each transition is replayed and propagated through critic updates 5× more often, while the larger ensemble keeps target estimates well-calibrated against the resulting overfitting risk."

Comparison with PPO (Problem 2)

The very last part of HW2 asks you to compare PPO vs SAC-default curves in 3-5 sentences with at least two concrete differences.

Property	PPO (P2)	SAC-style (P3 default)
Steps to plateau	~1M	~100k
Why	Discards rollouts	Replays buffer indefinitely
Final success rate	~50-80%	>90%
Run-to-run variance	Lower (clipped)	Higher (off-policy)
Conceptual basis	V-baseline + policy gradient	Q-learning with actor

Two concrete differences to discuss:

Sample efficiency. P3 reaches plateau in ~100k steps; P2 needs ~1M. Reason: replay buffer + off-policy means each transition is reused thousands of times in P3, while P2 discards rollouts after a few PPO epochs. With sparse rewards, P3 stores rare reward transitions and replays them; P2 must encounter rewards repeatedly in fresh rollouts.
Final success rate. P3 typically hits ~90-100% because it can fully exploit each transition's information. P2 is more conservative because the clipped objective limits per-update policy change.

Chapter 12

Cheat Sheet & Self-Quiz

Equations to memorize

Bellman target y = r + γ(1−done) · min( Q̄_i(s', a'), Q̄_j(s', a') )

Critic loss L_critic = Σ_k=1..N ( Q_k(s, a) − y )²

Actor loss L_actor = − (1/N) Σ_k=1..N Q_k(s, π_θ(s))

BC loss L_bc = − mean[ log π_θ(a | s) ]

Polyak target update φ̄ ← (1 − τ) φ̄ + τ φ

Variable scope reference

Variable	Where	What
`self.actor`	ACAgent	Online policy π_θ
`self.critic`	ACAgent	Ensemble of N online critics
`self.critic_target`	ACAgent	Slow-moving copies of online critics
`self.actor_opt`	ACAgent	Adam, only actor params
`self.critic_opt`	ACAgent	Adam, only critic params
`self.critic_target_tau`	ACAgent	τ for Polyak (typically 0.005)
`self.stddev_clip`	ACAgent	Action noise clip for sampling
`obs, action, reward, discount, next_obs`	From `to_torch(batch)`	`discount = γ(1−done)` pre-multiplied
`q_list`	Returned by `critic(obs, action)`	Python list of N tensors, each [B, 1]
`utils.soft_update_params`	`utils.py`	Polyak helper, applies τ-mix to all params

Self-quiz — if you can answer these without re-reading, you're ready

Why does Q-learning with a single critic and no target network typically diverge?
What does the (1 − done) factor accomplish in the Bellman target?
Why do we sample only 2 critics for the target but train all N?
Why is the actor loss − Q(s, π(s)) and not + Q(s, π(s))?
Why is with torch.no_grad(): essential when computing the TD target?
What's the difference between critic and critic_target? When does each get updated?
If the critic loss stays at exactly 0.0 throughout training, what's the most likely bug?
If the critic loss explodes to 1e6+, what's the most likely bug?
Why does PPO need ~1M steps but this SAC-style algorithm needs ~100k?
What does the BC step during RL accomplish that pretraining alone can't?
With UTD=5, which is the bottleneck on training time — env steps or gradient steps?
Why does the actor update sample a fresh action from π(s) rather than use the buffer's stored action?

Answer key — check after attempting

1. Bootstrapping with the same network creates a moving target; combined with off-policy data and function approximation (the deadly triad), gradient descent chases its own tail. Target networks fix this.

2. Zeros out the bootstrapped future when an episode ended — terminal states have no future to bootstrap from.

3. Min-of-2 reduces maximization bias; randomization keeps every critic in the ensemble accountable; train all N to keep the ensemble diverse and accurate everywhere, not just where it's queried for the target.

4. Optimizers minimize, but we want to maximize Q. Negate.

5. Otherwise gradients flow through the target into the critic, training it against its own moving prediction → divergence.

6. critic is online, updated by Adam every step. critic_target is updated only by Polyak averaging, slowly tracking the online critic. Used only for computing TD targets.

7. Almost certainly: target wasn't no_grad'd / detached, OR wrong optimizer, OR .backward() not called, OR shapes mismatched and broadcast made loss = 0.

8. Shape mismatch broadcasting (e.g. [B] + [B, 1] → [B, B]), OR forgot no_grad on target, OR forgot soft update so target stuck at random init.

9. Off-policy reuses each transition many times; on-policy throws data away after a few epochs. With sparse rewards, the rare reward transitions get replayed dozens of times in off-policy.

10. Pretraining can decay during RL drift; periodic BC during RL keeps the actor anchored to the expert distribution, especially when actor updates push toward critic-overestimated actions.

11. Gradient steps. Each env step costs ~1ms (sim + tiny forward pass). Each gradient step costs ~10ms+ (full forward+backward through actor and ensemble of critics).

12. The buffer's action was sampled from an old policy. We want to train the current actor's parameters, so we need a fresh action where gradients can flow from Q back through μ_θ. Reparameterization makes that gradient meaningful.

Implementation order

bc — ~5 minutes if you understand log_prob. Easiest. Do this first to verify your dev loop works.
update_critic — ~20-30 minutes. The hardest. Most students hit at least one shape bug.
update_actor — ~10 minutes. Easy after critic.
Launch run 1: modal run --detach modal_off_policy.py
Edit modal_off_policy.py — uncomment num_critics=10 and utd=5.
Launch run 2: modal run --detach modal_off_policy.py

Take it back to class

You can now teach this

If a friend asks: "What's the difference between PPO and SAC?" — you don't recite features. You say: "PPO is a policy gradient method that has to use fresh data because its update is biased on stale data. SAC is a Q-learning method whose update is grounded in the Bellman equation, which is a property of the environment, so any past data works. That asymmetry is why off-policy is 10× more sample-efficient on sparse-reward tasks. Off-policy needs target networks and double-Q to stabilize, but the sample-efficiency win is worth the engineering."

If asked: "Why an ensemble of critics?" — you say: "Maximization bias. The max operator on noisy estimates is biased upward. Two-critic min counteracts that. With an ensemble of 10 and random pair-min, you get a more conservative target that tolerates aggressive UTD without overfitting."

That's the bar. You're there.

Problem 3 from Absolute Zero

What You'll Master

The Setup & Why It's Hard

Q-Functions — The Foundation

Why Q, not V?

The Bellman equation, again

Terminal handling

On-Policy vs Off-Policy

Why must on-policy be on-policy?

Why CAN off-policy be off-policy?

What's stored in the buffer

The Deadly Triad

Concrete failure mode 1: Moving target

Concrete failure mode 2: Maximization bias

The fixes (preview)

The Actor-Critic Fix

The reparameterization trick

The ping-pong

Three Stabilization Tricks

Trick 1: Target network with soft (Polyak) update

Trick 2: Ensemble of N critics

Trick 3: Random pair + min for the target

Putting them together

Behavior Cloning

Why BC also runs during RL training

The BC method signature

The Full Algorithm

Code Tour: off_policy.py

Actor

Critic ensemble

Agent

The acting method

Your Three Implementation Tasks

The UTD Ratio Analysis

What is UTD?

Expected effects

One-sentence answer template

Comparison with PPO (Problem 2)

Cheat Sheet & Self-Quiz

Equations to memorize

Variable scope reference

Self-quiz — if you can answer these without re-reading, you're ready

Implementation order

You can now teach this