← back
Stanford CS 224R · Homework 3 · Offline RL

Implicit Q-Learning from Absolute Zero

A subtle but powerful idea: learn the upper expectile of Q-values among the dataset's actions, never query out-of-distribution actions, get state-of-the-art offline RL performance. Every equation derived, every line of code annotated.

Builds on AWAC The expectile trick Six implementation tasks ~1500 lines
Roadmap

What You'll Master

Chapter 01

The Setup, Re-Grounded

Same offline RL problem as Problem 1 of HW3: an offline dataset of (s, a, r, s', d) transitions, no environment interaction during training, two AntMaze tasks plus a PointMass stitching evaluation. Same goal: learn a policy that's better than the data, without ever acting in the world.

What's different: the algorithm. Instead of constraining the actor to stay near the data (AWAC's approach), IQL constrains the critic to never extrapolate to out-of-distribution actions in the first place. If the critic never queries OOD actions, OOD-overestimation can't happen.

The deliverables for Problem 2:

  1. Tune the expectile ζ on antmaze-umaze across {0.2, 0.9}, report mean ± std across 3 seeds.
  2. Train on antmaze-medium-diverse with the better ζ, again 3 seeds.
  3. Compare AWAC and IQL on both tasks, explain the difference in 3 sentences.
  4. PointMass stitching: run IQL and Filtered BC on a curated suboptimal dataset, evaluate whether IQL composes better trajectories than any single trajectory in the dataset.

By the end of this guide you'll know:

Chapter 02

Why IQL Exists

You just implemented AWAC. It works. So why another algorithm?

AWAC's quiet weakness

AWAC's critic update has one line that should make you uncomfortable:

y = r + γ (1 − d) min(Q̄1(s', a'), Q̄2(s', a')) where a' ~ πψ(·|s')

The next-action a' comes from the current actor. The actor is trained via weighted MLE on the dataset, which keeps it approximately in-distribution. But "approximately" isn't "perfectly."

If the actor drifts even slightly — say, into a region of action space that's adjacent to but not covered by the dataset — the target Q-network has to evaluate Q at those slightly-OOD actions. Predictions there are unreliable. Errors leak in.

The drift can be subtle. The actor is a continuous Gaussian; even with mean tied close to dataset actions, sampling produces tail values that wander outside data support. AWAC's softness is its weakness.

The cycle that almost works in AWAC

Actor → close to data → samples a' ≈ in-distribution → target Q reliable → advantages reliable → actor stays close to data. The cycle holds as long as the BC anchor is strong. With a small dataset, weak BC, or distribution-shifted data, AWAC can still drift.

IQL's clean idea: never query OOD actions in the critic

What if the critic never needed to evaluate Q(s', a') for actions sampled from the policy? What if every Q-value in the entire training pipeline was computed only at actions that exist in the dataset?

Then OOD overestimation is structurally impossible. The Q-network is only trained at in-distribution (s, a) pairs and only queried at in-distribution (s, a) pairs. Distributional shift becomes a non-issue.

That's the entire conceptual move of IQL.

The trick: a separate V-network

Replace the actor-sampled a' with a learned V-function:

AWAC's TD target (uses actor) y = r + γ min(Q̄1(s', a'), Q̄2(s', a')) where a' ~ π(·|s') IQL's TD target (uses V-network) y = r + γ Vφ(s') no actor in the target!

If we have a good V(s') — meaning "expected return from s' under the optimal in-distribution policy" — we can compute the Q-target without sampling actions at all. The Q-network is trained purely from (s, a, r, s', d) tuples.

The catch: how do we train V? It needs to capture "value of acting well at state s, given the dataset's coverage." Not the average value (which would be too pessimistic), not the max (which would extrapolate). Something between the average and the max.

That something is the expectile.

Chapter 03

The Expectile Idea

To explain expectile regression, start with what you already know.

Standard regression: mean

Regular MSE regression of V(s) toward Q-values:

MSE = mean LMSE(φ) = 𝔼(s, a)[ ( Q(s, a) − Vφ(s) )2 ]

The minimizer is the mean: V(s) ends up equal to Ea~D[Q(s, a)], averaged over the actions in the dataset at state s.

Why this isn't what we want: averaging includes bad actions. If the dataset has a mix of expert demonstrations and random meandering at state s, the mean dilutes the expert's value. We end up with a pessimistic V that doesn't reflect what's achievable from s.

Quantile regression: median, percentiles

You can replace MSE with the pinball loss to learn quantiles instead of means. The 0.5-quantile is the median; 0.9-quantile is the value below which 90% of samples fall.

Pinball loss for τ-quantile Lτ(u) = max( τ u, (τ − 1) u )

Quantiles are great because the 0.9-quantile of Q-values represents "the upper end of what the dataset achieves at this state" — closer to the optimal in-distribution policy. But quantile regression is non-smooth (the loss has a kink), making optimization with stochastic gradient descent harder.

Expectile: the smooth cousin

The expectile is a quantile-like quantity computed with squared rather than absolute error:

Expectile loss Lζ2(u) = | ζ − 1[u < 0] | · u2 where u = Q(s, a) − V(s) 1[·] is the indicator function (1 if condition true, else 0) ζ ∈ (0, 1) is the expectile parameter

Read this carefully:

The loss is asymmetric: it penalizes errors on each side differently.

ζ = 0.5
MSE
Both sides equal. Mean.
ζ = 0.9
Upper
Penalize underestimate 9× more
ζ = 0.99
Max-ish
Aggressive upper estimate
ζ = 0.1
Lower
Pessimistic, rarely useful

For IQL we want large ζ — typically 0.7 to 0.99. This makes V regress toward the upper expectile of Q over the dataset's actions, which approximates "value of the better dataset actions at state s."

Why this is the perfect fit for offline RL

V should reflect "value of acting well from s, within the dataset's coverage." The mean is too pessimistic (includes bad actions). The max is impossible (extrapolates). The upper expectile threads the needle: "how good is the value of the better-than-average actions in the dataset at this state?"

That value is exactly what we want to bootstrap from in the Q-update. The agent doesn't need to take the mean action or some impossible argmax — it should take an action like the better dataset actions, and V tells us what the return looks like from there.

The intuition in one sentence

Expectile regression with ζ = 0.9 learns a "soft max" of Q over dataset actions: optimistic enough to represent the policy's potential, conservative enough to never extrapolate beyond what was seen.

Chapter 04

Where Expectile Comes From

The expectile loss isn't a heuristic. It's the unique loss whose minimizer is the ζ-expectile of the target distribution. Let's see why.

Step 1: define the ζ-expectile

Expectile, defined mζ(X) = arg minm 𝔼[ Lζ2(X − m) ] = arg minm 𝔼[ | ζ − 1[X < m] | · (X − m)2 ]

The ζ-expectile of a distribution X is the value m that minimizes the expected expectile loss between X and m. The minimizer exists and is unique for any ζ ∈ (0, 1).

Step 2: why ζ=0.5 gives the mean

When ζ = 0.5, the multiplier |0.5 − 1[X < m]| is always 0.5, regardless of which side X is on. The loss reduces to standard MSE up to a constant:

L0.52(u) = 0.5 · u2

And we know the MSE-minimizer is the mean. So m0.5(X) = E[X].

Step 3: why ζ > 0.5 gives an upper-shifted value

For ζ > 0.5: positive errors (X > m) get weight ζ, negative errors (X < m) get weight 1−ζ. With ζ = 0.9, positive errors are penalized 9× more than negative errors.

To minimize loss, we want to avoid positive errors — meaning we don't want X > m to happen often. So m gets pushed up, until only ~10% of the X distribution lies above m. That's the upper expectile.

(Technically the expectile isn't exactly the 90th percentile — it's defined via squared rather than absolute deviations — but it's a smoothed cousin of the quantile and serves the same purpose.)

Step 4: this is what we apply to Q

Treat Q(s, a) over a ~ D(·|s) as our distribution X. Train V(s) to be the ζ-expectile of that distribution:

IQL's V-loss LV(φ) = 𝔼(s, a) ~ D[ Lζ2( Q̄(s, a) − Vφ(s) ) ] = 𝔼(s, a) ~ D[ | ζ − 1[diff < 0] | · diff2 ] where diff = Q̄(s, a) − Vφ(s)

The gradient of this with respect to V's parameters:

φ LV = 𝔼(s, a) ~ D[ − 2 · | ζ − 1[diff < 0] | · diff · ∇φ Vφ(s) ]

At the minimum (gradient = 0), V(s) settles at the value where positive diffs occur with weight ζ and negative diffs with weight 1−ζ. That's the upper expectile.

The bottom line

Expectile regression learns "the value of acting like the better dataset actions" without needing to know which actions are better in advance. The asymmetric loss does the upper-percentile selection automatically through gradient descent.

The expectile loss in the homework's notation

The PDF (Equation 6) writes:

L2ζ(u) = | ζ − 1{u ≤ 0} | · u2

Note: 1{u ≤ 0} with ≤ (not <). Mathematically equivalent for continuous u (probability of exact zero is zero); for code it doesn't matter which you pick. PyTorch's (diff < 0).float() works either way.

Chapter 05

The Three-Network Architecture

IQL maintains three network roles. (AWAC had two: actor and critic, with target networks for the critic.) IQL has three because the V-function is its own component.

NetworkSymbolRoleTrained on
Q-networksQθ1, Qθ2Estimate Q(s, a) for dataset actionsTD: y = r + γ V(s')
Target Q-networks1, Q̄2Stable target for the V regressionPolyak from Q-online
V-networkVφUpper expectile of Q over dataset actionsExpectile regression toward Q̄(s, a)

The data flow

IQL forward graph [dataset] / \ (s, a) (s, a, r, s', d) | | v v Q̄(s, a) Vφ(s') (no_grad on target Q) \ / v v expectile y = r + γ (1-d) V(s') regression | | v v MSE(Q(s,a), y) train V | v train Q1, Q2 | v Polyak update Q̄1, Q̄2

Three updates per step, one for each network. The V update reads target Q (no grad). The Q update reads V (no grad). The target Q is slowly EMA'd from online Q. No actor anywhere in this graph — the policy is a separate component (more on that below).

Why a separate V (and not just use Q)?

You might ask: why train V at all? Couldn't we just use min(Q1, Q2)(s, a) in the Q-target by some trick?

The whole point of IQL is to never compute Q at OOD actions. min(Q1, Q2)(s', a') requires choosing some a'. Whatever we pick — mean of policy, sample from policy, argmax over discrete actions — that a' isn't guaranteed to be in the dataset. Even if it usually is, sometimes it isn't, and that's enough to leak OOD errors.

By introducing V(s) trained via expectile regression, we never need a' for the Q-target. V(s') is just a function of s'. No action choice required. OOD-free.

Where does the policy come from?

IQL trains a policy as a separate module that's not involved in any of the value updates. The policy is trained the same way as AWAC's actor: weighted maximum likelihood, with weights exp(A(s, a) / λ) where A(s, a) = Q(s, a) − V(s).

The policy doesn't influence Q-updates. The policy doesn't influence V-updates. The policy just consumes the trained Q and V to figure out which dataset actions to imitate more strongly. Decoupled.

The architectural insight

IQL splits offline RL into two phases that don't interfere with each other: (1) value estimation via expectile regression, (2) policy extraction via advantage-weighted regression. Each phase is solved cleanly without distributional-shift concerns.

Chapter 06

IQL vs AWAC Side-by-Side

Both algorithms use advantage-weighted regression for the actor. The difference is entirely in how they compute Q and V.

ComponentAWACIQL
Actor lossSame: − mean(log π(a|s) · exp(A/λ))
Q-targetr + γ min(Q̄1, Q̄2)(s', a')
where a' ~ π(·|s')
r + γ V(s')
(no actor!)
V-estimateQ(s, aπ) with single MC sampleSeparate V-net trained via expectile regression
NetworksActor + 2 Q + 2 target Q = 5Actor + 2 Q + 2 target Q + V = 6
OOD riskPossible if actor driftsNone — V never queries OOD actions
Tunableλ onlyζ (expectile) and λ
Typical useEasier tasks, smaller datasetsHarder tasks, larger datasets, sparse rewards

The asymptotic argument

Why does IQL typically outperform AWAC on harder tasks? The PDF asks you to write 3 sentences about this. Here's the structure of a strong answer.

Three sentences for the writeup

Sentence 1: AWAC's TD target requires sampling a' ~ π(·|s') from the actor, which can drift outside the data distribution as training progresses; this lets OOD evaluation errors enter the value estimates and propagate.

Sentence 2: IQL replaces that sample with a learned V(s') trained via expectile regression on dataset (s, a) pairs, so every Q-value query in the entire pipeline is at an in-distribution action, eliminating OOD overestimation by construction.

Sentence 3: On longer-horizon tasks like antmaze-medium where TD errors compound over many bootstrap steps, IQL's stricter avoidance of OOD queries leads to more reliable value estimates and substantially better final performance.

The expectile parameter ζ as a knob

The homework asks you to sweep ζ on antmaze-umaze across {0.2, 0.9}. What should you expect?

ζV learnsEffect on policy
0.2Lower expectile of Q at each state — pessimisticAdvantage A = Q − V is often positive; weights are diffuse; policy ends up close to plain BC
0.5Mean of Q (standard MSE) — averageReduces to AWAC-like behavior with single-sample V replaced by mean V
0.7–0.9Upper expectile — "value of the good dataset actions"Advantage A = Q − V is sharp around 0; high-Q actions get heavily upweighted; policy improves over data
0.99Near-max of Q — aggressiveV can become unstable; if max in dataset has high variance, V tracks noise

For antmaze-umaze, ζ = 0.9 typically wins. You'll confirm this empirically and report the better ζ for the medium-diverse run.

Chapter 07

Stitching Behavior

Part 3 of Problem 2 asks about stitching. This is the gold-standard test of whether an offline RL algorithm is genuinely doing more than imitation.

What is stitching?

Definition
Stitching

The ability of an offline RL algorithm to combine parts of multiple suboptimal trajectories in the dataset into a single trajectory that's better than any individual trajectory in the dataset.

Concrete example: imagine a navigation task where the dataset contains:

Neither A nor B alone reaches the goal optimally. But the concatenation of A's first half + B's second half could reach the goal in fewer steps than either — total return -80, say. That's stitching.

Why this matters

Behavior cloning can't stitch. BC just imitates trajectories; the best you can do is reproduce the best trajectory in the dataset. Filtered BC (top 10% by return) can do slightly better, but still bounded by the best single trajectory.

True offline RL can stitch because the Q-function captures state-conditioned value, not trajectory-level identity. If state s appears in both A and B, the algorithm learns "from state s, the best continuation is whatever B does" — even if no full trajectory does what we want.

The PointMass test

The homework's pointmass_stitching_dataset.npz is curated specifically to test this. It contains trajectories with:

You'll train both IQL and Filtered BC on this dataset. The deliverable: report mean and max return across 3 seeds, plus trajectory visualizations.

What success looks like:

If IQL achieves > -46 mean return, you've demonstrated stitching. The trajectory visualization should show paths to the goal that don't exactly match any single dataset trajectory but are clearly composed of pieces of multiple ones.

Why IQL stitches

The Q-function generalizes across (s, a) pairs. If state s appears in trajectory A and a similar state appears in trajectory B, Q learns the best action at that state based on what worked in either trajectory. The advantage-weighted policy then takes the best in-distribution action at each state, regardless of which original trajectory contained it.

Chapter 08

The Full Algorithm

IQL: Implicit Q-Learning
  1. Initialize:
    • Two Q-networks Qθ1, Qθ2 + their target copies Q̄1, Q̄2.
    • V-network Vφ.
    • Policy πψ (random init).
    • Offline dataset D.
  2. For step = 1 to N (e.g., 1M):
    a) Sample minibatch (s, a, r, s', d) from D.
    b) V update:
    • q_target = min(Q̄1(s, a), Q̄2(s, a)) [no_grad]
    • diff = q_target − Vφ(s)
    • LV = mean( |ζ − 1[diff<0]| · diff2 )
    • Step V optimizer.
    c) Actor update:
    • A(s, a) = Q(s, a) − Vφ(s) [no_grad]
    • w = exp(A / λ), clamped at 50.
    • Lπ = − mean(w · log πψ(a|s))
    • Step actor optimizer.
    d) Q update:
    • y = r + γ (1 − d) Vφ(s') [no_grad]
    • LQ = MSE(Qθ1(s, a), y) + MSE(Qθ2(s, a), y)
    • Step Q optimizer.
    e) Soft-update target Qs: Q̄ ← (1 − τ) Q̄ + τ Q.

Three losses, three updates per step. The order matters subtly: V update first uses the current target Q. Q update reads the current V (just updated). Actor reads both. Then target soft-update. By the next iteration, all three have moved together by one step.

Notice the key absence: nowhere does the algorithm sample a' ~ π(·|s'). The policy never enters the value loops. OOD-free.

Chapter 09

Code Tour

The IQL implementation spans two files. (The actor uses the same MLPPolicyAWAC class from Problem 1 — you're done with that.)

FileClassResponsibility
critics/iql_critic.pyIQLCriticQ-net, V-net, target Qs, expectile loss, V-update, Q-update
agents/iql_agent.pyIQLAgentOrchestrator — advantage estimation + train loop

The IQLCritic skeleton

Look at iql_critic.py:55-130 for the constructor. It already builds:

It also pre-builds self.v_optimizer and self.v_learning_rate_schedulerbut only after a missing self.v_net = ... line that you have to add. That's edit #1.

Helpers already provided

_get_q_value(q_net, obs, actions) at iql_critic.py:132: handles discrete vs continuous Q-network signatures, returns shape [B].

get_q(obs, actions) at iql_critic.py:150: returns min(Q1, Q2)(s, a) using ONLINE networks.

get_target_q(obs, actions) at iql_critic.py:165: returns min(Q̄1, Q̄2)(s, a) using TARGET networks.

update_target_network() at iql_critic.py:286: Polyak averages both target Qs from online Qs.

What you'll fill in

  1. self.v_net definition in __init__ (one line).
  2. expectile_loss(diff) body (one line).
  3. value_loss in update_v (one line).
  4. loss, loss2 in update_q (a small block).
  5. adv = q − v in iql_agent.py:estimate_advantage (one line).
  6. actor_loss in iql_agent.py:train (a small block).

Six edits, mostly small. The hard work is conceptual; the code is ~10 lines total.

Chapter 10

Your Six Changes, Decoded

Per-line annotations for every blank you'll fill. This is the centerpiece chapter.

Change 1 of 6
Define self.v_net in IQLCritic.__init__

Where: iql_critic.py:116-119.

What's nearby: the helper v_network_initializer already exists at line 87, pulled from hparams['v_func']. Reading the docstring, it's a callable: v_func(ob_dim) → V-network with output [B, 1].

Look at how q_net is built at line 89:

self.q_net = q_network_initializer(self.ob_dim, self.ac_dim)
self.q_net.to(ptu.device)

Your code:

self.v_net = v_network_initializer(self.ob_dim)
self.v_net.to(ptu.device)

Decoded

self.v_net = v_network_initializer(self.ob_dim)

Build the V-network. v_network_initializer takes only ob_dim (no action dim — V is a state-only function). Returns an MLP whose output is [B, 1].

Compare to Q: q_network_initializer(ob_dim, ac_dim) takes both because Q is a function of state AND action. V skips the action argument.

self.v_net.to(ptu.device)

Move to GPU (if available). Same pattern as the Q-networks. Without this, V lives on CPU and the forward pass would fail when given GPU-resident observations.

That's it. Two lines. The v_optimizer on line 121 references self.v_net.parameters() — if you don't define v_net first, you'd get AttributeError.

Change 2 of 6
Implement expectile_loss

Where: iql_critic.py:193-196.

The math:

Lζ2(diff) = | ζ − 1[diff < 0] | · diff2

What's in scope: diff — tensor of shape [B]; self.iql_expectile — scalar ζ.

Your code:

weight = torch.abs(self.iql_expectile − (diff < 0).float())
return weight * diff.pow(2)

Decoded

weight = torch.abs(self.iql_expectile - (diff < 0).float())

The asymmetric weight, computed per element of diff. Three operations:

1. (diff < 0) — elementwise comparison, returns a bool tensor of same shape.

2. .float() — cast bool to float (True → 1.0, False → 0.0). This is the indicator 1[diff < 0].

3. self.iql_expectile - (...) — subtract. When diff > 0: result is ζ − 0 = ζ. When diff < 0: result is ζ − 1 (typically negative for ζ ≤ 0.99).

4. torch.abs(...) — absolute value. Now: when diff > 0, weight = ζ; when diff < 0, weight = 1 − ζ (the sign flip becomes positive after abs).

So weight at each sample is exactly |ζ − 1[diff<0]|.

return weight * diff.pow(2)

Elementwise multiplication with diff squared. Returns a per-sample expectile loss, shape [B].

Note: diff.pow(2) is the same as diff ** 2 or diff * diff. Stylistic preference.

Why not mean here? Because update_v calls .mean() later. Returning per-sample lets the caller decide reduction. (Useful for inspecting the loss distribution during debugging.)

A common alternative formulation

You'll see this same loss written equivalently as:

weight = torch.where(diff > 0, expectile, 1 - expectile)
return weight * diff.pow(2)

Mathematically identical. Slightly more readable but slightly slower (torch.where dispatches a kernel). Both work.

Change 3 of 6
value_loss in update_v

Where: iql_critic.py:226-228.

The math:

LV(φ) = mean( Lζ2( Q̄(s, a) − Vφ(s) ) )

What's already in scope:

  • q_t_values: Q̄(s, a) from target nets, shape [B], computed with no_grad at line 218.
  • v_t: Vφ(s) from current online V, shape [B], computed at line 221.

Your code:

value_loss = self.expectile_loss(q_t_values − v_t).mean()

Decoded

value_loss = self.expectile_loss(q_t_values - v_t).mean()

One line, three operations.

1. q_t_values - v_t — elementwise difference, shape [B]. This is "diff" from the previous task. Positive when target Q is above current V (V is too low); negative when V is above target Q.

2. self.expectile_loss(...) — calls the function you wrote in Change 2. Returns per-sample expectile losses, shape [B].

3. .mean() — reduces to a scalar. Standard reduction for SGD on minibatch loss.

Critical detail: q_t_values was computed under with torch.no_grad(): at line 217. So gradients only flow through v_t. The V-network learns; the target Q-network doesn't. Good — we want a stable target for V to regress against, just like target networks for Q in standard TD.

Why we use Q̄ (target) and not Q (online) in the V regression

If we used online Q, the regression target moves whenever Q updates. V chases a moving target. With target Q (slowly EMA'd), the regression target is stable on the timescale of one V update. Same standard target-network argument.

Change 4 of 6
Q-loss in update_q

Where: iql_critic.py:269-273.

The math:

y = r + γ (1 − d) Vφ(s') no_grad on V LQ1 = MSE( Qθ1(s, a), y ) LQ2 = MSE( Qθ2(s, a), y )

What's in scope: ob_no, ac_na, next_ob_no, reward_n, terminal_n — all tensors. self.v_net, self._get_q_value, self.gamma, self.mse_loss.

Your code:

with torch.no_grad():
    v_next = self.v_net(next_ob_no).squeeze(−1)
    target = reward_n + self.gamma * (1.0 − terminal_n) * v_next

q1 = self._get_q_value(self.q_net,  ob_no, ac_na)
q2 = self._get_q_value(self.q_net2, ob_no, ac_na)

loss  = self.mse_loss(q1, target)
loss2 = self.mse_loss(q2, target)

Decoded

with torch.no_grad():

The block computing the TD target. Same pattern as AWAC's critic: targets must be detached so gradients only flow through the prediction (Q1, Q2), not through V or the targets.

v_next = self.v_net(next_ob_no).squeeze(-1)

V at the next states. self.v_net(next_ob_no) returns shape [B, 1] — the V-network ends with nn.Linear(hidden, 1). .squeeze(-1) drops the trailing dim to get shape [B], matching the shape of reward_n and terminal_n.

This is the IQL secret sauce: no actor sample, no min(Q1, Q2)(s', a'). Just V(s'). The V-network has been trained (in the very same training step, via update_v) to be the upper expectile of Q over dataset actions, so this V represents "the value of acting like the better dataset actions from s'." Bootstrapping on it is principled.

target = reward_n + self.gamma * (1.0 - terminal_n) * v_next

The Bellman target. Three pieces, all shape [B]:

reward_n — immediate reward.

self.gamma * v_next — discounted bootstrap.

(1.0 - terminal_n) — done mask. Where terminal_n = 1, this kills the bootstrap (target = reward only). Standard.

q1 = self._get_q_value(self.q_net, ob_no, ac_na)

OUT of no_grad now. We want gradients on this — it's the prediction we're training. _get_q_value handles the discrete-vs-continuous abstraction; for AntMaze (continuous), it does self.q_net(obs, actions).squeeze(-1) internally.

q2 = self._get_q_value(self.q_net2, ob_no, ac_na)

Same for the second Q-network. We train both; the V-update uses min(Q̄1, Q̄2) of the targets, but here for the Q-update each network has its own MSE loss.

loss = self.mse_loss(q1, target)

MSE between Q1 prediction and target. Returns a scalar. The downstream code does (loss + loss2).backward() at line 276, summing gradients.

loss2 = self.mse_loss(q2, target)

Same for Q2. Each Q-network has its own parameters in the optimizer, so (loss + loss2).backward() correctly distributes gradients: Q1 gets only loss's gradient, Q2 gets only loss2's gradient.

Common bug: forgetting that V is learned simultaneously

You might be tempted to wrap v_next in a stop-gradient. That's already happening via the with torch.no_grad(): block — but make sure you don't accidentally use V outside that block.

Also: V was just updated by update_v at line 198. By the time update_q runs, V has already taken one gradient step this iteration. That's intentional. The order V → Actor → Q in the training loop is deliberate: V is computed first so Q can bootstrap on it.

Change 5 of 6
estimate_advantage in IQLAgent

Where: iql_agent.py:88-91.

The math:

A(s, a) = Q(s, a) − V(s)

What's already in scope:

  • v_pi: Vφ(s) from V-net, shape [B].
  • q_sa: min(Q1, Q2)(s, a) from online Q, shape [B].

Your code:

adv = q_sa − v_pi

Decoded

adv = q_sa - v_pi

The advantage. Shape [B].

Compare to AWAC's estimate_advantage: AWAC sampled aπ ~ π(·|s) and used Q(s, aπ) as a single-sample estimate of V(s). IQL has a learned V-network, no sampling needed. Massively lower variance in the advantage estimate.

This is one of the practical benefits of IQL over AWAC: the advantage signal that drives the actor's update is much cleaner because V is a smooth, well-trained function rather than a single-sample MC estimate.

Why this single line is so important

This advantage is the only "feedback" the actor receives from the Q/V world. If the advantage is noisy, the actor's update is noisy and the policy meanders. AWAC is famous for not always converging cleanly; IQL's clean advantage is a significant practical reason it usually works better.

Change 6 of 6
actor_loss in IQLAgent.train

Where: iql_agent.py:120-123.

The plumbing:

adv_n = self.estimate_advantage(ob_no, ac_na)
actor_loss = self.actor.update(ob_no, ac_na, adv_n=adv_n)

Decoded

adv_n = self.estimate_advantage(ob_no, ac_na)

Calls the function you wrote in Change 5. Returns shape [B] tensor of advantages, one per dataset transition.

Note this happens after update_v at line 115 but before update_q at line 125-128. Why? Because the advantage uses V (just updated) and Q (about to be updated). Using the not-yet-updated Q is fine — it's still a valid estimate, just slightly stale.

actor_loss = self.actor.update(ob_no, ac_na, adv_n=adv_n)

Same as AWAC: calls MLPPolicyAWAC.update with advantages, which internally computes exp_weights = exp(adv_n / lambda_awac), clamps, computes weighted MLE loss, backprop, optimizer step. Returns scalar loss for logging.

Same actor class as AWAC. The policy training is shared. The only thing IQL changes is the value learning (V via expectile, Q via V-bootstrap). The actor side is identical.

The full IQL train loop, end to end

1. Sample minibatch (already done by the trainer).

2. update_v: compute target Q (no grad), regress V toward expectile of target Q.

3. estimate_advantage: A = Q − V using the just-updated V.

4. actor.update: weighted MLE on (s, a) pairs with weight exp(A/λ).

5. update_q: compute V(s') (no grad), regress Q toward r + γ V(s').

6. update_target_network: Polyak-update target Qs from online Qs.

One iteration of IQL. Loop a million times. Never touch the environment.

Chapter 11

Running on Modal

Pre-flight (already done from Problem 1)

Same Modal + wandb setup as AWAC. Skip to the launch commands.

Part 1: sweep ζ on antmaze-umaze

Run two experiments, one per ζ value, each with 3 seeds (parallel):

# zeta = 0.2
modal run --detach modal_train_para.py --algo iql \
  --env-name antmaze-umaze-v0 \
  --iql-expectile 0.2 \
  --exp-name iql_zeta_0.2_umaze \
  --use-wandb

# zeta = 0.9
modal run --detach modal_train_para.py --algo iql \
  --env-name antmaze-umaze-v0 \
  --iql-expectile 0.9 \
  --exp-name iql_zeta_0.9_umaze \
  --use-wandb

~1 hour per experiment. Run them simultaneously — different containers don't conflict.

For each, take final-checkpoint Eval_AverageReturn across 3 seeds, fill into Table 3. Identify the better ζ (almost certainly 0.9). Write 2-3 sentences explaining why — the upper expectile better represents what's achievable from each state, leading to sharper advantage signals and faster learning.

Part 2: train on antmaze-medium-diverse with the better ζ

modal run --detach modal_train_para.py --algo iql \
  --env-name antmaze-medium-diverse-v0 \
  --iql-expectile 0.9 \
  --exp-name iql_zeta_0.9_medium_diverse \
  --use-wandb

~1.5 hours. 3 seeds in parallel. Fill Table 4 (IQL on medium-diverse) and Table 5 (AWAC vs IQL on both tasks).

Part 3: PointMass stitching

Two runs, one for IQL and one for Filtered BC:

# IQL with the better zeta from Part 1
modal run --detach modal_train_para.py --algo iql \
  --env-name PointmassMedium-v0 \
  --exp-name iql_zeta_0.9_stitching \
  --iql-expectile 0.9 \
  --offline-dataset offline_datasets/pointmass_stitching_dataset.npz \
  --use-wandb

# Filtered BC (top 10% trajectories)
modal run --detach modal_train_para.py --algo bc \
  --env-name PointmassMedium-v0 \
  --exp-name filtered_bc_stitching \
  --offline-dataset offline_datasets/pointmass_stitching_dataset.npz \
  --filter-top-percent 10 \
  --use-wandb

IQL: ~15 min. BC: ~10 min. Both 3 seeds in parallel. Fill Table 6 with mean and max return for each.

Expected outcomes:

Healthy training signals

MetricHealthyBug
Critic V LossDecreases over timeStays flat or explodes
Critic Q LossDecreases, eventually stableDiverges — usually shape bug
Actor LossNegative, slowly driftsPositive (BC broken)
Eval_AverageReturnClimbs from baseline toward 0Plateau or drops below dataset max return

Submission structure

Same as Problem 1, but with the additional layers:

P2/
├── 1/
│   ├── iql_umaze_seed1.csv     # BEST zeta only!
│   ├── iql_umaze_seed2.csv
│   └── iql_umaze_seed3.csv
└── 2/
    ├── iql_medium_maze_seed1.csv
    ├── iql_medium_maze_seed2.csv
    └── iql_medium_maze_seed3.csv

Critical: per the PDF instructions, only upload CSVs from the best-performing ζ. The autograder expects exactly 3 CSVs per folder, all from the same expectile.

Chapter 12

Cheat Sheet & Self-Quiz

Equations to memorize

Expectile loss Lζ2(u) = | ζ − 1[u < 0] | · u2
V-loss LV(φ) = mean( Lζ2( Q̄(s, a) − Vφ(s) ) ) (s, a) ~ D
Q-loss y = r + γ (1 − d) Vφ(s') no_grad on V; no actor! LQ(θ) = MSE(Qθ1(s, a), y) + MSE(Qθ2(s, a), y)
Actor loss (same as AWAC) A(s, a) = Q(s, a) − V(s) Lπ(ψ) = − mean( log πψ(a|s) · exp(A / λ) )

API reference

CallReturnsFile
self.v_net(obs)V(s), shape [B, 1] — squeeze for [B]iql_critic.py
self.get_q(obs, actions)min(Q1, Q2) ONLINE, shape [B]iql_critic.py:150
self.get_target_q(obs, actions)min(Q̄1, Q̄2) TARGET, shape [B]iql_critic.py:165
self._get_q_value(q_net, obs, actions)Q from a single network, shape [B]iql_critic.py:132
self.expectile_loss(diff)Per-sample loss, shape [B]iql_critic.py:180 (you write)
self.iql_expectileζ scalariql_critic.py:129

Self-quiz

  1. What's the fundamental difference between AWAC's and IQL's TD targets?
  2. Why does IQL use a separate V-network instead of min(Q1, Q2)(s', a') with some a'?
  3. Why is ζ = 0.9 typically better than ζ = 0.2 for offline RL?
  4. What does the expectile loss reduce to when ζ = 0.5?
  5. What is "stitching" and why is it the gold standard test of offline RL?
  6. Why does IQL's advantage estimate have lower variance than AWAC's?
  7. Why is V trained against target Q rather than online Q?
  8. What happens if you set ζ = 0.99 versus 0.9?
  9. Why is the actor loss the same in IQL and AWAC?
  10. Where in the algorithm does IQL ever need to sample a' ~ π? (Hint: trick question.)
  11. If V is trained well, what does Q(s, a) − V(s) being negative tell us about action a?
  12. Why does IQL typically outperform AWAC on harder tasks like medium-diverse?
Answer key

1. AWAC's target uses min(Q̄1, Q̄2)(s', a') with a' ~ π(·|s') — requires actor samples. IQL's target uses V(s') directly — no action sampling. This eliminates OOD evaluation in the critic.

2. Because any a' — sampled from policy, mean of policy, argmax over discrete — could be slightly OOD. IQL learns V to summarize "value over the better dataset actions" without requiring a specific action choice.

3. ζ = 0.9 makes V track the upper expectile of Q over dataset actions, which approximates "value of the better dataset actions." ζ = 0.2 is pessimistic (lower expectile), produces weaker advantage signals.

4. Standard MSE up to a constant. The minimizer is the mean of Q over dataset actions at each state.

5. Stitching = combining segments of multiple suboptimal dataset trajectories into a path that's better than any single trajectory. It's the test because BC cannot stitch (it imitates whole trajectories), but real offline RL can.

6. AWAC uses a single-sample MC estimate Q(s, aπ) for V, which is high-variance. IQL uses a learned V-network, smoothed via expectile regression over the entire dataset, which is much lower variance.

7. Same standard target-network argument: regression onto a moving target leads to instability. Target Q changes slowly via Polyak averaging, so it's a stable target for V.

8. ζ = 0.99 is more aggressive — V gets pushed near the max of Q over dataset actions. Can be unstable if the dataset's Q distribution at some states has heavy tails or noise. ζ = 0.9 is the typical sweet spot.

9. Both algorithms use advantage-weighted maximum likelihood: L = -mean(logπ(a|s) * exp(A/λ)). The actor only changes through different advantage estimates from the value functions.

10. Nowhere. IQL's value learning is structurally policy-free. The policy enters only via the actor update, which uses logπ(a|s) on dataset actions only.

11. That action a is below the ζ-expectile — meaning it's worse than typical "good actions" in the dataset at state s. The exp(A/λ) weight is small, so the actor barely updates toward this action. Corresponds to negative weight in the AWR formula.

12. On longer-horizon tasks, TD errors compound. AWAC's policy can drift over many bootstraps, causing OOD evaluation in the critic, which feeds back into actor advantages. IQL never queries OOD actions in the value pipeline, so error accumulation is bounded.

Implementation order

  1. self.v_net in __init__ — 30 seconds.
  2. expectile_loss — 1 minute. Test by passing toy diff values mentally.
  3. value_loss in update_v — 1 minute. Connects expectile_loss to the V regression.
  4. Q-loss in update_q — 5 minutes. Most error-prone block. Watch for shape and no_grad.
  5. adv = q - v in estimate_advantage — 30 seconds.
  6. actor_loss in train — 1 minute. Just plumbing.

Total: ~10 minutes of typing if you understand the math. Then launch the ζ sweep on umaze (~1 hour each, parallel) and verify Eval_AverageReturn climbs.

Take it back to class

You can now teach this

Three big ideas, in order of importance:

  1. The expectile is the value-learning loss for offline RL. Train V toward the upper expectile of Q over dataset actions and you get "what's achievable from this state without leaving the data distribution." Asymmetric squared loss with weight |ζ − 1[u<0]|. Tune ζ to control aggressiveness.
  2. The V-network factors out OOD risk from the critic loop. By having V take only state, IQL's TD target r + γ V(s') never queries Q at any out-of-distribution action. The actor never enters the value updates. This decoupling is what makes IQL state-of-the-art.
  3. The actor is unchanged from AWAC. Advantage-weighted regression: imitate the dataset, but lean toward high-advantage actions. The same optimal-policy form falls out of constrained policy improvement (KL-bounded step from the data policy). What changed is just how the advantage is computed — cleanly via Q − V instead of single-sample MC.

If a friend asks: "What's the difference between AWAC and IQL?" — you say: "Both extract a policy from offline data via advantage-weighted regression. AWAC gets the value baseline by sampling the policy — cheap but high variance and risks OOD drift. IQL learns a separate V-network via expectile regression, which is the ζ-quantile-like statistic of Q over dataset actions. Because V is a state-only function, IQL's TD target uses V(s') with no action sampling, so the entire value-learning pipeline avoids querying out-of-distribution actions. That's why IQL typically wins on harder tasks."

You can teach this. Submit the writeup.