A subtle but powerful idea: learn the upper expectile of Q-values among the dataset's actions, never query out-of-distribution actions, get state-of-the-art offline RL performance. Every equation derived, every line of code annotated.
Same offline RL problem as Problem 1 of HW3: an offline dataset of (s, a, r, s', d) transitions, no environment interaction during training, two AntMaze tasks plus a PointMass stitching evaluation. Same goal: learn a policy that's better than the data, without ever acting in the world.
What's different: the algorithm. Instead of constraining the actor to stay near the data (AWAC's approach), IQL constrains the critic to never extrapolate to out-of-distribution actions in the first place. If the critic never queries OOD actions, OOD-overestimation can't happen.
The deliverables for Problem 2:
By the end of this guide you'll know:
You just implemented AWAC. It works. So why another algorithm?
AWAC's critic update has one line that should make you uncomfortable:
The next-action a' comes from the current actor. The actor is trained via weighted MLE on the dataset, which keeps it approximately in-distribution. But "approximately" isn't "perfectly."
If the actor drifts even slightly — say, into a region of action space that's adjacent to but not covered by the dataset — the target Q-network has to evaluate Q at those slightly-OOD actions. Predictions there are unreliable. Errors leak in.
The drift can be subtle. The actor is a continuous Gaussian; even with mean tied close to dataset actions, sampling produces tail values that wander outside data support. AWAC's softness is its weakness.
Actor → close to data → samples a' ≈ in-distribution → target Q reliable → advantages reliable → actor stays close to data. The cycle holds as long as the BC anchor is strong. With a small dataset, weak BC, or distribution-shifted data, AWAC can still drift.
What if the critic never needed to evaluate Q(s', a') for actions sampled from the policy? What if every Q-value in the entire training pipeline was computed only at actions that exist in the dataset?
Then OOD overestimation is structurally impossible. The Q-network is only trained at in-distribution (s, a) pairs and only queried at in-distribution (s, a) pairs. Distributional shift becomes a non-issue.
That's the entire conceptual move of IQL.
Replace the actor-sampled a' with a learned V-function:
If we have a good V(s') — meaning "expected return from s' under the optimal in-distribution policy" — we can compute the Q-target without sampling actions at all. The Q-network is trained purely from (s, a, r, s', d) tuples.
The catch: how do we train V? It needs to capture "value of acting well at state s, given the dataset's coverage." Not the average value (which would be too pessimistic), not the max (which would extrapolate). Something between the average and the max.
That something is the expectile.
To explain expectile regression, start with what you already know.
Regular MSE regression of V(s) toward Q-values:
The minimizer is the mean: V(s) ends up equal to Ea~D[Q(s, a)], averaged over the actions in the dataset at state s.
Why this isn't what we want: averaging includes bad actions. If the dataset has a mix of expert demonstrations and random meandering at state s, the mean dilutes the expert's value. We end up with a pessimistic V that doesn't reflect what's achievable from s.
You can replace MSE with the pinball loss to learn quantiles instead of means. The 0.5-quantile is the median; 0.9-quantile is the value below which 90% of samples fall.
Quantiles are great because the 0.9-quantile of Q-values represents "the upper end of what the dataset achieves at this state" — closer to the optimal in-distribution policy. But quantile regression is non-smooth (the loss has a kink), making optimization with stochastic gradient descent harder.
The expectile is a quantile-like quantity computed with squared rather than absolute error:
Read this carefully:
u > 0 (Q is bigger than V), then 1[u<0] = 0, so the multiplier is ζ.u < 0 (Q is smaller than V), then 1[u<0] = 1, so the multiplier is 1 − ζ.The loss is asymmetric: it penalizes errors on each side differently.
For IQL we want large ζ — typically 0.7 to 0.99. This makes V regress toward the upper expectile of Q over the dataset's actions, which approximates "value of the better dataset actions at state s."
V should reflect "value of acting well from s, within the dataset's coverage." The mean is too pessimistic (includes bad actions). The max is impossible (extrapolates). The upper expectile threads the needle: "how good is the value of the better-than-average actions in the dataset at this state?"
That value is exactly what we want to bootstrap from in the Q-update. The agent doesn't need to take the mean action or some impossible argmax — it should take an action like the better dataset actions, and V tells us what the return looks like from there.
Expectile regression with ζ = 0.9 learns a "soft max" of Q over dataset actions: optimistic enough to represent the policy's potential, conservative enough to never extrapolate beyond what was seen.
The expectile loss isn't a heuristic. It's the unique loss whose minimizer is the ζ-expectile of the target distribution. Let's see why.
The ζ-expectile of a distribution X is the value m that minimizes the expected expectile loss between X and m. The minimizer exists and is unique for any ζ ∈ (0, 1).
When ζ = 0.5, the multiplier |0.5 − 1[X < m]| is always 0.5, regardless of which side X is on. The loss reduces to standard MSE up to a constant:
And we know the MSE-minimizer is the mean. So m0.5(X) = E[X].
For ζ > 0.5: positive errors (X > m) get weight ζ, negative errors (X < m) get weight 1−ζ. With ζ = 0.9, positive errors are penalized 9× more than negative errors.
To minimize loss, we want to avoid positive errors — meaning we don't want X > m to happen often. So m gets pushed up, until only ~10% of the X distribution lies above m. That's the upper expectile.
(Technically the expectile isn't exactly the 90th percentile — it's defined via squared rather than absolute deviations — but it's a smoothed cousin of the quantile and serves the same purpose.)
Treat Q(s, a) over a ~ D(·|s) as our distribution X. Train V(s) to be the ζ-expectile of that distribution:
The gradient of this with respect to V's parameters:
At the minimum (gradient = 0), V(s) settles at the value where positive diffs occur with weight ζ and negative diffs with weight 1−ζ. That's the upper expectile.
Expectile regression learns "the value of acting like the better dataset actions" without needing to know which actions are better in advance. The asymmetric loss does the upper-percentile selection automatically through gradient descent.
The PDF (Equation 6) writes:
Note: 1{u ≤ 0} with ≤ (not <). Mathematically equivalent for continuous u (probability of exact zero is zero); for code it doesn't matter which you pick. PyTorch's (diff < 0).float() works either way.
IQL maintains three network roles. (AWAC had two: actor and critic, with target networks for the critic.) IQL has three because the V-function is its own component.
| Network | Symbol | Role | Trained on |
|---|---|---|---|
| Q-networks | Qθ1, Qθ2 | Estimate Q(s, a) for dataset actions | TD: y = r + γ V(s') |
| Target Q-networks | Q̄1, Q̄2 | Stable target for the V regression | Polyak from Q-online |
| V-network | Vφ | Upper expectile of Q over dataset actions | Expectile regression toward Q̄(s, a) |
Three updates per step, one for each network. The V update reads target Q (no grad). The Q update reads V (no grad). The target Q is slowly EMA'd from online Q. No actor anywhere in this graph — the policy is a separate component (more on that below).
You might ask: why train V at all? Couldn't we just use min(Q1, Q2)(s, a) in the Q-target by some trick?
The whole point of IQL is to never compute Q at OOD actions. min(Q1, Q2)(s', a') requires choosing some a'. Whatever we pick — mean of policy, sample from policy, argmax over discrete actions — that a' isn't guaranteed to be in the dataset. Even if it usually is, sometimes it isn't, and that's enough to leak OOD errors.
By introducing V(s) trained via expectile regression, we never need a' for the Q-target. V(s') is just a function of s'. No action choice required. OOD-free.
IQL trains a policy as a separate module that's not involved in any of the value updates. The policy is trained the same way as AWAC's actor: weighted maximum likelihood, with weights exp(A(s, a) / λ) where A(s, a) = Q(s, a) − V(s).
The policy doesn't influence Q-updates. The policy doesn't influence V-updates. The policy just consumes the trained Q and V to figure out which dataset actions to imitate more strongly. Decoupled.
IQL splits offline RL into two phases that don't interfere with each other: (1) value estimation via expectile regression, (2) policy extraction via advantage-weighted regression. Each phase is solved cleanly without distributional-shift concerns.
Both algorithms use advantage-weighted regression for the actor. The difference is entirely in how they compute Q and V.
| Component | AWAC | IQL |
|---|---|---|
| Actor loss | Same: − mean(log π(a|s) · exp(A/λ)) | |
| Q-target | r + γ min(Q̄1, Q̄2)(s', a') where a' ~ π(·|s') | r + γ V(s') (no actor!) |
| V-estimate | Q(s, aπ) with single MC sample | Separate V-net trained via expectile regression |
| Networks | Actor + 2 Q + 2 target Q = 5 | Actor + 2 Q + 2 target Q + V = 6 |
| OOD risk | Possible if actor drifts | None — V never queries OOD actions |
| Tunable | λ only | ζ (expectile) and λ |
| Typical use | Easier tasks, smaller datasets | Harder tasks, larger datasets, sparse rewards |
Why does IQL typically outperform AWAC on harder tasks? The PDF asks you to write 3 sentences about this. Here's the structure of a strong answer.
Sentence 1: AWAC's TD target requires sampling a' ~ π(·|s') from the actor, which can drift outside the data distribution as training progresses; this lets OOD evaluation errors enter the value estimates and propagate.
Sentence 2: IQL replaces that sample with a learned V(s') trained via expectile regression on dataset (s, a) pairs, so every Q-value query in the entire pipeline is at an in-distribution action, eliminating OOD overestimation by construction.
Sentence 3: On longer-horizon tasks like antmaze-medium where TD errors compound over many bootstrap steps, IQL's stricter avoidance of OOD queries leads to more reliable value estimates and substantially better final performance.
The homework asks you to sweep ζ on antmaze-umaze across {0.2, 0.9}. What should you expect?
| ζ | V learns | Effect on policy |
|---|---|---|
| 0.2 | Lower expectile of Q at each state — pessimistic | Advantage A = Q − V is often positive; weights are diffuse; policy ends up close to plain BC |
| 0.5 | Mean of Q (standard MSE) — average | Reduces to AWAC-like behavior with single-sample V replaced by mean V |
| 0.7–0.9 | Upper expectile — "value of the good dataset actions" | Advantage A = Q − V is sharp around 0; high-Q actions get heavily upweighted; policy improves over data |
| 0.99 | Near-max of Q — aggressive | V can become unstable; if max in dataset has high variance, V tracks noise |
For antmaze-umaze, ζ = 0.9 typically wins. You'll confirm this empirically and report the better ζ for the medium-diverse run.
Part 3 of Problem 2 asks about stitching. This is the gold-standard test of whether an offline RL algorithm is genuinely doing more than imitation.
The ability of an offline RL algorithm to combine parts of multiple suboptimal trajectories in the dataset into a single trajectory that's better than any individual trajectory in the dataset.
Concrete example: imagine a navigation task where the dataset contains:
Neither A nor B alone reaches the goal optimally. But the concatenation of A's first half + B's second half could reach the goal in fewer steps than either — total return -80, say. That's stitching.
Behavior cloning can't stitch. BC just imitates trajectories; the best you can do is reproduce the best trajectory in the dataset. Filtered BC (top 10% by return) can do slightly better, but still bounded by the best single trajectory.
True offline RL can stitch because the Q-function captures state-conditioned value, not trajectory-level identity. If state s appears in both A and B, the algorithm learns "from state s, the best continuation is whatever B does" — even if no full trajectory does what we want.
The homework's pointmass_stitching_dataset.npz is curated specifically to test this. It contains trajectories with:
You'll train both IQL and Filtered BC on this dataset. The deliverable: report mean and max return across 3 seeds, plus trajectory visualizations.
What success looks like:
If IQL achieves > -46 mean return, you've demonstrated stitching. The trajectory visualization should show paths to the goal that don't exactly match any single dataset trajectory but are clearly composed of pieces of multiple ones.
The Q-function generalizes across (s, a) pairs. If state s appears in trajectory A and a similar state appears in trajectory B, Q learns the best action at that state based on what worked in either trajectory. The advantage-weighted policy then takes the best in-distribution action at each state, regardless of which original trajectory contained it.
Three losses, three updates per step. The order matters subtly: V update first uses the current target Q. Q update reads the current V (just updated). Actor reads both. Then target soft-update. By the next iteration, all three have moved together by one step.
Notice the key absence: nowhere does the algorithm sample a' ~ π(·|s'). The policy never enters the value loops. OOD-free.
The IQL implementation spans two files. (The actor uses the same MLPPolicyAWAC class from Problem 1 — you're done with that.)
| File | Class | Responsibility |
|---|---|---|
critics/iql_critic.py | IQLCritic | Q-net, V-net, target Qs, expectile loss, V-update, Q-update |
agents/iql_agent.py | IQLAgent | Orchestrator — advantage estimation + train loop |
Look at iql_critic.py:55-130 for the constructor. It already builds:
self.q_net, self.q_net2, self.q_net_target, self.q_net2_target — same setup as AWAC.self.optimizer over both online Q-nets — combined Adam.self.mse_loss — helper for Q-update.self.iql_expectile = ζ.self.tau = soft-update rate.It also pre-builds self.v_optimizer and self.v_learning_rate_scheduler — but only after a missing self.v_net = ... line that you have to add. That's edit #1.
_get_q_value(q_net, obs, actions) at iql_critic.py:132: handles discrete vs continuous Q-network signatures, returns shape [B].
get_q(obs, actions) at iql_critic.py:150: returns min(Q1, Q2)(s, a) using ONLINE networks.
get_target_q(obs, actions) at iql_critic.py:165: returns min(Q̄1, Q̄2)(s, a) using TARGET networks.
update_target_network() at iql_critic.py:286: Polyak averages both target Qs from online Qs.
self.v_net definition in __init__ (one line).expectile_loss(diff) body (one line).value_loss in update_v (one line).loss, loss2 in update_q (a small block).adv = q − v in iql_agent.py:estimate_advantage (one line).actor_loss in iql_agent.py:train (a small block).Six edits, mostly small. The hard work is conceptual; the code is ~10 lines total.
Per-line annotations for every blank you'll fill. This is the centerpiece chapter.
Where: iql_critic.py:116-119.
What's nearby: the helper v_network_initializer already exists at line 87, pulled from hparams['v_func']. Reading the docstring, it's a callable: v_func(ob_dim) → V-network with output [B, 1].
Look at how q_net is built at line 89:
self.q_net = q_network_initializer(self.ob_dim, self.ac_dim) self.q_net.to(ptu.device)
Your code:
self.v_net = v_network_initializer(self.ob_dim) self.v_net.to(ptu.device)
Build the V-network. v_network_initializer takes only ob_dim (no action dim — V is a state-only function). Returns an MLP whose output is [B, 1].
Compare to Q: q_network_initializer(ob_dim, ac_dim) takes both because Q is a function of state AND action. V skips the action argument.
Move to GPU (if available). Same pattern as the Q-networks. Without this, V lives on CPU and the forward pass would fail when given GPU-resident observations.
That's it. Two lines. The v_optimizer on line 121 references self.v_net.parameters() — if you don't define v_net first, you'd get AttributeError.
Where: iql_critic.py:193-196.
The math:
What's in scope: diff — tensor of shape [B]; self.iql_expectile — scalar ζ.
Your code:
weight = torch.abs(self.iql_expectile − (diff < 0).float()) return weight * diff.pow(2)
The asymmetric weight, computed per element of diff. Three operations:
1. (diff < 0) — elementwise comparison, returns a bool tensor of same shape.
2. .float() — cast bool to float (True → 1.0, False → 0.0). This is the indicator 1[diff < 0].
3. self.iql_expectile - (...) — subtract. When diff > 0: result is ζ − 0 = ζ. When diff < 0: result is ζ − 1 (typically negative for ζ ≤ 0.99).
4. torch.abs(...) — absolute value. Now: when diff > 0, weight = ζ; when diff < 0, weight = 1 − ζ (the sign flip becomes positive after abs).
So weight at each sample is exactly |ζ − 1[diff<0]|.
Elementwise multiplication with diff squared. Returns a per-sample expectile loss, shape [B].
Note: diff.pow(2) is the same as diff ** 2 or diff * diff. Stylistic preference.
Why not mean here? Because update_v calls .mean() later. Returning per-sample lets the caller decide reduction. (Useful for inspecting the loss distribution during debugging.)
You'll see this same loss written equivalently as:
weight = torch.where(diff > 0, expectile, 1 - expectile) return weight * diff.pow(2)
Mathematically identical. Slightly more readable but slightly slower (torch.where dispatches a kernel). Both work.
Where: iql_critic.py:226-228.
The math:
What's already in scope:
q_t_values: Q̄(s, a) from target nets, shape [B], computed with no_grad at line 218.v_t: Vφ(s) from current online V, shape [B], computed at line 221.Your code:
value_loss = self.expectile_loss(q_t_values − v_t).mean()
One line, three operations.
1. q_t_values - v_t — elementwise difference, shape [B]. This is "diff" from the previous task. Positive when target Q is above current V (V is too low); negative when V is above target Q.
2. self.expectile_loss(...) — calls the function you wrote in Change 2. Returns per-sample expectile losses, shape [B].
3. .mean() — reduces to a scalar. Standard reduction for SGD on minibatch loss.
Critical detail: q_t_values was computed under with torch.no_grad(): at line 217. So gradients only flow through v_t. The V-network learns; the target Q-network doesn't. Good — we want a stable target for V to regress against, just like target networks for Q in standard TD.
If we used online Q, the regression target moves whenever Q updates. V chases a moving target. With target Q (slowly EMA'd), the regression target is stable on the timescale of one V update. Same standard target-network argument.
Where: iql_critic.py:269-273.
The math:
What's in scope: ob_no, ac_na, next_ob_no, reward_n, terminal_n — all tensors. self.v_net, self._get_q_value, self.gamma, self.mse_loss.
Your code:
with torch.no_grad(): v_next = self.v_net(next_ob_no).squeeze(−1) target = reward_n + self.gamma * (1.0 − terminal_n) * v_next q1 = self._get_q_value(self.q_net, ob_no, ac_na) q2 = self._get_q_value(self.q_net2, ob_no, ac_na) loss = self.mse_loss(q1, target) loss2 = self.mse_loss(q2, target)
The block computing the TD target. Same pattern as AWAC's critic: targets must be detached so gradients only flow through the prediction (Q1, Q2), not through V or the targets.
V at the next states. self.v_net(next_ob_no) returns shape [B, 1] — the V-network ends with nn.Linear(hidden, 1). .squeeze(-1) drops the trailing dim to get shape [B], matching the shape of reward_n and terminal_n.
This is the IQL secret sauce: no actor sample, no min(Q1, Q2)(s', a'). Just V(s'). The V-network has been trained (in the very same training step, via update_v) to be the upper expectile of Q over dataset actions, so this V represents "the value of acting like the better dataset actions from s'." Bootstrapping on it is principled.
The Bellman target. Three pieces, all shape [B]:
• reward_n — immediate reward.
• self.gamma * v_next — discounted bootstrap.
• (1.0 - terminal_n) — done mask. Where terminal_n = 1, this kills the bootstrap (target = reward only). Standard.
OUT of no_grad now. We want gradients on this — it's the prediction we're training. _get_q_value handles the discrete-vs-continuous abstraction; for AntMaze (continuous), it does self.q_net(obs, actions).squeeze(-1) internally.
Same for the second Q-network. We train both; the V-update uses min(Q̄1, Q̄2) of the targets, but here for the Q-update each network has its own MSE loss.
MSE between Q1 prediction and target. Returns a scalar. The downstream code does (loss + loss2).backward() at line 276, summing gradients.
Same for Q2. Each Q-network has its own parameters in the optimizer, so (loss + loss2).backward() correctly distributes gradients: Q1 gets only loss's gradient, Q2 gets only loss2's gradient.
You might be tempted to wrap v_next in a stop-gradient. That's already happening via the with torch.no_grad(): block — but make sure you don't accidentally use V outside that block.
Also: V was just updated by update_v at line 198. By the time update_q runs, V has already taken one gradient step this iteration. That's intentional. The order V → Actor → Q in the training loop is deliberate: V is computed first so Q can bootstrap on it.
Where: iql_agent.py:88-91.
The math:
What's already in scope:
v_pi: Vφ(s) from V-net, shape [B].q_sa: min(Q1, Q2)(s, a) from online Q, shape [B].Your code:
adv = q_sa − v_pi
The advantage. Shape [B].
Compare to AWAC's estimate_advantage: AWAC sampled aπ ~ π(·|s) and used Q(s, aπ) as a single-sample estimate of V(s). IQL has a learned V-network, no sampling needed. Massively lower variance in the advantage estimate.
This is one of the practical benefits of IQL over AWAC: the advantage signal that drives the actor's update is much cleaner because V is a smooth, well-trained function rather than a single-sample MC estimate.
This advantage is the only "feedback" the actor receives from the Q/V world. If the advantage is noisy, the actor's update is noisy and the policy meanders. AWAC is famous for not always converging cleanly; IQL's clean advantage is a significant practical reason it usually works better.
Where: iql_agent.py:120-123.
The plumbing:
adv_n = self.estimate_advantage(ob_no, ac_na) actor_loss = self.actor.update(ob_no, ac_na, adv_n=adv_n)
Calls the function you wrote in Change 5. Returns shape [B] tensor of advantages, one per dataset transition.
Note this happens after update_v at line 115 but before update_q at line 125-128. Why? Because the advantage uses V (just updated) and Q (about to be updated). Using the not-yet-updated Q is fine — it's still a valid estimate, just slightly stale.
Same as AWAC: calls MLPPolicyAWAC.update with advantages, which internally computes exp_weights = exp(adv_n / lambda_awac), clamps, computes weighted MLE loss, backprop, optimizer step. Returns scalar loss for logging.
Same actor class as AWAC. The policy training is shared. The only thing IQL changes is the value learning (V via expectile, Q via V-bootstrap). The actor side is identical.
1. Sample minibatch (already done by the trainer).
2. update_v: compute target Q (no grad), regress V toward expectile of target Q.
3. estimate_advantage: A = Q − V using the just-updated V.
4. actor.update: weighted MLE on (s, a) pairs with weight exp(A/λ).
5. update_q: compute V(s') (no grad), regress Q toward r + γ V(s').
6. update_target_network: Polyak-update target Qs from online Qs.
One iteration of IQL. Loop a million times. Never touch the environment.
Same Modal + wandb setup as AWAC. Skip to the launch commands.
Run two experiments, one per ζ value, each with 3 seeds (parallel):
# zeta = 0.2 modal run --detach modal_train_para.py --algo iql \ --env-name antmaze-umaze-v0 \ --iql-expectile 0.2 \ --exp-name iql_zeta_0.2_umaze \ --use-wandb # zeta = 0.9 modal run --detach modal_train_para.py --algo iql \ --env-name antmaze-umaze-v0 \ --iql-expectile 0.9 \ --exp-name iql_zeta_0.9_umaze \ --use-wandb
~1 hour per experiment. Run them simultaneously — different containers don't conflict.
For each, take final-checkpoint Eval_AverageReturn across 3 seeds, fill into Table 3. Identify the better ζ (almost certainly 0.9). Write 2-3 sentences explaining why — the upper expectile better represents what's achievable from each state, leading to sharper advantage signals and faster learning.
modal run --detach modal_train_para.py --algo iql \
--env-name antmaze-medium-diverse-v0 \
--iql-expectile 0.9 \
--exp-name iql_zeta_0.9_medium_diverse \
--use-wandb
~1.5 hours. 3 seeds in parallel. Fill Table 4 (IQL on medium-diverse) and Table 5 (AWAC vs IQL on both tasks).
Two runs, one for IQL and one for Filtered BC:
# IQL with the better zeta from Part 1 modal run --detach modal_train_para.py --algo iql \ --env-name PointmassMedium-v0 \ --exp-name iql_zeta_0.9_stitching \ --iql-expectile 0.9 \ --offline-dataset offline_datasets/pointmass_stitching_dataset.npz \ --use-wandb # Filtered BC (top 10% trajectories) modal run --detach modal_train_para.py --algo bc \ --env-name PointmassMedium-v0 \ --exp-name filtered_bc_stitching \ --offline-dataset offline_datasets/pointmass_stitching_dataset.npz \ --filter-top-percent 10 \ --use-wandb
IQL: ~15 min. BC: ~10 min. Both 3 seeds in parallel. Fill Table 6 with mean and max return for each.
Expected outcomes:
| Metric | Healthy | Bug |
|---|---|---|
Critic V Loss | Decreases over time | Stays flat or explodes |
Critic Q Loss | Decreases, eventually stable | Diverges — usually shape bug |
Actor Loss | Negative, slowly drifts | Positive (BC broken) |
Eval_AverageReturn | Climbs from baseline toward 0 | Plateau or drops below dataset max return |
Same as Problem 1, but with the additional layers:
P2/ ├── 1/ │ ├── iql_umaze_seed1.csv # BEST zeta only! │ ├── iql_umaze_seed2.csv │ └── iql_umaze_seed3.csv └── 2/ ├── iql_medium_maze_seed1.csv ├── iql_medium_maze_seed2.csv └── iql_medium_maze_seed3.csv
Critical: per the PDF instructions, only upload CSVs from the best-performing ζ. The autograder expects exactly 3 CSVs per folder, all from the same expectile.
| Call | Returns | File |
|---|---|---|
self.v_net(obs) | V(s), shape [B, 1] — squeeze for [B] | iql_critic.py |
self.get_q(obs, actions) | min(Q1, Q2) ONLINE, shape [B] | iql_critic.py:150 |
self.get_target_q(obs, actions) | min(Q̄1, Q̄2) TARGET, shape [B] | iql_critic.py:165 |
self._get_q_value(q_net, obs, actions) | Q from a single network, shape [B] | iql_critic.py:132 |
self.expectile_loss(diff) | Per-sample loss, shape [B] | iql_critic.py:180 (you write) |
self.iql_expectile | ζ scalar | iql_critic.py:129 |
min(Q1, Q2)(s', a') with some a'?a' ~ π? (Hint: trick question.)Q(s, a) − V(s) being negative tell us about action a?1. AWAC's target uses min(Q̄1, Q̄2)(s', a') with a' ~ π(·|s') — requires actor samples. IQL's target uses V(s') directly — no action sampling. This eliminates OOD evaluation in the critic.
2. Because any a' — sampled from policy, mean of policy, argmax over discrete — could be slightly OOD. IQL learns V to summarize "value over the better dataset actions" without requiring a specific action choice.
3. ζ = 0.9 makes V track the upper expectile of Q over dataset actions, which approximates "value of the better dataset actions." ζ = 0.2 is pessimistic (lower expectile), produces weaker advantage signals.
4. Standard MSE up to a constant. The minimizer is the mean of Q over dataset actions at each state.
5. Stitching = combining segments of multiple suboptimal dataset trajectories into a path that's better than any single trajectory. It's the test because BC cannot stitch (it imitates whole trajectories), but real offline RL can.
6. AWAC uses a single-sample MC estimate Q(s, aπ) for V, which is high-variance. IQL uses a learned V-network, smoothed via expectile regression over the entire dataset, which is much lower variance.
7. Same standard target-network argument: regression onto a moving target leads to instability. Target Q changes slowly via Polyak averaging, so it's a stable target for V.
8. ζ = 0.99 is more aggressive — V gets pushed near the max of Q over dataset actions. Can be unstable if the dataset's Q distribution at some states has heavy tails or noise. ζ = 0.9 is the typical sweet spot.
9. Both algorithms use advantage-weighted maximum likelihood: L = -mean(logπ(a|s) * exp(A/λ)). The actor only changes through different advantage estimates from the value functions.
10. Nowhere. IQL's value learning is structurally policy-free. The policy enters only via the actor update, which uses logπ(a|s) on dataset actions only.
11. That action a is below the ζ-expectile — meaning it's worse than typical "good actions" in the dataset at state s. The exp(A/λ) weight is small, so the actor barely updates toward this action. Corresponds to negative weight in the AWR formula.
12. On longer-horizon tasks, TD errors compound. AWAC's policy can drift over many bootstraps, causing OOD evaluation in the critic, which feeds back into actor advantages. IQL never queries OOD actions in the value pipeline, so error accumulation is bounded.
__init__ — 30 seconds.update_v — 1 minute. Connects expectile_loss to the V regression.update_q — 5 minutes. Most error-prone block. Watch for shape and no_grad.estimate_advantage — 30 seconds.train — 1 minute. Just plumbing.Total: ~10 minutes of typing if you understand the math. Then launch the ζ sweep on umaze (~1 hour each, parallel) and verify Eval_AverageReturn climbs.
Three big ideas, in order of importance:
|ζ − 1[u<0]|. Tune ζ to control aggressiveness.r + γ V(s') never queries Q at any out-of-distribution action. The actor never enters the value updates. This decoupling is what makes IQL state-of-the-art.If a friend asks: "What's the difference between AWAC and IQL?" — you say: "Both extract a policy from offline data via advantage-weighted regression. AWAC gets the value baseline by sampling the policy — cheap but high variance and risks OOD drift. IQL learns a separate V-network via expectile regression, which is the ζ-quantile-like statistic of Q over dataset actions. Because V is a state-only function, IQL's TD target uses V(s') with no action sampling, so the entire value-learning pipeline avoids querying out-of-distribution actions. That's why IQL typically wins on harder tasks."
You can teach this. Submit the writeup.