Eight self-contained HTML guides covering the full sequence of imitation learning, reinforcement learning, and offline RL taught in CS 224R. Each one assumes no prior knowledge, derives every equation, annotates every line of code you'll write, and includes a self-quiz so you can verify mastery.
The eight guides form a single conceptual sequence. Each builds on the failures and ideas of the previous one. Read in this order and you'll have covered the canonical foundations of modern robot learning.
HW1 → HW2: from imitating an expert to learning from reward.
HW2 → HW3: from interactive learning (with environment access) to offline learning (frozen dataset).
Within each homework: increasingly sophisticated answers to "what's wrong with the previous approach?" Each guide explicitly motivates itself by the failures of the one before.
A bird, a hammer-and-nail-like physics task, and an expert that sometimes makes random choices. Three problems explore one truth: imitating a multimodal expert with a unimodal regressor doesn't work, and there are two clean ways to fix it.
No expert. Just a reward function and the ability to interact with the environment. Three problems walk from tabular Q-learning to two complete neural-network RL algorithms (PPO and SAC-style off-policy), motivating each architectural choice from the failures of the previous algorithm.
compute_gae's recursion through the importance ratio in log-space to the min-of-two-surrogates clip.A frozen dataset. No env queries during training. Two algorithms attack the same problem from opposite directions: AWAC constrains the policy near the data; IQL constrains the value function to never query out-of-distribution actions. Both are essential, both are widely used.
exp(A/λ) weight from a constrained policy improvement problem (KL-bounded step from data policy), explains why this keeps the actor in-distribution, and contrasts with IQL.min(Q1, Q2)(s', a'). Includes the stitching analysis on PointMass.Each guide is independently readable. Pick a mode based on what you need.
Read all eight in order. Each chapter assumes you read the previous guide. Best for first-time learners who want the canonical ML/RL arc.
Jump to the guide for the homework you're stuck on. Each one stands alone; the implementation walkthrough chapter has line-by-line annotations of every blank you'll fill in.
Use the table of contents in each guide to jump to the chapter on a specific topic — expectile loss, action chunking, GAE, clipped double-Q, etc. The chapter cross-references the rest of the guide for context.
Each guide ends with a 12-question self-quiz with answer key. If you can answer them without re-reading, you've mastered the material.
All eight follow the same structure. Glance at any one and you'll know where to find what.
| Chapter | Purpose |
|---|---|
| Setup | The task in concrete terms; what success looks like; what you'll know by the end |
| Why this exists | The failure mode of the previous algorithm that motivates this one |
| The math | Every equation derived from first principles, not stated |
| PyTorch / NumPy primer | The minimum library knowledge needed for the implementation chapter |
| Code tour | Walk through the starter code so you know what's already done and what's missing |
| Your changes, decoded | The centerpiece. Every line you'll write, with a per-line explanation of what each operation does |
| Running it | Commands to run, expected results, healthy training signals, common bugs |
| Cheat sheet & quiz | Equations, API reference, 12-question self-test with answer key |
Some ideas appear in multiple guides at progressively deeper levels. Spotting them is part of the value of reading the whole sequence.
First seen in tabular Q-learning (HW2 P1). Returns as the TD target in PPO's GAE (HW2 P2), as the off-policy critic update (HW2 P3), and as the foundation of every offline RL algorithm in HW3. The same recursion, scaled up to neural networks and stabilized with target nets and ensembles.
The disease that motivates DAgger in HW1 (policy drifts off-distribution at test time), reappears as the central problem of offline RL in HW3 (Q-network unreliable at OOD actions). Three families of fixes — iterative relabeling, policy constraints, value constraints — emerge across the guides.
BC regression dies on multimodal experts (HW1 P1). Flow matching captures multimodality through a generative model (HW1 P2). DAgger sidesteps it with a deterministic expert (HW1 P3). Modern systems use both flow matching/diffusion AND DAgger-like data curation.
Introduced as a variance-reduction trick in PPO (HW2 P2: A = Q − V). Returns as the AWAC weight (HW3 P1) and the IQL action selector (HW3 P2). The same quantity drives policy improvement across vastly different algorithms.
Stabilization tricks introduced in HW2 P3 to handle the deadly triad. Reused identically in HW3's AWAC. IQL extends them with a third network (V) to factor out OOD evaluation entirely.