Stanford CS 224R Spring 2026 · Eight Companion Guides

From Behavior Cloning to Implicit Q-Learning

Eight self-contained HTML guides covering the full sequence of imitation learning, reinforcement learning, and offline RL taught in CS 224R. Each one assumes no prior knowledge, derives every equation, annotates every line of code you'll write, and includes a self-quiz so you can verify mastery.

8
Guides
~12k
Lines
100%
From Zero
3
Homeworks
The Conceptual Arc

Read in Order

The eight guides form a single conceptual sequence. Each builds on the failures and ideas of the previous one. Read in this order and you'll have covered the canonical foundations of modern robot learning.

Pedagogical Sequence
BC regression Flow matching DAgger Tabular Q PPO SAC-like AWAC IQL

Three big shifts

HW1 → HW2: from imitating an expert to learning from reward.

HW2 → HW3: from interactive learning (with environment access) to offline learning (frozen dataset).

Within each homework: increasingly sophisticated answers to "what's wrong with the previous approach?" Each guide explicitly motivates itself by the failures of the one before.

HW1 · Imitation Learning Flappy Bird · 4-D obs · 20-step action chunks

Three takes on cloning an expert

A bird, a hammer-and-nail-like physics task, and an expert that sometimes makes random choices. Three problems explore one truth: imitating a multimodal expert with a unimodal regressor doesn't work, and there are two clean ways to fix it.

HW2 · Online Reinforcement Learning Gridworld · Sawyer hammer · 4-DOF continuous control

From tables to neural-net actor-critic

No expert. Just a reward function and the ability to interact with the environment. Three problems walk from tabular Q-learning to two complete neural-network RL algorithms (PPO and SAC-style off-policy), motivating each architectural choice from the failures of the previous algorithm.

HW3 · Offline Reinforcement Learning AntMaze · PointMass stitching · D4RL datasets

Learning without environment interaction

A frozen dataset. No env queries during training. Two algorithms attack the same problem from opposite directions: AWAC constrains the policy near the data; IQL constrains the value function to never query out-of-distribution actions. Both are essential, both are widely used.

How to Read

Three reading modes

Each guide is independently readable. Pick a mode based on what you need.

Course mode

Sequential, full depth

Read all eight in order. Each chapter assumes you read the previous guide. Best for first-time learners who want the canonical ML/RL arc.

Reference mode

Per-homework deep dive

Jump to the guide for the homework you're stuck on. Each one stands alone; the implementation walkthrough chapter has line-by-line annotations of every blank you'll fill in.

Concept mode

Pick a single idea

Use the table of contents in each guide to jump to the chapter on a specific topic — expectile loss, action chunking, GAE, clipped double-Q, etc. The chapter cross-references the rest of the guide for context.

Quiz mode

Test yourself

Each guide ends with a 12-question self-quiz with answer key. If you can answer them without re-reading, you've mastered the material.

Anatomy

What's in every guide

All eight follow the same structure. Glance at any one and you'll know where to find what.

Chapter Purpose
SetupThe task in concrete terms; what success looks like; what you'll know by the end
Why this existsThe failure mode of the previous algorithm that motivates this one
The mathEvery equation derived from first principles, not stated
PyTorch / NumPy primerThe minimum library knowledge needed for the implementation chapter
Code tourWalk through the starter code so you know what's already done and what's missing
Your changes, decodedThe centerpiece. Every line you'll write, with a per-line explanation of what each operation does
Running itCommands to run, expected results, healthy training signals, common bugs
Cheat sheet & quizEquations, API reference, 12-question self-test with answer key
Cross-cutting Themes

Threads that run through all eight

Some ideas appear in multiple guides at progressively deeper levels. Spotting them is part of the value of reading the whole sequence.

The Bellman equation

First seen in tabular Q-learning (HW2 P1). Returns as the TD target in PPO's GAE (HW2 P2), as the off-policy critic update (HW2 P3), and as the foundation of every offline RL algorithm in HW3. The same recursion, scaled up to neural networks and stabilized with target nets and ensembles.

Distributional shift

The disease that motivates DAgger in HW1 (policy drifts off-distribution at test time), reappears as the central problem of offline RL in HW3 (Q-network unreliable at OOD actions). Three families of fixes — iterative relabeling, policy constraints, value constraints — emerge across the guides.

Multimodality and how to handle it

BC regression dies on multimodal experts (HW1 P1). Flow matching captures multimodality through a generative model (HW1 P2). DAgger sidesteps it with a deterministic expert (HW1 P3). Modern systems use both flow matching/diffusion AND DAgger-like data curation.

The advantage function

Introduced as a variance-reduction trick in PPO (HW2 P2: A = Q − V). Returns as the AWAC weight (HW3 P1) and the IQL action selector (HW3 P2). The same quantity drives policy improvement across vastly different algorithms.

Target networks and double-Q

Stabilization tricks introduced in HW2 P3 to handle the deadly triad. Reused identically in HW3's AWAC. IQL extends them with a third network (V) to factor out OOD evaluation entirely.