CS 224R · Complete Guide Series

The Conceptual Arc

Read in Order

The eight guides form a single conceptual sequence. Each builds on the failures and ideas of the previous one. Read in this order and you'll have covered the canonical foundations of modern robot learning.

Pedagogical Sequence

BC regression → Flow matching → DAgger ⇒ Tabular Q → PPO → SAC-like ⇒ AWAC → IQL

Three big shifts

HW1 → HW2: from imitating an expert to learning from reward.

HW2 → HW3: from interactive learning (with environment access) to offline learning (frozen dataset).

Within each homework: increasingly sophisticated answers to "what's wrong with the previous approach?" Each guide explicitly motivates itself by the failures of the one before.

HW1 · Imitation Learning Flappy Bird · 4-D obs · 20-step action chunks

Three takes on cloning an expert

A bird, a hammer-and-nail-like physics task, and an expert that sometimes makes random choices. Three problems explore one truth: imitating a multimodal expert with a unimodal regressor doesn't work, and there are two clean ways to fix it.

Guide 01P1

Behavior Cloning with MSE Regression

HW1 Problem 1 · networks.py + losses.py

The simplest imitation learning algorithm: a 3-layer MLP regressed via mean squared error to predict expert actions. Works on easy mode, fails spectacularly on hard mode. The conceptual climax: why MSE collapses bimodal experts into the wall between the two gaps. Sets up the rest of HW1.

Imitation MLP MSE Loss Multimodality PyTorch primer →

Guide 02P2

Flow Matching as a Generative Policy

HW1 Problem 2 · FlowMatchingSchedule + flow_matching_loss

Train a vector field that transports Gaussian noise into expert actions. Sampling becomes Euler integration of an ODE. Captures multimodality where MSE cannot — different noise samples flow to different gaps, no averaging into walls. The fix that "makes the model richer."

Imitation Generative Velocity field Euler ODE vs Diffusion →

Guide 03P3

DAgger: Iterative Relabeling

HW1 Problem 3 · dagger.py

Roll out the policy, ask a deterministic expert to relabel every visited state, retrain. Solves distribution shift via iterative dataset aggregation, plus uses the deterministic-expert trick to make MSE work on previously-bimodal data. The fix that "makes the data cleaner."

Imitation Distribution shift Iterative O(T) bound Deterministic expert →

HW2 · Online Reinforcement Learning Gridworld · Sawyer hammer · 4-DOF continuous control

From tables to neural-net actor-critic

No expert. Just a reward function and the ability to interact with the environment. Three problems walk from tabular Q-learning to two complete neural-network RL algorithms (PPO and SAC-style off-policy), motivating each architectural choice from the failures of the previous algorithm.

Guide 04P1

Tabular Q-Learning on a Gridworld

HW2 Problem 1 · gridworld_q_learning.py

A 5×4 grid, two goals at different distances, three reward configurations. Implement the Bellman update by hand, watch reward propagate backward through the Q-table, learn how a small change in step reward flips the optimal policy. Includes a NumPy primer and the entire foundation of TD methods.

RL Bellman TD update ε-greedy NumPy primer →

Guide 05P2

PPO: Policy Gradients with Clipping

HW2 Problem 2 · on_policy.py

From the policy gradient theorem to GAE to the asymmetric pessimistic clip. Includes a full PyTorch primer for newcomers, and a chapter that walks through every line of the three changes you'll write — from compute_gae's recursion through the importance ratio in log-space to the min-of-two-surrogates clip.

RL Policy gradient GAE Clipping PyTorch primer →

Guide 06P3

Off-Policy Actor-Critic (SAC-like)

HW2 Problem 3 · off_policy.py

Replay buffers, target networks, ensemble critics, the deadly triad and how to neutralize it. The full machinery of off-policy continuous-action RL, with the UTD-ratio ablation and an explanation of why an ensemble of N critics with random pair-min produces more stable targets than naive double-Q.

RL Off-policy Replay buffer Target nets Ensemble →

HW3 · Offline Reinforcement Learning AntMaze · PointMass stitching · D4RL datasets

Learning without environment interaction

A frozen dataset. No env queries during training. Two algorithms attack the same problem from opposite directions: AWAC constrains the policy near the data; IQL constrains the value function to never query out-of-distribution actions. Both are essential, both are widely used.

Guide 07P1

AWAC: Advantage-Weighted Actor-Critic

HW3 Problem 1 · awac_agent.py + awac_critic.py

Imitate the dataset, but lean toward actions whose advantage is high. Derives the exp(A/λ) weight from a constrained policy improvement problem (KL-bounded step from data policy), explains why this keeps the actor in-distribution, and contrasts with IQL.

Offline RL Distributional shift Policy constraint Weighted MLE Constrained opt →

Guide 08P2

IQL: Implicit Q-Learning

HW3 Problem 2 · iql_agent.py + iql_critic.py

A separate V-network trained via expectile regression learns the upper-percentile of Q over dataset actions, eliminating the need to ever query Q at out-of-distribution actions. The Q-target uses V(s') instead of min(Q1, Q2)(s', a'). Includes the stitching analysis on PointMass.

Offline RL Expectile Three networks Stitching Value constraint →

How to Read

Three reading modes

Each guide is independently readable. Pick a mode based on what you need.

Course mode

Sequential, full depth

Read all eight in order. Each chapter assumes you read the previous guide. Best for first-time learners who want the canonical ML/RL arc.

Reference mode

Per-homework deep dive

Jump to the guide for the homework you're stuck on. Each one stands alone; the implementation walkthrough chapter has line-by-line annotations of every blank you'll fill in.

Concept mode

Pick a single idea

Use the table of contents in each guide to jump to the chapter on a specific topic — expectile loss, action chunking, GAE, clipped double-Q, etc. The chapter cross-references the rest of the guide for context.

Quiz mode

Test yourself

Each guide ends with a 12-question self-quiz with answer key. If you can answer them without re-reading, you've mastered the material.

Anatomy

What's in every guide

All eight follow the same structure. Glance at any one and you'll know where to find what.

Chapter	Purpose
Setup	The task in concrete terms; what success looks like; what you'll know by the end
Why this exists	The failure mode of the previous algorithm that motivates this one
The math	Every equation derived from first principles, not stated
PyTorch / NumPy primer	The minimum library knowledge needed for the implementation chapter
Code tour	Walk through the starter code so you know what's already done and what's missing
Your changes, decoded	The centerpiece. Every line you'll write, with a per-line explanation of what each operation does
Running it	Commands to run, expected results, healthy training signals, common bugs
Cheat sheet & quiz	Equations, API reference, 12-question self-test with answer key

Cross-cutting Themes

Threads that run through all eight

Some ideas appear in multiple guides at progressively deeper levels. Spotting them is part of the value of reading the whole sequence.

The Bellman equation

First seen in tabular Q-learning (HW2 P1). Returns as the TD target in PPO's GAE (HW2 P2), as the off-policy critic update (HW2 P3), and as the foundation of every offline RL algorithm in HW3. The same recursion, scaled up to neural networks and stabilized with target nets and ensembles.

Distributional shift

The disease that motivates DAgger in HW1 (policy drifts off-distribution at test time), reappears as the central problem of offline RL in HW3 (Q-network unreliable at OOD actions). Three families of fixes — iterative relabeling, policy constraints, value constraints — emerge across the guides.

Multimodality and how to handle it

BC regression dies on multimodal experts (HW1 P1). Flow matching captures multimodality through a generative model (HW1 P2). DAgger sidesteps it with a deterministic expert (HW1 P3). Modern systems use both flow matching/diffusion AND DAgger-like data curation.

The advantage function

Introduced as a variance-reduction trick in PPO (HW2 P2: A = Q − V). Returns as the AWAC weight (HW3 P1) and the IQL action selector (HW3 P2). The same quantity drives policy improvement across vastly different algorithms.

Target networks and double-Q

Stabilization tricks introduced in HW2 P3 to handle the deadly triad. Reused identically in HW3's AWAC. IQL extends them with a third network (V) to factor out OOD evaluation entirely.

From Behavior Cloning to Implicit Q-Learning