← back
Stanford CS 224R · Homework 1 · Imitation Learning

Behavior Cloning with Regression from Absolute Zero

A Flappy Bird, a 4-D observation, twenty future actions to predict. The simplest imitation learning algorithm there is. Why it works on easy mode, why it dies on hard mode, and the deep lesson that motivates flow matching.

No prior IL/RL assumed PyTorch primer included Three implementation tasks The multimodality lesson
Roadmap

What You'll Master

Chapter 01

The Flappy Bird Setup

You've been given an environment that's a physics-based version of the classic Flappy Bird game. A bird falls under gravity. Pipes scroll left across the screen. Each pipe has one or two gaps. The bird must navigate through the gaps without crashing.

The agent's job: at each timestep, output a target y-position in [0, 1]. A built-in PD controller converts that target into thrust, creating realistic momentum-based motion.

The observation

4 numbers, all normalised to [0, 1]:

Two difficulty modes

ModeDescriptionWhy it matters
EasyOne gap per pipe (gap1_y == gap2_y)Unimodal expert. Plain BC works fine.
HardAlternating single and double-gap pipesMultimodal expert. Plain BC fails spectacularly.

The "hard" mode is where the lesson lives. With double-gap pipes, the expert can choose either gap. Different demonstrations choose differently. We'll see in Chapter 07 why this breaks MSE regression.

Action chunking

Modern imitation-learning policies don't predict one action at a time. They predict a chunk of T future actions in a single forward pass:

πθ(st) = (at, at+1, at+2, …, at+T-1)

In this homework: ACTION_CHUNK = 20, EXECUTE_STEPS = 10. The policy predicts 20 future targets, but only the first 10 are executed before re-querying. This is called receding horizon control. We'll explain why this helps in Chapter 04.

Evaluation

Average episode length over 50 evaluation episodes. Episodes cap at 1000 steps; surviving all 1000 is "success." Higher is better. Standard deviation tells you about consistency — high std means the policy sometimes succeeds and sometimes immediately crashes.

Chapter 02

The Imitation Learning Paradigm

Imitation learning is fundamentally different from reinforcement learning. If you're coming to HW1 fresh, internalize this distinction before anything else.

PropertyReinforcement LearningImitation Learning
Data sourceTrial and error in the environmentExpert demonstrations (no env interaction during training)
SignalReward functionExpert's actions (treated as labels)
Algorithm familyQ-learning, policy gradient, etc.Supervised learning
DifficultySparse rewards, exploration, credit assignmentDistribution shift, multimodal experts
ExampleTrain a chess engine via self-playTrain a chess engine by watching humans

The imitation learning setup:

  1. An expert exists. We have access to their demonstrations: a dataset of (state, action) pairs.
  2. We want to train a policy πθ(s) that produces actions like the expert would.
  3. We don't ever query the expert's reward function or interact with the environment during training. (Problem 3, DAgger, slightly relaxes this last constraint.)
Definition
Behavior Cloning (BC)

The simplest imitation learning algorithm: treat the expert demonstrations as a supervised learning dataset, where states are inputs and expert actions are labels. Train a neural network via standard supervised learning.

That's it. No exploration. No reward function. No Bellman equation. Just supervised learning with state → action pairs. The complexity comes from the subtleties: which loss function, which architecture, and what to do when the expert is multimodal.

Why imitation learning matters

Three reasons it's the right starting point for a course on robot learning:

The conceptual ladder

HW1 is BC variants. HW2 is RL from scratch (no expert). HW3 is offline RL (expert dataset + value learning). The arc: pure imitation → pure RL → combined. Understanding HW1 deeply makes the rest of the course click.

Chapter 03

Behavior Cloning

Strip BC down to its essence and it's literally just supervised learning.

The dataset

Run the expert in the environment N times. Record every (state, action) pair you see. Result: a dataset

D = { (s1, a1*), (s2, a2*), …, (sN, aN*) }

where each ai* is the action the expert took at state si.

The objective

Find policy parameters θ that minimize the discrepancy between the policy's predicted action and the expert's action, averaged over the dataset:

BC objective θ* = arg minθ (1/N) Σi=1..N L( πθ(si), ai* )

where L is some loss function. For Problem 1, L is mean squared error. (Problem 2 will use flow matching, a generative loss. Different loss, same overall structure.)

What this means in practice

Three steps:

  1. Collect demos: run the expert, log states and actions.
  2. Define the model: a neural network πθ(s) → a.
  3. Optimize: standard supervised learning loop. Sample minibatches, compute loss, backprop, step Adam, repeat.

If you've trained any classifier or regressor in PyTorch before, you've already done all the mechanics of BC. The only thing that's new is the data — states and actions instead of images and labels.

Why BC isn't trivially perfect

The hardness of BC isn't in the optimization — it's in:

  1. Distribution shift (DAgger fixes this in Problem 3): at test time the policy makes small errors that compound, and pretty soon it's in a state the expert never visited. The policy has no idea what to do.
  2. Multimodal experts (Flow matching fixes this in Problem 2): if the expert sometimes goes through gap 1 and sometimes through gap 2, MSE regression averages the two and produces a midpoint — which crashes into the wall.
  3. Causal confusion: the policy might learn shortcuts that work on the training data but don't generalize.

Problem 1 of HW1 introduces BC. Problem 2 (flow matching) tackles multimodality. Problem 3 (DAgger) tackles distribution shift. Each problem is one BC failure mode and one fix.

Chapter 04

Why Action Chunking

Most BC tutorials predict one action at a time:

Naive BC πθ(st) = at single-step prediction

Modern robot learning (e.g., Diffusion Policy, ALOHA, RT-1) predicts a chunk of future actions:

Action chunking πθ(st) = (at, at+1, at+2, …, at+T-1)

where T = 20 in this homework (the ACTION_CHUNK constant). At rollout time, the policy is queried, then the first EXECUTE_STEPS = 10 actions are executed open-loop (without re-querying), and only after the 10th step does the policy fire again.

This is called receding horizon control — the policy commits to a 20-step plan, executes 10 steps of it, then re-plans.

Why does this help?

Three big reasons:

1. Temporal consistency

If you predict one action at a time, the policy can flip-flop — this step it predicts "go up," the next step "go down." Action chunks force temporal coherence: the network has to commit to a continuous plan, not a shaky sequence of independent decisions.

2. Reduced compounding error

Each policy query is one source of stochasticity. Querying once every 10 steps means you accumulate fewer "rolls of the policy dice" per episode. If errors are roughly Gaussian-distributed per query, fewer queries = lower total error variance.

3. Multi-step reasoning

Predicting 20 steps ahead lets the policy plan around obstacles (e.g., commit to a gap, route through it). A single-step policy has to make this decision fresh at every step, which gets tangled in noise.

The chunk size tradeoff

Large chunks (T ≫ T_execute): more temporal consistency but the policy commits to outdated plans for longer.

Small chunks (T == T_execute): less consistency but always responsive to fresh observations.

The "execute K of T predicted" pattern is the typical compromise: predict more than you execute, then re-query before the buffer runs out.

How chunking shapes the architecture

For BC with action chunking on this homework:

The policy is just an MLP that maps 4 → 256 → 256 → 20. Each output dimension corresponds to one future action. The action_dim parameter in BCPolicy defaults to 20 because that's the chunk size.

Chapter 05

The MLP Architecture

Problem 1 specifies a 3-layer MLP:

BCPolicy architecture state (4-D) ↓ Linear(4 → 256) → ReLU ↓ Linear(256 → 256) → ReLU ↓ Linear(256 → 20) → Sigmoid ↓ action chunk (20-D), each value in [0, 1]

Three things to understand here.

Why an MLP and not something fancier

The problem is small (4-D input, 20-D output, simple physics). Convolutional networks would be overkill (no spatial structure in the state). Transformers would be overkill (no sequential structure in the input). A plain MLP with two hidden layers of 256 is the natural choice.

For comparison, the FlowMatchingPolicy for Problem 2 is much fancier — a 1D U-Net with conditional residual blocks. That's because flow matching needs to learn a vector field, which is a more complex function class.

Why ReLU

ReLU(x) = max(0, x). Three reasons to use it:

Could you use Mish, GELU, SiLU instead? Sure. ReLU is the textbook default and works fine here.

Why sigmoid on the output

The action is a target y-position, normalized to [0, 1]. The sigmoid activation:

σ(x) = 1 / (1 + e−x)

maps any real input to (0, 1). This is the natural activation for outputs that need to be in [0, 1]. (Compare to HW2 P2's actor, which used tanh for the [-1, 1] action range.)

Why not just clamp?

You could output raw values from the final Linear and clamp to [0, 1] at inference. But:

• The training loss would have a weird kink at the boundaries (gradients flip from "MSE pulling x up" to "MSE pulling x down" abruptly).

• The network has to learn that values outside [0, 1] are forbidden, which is implicit and hard.

Sigmoid bakes the constraint into the architecture. The network outputs unconstrained logits; the sigmoid converts them to the action range. Cleaner gradient signal, no out-of-range predictions ever.

The full architecture in PyTorch

Just nn.Sequential stacking. Five operations, each one line:

nn.Sequential(
    nn.Linear(state_dim, hidden),    # 4 → 256
    nn.ReLU(),
    nn.Linear(hidden, hidden),       # 256 → 256
    nn.ReLU(),
    nn.Linear(hidden, action_dim),   # 256 → 20
    nn.Sigmoid(),
)

That's the entire model. We'll write this in your BCPolicy.__init__.

Chapter 06

The MSE Loss

The math

MSE loss LMSE(θ) = (1/N) Σi=1..N || πθ(si) − ai* ||2

Read it: "average over the dataset of the squared distance between the policy's predicted action and the expert's action."

For the action chunk version, the squared norm is over all 20 dimensions of the chunk:

|| πθ(s) − a* ||2 = Σk=1..20 ( πθ(s)[k] − a*[k] )2

Each future action's prediction error squared, summed.

Why squared error and not absolute error?

LossFormProsCons
L2 (MSE)(π − a*)2Smooth gradient; minimizer is the conditional meanSensitive to outliers; averages multimodal targets
L1 (MAE)|π − a*|Robust to outliers; minimizer is the conditional medianNon-smooth at 0; harder to optimize with SGD

For BC on smooth control tasks, MSE is the standard choice. The minimizer is the conditional expectation of expert actions:

MSE minimizer πθ*(s) = Ea* ~ expert[ a* | s ]

This is a fact you should remember. The MSE-minimizing prediction at state s is the average of all the expert's actions at that state.

A 30-second derivation

Take the gradient of expected squared error w.r.t. the prediction p:

d/dp E[(p − a)2] = 2 · E[(p − a)] = 2 · (p − E[a])

Setting this to zero gives p = E[a]. The optimal MSE prediction is the conditional mean of the target.

For unimodal experts (always make the same choice), this is exactly what we want. For multimodal experts (sometimes choose A, sometimes B), the mean of A and B is a problem — it's neither A nor B, and it could be a terrible prediction. That's the next chapter.

Chapter 07

The Multimodality Trap

This chapter is the conceptual climax of Problem 1. Spend time here.

The setup

Hard mode has alternating single-gap and double-gap pipes. When the expert sees a double-gap pipe, it commits to either gap 1 or gap 2 randomly — from the same observation. So the dataset contains:

Two valid expert actions, randomly chosen, from the same state.

What MSE does

The MSE-minimizer is the conditional mean. Average of 0.7 and 0.3 is 0.5. So the trained policy outputs 0.5 at this state — which is exactly the wall between the two gaps.

gap 1 (top) gap 2 (bot) expert demo A: aim gap 1 expert demo B: aim gap 2 BC averages → CRASH bird
In hard mode, the expert sometimes aims at gap 1, sometimes at gap 2. MSE regression trains the policy toward the mean of the two demos — a path through the wall.

The policy isn't doing anything wrong — it's faithfully reproducing the conditional mean of the expert distribution. But the conditional mean is not a valid action.

The fundamental issue

MSE regression assumes the conditional distribution p(a | s) is unimodal — that there's a single "best" action at each state. When the expert's behavior is multimodal (multiple valid actions per state), the mean of those modes can be far from any actual mode and can be catastrophically bad.

Why this is the deep lesson of HW1

Problem 1 is designed to fail on hard mode. You'll observe poor performance — mean episode length much shorter than 1000 — and the homework asks you to explain why in 2-3 sentences. The answer is exactly: multimodality of the expert plus mean-seeking behavior of MSE.

The two follow-up problems are direct fixes:

Both are valid solutions to the same problem. They take orthogonal approaches: Problem 2 makes the model richer; Problem 3 changes the data to be unimodal. Real robot learning systems often combine both: rich generative models + curated expert data.

Your one-paragraph writeup template

The PDF asks for 2-3 sentences explaining MSE on hard mode. Here's the structure:

For the writeup

"On hard mode, the expert chooses between two valid actions (gap 1 or gap 2) randomly at the same state, making the action distribution multimodal. The MSE loss converges to the conditional mean of expert actions, so the policy outputs the average of the two gaps — a target between them where the wall is. This causes the bird to crash, leading to short episode lengths."

Chapter 08

PyTorch Primer

If you know Python but PyTorch is new, read this. You'll only need a small slice of PyTorch for Problem 1.

Tensors

Numpy arrays with two extras: GPU residency and autograd. torch.tensor([1., 2., 3.]) creates a tensor; .shape, .mean(), +, *, etc. work like numpy.

nn.Module: a network

Inherit from nn.Module, define layers in __init__, define the forward pass in forward. The framework handles parameters and gradient tracking automatically.

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(10, 256)
        self.layer2 = nn.Linear(256, 1)

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        return self.layer2(x)

net = SimpleNet()
y = net(torch.randn(32, 10))    # shape [32, 1]

Notice net(x) calls forward(x) indirectly. The framework runs some bookkeeping then forwards. Don't call .forward() directly — you'll skip the bookkeeping and break things.

nn.Sequential: stacking layers

For pure feed-forward networks (no skip connections, no branching), you can skip the boilerplate of __init__+forward and just stack layers:

self.net = nn.Sequential(
    nn.Linear(10, 256),
    nn.ReLU(),
    nn.Linear(256, 1),
)

y = self.net(x)        # runs through all layers in order

Same effect as the manual class. Internally, nn.Sequential just calls each layer in order. You'll use this pattern in BCPolicy.

Loss functions

PyTorch's loss functions are usually classes you instantiate, then call:

criterion = nn.MSELoss()
loss = criterion(predicted, target)        # scalar

Or use the functional form, which is one-shot:

import torch.nn.functional as F

loss = F.mse_loss(predicted, target)        # same thing, no instance needed

Both apply .mean() by default — reducing the per-element squared errors to a scalar by averaging. The default is what you want for SGD.

Shapes

For BC on this homework:

The MSE loss reduces both to a scalar by squaring elementwise then averaging across all 20 × B elements.

Activation functions

FunctionWhere to useExample
nn.ReLU()Hidden layers (general)nn.Sequential(nn.Linear(...), nn.ReLU())
nn.Sigmoid()Output in [0, 1]Probabilities, normalized actions
nn.Tanh()Output in [-1, 1]Symmetric continuous actions
nn.Softmax(dim=-1)Output is a probability distributionDiscrete action policies

You'll use nn.ReLU() twice and nn.Sigmoid() once in BCPolicy.

That's all the PyTorch you need for Problem 1. Onward.

Chapter 09

Code Tour

Three files for Problem 1.

FileStatusWhat's there
main.pyread-onlyTraining loop, eval, CLI entry point
flappy_bird_env.pyread-onlyGymnasium environment
expert.pyread-onlyExpert policy + demo collection
networks.pyEDITBCPolicy class (and FlowMatchingSchedule for P2)
losses.pyEDITmse_loss (and flow_matching_loss for P2)
dagger.pyEDIT (P3)For Problem 3 only

BCPolicy skeleton

Look at networks.py:316-339. The class is empty:

class BCPolicy(nn.Module):
    def __init__(self, state_dim: int = 4, action_dim: int = 20, hidden: int = 256):
        super().__init__()
        # TODO: Implement BCPolicy.__init__
        raise NotImplementedError("TODO: Implement BCPolicy.__init__")

    def forward(self, state):
        # TODO: Implement BCPolicy.forward
        raise NotImplementedError("TODO: Implement BCPolicy.forward")

Default constructor args: state_dim=4, action_dim=20, hidden=256. These match the environment's observation size, the action chunk length, and the standard hidden width.

mse_loss skeleton

Look at losses.py:18-33. Same pattern:

def mse_loss(policy, s_batch, a_batch):
    # TODO: Implement mse_loss
    raise NotImplementedError("TODO: Implement mse_loss")

The function signature receives policy (a callable that takes states and returns predicted actions), s_batch of shape [B, state_dim], and a_batch of shape [B, action_dim]. Returns a scalar.

The training loop (already done)

You don't write this, but it's worth understanding what calls your code. From main.py, paraphrased:

# Once at the start: collect demos, build dataset
states, actions = collect_expert_data(...)
dataset = TensorDataset(states, actions)
loader = DataLoader(dataset, batch_size=64, shuffle=True)

# Build policy and optimizer
policy = BCPolicy()
optimizer = optim.Adam(policy.parameters(), lr=1e-3)

# Standard supervised learning loop
for epoch in range(num_epochs):
    for s_batch, a_batch in loader:
        loss = mse_loss(policy, s_batch, a_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Plain SGD on a regression task. Your mse_loss goes inside the inner loop. Your BCPolicy is what policy(s_batch) calls.

What's nice about this setup

The training loop is generic. To switch from MSE regression (Problem 1) to flow matching (Problem 2), you just swap the loss function and policy class. Same loop, same dataset, same optimizer. main.py dispatches on the --method CLI flag and picks the right combination. This is the same pattern HW2/HW3 used — the framework is generic, your code is the algorithm.

Chapter 10

Your Three Changes, Decoded

Per-line annotations for every blank you'll fill in. This is the centerpiece chapter.

Change 1 of 3
BCPolicy.__init__

Where: networks.py:327-332.

What you need to build: a 3-layer MLP. The PDF spec is exact:

Linear → ReLU → Linear → ReLU → Linear → Sigmoid

Dimensions: state_dim → hidden → hidden → action_dim.

The code:

def __init__(self, state_dim=4, action_dim=20, hidden=256):
    super().__init__()
    self.net = nn.Sequential(
        nn.Linear(state_dim, hidden),
        nn.ReLU(),
        nn.Linear(hidden, hidden),
        nn.ReLU(),
        nn.Linear(hidden, action_dim),
        nn.Sigmoid(),
    )

Decoded

super().__init__()

Calls nn.Module's constructor. Without this, the parameter tracking and bookkeeping system that PyTorch uses to find self.net's parameters won't initialize. Optimizer.parameters() would return an empty list and training would silently do nothing.

Forgetting super().__init__() is the #1 silent bug in PyTorch class definitions. The starter code includes it for you.

self.net = nn.Sequential(...)

Stores the entire network as an attribute named net. PyTorch's nn.Module automatically discovers any nn.Module attributes (like Sequential, Linear, etc.) and registers their parameters as part of self.parameters().

Why use nn.Sequential rather than store layers separately: cleaner. Same parameters, same gradient flow, fewer attributes to track.

nn.Linear(state_dim, hidden)

First fully-connected layer. Maps a 4-dim state vector to a 256-dim hidden representation. Internally: output = W @ input + b where W is a learnable [256, 4] matrix and b is a learnable [256] vector.

For an input of shape [B, 4], the output is [B, 256]. The batch dimension is preserved; the operation is applied per-row.

nn.ReLU()

Applies max(0, x) elementwise. No learnable parameters. Output shape == input shape (no dim change). Without ReLU between Linears, the entire network would collapse to a single Linear (because the composition of two linear maps is a linear map). ReLU is what makes the network nonlinear and able to learn complex functions.

nn.Linear(hidden, hidden)

Second hidden layer. Maps 256 → 256. Same shape in, same shape out. The actual function this learns is whatever transformation, when ReLU'd and passed to the final layer, helps minimize the loss.

nn.Linear(hidden, action_dim)

Final Linear layer. Maps 256 → 20. The 20 outputs correspond to the 20 future actions in the chunk. These are logits at this point — raw real numbers, possibly outside [0, 1].

nn.Sigmoid()

Squashes each of the 20 outputs to (0, 1) via 1 / (1 + exp(-x)). After this, every output is a valid action target. Without Sigmoid, the network could output 0.5, 100, -3, 0.7 ... and the env would crash on the negative or oversized values.

Sigmoid is applied elementwise — each of the 20 outputs is squashed independently. They're not normalized to sum to 1 or anything like that. Each is a separate target y-position.

A common alternative

Some implementations would output raw values from the final Linear and apply torch.sigmoid() in forward:

self.net = nn.Sequential(...) # without sigmoid
def forward(self, x):
    return torch.sigmoid(self.net(x))

Mathematically identical. Slightly more flexible if you want to remove the sigmoid for inspection. Either pattern is fine; baking it into Sequential is more compact.

Change 2 of 3
BCPolicy.forward

Where: networks.py:334-338.

What you need: take a state batch, return the predicted action chunk.

The code:

def forward(self, state):
    return self.net(state)

Decoded

return self.net(state)

One line. self.net is the nn.Sequential from __init__. Calling it with state runs all six operations (3 Linears, 2 ReLUs, 1 Sigmoid) in order. Returns shape [B, 20] if input is shape [B, 4].

Why this is one line: because nn.Sequential defines its own forward internally. We just delegate to it. If we'd kept layers as separate attributes, we'd need to chain them manually.

A common variation

If you'd defined the network with separate attributes:

self.fc1 = nn.Linear(state_dim, hidden)
self.fc2 = nn.Linear(hidden, hidden)
self.fc3 = nn.Linear(hidden, action_dim)

then forward would chain them by hand:

def forward(self, state):
    x = torch.relu(self.fc1(state))
    x = torch.relu(self.fc2(x))
    return torch.sigmoid(self.fc3(x))

Same network, more code. Stick with Sequential for this homework.

Change 3 of 3
mse_loss

Where: losses.py:18-33.

The math:

L = (1/N) Σi=1..N Σk=1..20θ(si)[k] − ai*[k])2

"Mean of squared element-wise differences across the batch and the action chunk."

The code:

def mse_loss(policy, s_batch, a_batch):
    predicted = policy(s_batch)
    return nn.functional.mse_loss(predicted, a_batch)

Decoded

predicted = policy(s_batch)

Forward pass through the BC policy. s_batch is shape [B, 4]; predicted comes out as shape [B, 20].

Note this calls policy(s_batch) not policy.forward(s_batch). The first triggers PyTorch's bookkeeping (registering this forward pass with autograd, etc.). The second skips it. Always use the former.

return nn.functional.mse_loss(predicted, a_batch)

PyTorch's built-in MSE function. Computes (predicted - a_batch)2 elementwise, then averages over all elements (both batch dim and action dim). Returns a scalar.

The function is in two namespaces: nn.MSELoss() (class form, instantiate then call) and nn.functional.mse_loss (function form, one-shot). Both default to reduction='mean', which is what we want.

Reductions: pass reduction='sum' if you want the unaveraged sum (rare); reduction='none' if you want per-element losses (useful if you'll weight them later, like AWAC's exp_weights). For plain BC, default mean is correct.

An equivalent manual formulation

You could write this from scratch:

predicted = policy(s_batch)
return ((predicted - a_batch) ** 2).mean()

Identical numerically. F.mse_loss is slightly faster (specialized kernel) and self-documenting. Either is fine.

Common bugs in this function

1. Forgetting to call policy(s_batch): passing s_batch directly into F.mse_loss would compute the MSE between states and actions, which is meaningless.

2. Wrong shapes: predicted and a_batch must have the same shape. If a_batch arrives as [B, 20] but policy outputs [B, 1] (because action_dim=1 got passed), the MSE will broadcast to [B, 20] by tiling and silently produce a wrong answer. Print shapes if loss numbers look odd.

3. Using policy.forward: skips PyTorch hooks. Usually doesn't break MSE specifically, but it's bad habit.

Summary

Three blanks. Total < 15 lines of actual code. The conceptual content is the architecture choice (3-layer MLP) and the loss choice (MSE). Both are textbook. The interesting part is in Chapter 07: why MSE fails on hard mode.

Chapter 11

Running It

Setup

From the starter code's installation.md:

conda create -n cs224r-hw1 python=3.10
conda activate cs224r-hw1
cd hw1_starter_code
pip install -e .

This installs the local hw1 package in editable mode plus its dependencies (torch, gymnasium, numpy, matplotlib).

Running BC regression on easy mode

python main.py --method bc_reg --env easy

This:

  1. Collects expert demos on easy mode (~5 episodes, ~5000 transitions).
  2. Trains your BCPolicy via mse_loss for some number of epochs.
  3. Evaluates on 50 episodes, reports mean and std of episode length.
  4. Saves results to bc_reg_easy.txt.

Expected on easy mode: mean episode length close to 1000 (full survival), low standard deviation. Easy mode has unimodal expert (always aim at the single gap), so MSE works fine.

Running BC regression on hard mode

python main.py --method bc_reg --env hard

Same training pipeline, hard mode. Expected: mean episode length much shorter than 1000, high standard deviation. The policy is the conditional mean of the multimodal expert and crashes into the wall on most double-gap pipes.

Save the result to bc_reg_hard.txt. Report both numbers in the writeup table.

How long does training take?

Probably 1-5 minutes on CPU, much faster on GPU. The model is small, the dataset is small, and there's no env interaction during training. This is one homework where you don't need a cluster.

What healthy training looks like

MetricHealthyBug
Training lossDecreases monotonically over epochs, plateausStays flat at random-init level (forgot super().__init__() or policy(s_batch))
Training loss (final)Around 0.01-0.05 typical0.0 exactly (something's off; possibly shape bug)
Eval episode length (easy)900-1000< 100 (model not actually learning, or sigmoid not applied)
Eval episode length (hard)200-500 typical1000 (would mean BC magically solves it — doesn't happen)

Common reasons it doesn't work

  1. Forgot super().__init__(): the model has no parameters tracked, optimizer has nothing to update, training does nothing. Easy give-away: training loss never changes.
  2. Forgot the Sigmoid: predictions are unbounded, env clamps but training is unstable. Random/poor performance.
  3. Wrong action_dim: BCPolicy(action_dim=1) instead of 20 outputs single scalar instead of 20-step chunk. Loop crashes immediately.
  4. Calling policy.forward(s_batch) instead of policy(s_batch): usually works for BC but bad style and can cause issues with newer PyTorch features.

The deliverable

Per the PDF, you need three things in your writeup:

  1. Table for easy mode: mean ± std of episode length over 50 evaluation episodes.
  2. Table for hard mode: same metric.
  3. 2-3 sentence explanation of MSE's hard-mode performance. Use the template from Chapter 07.
Chapter 12

Cheat Sheet & Self-Quiz

Equations

BC objective θ* = arg minθ E(s, a*) ~ D[ || πθ(s) − a* ||2 ]
MSE minimizer πθ*(s) = Ea* | s[a*] conditional mean of expert
Architecture state → Linear(4, 256) → ReLU → Linear(256, 256) → ReLU → Linear(256, 20) → Sigmoid → action chunk

API reference

CallReturns
nn.Linear(in_dim, out_dim)Fully-connected layer
nn.ReLU()max(0, x) activation
nn.Sigmoid()1 / (1 + exp(-x)) activation, output in (0, 1)
nn.Sequential(*layers)Composes layers into a single Module
nn.functional.mse_loss(pred, target)Mean squared error scalar
policy(s_batch)Forward pass with bookkeeping
super().__init__()Initialize nn.Module's bookkeeping

Self-quiz

  1. What's the difference between imitation learning and reinforcement learning?
  2. What is behavior cloning (in one sentence)?
  3. Why does the BC policy use sigmoid on its output?
  4. What does the MSE loss converge to in expectation?
  5. Why does action chunking help BC?
  6. What's the difference between predicting 20 actions and executing 20 actions?
  7. Why does easy mode work but hard mode fail with MSE?
  8. What does "the conditional mean of expert actions" mean for the gap-1-or-gap-2 case?
  9. Why is super().__init__() required in BCPolicy.__init__?
  10. What's the difference between policy(x) and policy.forward(x)?
  11. Why use ReLU instead of sigmoid for hidden layers?
  12. What two HW1 problems fix the multimodality issue we observe in P1, and how?
Answer key

1. RL learns from environment interaction guided by a reward function. IL learns from expert demonstrations via supervised learning. RL needs exploration; IL doesn't.

2. Treat (state, expert action) pairs as a supervised dataset and train a policy via standard supervised learning to predict expert actions from states.

3. Because actions are normalized to [0, 1]. Sigmoid maps any real input to (0, 1), so the network's outputs are always valid actions without needing post-hoc clipping.

4. The conditional mean: π*(s) = E[a* | s]. The MSE-minimizing prediction at each state is the average of expert actions at that state.

5. Three reasons: (1) temporal consistency, (2) less compounding error from fewer policy queries per episode, (3) implicit multi-step planning.

6. The policy predicts 20 future actions in a single forward pass. Only the first 10 (EXECUTE_STEPS) are actually applied to the env before re-querying. The other 10 are computed but discarded.

7. Easy mode has unimodal experts (one gap, one valid action). MSE finds the right action. Hard mode has multimodal experts (two gaps, two valid actions). MSE averages them and outputs the wall.

8. If demo A aims at gap 1 (y=0.7) and demo B aims at gap 2 (y=0.3), both at the same state, the conditional mean is 0.5. That's exactly the wall between them.

9. Without it, nn.Module's parameter-tracking bookkeeping isn't initialized. Layers added to self after that won't be discoverable by self.parameters(), the optimizer will have nothing to update, and training will silently do nothing.

10. policy(x) calls __call__, which runs PyTorch's hooks (autograd registration, eval/train mode, etc.) before forwarding. policy.forward(x) skips those hooks. Use the first; the second is only for very low-level work.

11. ReLU doesn't saturate for positive inputs, so gradients stay healthy across many layers. Sigmoid saturates near 0 and 1, vanishing the gradient and slowing learning. Sigmoid is appropriate at the output (where saturation is the point) but not in hidden layers.

12. Problem 2 (Flow Matching) replaces the deterministic regressor with a generative model that can sample one of multiple modes. Problem 3 (DAgger) keeps MSE but uses a deterministic expert that always picks the same gap consistently, removing multimodality from the data.

Implementation order

  1. BCPolicy.__init__ — ~1 minute. Six lines, all standard nn.Module idioms.
  2. BCPolicy.forward — 30 seconds. One line.
  3. mse_loss — 1 minute. Two lines.

Total: ~3 minutes of typing. Run easy mode first, verify ~1000 step survival. Run hard mode, observe the failure, write the explanation. Move on to Problem 2 (flow matching) which fixes the multimodality issue.

Take it back to class

You can now teach this

Three big ideas, in order of importance:

  1. Behavior cloning is supervised learning on (state, action) pairs. There's nothing fancy about the algorithm. The interesting parts are: which loss function, which architecture, and what to do when supervision is multimodal.
  2. The MSE-minimizer is the conditional mean. This is fine when the expert is deterministic per state. It's catastrophic when the expert is multimodal — the mean of two valid actions can be a forbidden middle ground (the wall in Flappy Bird's hard mode).
  3. Two orthogonal fixes for multimodality: enrich the model (Problem 2's flow matching can capture multimodal distributions) or sanitize the data (Problem 3's DAgger uses a deterministic expert). Both work; both are widely used in real robot learning.

If a friend asks: "Why does behavior cloning fail on hard mode?" — you say: "It's not BC's fault — it's the loss function. MSE regression converges to the conditional mean of the expert's actions. When the expert is bimodal — sometimes go up, sometimes go down — the mean is in the middle, which is a wall. The fix is either a richer model (flow matching captures bimodality) or cleaner data (a deterministic expert that always picks the same gap)."

You can teach this. On to Problem 2.