A Flappy Bird, a 4-D observation, twenty future actions to predict. The simplest imitation learning algorithm there is. Why it works on easy mode, why it dies on hard mode, and the deep lesson that motivates flow matching.
You've been given an environment that's a physics-based version of the classic Flappy Bird game. A bird falls under gravity. Pipes scroll left across the screen. Each pipe has one or two gaps. The bird must navigate through the gaps without crashing.
The agent's job: at each timestep, output a target y-position in [0, 1]. A built-in PD controller converts that target into thrust, creating realistic momentum-based motion.
4 numbers, all normalised to [0, 1]:
obs[0]: distance to the next pipeobs[1]: y-position of gap 1obs[2]: y-position of gap 2 (in easy mode, same as gap 1)obs[3]: bird's current y-position| Mode | Description | Why it matters |
|---|---|---|
| Easy | One gap per pipe (gap1_y == gap2_y) | Unimodal expert. Plain BC works fine. |
| Hard | Alternating single and double-gap pipes | Multimodal expert. Plain BC fails spectacularly. |
The "hard" mode is where the lesson lives. With double-gap pipes, the expert can choose either gap. Different demonstrations choose differently. We'll see in Chapter 07 why this breaks MSE regression.
Modern imitation-learning policies don't predict one action at a time. They predict a chunk of T future actions in a single forward pass:
In this homework: ACTION_CHUNK = 20, EXECUTE_STEPS = 10. The policy predicts 20 future targets, but only the first 10 are executed before re-querying. This is called receding horizon control. We'll explain why this helps in Chapter 04.
Average episode length over 50 evaluation episodes. Episodes cap at 1000 steps; surviving all 1000 is "success." Higher is better. Standard deviation tells you about consistency — high std means the policy sometimes succeeds and sometimes immediately crashes.
Imitation learning is fundamentally different from reinforcement learning. If you're coming to HW1 fresh, internalize this distinction before anything else.
| Property | Reinforcement Learning | Imitation Learning |
|---|---|---|
| Data source | Trial and error in the environment | Expert demonstrations (no env interaction during training) |
| Signal | Reward function | Expert's actions (treated as labels) |
| Algorithm family | Q-learning, policy gradient, etc. | Supervised learning |
| Difficulty | Sparse rewards, exploration, credit assignment | Distribution shift, multimodal experts |
| Example | Train a chess engine via self-play | Train a chess engine by watching humans |
The imitation learning setup:
(state, action) pairs.The simplest imitation learning algorithm: treat the expert demonstrations as a supervised learning dataset, where states are inputs and expert actions are labels. Train a neural network via standard supervised learning.
That's it. No exploration. No reward function. No Bellman equation. Just supervised learning with state → action pairs. The complexity comes from the subtleties: which loss function, which architecture, and what to do when the expert is multimodal.
Three reasons it's the right starting point for a course on robot learning:
HW1 is BC variants. HW2 is RL from scratch (no expert). HW3 is offline RL (expert dataset + value learning). The arc: pure imitation → pure RL → combined. Understanding HW1 deeply makes the rest of the course click.
Strip BC down to its essence and it's literally just supervised learning.
Run the expert in the environment N times. Record every (state, action) pair you see. Result: a dataset
where each ai* is the action the expert took at state si.
Find policy parameters θ that minimize the discrepancy between the policy's predicted action and the expert's action, averaged over the dataset:
where L is some loss function. For Problem 1, L is mean squared error. (Problem 2 will use flow matching, a generative loss. Different loss, same overall structure.)
Three steps:
If you've trained any classifier or regressor in PyTorch before, you've already done all the mechanics of BC. The only thing that's new is the data — states and actions instead of images and labels.
The hardness of BC isn't in the optimization — it's in:
Problem 1 of HW1 introduces BC. Problem 2 (flow matching) tackles multimodality. Problem 3 (DAgger) tackles distribution shift. Each problem is one BC failure mode and one fix.
Most BC tutorials predict one action at a time:
Modern robot learning (e.g., Diffusion Policy, ALOHA, RT-1) predicts a chunk of future actions:
where T = 20 in this homework (the ACTION_CHUNK constant). At rollout time, the policy is queried, then the first EXECUTE_STEPS = 10 actions are executed open-loop (without re-querying), and only after the 10th step does the policy fire again.
This is called receding horizon control — the policy commits to a 20-step plan, executes 10 steps of it, then re-plans.
Three big reasons:
If you predict one action at a time, the policy can flip-flop — this step it predicts "go up," the next step "go down." Action chunks force temporal coherence: the network has to commit to a continuous plan, not a shaky sequence of independent decisions.
Each policy query is one source of stochasticity. Querying once every 10 steps means you accumulate fewer "rolls of the policy dice" per episode. If errors are roughly Gaussian-distributed per query, fewer queries = lower total error variance.
Predicting 20 steps ahead lets the policy plan around obstacles (e.g., commit to a gap, route through it). A single-step policy has to make this decision fresh at every step, which gets tangled in noise.
Large chunks (T ≫ T_execute): more temporal consistency but the policy commits to outdated plans for longer.
Small chunks (T == T_execute): less consistency but always responsive to fresh observations.
The "execute K of T predicted" pattern is the typical compromise: predict more than you execute, then re-query before the buffer runs out.
For BC with action chunking on this homework:
[B, 4].[B, 20].The policy is just an MLP that maps 4 → 256 → 256 → 20. Each output dimension corresponds to one future action. The action_dim parameter in BCPolicy defaults to 20 because that's the chunk size.
Problem 1 specifies a 3-layer MLP:
Three things to understand here.
The problem is small (4-D input, 20-D output, simple physics). Convolutional networks would be overkill (no spatial structure in the state). Transformers would be overkill (no sequential structure in the input). A plain MLP with two hidden layers of 256 is the natural choice.
For comparison, the FlowMatchingPolicy for Problem 2 is much fancier — a 1D U-Net with conditional residual blocks. That's because flow matching needs to learn a vector field, which is a more complex function class.
ReLU(x) = max(0, x). Three reasons to use it:
Could you use Mish, GELU, SiLU instead? Sure. ReLU is the textbook default and works fine here.
The action is a target y-position, normalized to [0, 1]. The sigmoid activation:
maps any real input to (0, 1). This is the natural activation for outputs that need to be in [0, 1]. (Compare to HW2 P2's actor, which used tanh for the [-1, 1] action range.)
You could output raw values from the final Linear and clamp to [0, 1] at inference. But:
• The training loss would have a weird kink at the boundaries (gradients flip from "MSE pulling x up" to "MSE pulling x down" abruptly).
• The network has to learn that values outside [0, 1] are forbidden, which is implicit and hard.
Sigmoid bakes the constraint into the architecture. The network outputs unconstrained logits; the sigmoid converts them to the action range. Cleaner gradient signal, no out-of-range predictions ever.
Just nn.Sequential stacking. Five operations, each one line:
nn.Sequential( nn.Linear(state_dim, hidden), # 4 → 256 nn.ReLU(), nn.Linear(hidden, hidden), # 256 → 256 nn.ReLU(), nn.Linear(hidden, action_dim), # 256 → 20 nn.Sigmoid(), )
That's the entire model. We'll write this in your BCPolicy.__init__.
Read it: "average over the dataset of the squared distance between the policy's predicted action and the expert's action."
For the action chunk version, the squared norm is over all 20 dimensions of the chunk:
Each future action's prediction error squared, summed.
| Loss | Form | Pros | Cons |
|---|---|---|---|
| L2 (MSE) | (π − a*)2 | Smooth gradient; minimizer is the conditional mean | Sensitive to outliers; averages multimodal targets |
| L1 (MAE) | |π − a*| | Robust to outliers; minimizer is the conditional median | Non-smooth at 0; harder to optimize with SGD |
For BC on smooth control tasks, MSE is the standard choice. The minimizer is the conditional expectation of expert actions:
This is a fact you should remember. The MSE-minimizing prediction at state s is the average of all the expert's actions at that state.
Take the gradient of expected squared error w.r.t. the prediction p:
d/dp E[(p − a)2] = 2 · E[(p − a)] = 2 · (p − E[a])
Setting this to zero gives p = E[a]. The optimal MSE prediction is the conditional mean of the target.
For unimodal experts (always make the same choice), this is exactly what we want. For multimodal experts (sometimes choose A, sometimes B), the mean of A and B is a problem — it's neither A nor B, and it could be a terrible prediction. That's the next chapter.
This chapter is the conceptual climax of Problem 1. Spend time here.
Hard mode has alternating single-gap and double-gap pipes. When the expert sees a double-gap pipe, it commits to either gap 1 or gap 2 randomly — from the same observation. So the dataset contains:
Two valid expert actions, randomly chosen, from the same state.
The MSE-minimizer is the conditional mean. Average of 0.7 and 0.3 is 0.5. So the trained policy outputs 0.5 at this state — which is exactly the wall between the two gaps.
The policy isn't doing anything wrong — it's faithfully reproducing the conditional mean of the expert distribution. But the conditional mean is not a valid action.
MSE regression assumes the conditional distribution p(a | s) is unimodal — that there's a single "best" action at each state. When the expert's behavior is multimodal (multiple valid actions per state), the mean of those modes can be far from any actual mode and can be catastrophically bad.
Problem 1 is designed to fail on hard mode. You'll observe poor performance — mean episode length much shorter than 1000 — and the homework asks you to explain why in 2-3 sentences. The answer is exactly: multimodality of the expert plus mean-seeking behavior of MSE.
The two follow-up problems are direct fixes:
Both are valid solutions to the same problem. They take orthogonal approaches: Problem 2 makes the model richer; Problem 3 changes the data to be unimodal. Real robot learning systems often combine both: rich generative models + curated expert data.
The PDF asks for 2-3 sentences explaining MSE on hard mode. Here's the structure:
"On hard mode, the expert chooses between two valid actions (gap 1 or gap 2) randomly at the same state, making the action distribution multimodal. The MSE loss converges to the conditional mean of expert actions, so the policy outputs the average of the two gaps — a target between them where the wall is. This causes the bird to crash, leading to short episode lengths."
If you know Python but PyTorch is new, read this. You'll only need a small slice of PyTorch for Problem 1.
Numpy arrays with two extras: GPU residency and autograd. torch.tensor([1., 2., 3.]) creates a tensor; .shape, .mean(), +, *, etc. work like numpy.
Inherit from nn.Module, define layers in __init__, define the forward pass in forward. The framework handles parameters and gradient tracking automatically.
class SimpleNet(nn.Module): def __init__(self): super().__init__() self.layer1 = nn.Linear(10, 256) self.layer2 = nn.Linear(256, 1) def forward(self, x): x = torch.relu(self.layer1(x)) return self.layer2(x) net = SimpleNet() y = net(torch.randn(32, 10)) # shape [32, 1]
Notice net(x) calls forward(x) indirectly. The framework runs some bookkeeping then forwards. Don't call .forward() directly — you'll skip the bookkeeping and break things.
For pure feed-forward networks (no skip connections, no branching), you can skip the boilerplate of __init__+forward and just stack layers:
self.net = nn.Sequential( nn.Linear(10, 256), nn.ReLU(), nn.Linear(256, 1), ) y = self.net(x) # runs through all layers in order
Same effect as the manual class. Internally, nn.Sequential just calls each layer in order. You'll use this pattern in BCPolicy.
PyTorch's loss functions are usually classes you instantiate, then call:
criterion = nn.MSELoss() loss = criterion(predicted, target) # scalar
Or use the functional form, which is one-shot:
import torch.nn.functional as F loss = F.mse_loss(predicted, target) # same thing, no instance needed
Both apply .mean() by default — reducing the per-element squared errors to a scalar by averaging. The default is what you want for SGD.
For BC on this homework:
state has shape [B, 4] — B = batch size (e.g., 64), 4 = state dimension.predicted_action has shape [B, 20] — B batches, 20-dim action chunks.expert_action has shape [B, 20] — same.The MSE loss reduces both to a scalar by squaring elementwise then averaging across all 20 × B elements.
| Function | Where to use | Example |
|---|---|---|
nn.ReLU() | Hidden layers (general) | nn.Sequential(nn.Linear(...), nn.ReLU()) |
nn.Sigmoid() | Output in [0, 1] | Probabilities, normalized actions |
nn.Tanh() | Output in [-1, 1] | Symmetric continuous actions |
nn.Softmax(dim=-1) | Output is a probability distribution | Discrete action policies |
You'll use nn.ReLU() twice and nn.Sigmoid() once in BCPolicy.
That's all the PyTorch you need for Problem 1. Onward.
Three files for Problem 1.
| File | Status | What's there |
|---|---|---|
main.py | read-only | Training loop, eval, CLI entry point |
flappy_bird_env.py | read-only | Gymnasium environment |
expert.py | read-only | Expert policy + demo collection |
networks.py | EDIT | BCPolicy class (and FlowMatchingSchedule for P2) |
losses.py | EDIT | mse_loss (and flow_matching_loss for P2) |
dagger.py | EDIT (P3) | For Problem 3 only |
Look at networks.py:316-339. The class is empty:
class BCPolicy(nn.Module): def __init__(self, state_dim: int = 4, action_dim: int = 20, hidden: int = 256): super().__init__() # TODO: Implement BCPolicy.__init__ raise NotImplementedError("TODO: Implement BCPolicy.__init__") def forward(self, state): # TODO: Implement BCPolicy.forward raise NotImplementedError("TODO: Implement BCPolicy.forward")
Default constructor args: state_dim=4, action_dim=20, hidden=256. These match the environment's observation size, the action chunk length, and the standard hidden width.
Look at losses.py:18-33. Same pattern:
def mse_loss(policy, s_batch, a_batch): # TODO: Implement mse_loss raise NotImplementedError("TODO: Implement mse_loss")
The function signature receives policy (a callable that takes states and returns predicted actions), s_batch of shape [B, state_dim], and a_batch of shape [B, action_dim]. Returns a scalar.
You don't write this, but it's worth understanding what calls your code. From main.py, paraphrased:
# Once at the start: collect demos, build dataset states, actions = collect_expert_data(...) dataset = TensorDataset(states, actions) loader = DataLoader(dataset, batch_size=64, shuffle=True) # Build policy and optimizer policy = BCPolicy() optimizer = optim.Adam(policy.parameters(), lr=1e-3) # Standard supervised learning loop for epoch in range(num_epochs): for s_batch, a_batch in loader: loss = mse_loss(policy, s_batch, a_batch) optimizer.zero_grad() loss.backward() optimizer.step()
Plain SGD on a regression task. Your mse_loss goes inside the inner loop. Your BCPolicy is what policy(s_batch) calls.
The training loop is generic. To switch from MSE regression (Problem 1) to flow matching (Problem 2), you just swap the loss function and policy class. Same loop, same dataset, same optimizer. main.py dispatches on the --method CLI flag and picks the right combination. This is the same pattern HW2/HW3 used — the framework is generic, your code is the algorithm.
Per-line annotations for every blank you'll fill in. This is the centerpiece chapter.
Where: networks.py:327-332.
What you need to build: a 3-layer MLP. The PDF spec is exact:
Dimensions: state_dim → hidden → hidden → action_dim.
The code:
def __init__(self, state_dim=4, action_dim=20, hidden=256): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, action_dim), nn.Sigmoid(), )
Calls nn.Module's constructor. Without this, the parameter tracking and bookkeeping system that PyTorch uses to find self.net's parameters won't initialize. Optimizer.parameters() would return an empty list and training would silently do nothing.
Forgetting super().__init__() is the #1 silent bug in PyTorch class definitions. The starter code includes it for you.
Stores the entire network as an attribute named net. PyTorch's nn.Module automatically discovers any nn.Module attributes (like Sequential, Linear, etc.) and registers their parameters as part of self.parameters().
Why use nn.Sequential rather than store layers separately: cleaner. Same parameters, same gradient flow, fewer attributes to track.
First fully-connected layer. Maps a 4-dim state vector to a 256-dim hidden representation. Internally: output = W @ input + b where W is a learnable [256, 4] matrix and b is a learnable [256] vector.
For an input of shape [B, 4], the output is [B, 256]. The batch dimension is preserved; the operation is applied per-row.
Applies max(0, x) elementwise. No learnable parameters. Output shape == input shape (no dim change). Without ReLU between Linears, the entire network would collapse to a single Linear (because the composition of two linear maps is a linear map). ReLU is what makes the network nonlinear and able to learn complex functions.
Second hidden layer. Maps 256 → 256. Same shape in, same shape out. The actual function this learns is whatever transformation, when ReLU'd and passed to the final layer, helps minimize the loss.
Final Linear layer. Maps 256 → 20. The 20 outputs correspond to the 20 future actions in the chunk. These are logits at this point — raw real numbers, possibly outside [0, 1].
Squashes each of the 20 outputs to (0, 1) via 1 / (1 + exp(-x)). After this, every output is a valid action target. Without Sigmoid, the network could output 0.5, 100, -3, 0.7 ... and the env would crash on the negative or oversized values.
Sigmoid is applied elementwise — each of the 20 outputs is squashed independently. They're not normalized to sum to 1 or anything like that. Each is a separate target y-position.
Some implementations would output raw values from the final Linear and apply torch.sigmoid() in forward:
self.net = nn.Sequential(...) # without sigmoid
def forward(self, x):
return torch.sigmoid(self.net(x))
Mathematically identical. Slightly more flexible if you want to remove the sigmoid for inspection. Either pattern is fine; baking it into Sequential is more compact.
Where: networks.py:334-338.
What you need: take a state batch, return the predicted action chunk.
The code:
def forward(self, state): return self.net(state)
One line. self.net is the nn.Sequential from __init__. Calling it with state runs all six operations (3 Linears, 2 ReLUs, 1 Sigmoid) in order. Returns shape [B, 20] if input is shape [B, 4].
Why this is one line: because nn.Sequential defines its own forward internally. We just delegate to it. If we'd kept layers as separate attributes, we'd need to chain them manually.
If you'd defined the network with separate attributes:
self.fc1 = nn.Linear(state_dim, hidden) self.fc2 = nn.Linear(hidden, hidden) self.fc3 = nn.Linear(hidden, action_dim)
then forward would chain them by hand:
def forward(self, state):
x = torch.relu(self.fc1(state))
x = torch.relu(self.fc2(x))
return torch.sigmoid(self.fc3(x))
Same network, more code. Stick with Sequential for this homework.
Where: losses.py:18-33.
The math:
"Mean of squared element-wise differences across the batch and the action chunk."
The code:
def mse_loss(policy, s_batch, a_batch): predicted = policy(s_batch) return nn.functional.mse_loss(predicted, a_batch)
Forward pass through the BC policy. s_batch is shape [B, 4]; predicted comes out as shape [B, 20].
Note this calls policy(s_batch) not policy.forward(s_batch). The first triggers PyTorch's bookkeeping (registering this forward pass with autograd, etc.). The second skips it. Always use the former.
PyTorch's built-in MSE function. Computes (predicted - a_batch)2 elementwise, then averages over all elements (both batch dim and action dim). Returns a scalar.
The function is in two namespaces: nn.MSELoss() (class form, instantiate then call) and nn.functional.mse_loss (function form, one-shot). Both default to reduction='mean', which is what we want.
Reductions: pass reduction='sum' if you want the unaveraged sum (rare); reduction='none' if you want per-element losses (useful if you'll weight them later, like AWAC's exp_weights). For plain BC, default mean is correct.
You could write this from scratch:
predicted = policy(s_batch) return ((predicted - a_batch) ** 2).mean()
Identical numerically. F.mse_loss is slightly faster (specialized kernel) and self-documenting. Either is fine.
1. Forgetting to call policy(s_batch): passing s_batch directly into F.mse_loss would compute the MSE between states and actions, which is meaningless.
2. Wrong shapes: predicted and a_batch must have the same shape. If a_batch arrives as [B, 20] but policy outputs [B, 1] (because action_dim=1 got passed), the MSE will broadcast to [B, 20] by tiling and silently produce a wrong answer. Print shapes if loss numbers look odd.
3. Using policy.forward: skips PyTorch hooks. Usually doesn't break MSE specifically, but it's bad habit.
Three blanks. Total < 15 lines of actual code. The conceptual content is the architecture choice (3-layer MLP) and the loss choice (MSE). Both are textbook. The interesting part is in Chapter 07: why MSE fails on hard mode.
From the starter code's installation.md:
conda create -n cs224r-hw1 python=3.10 conda activate cs224r-hw1 cd hw1_starter_code pip install -e .
This installs the local hw1 package in editable mode plus its dependencies (torch, gymnasium, numpy, matplotlib).
python main.py --method bc_reg --env easy
This:
BCPolicy via mse_loss for some number of epochs.bc_reg_easy.txt.Expected on easy mode: mean episode length close to 1000 (full survival), low standard deviation. Easy mode has unimodal expert (always aim at the single gap), so MSE works fine.
python main.py --method bc_reg --env hard
Same training pipeline, hard mode. Expected: mean episode length much shorter than 1000, high standard deviation. The policy is the conditional mean of the multimodal expert and crashes into the wall on most double-gap pipes.
Save the result to bc_reg_hard.txt. Report both numbers in the writeup table.
Probably 1-5 minutes on CPU, much faster on GPU. The model is small, the dataset is small, and there's no env interaction during training. This is one homework where you don't need a cluster.
| Metric | Healthy | Bug |
|---|---|---|
| Training loss | Decreases monotonically over epochs, plateaus | Stays flat at random-init level (forgot super().__init__() or policy(s_batch)) |
| Training loss (final) | Around 0.01-0.05 typical | 0.0 exactly (something's off; possibly shape bug) |
| Eval episode length (easy) | 900-1000 | < 100 (model not actually learning, or sigmoid not applied) |
| Eval episode length (hard) | 200-500 typical | 1000 (would mean BC magically solves it — doesn't happen) |
super().__init__(): the model has no parameters tracked, optimizer has nothing to update, training does nothing. Easy give-away: training loss never changes.BCPolicy(action_dim=1) instead of 20 outputs single scalar instead of 20-step chunk. Loop crashes immediately.policy.forward(s_batch) instead of policy(s_batch): usually works for BC but bad style and can cause issues with newer PyTorch features.Per the PDF, you need three things in your writeup:
| Call | Returns |
|---|---|
nn.Linear(in_dim, out_dim) | Fully-connected layer |
nn.ReLU() | max(0, x) activation |
nn.Sigmoid() | 1 / (1 + exp(-x)) activation, output in (0, 1) |
nn.Sequential(*layers) | Composes layers into a single Module |
nn.functional.mse_loss(pred, target) | Mean squared error scalar |
policy(s_batch) | Forward pass with bookkeeping |
super().__init__() | Initialize nn.Module's bookkeeping |
super().__init__() required in BCPolicy.__init__?policy(x) and policy.forward(x)?1. RL learns from environment interaction guided by a reward function. IL learns from expert demonstrations via supervised learning. RL needs exploration; IL doesn't.
2. Treat (state, expert action) pairs as a supervised dataset and train a policy via standard supervised learning to predict expert actions from states.
3. Because actions are normalized to [0, 1]. Sigmoid maps any real input to (0, 1), so the network's outputs are always valid actions without needing post-hoc clipping.
4. The conditional mean: π*(s) = E[a* | s]. The MSE-minimizing prediction at each state is the average of expert actions at that state.
5. Three reasons: (1) temporal consistency, (2) less compounding error from fewer policy queries per episode, (3) implicit multi-step planning.
6. The policy predicts 20 future actions in a single forward pass. Only the first 10 (EXECUTE_STEPS) are actually applied to the env before re-querying. The other 10 are computed but discarded.
7. Easy mode has unimodal experts (one gap, one valid action). MSE finds the right action. Hard mode has multimodal experts (two gaps, two valid actions). MSE averages them and outputs the wall.
8. If demo A aims at gap 1 (y=0.7) and demo B aims at gap 2 (y=0.3), both at the same state, the conditional mean is 0.5. That's exactly the wall between them.
9. Without it, nn.Module's parameter-tracking bookkeeping isn't initialized. Layers added to self after that won't be discoverable by self.parameters(), the optimizer will have nothing to update, and training will silently do nothing.
10. policy(x) calls __call__, which runs PyTorch's hooks (autograd registration, eval/train mode, etc.) before forwarding. policy.forward(x) skips those hooks. Use the first; the second is only for very low-level work.
11. ReLU doesn't saturate for positive inputs, so gradients stay healthy across many layers. Sigmoid saturates near 0 and 1, vanishing the gradient and slowing learning. Sigmoid is appropriate at the output (where saturation is the point) but not in hidden layers.
12. Problem 2 (Flow Matching) replaces the deterministic regressor with a generative model that can sample one of multiple modes. Problem 3 (DAgger) keeps MSE but uses a deterministic expert that always picks the same gap consistently, removing multimodality from the data.
Total: ~3 minutes of typing. Run easy mode first, verify ~1000 step survival. Run hard mode, observe the failure, write the explanation. Move on to Problem 2 (flow matching) which fixes the multimodality issue.
Three big ideas, in order of importance:
If a friend asks: "Why does behavior cloning fail on hard mode?" — you say: "It's not BC's fault — it's the loss function. MSE regression converges to the conditional mean of the expert's actions. When the expert is bimodal — sometimes go up, sometimes go down — the mean is in the middle, which is a wall. The fix is either a richer model (flow matching captures bimodality) or cleaner data (a deterministic expert that always picks the same gap)."
You can teach this. On to Problem 2.