Roll out the policy. Ask the expert what to do at every state visited. Aggregate. Retrain. Repeat. The simplest fix to behavior cloning's deepest weakness, plus the clever deterministic-expert trick that resolves the multimodality problem you saw in P1.
Same Flappy Bird environment. Same hard-mode multimodality challenge. Different fix.
Problem 1 introduced behavior cloning with MSE regression and showed how it fails on hard mode (the multimodal expert — gap 1 or gap 2 — gets averaged into the wall). Problem 2 fixed this by replacing MSE with flow matching (a generative model that preserves modes). Problem 3 takes the orthogonal approach: keep MSE regression but make the data unimodal.
Plain BC has two distinct issues, and they call for different fixes:
| Failure mode | What goes wrong | Where it shows up | Fix |
|---|---|---|---|
| Multimodality | MSE averages multiple valid actions into an invalid mean | Hard mode with bimodal expert | P2: flow matching, OR P3: deterministic expert |
| Distribution shift | Policy errors compound; agent ends up in states the expert never visited | Long episodes, anywhere | P3: DAgger |
Problem 3 is mostly aimed at distribution shift, but in this homework it also resolves multimodality through the deterministic-expert trick. Both fixes happen at once. We'll see why in Chapter 05.
Three small functions in dagger.py:
The DAgger orchestration loop (run_dagger) is provided. You write the data-collection helpers; the framework handles the BC retraining and aggregation.
This chapter is the conceptual foundation of DAgger. Without understanding distribution shift, the algorithm is just an arbitrary procedure.
BC trains a policy to mimic the expert at states the expert visits. Call this distribution of states pexpert(s). The training loss is:
The policy is good at states the expert visits. Training cannot tell us anything about states the expert doesn't visit, because we have no examples there.
At test time, the policy makes a small error at some state s0. It produces an action that's almost right but slightly off. The bird is now in a slightly weird state at the next timestep — one the expert never quite visited (because the expert didn't make the same small error).
From this slightly-out-of-distribution state, the policy makes a slightly larger error. Now the bird is in a more out-of-distribution state. From there, an even larger error. From there, total chaos. By 50 timesteps in, the policy is in a state space the expert never saw, and the policy has no idea what to do.
You might think: "just train the BC policy harder, get smaller errors per step, problem solved." But:
When the distribution of states encountered during deployment differs from the distribution of states encountered during training. In BC, this happens automatically because the deployment-time state distribution is determined by the policy (not the expert), and the policy makes small errors that drift the state distribution away from training data.
| Approach | Idea | Used in |
|---|---|---|
| DAgger | Iteratively collect states from policy rollouts, get expert labels at those states, retrain. The training distribution gradually expands to cover the deployment distribution. | This homework, robot teleop, autonomous driving |
| Better representations | Use convolutional/transformer networks that generalize better to nearby out-of-distribution states. | Modern foundation-model robot policies |
DAgger is the cleanest theoretical fix. The distribution shift literature largely starts with DAgger.
DAgger stands for Dataset Aggregation. Ross, Gordon, and Bagnell, AISTATS 2011.
The problem with BC: the policy gets bad at states it visits but the expert didn't. Solution: collect data at the states the policy actually visits, get expert labels there, add to the training set, retrain. Repeat.
Algorithmically:
That's the entire algorithm. The dataset grows each round. The policy improves because it now has supervision at exactly the states it tends to visit.
BC trains the policy on the expert's state distribution. DAgger trains the policy on the policy's own state distribution — with expert labels there. Over rounds, the training distribution converges to the deployment distribution.
The policy provides the states (where to collect data). The expert provides the actions (what to do at those states). Mixing this up is the most common conceptual error.
If you used the policy's actions as labels, you'd just be training the policy to do whatever it already does — no learning signal. You need the expert's different action at the policy's bad-state to teach the policy to recover.
DAgger requires you to query the expert at any state, on demand. This is more demanding than plain BC, which only needs a fixed offline dataset of expert demonstrations.
In this homework, the expert is a Python class (DeterministicExpert) that accepts an observation and returns an action. We can call it whenever we want. In real robot teleop, the human operator has to actually be available to provide labels — which is expensive but doable.
BC trains under distribution mismatch:
These are different. The policy's behavior depends on its training data, but the policy's training data is determined by the expert — not by the policy. As soon as the policy is deployed, it visits states the training never sampled.
DAgger explicitly closes this gap by feeding back the policy's own state distribution into the training set:
After a few rounds, the training distribution covers the deployment distribution. The policy's per-step error stops compounding because every state it visits is now labeled.
Ross & Bagnell (2011) proved a regret bound for DAgger. With N DAgger iterations, the gap between the policy's performance and the expert's scales as:
where ε is the per-step training error and T is the episode horizon. Compare to vanilla BC:
For T = 1000 (this homework), the difference is dramatic: T · ε = 1000ε vs. T2 · ε = 1,000,000ε. Even a 1% per-step error compounds to 100% in BC but stays at ~10% with DAgger.
BC is quadratic because errors compound: a per-step error of ε means the deployment distribution drifts by O(T ε), and at each state the policy's error is amplified by the deviation from training distribution — giving O(T ε · T) = O(T2 ε).
DAgger is linear because the deployment distribution converges to the training distribution: errors don't compound, they stay constant. Total cost is just T · ε.
The first round of DAgger trains BC on the original expert demos — same as Problem 1. Performance: poor on hard mode.
The second round adds states visited by this poor policy, labeled by the expert. The retrained policy is better — it now knows what to do in some of the off-distribution states.
Subsequent rounds keep adding policy-visited states. After 5 rounds (this homework's setting), the dataset has good coverage of the policy's actual deployment distribution, and the policy is much closer to expert performance.
The plot you'll generate (Figure for Problem 3) should show this monotonic improvement — the first round near baseline BC, climbing toward expert level by round 5.
This is the cleverest part of HW1's DAgger setup. Read it carefully.
From Problem 1: hard mode has alternating single and double-gap pipes. The expert in expert.py sees a double-gap pipe and randomly picks one of the two gaps:
# From expert.py if dist < self.commit_dist: self.target_gap_idx = np.random.choice([0, 1]) self._committed = True
This randomness is what made the expert's actions multimodal: at the same state, sometimes the expert picks gap 1 (y=0.7), sometimes gap 2 (y=0.3). MSE regression averages these into y=0.5 — the wall.
For DAgger relabeling, we don't have to use the same multimodal expert. We can build a deterministic version that always makes the same choice:
# DeterministicExpert (the version you'll fill in) if dist < self.commit_dist: self._committed = True raw_target = float(gap1_y) # ALWAYS gap 1
The deterministic expert always commits to gap 1 (the upper gap) when close to the pipe. No randomness. Same state → same action, every time.
Now think about what DAgger does. It collects states by rolling out the current policy, then labels them with the deterministic expert's actions. Every label says "go to gap 1 here, go to gap 1 there, go to gap 1 everywhere." The training data is now unimodal.
MSE regression on unimodal data works great — it converges to the conditional mean, but the conditional mean of "gap 1, gap 1, gap 1, …" is just gap 1. The policy learns to consistently pick gap 1.
DAgger fixes distribution shift by aggregating policy-visited states with expert labels. The deterministic expert also fixes multimodality by removing the randomness. Together, they let plain MSE regression succeed where it failed in P1 — same model, same loss, just different data.
Standard DAgger usually uses the same expert that generated the initial demos. Here we use a different (deterministic) expert. Why?
Because the original expert is multimodal — if we used it for relabeling, we'd just keep adding multimodal labels to the training set. DAgger would fix distribution shift but not multimodality. The combination of MSE + multimodal labels still produces averaged-into-the-wall predictions.
The deterministic expert is a trick specific to this homework's setup. In real robotics, the expert is usually a human operator or a known-good controller, which is naturally deterministic for any single state. Multimodality from random expert behavior is a synthetic artifact of this homework.
Three behaviors, mostly already coded for you:
Your edit is one line of Python: raw_target = float(gap1_y). The cleverness is conceptual; the code is trivial.
One implementation wrinkle: the policy predicts 20-step action chunks, but only executes the first 10 before re-querying. How does DAgger handle this?
During rollout_episode:
chunk_buf (the most recent 20-step action chunk from the policy) and a step_in_chunk counter.The data we collect is per-step, not per-chunk. Each row of the dataset is (st, expert_action_at_st) — not a 20-step chunk.
During rollout_and_relabel:
rollout_episode, getting per-step lists.new_states and new_actions lists.The expert's actions at each step are stored sequentially during rollout, then windowed into chunks of length 20 afterwards.
The expert is queried at every state, so we have a per-step expert action sequence. Concatenating any 20 consecutive expert actions gives a valid expert action chunk — the expert could have produced that chunk if asked. So windowing creates new (state, chunk) training pairs without needing extra expert queries.
Look at the original collect_expert_data in expert.py:
for i in range(len(ep_states) − action_chunk + 1): all_states.append(ep_states[i]) all_actions.append(ep_actions[i:i + action_chunk])
Same windowing pattern. rollout_and_relabel mirrors this exactly — collect per-step lists, window into chunks, append.
The only difference: collect_expert_data rolls out the expert; rollout_and_relabel rolls out the policy but labels with the expert. Different state distribution, same windowing logic.
rollout_and_relabel for episodes_per_round episodes. For each episode:Three things to internalize:
One file, three blanks. Plus the orchestrator run_dagger which is read-only.
| Function | Status | What it does |
|---|---|---|
DeterministicExpert.act | EDIT (one line) | Set raw_target = gap1_y when committed |
rollout_episode | EDIT | Roll out policy for one episode, return states & expert actions |
rollout_and_relabel | EDIT | Loop over episodes, window into chunks |
run_dagger | read-only | The full DAgger orchestrator (eval, collect, retrain) |
Almost everything is provided in dagger.py:30-96. The class has:
_last_gap_sig, _committed, _smooth_target — for tracking pipe transitions and EMA smoothing.dagger.py:99-129:
@torch.no_grad() def rollout_episode(env, policy, seed, action_chunk, device): # YOUR CODE HERE return ep_states, ep_expert_actions
Inputs:
env: a FlappyBirdEnv instance (already constructed).policy: trained BC policy. Callable as policy(state_tensor); outputs a chunk of shape (1, action_chunk).seed: env reset seed for reproducibility.action_chunk: 20 (the chunk length).device: torch device for tensor placement.Outputs: per-step list of states, per-step list of expert actions (scalars, not chunks — we'll window into chunks in rollout_and_relabel).
dagger.py:132-171:
@torch.no_grad() def rollout_and_relabel(policy, difficulty, num_episodes, pipe_speed, seed, action_chunk, device): policy.eval() env = FlappyBirdEnv(difficulty=difficulty, pipe_speed=pipe_speed) det_expert = DeterministicExpert() new_states, new_actions = [], [] # YOUR CODE HERE return new_states, new_actions
The orchestrator calls this once per round to get fresh training data. num_episodes defaults to a small number (~10 in the default config). Returns numpy arrays of shape (N, 4) states and (N, 20) action chunks.
Note: the deterministic expert is constructed inside this function, not passed in. Each call gets a fresh expert that resets between episodes.
Per-line annotations. This is the centerpiece chapter.
Where: dagger.py:81-86.
What you need: when the bird is close enough to commit to a gap, deterministically pick gap 1 instead of randomly choosing.
The code:
if dist < self.commit_dist: # very close to the pipe self._committed = True raw_target = float(gap1_y) # <-- YOUR CODE else: raw_target = float(midpoint)
Set the raw target to gap1_y. Always gap 1, regardless of which would be closer to the bird's current position.
Why float(gap1_y) and not just gap1_y: gap1_y is unpacked from a numpy array (obs[1]), so it might be a numpy.float32. Wrapping in Python's float() converts to a plain Python float, which plays more nicely with downstream EMA smoothing arithmetic.
Why gap 1 specifically: arbitrary choice. Gap 2 would work equally well as long as we're consistent. The key is determinism, not which gap.
One line of code. That's the entire trick. The original expert (in expert.py) had:
self.target_gap_idx = np.random.choice([0, 1]) # multimodal! self._committed = True
Your version has:
self._committed = True raw_target = float(gap1_y) # deterministic!
Two pieces of code, one is "random.choice" and the other is "always gap 1." This single deterministic choice is what turns multimodal expert data into unimodal data, which is what makes MSE regression work.
Where: dagger.py:99-129.
What you need to build: roll out the current policy for one episode, with action chunking and re-querying every EXECUTE_STEPS = 10 steps. At each timestep, store the current state and the deterministic expert's action for that state.
The code:
@torch.no_grad() def rollout_episode(env, policy, seed, action_chunk, device): obs, _ = env.reset(seed=seed) det_expert = DeterministicExpert() det_expert.reset() ep_states, ep_expert_actions = [], [] chunk_buf = None step_in_chunk = 0 done = False while not done: if chunk_buf is None or step_in_chunk ≥= EXECUTE_STEPS: state_tensor = torch.as_tensor(obs, dtype=torch.float32, device=device).unsqueeze(0) chunk_buf = policy.get_action_chunk(state_tensor) if hasattr(policy, 'get_action_chunk') else policy(state_tensor).cpu().numpy()[0] step_in_chunk = 0 action = float(chunk_buf[step_in_chunk]) expert_action = det_expert.act(obs) ep_states.append(obs.copy()) ep_expert_actions.append(expert_action) obs, _, terminated, truncated, _ = env.step(np.array([action])) done = terminated or truncated step_in_chunk += 1 return ep_states, ep_expert_actions
Reset the environment with the given seed. Returns (observation, info_dict); we discard the info dict. obs is a numpy array of shape (4,).
Seeding ensures reproducibility — running with the same seed produces the same initial state.
Create a fresh deterministic expert and reset its internal state (gap signature, commitment, EMA buffer). Each episode gets a clean expert — no state carried over from previous episodes.
Empty lists to collect the per-step data. We'll append to these inside the loop, then return them at the end.
Initial state of the chunk buffer. None means we haven't queried the policy yet, so we'll query on the first iteration. step_in_chunk tracks how many actions we've used from the current chunk — when it hits EXECUTE_STEPS=10, we re-query.
Re-query the policy when (a) we haven't queried yet, or (b) we've used up the executable portion of the previous chunk. This is the receding-horizon re-querying logic.
Note: we re-query at EXECUTE_STEPS = 10, not at action_chunk = 20. The remaining 10 actions in the buffer are discarded — that's the point of "execute K of T predicted."
Convert the numpy obs to a torch tensor on the right device, then add a batch dimension. The policy expects shape (B, 4), not (4,). .unsqueeze(0) adds the batch dim, giving shape (1, 4).
Forward pass through the BC policy. For BCPolicy from P1, this is a single forward through the MLP, returning (1, 20). For FlowMatchingPolicy from P2, this is the full Euler integration. The output shape is the same: (1, action_chunk).
.cpu() moves the result back to CPU (in case it was on GPU). .numpy() converts to numpy. [0] drops the batch dim, giving shape (20,).
Why we don't keep gradients: the function is decorated @torch.no_grad(). Inference, no need for autograd. Saves memory and time.
Take the action at the current position in the chunk. chunk_buf[step_in_chunk] is a numpy scalar; float() converts to a plain Python float for the env.
Query the deterministic expert at the current state. This is the relabeling step: we don't keep the policy's action as the label; we use the expert's action.
Note we pass the current obs, not the state we'd get after taking the policy's action. Labels go with the state we encountered, not the state we're about to encounter.
Store the current state and the expert's label. obs.copy() creates a copy — without it, all entries in ep_states would point to the same array (which gets overwritten by the env step).
Take the step. env.step in Gymnasium returns (obs, reward, terminated, truncated, info). We don't need reward or info. terminated is True if the bird crashed; truncated is True if we hit the 1000-step max.
The action goes in as np.array([action]) — the env expects an array, not a scalar.
Either condition ends the episode.
Advance the position in the chunk. When this hits EXECUTE_STEPS, the next iteration will re-query the policy.
Return the per-step lists. Note these are scalar expert actions, not chunks. rollout_and_relabel handles windowing.
1. Forgetting to reset the deterministic expert: state from previous episodes carries over (commit to gap 1 even when in a new pipe context). Manifests as poor relabeling on the first few episodes.
2. Not copying obs: ep_states.append(obs) without .copy() stores the same object many times; after the loop, all entries point to the final state.
3. Storing policy actions instead of expert actions: defeats the entire point of DAgger. Make sure the action you append is from det_expert.act(obs), not from chunk_buf.
4. Using state_in_chunk to update chunk_buf trigger but not actually re-querying: forgetting the chunk_buf is None condition causes a NameError on the first iteration.
Where: dagger.py:132-171.
What you need to build: loop over num_episodes, call rollout_episode for each, window the per-step lists into 20-step (state, action_chunk) pairs, return numpy arrays.
The code:
@torch.no_grad() def rollout_and_relabel(policy, difficulty, num_episodes, pipe_speed, seed, action_chunk, device): policy.eval() env = FlappyBirdEnv(difficulty=difficulty, pipe_speed=pipe_speed) new_states, new_actions = [], [] for ep in range(num_episodes): ep_states, ep_expert_actions = rollout_episode( env, policy, seed=seed + ep, action_chunk=action_chunk, device=device) for i in range(len(ep_states) − action_chunk + 1): new_states.append(ep_states[i]) new_actions.append(ep_expert_actions[i:i + action_chunk]) env.close() return (np.array(new_states, dtype=np.float32), np.array(new_actions, dtype=np.float32))
Put the policy in eval mode. For most architectures this disables dropout and switches BatchNorm to use running statistics. The provided BCPolicy doesn't use dropout or BatchNorm, so this is mostly defensive coding — doesn't change behavior here but is correct practice.
Construct one env instance for all episodes. We'll reset it inside rollout_episode for each new episode.
Accumulators for the windowed (state, action chunk) pairs across all episodes.
Loop over episodes. The provided default is something like 10 episodes per round.
Run one episode. seed + ep gives a different seed per episode for diversity. Returns per-step lists.
Windowing loop. We can build a chunk starting at index i if and only if there are at least action_chunk more states/actions left after i. The +1 is because range(stop) goes up to (but not including) stop.
Example: if the episode has 100 steps and action_chunk = 20, this iterates i from 0 to 80 inclusive. We get 81 (state, chunk) pairs from a 100-step episode.
The state for this windowed pair is the state at index i.
The action chunk is the next 20 expert actions starting at index i. ep_expert_actions[i:i+20] takes a slice of length 20.
Even though the policy's actual rollout used policy actions (with re-querying), the labels here are expert actions for 20 consecutive steps starting at this state. This is the relabeling that gives BC consistent unimodal supervision.
Release env resources. Good housekeeping.
Convert lists to numpy arrays with the right dtype. Shape: states (N, 4), actions (N, 20).
dtype=np.float32 matches what the BC training pipeline expects. Without explicit dtype, numpy might infer float64, which is wasteful and may cause issues with the torch model (which uses float32).
Episode 1 of round 1 starts. rollout_episode rolls the policy for ~50 steps before crashing. It returns 50 states and 50 expert actions.
rollout_and_relabel windows these into 50 - 20 + 1 = 31 (state, chunk) pairs.
Repeat for 10 episodes → ~310 new training pairs.
Aggregate with the original ~5000 expert demo pairs → ~5310 training pairs for round 2.
By round 5, the dataset has ~6500 pairs, with progressively more coverage of the policy's actual trajectories.
python main.py --method dagger --env hard
This:
run_dagger for 5 rounds:
rollout_and_relabel) → aggregate → retraindagger_hard.txt.| Round | Expected mean episode length | What's happening |
|---|---|---|
| 1 | ~200-400 | Plain BC on expert demos — same as P1 |
| 2 | ~400-600 | Some unimodal data added |
| 3 | ~600-800 | More coverage |
| 4-5 | ~800-1000 | Near-expert performance |
The exact numbers depend on hyperparameters, but you should see monotonic improvement across rounds. By round 5, performance should be comparable to (or exceed) flow matching from P2.
Per round: a quick rollout phase (~minute) plus a BC retraining phase (a few minutes for 5000-6500 transitions). Five rounds total: roughly 20-30 minutes on CPU.
| Metric | Healthy | Bug |
|---|---|---|
| Round 1 eval | 200-400 (BC baseline) | 0 or 1000 (something off) |
| Per-round improvement | Each round > previous (mostly) | Stays flat or decreases |
| New transitions per round | 200-2000 (depends on episode length) | 0 (rollout_episode returning empty) |
| Final round eval | 800-1000 | Stuck below 500 (multimodality not resolved — check DeterministicExpert.act) |
Per the PDF:
"DAgger improves over rounds because each round adds expert-labeled states from the policy's own deployment distribution to the training set, so the policy receives supervision exactly at the states where it tends to make errors and gradually closes the distribution-shift gap. The deterministic expert plays two roles: it provides a clean, consistent labeling signal at every visited state, and it removes the multimodality that broke MSE regression in Problem 1 by always picking gap 1 instead of randomly choosing. Together, the iterative dataset aggregation and the deterministic relabeling let plain MSE behavior cloning recover near-expert performance — same model and loss as P1, but trained on better data."
The PDF asks for a comparison plot of all three methods on hard mode. Run:
python main.py --plot
This reads bc_reg_hard.txt, bc_flow_hard.txt, and dagger_hard.txt and produces a comparison.
| Method | Hard mode mean (typical) | How it solves the problem |
|---|---|---|
| BC regression (P1) | 200-400 | Doesn't — this is the baseline failure |
| Flow matching (P2) | 700-1000 | Generative model preserves modes → consistent gap selection per rollout |
| DAgger (P3, final round) | 800-1000 | Deterministic expert + iterative relabeling → unimodal training distribution |
BC, flow matching, and DAgger represent three different attacks on the same fundamental problem: multimodal experts on long-horizon control.
Real robot learning systems often combine both generative models and iterative relabeling. Diffusion Policy + DAgger is a known combination. The lessons from these three problems are foundational; you'll see variations of all three in research papers and production systems.
| Property | BC reg | Flow matching | DAgger |
|---|---|---|---|
| Compute (training) | Cheap | Moderate (U-Net is bigger) | Cheap per round, but multiple rounds |
| Compute (inference) | 1 forward pass | 20 forward passes (Euler integration) | 1 forward pass |
| Expert assumption | Offline demos only | Offline demos only | Need interactive expert at any state |
| Theoretical bound | O(T2ε) | Same family as BC | O(Tε) |
DAgger's main practical limitation: needing an interactive expert. In settings where the expert is a human operator, this is expensive. In simulation or where the expert is itself a controller (as in this homework), it's cheap.
| Call | Returns |
|---|---|
env.reset(seed=seed) | (obs, info_dict); obs is shape (4,) |
env.step(np.array([action])) | (obs, reward, terminated, truncated, info) |
det_expert.act(obs) | scalar float in [0, 1]; expert's target y |
det_expert.reset() | None; clears commitment, gap signature, EMA |
policy(state_tensor) | predicted action chunk, shape (B, action_chunk) |
torch.as_tensor(np_arr).unsqueeze(0) | tensor with batch dim added |
x.cpu().numpy() | numpy array on CPU |
policy.eval() | set policy to eval mode (no dropout, etc.) |
EXECUTE_STEPS rather than every step?obs when appending to ep_states?rollout_and_relabel turn a per-step list of expert actions into action chunks?1. The states encountered at deployment time are determined by the policy's own behavior, but the policy was only trained on expert states. Small policy errors cause drift to states never seen in training, where the policy has no knowledge.
2. BC's per-step error amplifies with deviation from the training distribution; deviation grows linearly with time, and per-step error scales with deviation, so total cost is O(T2 ε). DAgger keeps the training distribution covering deployment, so per-step error stays bounded; total cost is O(T ε).
3. The policy provides states (rollout). The expert provides action labels (relabeling). The point of DAgger is supervising the policy at the states it actually visits.
4. Because the original expert is multimodal — relabeling with it would just keep adding multimodal labels, leaving the BC averaging-into-the-wall problem unsolved. The deterministic expert removes multimodality from the data.
5. When close to a pipe, the original expert randomly picks gap 1 or gap 2 with equal probability. The deterministic expert always picks gap 1. Same midpoint hovering and EMA smoothing otherwise.
6. Executed actions come from the policy (so we can collect realistic deployment-distribution states). Stored labels come from the deterministic expert (so the BC loss has a unimodal supervision signal). Mixing these up defeats DAgger's purpose.
7. Action chunking (re-querying every K steps) gives temporal consistency, reduces compounding error from policy queries, and improves stability. We re-query at EXECUTE_STEPS, not at the full chunk length, because the last 10 of 20 predicted actions are stale by then.
8. Without .copy(), all entries point to the same numpy array, which gets overwritten in-place by the env. By the end of the loop, every state in ep_states is identical (the final state). With .copy(), each state is a snapshot.
9. By windowing: the chunk for state at index i is the next 20 expert actions ep_expert_actions[i:i+20]. Each state in the rollout (except the last 19) becomes one (state, chunk) training pair.
10. Because we want the training distribution to cover all of the policy's deployment distribution — including states from earlier rounds when the policy was worse. Forgetting old data would lose coverage and might cause the policy to forget edge cases.
11. Yes to distribution shift — the data still gets aggregated from policy rollouts, so coverage improves. No to multimodality — the labels would still be sometimes-gap-1, sometimes-gap-2, and MSE would still average them. You'd need a generative model (like in P2) for this case.
12. In P1, the multimodal expert's randomness made MSE regression collapse to an invalid average (the wall). The deterministic expert in P3 removes that randomness from the labels, eliminating the failure mode at its source. With unimodal labels, MSE works fine — same model and loss as P1, just better data.
Total: ~15 minutes of typing. Run on hard mode, observe ~200 round-1 performance climb to ~900 by round 5. Generate the learning curve plot. Run --plot to compare with P1 and P2.
Three big ideas, in order of importance:
If a friend asks: "Why does DAgger improve over rounds?" — you say: "Each round adds expert-labeled states from the policy's actual deployment distribution to the training set. So the policy gets supervision at exactly the states where it currently makes errors. Over rounds, the training distribution converges to the deployment distribution, and the policy stops drifting off-distribution. The deterministic expert ensures the labels are consistent rather than random, which lets plain MSE regression succeed where it failed in Problem 1."
You can teach this. Submit the writeup.