← back
Stanford CS 224R · Homework 1 · Imitation Learning

DAgger from Absolute Zero

Roll out the policy. Ask the expert what to do at every state visited. Aggregate. Retrain. Repeat. The simplest fix to behavior cloning's deepest weakness, plus the clever deterministic-expert trick that resolves the multimodality problem you saw in P1.

Builds on P1 and P2 Solves distribution shift Three implementation tasks Returns BC to glory
Roadmap

What You'll Master

Chapter 01

Where We Are

Same Flappy Bird environment. Same hard-mode multimodality challenge. Different fix.

Problem 1 introduced behavior cloning with MSE regression and showed how it fails on hard mode (the multimodal expert — gap 1 or gap 2 — gets averaged into the wall). Problem 2 fixed this by replacing MSE with flow matching (a generative model that preserves modes). Problem 3 takes the orthogonal approach: keep MSE regression but make the data unimodal.

The two failure modes of BC

Plain BC has two distinct issues, and they call for different fixes:

Failure modeWhat goes wrongWhere it shows upFix
MultimodalityMSE averages multiple valid actions into an invalid meanHard mode with bimodal expertP2: flow matching, OR P3: deterministic expert
Distribution shiftPolicy errors compound; agent ends up in states the expert never visitedLong episodes, anywhereP3: DAgger

Problem 3 is mostly aimed at distribution shift, but in this homework it also resolves multimodality through the deterministic-expert trick. Both fixes happen at once. We'll see why in Chapter 05.

What you're implementing

Three small functions in dagger.py:

  1. DeterministicExpert.act: a deterministic version of the multimodal expert that always picks gap 1.
  2. rollout_episode: roll out the current policy in the environment for one episode, return the (state, expert action) pairs.
  3. rollout_and_relabel: loop the above over multiple episodes and window into action chunks.

The DAgger orchestration loop (run_dagger) is provided. You write the data-collection helpers; the framework handles the BC retraining and aggregation.

The deliverable

Chapter 02

The Distribution Shift Problem

This chapter is the conceptual foundation of DAgger. Without understanding distribution shift, the algorithm is just an arbitrary procedure.

The setup

BC trains a policy to mimic the expert at states the expert visits. Call this distribution of states pexpert(s). The training loss is:

LBC(θ) = Es ~ pexpert(s) [ ||πθ(s) − π*(s)||2 ]

The policy is good at states the expert visits. Training cannot tell us anything about states the expert doesn't visit, because we have no examples there.

What happens at test time

At test time, the policy makes a small error at some state s0. It produces an action that's almost right but slightly off. The bird is now in a slightly weird state at the next timestep — one the expert never quite visited (because the expert didn't make the same small error).

From this slightly-out-of-distribution state, the policy makes a slightly larger error. Now the bird is in a more out-of-distribution state. From there, an even larger error. From there, total chaos. By 50 timesteps in, the policy is in a state space the expert never saw, and the policy has no idea what to do.

expert state distribution start small error larger error far OOD total chaos time →
The agent starts on the expert's state distribution. Each small policy error nudges it slightly off-distribution. Errors compound: the further off-distribution, the worse the policy's actions, the further off-distribution it goes. This is "compounding error" or "covariate shift."

Why this is fundamental, not a tuning issue

You might think: "just train the BC policy harder, get smaller errors per step, problem solved." But:

Definition
Distribution shift / Covariate shift

When the distribution of states encountered during deployment differs from the distribution of states encountered during training. In BC, this happens automatically because the deployment-time state distribution is determined by the policy (not the expert), and the policy makes small errors that drift the state distribution away from training data.

Two famous fixes

ApproachIdeaUsed in
DAggerIteratively collect states from policy rollouts, get expert labels at those states, retrain. The training distribution gradually expands to cover the deployment distribution.This homework, robot teleop, autonomous driving
Better representationsUse convolutional/transformer networks that generalize better to nearby out-of-distribution states.Modern foundation-model robot policies

DAgger is the cleanest theoretical fix. The distribution shift literature largely starts with DAgger.

Chapter 03

DAgger: Iterative Relabeling

DAgger stands for Dataset Aggregation. Ross, Gordon, and Bagnell, AISTATS 2011.

The core idea

The problem with BC: the policy gets bad at states it visits but the expert didn't. Solution: collect data at the states the policy actually visits, get expert labels there, add to the training set, retrain. Repeat.

Algorithmically:

DAgger
  1. Collect initial expert demos: D0 = {(s, π*(s))}.
  2. Train initial policy: π1 = BC(D0).
  3. For round k = 1, 2, …:
    a) Roll out πk in the environment, collecting visited states {s1, s2, …}.
    b) Query the expert at each visited state: ai* = π*(si).
    c) Add to dataset: Dk = Dk-1 ∪ {(si, π*(si))}.
    d) Retrain: πk+1 = BC(Dk).

That's the entire algorithm. The dataset grows each round. The policy improves because it now has supervision at exactly the states it tends to visit.

The key insight in one sentence

BC trains the policy on the expert's state distribution. DAgger trains the policy on the policy's own state distribution — with expert labels there. Over rounds, the training distribution converges to the deployment distribution.

Critical: we use expert actions, not policy actions

The policy provides the states (where to collect data). The expert provides the actions (what to do at those states). Mixing this up is the most common conceptual error.

If you used the policy's actions as labels, you'd just be training the policy to do whatever it already does — no learning signal. You need the expert's different action at the policy's bad-state to teach the policy to recover.

The role of "interactive" expert access

DAgger requires you to query the expert at any state, on demand. This is more demanding than plain BC, which only needs a fixed offline dataset of expert demonstrations.

In this homework, the expert is a Python class (DeterministicExpert) that accepts an observation and returns an action. We can call it whenever we want. In real robot teleop, the human operator has to actually be available to provide labels — which is expensive but doable.

Chapter 04

Why DAgger Works

The fundamental fix

BC trains under distribution mismatch:

training distribution = pexpert(s) deployment distribution = pπ(s)

These are different. The policy's behavior depends on its training data, but the policy's training data is determined by the expert — not by the policy. As soon as the policy is deployed, it visits states the training never sampled.

DAgger explicitly closes this gap by feeding back the policy's own state distribution into the training set:

training distribution = pexpert(s) ∪ pπ1(s) ∪ pπ2(s) ∪ …

After a few rounds, the training distribution covers the deployment distribution. The policy's per-step error stops compounding because every state it visits is now labeled.

The theoretical guarantee

Ross & Bagnell (2011) proved a regret bound for DAgger. With N DAgger iterations, the gap between the policy's performance and the expert's scales as:

J(πDAgger) − J(π*) = O(T · ε) linear in T

where ε is the per-step training error and T is the episode horizon. Compare to vanilla BC:

J(πBC) − J(π*) = O(T2 · ε) quadratic in T

For T = 1000 (this homework), the difference is dramatic: T · ε = 1000ε vs. T2 · ε = 1,000,000ε. Even a 1% per-step error compounds to 100% in BC but stays at ~10% with DAgger.

The intuition behind the bound

BC is quadratic because errors compound: a per-step error of ε means the deployment distribution drifts by O(T ε), and at each state the policy's error is amplified by the deviation from training distribution — giving O(T ε · T) = O(T2 ε).

DAgger is linear because the deployment distribution converges to the training distribution: errors don't compound, they stay constant. Total cost is just T · ε.

Why the rounds matter

The first round of DAgger trains BC on the original expert demos — same as Problem 1. Performance: poor on hard mode.

The second round adds states visited by this poor policy, labeled by the expert. The retrained policy is better — it now knows what to do in some of the off-distribution states.

Subsequent rounds keep adding policy-visited states. After 5 rounds (this homework's setting), the dataset has good coverage of the policy's actual deployment distribution, and the policy is much closer to expert performance.

The plot you'll generate (Figure for Problem 3) should show this monotonic improvement — the first round near baseline BC, climbing toward expert level by round 5.

Chapter 05

The Deterministic Expert Trick

This is the cleverest part of HW1's DAgger setup. Read it carefully.

Recap: the multimodality problem

From Problem 1: hard mode has alternating single and double-gap pipes. The expert in expert.py sees a double-gap pipe and randomly picks one of the two gaps:

# From expert.py
if dist < self.commit_dist:
    self.target_gap_idx = np.random.choice([0, 1])
    self._committed = True

This randomness is what made the expert's actions multimodal: at the same state, sometimes the expert picks gap 1 (y=0.7), sometimes gap 2 (y=0.3). MSE regression averages these into y=0.5 — the wall.

The trick: a deterministic expert

For DAgger relabeling, we don't have to use the same multimodal expert. We can build a deterministic version that always makes the same choice:

# DeterministicExpert (the version you'll fill in)
if dist < self.commit_dist:
    self._committed = True
    raw_target = float(gap1_y)        # ALWAYS gap 1

The deterministic expert always commits to gap 1 (the upper gap) when close to the pipe. No randomness. Same state → same action, every time.

Why this resolves the multimodality problem

Now think about what DAgger does. It collects states by rolling out the current policy, then labels them with the deterministic expert's actions. Every label says "go to gap 1 here, go to gap 1 there, go to gap 1 everywhere." The training data is now unimodal.

MSE regression on unimodal data works great — it converges to the conditional mean, but the conditional mean of "gap 1, gap 1, gap 1, …" is just gap 1. The policy learns to consistently pick gap 1.

Two birds with one stone

DAgger fixes distribution shift by aggregating policy-visited states with expert labels. The deterministic expert also fixes multimodality by removing the randomness. Together, they let plain MSE regression succeed where it failed in P1 — same model, same loss, just different data.

Why use the original expert as the labeler in normal DAgger?

Standard DAgger usually uses the same expert that generated the initial demos. Here we use a different (deterministic) expert. Why?

Because the original expert is multimodal — if we used it for relabeling, we'd just keep adding multimodal labels to the training set. DAgger would fix distribution shift but not multimodality. The combination of MSE + multimodal labels still produces averaged-into-the-wall predictions.

The deterministic expert is a trick specific to this homework's setup. In real robotics, the expert is usually a human operator or a known-good controller, which is naturally deterministic for any single state. Multimodality from random expert behavior is a synthetic artifact of this homework.

The full DeterministicExpert behavior

Three behaviors, mostly already coded for you:

  1. Far from pipe: target the midpoint of (gap1_y, gap2_y). This is the same hovering behavior the original expert uses.
  2. Close to pipe (dist < commit_dist): commit to gap 1. This is the line you fill in.
  3. EMA smoothing: temporal smoothing on the target. Same as original.
  4. New-pipe detection: when gap positions change, reset commitment. Same as original.

Your edit is one line of Python: raw_target = float(gap1_y). The cleverness is conceptual; the code is trivial.

Chapter 06

DAgger with Action Chunking

One implementation wrinkle: the policy predicts 20-step action chunks, but only executes the first 10 before re-querying. How does DAgger handle this?

The rollout side

During rollout_episode:

  1. Reset the env and the deterministic expert.
  2. Maintain a chunk_buf (the most recent 20-step action chunk from the policy) and a step_in_chunk counter.
  3. At each step, if the buffer is empty or step_in_chunk ≥ EXECUTE_STEPS, query the policy for a fresh chunk.
  4. Take the action from the buffer at index step_in_chunk.
  5. Step the env. Store (current state, expert's action at the current state) in the data lists.
  6. Increment step_in_chunk and check for episode termination.

The data we collect is per-step, not per-chunk. Each row of the dataset is (st, expert_action_at_st) — not a 20-step chunk.

The relabel side

During rollout_and_relabel:

  1. Roll out an episode with rollout_episode, getting per-step lists.
  2. Window the per-step lists into action chunks: state st gets paired with the chunk [a*t, a*t+1, …, a*t+19].
  3. Append to the global new_states and new_actions lists.

The expert's actions at each step are stored sequentially during rollout, then windowed into chunks of length 20 afterwards.

Why this windowing works

The expert is queried at every state, so we have a per-step expert action sequence. Concatenating any 20 consecutive expert actions gives a valid expert action chunk — the expert could have produced that chunk if asked. So windowing creates new (state, chunk) training pairs without needing extra expert queries.

Comparison to expert.py's collect_expert_data

Look at the original collect_expert_data in expert.py:

for i in range(len(ep_states) − action_chunk + 1):
    all_states.append(ep_states[i])
    all_actions.append(ep_actions[i:i + action_chunk])

Same windowing pattern. rollout_and_relabel mirrors this exactly — collect per-step lists, window into chunks, append.

The only difference: collect_expert_data rolls out the expert; rollout_and_relabel rolls out the policy but labels with the expert. Different state distribution, same windowing logic.

Chapter 07

The Full Algorithm

DAgger with Deterministic Relabeling and Action Chunking
  1. Initialize:
    • Collect initial expert demos D0 via the original (multimodal) expert.
    • Train initial BC policy π1 on D0.
  2. For round k = 1, 2, …, 5:
    a) Evaluate: roll out πk for eval_episodes (e.g., 100), record mean and std episode length.
    b) Collect rollouts: run rollout_and_relabel for episodes_per_round episodes. For each episode:
    • Reset env and deterministic expert.
    • At each step: get policy action chunk, take action[step_in_chunk], query det expert at the current state, store (state, expert action).
    • After episode: window per-step lists into 20-step (state, expert action chunk) pairs.
    c) Aggregate: Dk = Dk-1 ∪ new (state, chunk) pairs.
    d) Retrain: πk+1 = BC(Dk) using MSE regression.
  3. Restore best policy: across all rounds, keep the policy that performed best in evaluation.

Three things to internalize:

Chapter 08

Code Tour

One file, three blanks. Plus the orchestrator run_dagger which is read-only.

FunctionStatusWhat it does
DeterministicExpert.actEDIT (one line)Set raw_target = gap1_y when committed
rollout_episodeEDITRoll out policy for one episode, return states & expert actions
rollout_and_relabelEDITLoop over episodes, window into chunks
run_daggerread-onlyThe full DAgger orchestrator (eval, collect, retrain)

The DeterministicExpert skeleton

Almost everything is provided in dagger.py:30-96. The class has:

The rollout_episode signature

dagger.py:99-129:

@torch.no_grad()
def rollout_episode(env, policy, seed, action_chunk, device):
    # YOUR CODE HERE
    return ep_states, ep_expert_actions

Inputs:

Outputs: per-step list of states, per-step list of expert actions (scalars, not chunks — we'll window into chunks in rollout_and_relabel).

The rollout_and_relabel signature

dagger.py:132-171:

@torch.no_grad()
def rollout_and_relabel(policy, difficulty, num_episodes, pipe_speed,
                        seed, action_chunk, device):
    policy.eval()
    env = FlappyBirdEnv(difficulty=difficulty, pipe_speed=pipe_speed)
    det_expert = DeterministicExpert()
    new_states, new_actions = [], []

    # YOUR CODE HERE
    return new_states, new_actions

The orchestrator calls this once per round to get fresh training data. num_episodes defaults to a small number (~10 in the default config). Returns numpy arrays of shape (N, 4) states and (N, 20) action chunks.

Note: the deterministic expert is constructed inside this function, not passed in. Each call gets a fresh expert that resets between episodes.

Chapter 09

Your Three Changes, Decoded

Per-line annotations. This is the centerpiece chapter.

Change 1 of 3
DeterministicExpert.act — pick gap 1

Where: dagger.py:81-86.

What you need: when the bird is close enough to commit to a gap, deterministically pick gap 1 instead of randomly choosing.

The code:

if dist < self.commit_dist:    # very close to the pipe
    self._committed = True
    raw_target = float(gap1_y)    # <-- YOUR CODE
else:
    raw_target = float(midpoint)

Decoded

raw_target = float(gap1_y)

Set the raw target to gap1_y. Always gap 1, regardless of which would be closer to the bird's current position.

Why float(gap1_y) and not just gap1_y: gap1_y is unpacked from a numpy array (obs[1]), so it might be a numpy.float32. Wrapping in Python's float() converts to a plain Python float, which plays more nicely with downstream EMA smoothing arithmetic.

Why gap 1 specifically: arbitrary choice. Gap 2 would work equally well as long as we're consistent. The key is determinism, not which gap.

Sanity check: this is the entire fix

One line of code. That's the entire trick. The original expert (in expert.py) had:

self.target_gap_idx = np.random.choice([0, 1])    # multimodal!
self._committed = True

Your version has:

self._committed = True
raw_target = float(gap1_y)                        # deterministic!

Two pieces of code, one is "random.choice" and the other is "always gap 1." This single deterministic choice is what turns multimodal expert data into unimodal data, which is what makes MSE regression work.

Change 2 of 3
rollout_episode — one policy rollout

Where: dagger.py:99-129.

What you need to build: roll out the current policy for one episode, with action chunking and re-querying every EXECUTE_STEPS = 10 steps. At each timestep, store the current state and the deterministic expert's action for that state.

The code:

@torch.no_grad()
def rollout_episode(env, policy, seed, action_chunk, device):
    obs, _ = env.reset(seed=seed)
    det_expert = DeterministicExpert()
    det_expert.reset()

    ep_states, ep_expert_actions = [], []
    chunk_buf = None
    step_in_chunk = 0
    done = False

    while not done:
        if chunk_buf is None or step_in_chunk ≥= EXECUTE_STEPS:
            state_tensor = torch.as_tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
            chunk_buf = policy.get_action_chunk(state_tensor) if hasattr(policy, 'get_action_chunk') else policy(state_tensor).cpu().numpy()[0]
            step_in_chunk = 0

        action = float(chunk_buf[step_in_chunk])
        expert_action = det_expert.act(obs)

        ep_states.append(obs.copy())
        ep_expert_actions.append(expert_action)

        obs, _, terminated, truncated, _ = env.step(np.array([action]))
        done = terminated or truncated
        step_in_chunk += 1

    return ep_states, ep_expert_actions

Decoded

obs, _ = env.reset(seed=seed)

Reset the environment with the given seed. Returns (observation, info_dict); we discard the info dict. obs is a numpy array of shape (4,).

Seeding ensures reproducibility — running with the same seed produces the same initial state.

det_expert = DeterministicExpert(); det_expert.reset()

Create a fresh deterministic expert and reset its internal state (gap signature, commitment, EMA buffer). Each episode gets a clean expert — no state carried over from previous episodes.

ep_states, ep_expert_actions = [], []

Empty lists to collect the per-step data. We'll append to these inside the loop, then return them at the end.

chunk_buf = None; step_in_chunk = 0

Initial state of the chunk buffer. None means we haven't queried the policy yet, so we'll query on the first iteration. step_in_chunk tracks how many actions we've used from the current chunk — when it hits EXECUTE_STEPS=10, we re-query.

if chunk_buf is None or step_in_chunk >= EXECUTE_STEPS:

Re-query the policy when (a) we haven't queried yet, or (b) we've used up the executable portion of the previous chunk. This is the receding-horizon re-querying logic.

Note: we re-query at EXECUTE_STEPS = 10, not at action_chunk = 20. The remaining 10 actions in the buffer are discarded — that's the point of "execute K of T predicted."

state_tensor = torch.as_tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)

Convert the numpy obs to a torch tensor on the right device, then add a batch dimension. The policy expects shape (B, 4), not (4,). .unsqueeze(0) adds the batch dim, giving shape (1, 4).

chunk_buf = policy(state_tensor).cpu().numpy()[0]

Forward pass through the BC policy. For BCPolicy from P1, this is a single forward through the MLP, returning (1, 20). For FlowMatchingPolicy from P2, this is the full Euler integration. The output shape is the same: (1, action_chunk).

.cpu() moves the result back to CPU (in case it was on GPU). .numpy() converts to numpy. [0] drops the batch dim, giving shape (20,).

Why we don't keep gradients: the function is decorated @torch.no_grad(). Inference, no need for autograd. Saves memory and time.

action = float(chunk_buf[step_in_chunk])

Take the action at the current position in the chunk. chunk_buf[step_in_chunk] is a numpy scalar; float() converts to a plain Python float for the env.

expert_action = det_expert.act(obs)

Query the deterministic expert at the current state. This is the relabeling step: we don't keep the policy's action as the label; we use the expert's action.

Note we pass the current obs, not the state we'd get after taking the policy's action. Labels go with the state we encountered, not the state we're about to encounter.

ep_states.append(obs.copy()); ep_expert_actions.append(expert_action)

Store the current state and the expert's label. obs.copy() creates a copy — without it, all entries in ep_states would point to the same array (which gets overwritten by the env step).

obs, _, terminated, truncated, _ = env.step(np.array([action]))

Take the step. env.step in Gymnasium returns (obs, reward, terminated, truncated, info). We don't need reward or info. terminated is True if the bird crashed; truncated is True if we hit the 1000-step max.

The action goes in as np.array([action]) — the env expects an array, not a scalar.

done = terminated or truncated

Either condition ends the episode.

step_in_chunk += 1

Advance the position in the chunk. When this hits EXECUTE_STEPS, the next iteration will re-query the policy.

return ep_states, ep_expert_actions

Return the per-step lists. Note these are scalar expert actions, not chunks. rollout_and_relabel handles windowing.

Common bugs in rollout_episode

1. Forgetting to reset the deterministic expert: state from previous episodes carries over (commit to gap 1 even when in a new pipe context). Manifests as poor relabeling on the first few episodes.

2. Not copying obs: ep_states.append(obs) without .copy() stores the same object many times; after the loop, all entries point to the final state.

3. Storing policy actions instead of expert actions: defeats the entire point of DAgger. Make sure the action you append is from det_expert.act(obs), not from chunk_buf.

4. Using state_in_chunk to update chunk_buf trigger but not actually re-querying: forgetting the chunk_buf is None condition causes a NameError on the first iteration.

Change 3 of 3
rollout_and_relabel — loop and window

Where: dagger.py:132-171.

What you need to build: loop over num_episodes, call rollout_episode for each, window the per-step lists into 20-step (state, action_chunk) pairs, return numpy arrays.

The code:

@torch.no_grad()
def rollout_and_relabel(policy, difficulty, num_episodes, pipe_speed,
                        seed, action_chunk, device):
    policy.eval()
    env = FlappyBirdEnv(difficulty=difficulty, pipe_speed=pipe_speed)
    new_states, new_actions = [], []

    for ep in range(num_episodes):
        ep_states, ep_expert_actions = rollout_episode(
            env, policy, seed=seed + ep, action_chunk=action_chunk, device=device)

        for i in range(len(ep_states) − action_chunk + 1):
            new_states.append(ep_states[i])
            new_actions.append(ep_expert_actions[i:i + action_chunk])

    env.close()
    return (np.array(new_states, dtype=np.float32),
            np.array(new_actions, dtype=np.float32))

Decoded

policy.eval()

Put the policy in eval mode. For most architectures this disables dropout and switches BatchNorm to use running statistics. The provided BCPolicy doesn't use dropout or BatchNorm, so this is mostly defensive coding — doesn't change behavior here but is correct practice.

env = FlappyBirdEnv(difficulty=difficulty, pipe_speed=pipe_speed)

Construct one env instance for all episodes. We'll reset it inside rollout_episode for each new episode.

new_states, new_actions = [], []

Accumulators for the windowed (state, action chunk) pairs across all episodes.

for ep in range(num_episodes):

Loop over episodes. The provided default is something like 10 episodes per round.

ep_states, ep_expert_actions = rollout_episode(env, policy, seed=seed + ep, ...)

Run one episode. seed + ep gives a different seed per episode for diversity. Returns per-step lists.

for i in range(len(ep_states) - action_chunk + 1):

Windowing loop. We can build a chunk starting at index i if and only if there are at least action_chunk more states/actions left after i. The +1 is because range(stop) goes up to (but not including) stop.

Example: if the episode has 100 steps and action_chunk = 20, this iterates i from 0 to 80 inclusive. We get 81 (state, chunk) pairs from a 100-step episode.

new_states.append(ep_states[i])

The state for this windowed pair is the state at index i.

new_actions.append(ep_expert_actions[i:i + action_chunk])

The action chunk is the next 20 expert actions starting at index i. ep_expert_actions[i:i+20] takes a slice of length 20.

Even though the policy's actual rollout used policy actions (with re-querying), the labels here are expert actions for 20 consecutive steps starting at this state. This is the relabeling that gives BC consistent unimodal supervision.

env.close()

Release env resources. Good housekeeping.

return (np.array(new_states, dtype=np.float32), np.array(new_actions, dtype=np.float32))

Convert lists to numpy arrays with the right dtype. Shape: states (N, 4), actions (N, 20).

dtype=np.float32 matches what the BC training pipeline expects. Without explicit dtype, numpy might infer float64, which is wasteful and may cause issues with the torch model (which uses float32).

A nice exercise: trace the data flow

Episode 1 of round 1 starts. rollout_episode rolls the policy for ~50 steps before crashing. It returns 50 states and 50 expert actions.

rollout_and_relabel windows these into 50 - 20 + 1 = 31 (state, chunk) pairs.

Repeat for 10 episodes → ~310 new training pairs.

Aggregate with the original ~5000 expert demo pairs → ~5310 training pairs for round 2.

By round 5, the dataset has ~6500 pairs, with progressively more coverage of the policy's actual trajectories.

Chapter 10

Running It

The command

python main.py --method dagger --env hard

This:

  1. Collects initial expert demos on hard mode (multimodal expert).
  2. Trains an initial BCPolicy on those demos.
  3. Runs run_dagger for 5 rounds:
    • Evaluate → collect rollouts (your rollout_and_relabel) → aggregate → retrain
  4. Saves results (per-round means and stds) to dagger_hard.txt.

Expected results

RoundExpected mean episode lengthWhat's happening
1~200-400Plain BC on expert demos — same as P1
2~400-600Some unimodal data added
3~600-800More coverage
4-5~800-1000Near-expert performance

The exact numbers depend on hyperparameters, but you should see monotonic improvement across rounds. By round 5, performance should be comparable to (or exceed) flow matching from P2.

How long does training take?

Per round: a quick rollout phase (~minute) plus a BC retraining phase (a few minutes for 5000-6500 transitions). Five rounds total: roughly 20-30 minutes on CPU.

What healthy training looks like

MetricHealthyBug
Round 1 eval200-400 (BC baseline)0 or 1000 (something off)
Per-round improvementEach round > previous (mostly)Stays flat or decreases
New transitions per round200-2000 (depends on episode length)0 (rollout_episode returning empty)
Final round eval800-1000Stuck below 500 (multimodality not resolved — check DeterministicExpert.act)

The deliverables

Per the PDF:

  1. Learning curve plot: round number on x-axis, mean episode length on y-axis with error bars (std). Include the original BC regression performance as a horizontal line.
  2. Three-method comparison plot/table (next chapter).
  3. 3-4 sentence explanation: why DAgger improves over rounds, what role the deterministic expert plays, how this approach solves the BC failure from P1.

Your one-paragraph writeup template

For the writeup

"DAgger improves over rounds because each round adds expert-labeled states from the policy's own deployment distribution to the training set, so the policy receives supervision exactly at the states where it tends to make errors and gradually closes the distribution-shift gap. The deterministic expert plays two roles: it provides a clean, consistent labeling signal at every visited state, and it removes the multimodality that broke MSE regression in Problem 1 by always picking gap 1 instead of randomly choosing. Together, the iterative dataset aggregation and the deterministic relabeling let plain MSE behavior cloning recover near-expert performance — same model and loss as P1, but trained on better data."

Chapter 11

Three-Method Comparison

The PDF asks for a comparison plot of all three methods on hard mode. Run:

python main.py --plot

This reads bc_reg_hard.txt, bc_flow_hard.txt, and dagger_hard.txt and produces a comparison.

Expected pattern

MethodHard mode mean (typical)How it solves the problem
BC regression (P1)200-400Doesn't — this is the baseline failure
Flow matching (P2)700-1000Generative model preserves modes → consistent gap selection per rollout
DAgger (P3, final round)800-1000Deterministic expert + iterative relabeling → unimodal training distribution

What the comparison teaches

BC, flow matching, and DAgger represent three different attacks on the same fundamental problem: multimodal experts on long-horizon control.

Real robot learning systems often combine both generative models and iterative relabeling. Diffusion Policy + DAgger is a known combination. The lessons from these three problems are foundational; you'll see variations of all three in research papers and production systems.

Caveats and tradeoffs

PropertyBC regFlow matchingDAgger
Compute (training)CheapModerate (U-Net is bigger)Cheap per round, but multiple rounds
Compute (inference)1 forward pass20 forward passes (Euler integration)1 forward pass
Expert assumptionOffline demos onlyOffline demos onlyNeed interactive expert at any state
Theoretical boundO(T2ε)Same family as BCO(Tε)

DAgger's main practical limitation: needing an interactive expert. In settings where the expert is a human operator, this is expensive. In simulation or where the expert is itself a controller (as in this homework), it's cheap.

Chapter 12

Cheat Sheet & Self-Quiz

Equations & bounds

DAgger dataset growth Dk = D0 ∪ { (s, πdet(s)) | s ~ pπk(·) }
Performance bounds (Ross & Bagnell 2011) BC: J(π) − J(π*) = O(T2ε) quadratic compounding DAgger: J(π) − J(π*) = O(Tε) linear, much better

API reference

CallReturns
env.reset(seed=seed)(obs, info_dict); obs is shape (4,)
env.step(np.array([action]))(obs, reward, terminated, truncated, info)
det_expert.act(obs)scalar float in [0, 1]; expert's target y
det_expert.reset()None; clears commitment, gap signature, EMA
policy(state_tensor)predicted action chunk, shape (B, action_chunk)
torch.as_tensor(np_arr).unsqueeze(0)tensor with batch dim added
x.cpu().numpy()numpy array on CPU
policy.eval()set policy to eval mode (no dropout, etc.)

Self-quiz

  1. What is the distribution shift / covariate shift problem in BC?
  2. Why does the BC error compound quadratically while DAgger's compounds linearly?
  3. In DAgger, who provides the states? Who provides the action labels?
  4. Why use the deterministic expert in this homework rather than the original (multimodal) expert for relabeling?
  5. What does the deterministic expert do differently from the original?
  6. What's the difference between the actions the policy executes in rollout and the actions we store as labels?
  7. Why do we re-query the policy every EXECUTE_STEPS rather than every step?
  8. Why do we copy obs when appending to ep_states?
  9. How does rollout_and_relabel turn a per-step list of expert actions into action chunks?
  10. Why does the dataset grow each round but never shrink?
  11. If you used the original (multimodal) expert in DAgger, would the algorithm still resolve distribution shift? Would it resolve multimodality?
  12. What's the connection between the failure mode in P1 and the deterministic expert in P3?
Answer key

1. The states encountered at deployment time are determined by the policy's own behavior, but the policy was only trained on expert states. Small policy errors cause drift to states never seen in training, where the policy has no knowledge.

2. BC's per-step error amplifies with deviation from the training distribution; deviation grows linearly with time, and per-step error scales with deviation, so total cost is O(T2 ε). DAgger keeps the training distribution covering deployment, so per-step error stays bounded; total cost is O(T ε).

3. The policy provides states (rollout). The expert provides action labels (relabeling). The point of DAgger is supervising the policy at the states it actually visits.

4. Because the original expert is multimodal — relabeling with it would just keep adding multimodal labels, leaving the BC averaging-into-the-wall problem unsolved. The deterministic expert removes multimodality from the data.

5. When close to a pipe, the original expert randomly picks gap 1 or gap 2 with equal probability. The deterministic expert always picks gap 1. Same midpoint hovering and EMA smoothing otherwise.

6. Executed actions come from the policy (so we can collect realistic deployment-distribution states). Stored labels come from the deterministic expert (so the BC loss has a unimodal supervision signal). Mixing these up defeats DAgger's purpose.

7. Action chunking (re-querying every K steps) gives temporal consistency, reduces compounding error from policy queries, and improves stability. We re-query at EXECUTE_STEPS, not at the full chunk length, because the last 10 of 20 predicted actions are stale by then.

8. Without .copy(), all entries point to the same numpy array, which gets overwritten in-place by the env. By the end of the loop, every state in ep_states is identical (the final state). With .copy(), each state is a snapshot.

9. By windowing: the chunk for state at index i is the next 20 expert actions ep_expert_actions[i:i+20]. Each state in the rollout (except the last 19) becomes one (state, chunk) training pair.

10. Because we want the training distribution to cover all of the policy's deployment distribution — including states from earlier rounds when the policy was worse. Forgetting old data would lose coverage and might cause the policy to forget edge cases.

11. Yes to distribution shift — the data still gets aggregated from policy rollouts, so coverage improves. No to multimodality — the labels would still be sometimes-gap-1, sometimes-gap-2, and MSE would still average them. You'd need a generative model (like in P2) for this case.

12. In P1, the multimodal expert's randomness made MSE regression collapse to an invalid average (the wall). The deterministic expert in P3 removes that randomness from the labels, eliminating the failure mode at its source. With unimodal labels, MSE works fine — same model and loss as P1, just better data.

Implementation order

  1. DeterministicExpert.act — 30 seconds. One line.
  2. rollout_episode — 10 minutes. The hardest of the three; mind the chunk buffering and the no_grad / device handling.
  3. rollout_and_relabel — 5 minutes. Clean loop and windowing.

Total: ~15 minutes of typing. Run on hard mode, observe ~200 round-1 performance climb to ~900 by round 5. Generate the learning curve plot. Run --plot to compare with P1 and P2.

Take it back to class

You can now teach this

Three big ideas, in order of importance:

  1. Distribution shift is the central failure mode of plain BC. Train on expert states, deploy on policy states. The deployment distribution drifts away from training, errors compound quadratically with horizon, and even small per-step errors become catastrophic over long episodes.
  2. DAgger's fix is iterative relabeling: collect rollouts from the current policy, get expert labels at those states, aggregate, retrain. The training distribution converges to the deployment distribution; per-step errors stay bounded; total cost is linear instead of quadratic.
  3. The deterministic expert trick handles multimodality. By replacing the random gap choice with a fixed always-gap-1 choice, we make the labels unimodal. MSE regression now has a unimodal target and converges cleanly. Same algorithm as P1, just better data.

If a friend asks: "Why does DAgger improve over rounds?" — you say: "Each round adds expert-labeled states from the policy's actual deployment distribution to the training set. So the policy gets supervision at exactly the states where it currently makes errors. Over rounds, the training distribution converges to the deployment distribution, and the policy stops drifting off-distribution. The deterministic expert ensures the labels are consistent rather than random, which lets plain MSE regression succeed where it failed in Problem 1."

You can teach this. Submit the writeup.