CS 224R HW1 · Problem 3 · DAgger from Zero

Roadmap

What You'll Master

01Where We Are 02The Distribution Shift Problem 03DAgger: Iterative Relabeling 04Why DAgger Works 05The Deterministic Expert Trick 06DAgger with Action Chunking 07The Full Algorithm 08Code Tour 09Your Three Changes, Decoded 10Running It 11Three-Method Comparison 12Cheat Sheet & Quiz

Chapter 01

Where We Are

Same Flappy Bird environment. Same hard-mode multimodality challenge. Different fix.

Problem 1 introduced behavior cloning with MSE regression and showed how it fails on hard mode (the multimodal expert — gap 1 or gap 2 — gets averaged into the wall). Problem 2 fixed this by replacing MSE with flow matching (a generative model that preserves modes). Problem 3 takes the orthogonal approach: keep MSE regression but make the data unimodal.

The two failure modes of BC

Plain BC has two distinct issues, and they call for different fixes:

Failure mode	What goes wrong	Where it shows up	Fix
Multimodality	MSE averages multiple valid actions into an invalid mean	Hard mode with bimodal expert	P2: flow matching, OR P3: deterministic expert
Distribution shift	Policy errors compound; agent ends up in states the expert never visited	Long episodes, anywhere	P3: DAgger

Problem 3 is mostly aimed at distribution shift, but in this homework it also resolves multimodality through the deterministic-expert trick. Both fixes happen at once. We'll see why in Chapter 05.

What you're implementing

Three small functions in dagger.py:

DeterministicExpert.act: a deterministic version of the multimodal expert that always picks gap 1.
rollout_episode: roll out the current policy in the environment for one episode, return the (state, expert action) pairs.
rollout_and_relabel: loop the above over multiple episodes and window into action chunks.

The DAgger orchestration loop (run_dagger) is provided. You write the data-collection helpers; the framework handles the BC retraining and aggregation.

The deliverable

Learning curve: 5 DAgger rounds on hard mode. Plot mean episode length vs. round, with error bars. Include the original BC regression performance as a horizontal baseline.
Three-method comparison: bar chart or table comparing BC regression, flow matching, and DAgger (final round) on hard mode.
3-4 sentence explanation: why DAgger improves over rounds, what role the deterministic expert plays, how this approach solves the original BC failure.

Chapter 02

The Distribution Shift Problem

This chapter is the conceptual foundation of DAgger. Without understanding distribution shift, the algorithm is just an arbitrary procedure.

The setup

BC trains a policy to mimic the expert at states the expert visits. Call this distribution of states p_expert(s). The training loss is:

L_BC(θ) = E_{s ~ p_expert(s)} [ ||π_θ(s) − π*(s)||² ]

The policy is good at states the expert visits. Training cannot tell us anything about states the expert doesn't visit, because we have no examples there.

What happens at test time

At test time, the policy makes a small error at some state s₀. It produces an action that's almost right but slightly off. The bird is now in a slightly weird state at the next timestep — one the expert never quite visited (because the expert didn't make the same small error).

From this slightly-out-of-distribution state, the policy makes a slightly larger error. Now the bird is in a more out-of-distribution state. From there, an even larger error. From there, total chaos. By 50 timesteps in, the policy is in a state space the expert never saw, and the policy has no idea what to do.

The agent starts on the expert's state distribution. Each small policy error nudges it slightly off-distribution. Errors compound: the further off-distribution, the worse the policy's actions, the further off-distribution it goes. This is "compounding error" or "covariate shift."

Why this is fundamental, not a tuning issue

You might think: "just train the BC policy harder, get smaller errors per step, problem solved." But:

The total error scales as O(T²) in the time horizon T — a famous result of Ross & Bagnell (2010). For long episodes (1000 steps in this homework), even tiny per-step errors compound badly.
The states drift to regions fundamentally absent from the training data. No amount of training can teach the network what to do at unseen states.

Definition

Distribution shift / Covariate shift

When the distribution of states encountered during deployment differs from the distribution of states encountered during training. In BC, this happens automatically because the deployment-time state distribution is determined by the policy (not the expert), and the policy makes small errors that drift the state distribution away from training data.

Two famous fixes

Approach	Idea	Used in
DAgger	Iteratively collect states from policy rollouts, get expert labels at those states, retrain. The training distribution gradually expands to cover the deployment distribution.	This homework, robot teleop, autonomous driving
Better representations	Use convolutional/transformer networks that generalize better to nearby out-of-distribution states.	Modern foundation-model robot policies

DAgger is the cleanest theoretical fix. The distribution shift literature largely starts with DAgger.

Chapter 03

DAgger: Iterative Relabeling

DAgger stands for Dataset Aggregation. Ross, Gordon, and Bagnell, AISTATS 2011.

The core idea

The problem with BC: the policy gets bad at states it visits but the expert didn't. Solution: collect data at the states the policy actually visits, get expert labels there, add to the training set, retrain. Repeat.

Algorithmically:

DAgger

Collect initial expert demos: D₀ = {(s, π*(s))}.
Train initial policy: π₁ = BC(D₀).
For round k = 1, 2, …:
a) Roll out π_k in the environment, collecting visited states {s₁, s₂, …}.

b) Query the expert at each visited state: a_i* = π*(s_i).

c) Add to dataset: D_k = D_k-1 ∪ {(s_i, π*(s_i))}.

d) Retrain: π_k+1 = BC(D_k).

That's the entire algorithm. The dataset grows each round. The policy improves because it now has supervision at exactly the states it tends to visit.

The key insight in one sentence

BC trains the policy on the expert's state distribution. DAgger trains the policy on the policy's own state distribution — with expert labels there. Over rounds, the training distribution converges to the deployment distribution.

Critical: we use expert actions, not policy actions

The policy provides the states (where to collect data). The expert provides the actions (what to do at those states). Mixing this up is the most common conceptual error.

If you used the policy's actions as labels, you'd just be training the policy to do whatever it already does — no learning signal. You need the expert's different action at the policy's bad-state to teach the policy to recover.

The role of "interactive" expert access

DAgger requires you to query the expert at any state, on demand. This is more demanding than plain BC, which only needs a fixed offline dataset of expert demonstrations.

In this homework, the expert is a Python class (DeterministicExpert) that accepts an observation and returns an action. We can call it whenever we want. In real robot teleop, the human operator has to actually be available to provide labels — which is expensive but doable.

Chapter 04

Why DAgger Works

The fundamental fix

BC trains under distribution mismatch:

training distribution = p_expert(s) deployment distribution = p_π(s)

These are different. The policy's behavior depends on its training data, but the policy's training data is determined by the expert — not by the policy. As soon as the policy is deployed, it visits states the training never sampled.

DAgger explicitly closes this gap by feeding back the policy's own state distribution into the training set:

training distribution = p_expert(s) ∪ p_π₁(s) ∪ p_π₂(s) ∪ …

After a few rounds, the training distribution covers the deployment distribution. The policy's per-step error stops compounding because every state it visits is now labeled.

The theoretical guarantee

Ross & Bagnell (2011) proved a regret bound for DAgger. With N DAgger iterations, the gap between the policy's performance and the expert's scales as:

J(π_DAgger) − J(π*) = O(T · ε) linear in T

where ε is the per-step training error and T is the episode horizon. Compare to vanilla BC:

J(π_BC) − J(π*) = O(T² · ε) quadratic in T

For T = 1000 (this homework), the difference is dramatic: T · ε = 1000ε vs. T² · ε = 1,000,000ε. Even a 1% per-step error compounds to 100% in BC but stays at ~10% with DAgger.

The intuition behind the bound

BC is quadratic because errors compound: a per-step error of ε means the deployment distribution drifts by O(T ε), and at each state the policy's error is amplified by the deviation from training distribution — giving O(T ε · T) = O(T² ε).

DAgger is linear because the deployment distribution converges to the training distribution: errors don't compound, they stay constant. Total cost is just T · ε.

Why the rounds matter

The first round of DAgger trains BC on the original expert demos — same as Problem 1. Performance: poor on hard mode.

The second round adds states visited by this poor policy, labeled by the expert. The retrained policy is better — it now knows what to do in some of the off-distribution states.

Subsequent rounds keep adding policy-visited states. After 5 rounds (this homework's setting), the dataset has good coverage of the policy's actual deployment distribution, and the policy is much closer to expert performance.

The plot you'll generate (Figure for Problem 3) should show this monotonic improvement — the first round near baseline BC, climbing toward expert level by round 5.

Chapter 05

The Deterministic Expert Trick

This is the cleverest part of HW1's DAgger setup. Read it carefully.

Recap: the multimodality problem

From Problem 1: hard mode has alternating single and double-gap pipes. The expert in expert.py sees a double-gap pipe and randomly picks one of the two gaps:

# From expert.py
if dist < self.commit_dist:
    self.target_gap_idx = np.random.choice([0, 1])
    self._committed = True

This randomness is what made the expert's actions multimodal: at the same state, sometimes the expert picks gap 1 (y=0.7), sometimes gap 2 (y=0.3). MSE regression averages these into y=0.5 — the wall.

The trick: a deterministic expert

For DAgger relabeling, we don't have to use the same multimodal expert. We can build a deterministic version that always makes the same choice:

# DeterministicExpert (the version you'll fill in)
if dist < self.commit_dist:
    self._committed = True
    raw_target = float(gap1_y)        # ALWAYS gap 1

The deterministic expert always commits to gap 1 (the upper gap) when close to the pipe. No randomness. Same state → same action, every time.

Why this resolves the multimodality problem

Now think about what DAgger does. It collects states by rolling out the current policy, then labels them with the deterministic expert's actions. Every label says "go to gap 1 here, go to gap 1 there, go to gap 1 everywhere." The training data is now unimodal.

MSE regression on unimodal data works great — it converges to the conditional mean, but the conditional mean of "gap 1, gap 1, gap 1, …" is just gap 1. The policy learns to consistently pick gap 1.

Two birds with one stone

DAgger fixes distribution shift by aggregating policy-visited states with expert labels. The deterministic expert also fixes multimodality by removing the randomness. Together, they let plain MSE regression succeed where it failed in P1 — same model, same loss, just different data.

Why use the original expert as the labeler in normal DAgger?

Standard DAgger usually uses the same expert that generated the initial demos. Here we use a different (deterministic) expert. Why?

Because the original expert is multimodal — if we used it for relabeling, we'd just keep adding multimodal labels to the training set. DAgger would fix distribution shift but not multimodality. The combination of MSE + multimodal labels still produces averaged-into-the-wall predictions.

The deterministic expert is a trick specific to this homework's setup. In real robotics, the expert is usually a human operator or a known-good controller, which is naturally deterministic for any single state. Multimodality from random expert behavior is a synthetic artifact of this homework.

The full DeterministicExpert behavior

Three behaviors, mostly already coded for you:

Far from pipe: target the midpoint of (gap1_y, gap2_y). This is the same hovering behavior the original expert uses.
Close to pipe (dist < commit_dist): commit to gap 1. This is the line you fill in.
EMA smoothing: temporal smoothing on the target. Same as original.
New-pipe detection: when gap positions change, reset commitment. Same as original.

Your edit is one line of Python: raw_target = float(gap1_y). The cleverness is conceptual; the code is trivial.

Chapter 06

DAgger with Action Chunking

One implementation wrinkle: the policy predicts 20-step action chunks, but only executes the first 10 before re-querying. How does DAgger handle this?

The rollout side

During rollout_episode:

Reset the env and the deterministic expert.
Maintain a chunk_buf (the most recent 20-step action chunk from the policy) and a step_in_chunk counter.
At each step, if the buffer is empty or step_in_chunk ≥ EXECUTE_STEPS, query the policy for a fresh chunk.
Take the action from the buffer at index step_in_chunk.
Step the env. Store (current state, expert's action at the current state) in the data lists.
Increment step_in_chunk and check for episode termination.

The data we collect is per-step, not per-chunk. Each row of the dataset is (s_t, expert_action_at_s_t) — not a 20-step chunk.

The relabel side

During rollout_and_relabel:

Roll out an episode with rollout_episode, getting per-step lists.
Window the per-step lists into action chunks: state s_t gets paired with the chunk [a*_t, a*_t+1, …, a*_t+19].
Append to the global new_states and new_actions lists.

The expert's actions at each step are stored sequentially during rollout, then windowed into chunks of length 20 afterwards.

Why this windowing works

The expert is queried at every state, so we have a per-step expert action sequence. Concatenating any 20 consecutive expert actions gives a valid expert action chunk — the expert could have produced that chunk if asked. So windowing creates new (state, chunk) training pairs without needing extra expert queries.

Comparison to expert.py's collect_expert_data

Look at the original collect_expert_data in expert.py:

for i in range(len(ep_states) − action_chunk + 1):
    all_states.append(ep_states[i])
    all_actions.append(ep_actions[i:i + action_chunk])

Same windowing pattern. rollout_and_relabel mirrors this exactly — collect per-step lists, window into chunks, append.

The only difference: collect_expert_data rolls out the expert; rollout_and_relabel rolls out the policy but labels with the expert. Different state distribution, same windowing logic.

Chapter 07

The Full Algorithm

DAgger with Deterministic Relabeling and Action Chunking

Initialize:
• Collect initial expert demos D₀ via the original (multimodal) expert.

• Train initial BC policy π₁ on D₀.
For round k = 1, 2, …, 5:
a) Evaluate: roll out π_k for eval_episodes (e.g., 100), record mean and std episode length.

b) Collect rollouts: run rollout_and_relabel for episodes_per_round episodes. For each episode:

• Reset env and deterministic expert.

• At each step: get policy action chunk, take action[step_in_chunk], query det expert at the current state, store (state, expert action).

• After episode: window per-step lists into 20-step (state, expert action chunk) pairs.

c) Aggregate: D_k = D_k-1 ∪ new (state, chunk) pairs.

d) Retrain: π_k+1 = BC(D_k) using MSE regression.
Restore best policy: across all rounds, keep the policy that performed best in evaluation.

Three things to internalize:

Each round: collect rollouts → relabel with deterministic expert → aggregate → retrain.
The dataset grows monotonically. By round 5, it has 5× more data than round 1.
The retraining is the same plain MSE BC from Problem 1. DAgger just changes the data, not the algorithm.

Chapter 08

Code Tour

One file, three blanks. Plus the orchestrator run_dagger which is read-only.

Function	Status	What it does
`DeterministicExpert.act`	EDIT (one line)	Set raw_target = gap1_y when committed
`rollout_episode`	EDIT	Roll out policy for one episode, return states & expert actions
`rollout_and_relabel`	EDIT	Loop over episodes, window into chunks
`run_dagger`	read-only	The full DAgger orchestrator (eval, collect, retrain)

The DeterministicExpert skeleton

Almost everything is provided in dagger.py:30-96. The class has:

State variables: _last_gap_sig, _committed, _smooth_target — for tracking pipe transitions and EMA smoothing.
reset(): clear all state at episode start.
act(obs): the main logic. New-pipe detection, midpoint hovering, commit-and-target on close approach, EMA smoothing. The only line missing is the actual gap selection inside the commitment branch.

The rollout_episode signature

dagger.py:99-129:

@torch.no_grad()
def rollout_episode(env, policy, seed, action_chunk, device):
    # YOUR CODE HERE
    return ep_states, ep_expert_actions

Inputs:

env: a FlappyBirdEnv instance (already constructed).
policy: trained BC policy. Callable as policy(state_tensor); outputs a chunk of shape (1, action_chunk).
seed: env reset seed for reproducibility.
action_chunk: 20 (the chunk length).
device: torch device for tensor placement.

Outputs: per-step list of states, per-step list of expert actions (scalars, not chunks — we'll window into chunks in rollout_and_relabel).

The rollout_and_relabel signature

dagger.py:132-171:

@torch.no_grad()
def rollout_and_relabel(policy, difficulty, num_episodes, pipe_speed,
                        seed, action_chunk, device):
    policy.eval()
    env = FlappyBirdEnv(difficulty=difficulty, pipe_speed=pipe_speed)
    det_expert = DeterministicExpert()
    new_states, new_actions = [], []

    # YOUR CODE HERE
    return new_states, new_actions

The orchestrator calls this once per round to get fresh training data. num_episodes defaults to a small number (~10 in the default config). Returns numpy arrays of shape (N, 4) states and (N, 20) action chunks.

Note: the deterministic expert is constructed inside this function, not passed in. Each call gets a fresh expert that resets between episodes.

Chapter 09

Your Three Changes, Decoded

Per-line annotations. This is the centerpiece chapter.

Change 1 of 3

DeterministicExpert.act — pick gap 1

Where: dagger.py:81-86.

What you need: when the bird is close enough to commit to a gap, deterministically pick gap 1 instead of randomly choosing.

The code:

if dist < self.commit_dist:    # very close to the pipe
    self._committed = True
    raw_target = float(gap1_y)    # <-- YOUR CODE
else:
    raw_target = float(midpoint)

Decoded

raw_target = float(gap1_y)

Set the raw target to gap1_y. Always gap 1, regardless of which would be closer to the bird's current position.

Why float(gap1_y) and not just gap1_y: gap1_y is unpacked from a numpy array (obs[1]), so it might be a numpy.float32. Wrapping in Python's float() converts to a plain Python float, which plays more nicely with downstream EMA smoothing arithmetic.

Why gap 1 specifically: arbitrary choice. Gap 2 would work equally well as long as we're consistent. The key is determinism, not which gap.

Sanity check: this is the entire fix

One line of code. That's the entire trick. The original expert (in expert.py) had:

self.target_gap_idx = np.random.choice([0, 1])    # multimodal!
self._committed = True

Your version has:

self._committed = True
raw_target = float(gap1_y)                        # deterministic!

Two pieces of code, one is "random.choice" and the other is "always gap 1." This single deterministic choice is what turns multimodal expert data into unimodal data, which is what makes MSE regression work.

Change 2 of 3

rollout_episode — one policy rollout

Where: dagger.py:99-129.

What you need to build: roll out the current policy for one episode, with action chunking and re-querying every EXECUTE_STEPS = 10 steps. At each timestep, store the current state and the deterministic expert's action for that state.

The code:

@torch.no_grad()
def rollout_episode(env, policy, seed, action_chunk, device):
    obs, _ = env.reset(seed=seed)
    det_expert = DeterministicExpert()
    det_expert.reset()

    ep_states, ep_expert_actions = [], []
    chunk_buf = None
    step_in_chunk = 0
    done = False

    while not done:
        if chunk_buf is None or step_in_chunk ≥= EXECUTE_STEPS:
            state_tensor = torch.as_tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
            chunk_buf = policy.get_action_chunk(state_tensor) if hasattr(policy, 'get_action_chunk') else policy(state_tensor).cpu().numpy()[0]
            step_in_chunk = 0

        action = float(chunk_buf[step_in_chunk])
        expert_action = det_expert.act(obs)

        ep_states.append(obs.copy())
        ep_expert_actions.append(expert_action)

        obs, _, terminated, truncated, _ = env.step(np.array([action]))
        done = terminated or truncated
        step_in_chunk += 1

    return ep_states, ep_expert_actions

Decoded

obs, _ = env.reset(seed=seed)

Reset the environment with the given seed. Returns (observation, info_dict); we discard the info dict. obs is a numpy array of shape (4,).

Seeding ensures reproducibility — running with the same seed produces the same initial state.

det_expert = DeterministicExpert(); det_expert.reset()

Create a fresh deterministic expert and reset its internal state (gap signature, commitment, EMA buffer). Each episode gets a clean expert — no state carried over from previous episodes.

ep_states, ep_expert_actions = [], []

Empty lists to collect the per-step data. We'll append to these inside the loop, then return them at the end.

chunk_buf = None; step_in_chunk = 0

Initial state of the chunk buffer. None means we haven't queried the policy yet, so we'll query on the first iteration. step_in_chunk tracks how many actions we've used from the current chunk — when it hits EXECUTE_STEPS=10, we re-query.

if chunk_buf is None or step_in_chunk >= EXECUTE_STEPS:

Re-query the policy when (a) we haven't queried yet, or (b) we've used up the executable portion of the previous chunk. This is the receding-horizon re-querying logic.

Note: we re-query at EXECUTE_STEPS = 10, not at action_chunk = 20. The remaining 10 actions in the buffer are discarded — that's the point of "execute K of T predicted."

state_tensor = torch.as_tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)

Convert the numpy obs to a torch tensor on the right device, then add a batch dimension. The policy expects shape (B, 4), not (4,). .unsqueeze(0) adds the batch dim, giving shape (1, 4).

chunk_buf = policy(state_tensor).cpu().numpy()[0]

Forward pass through the BC policy. For BCPolicy from P1, this is a single forward through the MLP, returning (1, 20). For FlowMatchingPolicy from P2, this is the full Euler integration. The output shape is the same: (1, action_chunk).

.cpu() moves the result back to CPU (in case it was on GPU). .numpy() converts to numpy. [0] drops the batch dim, giving shape (20,).

Why we don't keep gradients: the function is decorated @torch.no_grad(). Inference, no need for autograd. Saves memory and time.

action = float(chunk_buf[step_in_chunk])

Take the action at the current position in the chunk. chunk_buf[step_in_chunk] is a numpy scalar; float() converts to a plain Python float for the env.

expert_action = det_expert.act(obs)

Query the deterministic expert at the current state. This is the relabeling step: we don't keep the policy's action as the label; we use the expert's action.

Note we pass the current obs, not the state we'd get after taking the policy's action. Labels go with the state we encountered, not the state we're about to encounter.

ep_states.append(obs.copy()); ep_expert_actions.append(expert_action)

Store the current state and the expert's label. obs.copy() creates a copy — without it, all entries in ep_states would point to the same array (which gets overwritten by the env step).

obs, _, terminated, truncated, _ = env.step(np.array([action]))

Take the step. env.step in Gymnasium returns (obs, reward, terminated, truncated, info). We don't need reward or info. terminated is True if the bird crashed; truncated is True if we hit the 1000-step max.

The action goes in as np.array([action]) — the env expects an array, not a scalar.

done = terminated or truncated

Either condition ends the episode.

step_in_chunk += 1

Advance the position in the chunk. When this hits EXECUTE_STEPS, the next iteration will re-query the policy.

return ep_states, ep_expert_actions

Return the per-step lists. Note these are scalar expert actions, not chunks. rollout_and_relabel handles windowing.

Common bugs in rollout_episode

1. Forgetting to reset the deterministic expert: state from previous episodes carries over (commit to gap 1 even when in a new pipe context). Manifests as poor relabeling on the first few episodes.

2. Not copying obs: ep_states.append(obs) without .copy() stores the same object many times; after the loop, all entries point to the final state.

3. Storing policy actions instead of expert actions: defeats the entire point of DAgger. Make sure the action you append is from det_expert.act(obs), not from chunk_buf.

4. Using state_in_chunk to update chunk_buf trigger but not actually re-querying: forgetting the chunk_buf is None condition causes a NameError on the first iteration.

Change 3 of 3

rollout_and_relabel — loop and window

Where: dagger.py:132-171.

What you need to build: loop over num_episodes, call rollout_episode for each, window the per-step lists into 20-step (state, action_chunk) pairs, return numpy arrays.

The code:

@torch.no_grad()
def rollout_and_relabel(policy, difficulty, num_episodes, pipe_speed,
                        seed, action_chunk, device):
    policy.eval()
    env = FlappyBirdEnv(difficulty=difficulty, pipe_speed=pipe_speed)
    new_states, new_actions = [], []

    for ep in range(num_episodes):
        ep_states, ep_expert_actions = rollout_episode(
            env, policy, seed=seed + ep, action_chunk=action_chunk, device=device)

        for i in range(len(ep_states) − action_chunk + 1):
            new_states.append(ep_states[i])
            new_actions.append(ep_expert_actions[i:i + action_chunk])

    env.close()
    return (np.array(new_states, dtype=np.float32),
            np.array(new_actions, dtype=np.float32))

Decoded

policy.eval()

Put the policy in eval mode. For most architectures this disables dropout and switches BatchNorm to use running statistics. The provided BCPolicy doesn't use dropout or BatchNorm, so this is mostly defensive coding — doesn't change behavior here but is correct practice.

env = FlappyBirdEnv(difficulty=difficulty, pipe_speed=pipe_speed)

Construct one env instance for all episodes. We'll reset it inside rollout_episode for each new episode.

new_states, new_actions = [], []

Accumulators for the windowed (state, action chunk) pairs across all episodes.

for ep in range(num_episodes):

Loop over episodes. The provided default is something like 10 episodes per round.

ep_states, ep_expert_actions = rollout_episode(env, policy, seed=seed + ep, ...)

Run one episode. seed + ep gives a different seed per episode for diversity. Returns per-step lists.

for i in range(len(ep_states) - action_chunk + 1):

Windowing loop. We can build a chunk starting at index i if and only if there are at least action_chunk more states/actions left after i. The +1 is because range(stop) goes up to (but not including) stop.

Example: if the episode has 100 steps and action_chunk = 20, this iterates i from 0 to 80 inclusive. We get 81 (state, chunk) pairs from a 100-step episode.

new_states.append(ep_states[i])

The state for this windowed pair is the state at index i.

new_actions.append(ep_expert_actions[i:i + action_chunk])

The action chunk is the next 20 expert actions starting at index i. ep_expert_actions[i:i+20] takes a slice of length 20.

Even though the policy's actual rollout used policy actions (with re-querying), the labels here are expert actions for 20 consecutive steps starting at this state. This is the relabeling that gives BC consistent unimodal supervision.

env.close()

Release env resources. Good housekeeping.

return (np.array(new_states, dtype=np.float32), np.array(new_actions, dtype=np.float32))

Convert lists to numpy arrays with the right dtype. Shape: states (N, 4), actions (N, 20).

dtype=np.float32 matches what the BC training pipeline expects. Without explicit dtype, numpy might infer float64, which is wasteful and may cause issues with the torch model (which uses float32).

A nice exercise: trace the data flow

Episode 1 of round 1 starts. rollout_episode rolls the policy for ~50 steps before crashing. It returns 50 states and 50 expert actions.

rollout_and_relabel windows these into 50 - 20 + 1 = 31 (state, chunk) pairs.

Repeat for 10 episodes → ~310 new training pairs.

Aggregate with the original ~5000 expert demo pairs → ~5310 training pairs for round 2.

By round 5, the dataset has ~6500 pairs, with progressively more coverage of the policy's actual trajectories.

Chapter 10

Running It

The command

python main.py --method dagger --env hard

This:

Collects initial expert demos on hard mode (multimodal expert).
Trains an initial BCPolicy on those demos.
Runs run_dagger for 5 rounds:
- Evaluate → collect rollouts (your rollout_and_relabel) → aggregate → retrain
Saves results (per-round means and stds) to dagger_hard.txt.

Expected results

Round	Expected mean episode length	What's happening
1	~200-400	Plain BC on expert demos — same as P1
2	~400-600	Some unimodal data added
3	~600-800	More coverage
4-5	~800-1000	Near-expert performance

The exact numbers depend on hyperparameters, but you should see monotonic improvement across rounds. By round 5, performance should be comparable to (or exceed) flow matching from P2.

How long does training take?

Per round: a quick rollout phase (~minute) plus a BC retraining phase (a few minutes for 5000-6500 transitions). Five rounds total: roughly 20-30 minutes on CPU.

What healthy training looks like

Metric	Healthy	Bug
Round 1 eval	200-400 (BC baseline)	0 or 1000 (something off)
Per-round improvement	Each round > previous (mostly)	Stays flat or decreases
New transitions per round	200-2000 (depends on episode length)	0 (rollout_episode returning empty)
Final round eval	800-1000	Stuck below 500 (multimodality not resolved — check DeterministicExpert.act)

The deliverables

Per the PDF:

Learning curve plot: round number on x-axis, mean episode length on y-axis with error bars (std). Include the original BC regression performance as a horizontal line.
Three-method comparison plot/table (next chapter).
3-4 sentence explanation: why DAgger improves over rounds, what role the deterministic expert plays, how this approach solves the BC failure from P1.

Your one-paragraph writeup template

For the writeup

"DAgger improves over rounds because each round adds expert-labeled states from the policy's own deployment distribution to the training set, so the policy receives supervision exactly at the states where it tends to make errors and gradually closes the distribution-shift gap. The deterministic expert plays two roles: it provides a clean, consistent labeling signal at every visited state, and it removes the multimodality that broke MSE regression in Problem 1 by always picking gap 1 instead of randomly choosing. Together, the iterative dataset aggregation and the deterministic relabeling let plain MSE behavior cloning recover near-expert performance — same model and loss as P1, but trained on better data."

Chapter 11

Three-Method Comparison

The PDF asks for a comparison plot of all three methods on hard mode. Run:

python main.py --plot

This reads bc_reg_hard.txt, bc_flow_hard.txt, and dagger_hard.txt and produces a comparison.

Expected pattern

Method	Hard mode mean (typical)	How it solves the problem
BC regression (P1)	200-400	Doesn't — this is the baseline failure
Flow matching (P2)	700-1000	Generative model preserves modes → consistent gap selection per rollout
DAgger (P3, final round)	800-1000	Deterministic expert + iterative relabeling → unimodal training distribution

What the comparison teaches

BC, flow matching, and DAgger represent three different attacks on the same fundamental problem: multimodal experts on long-horizon control.

BC regression: the simplest method. Fails when supervision is multimodal or distribution shift compounds. Baseline.
Flow matching: keep the data, change the model. A richer generative model can capture the full conditional distribution, including multimodality.
DAgger: keep the model, change the data. Iterative relabeling closes distribution shift; deterministic expert removes multimodality.

Real robot learning systems often combine both generative models and iterative relabeling. Diffusion Policy + DAgger is a known combination. The lessons from these three problems are foundational; you'll see variations of all three in research papers and production systems.

Caveats and tradeoffs

Property	BC reg	Flow matching	DAgger
Compute (training)	Cheap	Moderate (U-Net is bigger)	Cheap per round, but multiple rounds
Compute (inference)	1 forward pass	20 forward passes (Euler integration)	1 forward pass
Expert assumption	Offline demos only	Offline demos only	Need interactive expert at any state
Theoretical bound	O(T²ε)	Same family as BC	O(Tε)

DAgger's main practical limitation: needing an interactive expert. In settings where the expert is a human operator, this is expensive. In simulation or where the expert is itself a controller (as in this homework), it's cheap.

Chapter 12

Cheat Sheet & Self-Quiz

Equations & bounds

DAgger dataset growth D_k = D₀ ∪ { (s, π_det(s)) | s ~ p_{π_k}(·) }

Performance bounds (Ross & Bagnell 2011) BC: J(π) − J(π*) = O(T²ε) quadratic compounding DAgger: J(π) − J(π*) = O(Tε) linear, much better

API reference

Call	Returns
`env.reset(seed=seed)`	(obs, info_dict); obs is shape (4,)
`env.step(np.array([action]))`	(obs, reward, terminated, truncated, info)
`det_expert.act(obs)`	scalar float in [0, 1]; expert's target y
`det_expert.reset()`	None; clears commitment, gap signature, EMA
`policy(state_tensor)`	predicted action chunk, shape (B, action_chunk)
`torch.as_tensor(np_arr).unsqueeze(0)`	tensor with batch dim added
`x.cpu().numpy()`	numpy array on CPU
`policy.eval()`	set policy to eval mode (no dropout, etc.)

Self-quiz

What is the distribution shift / covariate shift problem in BC?
Why does the BC error compound quadratically while DAgger's compounds linearly?
In DAgger, who provides the states? Who provides the action labels?
Why use the deterministic expert in this homework rather than the original (multimodal) expert for relabeling?
What does the deterministic expert do differently from the original?
What's the difference between the actions the policy executes in rollout and the actions we store as labels?
Why do we re-query the policy every EXECUTE_STEPS rather than every step?
Why do we copy obs when appending to ep_states?
How does rollout_and_relabel turn a per-step list of expert actions into action chunks?
Why does the dataset grow each round but never shrink?
If you used the original (multimodal) expert in DAgger, would the algorithm still resolve distribution shift? Would it resolve multimodality?
What's the connection between the failure mode in P1 and the deterministic expert in P3?

Answer key

1. The states encountered at deployment time are determined by the policy's own behavior, but the policy was only trained on expert states. Small policy errors cause drift to states never seen in training, where the policy has no knowledge.

2. BC's per-step error amplifies with deviation from the training distribution; deviation grows linearly with time, and per-step error scales with deviation, so total cost is O(T² ε). DAgger keeps the training distribution covering deployment, so per-step error stays bounded; total cost is O(T ε).

3. The policy provides states (rollout). The expert provides action labels (relabeling). The point of DAgger is supervising the policy at the states it actually visits.

4. Because the original expert is multimodal — relabeling with it would just keep adding multimodal labels, leaving the BC averaging-into-the-wall problem unsolved. The deterministic expert removes multimodality from the data.

5. When close to a pipe, the original expert randomly picks gap 1 or gap 2 with equal probability. The deterministic expert always picks gap 1. Same midpoint hovering and EMA smoothing otherwise.

6. Executed actions come from the policy (so we can collect realistic deployment-distribution states). Stored labels come from the deterministic expert (so the BC loss has a unimodal supervision signal). Mixing these up defeats DAgger's purpose.

7. Action chunking (re-querying every K steps) gives temporal consistency, reduces compounding error from policy queries, and improves stability. We re-query at EXECUTE_STEPS, not at the full chunk length, because the last 10 of 20 predicted actions are stale by then.

8. Without .copy(), all entries point to the same numpy array, which gets overwritten in-place by the env. By the end of the loop, every state in ep_states is identical (the final state). With .copy(), each state is a snapshot.

9. By windowing: the chunk for state at index i is the next 20 expert actions ep_expert_actions[i:i+20]. Each state in the rollout (except the last 19) becomes one (state, chunk) training pair.

10. Because we want the training distribution to cover all of the policy's deployment distribution — including states from earlier rounds when the policy was worse. Forgetting old data would lose coverage and might cause the policy to forget edge cases.

11. Yes to distribution shift — the data still gets aggregated from policy rollouts, so coverage improves. No to multimodality — the labels would still be sometimes-gap-1, sometimes-gap-2, and MSE would still average them. You'd need a generative model (like in P2) for this case.

12. In P1, the multimodal expert's randomness made MSE regression collapse to an invalid average (the wall). The deterministic expert in P3 removes that randomness from the labels, eliminating the failure mode at its source. With unimodal labels, MSE works fine — same model and loss as P1, just better data.

Implementation order

DeterministicExpert.act — 30 seconds. One line.
rollout_episode — 10 minutes. The hardest of the three; mind the chunk buffering and the no_grad / device handling.
rollout_and_relabel — 5 minutes. Clean loop and windowing.

Total: ~15 minutes of typing. Run on hard mode, observe ~200 round-1 performance climb to ~900 by round 5. Generate the learning curve plot. Run --plot to compare with P1 and P2.

Take it back to class

You can now teach this

Three big ideas, in order of importance:

Distribution shift is the central failure mode of plain BC. Train on expert states, deploy on policy states. The deployment distribution drifts away from training, errors compound quadratically with horizon, and even small per-step errors become catastrophic over long episodes.
DAgger's fix is iterative relabeling: collect rollouts from the current policy, get expert labels at those states, aggregate, retrain. The training distribution converges to the deployment distribution; per-step errors stay bounded; total cost is linear instead of quadratic.
The deterministic expert trick handles multimodality. By replacing the random gap choice with a fixed always-gap-1 choice, we make the labels unimodal. MSE regression now has a unimodal target and converges cleanly. Same algorithm as P1, just better data.

If a friend asks: "Why does DAgger improve over rounds?" — you say: "Each round adds expert-labeled states from the policy's actual deployment distribution to the training set. So the policy gets supervision at exactly the states where it currently makes errors. Over rounds, the training distribution converges to the deployment distribution, and the policy stops drifting off-distribution. The deterministic expert ensures the labels are consistent rather than random, which lets plain MSE regression succeed where it failed in Problem 1."

You can teach this. Submit the writeup.

DAgger from Absolute Zero

What You'll Master

Where We Are

The two failure modes of BC

What you're implementing

The deliverable

The Distribution Shift Problem

The setup

What happens at test time

Why this is fundamental, not a tuning issue

Two famous fixes

DAgger: Iterative Relabeling

The core idea

Critical: we use expert actions, not policy actions

The role of "interactive" expert access

Why DAgger Works

The fundamental fix

The theoretical guarantee

Why the rounds matter

The Deterministic Expert Trick

Recap: the multimodality problem

The trick: a deterministic expert

Why this resolves the multimodality problem

Why use the original expert as the labeler in normal DAgger?

The full DeterministicExpert behavior

DAgger with Action Chunking

The rollout side

The relabel side

Comparison to expert.py's collect_expert_data

The Full Algorithm

Code Tour

The DeterministicExpert skeleton

The rollout_episode signature

The rollout_and_relabel signature

Your Three Changes, Decoded

Decoded

Decoded

Decoded

Running It

The command

Expected results

How long does training take?

What healthy training looks like

The deliverables

Your one-paragraph writeup template

Three-Method Comparison

Expected pattern

What the comparison teaches

Caveats and tradeoffs

Cheat Sheet & Self-Quiz

Equations & bounds

API reference

Self-quiz

Implementation order

You can now teach this