CS 224R HW1 · Problem 2 · Flow Matching from Zero

Roadmap

What You'll Master

01Where We Are 02Why Flow Matching Fixes P1 03Generative Models for BC 04The Flow Matching Idea 05Training: Velocity Regression 06Sampling: Euler Integration 07FM vs Diffusion 08The U-Net Architecture 09Code Tour 10Your Three Changes, Decoded 11Running It 12Cheat Sheet & Quiz

Chapter 01

Where We Are

Same Flappy Bird environment as Problem 1. Same 4-D observation, same 20-step action chunk, same hard mode with alternating single and double-gap pipes.

What's different: the algorithm. Instead of training an MLP to regress the expert's action via MSE (Problem 1's approach), you'll train a generative model that learns the entire distribution of expert actions at each state. At inference time, you sample from that distribution to get an action chunk.

The deliverable for Problem 2:

Implement FlowMatchingSchedule.interpolate — sample noise and build the noisy training input.
Implement FlowMatchingSchedule.sample — integrate the learned ODE from noise to a clean action.
Implement flow_matching_loss — MSE on velocity prediction.
Run on hard mode — report mean and std episode length, explain the result in 2-3 sentences.

By the end of this guide you'll know:

What a generative model is and how it differs from a regressor.
What a vector field is and why it can transport noise to data.
The flow matching training objective: regress velocity along straight-line interpolations.
How Euler integration turns a learned vector field into a sampling procedure.
Why flow matching solves the multimodality problem that killed P1's MSE regression.
How flow matching relates to diffusion (it's a strict simplification).

Chapter 02

Why Flow Matching Fixes P1

Recall the climax of Problem 1: hard mode has multimodal experts. Two valid expert actions per state — gap 1 (top) or gap 2 (bottom). MSE regression converges to the conditional mean of expert actions, which is the midpoint between the gaps — the wall.

The fundamental issue was that MSE assumes a unimodal target distribution. We want a method that can represent multiple modes simultaneously: at this state, the expert sometimes does X, sometimes Y, and the policy should be able to sample one or the other rather than averaging.

The two paths to fix multimodality

Two orthogonal approaches:

Make the model richer — train a generative model that learns the full conditional distribution p(a | s), not just its mean. Sample from this distribution at inference. This is Problem 2: flow matching.
Make the data unimodal — use a deterministic expert that always picks the same gap. The data is now single-modal and MSE works fine. (Problem 3: DAgger.)

Both work. Modern robot learning systems often use both: a generative policy (flow matching, diffusion) trained on the cleanest data possible.

The deep idea

A regressor outputs one answer per input. A generative model outputs a distribution per input from which you can sample. For unimodal targets, the two are equivalent. For multimodal targets, only the generative model preserves the modes.

Chapter 03

Generative Models for BC

Step back from flow matching for a moment. What does it mean to "model the conditional distribution p(a | s)" rather than just its mean?

What a regressor does

The MSE-trained policy from Problem 1:

π_regressor(s) = a single deterministic output

Pass in a state, get back one action. The same state always gives the same action.

What a generative model does

A generative policy is sampled, not evaluated:

π_gen(s) = sample from p(· | s) stochastic output

Pass in a state, get back one sample from the conditional distribution. Same state, different samples each time. If the distribution has two modes (gap 1 and gap 2), then sometimes the sample lands at gap 1, sometimes at gap 2.

For the Flappy Bird problem, this is exactly what you want. Each individual rollout commits to a coherent path (sample gap 1, target gap 1 for all 20 future steps). Different rollouts may sample different gaps. None of them takes the middle.

The hard part: how do you parameterize a distribution?

For the Gaussian case, easy: a network outputs mean and standard deviation. Sample from the resulting Gaussian. But that only gives unimodal distributions.

For arbitrary distributions, harder. There are several common approaches:

Approach	Idea	Examples
Mixture of Gaussians	Output K means, K stds, K mixture weights	Old school; requires choosing K
Categorical (discretization)	Bin the action space, output a probability per bin	Robotic Transformer 1, Decision Transformer
Normalizing flow	Invertible neural net + Jacobian trick	Real NVP; rarely used in robotics
Diffusion	Learn to denoise; run reverse diffusion at sampling	Diffusion Policy (Chi et al.)
Flow matching	Learn a vector field; integrate ODE at sampling	Recent SOTA; this homework

Diffusion and flow matching are the dominant choices in modern robot learning. They share the same overall structure (noise → data via iterative refinement) but differ in implementation details.

Why FM and not diffusion for this homework?

Flow matching is mathematically simpler. The training objective is just MSE on velocity prediction. Sampling is straight-line Euler integration. Diffusion has more moving parts (noise schedules, beta schedules, score matching, DDPM/DDIM samplers). Both methods give comparable performance in practice; flow matching is easier to implement.

Chapter 04

The Flow Matching Idea

Flow matching's central insight: generate samples from a complex distribution by continuously transporting samples from a simple distribution along a learned vector field.

The setup

Two distributions of interest:

Source distribution: a Gaussian N(0, I). Easy to sample from.
Target distribution: the expert's conditional action distribution p(a | s). Hard to sample from directly — that's what we want to learn.

We want a way to take a sample from the source and transform it into a sample from the target.

The vector field idea

Imagine the action space (a 20-D box, but think of it as a 2-D plane for visualization). At each point in this space, draw an arrow pointing toward where samples should "flow" if we want to move from noise to data.

A vector field maps each (point, time) to an arrow. Following the arrows from a noise sample for time τ = 0 to τ = 1 transports it to a data sample. The field naturally splits noise samples into the two modes (gap 1 and gap 2) without averaging them.

What we want from the field

If we sample x₀ ~ N(0, I) at time τ = 0 and follow the velocity arrows for time τ from 0 to 1, we should end up at x₁ ~ p(a | s). Mathematically:

The continuous flow dx/dτ = v_θ(x, s, τ) ODE x(0) ~ N(0, I) initial condition x(1) ~ p(· | s) desired endpoint

This is an ordinary differential equation. The velocity v_θ(x, s, τ) is a neural network parameterized by θ. Different network outputs give different vector fields, which produce different transport behaviors.

The training problem

We have data — samples from p(a | s) — and noise — samples from N(0, I). We want to train v_θ so that integrating from noise to data along the field gives the right answer.

The key trick: we don't need to integrate during training. We can derive the target velocity at any (x, s, τ) directly. That's what makes flow matching simple to train.

The next chapter shows how.

Chapter 05

Training: Velocity Regression

The flow matching training objective is breathtakingly simple. Three steps.

Step 1: pick a clean action and a noise sample

Take a₁ from the expert dataset (a clean action chunk, shape [20]). Sample a₀ ~ N(0, I) independently of a₁ (also shape [20]).

Step 2: pick a random time and interpolate

Sample τ uniformly from [0, 1]. Compute the linear interpolation:

Interpolation a_τ = τ · a₁ + (1 − τ) · a₀

At τ = 0, this is pure noise a₀. At τ = 1, this is the clean action a₁. At τ = 0.5, it's a 50/50 mix. As τ varies, a_τ traces a straight line from noise to data.

Step 3: the target velocity is the straight-line slope

Differentiate a_τ with respect to τ:

Target velocity da_τ/dτ = a₁ − a₀ constant, doesn't depend on τ

This is the velocity that, if the network produced it everywhere along the line from a₀ to a₁, would correctly transport a₀ to a₁ over time τ ∈ [0, 1].

The training loss: regress the network's output v_θ(a_τ, s, τ) toward this target velocity:

Flow matching loss L_FM(θ) = E_{(s, a₁) ~ D} E_{a₀ ~ N(0, I)} E_{τ ~ U(0, 1)} || v_θ(a_τ, s, τ) − (a₁ − a₀) ||²

Why this works (the surprising part)

You might worry: we're training the network on a specific noise sample a₀, but at inference time we'll sample a fresh different noise sample. The network has only seen one specific path from a₀ to a₁; how does it know what to do for arbitrary noise inputs?

Answer: averaging across many noise samples and many data samples, the network learns the conditional expectation of the velocity at each (a_τ, s, τ) point:

v*(x, s, τ) = E[ a₁ − a₀ | a_τ = x ]

Across all (a₀, a₁) pairs that pass through point x at time τ, the average target velocity is the value the network learns. This expected velocity, integrated over time, transforms the source distribution N(0, I) into the target distribution p(a | s). That's the theorem of Conditional Optimal Transport Flow Matching (Lipman et al., 2023).

The bottom line

Pick any (x, s, τ), the optimal velocity field at that point is the expected difference (data − noise) over all data-noise pairs that interpolate through x at time τ. The MSE training objective converges to this expected velocity automatically.

The algorithm

Flow matching training, one step

Sample a minibatch of (s, a₁) from the expert dataset.
For each sample, draw a₀ ~ N(0, I) of the same shape as a₁.
Sample a per-sample timestep τ ~ U(0, 1).
Build the noisy interpolation: a_τ = τ · a₁ + (1 − τ) · a₀.
Compute the target velocity: v_target = a₁ − a₀.
Forward pass: v_pred = network(a_τ, s, τ).
Loss: MSE(v_pred, v_target).
Backprop, Adam step, repeat.

Steps 1-5 happen in your FlowMatchingSchedule.interpolate. Steps 6-8 happen in your flow_matching_loss.

Chapter 06

Sampling: Euler Integration

Training is done. Now we have a learned velocity field v_θ(x, s, τ). To generate an action chunk for state s, we integrate the ODE from τ = 0 to τ = 1.

The ODE

dx/dτ = v_θ(x, s, τ), x(0) ~ N(0, I)

We want x(1), starting from x(0). There's no closed form — we have to integrate numerically.

Euler's method

The simplest numerical ODE solver. Discretize τ into n steps of size h = 1/n. At each step, take a small move in the direction of the velocity:

Euler integration step x_{τ + h} = x_τ + h · v_θ(x_τ, s, τ)

"From the current position, take a step of size h in the direction of the velocity at the current position and time." Repeat n times to get from τ = 0 to τ = 1.

For this homework, num_steps = 20, so h = 1/20 = 0.05.

The full sampling procedure

Flow matching sampling

Sample x₀ ~ N(0, I), shape (B, action_dim).
For i = 0, 1, …, n−1:
a) τ = i / n

b) Compute velocity: v = v_θ(x_τ, s, τ)

c) Step: x_{τ + 1/n} = x_τ + (1/n) · v
Clamp x₁ to [0, 1] (action range).
Return x₁.

Why this works (intuition)

The trained velocity field, at any point along an interpolation path, points toward the data distribution. Following the field's arrows traces out a curve that, over time, transports the noise sample toward a data sample.

For a multimodal target like the gap-1-or-gap-2 distribution, samples from N(0, I) are partitioned by the field: those near the "gap 1 attractor" flow toward gap 1; those near the "gap 2 attractor" flow toward gap 2. The exact path depends on the initial noise sample, so different runs of the sampler give different modes — that's how flow matching captures multimodality.

Why the clamp at the end?

Actions in this homework are normalized to [0, 1]. Even with sigmoid-style architectures, the integrated x₁ can drift slightly outside this range. A final clamp ensures actions are in-bounds for the env. (Compare to BCPolicy in Problem 1, where Sigmoid was inside the network. Here the network outputs raw velocities, so the clamp happens after integration.)

The trade-off in num_steps

num_steps	Quality	Speed
1	Single-step Euler — equivalent to plain MSE	1 forward pass
20 (default)	Sweet spot for most tasks	20 forward passes
100	Marginal quality gain	5× slower than 20

Flow matching with conditional optimal transport (the variant in this homework) tends to need fewer integration steps than diffusion, because the paths between noise and data are straight lines rather than curved. That's one of the practical advantages of FM over diffusion.

Chapter 07

FM vs Diffusion

Flow matching and diffusion are the two dominant generative paradigms in modern robotics (Diffusion Policy, RT-2, π₀, …). They share the same core idea but differ in implementation.

Aspect	Diffusion (DDPM)	Flow Matching
Forward (training)	Add Gaussian noise per time step (Markov chain)	Linear interpolation between data and noise
Reverse (sampling)	Iteratively denoise via learned score	Integrate learned ODE
What the network predicts	Noise ε (or score ∇ log p)	Velocity v = a₁ − a₀
Path between noise and data	Curved (Brownian)	Straight (Conditional OT)
Sampling steps	50-1000 typical	5-50 typical
Loss	MSE on noise prediction	MSE on velocity prediction
Hyperparameters	Beta schedule, variance schedule, loss weighting	Just num_steps

Why FM has straight paths

In diffusion, you add small Gaussian increments many times. The cumulative trajectory from data to noise is a random walk, which is curved. To generate data, you reverse this curved process — needing many small steps to follow the curve accurately.

In flow matching, you define the path as a straight line from data to noise. The velocity along this straight line is constant (it's just a₁ − a₀). The network only has to learn average velocities along straight paths — an easier function class. Sampling traces the straight paths back, which can be done with fewer Euler steps.

In one sentence

Diffusion is curved. Flow matching is straight. Straight is easier to learn, faster to sample, and almost always good enough for robot control.

Are they really different?

Mathematically, no — they're both ways of learning a transport map between two distributions, and there are formal connections between them. Practically, yes — the implementation details differ. Flow matching's simplicity is what makes it the default choice for new code in 2024+ robot learning papers.

Chapter 08

The U-Net Architecture

For Problem 1, the policy was a 3-layer MLP — you wrote it. For Problem 2, the velocity network is much fancier: a 1D temporal U-Net with conditional residual blocks. The starter code provides this entire architecture; you don't write any of it. But understanding what it does makes the rest of the homework click.

What the U-Net does

It implements v_θ(noisy_action, state, timestep):

Input	Shape	Meaning
noisy_action	[B, 20]	Current point a_τ on the interpolation path
state	[B, 4]	Conditioning observation (which Flappy Bird state)
timestep	[B]	Current τ in [0, 1]
Output	Shape	Meaning
velocity	[B, 20]	Predicted velocity at this (a_τ, s, τ) point

Why a U-Net (vs another MLP)?

For the BC regressor in P1, an MLP was fine. For flow matching, the function we're learning is much richer: it depends on three inputs (noisy action, state, time) and has to handle the entire dynamics of pushing noise toward data. An MLP can do this in principle, but a 1D U-Net is the standard architecture for diffusion/FM in time-series spaces because:

The action chunk has a temporal structure (20 future actions). Convolutions across the time axis exploit this.
U-Nets have a multi-scale structure (downsample to coarse features, upsample with skip connections). This helps capture both high-frequency details and long-range structure.
The conditional residual blocks let the network modulate its computation by the timestep and state (via FiLM-style scale-and-bias). This is how the network "knows" what time it is.

Black-box view (you don't need to understand the internals)

For the purposes of Problem 2, you can treat the model as a function:

model = policy.model       # the U-Net
schedule = policy.schedule  # the FlowMatchingSchedule (you implement)

# During training: predict velocity at one (a_tau, s, tau) point
v_pred = model(a_tau, state, tau)        # shape [B, 20]

# During inference: integrate the ODE
sampled_action = schedule.sample(model, state)

The FlowMatchingPolicy wrapper class bundles both together so you can pass policy around as a single object.

Sigmoid not used here

Unlike the BCPolicy in P1, the U-Net does not have a sigmoid output. The U-Net outputs raw velocities, which can be any real number. The actions only need to be in [0, 1] at the end of integration. The FlowMatchingSchedule.sample method clamps the integrated x₁ to [0, 1] before returning it.

Chapter 09

Code Tour

Same files as Problem 1. Different blanks.

Already provided

SinusoidalPosEmb, Conv1dBlock, ConditionalResidualBlock1D, ConditionalUnet1D — the U-Net building blocks.
TemporalNoisePredictor — wraps the U-Net to match the (noisy_action, state, timestep) signature.
FlowMatchingPolicy — bundles the model and schedule together.

What you fill in

Function	Where	What it does
`FlowMatchingSchedule.interpolate`	networks.py:258	Build noisy training input + target velocity
`FlowMatchingSchedule.sample`	networks.py:274	Euler integration from noise to data
`flow_matching_loss`	losses.py:36	Run interpolate + forward + MSE

The training loop (already done)

Same generic loop as Problem 1, just with different policy and loss:

policy = FlowMatchingPolicy(...)
optimizer = optim.Adam(policy.parameters(), lr=1e-3)

for epoch in range(num_epochs):
    for s_batch, a_batch in loader:
        loss = flow_matching_loss(policy, s_batch, a_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

The framework is identical to P1. Your flow_matching_loss swaps in for mse_loss; everything else stays the same.

The inference loop (also already done)

From visualization.py, paraphrased:

def act(state):
    state_tensor = torch.as_tensor(state)
    actions = policy.schedule.sample(policy.model, state_tensor)
    return actions[0]   # single action

Your schedule.sample drives the policy at inference time.

Chapter 10

Your Three Changes, Decoded

Per-line annotations. This is the centerpiece chapter.

Change 1 of 3

FlowMatchingSchedule.interpolate

Where: networks.py:258-271.

The math:

a₀ ~ N(0, I) noise of same shape as x1 a_τ = τ · x₁ + (1 − τ) · a₀ v* = x₁ − a₀

What's in scope:

x1: clean action data, shape (B, action_dim).
t: timesteps in [0, 1], shape (B,).

The code:

def interpolate(self, x1, t):
    eps = torch.randn_like(x1)                # shape (B, action_dim)
    t_expanded = t.view(-1, 1)               # shape (B, 1) for broadcasting
    x_t = t_expanded * x1 + (1.0 − t_expanded) * eps
    velocity = x1 − eps
    return x_t, velocity

Decoded

eps = torch.randn_like(x1)

Sample standard normal noise with the same shape as x1. torch.randn_like(x) is shorthand for torch.randn(x.shape, dtype=x.dtype, device=x.device) — matches not just shape but dtype and device too.

Why match shape exactly: the noise has to be the same dimensionality as the data so the linear interpolation makes sense. Each of the 20 action dimensions gets its own independent N(0, 1) sample.

Why standard normal (not other distributions): the math of flow matching assumes the source distribution is N(0, I). The "(0, 1)" gives mean 0, variance 1 per dimension, with all dimensions independent. This is the conventional choice for flow matching and diffusion alike.

t_expanded = t.view(-1, 1)

Reshape t from (B,) to (B, 1). Why? Because we're about to multiply by x1, which has shape (B, action_dim). Broadcasting (B,) * (B, action_dim) in PyTorch will fail; broadcasting (B, 1) * (B, action_dim) works correctly — the 1 dimension expands to match action_dim.

view(-1, 1) means "make this 2D with the second dim being 1, infer the first dim from total size." For input of length B, this gives shape (B, 1).

x_t = t_expanded * x1 + (1.0 - t_expanded) * eps

The linear interpolation. At t = 0: x_t = 0 * x1 + 1 * eps = eps (pure noise). At t = 1: x_t = 1 * x1 + 0 * eps = x1 (pure data). Shape: (B, action_dim).

Each sample in the batch has its own t_expanded[i], so each row of x_t is at a different point along its noise-to-data path. This randomization across the batch is what gives the network exposure to all timesteps in [0, 1] during training.

velocity = x1 - eps

The target velocity. x1 - eps is the displacement from noise to data. Constant in τ (doesn't depend on the timestep) — the straight-line path has constant velocity. Shape: (B, action_dim).

Note this is just a difference of tensors. No autograd magic needed — x1 comes from data (no_grad implicitly), eps from randn_like (no_grad), so the result is just a fixed target.

return x_t, velocity

Tuple of two tensors. The caller (your flow_matching_loss) will use x_t as the input to the velocity network and velocity as the regression target.

Why we don't use Python's tuple unpacking inline

You'll see code like return tau * x1 + (1 - tau) * eps, x1 - eps sometimes. Same thing — Python returns a tuple. Naming the intermediates is just clearer for debugging.

Change 2 of 3

FlowMatchingSchedule.sample

Where: networks.py:274-287.

The math:

x₀ ~ N(0, I), shape (B, action_dim) for i = 0, 1, …, n−1: τ = i / n x_{τ + 1/n} = x_τ + (1/n) · v_θ(x_τ, s, τ) return x₁.clamp(0, 1)

What's in scope:

model: the velocity network. Callable as model(x, state, t).
state: conditioning, shape (B, state_dim).
self.action_dim: action chunk length (20).
self.num_steps: number of Euler steps (20).
self.device: torch device string.

The code:

@torch.no_grad()
def sample(self, model, state):
    B = state.shape[0]
    x = torch.randn(B, self.action_dim, device=self.device)
    dt = 1.0 / self.num_steps

    for i in range(self.num_steps):
        t = torch.full((B,), i * dt, device=self.device)
        v = model(x, state, t)
        x = x + dt * v

    return x.clamp(0.0, 1.0)

Decoded

@torch.no_grad()

This is a decorator. It wraps the entire function so PyTorch doesn't track gradients during sampling. We're at inference time; we don't need gradients. Saves memory and time.

The decorator is already in the starter code (line 273 of networks.py) — you don't add it. But know it's there. If you need a fresh sample mid-training (which we don't here), you'd want this for efficiency.

B = state.shape[0]

Batch size. Convention: states have shape (B, state_dim), so the first dim is the batch.

x = torch.randn(B, self.action_dim, device=self.device)

Initialize the integration with pure Gaussian noise. Shape (B, action_dim) = (B, 20). Same shape as the actions we want to generate.

Why self.device matters: the model is on GPU. If you create the noise on CPU and pass to the model, you'll get a runtime error. Always create tensors on the same device as the model.

dt = 1.0 / self.num_steps

Step size. With num_steps = 20, this is 0.05. Each Euler step advances τ by 0.05, so 20 steps span τ = 0 to τ = 1.

for i in range(self.num_steps):

The integration loop. Iterates from i = 0 to i = 19 (inclusive). At each iteration, we're at τ = i · dt.

t = torch.full((B,), i * dt, device=self.device)

Build the timestep tensor for this iteration. torch.full(shape, value) creates a tensor of given shape, filled with the scalar value. Here we want a tensor of shape (B,) with every element equal to i * dt (the current τ).

Why (B,) not just a scalar? Because the model expects a batched timestep — each sample in the batch could in principle be at a different time, even though here they're all at the same τ.

v = model(x, state, t)

Forward pass through the velocity network. Inputs: current point x shape (B, 20), state state shape (B, 4), timestep t shape (B,). Output: predicted velocity, shape (B, 20).

x = x + dt * v

Euler step. Move x by dt · v. Same shape in, same shape out. Note this overwrites x — we don't need to keep the intermediate x's. Each iteration just advances the current point.

Why we use the same x for all steps: because we're integrating an ODE; the next state depends only on the current state and the current velocity. We don't need to remember the history.

return x.clamp(0.0, 1.0)

Clamp to action range. After 20 Euler steps, x is approximately at the data distribution but may have drifted slightly outside [0, 1] due to integration error and the network's imperfections. Clamping is a safety net.

x.clamp(0, 1) returns a new tensor where any element above 1 becomes 1 and any below 0 becomes 0. Other elements unchanged.

Common bugs in sample()

1. Forgetting device when creating noise or timestep tensors. Causes "Expected GPU but got CPU" errors.

2. Off-by-one in the loop: using range(self.num_steps + 1) would do 21 steps and overshoot. range(self.num_steps) is correct.

3. Wrong timestep: t = torch.full((B,), (i + 1) * dt, ...) would query the model at time τ + dt instead of τ. Subtle bug; performance often still OK but mathematically wrong.

4. Forgetting clamp: actions outside [0, 1] make the env confused. Usually still works, but unprincipled.

Change 3 of 3

flow_matching_loss

Where: losses.py:36-53.

The math:

1. Sample τ ~ U(0, 1) for each item in batch 2. Compute (a_τ, v*) = interpolate(a*, τ) 3. Predict v_pred = model(a_τ, s, τ) 4. Loss = MSE(v_pred, v*)

What's in scope:

policy: a FlowMatchingPolicy with policy.model (network) and policy.schedule (FlowMatchingSchedule).
s_batch: states, shape (B, state_dim).
a_batch: clean expert actions, shape (B, action_dim).

The code:

def flow_matching_loss(policy, s_batch, a_batch):
    B = a_batch.shape[0]
    t = torch.rand(B, device=a_batch.device)
    x_t, target_velocity = policy.schedule.interpolate(a_batch, t)
    pred_velocity = policy.model(x_t, s_batch, t)
    return nn.functional.mse_loss(pred_velocity, target_velocity)

Decoded

B = a_batch.shape[0]

Batch size. We need this to know how many independent τ samples to draw.

t = torch.rand(B, device=a_batch.device)

Sample B timesteps, each uniform in [0, 1). Note: torch.rand samples from U(0, 1) (uniform); torch.randn samples from N(0, 1) (normal). Different functions for different distributions.

Each sample in the batch gets a different τ. Across many minibatches, the network sees all values of τ in [0, 1) and learns to predict velocity at every timestep.

Why per-sample timesteps: it's much more efficient than running 20 separate forward passes per sample (one for each τ on a fixed grid). Random τ per sample gives the same coverage with fewer forward passes.

x_t, target_velocity = policy.schedule.interpolate(a_batch, t)

Calls your interpolate function from Change 1. Returns the noisy interpolation point and the target velocity. Shapes: (B, 20) for both.

pred_velocity = policy.model(x_t, s_batch, t)

Forward pass through the U-Net velocity model. Inputs: x_t shape (B, 20), s_batch shape (B, 4), t shape (B,). Output: predicted velocity shape (B, 20).

Note the U-Net is wrapped: policy.model is the TemporalNoisePredictor, which internally reshapes inputs and calls ConditionalUnet1D. We don't worry about the reshaping — it's all handled.

return nn.functional.mse_loss(pred_velocity, target_velocity)

Standard MSE: average squared error across all elements (batch × action dim). Returns a scalar.

This is the same MSE function from Problem 1, applied to a different prediction-target pair. There the prediction was an action and the target was an expert action; here the prediction is a velocity and the target is a velocity.

A subtle conceptual point

Flow matching is "MSE on the right thing." Plain BC does MSE on actions, which (we showed) collapses to the conditional mean. Flow matching does MSE on velocities at random interpolation points, which (theorem from FM literature) converges to a vector field that transports noise into the data distribution — preserving multimodality. Same loss class, fundamentally different inductive bias.

Putting it all together

Three blanks, totaling ~15 lines of code. The training loop calls flow_matching_loss per minibatch. Inside, your loss function calls schedule.interpolate. At inference, the policy wrapper calls schedule.sample. Your three pieces interlock to form the full algorithm.

Chapter 11

Running It

The command

python main.py --method bc_flow --env hard

This:

Collects expert demos on hard mode (same as P1).
Builds a FlowMatchingPolicy (the U-Net + your schedule).
Trains for some number of epochs using flow_matching_loss.
Evaluates on 50 episodes via schedule.sample for action selection.
Saves results to bc_flow_hard.txt.

Expected results

Method	Mean episode length on hard mode	Why
BC regression (P1)	~200-500	MSE averages multimodal expert → wall
Flow matching (P2)	~700-1000	Generative model preserves modes → consistent gap selection

You should see a substantial improvement over Problem 1 on hard mode. If episode length stays low, something is off — check the gotchas in Chapter 10.

How long does training take?

Longer than P1's MLP. The U-Net has more parameters, and each training step does an interpolation + forward pass. Expect roughly 5-15 minutes on CPU, much faster on GPU.

What healthy training looks like

Metric	Healthy	Bug
Training loss	Starts ~1.0, decreases to ~0.05-0.2	Stays at ~1.0 (model not learning velocities)
Final eval episode length	700-1000 on hard mode	< 300 (probably bug in interpolate or sample)
Std of eval episode length	Moderate (some variance from sampling)	Very high (model is unstable) or 0 (collapsed to constant)

The deliverables

Per the PDF:

Table of mean ± std episode length on hard mode (50 evaluation episodes).
2-3 sentence explanation of why flow matching performs the way it does on hard mode.

Your one-paragraph writeup template

For the writeup

"Flow matching trains a generative model that learns the full conditional distribution p(a | s) rather than just its conditional mean. On hard mode where the expert is bimodal (gap 1 or gap 2), the learned vector field transports samples from Gaussian noise to either mode based on the initial noise sample, so each rollout commits to one gap rather than aiming at the wall in between. This recovers near-expert performance because the policy now samples valid actions, not their average."

Chapter 12

Cheat Sheet & Self-Quiz

Equations to memorize

Training: interpolate a₀ ~ N(0, I) a_τ = τ · a₁ + (1 − τ) · a₀ v* = a₁ − a₀

Training: loss L = E [ || v_θ(a_τ, s, τ) − (a₁ − a₀) ||² ]

Sampling: Euler step x_{τ + 1/n} = x_τ + (1/n) · v_θ(x_τ, s, τ)

API reference

Call	Returns
`torch.randn(*shape, device=)`	Tensor of N(0, 1) samples, given shape
`torch.randn_like(x)`	Same shape/dtype/device as x, N(0, 1) samples
`torch.rand(*shape)`	Uniform [0, 1) samples
`torch.full(shape, value)`	Tensor of given shape, all elements = value
`x.view(-1, 1)`	Reshape to 2D with second dim = 1
`x.clamp(lo, hi)`	Elementwise min(max(x, lo), hi)
`nn.functional.mse_loss(pred, target)`	Mean squared error scalar
`policy.schedule.interpolate(x1, t)`	(x_t, velocity), shape (B, action_dim)
`policy.schedule.sample(model, state)`	Sampled actions, shape (B, action_dim), clamped
`policy.model(x, state, t)`	Predicted velocity, shape (B, action_dim)

Self-quiz

What's the difference between a regressor and a generative model?
Why does flow matching solve the multimodality problem from P1?
What is a vector field and why is it useful here?
What does the network v_θ(x, s, τ) predict, in one sentence?
Why is the target velocity a₁ − a₀ constant in τ?
Why do we sample τ uniformly during training instead of using a fixed value?
What does the Euler step x ← x + (1/n) · v compute?
Why do we clamp to [0, 1] only at the end of sampling, not after each step?
What's the connection between flow matching and diffusion?
Why does FM typically need fewer sampling steps than diffusion?
What's the difference between torch.randn and torch.rand?
If you set num_steps = 1 in the sampler, what does flow matching reduce to?

Answer key

1. A regressor outputs one answer per input. A generative model outputs a distribution from which you sample. For unimodal targets they're equivalent; for multimodal targets, only generative models preserve multiple modes.

2. Because the trained vector field maps different noise samples to different modes. At a state with two valid actions (gap 1, gap 2), starting from different noise samples sends you to different gaps. No averaging.

3. A vector field assigns a velocity (arrow) to every (x, s, τ). Following the arrows over time transports a starting point to an ending point. Useful here because it lets us continuously transform Gaussian noise into expert actions.

4. The network predicts the velocity (rate of change) at the current point along the noise-to-data path, conditioned on the state and the current time.

5. Because the training-time path is a straight line: a_τ = τ a₁ + (1 − τ) a₀. Differentiating gives da_τ/dτ = a₁ − a₀, independent of τ.

6. So the network learns velocity at every value of τ in [0, 1]. At inference, we'll integrate through all those timesteps. If we only trained at one τ, the network would only know what to do at that single time.

7. A first-order numerical approximation to the ODE dx/dτ = v. We assume velocity is roughly constant over a small interval 1/n, so the position changes by roughly v/n.

8. Because clamping mid-integration would distort the trajectory. The network was trained on smooth interpolation paths; clamping at intermediate steps would push us off-distribution from the training paths. Final clamping is just a safety net for the very small drift that may occur.

9. Both transport noise to data via iterative refinement guided by a learned network. Diffusion uses a curved (random-walk) path and predicts noise/score. Flow matching uses a straight path and predicts velocity. Mathematically related; FM is simpler.

10. Because FM paths are straight, integration error is smaller per step. Diffusion paths are curved, so you need more small steps to follow the curve accurately.

11. torch.randn samples from N(0, 1) (Gaussian). torch.rand samples from U(0, 1) (uniform). Different distributions; different uses (we use randn for noise, rand for timestep).

12. One Euler step from x₀ in direction v(x₀, s, 0). With straight-line paths and learned velocity equal to the conditional mean of (a₁ − a₀), this approximately equals the conditional mean of a₁. So num_steps = 1 collapses flow matching back to MSE regression with extra steps. Multimodality is lost.

Implementation order

FlowMatchingSchedule.interpolate — ~2 minutes. Five lines.
flow_matching_loss — ~1 minute. Five lines. Verify training loss decreases.
FlowMatchingSchedule.sample — ~3 minutes. Eight lines. Hardest because of the loop and device handling.

Total: ~6 minutes of typing. Run on hard mode, see episode length jump from ~300 (BC regression) to ~800+ (flow matching). Write the explanation.

Take it back to class

You can now teach this

Three big ideas, in order of importance:

Generative models preserve multimodality. A regressor collapses multimodal targets to their mean. A generative model samples from the distribution, preserving each mode. For multimodal experts, this is the difference between crashing and succeeding.
Flow matching is MSE on velocity along straight-line paths. The training procedure is: sample data, sample noise, sample time, build linear interpolation, regress network output toward the constant difference (data − noise). The math is breathtakingly simple.
Sampling is Euler integration. Start from noise. At each step, query the network for velocity at the current point and time. Step forward by 1/n in that direction. Repeat n times. Clamp. Done.

If a friend asks: "Why does flow matching work where MSE regression failed?" — you say: "MSE regression learns the conditional mean of expert actions, which collapses bimodal data into the wrong middle ground. Flow matching learns a vector field that maps noise to data; different noise samples flow to different modes, so the policy can sample either mode rather than averaging them. Mathematically, both are MSE losses, but flow matching applies MSE to velocity predictions along learned paths, not to raw actions. The richer prediction target preserves multimodality."

You can teach this. On to Problem 3.

Flow Matching from Absolute Zero

What You'll Master

Where We Are

Why Flow Matching Fixes P1

The two paths to fix multimodality

Generative Models for BC

What a regressor does

What a generative model does

The hard part: how do you parameterize a distribution?

The Flow Matching Idea

The setup

The vector field idea

What we want from the field

The training problem

Training: Velocity Regression

Step 1: pick a clean action and a noise sample

Step 2: pick a random time and interpolate

Step 3: the target velocity is the straight-line slope

Why this works (the surprising part)

The algorithm

Sampling: Euler Integration

The ODE

Euler's method

The full sampling procedure

Why this works (intuition)

The trade-off in num_steps

FM vs Diffusion

Why FM has straight paths

Are they really different?

The U-Net Architecture

What the U-Net does

Why a U-Net (vs another MLP)?

Black-box view (you don't need to understand the internals)

Code Tour

Already provided

What you fill in

The training loop (already done)

The inference loop (also already done)

Your Three Changes, Decoded

Decoded

Decoded

Decoded

Putting it all together

Running It

The command

Expected results

How long does training take?

What healthy training looks like

The deliverables

Your one-paragraph writeup template

Cheat Sheet & Self-Quiz

Equations to memorize

API reference

Self-quiz

Implementation order

You can now teach this