A vector field that pushes Gaussian noise into expert action distributions. Three lines of math, three blanks of code, and the tool that solves the multimodality problem you just witnessed in Problem 1.
Same Flappy Bird environment as Problem 1. Same 4-D observation, same 20-step action chunk, same hard mode with alternating single and double-gap pipes.
What's different: the algorithm. Instead of training an MLP to regress the expert's action via MSE (Problem 1's approach), you'll train a generative model that learns the entire distribution of expert actions at each state. At inference time, you sample from that distribution to get an action chunk.
The deliverable for Problem 2:
FlowMatchingSchedule.interpolate — sample noise and build the noisy training input.FlowMatchingSchedule.sample — integrate the learned ODE from noise to a clean action.flow_matching_loss — MSE on velocity prediction.By the end of this guide you'll know:
Recall the climax of Problem 1: hard mode has multimodal experts. Two valid expert actions per state — gap 1 (top) or gap 2 (bottom). MSE regression converges to the conditional mean of expert actions, which is the midpoint between the gaps — the wall.
The fundamental issue was that MSE assumes a unimodal target distribution. We want a method that can represent multiple modes simultaneously: at this state, the expert sometimes does X, sometimes Y, and the policy should be able to sample one or the other rather than averaging.
Two orthogonal approaches:
Both work. Modern robot learning systems often use both: a generative policy (flow matching, diffusion) trained on the cleanest data possible.
A regressor outputs one answer per input. A generative model outputs a distribution per input from which you can sample. For unimodal targets, the two are equivalent. For multimodal targets, only the generative model preserves the modes.
Step back from flow matching for a moment. What does it mean to "model the conditional distribution p(a | s)" rather than just its mean?
The MSE-trained policy from Problem 1:
Pass in a state, get back one action. The same state always gives the same action.
A generative policy is sampled, not evaluated:
Pass in a state, get back one sample from the conditional distribution. Same state, different samples each time. If the distribution has two modes (gap 1 and gap 2), then sometimes the sample lands at gap 1, sometimes at gap 2.
For the Flappy Bird problem, this is exactly what you want. Each individual rollout commits to a coherent path (sample gap 1, target gap 1 for all 20 future steps). Different rollouts may sample different gaps. None of them takes the middle.
For the Gaussian case, easy: a network outputs mean and standard deviation. Sample from the resulting Gaussian. But that only gives unimodal distributions.
For arbitrary distributions, harder. There are several common approaches:
| Approach | Idea | Examples |
|---|---|---|
| Mixture of Gaussians | Output K means, K stds, K mixture weights | Old school; requires choosing K |
| Categorical (discretization) | Bin the action space, output a probability per bin | Robotic Transformer 1, Decision Transformer |
| Normalizing flow | Invertible neural net + Jacobian trick | Real NVP; rarely used in robotics |
| Diffusion | Learn to denoise; run reverse diffusion at sampling | Diffusion Policy (Chi et al.) |
| Flow matching | Learn a vector field; integrate ODE at sampling | Recent SOTA; this homework |
Diffusion and flow matching are the dominant choices in modern robot learning. They share the same overall structure (noise → data via iterative refinement) but differ in implementation details.
Flow matching is mathematically simpler. The training objective is just MSE on velocity prediction. Sampling is straight-line Euler integration. Diffusion has more moving parts (noise schedules, beta schedules, score matching, DDPM/DDIM samplers). Both methods give comparable performance in practice; flow matching is easier to implement.
Flow matching's central insight: generate samples from a complex distribution by continuously transporting samples from a simple distribution along a learned vector field.
Two distributions of interest:
We want a way to take a sample from the source and transform it into a sample from the target.
Imagine the action space (a 20-D box, but think of it as a 2-D plane for visualization). At each point in this space, draw an arrow pointing toward where samples should "flow" if we want to move from noise to data.
If we sample x0 ~ N(0, I) at time τ = 0 and follow the velocity arrows for time τ from 0 to 1, we should end up at x1 ~ p(a | s). Mathematically:
This is an ordinary differential equation. The velocity vθ(x, s, τ) is a neural network parameterized by θ. Different network outputs give different vector fields, which produce different transport behaviors.
We have data — samples from p(a | s) — and noise — samples from N(0, I). We want to train vθ so that integrating from noise to data along the field gives the right answer.
The key trick: we don't need to integrate during training. We can derive the target velocity at any (x, s, τ) directly. That's what makes flow matching simple to train.
The next chapter shows how.
The flow matching training objective is breathtakingly simple. Three steps.
Take a1 from the expert dataset (a clean action chunk, shape [20]). Sample a0 ~ N(0, I) independently of a1 (also shape [20]).
Sample τ uniformly from [0, 1]. Compute the linear interpolation:
At τ = 0, this is pure noise a0. At τ = 1, this is the clean action a1. At τ = 0.5, it's a 50/50 mix. As τ varies, aτ traces a straight line from noise to data.
Differentiate aτ with respect to τ:
This is the velocity that, if the network produced it everywhere along the line from a0 to a1, would correctly transport a0 to a1 over time τ ∈ [0, 1].
The training loss: regress the network's output vθ(aτ, s, τ) toward this target velocity:
You might worry: we're training the network on a specific noise sample a0, but at inference time we'll sample a fresh different noise sample. The network has only seen one specific path from a0 to a1; how does it know what to do for arbitrary noise inputs?
Answer: averaging across many noise samples and many data samples, the network learns the conditional expectation of the velocity at each (aτ, s, τ) point:
Across all (a0, a1) pairs that pass through point x at time τ, the average target velocity is the value the network learns. This expected velocity, integrated over time, transforms the source distribution N(0, I) into the target distribution p(a | s). That's the theorem of Conditional Optimal Transport Flow Matching (Lipman et al., 2023).
Pick any (x, s, τ), the optimal velocity field at that point is the expected difference (data − noise) over all data-noise pairs that interpolate through x at time τ. The MSE training objective converges to this expected velocity automatically.
Steps 1-5 happen in your FlowMatchingSchedule.interpolate. Steps 6-8 happen in your flow_matching_loss.
Training is done. Now we have a learned velocity field vθ(x, s, τ). To generate an action chunk for state s, we integrate the ODE from τ = 0 to τ = 1.
We want x(1), starting from x(0). There's no closed form — we have to integrate numerically.
The simplest numerical ODE solver. Discretize τ into n steps of size h = 1/n. At each step, take a small move in the direction of the velocity:
"From the current position, take a step of size h in the direction of the velocity at the current position and time." Repeat n times to get from τ = 0 to τ = 1.
For this homework, num_steps = 20, so h = 1/20 = 0.05.
The trained velocity field, at any point along an interpolation path, points toward the data distribution. Following the field's arrows traces out a curve that, over time, transports the noise sample toward a data sample.
For a multimodal target like the gap-1-or-gap-2 distribution, samples from N(0, I) are partitioned by the field: those near the "gap 1 attractor" flow toward gap 1; those near the "gap 2 attractor" flow toward gap 2. The exact path depends on the initial noise sample, so different runs of the sampler give different modes — that's how flow matching captures multimodality.
Actions in this homework are normalized to [0, 1]. Even with sigmoid-style architectures, the integrated x1 can drift slightly outside this range. A final clamp ensures actions are in-bounds for the env. (Compare to BCPolicy in Problem 1, where Sigmoid was inside the network. Here the network outputs raw velocities, so the clamp happens after integration.)
| num_steps | Quality | Speed |
|---|---|---|
| 1 | Single-step Euler — equivalent to plain MSE | 1 forward pass |
| 20 (default) | Sweet spot for most tasks | 20 forward passes |
| 100 | Marginal quality gain | 5× slower than 20 |
Flow matching with conditional optimal transport (the variant in this homework) tends to need fewer integration steps than diffusion, because the paths between noise and data are straight lines rather than curved. That's one of the practical advantages of FM over diffusion.
Flow matching and diffusion are the two dominant generative paradigms in modern robotics (Diffusion Policy, RT-2, π0, …). They share the same core idea but differ in implementation.
| Aspect | Diffusion (DDPM) | Flow Matching |
|---|---|---|
| Forward (training) | Add Gaussian noise per time step (Markov chain) | Linear interpolation between data and noise |
| Reverse (sampling) | Iteratively denoise via learned score | Integrate learned ODE |
| What the network predicts | Noise ε (or score ∇ log p) | Velocity v = a1 − a0 |
| Path between noise and data | Curved (Brownian) | Straight (Conditional OT) |
| Sampling steps | 50-1000 typical | 5-50 typical |
| Loss | MSE on noise prediction | MSE on velocity prediction |
| Hyperparameters | Beta schedule, variance schedule, loss weighting | Just num_steps |
In diffusion, you add small Gaussian increments many times. The cumulative trajectory from data to noise is a random walk, which is curved. To generate data, you reverse this curved process — needing many small steps to follow the curve accurately.
In flow matching, you define the path as a straight line from data to noise. The velocity along this straight line is constant (it's just a1 − a0). The network only has to learn average velocities along straight paths — an easier function class. Sampling traces the straight paths back, which can be done with fewer Euler steps.
Diffusion is curved. Flow matching is straight. Straight is easier to learn, faster to sample, and almost always good enough for robot control.
Mathematically, no — they're both ways of learning a transport map between two distributions, and there are formal connections between them. Practically, yes — the implementation details differ. Flow matching's simplicity is what makes it the default choice for new code in 2024+ robot learning papers.
For Problem 1, the policy was a 3-layer MLP — you wrote it. For Problem 2, the velocity network is much fancier: a 1D temporal U-Net with conditional residual blocks. The starter code provides this entire architecture; you don't write any of it. But understanding what it does makes the rest of the homework click.
It implements vθ(noisy_action, state, timestep):
| Input | Shape | Meaning |
|---|---|---|
| noisy_action | [B, 20] | Current point aτ on the interpolation path |
| state | [B, 4] | Conditioning observation (which Flappy Bird state) |
| timestep | [B] | Current τ in [0, 1] |
| Output | Shape | Meaning |
| velocity | [B, 20] | Predicted velocity at this (aτ, s, τ) point |
For the BC regressor in P1, an MLP was fine. For flow matching, the function we're learning is much richer: it depends on three inputs (noisy action, state, time) and has to handle the entire dynamics of pushing noise toward data. An MLP can do this in principle, but a 1D U-Net is the standard architecture for diffusion/FM in time-series spaces because:
For the purposes of Problem 2, you can treat the model as a function:
model = policy.model # the U-Net schedule = policy.schedule # the FlowMatchingSchedule (you implement) # During training: predict velocity at one (a_tau, s, tau) point v_pred = model(a_tau, state, tau) # shape [B, 20] # During inference: integrate the ODE sampled_action = schedule.sample(model, state)
The FlowMatchingPolicy wrapper class bundles both together so you can pass policy around as a single object.
Unlike the BCPolicy in P1, the U-Net does not have a sigmoid output. The U-Net outputs raw velocities, which can be any real number. The actions only need to be in [0, 1] at the end of integration. The FlowMatchingSchedule.sample method clamps the integrated x1 to [0, 1] before returning it.
Same files as Problem 1. Different blanks.
SinusoidalPosEmb, Conv1dBlock, ConditionalResidualBlock1D, ConditionalUnet1D — the U-Net building blocks.TemporalNoisePredictor — wraps the U-Net to match the (noisy_action, state, timestep) signature.FlowMatchingPolicy — bundles the model and schedule together.| Function | Where | What it does |
|---|---|---|
FlowMatchingSchedule.interpolate | networks.py:258 | Build noisy training input + target velocity |
FlowMatchingSchedule.sample | networks.py:274 | Euler integration from noise to data |
flow_matching_loss | losses.py:36 | Run interpolate + forward + MSE |
Same generic loop as Problem 1, just with different policy and loss:
policy = FlowMatchingPolicy(...) optimizer = optim.Adam(policy.parameters(), lr=1e-3) for epoch in range(num_epochs): for s_batch, a_batch in loader: loss = flow_matching_loss(policy, s_batch, a_batch) optimizer.zero_grad() loss.backward() optimizer.step()
The framework is identical to P1. Your flow_matching_loss swaps in for mse_loss; everything else stays the same.
From visualization.py, paraphrased:
def act(state): state_tensor = torch.as_tensor(state) actions = policy.schedule.sample(policy.model, state_tensor) return actions[0] # single action
Your schedule.sample drives the policy at inference time.
Per-line annotations. This is the centerpiece chapter.
Where: networks.py:258-271.
The math:
What's in scope:
x1: clean action data, shape (B, action_dim).t: timesteps in [0, 1], shape (B,).The code:
def interpolate(self, x1, t): eps = torch.randn_like(x1) # shape (B, action_dim) t_expanded = t.view(-1, 1) # shape (B, 1) for broadcasting x_t = t_expanded * x1 + (1.0 − t_expanded) * eps velocity = x1 − eps return x_t, velocity
Sample standard normal noise with the same shape as x1. torch.randn_like(x) is shorthand for torch.randn(x.shape, dtype=x.dtype, device=x.device) — matches not just shape but dtype and device too.
Why match shape exactly: the noise has to be the same dimensionality as the data so the linear interpolation makes sense. Each of the 20 action dimensions gets its own independent N(0, 1) sample.
Why standard normal (not other distributions): the math of flow matching assumes the source distribution is N(0, I). The "(0, 1)" gives mean 0, variance 1 per dimension, with all dimensions independent. This is the conventional choice for flow matching and diffusion alike.
Reshape t from (B,) to (B, 1). Why? Because we're about to multiply by x1, which has shape (B, action_dim). Broadcasting (B,) * (B, action_dim) in PyTorch will fail; broadcasting (B, 1) * (B, action_dim) works correctly — the 1 dimension expands to match action_dim.
view(-1, 1) means "make this 2D with the second dim being 1, infer the first dim from total size." For input of length B, this gives shape (B, 1).
The linear interpolation. At t = 0: x_t = 0 * x1 + 1 * eps = eps (pure noise). At t = 1: x_t = 1 * x1 + 0 * eps = x1 (pure data). Shape: (B, action_dim).
Each sample in the batch has its own t_expanded[i], so each row of x_t is at a different point along its noise-to-data path. This randomization across the batch is what gives the network exposure to all timesteps in [0, 1] during training.
The target velocity. x1 - eps is the displacement from noise to data. Constant in τ (doesn't depend on the timestep) — the straight-line path has constant velocity. Shape: (B, action_dim).
Note this is just a difference of tensors. No autograd magic needed — x1 comes from data (no_grad implicitly), eps from randn_like (no_grad), so the result is just a fixed target.
Tuple of two tensors. The caller (your flow_matching_loss) will use x_t as the input to the velocity network and velocity as the regression target.
You'll see code like return tau * x1 + (1 - tau) * eps, x1 - eps sometimes. Same thing — Python returns a tuple. Naming the intermediates is just clearer for debugging.
Where: networks.py:274-287.
The math:
What's in scope:
model: the velocity network. Callable as model(x, state, t).state: conditioning, shape (B, state_dim).self.action_dim: action chunk length (20).self.num_steps: number of Euler steps (20).self.device: torch device string.The code:
@torch.no_grad() def sample(self, model, state): B = state.shape[0] x = torch.randn(B, self.action_dim, device=self.device) dt = 1.0 / self.num_steps for i in range(self.num_steps): t = torch.full((B,), i * dt, device=self.device) v = model(x, state, t) x = x + dt * v return x.clamp(0.0, 1.0)
This is a decorator. It wraps the entire function so PyTorch doesn't track gradients during sampling. We're at inference time; we don't need gradients. Saves memory and time.
The decorator is already in the starter code (line 273 of networks.py) — you don't add it. But know it's there. If you need a fresh sample mid-training (which we don't here), you'd want this for efficiency.
Batch size. Convention: states have shape (B, state_dim), so the first dim is the batch.
Initialize the integration with pure Gaussian noise. Shape (B, action_dim) = (B, 20). Same shape as the actions we want to generate.
Why self.device matters: the model is on GPU. If you create the noise on CPU and pass to the model, you'll get a runtime error. Always create tensors on the same device as the model.
Step size. With num_steps = 20, this is 0.05. Each Euler step advances τ by 0.05, so 20 steps span τ = 0 to τ = 1.
The integration loop. Iterates from i = 0 to i = 19 (inclusive). At each iteration, we're at τ = i · dt.
Build the timestep tensor for this iteration. torch.full(shape, value) creates a tensor of given shape, filled with the scalar value. Here we want a tensor of shape (B,) with every element equal to i * dt (the current τ).
Why (B,) not just a scalar? Because the model expects a batched timestep — each sample in the batch could in principle be at a different time, even though here they're all at the same τ.
Forward pass through the velocity network. Inputs: current point x shape (B, 20), state state shape (B, 4), timestep t shape (B,). Output: predicted velocity, shape (B, 20).
Euler step. Move x by dt · v. Same shape in, same shape out. Note this overwrites x — we don't need to keep the intermediate x's. Each iteration just advances the current point.
Why we use the same x for all steps: because we're integrating an ODE; the next state depends only on the current state and the current velocity. We don't need to remember the history.
Clamp to action range. After 20 Euler steps, x is approximately at the data distribution but may have drifted slightly outside [0, 1] due to integration error and the network's imperfections. Clamping is a safety net.
x.clamp(0, 1) returns a new tensor where any element above 1 becomes 1 and any below 0 becomes 0. Other elements unchanged.
1. Forgetting device when creating noise or timestep tensors. Causes "Expected GPU but got CPU" errors.
2. Off-by-one in the loop: using range(self.num_steps + 1) would do 21 steps and overshoot. range(self.num_steps) is correct.
3. Wrong timestep: t = torch.full((B,), (i + 1) * dt, ...) would query the model at time τ + dt instead of τ. Subtle bug; performance often still OK but mathematically wrong.
4. Forgetting clamp: actions outside [0, 1] make the env confused. Usually still works, but unprincipled.
Where: losses.py:36-53.
The math:
What's in scope:
policy: a FlowMatchingPolicy with policy.model (network) and policy.schedule (FlowMatchingSchedule).s_batch: states, shape (B, state_dim).a_batch: clean expert actions, shape (B, action_dim).The code:
def flow_matching_loss(policy, s_batch, a_batch): B = a_batch.shape[0] t = torch.rand(B, device=a_batch.device) x_t, target_velocity = policy.schedule.interpolate(a_batch, t) pred_velocity = policy.model(x_t, s_batch, t) return nn.functional.mse_loss(pred_velocity, target_velocity)
Batch size. We need this to know how many independent τ samples to draw.
Sample B timesteps, each uniform in [0, 1). Note: torch.rand samples from U(0, 1) (uniform); torch.randn samples from N(0, 1) (normal). Different functions for different distributions.
Each sample in the batch gets a different τ. Across many minibatches, the network sees all values of τ in [0, 1) and learns to predict velocity at every timestep.
Why per-sample timesteps: it's much more efficient than running 20 separate forward passes per sample (one for each τ on a fixed grid). Random τ per sample gives the same coverage with fewer forward passes.
Calls your interpolate function from Change 1. Returns the noisy interpolation point and the target velocity. Shapes: (B, 20) for both.
Forward pass through the U-Net velocity model. Inputs: x_t shape (B, 20), s_batch shape (B, 4), t shape (B,). Output: predicted velocity shape (B, 20).
Note the U-Net is wrapped: policy.model is the TemporalNoisePredictor, which internally reshapes inputs and calls ConditionalUnet1D. We don't worry about the reshaping — it's all handled.
Standard MSE: average squared error across all elements (batch × action dim). Returns a scalar.
This is the same MSE function from Problem 1, applied to a different prediction-target pair. There the prediction was an action and the target was an expert action; here the prediction is a velocity and the target is a velocity.
Flow matching is "MSE on the right thing." Plain BC does MSE on actions, which (we showed) collapses to the conditional mean. Flow matching does MSE on velocities at random interpolation points, which (theorem from FM literature) converges to a vector field that transports noise into the data distribution — preserving multimodality. Same loss class, fundamentally different inductive bias.
Three blanks, totaling ~15 lines of code. The training loop calls flow_matching_loss per minibatch. Inside, your loss function calls schedule.interpolate. At inference, the policy wrapper calls schedule.sample. Your three pieces interlock to form the full algorithm.
python main.py --method bc_flow --env hard
This:
FlowMatchingPolicy (the U-Net + your schedule).flow_matching_loss.schedule.sample for action selection.bc_flow_hard.txt.| Method | Mean episode length on hard mode | Why |
|---|---|---|
| BC regression (P1) | ~200-500 | MSE averages multimodal expert → wall |
| Flow matching (P2) | ~700-1000 | Generative model preserves modes → consistent gap selection |
You should see a substantial improvement over Problem 1 on hard mode. If episode length stays low, something is off — check the gotchas in Chapter 10.
Longer than P1's MLP. The U-Net has more parameters, and each training step does an interpolation + forward pass. Expect roughly 5-15 minutes on CPU, much faster on GPU.
| Metric | Healthy | Bug |
|---|---|---|
| Training loss | Starts ~1.0, decreases to ~0.05-0.2 | Stays at ~1.0 (model not learning velocities) |
| Final eval episode length | 700-1000 on hard mode | < 300 (probably bug in interpolate or sample) |
| Std of eval episode length | Moderate (some variance from sampling) | Very high (model is unstable) or 0 (collapsed to constant) |
Per the PDF:
"Flow matching trains a generative model that learns the full conditional distribution p(a | s) rather than just its conditional mean. On hard mode where the expert is bimodal (gap 1 or gap 2), the learned vector field transports samples from Gaussian noise to either mode based on the initial noise sample, so each rollout commits to one gap rather than aiming at the wall in between. This recovers near-expert performance because the policy now samples valid actions, not their average."
| Call | Returns |
|---|---|
torch.randn(*shape, device=) | Tensor of N(0, 1) samples, given shape |
torch.randn_like(x) | Same shape/dtype/device as x, N(0, 1) samples |
torch.rand(*shape) | Uniform [0, 1) samples |
torch.full(shape, value) | Tensor of given shape, all elements = value |
x.view(-1, 1) | Reshape to 2D with second dim = 1 |
x.clamp(lo, hi) | Elementwise min(max(x, lo), hi) |
nn.functional.mse_loss(pred, target) | Mean squared error scalar |
policy.schedule.interpolate(x1, t) | (x_t, velocity), shape (B, action_dim) |
policy.schedule.sample(model, state) | Sampled actions, shape (B, action_dim), clamped |
policy.model(x, state, t) | Predicted velocity, shape (B, action_dim) |
torch.randn and torch.rand?num_steps = 1 in the sampler, what does flow matching reduce to?1. A regressor outputs one answer per input. A generative model outputs a distribution from which you sample. For unimodal targets they're equivalent; for multimodal targets, only generative models preserve multiple modes.
2. Because the trained vector field maps different noise samples to different modes. At a state with two valid actions (gap 1, gap 2), starting from different noise samples sends you to different gaps. No averaging.
3. A vector field assigns a velocity (arrow) to every (x, s, τ). Following the arrows over time transports a starting point to an ending point. Useful here because it lets us continuously transform Gaussian noise into expert actions.
4. The network predicts the velocity (rate of change) at the current point along the noise-to-data path, conditioned on the state and the current time.
5. Because the training-time path is a straight line: aτ = τ a1 + (1 − τ) a0. Differentiating gives daτ/dτ = a1 − a0, independent of τ.
6. So the network learns velocity at every value of τ in [0, 1]. At inference, we'll integrate through all those timesteps. If we only trained at one τ, the network would only know what to do at that single time.
7. A first-order numerical approximation to the ODE dx/dτ = v. We assume velocity is roughly constant over a small interval 1/n, so the position changes by roughly v/n.
8. Because clamping mid-integration would distort the trajectory. The network was trained on smooth interpolation paths; clamping at intermediate steps would push us off-distribution from the training paths. Final clamping is just a safety net for the very small drift that may occur.
9. Both transport noise to data via iterative refinement guided by a learned network. Diffusion uses a curved (random-walk) path and predicts noise/score. Flow matching uses a straight path and predicts velocity. Mathematically related; FM is simpler.
10. Because FM paths are straight, integration error is smaller per step. Diffusion paths are curved, so you need more small steps to follow the curve accurately.
11. torch.randn samples from N(0, 1) (Gaussian). torch.rand samples from U(0, 1) (uniform). Different distributions; different uses (we use randn for noise, rand for timestep).
12. One Euler step from x0 in direction v(x0, s, 0). With straight-line paths and learned velocity equal to the conditional mean of (a1 − a0), this approximately equals the conditional mean of a1. So num_steps = 1 collapses flow matching back to MSE regression with extra steps. Multimodality is lost.
Total: ~6 minutes of typing. Run on hard mode, see episode length jump from ~300 (BC regression) to ~800+ (flow matching). Write the explanation.
Three big ideas, in order of importance:
If a friend asks: "Why does flow matching work where MSE regression failed?" — you say: "MSE regression learns the conditional mean of expert actions, which collapses bimodal data into the wrong middle ground. Flow matching learns a vector field that maps noise to data; different noise samples flow to different modes, so the policy can sample either mode rather than averaging them. Mathematically, both are MSE losses, but flow matching applies MSE to velocity predictions along learned paths, not to raw actions. The richer prediction target preserves multimodality."
You can teach this. On to Problem 3.