A robot arm. A hammer. A nail. A reward of 1.0 if you finish, 0.0 every other moment. From those rules, a self-improving system. Every concept, every line of code, every gotcha — explained.
You have a 4-DOF Sawyer robot arm in simulation. It can pick things up, push, swing. Your task: train it to pick up a hammer and use that hammer to drive a nail. Same task as Problem 2, completely different algorithm.
The observation is a vector of about 39 numbers — positions of robot/hammer/nail at the current step and the previous step. The action is a 4-D vector in [-1, 1] — joint displacements.
The reward function is sparse:
That sparseness is what makes this hard. Compare it to a "shaped" reward like −|distance to nail| − |hammer angle error|, which would tell the agent "you're getting warmer" continuously. With the sparse reward, the agent gets zero feedback for thousands of timesteps, then a single +1 when it finally succeeds. Most rollouts contain zero learning signal.
Real-world robotics rewards are sparse. You can't easily hand-code a smooth reward for "drive the nail" — what does "halfway driven" look like? Sparse rewards force you to use exploration + demonstrations + sample-efficient algorithms. Solving sparse-reward tasks is the frontier.
So the homework gives you two things to fight the sparseness:
The deliverable: a wandb plot of eval/episode_success rising from 0 to at least 90% within 100,000 environment steps. For comparison, PPO in Problem 2 needs about 1,000,000 steps for far less. That's a 10× sample-efficiency gap. Understanding where it comes from is the entire lesson.
Think of a chef tasting a dish once and writing down the rating. On-policy: taste, write rating, throw the recipe away, cook a new dish. Off-policy: taste, write rating, store everything in a giant cookbook. Tomorrow, re-read every entry hundreds of times and notice patterns. Same data, way more learning.
In Problem 1 you implemented tabular Q-learning. Same idea here, except instead of a 20×4 table, the Q-function is a neural network.
The expected total discounted reward you would get if you start in state s, take action a, and then act according to your policy π forever after.
The discount factor γ ∈ [0, 1) — future rewards are worth less than immediate ones. Typical: 0.99. A reward 100 steps away is worth 0.99100 ≈ 0.37 of a reward right now.
You also could imagine a function that just takes state — the state-value V(s):
PPO (Problem 2) used V. Why does this homework use Q?
Because actions matter. When we update the actor, we want to push the policy toward actions that have high value. With Q(s, a), we can directly ask "is this specific action good?" With V(s), we'd only know whether the state is good on average, and we'd have to use the policy gradient theorem (with all its variance) to extract action-level signal. Q-learning skips the middleman.
You saw it in P1. Same recursion, same intuition:
The value of (s, a) decomposes into this step's reward plus the discounted value of where we land next. This recursion is the entire engine of TD learning.
The TD target for a single sample (s, a, r, s') is:
And the TD error is the difference between this new estimate and our old prediction:
If δ is positive, the action turned out better than we predicted — push Q(s, a) up. If negative, push down. Same as the gridworld update; just with a neural network instead of a table.
In the table case: Q[s, a] += α δ. Direct write. In the network case, we don't have direct slots — we have weights φ. So instead we compute MSE loss (Qφ(s, a) − y)2 and let backprop pull Q toward y. The TD update becomes a regression problem.
If s' is terminal (the episode ended), there is no future:
The starter code provides discount as a pre-multiplied factor: discount = γ · (1 − done). So:
...handles both cases in one expression. When done is 1, discount is 0, the bootstrap term vanishes. Same trick you used in PPO's GAE.
This is the most important conceptual distinction in modern RL.
| Property | On-policy (PPO, Problem 2) | Off-policy (this problem, SAC, DQN) |
|---|---|---|
| Data source | Current policy only | Any past policy |
| Storage | Throw away after few epochs | Replay buffer, kept forever |
| Sample efficiency | Low | High (10× or more) |
| Stability | Naturally stable | Fragile — needs target nets, double-Q |
| Algorithm class | Policy gradient | Q-learning / actor-critic |
The policy gradient estimator is:
The expectation is over (s, a) drawn from the current policy. If you use stale data, the expectation is wrong — you're computing a gradient for the wrong distribution. PPO patches this with importance sampling for a few epochs, but still needs fresh data per update.
Look at the Bellman equation again:
This equation is a property of the environment (the reward and transition function), not of any particular policy. As long as you have a tuple (s, a, r, s'), you can use it to enforce the Bellman constraint — regardless of which policy generated the tuple.
That's why off-policy works: the critic learns the environment's value structure, and the policy generating the data doesn't have to match. You can mix transitions from a random initial policy, an expert demonstration, and your current actor, all in the same buffer.
You collect 1 transition per environment step but do many gradient updates per step. The "update-to-data" ratio (UTD) is how aggressively you exploit your buffer. UTD=1 is conservative; UTD=5 in this homework's ablation is more aggressive. Off-policy lets you do this; on-policy fundamentally cannot.
Each entry: (s, a, r, s', done). That's it. No policy, no log-probs (unlike PPO). You can sample any minibatch of these uniformly at random, and the Bellman update applies.
The off-policy story above sounds clean. It isn't. Three properties together cause Q-learning with neural networks to diverge. The triad is famously called "deadly":
r + γ Q(s', a') uses the same network we're optimizing. As we update φ, the target moves. We're regressing toward a moving label.Any one of these is fine. Any two is usually fine. All three together — instability, exploding Q-values, training collapse.
Imagine vanilla Q-learning: we use the same critic for the target and the online prediction.
When we take a gradient step, both Qφ(s, a) and Qφ(s', a') change. We tried to pull the prediction toward the target, but we also moved the target. Repeat thousands of times, gradient descent never converges.
Imagine trying to grab a flag that runs away whenever you reach for it — and runs at the same speed you do. The fix: tie the flag's position to a different copy of you that updates slowly. That's the target network.
Vanilla Q-learning's target uses maxa' Q(s', a'). Suppose Q has noise — some actions' Q-values are slightly overestimated, others underestimated, on average correct.
The max operator systematically picks overestimated entries. So your target is biased upward. You train Q toward an upward-biased target. Q grows. Next iteration, even more bias. Q-values explode.
This is called maximization bias. It's the reason vanilla DQN often diverges.
| Failure | Fix | Where in homework |
|---|---|---|
| Moving target | Target network — slowly-updated copy of the critic | self.critic_target + soft_update_params |
| Maximization bias | Clipped double-Q — min over two critics | min(Q̄i, Q̄j) in target |
| Variance in target | Ensemble of N critics, sample 2 randomly | num_critics hyperparameter |
Each fix surfaces in your update_critic implementation. Read this chapter again after you've written it — you'll see why each line is there.
Q-learning works in discrete action spaces because you can compute argmaxa Q(s, a) by enumerating actions (e.g., 18 Atari buttons). With continuous actions in [-1, 1]4, you can't enumerate. There's no argmax.
Two options:
We use option 2. The actor πθ(a|s) is trained so that its sampled action a maximizes Qφ(s, a). That's the entire actor objective:
Read it: "sample a state s, plug the policy's chosen action into the critic, that scalar is what the actor wants to maximize." The minus sign converts maximize into minimize for PyTorch.
For gradient to flow from Qφ back into θ, the action a must be a differentiable function of θ. With a stochastic policy this seems impossible — you can't backprop through "sample from a distribution." Solution: reparameterize.
The actor in this homework outputs a TruncatedNormal(μ, 0.1). Std is fixed at 0.1, not learned (unlike SAC's adaptive entropy). The 0.1 just adds exploration noise. Calling dist.sample() returns tanh(μθ(s)) + 0.1 · ε, which is differentiable in θ while ε is sampled fresh.
Actor πθ(a|s): maps state to action distribution. Trained to output high-Q actions. Used to act in the environment.
Critic Qφ(s, a): estimates expected return for (state, action). Trained via TD regression. Used as a training signal for the actor.
You alternate two updates:
If you only had the critic, you'd know which actions are good but couldn't act on them. If you only had the actor, you'd have nothing to optimize against. Together, they bootstrap each other up. This is the entire idea of actor-critic.
Each fixes a specific failure mode from Chapter 04. Each surfaces directly in your update_critic.
Maintain a second copy of the critic, Q̄φ̄, used only for computing TD targets. Update its parameters slowly:
Each step, the target moves 0.5% toward the online critic. After about 200 steps the target has caught up. Critically, on the timescale of any single gradient update, the target looks frozen. The regression target is stable.
In the codebase: utils.soft_update_params(net, target_net, tau) does this. You'll call it once per critic update.
If you forget the soft update entirely, the target stays at its initialization forever. Q-values won't propagate beyond γ per step — learning crawls. The critic loss will look fine but the actor never improves.
Maintain N independent critic networks Qφ1, ..., QφN. Each is initialized differently (different orthogonal weights from weight_init) and sees different minibatch orderings. They disagree on out-of-distribution inputs. That disagreement reduces variance when we combine them.
Default in this homework: N=2. Ablation: N=10. With N=10 + UTD=5 you have something close to REDQ ("Randomized Ensembled Double Q-learning"), a state-of-the-art recipe for sample efficiency.
When computing the TD target, do not use all N critics. Pick 2 randomly, take the elementwise min:
The min counteracts maximization bias from Chapter 04. The random sampling (vs always using critics 1 and 2) is the REDQ improvement — it forces every critic to be reliable, not just the first two. Otherwise critics 3..N could drift since they're never used in targets.
Common bug: students sometimes only update the 2 sampled critics, leaving the others frozen. Wrong. Loss should be computed over all N critic predictions:
critic_loss = sum( F.mse_loss(q, target) for q in q_list )
The target uses 2 samples; the loss uses all N predictions.
The TD target with all three tricks:
Memorize the shape of this. Three tricks, four lines, all the off-policy stability machinery in modern RL.
The reward is sparse. Random exploration almost never sees a +1. The agent would never learn from scratch in 100k steps.
So we cheat: start with a warm-started policy by imitating expert demonstrations. This is supervised learning, not RL.
Maximize the log-probability that the policy assigns to expert actions. Pure maximum likelihood. After ~5000 BC gradient steps, the policy is "kind-of right" — it produces actions that look expert-shaped. Then RL takes over and refines them based on actual reward.
The homework alternates RL gradient steps with BC gradient steps throughout training. Why?
Imagine you've BC-pretrained a decent policy. Now you start RL. The actor's gradient step says "move toward whatever the critic likes." But the critic is itself a randomly-initialized neural net at first — its high-Q regions are essentially random. The actor would walk away from the BC initialization toward the critic's random preferences and forget the demos.
Solution: keep mixing in BC steps during RL. The BC term anchors the actor near the expert distribution. As the critic improves, RL pulls more strongly toward Q-maximization. Net effect: stable improvement over BC, no catastrophic forgetting.
Problem 2's PPO uses a "reverse KL to a frozen reference policy" for the same reason — preventing drift from the BC warm-start. Both methods solve "don't forget the demos" but with different mechanics: PPO adds a KL penalty in the loss; this homework alternates a BC update. Same disease, different antibiotics.
Both pretraining and the RL-mixin call the same method:
def bc(self, replay_iter): # replay_iter yields (obs, action, reward, discount, next_obs) # For BC we only need (obs, action). The rest are unused. batch = next(replay_iter) obs, action, _, _, _ = utils.to_torch(batch, self.device) # YOUR CODE: loss = -E[log pi(a|s)], step actor optimizer
The trick: this same method is called whether the buffer is the demos or the live replay buffer. The training script picks which buffer to feed it. Don't put any "first time only" logic in bc.
Putting everything together. Read this twice.
You're implementing pieces of step (b), (c), and (d). The orchestration in train_off_policy.py calls your three methods.
Three classes, mirroring the three concepts: actor, critic ensemble, agent.
off_policy.py:11-33class Actor(nn.Module): def __init__(self, obs_shape, action_shape, hidden_dim, std=0.1): super().__init__() self.std = std self.policy = nn.Sequential( nn.Linear(obs_shape[0], hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, action_shape[0])) def forward(self, obs): mu = torch.tanh(self.policy(obs)) # action mean, squashed to [-1,1] std = torch.ones_like(mu) * self.std # fixed std (NOT learned) return utils.TruncatedNormal(mu, std)
Notable contrasts with PPO's actor:
utils.py).off_policy.py:36-54class Critic(nn.Module): def __init__(self, obs_shape, action_shape, num_critics, hidden_dim): super().__init__() self.critics = nn.ModuleList([nn.Sequential( nn.Linear(obs_shape[0] + action_shape[0], hidden_dim), nn.LayerNorm(hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, hidden_dim), nn.LayerNorm(hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, 1)) for _ in range(num_critics)]) def forward(self, obs, action): h = torch.cat([obs, action], dim=-1) return [critic(h) for critic in self.critics] # LIST of N tensors
[batch, 1]. Iterate over it for ensemble operations.off_policy.py:57-79class ACAgent: def __init__(self, obs_shape, action_shape, device, lr, hidden_dim, num_critics, critic_target_tau, stddev_clip): self.device = device self.critic_target_tau = critic_target_tau self.stddev_clip = stddev_clip self.actor = Actor(obs_shape, action_shape, hidden_dim).to(device) self.critic = Critic(obs_shape, action_shape, num_critics, hidden_dim).to(device) self.critic_target = Critic(obs_shape, action_shape, num_critics, hidden_dim).to(device) self.critic_target.load_state_dict(self.critic.state_dict()) # target == online at init self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=lr) self.critic_opt = torch.optim.Adam(self.critic.parameters(), lr=lr)
Two notable choices:
off_policy.py:88-97def act(self, obs, eval_mode): obs = torch.as_tensor(obs, device=self.device).float() dist = self.actor(obs.unsqueeze(0)) if eval_mode: action = dist.mean # greedy else: action = dist.sample(clip=None) # with noise return action.cpu().numpy()[0]
Used by the rollout collector. Eval mode = deterministic mean. Train mode = sampled with exploration noise.
For each task: the math, the gotchas, and pseudocode you can hold in your head while writing the actual implementation. Cross-reference back to the corresponding chapter when in doubt.
Goal: maximize the log-probability that the actor assigns to expert actions.
Already in scope:
batch = next(replay_iter) obs, action, _, _, _ = utils.to_torch(batch, self.device)
Pseudocode:
dist = self.actor(obs)log_prob = dist.log_prob(action).sum(-1)loss = -log_prob.mean()self.actor_opt.zero_grad(set_to_none=True)loss.backward()self.actor_opt.step()metrics["bc_loss"] = loss.item()• Use self.actor_opt, not self.opt — this class has separate actor/critic optimizers.
• log_prob returns shape [batch, action_dim]. Sum over the last dim (since action is 4-D, joint log-prob = sum of per-dim log-probs). Mean across the batch.
• This same method is also called during RL training. Don't add any one-time logic.
Goal: one critic gradient step plus one target soft-update. This is the most involved task.
Already in scope:
batch = next(replay_iter) obs, action, reward, discount, next_obs = utils.to_torch(batch, self.device)
Pseudocode:
with torch.no_grad(): # next-action from current policy next_action = self.actor(next_obs).sample(clip=self.stddev_clip) # forward all target critics → list of [B, 1] tensors target_q_list = self.critic_target(next_obs, next_action) # pick 2 random critics from the list (without replacement) sampled = random.sample(target_q_list, 2) target_q = torch.min(sampled[0], sampled[1]) # shape [B, 1] # TD target. Mind the shapes: reward [B], discount [B], target_q [B, 1] target = reward.unsqueeze(-1) + discount.unsqueeze(-1) * target_q # online critic predictions on (obs, action) q_list = self.critic(obs, action) # sum of MSEs across ALL N critics critic_loss = sum(F.mse_loss(q, target) for q in q_list) # gradient step self.critic_opt.zero_grad(set_to_none=True) critic_loss.backward() self.critic_opt.step() # soft-update target critic utils.soft_update_params(self.critic, self.critic_target, self.critic_target_tau) # logging metrics["critic_loss"] = critic_loss.item() metrics["critic_target_q"] = target.mean().item() metrics["critic_q1"] = q_list[0].mean().item()
1. Wrap target computation in with torch.no_grad():. Otherwise gradients flow through the target into the critic and you train it against itself → divergence.
2. Shape alignment. reward and discount are [B]; Q outputs are [B, 1]. Add .unsqueeze(-1) to align. Mismatched shapes broadcast to [B, B] — loss looks fine numerically but is gibberish, training never works. Silent killer.
3. Sample 2 critics WITHOUT replacement: random.sample(list, 2). Don't use random.choices — that's with replacement and could pick the same critic twice.
4. Train ALL N critics, not just the 2 used in the target. sum(F.mse_loss(q, target) for q in q_list) hits every critic.
5. Use self.critic_target, not self.critic, for the target. Mixing these up is the most common bug.
6. stddev_clip: when sampling for the target, clip the noise to self.stddev_clip. This is "target action smoothing" from TD3 — prevents the critic from being trained on extremely off-distribution actions.
• critic_loss stays at 0.0 → gradients aren't flowing. Forgot .backward()? Wrong optimizer? Loss disconnected from graph?
• critic_loss explodes to 1e6+ → shape bug, OR forgot no_grad and target is being trained, OR forgot soft update.
• target.mean() stays at 0 forever → agent never sees reward. Either bc pretrain didn't work, or buffer too small, or something upstream broken.
Goal: one actor gradient step. Push the policy toward actions where Q is high.
Already in scope:
batch = next(replay_iter) obs, _, _, _, _ = utils.to_torch(batch, self.device)
Only obs. The action will come from sampling the current policy — not from the buffer's stored action. Why: we want gradient to flow from Q through the new sampled action back into θ. The buffer's action was produced by an old policy and is detached.
Pseudocode:
# sample fresh action from current policy (reparameterized → grads flow) dist = self.actor(obs) action = dist.sample(clip=self.stddev_clip) # forward all critics. Critic params are NOT optimized in this step. q_list = self.critic(obs, action) # mean Q across ensemble, then mean across batch, then negate q_mean = torch.stack(q_list, dim=0).mean(dim=0) # [B, 1] actor_loss = -q_mean.mean() # gradient step on actor only self.actor_opt.zero_grad(set_to_none=True) actor_loss.backward() self.actor_opt.step() # logging metrics["actor_loss"] = actor_loss.item() metrics["actor_q"] = q_mean.mean().item()
• Use sample(clip=self.stddev_clip), not sample(). The clipping is what TD3 calls "target policy smoothing" — even on the actor side it's standard.
• Don't accidentally backprop into the critic. Since you're calling self.actor_opt.step() (not the critic's optimizer), and the critic params aren't in actor_opt, you're safe by construction. But don't get cute — if you call self.critic_opt.step() by mistake here, you corrupt the critic.
• Mean across critics, then mean across batch. The PDF specifies (1/N) Σk Qk. (Some implementations use min instead; staying with the PDF is correct.)
• The TruncatedNormal's clamp uses a straight-through estimator (x − x.detach() + clamped.detach()), so gradients still flow through μθ(s) even though the action is bounded. See utils.py:119.
The homework's final part asks you to run the same algorithm with two configurations and explain the difference:
| Run | num_critics | UTD | Wall-clock target | Env steps |
|---|---|---|---|---|
| Default | 2 | 1 | ~30-45 min | 100k |
| Ablation | 10 | 5 | ~2 hours | 50k |
The number of critic gradient updates performed per environment step. UTD=1 means: collect 1 transition, do 1 critic update. UTD=5 means: collect 1 transition, do 5 critic updates — same data, 5× more gradient passes through it.
You can ONLY do this with off-policy methods. PPO can't — once you do too many updates on the same rollout, the policy drift exceeds what the clip surrogate can compensate for.
Pro of UTD>1: each transition is more thoroughly exploited. Sample efficiency improves. The 50k-step run with UTD=5 should match or beat the 100k-step run with UTD=1.
Con of UTD>1: more compute per env step (5× more gradient passes). Wall-clock is longer despite fewer env steps.
Con of UTD>1 alone: with only 2 critics and aggressive replay, the critic overfits to recent buffer entries. Q-values become wildly inaccurate on out-of-distribution actions, the actor exploits those errors, training collapses. This is why pure UTD increase often hurts.
Why N=10 saves UTD=5: a larger ensemble has lower variance in target estimates, so it tolerates more aggressive updates. The ensemble's disagreement on out-of-distribution actions provides an implicit regularization signal — min-of-2-randomly-sampled is more conservative when the underlying ensemble is more diverse.
This pairing (high UTD + large ensemble) is the recipe of REDQ, which achieves model-based-level sample efficiency on continuous control.
"Increasing UTD from 1 to 5 with 10 critics improves sample efficiency because each transition is replayed and propagated through critic updates 5× more often, while the larger ensemble keeps target estimates well-calibrated against the resulting overfitting risk."
The very last part of HW2 asks you to compare PPO vs SAC-default curves in 3-5 sentences with at least two concrete differences.
| Property | PPO (P2) | SAC-style (P3 default) |
|---|---|---|
| Steps to plateau | ~1M | ~100k |
| Why | Discards rollouts | Replays buffer indefinitely |
| Final success rate | ~50-80% | >90% |
| Run-to-run variance | Lower (clipped) | Higher (off-policy) |
| Conceptual basis | V-baseline + policy gradient | Q-learning with actor |
Two concrete differences to discuss:
| Variable | Where | What |
|---|---|---|
self.actor | ACAgent | Online policy πθ |
self.critic | ACAgent | Ensemble of N online critics |
self.critic_target | ACAgent | Slow-moving copies of online critics |
self.actor_opt | ACAgent | Adam, only actor params |
self.critic_opt | ACAgent | Adam, only critic params |
self.critic_target_tau | ACAgent | τ for Polyak (typically 0.005) |
self.stddev_clip | ACAgent | Action noise clip for sampling |
obs, action, reward, discount, next_obs | From to_torch(batch) | discount = γ(1−done) pre-multiplied |
q_list | Returned by critic(obs, action) | Python list of N tensors, each [B, 1] |
utils.soft_update_params | utils.py | Polyak helper, applies τ-mix to all params |
(1 − done) factor accomplish in the Bellman target?− Q(s, π(s)) and not + Q(s, π(s))?with torch.no_grad(): essential when computing the TD target?critic and critic_target? When does each get updated?1. Bootstrapping with the same network creates a moving target; combined with off-policy data and function approximation (the deadly triad), gradient descent chases its own tail. Target networks fix this.
2. Zeros out the bootstrapped future when an episode ended — terminal states have no future to bootstrap from.
3. Min-of-2 reduces maximization bias; randomization keeps every critic in the ensemble accountable; train all N to keep the ensemble diverse and accurate everywhere, not just where it's queried for the target.
4. Optimizers minimize, but we want to maximize Q. Negate.
5. Otherwise gradients flow through the target into the critic, training it against its own moving prediction → divergence.
6. critic is online, updated by Adam every step. critic_target is updated only by Polyak averaging, slowly tracking the online critic. Used only for computing TD targets.
7. Almost certainly: target wasn't no_grad'd / detached, OR wrong optimizer, OR .backward() not called, OR shapes mismatched and broadcast made loss = 0.
8. Shape mismatch broadcasting (e.g. [B] + [B, 1] → [B, B]), OR forgot no_grad on target, OR forgot soft update so target stuck at random init.
9. Off-policy reuses each transition many times; on-policy throws data away after a few epochs. With sparse rewards, the rare reward transitions get replayed dozens of times in off-policy.
10. Pretraining can decay during RL drift; periodic BC during RL keeps the actor anchored to the expert distribution, especially when actor updates push toward critic-overestimated actions.
11. Gradient steps. Each env step costs ~1ms (sim + tiny forward pass). Each gradient step costs ~10ms+ (full forward+backward through actor and ensemble of critics).
12. The buffer's action was sampled from an old policy. We want to train the current actor's parameters, so we need a fresh action where gradients can flow from Q back through μθ. Reparameterization makes that gradient meaningful.
log_prob. Easiest. Do this first to verify your dev loop works.modal run --detach modal_off_policy.pynum_critics=10 and utd=5.modal run --detach modal_off_policy.pyIf a friend asks: "What's the difference between PPO and SAC?" — you don't recite features. You say: "PPO is a policy gradient method that has to use fresh data because its update is biased on stale data. SAC is a Q-learning method whose update is grounded in the Bellman equation, which is a property of the environment, so any past data works. That asymmetry is why off-policy is 10× more sample-efficient on sparse-reward tasks. Off-policy needs target networks and double-Q to stabilize, but the sample-efficiency win is worth the engineering."
If asked: "Why an ensemble of critics?" — you say: "Maximization bias. The max operator on noisy estimates is biased upward. Two-critic min counteracts that. With an ensemble of 10 and random pair-min, you get a more conservative target that tolerates aggressive UTD without overfitting."
That's the bar. You're there.