How we taught a UR5 arm to pick a Lego brick

PART I · Setup

01 · the problem

01The problem, in one sentence.

There is a robot arm sitting in front of a table. On the table are twelve Lego bricks of different colours. You — the operator — would like to walk up to a screen, click the brick you want, and have the arm pick that brick.

That sentence packs a lot. "The brick you want" presumes the arm can map your click to a specific physical object. "Pick it up" presumes the arm can decide where to move, how to close its fingers, and how not to knock the eleven other bricks onto the floor. Behind all of that there's a stack of work — modelling the robot, simulating the physics, collecting demonstrations, training a neural network, serving the trained network on a cloud GPU, and closing the loop in real time. Roughly half of this book is that stack.

The other half is what went wrong. Spoiler: across six trained checkpoints and four different data / training tricks, our policies never managed to actually close on a brick. They reliably descend toward the table, they reliably move in roughly the right direction — but they plateau ≈ 30 cm short of the target. Chapter 32 is the autopsy. Chapter 35 is the fix we wish we had time to build.

A different version of this project works.

Our sister project (cs-224r-final-project, the boxbros repo) does exactly this same trick on a real SO-101 arm — the same bbox-conditioned ACT policy, on real data — and it does close on the brick. The reasons the simulated UR5 version doesn't are not "the idea is wrong"; they're "the bottleneck is somewhere subtle." Stick around for chapter 32.

This book assumes no robotics background. Each chapter builds the next.

02 · why simulation

02Why we work in simulation first.

Picture the same problem on a physical arm. Every demonstration we'd need a human to teleoperate, or a scripted-IK rig in the lab, plus a USB camera, plus a calibrated Lego workspace, plus a person who has to hand-reset twelve bricks 300 times. At 30 s per demo that's 2.5 hours of attention from a human. If we want 1,000 demos, that's a day. If we want a thousand demos with randomised brick positions to teach the policy to lateralise, that's a week of nothing but data collection — before we've trained anything.

Simulation collapses that. In MuJoCo, a single CPU container can drive 10 episodes through a scripted picker in about 10 minutes. Fan out across 30 containers on Modal and the whole 300-episode dataset is done in 10 wall-clock minutes for ≈ $5. The same loop on real hardware would take days and cost engineer-hours.

What simulation is bad at

Three things, mostly:

Visual realism. Our wrist-cam frames look like clean computer graphics — flat lighting, sharp polygon edges, zero noise. A real GoPro frame has motion blur, JPEG artefacts, rolling shutter, dust on the lens. A policy trained only in sim often fails on real frames it has never seen the noise floor of. This is called the sim-to-real gap.
Contact dynamics. MuJoCo's contact solver is good but not perfect. A real gripper closing on a brick has friction, deformation, slippage. The sim solver approximates all of these.
Sensor latency. Real cameras have 30–60 ms of capture-to-arrival lag, real servos have 5–20 ms of command-to- actuation lag. Simulation collapses all of these to zero. Policies that work in sim sometimes oscillate when wrapped in a real control loop because they were never trained to anticipate latency.

We accept these gaps for now because the question we're trying to answer in this project — "does bbox conditioning let us route a pick to a specific brick?" — is upstream of all of them. It's about whether the model architecturally can learn to point and pick, given clean signals. If yes, the next project ports it to a real robot and burns the sim-to-real fuel. If no, no amount of real-data engineering would have fixed it.

PART II · Robotics 101

03 · the arm

03The anatomy of a robot arm.

A robot arm is a chain of links (rigid metal tubes) connected by joints (motors). At one end of the chain is the base, bolted to the floor or a table. At the other end is the end-effector — the "hand", which in our case is a gripper.

Each joint has one or two degrees of freedom (DOF) — one number that describes its configuration. A rotating joint has 1 DOF: the rotation angle. A sliding joint has 1 DOF: the slide distance. Stack up six rotating joints in series and you've got a 6-DOF arm — enough to put the end-effector at any reachable position and rotate it to any reachable orientation. This is the standard configuration for industrial manipulators, and it's what the UR5e is.

Why six joints?

A pose in 3D space has six independent numbers — three for position (x, y, z) and three for orientation (roll, pitch, yaw). To freely choose all six, you need at least six independent rotation axes. Fewer joints means some poses you can think of but can't physically reach. More joints (7-DOF, 8-DOF) gives redundancy: multiple joint configurations that all yield the same end-effector pose, handy for avoiding obstacles or staying inside joint limits.

The UR5e specifically

Universal Robots' UR5e: six revolute joints, ≈ 1 m of reach, ≈ 5 kg payload, 6-axis force/torque sensor built into the flange. We picked it because it's well-supported in MuJoCo's menagerie (a curated set of physically-accurate robot models), and because it's the same arm a lot of real lab work uses. The kinematics convention is:

shoulder_pan — rotates the whole arm about the vertical world axis
shoulder_lift — pitches the shoulder up/down
elbow — bends at the elbow
wrist_1, wrist_2, wrist_3 — three rotations that orient the flange

Plus the gripper has one more DOF — the finger separation. So our full state vector is 7 numbers: [q1, q2, q3, q4, q5, q6, gripper].

04 · forward kinematics

04Forward kinematics — joints to pose.

Given the seven joint numbers, where is the gripper tip in the world? That's forward kinematics (FK). It's pure geometry — no physics, no learning. You compose six rigid-body transformations (one per joint, each parameterised by that joint's angle), apply them in series starting from the base, and out pops the homogeneous transformation matrix T_ee ∈ ℝ^4×4 describing the end-effector's pose:

T_ee(q) = T_0→1(q₁) · T_1→2(q₂) · … · T_5→6(q₆) · T_6→ee

Each T_i→i+1(q_i) is a 4×4 matrix encoding "rotate by q_i about this joint's axis, then translate by the rigid link offset to the next joint." MuJoCo does this calculation for you every physics step — it's available in data.xpos[body_id] (world position) and data.xmat[body_id] (orientation matrix).

A 2D toy you can play with

Below is a simplified 2-link planar arm — just two joints, each a single angle. You drag the sliders; the dots trace where the tip goes. Six-DOF UR5 has the same idea, just with more rotations composed.

interactive · forward kinematics demo

θ₁ (rad)0.40 θ₂ (rad)-1.00

drag both sliders — the green dot is the end-effector position computed by FK; the orange trail is the last 100 positions.

# 2D forward kinematics — toy version of what UR5 FK does in 3D.
def fk_2link(theta1, theta2, L1=0.4, L2=0.35):
    # joint 1 at origin, rotates by theta1.
    x1 = L1 * cos(theta1)
    y1 = L1 * sin(theta1)
    # joint 2 at (x1, y1), rotates by theta1+theta2 in world frame.
    x2 = x1 + L2 * cos(theta1 + theta2)
    y2 = y1 + L2 * sin(theta1 + theta2)
    return (x2, y2)

Three things to notice from playing with the demo: (1) for any pair of joint angles there's exactly one tip position — FK is a function, never ambiguous. (2) the set of reachable tip positions is an annulus (donut) of inner radius |L1−L2| and outer radius L1+L2. (3) two very different joint configurations can put the tip at the same place — once we go backwards from "tip here" to "joints how?" we'll have to pick one. That's what chapter 5 is about.

05 · inverse kinematics

05Inverse kinematics — pose to joints.

Now imagine you have a target — "put the tip here, oriented like so." What joint angles do you set? That's inverse kinematics (IK), and it's much harder than FK:

Multiple solutions. A 6-DOF arm reaching some pose typically has up to 8 valid joint configurations (elbow up / down, shoulder in / out, wrist flipped, etc).
No solutions. If the target is outside the arm's reach, or behind it, or blocked by a joint limit, no configuration works at all.
Continuity. Even if a target is reachable, the path from the current joint configuration to a target configuration may pass through a singular pose where small joint motions cause huge tip motions. You'd like to avoid those.

The Jacobian-based approach

The classical way to solve IK is iterative: start from the current joint configuration q, compute the FK to find the current tip pose, compute the error between current and target, and take a small step in joint space that reduces the error. The step direction comes from the Jacobian — a 6×n matrix that linearises the FK function locally:

Δpose ≈ J(q) · Δq

Invert that relationship and you get the joint step that would reduce the pose error by the right amount:

Δq = J⁺(q) · Δpose

where J⁺ is the pseudo-inverse (Moore–Penrose) of the Jacobian. Repeat that until error is small. In practice you regularise the inverse to stay stable near singularities, you add posture priors so the arm prefers "natural" configurations, and you add joint limits as soft penalties.

Just use a library

We don't write this ourselves. Google DeepMind's mink is a clean Python IK library for MuJoCo: differentiable, supports multiple tasks (frame, posture, joint-limit), uses a quadratic program solver (daqp) under the hood. Our scripted demonstrator uses it like this:

import mink

# Set up once.
config = mink.Configuration(model)
ee_task = mink.FrameTask(
    frame_name="flange", frame_type="body",
    position_cost=1.0, orientation_cost=0.5,
)
posture_task = mink.PostureTask(model, cost=0.01)
solver = "daqp"

# Every tick.
ee_task.set_target(target_SE3)             # desired flange pose
posture_task.set_target_from_configuration(config)
for _ in range(20):
    dq = mink.solve_ik(
        config, [ee_task, posture_task],
        dt=0.01, solver=solver,
    )
    config.integrate_inplace(dq, dt=0.01)
q_target = config.q                       # the desired joint config

Drag-the-target demo

Same 2-link arm, now upside down — the slider is gone. Click and drag anywhere inside the reachable annulus and the arm solves IK to put its tip on your cursor. You'll see how the joints move continuously toward the goal, and how they snap discontinuously when you cross between the two solution branches.

interactive · inverse kinematics by gradient descent

elbow L1, L20.40, 0.35 m

drag inside the donut. When unreachable, the arm stretches toward the click but doesn't quite get there — which is exactly what real IK solvers do on out-of-reach targets.

06 · the gripper

06The gripper, in mechanical detail.

Bolted to the UR5e's flange is the UMI gripper — a parallel-jaw design used by Columbia Robotics as the standard portable data-collection rig for imitation learning. Two wedge-shaped fingers slide together (or apart) along a single linear axis, driven by an internal screw. The wedge shape (rather than a flat surface) is deliberate: it lets you pinch oddly-shaped objects on a single axis without needing dexterous finger articulation.

The gripper exposes one DOF — the distance between the two fingertips, in metres. Fully open is ≈ 4.5 cm, fully closed is 0. We send the target separation as the seventh number in our action vector; the sim physics takes care of contacts and slippage.

The wrist-mounted GoPro

Above the gripper, looking down and forward, is a small GoPro Hero 10 camera. The real UMI rigs put a GoPro on the gripper to capture demonstrations from the wrist's perspective — the camera moves with the hand, so the policy gets a first-person view of the workspace. In simulation we replicate this by adding a MuJoCo camera attached to the wrist body, with a 155° vertical field of view (matching the real GoPro Linear setting).

A wide-FOV camera at the wrist means the brick is visible from a great distance away from the home pose. The downside is heavy fisheye distortion near the image edges. We apply an equidistant-fisheye warp in software to mimic the GoPro's optical distortion, so the policy sees the same kind of warped image it would see on a real rig.

Gripper polarity gotcha

Watch the sign. The Menagerie UMI XML has the gripper actuator wired such that ctrl = 0 means fully OPEN, and ctrl = 0.05 means fully CLOSED — the opposite of what one would intuit. The internal tendon's joint axis is inverted relative to the typical convention. We discovered this at 3 AM when the scripted picker was opening the gripper at grasp time and closing it on the lift. Now it's a comment block at the top of collect_ik_demos.py.

PART III · MuJoCo & the scene

07 · mujoco

07MuJoCo in ten minutes.

MuJoCo (Multi-Joint dynamics with Contact) is an open-source rigid-body physics engine. Imagine Bullet or PhysX, but optimised for articulated robots, with a soft-contact model that's stable at large timesteps and a Python API that hands you raw access to every state variable.

The mental model:

You write an XML file describing the world: a tree of bodies, each with a joint connecting it to its parent, geometry for visualisation and collision, and inertial properties.
MuJoCo parses that into a MjModel (the static description) and creates a MjData (the live state — positions, velocities, contact forces).
Calling mj_step(model, data) advances time by one integration step (we use 4 ms, with 20 inner steps per 80 ms "control tick" = 12.5 Hz control).
You can write to data.ctrl (actuator commands), read from data.qpos (joint positions), data.xpos (body world positions), and so on.

A minimal XML scene

<mujoco>
  <option timestep="0.004" gravity="0 0 -9.81"/>
  <worldbody>
    <!-- a fixed table -->
    <geom type="box" size="0.6 0.4 0.02"
          pos="0 -0.55 0.43" rgba="0.6 0.4 0.2 1"/>
    <!-- one free-body brick -->
    <body name="brick" pos="0 -0.55 0.46">
      <freejoint/>
      <geom type="box" size="0.016 0.032 0.012"
            rgba="0.8 0.1 0.1 1"/>
    </body>
  </worldbody>
</mujoco>

Load it from Python:

import mujoco

model = mujoco.MjModel.from_xml_path("scene.xml")
data  = mujoco.MjData(model)
mujoco.mj_resetData(model, data)

for step in range(100):
    mujoco.mj_step(model, data)
    print("brick z:", data.qpos[2])    # falls onto the table

That's the entire loop. Add a camera and a renderer and you get images; add an arm and actuators and you get a controllable robot.

Why MuJoCo not Bullet, Gazebo, IsaacSim?

Speed. Single-CPU simulation of a 6-DOF arm runs at > 1000× real time on a laptop.
Stable contacts. The soft-constraint solver rarely explodes even at large timesteps; the alternative engines tend to need much smaller dt to stay stable, blowing the speed advantage.
Ecosystem. The menagerie has well-calibrated XML for most robots. No need to model the UR5 from URDF kinematics scratch.
License. Apache 2.0 since DeepMind open-sourced it in 2021.

The downside vs IsaacSim: no native GPU parallelisation (until very recently with MJX), and no built-in photorealistic renderer. For our project that's fine — we just need the wrist GoPro camera, which we render with MuJoCo's built-in rasteriser at 224×224 in a fraction of a millisecond.

08 · the scene

08Building our specific scene.

Our scene XML is generated by a Python script (scripts/build_menagerie_scene.py) rather than written by hand. The script uses MuJoCo's MjSpec API — a programmatic scene builder — so we can adjust dimensions and layouts without editing XML by hand. The skeleton:

import mujoco

spec = mujoco.MjSpec()
spec.from_file("third_party/mujoco_menagerie/universal_robots_ur5e/ur5e.xml")
wb   = spec.worldbody

# Table
table = wb.add_body()
table.add_geom(type="box", size=[0.6, 0.4, 0.02],
                pos=[0, -0.55, 0.43],
                rgba=[0.6, 0.4, 0.2, 1])

# UMI gripper, mounted to the UR5e flange in a specific orientation
umi_spec = mujoco.MjSpec().from_file(".../umi_gripper.xml")
spec.attach(umi_spec, parent_name="flange", ...)

# Twelve coloured Lego bricks on a 3×4 grid
for name, color, (x, y) in LEGO_LAYOUT:
    body = wb.add_body(name=name, pos=[x, y, 0.452])
    body.add_freejoint()
    body.add_geom(type="mesh", meshname="lego_brick", rgba=color)

# Cameras
wb.add_camera(name="overhead_cam", pos=[0.6, -0.1, 0.95], ...)
# Wrist GoPro camera is added INSIDE the UMI gripper body, so it
# moves with the gripper.

spec.compile()
spec.to_file("assets/mjcf/tabletop_pick_menagerie.xml")

The gotchas we hit (so you don't)

UMI gripper home orientation. The Menagerie UMI XML defines the gripper with the fingers pointing along its own +Z. To make the fingers hang straight down when bolted on the UR5e flange, we had to rotate the gripper body by quaternion (-1, 1, 0, 0) (= 180° about flange's local Y axis). This took an evening to figure out from looking at side-on screenshots.
Wrist→fingertip offset. The wrist body and the gripper fingertips are not at the same point — the gripper sticks out ≈ 26 cm below the flange. Our scripted picker has to subtract this offset when translating "I want the fingertip here" into "I want the wrist here." Wrong value → 0 / 8 pick rate. Correct value (−0.26 m along world −Z) → 30 / 30 pick rate.
Spawn z = 0.457 m (5 mm above the 0.452 m table top). Bricks fall a few mm onto the table and settle. Spawn higher and they bounce; spawn at exactly 0.452 and they clip into the table at t = 0 and the solver oscillates.

The final scene has the arm, the gripper, twelve Lego bricks at settled positions, a table, an overhead third-person camera (for visualisation), and a wrist-mounted GoPro (for the policy's observations).

09 · cameras & projection

09How cameras "see" in a sim.

A MuJoCo camera has a position, an orientation, and a vertical field of view (fovy). To render an image, MuJoCo:

Builds a 4×4 view matrix from the camera's world pose.
Builds a 4×4 projection matrix from fovy and the aspect ratio.
Rasterises every visible geom: for each pixel, finds which geom is in front, looks up the geom's material colour and lighting, writes the pixel.

For us the key question is the reverse: given a 3D point in the world (a Lego brick), what pixel does it land on? Without that we can't compute the "bbox of the target brick" that we'll use as the policy's prompt.

The pinhole projection formula

MuJoCo's cameras are pinhole cameras — they treat the lens as an infinitesimal point, with rays going through it in straight lines. To project a world point P:

Translate so the camera is at the origin: P' = P − C where C is the camera's world position.
Rotate into camera frame: p_cam = R^T · P' where R is the camera's rotation matrix (columns are the camera's local axes expressed in world coords). The convention is: camera looks along its local −Z, with local +Y up.
Project: a point at camera-frame coords (x_c, y_c, z_c) with z_c < 0 (in front of camera) lands at normalised image coords (−x_c/z_c, −y_c/z_c).
Convert to pixels: with image height H and fovy α, the focal length in pixels is f = (H/2) / tan(α/2), and pixel coords are (u, v) = (f · x_norm + W/2, f · y_norm + H/2).

p_cam = R^T · (P − C) ; u = (H/2) · (−x_c / z_c) / tan(α/2) + W/2

Code in our project

# tools/collect_umi_data.py — exact code we use to bbox bricks.
def _project_brick_bbox(model, data, brick_bid, cam_id, image_hw):
    body_pos    = data.xpos[brick_bid]
    body_mat    = data.xmat[brick_bid].reshape(3, 3)
    # 8 corners of the brick AABB, in world frame
    corners_local = _AABB_CORNER_SIGNS * _BRICK_HALF      # (8, 3)
    world_corners = body_pos + (body_mat @ corners_local.T).T

    cam_pos  = data.cam_xpos[cam_id]
    cam_mat  = data.cam_xmat[cam_id].reshape(3, 3)
    rel      = world_corners - cam_pos
    cam_coords = rel @ cam_mat                  # = (R.T @ rel.T).T

    cz = cam_coords[:, 2]
    in_front = cz < -1e-6
    if not np.any(in_front): return np.zeros(4, dtype=np.float32)

    fovy = float(model.cam_fovy[cam_id])
    f    = (image_hw / 2.0) / np.tan(np.deg2rad(fovy) / 2.0)
    x_img = -cam_coords[in_front, 0] / cam_coords[in_front, 2]
    y_img = -cam_coords[in_front, 1] / cam_coords[in_front, 2]
    u = f * x_img + (image_hw - 1) / 2.0
    v = f * y_img + (image_hw - 1) / 2.0
    # Clip to image bounds + normalise.
    u_min, u_max = max(0., float(u.min())), min(image_hw-1, float(u.max()))
    v_min, v_max = max(0., float(v.min())), min(image_hw-1, float(v.max()))
    return np.array([u_min, v_min, u_max, v_max]) / (image_hw - 1)

Output is a 4-vector [x_min, y_min, x_max, y_max] in [0, 1]. We compute this every tick during data collection AND every tick during inference; that 4-D vector goes into the policy as observation.environment_state.

Fisheye on top of pinhole

A real GoPro is not a pure pinhole — it has heavy barrel distortion at the edges, especially in the 155° "Linear" mode we mimic. We apply an equidistant fisheye warp to the rendered pinhole image to fake this distortion. The warp is a fixed pixel remap that bends straight lines outward at the edges.

Asymmetry we'd later regret. The image gets warped, but the bbox is computed on the un-warped pinhole projection. So the bbox numbers we send to the policy don't pixel-align with the fisheye image. Both train and inference do it the same way, so the policy learns the consistent (but slightly off) mapping. For an actual visual overlay (chapter 10), this matters.

10 · the prompt

10Pointing at a brick, formally.

How do you tell a policy "this one?" Three options:

option A · text

"pick the orange brick on the right". Needs a vision-language model to ground the text against the image — expensive, slow, sometimes wrong. SmolVLA does this but at the cost of 450 M params and a tokenizer step in the inference loop.

option B · click coordinates

"the brick at pixel (165, 220)". A single point. Spatially precise but the policy can't tell how large the object is or how it's oriented.

option C · bounding box

"the brick inside this rectangle, [x_min, y_min, x_max, y_max]." A single 4-D vector. Spatially dense, gives implicit size info. Same shape a real detector would output. This is what we use.

option D · object id

"brick #4." Needs the policy to internally track which object is which, which it can't because the input is just an image. Brittle.

We use option C, the bbox. In sim we have ground-truth — we know each brick's 3D position so the projection of chapter 9 gives us perfect bboxes. In a real-robot deployment the bbox would come from a detector like GroundingDINO: you'd type "orange brick" into a textbox, the detector outputs a bbox, and that bbox feeds the policy. Same interface.

Click-to-pick demo

The widget below is a top-down view of our actual scene layout — twelve coloured bricks in a 3×4 grid. Click any of them: the right- hand panel shows the wrist GoPro's view (a top-down projection approximation) with the bbox of the brick you picked drawn in green. These are the literal four numbers we'd send to the policy.

interactive · scene + wrist-cam bbox

overhead view — click a brick

wrist cam + bbox

the four numbers under "bbox" are exactly the payload we POST to the policy as observation.environment_state every tick.

PART IV · Demos & data

11 · behaviour cloning

11Behaviour cloning — copy what works.

There are three big ways to teach a robot to do a task: behaviour cloning (BC), reinforcement learning (RL), and model-based planning. We do BC. It's the simplest, the most data-efficient, and the easiest to debug. The recipe:

Get some expert trajectories — sequences of (observation, action) pairs from someone who already knows how to do the task.
Train a neural net to predict the expert's action given the expert's observation. This is plain supervised learning.
At deployment, run the trained net on live observations and execute its predictions.

Mathematically:

minimise_π 𝔼_{(o, a*)~𝒟} [ ‖π(o) − a*‖² ]

Where 𝒟 is the demo dataset, o is an observation, a* is the expert action, and π is our policy. We pick the policy parameters that minimise the mean-squared error between predicted and expert actions, averaged over the dataset.

Why BC fails in obvious ways

The Achilles' heel of BC is covariate shift. The policy is trained on observations drawn from the expert's distribution. At deployment, the policy itself decides what observations it sees next — and the moment it makes a small mistake, the next observation lands slightly off the training distribution. The policy has never seen this kind of obs, so its prediction is slightly worse, the next obs drifts further, and so on. Errors compound. Ross & Bagnell (2010) showed this scales like O(ε · T²) with episode length — quadratic in horizon.

Our episodes are 400 ticks. Even with a per-step error rate of 1% that gives an expected total error budget of 400² × 0.01 = 1600 — obviously trajectories don't blow up by 1600× because the policy sometimes recovers, but the math explains why "train on 99% accurate BC" still gives "deploy fails 100% of the time" if the task is long.

Tricks people use

DAgger — Dataset Aggregation. Run the trained policy in the env, ask the expert to relabel its actual trajectories ("what would you have done from here?"), add those to the dataset, retrain. Repeat. Directly attacks covariate shift. We didn't do this; chapter 36 says we should have.
Action chunking. Predict the next 50 actions in one shot instead of one. Each chunk is internally coherent — the policy commits to a multi-step trajectory before re-observing. We do this; chapter 15 has the details.
Stronger generative heads. If the demo distribution is multi-modal (multiple valid ways to do the task), MSE loss collapses to the mean of the modes, which is often a bad action. Diffusion / flow matching / VQ heads model the full distribution. Chapter 18 covers diffusion.

12 · scripted demos

12Writing a scripted demonstrator.

Where do the expert trajectories come from? Three options: human teleop (slow, expensive), scripted IK (us), or a learned expert (chicken-and-egg). We picked the second because in simulation we know each brick's exact 3D pose, so we can hand-write a four-phase plan:

phase 1

hover · 15 cm above the brick, fingers open

→

phase 2

descend · move down to grasp height

→

phase 3

close · drive gripper to 0.022 m

→

phase 4

lift · raise the brick 25 cm

Each phase converts to a TCP-pose target for the wrist. mink IK turns that into joint angles. A small per-tick joint-velocity cap keeps motion smooth. The whole thing is ≈ 280 lines of Python (scripts/collect_ik_demos.py). The key loop:

def run_one_episode(env, *, seed, on_step):
    obs, info   = env.reset(seed=seed)
    target_name = info["target_object"]
    target_pos  = env.target_object_pos.copy()

    # Phase targets, fingertip-relative
    FT_OFFSET = np.array(UR5UMIRobot.WRIST_TO_FINGERTIP_OFFSET)  # (0,0,-0.260)
    HOVER_FT  = np.array([0, 0, 0.15])
    GRASP_FT  = np.array([0, 0, 0.020])
    LIFT_FT   = np.array([0, 0, 0.25])
    pick_quat = np.array([0.7071, -0.7071, 0, 0])  # wrist hangs DOWN

    q_arm = env._data.qpos[:6].copy()
    ik    = _build_mink_solver(env._model)

    # PHASE 1 — hover
    target_wrist = target_pos + HOVER_FT - FT_OFFSET
    for tick in range(60):
        q_arm = _step_to(env, ik, q_arm, target_wrist, pick_quat,
                          grip=GRIP_OPEN, max_dq_deg=5)
        on_step(env._get_obs(), _action_vec(env, q_arm))

    # PHASE 2 — descend
    target_wrist = target_pos + GRASP_FT - FT_OFFSET
    for tick in range(80):
        q_arm = _step_to(env, ik, q_arm, target_wrist, pick_quat,
                          grip=GRIP_OPEN, max_dq_deg=5)
        on_step(env._get_obs(), _action_vec(env, q_arm))

    # PHASE 3 — close
    for tick in range(40):
        q_arm = _step_to(env, ik, q_arm, target_wrist, pick_quat,
                          grip=GRIP_CLOSED, max_dq_deg=5)
        on_step(env._get_obs(), _action_vec(env, q_arm))

    # PHASE 4 — lift
    target_wrist = target_pos + LIFT_FT - FT_OFFSET
    for tick in range(60):
        q_arm = _step_to(env, ik, q_arm, target_wrist, pick_quat,
                          grip=GRIP_CLOSED, max_dq_deg=5)
        on_step(env._get_obs(), _action_vec(env, q_arm))

    success = env.target_object_pos[2] > target_pos[2] + 0.10
    return {"success": success, "target": target_name}

Each tick the on_step callback records the observation and the action. That callback writes to our recording buffer; after the episode it flushes everything to disk in the UMI zarr format. Total: 240 frames per episode, ≈ 9.6 s of motion at 25 Hz.

A subtle but important data choice.

The action we log is action[t] = state[t] — the current TCP pose. We do not log "the target the IK was trying to reach" or "the joint deltas." This means the trained policy will be predicting future poses of the wrist (because the action chunk extends 50 ticks ahead in the dataset). In chapter 30 we'll see why this absolute-pose choice is what bottlenecks our results.

13 · scale

13Collecting 300 episodes on Modal CPU.

One episode takes ≈ 25 seconds wall-clock (most of it MuJoCo simulation). 300 episodes serial = 2 hours. We don't have 2 hours of patience, so we fan out across 30 CPU containers on Modal:

@app.cls(
    image=image, cpu=4.0, memory=8192,
    volumes={RUNS_DIR: runs_vol},
    timeout=3600,
    scaledown_window=120,
    max_containers=64,
)
class UMIDemoWorker:
    @modal.enter()
    def _setup(self):
        runs_vol.reload()

    @modal.method()
    def collect_chunk(self, run_id, worker_id, n_episodes, base_seed, ...):
        env    = TabletopPickEnv(EnvConfig())
        out    = EpisodeRecorder(env, image_hw=224)
        worker_dir = pathlib.Path(RUNS_DIR) / run_id / f"worker-{worker_id:03d}"
        worker_dir.mkdir(parents=True, exist_ok=True)

        for ep in range(n_episodes):
            seed = base_seed + ep
            result = run_one_episode(env, seed=seed, on_step=out)
            if result["success"]:
                _write_demo_folder(worker_dir / f"demo_sim{ep:04d}_seed{seed}", out)
                out.flush_episode()

        # Pack all this worker's frames into one zarr.zip
        write_umi_zarr(out, worker_dir / "dataset.zarr.zip")
        runs_vol.commit()
        return {"zarr_path": str(worker_dir / "dataset.zarr.zip"), ...}

From my laptop:

$ modal run --detach infra/collect_umi_app.py::main \
      --n-workers 30 --episodes-per-worker 10

Modal spins up 30 containers in parallel (it'll wait if the workspace is at quota), each runs 10 episodes, each writes its own zarr to a persistent volume. After all workers finish, an orchestrator function calls _merge_zarrs(...) to concatenate them into one big dataset.zarr.zip.

What went wrong (twice)

Worker timeouts. Initial worker timeout was 30 min. Once we randomised brick spawns, the scripted IK sometimes took longer to settle, and a couple of workers blew past 30 min and were killed. map(return_exceptions=False) then cancels all sibling workers — even ones that already finished. Painful. Mitigation: bump worker timeout to 60 min, reduce eps-per-worker so individual workers finish well within budget.

Orchestrator timeout. 30 workers × 10 eps × ≈ 30 s per ep = 2½ hr if serial; with 30-way parallelism it's ≈ 12 min. But the orchestrator's timeout=7200 (2 hr) wasn't long enough once we randomised the layout because individual pick reliability dropped (more variation = more failed picks = longer episodes). Fix: bump orchestrator timeout to 14400 s, and add a merge_only() function that can re-merge from already-collected worker zarrs without re-running episodes.

Final yield, after fixing both: 300 episodes, 102,000 frames, 100% scripted-IK pick rate. Cost: ≈ $5 of Modal CPU.

14 · the data format

14From UMI zarr to LeRobot v3 parquet.

Our scripted picker writes the UMI on-disk format: a dataset.zarr.zip archive (one multi-array zarr, JpegXL-compressed image stream, flat numeric arrays for state / bbox) plus a folder of per-demo MP4 videos for quick visualisation.

LeRobot, the framework we use for training, wants the LeRobot v3 format instead:

build/lerobot_ds_300/
├── data/chunk-000/file-000.parquet         # numeric columns per frame
├── videos/observation.images.wrist_cam/chunk-000/file-000.mp4
├── meta/
│   ├── info.json                           # feature schema
│   ├── stats.json                          # per-feature global stats
│   ├── episodes/chunk-000/file-000.parquet # per-episode index
│   └── tasks.parquet                       # task-text lookup

The columns in data/.../file-000.parquet (one row per frame):

observation.state — fixed_size_list<float, 7> (joint angles + gripper)
observation.ee_pos — fixed_size_list<float, 3>
observation.environment_state — fixed_size_list<float, 4> (the bbox)
action — fixed_size_list<float, 7>
frame_index, episode_index, timestamp, task_index

Images live in mp4 video files, one per camera per chunk. LeRobot's dataloader uses PyAV to seek into the video by frame_index. Much smaller than 102k individual PNGs.

The converter

tools/umi_zarr_to_lerobot.py walks the zarr, reads each frame's image and numerics, and feeds them to a LeRobotDatasetWriter we wrote (data/lerobot_writer.py). The writer keeps an in-memory buffer of frames and at finalize():

Writes data/chunk-000/file-000.parquet.
Encodes the image stream into file-000.mp4 using imageio-ffmpeg.
Writes per-episode metadata to meta/episodes/chunk-000/file-000.parquet.
Writes meta/info.json (the schema) and meta/stats.json (per-feature min/max/mean/std).

Push to Hub

$ python tools/push_lerobot_to_hub.py build/lerobot_ds_300 \
      --repo-id ozyphus/cs224-lego-umi-rand-300 --branch v3.0

The push uses huggingface_hub.HfApi.upload_folder to create a dataset repo under ozyphus/, then uploads everything to both the main branch and a v3.0 branch (LeRobot reads from v3.0 by convention). The Modal training jobs can then stream the dataset directly from the Hub.

Three subtle writer bugs we hit (and you might). (1) Older LeRobot expected paths like data/episode_chunk_000/episode_000.parquet; v3 uses data/chunk-000/file-000.parquet. Easy to miss when porting code. (2) meta/episodes/.../file-000.parquet wants a data/chunk_index and data/file_index column pointing back into the data parquets — easy to forget. (3) meta/tasks.parquet wants a pandas DataFrame whose index name is the task string, not a column. SmolVLA's tokenizer hard-fails with "task cannot be None" if you get this wrong.

15 · action chunking

15Action chunks — why predict 50 at a time.

A naive BC policy predicts one action per observation. At inference you call π(obs_t) → a_t, execute a_t, get obs_t+1, repeat. This works in theory, but it gives the policy no commitment — each step it picks the best action right now, ignoring the trajectory it's on. If the demo data has small high-frequency jitters (it does — physics integration plus IK noise), the policy learns to predict those jitters too and the resulting motion looks like a drunk man trying to thread a needle.

Action chunking, popularised by ACT (Aloha), fixes this. Instead of one action per call, the policy predicts a chunk of k future actions in one shot:

[a_t, a_t+1, …, a_t+k−1] = π(o_t)

The trained policy returns a coherent multi-step plan from a single observation. The runtime keeps a queue of the predicted chunk and pops one action per env tick. After n_{action_steps} pops (a parameter, ≤ k), it discards the rest of the chunk, re- observes, and predicts a new chunk. We use k = 50 for ACT, k = 16 for Diffusion.

Why this helps

Smoother motion. Within a chunk the policy commits to a low-noise plan instead of swinging on every tick.
Implicit smoothing across modes. If the demo has two valid sub-policies (e.g. left vs right approach), a one- step policy averages them (and goes through the middle, hitting nothing). A chunk policy can pick one mode at chunk-prediction time and stay committed.
Lower call rate. If you re-predict every 8 ticks instead of every 1, you've cut the inference HTTP load by 8×.

The chunk visualiser

Below is a synthetic 50-step action chunk — exactly the shape an ACT model would emit. Each line is one of the 7 action dimensions. Click play; the dashed vertical line is the "head" — what's currently being executed. After tick 49 the queue empties and a new chunk is fetched (starts over here for the demo).

interactive · action chunk visualiser

tick t = 0

one observation in, 50 actions out. The bold red curve is the gripper-width target — note that it stays open until ≈ t = 30, then closes. The descending green curve is the z-coordinate of the target wrist pose.

PART V · Models

16 · building blocks

16The four neural-net building blocks we need.

Before diving into ACT and Diffusion Policy, let's pin down the four primitives both use. If you've trained a CNN or a transformer this chapter is a refresher; if not, this is the bare minimum to make sense of the rest.

1 · Multi-layer perceptron (MLP)

A pile of linear layers with a non-linearity (ReLU, GELU) between them. Maps a vector in to a vector out. We use MLPs everywhere small things need to be embedded — the 7-D state, the 4-D bbox, the time-step embedding in diffusion.

import torch.nn as nn
mlp = nn.Sequential(
    nn.Linear(4, 64), nn.GELU(),
    nn.Linear(64, 64), nn.GELU(),
    nn.Linear(64, 128),
)
out = mlp(bbox_tensor)             # shape (B, 128)

2 · ResNet image encoder

A convolutional network that turns a 224×224 RGB image into a small feature map (e.g. 7×7×512). The convolutions slide small filters across the image, building up local-to-global features through successive downsampling. We use the pre-trained ResNet18 (≈ 11 M params, trained on ImageNet) because it's small, fast, and the early ImageNet features generalise surprisingly well to robot frames.

import torchvision.models as tv
backbone = tv.resnet18(weights="DEFAULT")
# strip the final classification head, keep up to conv5 features
feature_extractor = nn.Sequential(*list(backbone.children())[:-2])
img_feat = feature_extractor(image)   # shape (B, 512, 7, 7)

3 · Transformer encoder / decoder

A stack of self-attention blocks. Each block lets every token in a sequence look at every other token, weighted by a learned similarity score. The transformer's strength is modelling long-range dependencies — exactly what we need to relate a bbox in the image to the gripper's joint state.

For our purposes the key parts are:

Take a list of tokens (vectors of dim 512).
Add positional embeddings so the model knows order.
For L layers, run token = token + Attention(token); token = token + MLP(token).
Output: the same number of tokens, each refined.

4 · U-Net (for Diffusion)

A specific CNN architecture that maps an input to an output of the same shape, with a "U" of downsampling-then-upsampling layers and skip connections between symmetric levels. Diffusion Policy's U-Net takes a noisy action chunk and predicts the noise. We'll meet it in chapter 18.

Everything else the policies do — normalisation, layer norm, positional encodings, dropout — is plumbing on top of these four. If you grasp these, you'll be able to read the LeRobot source code.

17 · act

17ACT — the Action Chunking Transformer.

ACT, from Tony Zhao's ALOHA paper, is the simplest end-to-end imitation-learning model that "just works" on a 6-DOF manipulator. Three components glued together:

encode

image → ResNet18 → 49 patch tokens. state → MLP → 1 token. bbox → MLP → 1 token. transformer encoder mixes them all.

→

VAE prior

(at train time only) encode the demo action chunk into a 32-D latent z; KL-regularise. At inference, z is sampled as zero.

→

decode

transformer decoder, queried by 50 learned position tokens (one per future step), cross-attends to encoder tokens, emits 50 × 7-D actions.

Why a VAE?

The demo dataset can have multiple valid trajectories from the same starting state — pick from the left, or from the right; close fast, or close slow. If we just regressed actions on observations, MSE would average those modes, producing a "between-modes" action that might not be valid (the gripper closing on empty air between two bricks). The VAE encoder bakes the demo trajectory into a latent z, so the decoder is conditioned on "this kind of trajectory". At training time z comes from encoding the actual demo; at inference time z is just zeros and the model uses the mode that fell out of training.

The forward pass, in code

# Pseudocode based on lerobot/policies/act/modeling_act.py
class ACT(nn.Module):
    def __init__(self, config):
        self.image_encoder    = ResNet18Embed(out_dim=512)
        self.state_mlp       = nn.Linear(7, 512)
        self.env_state_mlp   = nn.Linear(4, 512)         # bbox
        self.transformer_enc = TransformerEnc(d_model=512, n_layers=4)
        self.transformer_dec = TransformerDec(d_model=512, n_layers=7)
        self.query_embed     = nn.Parameter(torch.randn(50, 512))   # 50 query slots
        self.action_head     = nn.Linear(512, 7)
        if config.use_vae:
            self.action_encoder = TransformerEnc(...)
            self.z_head         = nn.Linear(512, 64)            # μ, log σ
            self.z_proj         = nn.Linear(32, 512)

    def forward(self, batch):
        img_tokens = self.image_encoder(batch["image"])             # (B, 49, 512)
        s_tok      = self.state_mlp(batch["state"]).unsqueeze(1)     # (B, 1, 512)
        b_tok      = self.env_state_mlp(batch["bbox"]).unsqueeze(1) # (B, 1, 512)
        z_tok      = self._sample_z(batch).unsqueeze(1)            # (B, 1, 512)
        tokens     = torch.cat([s_tok, b_tok, z_tok, img_tokens], dim=1)
        memory     = self.transformer_enc(tokens)                  # (B, 52, 512)
        queries    = self.query_embed.unsqueeze(0).expand(B, -1, -1)  # (B, 50, 512)
        chunk_feat = self.transformer_dec(queries, memory)
        return self.action_head(chunk_feat)                       # (B, 50, 7)

What it does well

Fast. ~80 M params, runs at 50 ms / forward pass on an A10G — easily 20 Hz.
Stable. The chunk-based output gives smooth motion without complex post-filters.
Cheap to train. ≈ $1 of A10G time for our 32k steps; finished at loss 0.035 on the 300-ep dataset.

What it does badly (for us)

The bbox goes in as one of 52 tokens, alongside 49 image tokens that have 512 dimensions each. Self-attention can in theory let the bbox token pull selectively from image features, but in practice the image tokens dominate. Our sensitivity probe showed a full bbox sweep moves the predicted wrist position by ≤ 5 cm — about a third of a brick spacing. Chapter 32 has the full analysis.

18 · diffusion policy

18Diffusion Policy — DDPM for actions.

Diffusion Policy, from Chi et al. 2023, swaps ACT's transformer decoder for a Denoising Diffusion Probabilistic Model (DDPM) over the action chunk. It's heavier and slower than ACT, but better at multi-modal demos and tends to generalise more cleanly.

The denoising idea

Start with a clean target — a 16-step action chunk a₀ ∈ ℝ^16×7. Apply T = 100 increasingly noisy versions by adding Gaussian noise:

a_k = √ᾱ_k · a₀ + √(1 − ᾱ_k) · ε, ε ∼ 𝒩(0, I)

Where ᾱ_k is a noise schedule that decays from 1 (no noise) to 0 (pure noise) as k goes from 0 to T. Now we train a network ε_θ(a_k, k, cond) to predict the noise ε given the noisy chunk, the noise level k, and a conditioning vector (image+state+bbox embedding). The loss is just MSE on the predicted noise:

ℒ = 𝔼_{(a₀, cond, k, ε)} ‖ ε − ε_θ(a_k, k, cond) ‖²

Sampling at inference

Start from pure Gaussian noise a_T. Run T steps of denoising: at each step, predict the noise, subtract a scaled version, optionally re-add a smaller amount of noise (for DDPM) or none (for DDIM). After all steps you've got a clean action chunk a₀. That's the policy output.

interactive · DDPM denoising on a 1-D toy trajectory

step k = 100 noise → trajectory

Start: pure Gaussian noise (k = 100). Each "step" the network predicts the noise, subtracts a scaled version, and we re-add a small amount of fresh noise. After 100 iterations we land on a smooth descending curve — the policy's predicted z-trajectory.

The architecture in detail

# lerobot/policies/diffusion/modeling_diffusion.py — abridged
class DiffusionModel(nn.Module):
    def __init__(self, config):
        self.image_encoder = ResNet18Embed(out_dim=512)
        self.state_mlp     = nn.Linear(7, 64)
        self.env_state_mlp = nn.Linear(4, 64)              # bbox
        cond_dim = 512 + 64 + 64 + 128                # img + state + bbox + time
        self.unet = ConditionalUNet1D(
            in_dim=7, cond_dim=cond_dim,
            down_dims=[512, 1024, 2048],
        )
        self.scheduler = DDPMScheduler(num_train_timesteps=100)

    def predict_action_chunk(self, batch):
        cond = self._build_cond(batch)               # (B, cond_dim)
        a = torch.randn(B, 16, 7, device=device)     # pure noise
        for k in self.scheduler.timesteps:             # 100, 99, …, 1
            eps = self.unet(a, k, cond)                # predicted noise
            a = self.scheduler.step(eps, k, a).prev_sample
        return a                                  # clean chunk (B, 16, 7)

Why diffusion sometimes wins

Compared to ACT's regression head, diffusion captures multi-modal distributions natively — the noise sampling injects randomness that can land in any mode. Practically, this means smoother trajectories and better generalisation in our benchmarks (309 mm vs 542 mm reach on the same data). The downside is 100× more compute per inference call (≈ 150 ms vs 50 ms for ACT), but at our 25 Hz control rate that's still fast enough.

19 · smolvla & vlas

19SmolVLA — when you want language too.

Vision-language-action (VLA) models extend ACT/Diffusion with a pre-trained vision-language backbone that can understand text as part of the conditioning. The idea: the policy can be steered by natural-language commands ("pick the orange one"), not just bboxes.

SmolVLA (450 M params) is HuggingFace's compact VLA: it stacks an action flow-matching head on top of SmolVLM, a small (160 M) VLM that handles 224×224 images and short text prompts. The flow-matching head is similar to diffusion but one-step (faster to sample).

Why we tried it, and why we de-prioritized it

Conceptually, a VLA is the "right" interface for our task — the operator could just type "pick the orange Lego". In practice, on our 285-episode dataset:

SmolVLA-base is too big to fit comfortably on A10G; needs A100-40GB, ≈ 3× the cost per training hour.
Tokenisation of language at inference adds another 30 ms of latency.
The "task" string is identical across all 285 episodes ("pick the lego brick"), so the language input wasn't doing useful work — we'd need to label demos with brick-specific text to teach the model word-to-bbox grounding.
It still uses observation.environment_state the same way ACT does, so the bbox bottleneck would apply.

SmolVLA trained fine (final loss 0.013, 5 h on A100), but we never got a clean closed-loop number — the serve container kept missing the language tokenisation step (it's done in the preprocessor pipeline, which we initially didn't load — see chapter 25). By the time we patched it, results suggested it wasn't worth more iteration over Diffusion. The checkpoint is on the Modal volume if the next-iteration team wants to revive it.

PART VI · Training

20 · gradient descent

20Gradient descent, for the unfamiliar.

All three policies (ACT, Diffusion, SmolVLA) train the same way: stochastic gradient descent with Adam. If you've trained any neural net before you can skip this chapter.

The recipe

Define a loss function ℒ that takes the model's predictions and the dataset's ground truth and returns one number (lower = better). For BC, ℒ is MSE between predicted and expert actions. For diffusion, ℒ is MSE between predicted noise and added noise.
Sample a small batch (say 32 frames) from the dataset. Run the model forward to get predictions. Compute ℒ for the batch.
Backpropagate — compute ∂ℒ/∂θ for every model parameter θ. PyTorch handles this automatically.
Update: θ ← θ − η · ∇ℒ, scaled by Adam's per-parameter learning rate.
Repeat steps 2-4 for N iterations (we use ≈ 30,000).

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
loss_fn   = nn.MSELoss()

for step in range(30_000):
    batch       = sample_batch(dataset, batch_size=32)
    pred_action = model(batch["obs"])
    loss        = loss_fn(pred_action, batch["action"])
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 100 == 0:
        print(f"step {step}: loss={loss.item():.4f}")

What "training is done" means

Loss is monotonically going down (with noise). At some point the rate of decrease flattens — the model has roughly memorised the expert data and is now polishing. We hit:

arch	final loss	steps	time	cost
ACT	0.035	32 000	26 min	$0.17
Diffusion	0.001	30 000	1h 25m	$1.84
SmolVLA	0.013	20 000	5h 10m	$16.10

Diffusion's loss is much lower because it's predicting noise (which is well-bounded), whereas ACT is predicting raw action values which have wider variance. The absolute numbers aren't comparable across architectures.

21 · modal

21Modal — cloud GPUs from a Python decorator.

Renting GPUs is annoying. AWS gives you a raw EC2 instance to configure, RunPod is per-minute but you maintain the container, Lambda Labs has the cheapest H100s but it's a queue. Modal is different: you decorate a Python function with @app.function(gpu="L40S") and call it from your laptop. Modal handles container build, GPU allocation, file persistence, and tear-down. Per-second billing.

The mental model

App — a Python module containing functions and classes decorated with @app.function or @app.cls. You deploy the app (modal deploy) or run it (modal run); both build a container image, push it to Modal's registry, and register the functions in Modal's scheduler.
Image — a container image, defined declaratively: base image, apt installs, pip installs, env vars.
Volume — persistent disk storage, mounted at a path in your container. Reads need vol.reload(), writes need vol.commit().
Secret — environment variables (HF tokens, API keys), stored encrypted in Modal's KMS.
Function vs. class — a function spins up a container for one call. A class (@app.cls) holds state across calls (e.g. a loaded model), with @modal.enter() setup methods and @modal.method() RPC entry points.

Our training image

image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.4.1-devel-ubuntu22.04",    # devel so nvcc is present
        add_python="3.12",                          # LeRobot requires 3.12
    )
    .apt_install("git", "ffmpeg", "libgl1", "libglib2.0-0")
    .run_commands("pip install --upgrade pip setuptools wheel")
    .pip_install(
        "lerobot[pi,dataset,training,diffusion,smolvla] @ "
        "git+https://github.com/huggingface/lerobot.git@main",
        "huggingface_hub>=1.0.0",
        "wandb>=0.16",
        "num2words",                              # SmolVLM tokenizer dep
    )
    .env({
        "HF_HOME": f"{WEIGHTS_DIR}/hf",
        "TRANSFORMERS_CACHE": f"{WEIGHTS_DIR}/hf",
        "TOKENIZERS_PARALLELISM": "false",
    })
)

First build of this image takes ≈ 12 minutes (CUDA download + LeRobot's deep dep tree). Subsequent deploys use the cached layers and take ≈ 8 seconds.

Cost calculator

interactive · what does training cost?

        GPU
        
        hours / run
        
        1.25
      
        parallel runs
        
        6
      
        ≈ $9.75

Our actual mix across the project: 6 Diffusion runs (L40S, 1h 15m each = ~$10) + 6 ACT runs (A10G, 26 min each = ~$1) + 2 SmolVLA runs (A100, 5h each = ~$31). Total training spend: ≈ $80.

Detach for long jobs

A training run takes 1+ hours. We don't want the local terminal babysitting it. modal run --detach launches the container and exits locally; the run continues on Modal until completion. Logs are tailed via modal app logs <app-id>.

22 · lerobot trainer

22The LeRobot trainer in detail.

Inside our Modal function, the training is all LeRobot. The driver is 25 lines of glue:

def _run_train(policy_cfg, dataset_repo_id, name, num_steps, batch_size, lr, ...):
    from lerobot.configs.default import DatasetConfig
    from lerobot.configs.train import TrainPipelineConfig
    from lerobot.scripts.lerobot_train import train as run_train

    output_dir = Path(WEIGHTS_DIR) / "ft" / name
    policy_cfg.device       = "cuda"
    policy_cfg.optimizer_lr = lr
    policy_cfg.push_to_hub  = False

    train_cfg = TrainPipelineConfig(
        dataset        = DatasetConfig(repo_id=dataset_repo_id),
        policy         = policy_cfg,
        output_dir     = output_dir,
        job_name       = name,
        batch_size     = batch_size,
        steps          = num_steps,
        save_checkpoint= True,
        save_freq      = save_freq,
        log_freq       = log_freq,
        num_workers    = num_workers,
        tolerance_s    = 1e-3,                          # float32 timestamp slack
    )
    run_train(train_cfg)
    weights_vol.commit()

Inside run_train(train_cfg) LeRobot does:

Build the dataset. Stream the parquet+mp4 from HF Hub if not cached locally, else load from disk. Construct a PyTorch Dataset that yields windowed frames with action chunks.
Compute / load normalisation stats. Per-feature mean / std are read from meta/stats.json. These feed a NormalizerProcessor step that whitens inputs and outputs at train time.
Build the policy. Construct the chosen class (ACT / Diffusion / SmolVLA) with the chosen config. Wrap it in the pre/post-processor pipeline.
Train loop. Standard PyTorch DDP-ready loop with mixed-precision, EMA shadow weights for Diffusion, periodic checkpoint saves to output_dir / "checkpoints" / "{step}", plus a symlink "last" for convenience.
Save artifacts. Each checkpoint folder contains model.safetensors, config.json, and the preprocessor/postprocessor pipeline as safetensors + JSON. The full pipeline is saved next to weights so inference can recreate it.

The bookend-timestamp bug

tolerance_s = 1e-3 — easy to miss, but critical. LeRobot's dataset builder validates frame timestamps against the sampling rate (40 ms / frame at 25 Hz). Episode-boundary timestamps are float32, while per-frame are float64. The float32 timestamps lose precision once they cross ≈ 1380 s of accumulated time, causing spurious "timestamp drift" errors that fail dataset construction at epoch boundaries. Default tolerance (1e-4) is too tight; we bumped it to 1e-3 which is still well inside the 40 ms frame spacing.

23 · hyperparameters

23Hyperparameters that actually mattered.

Most LeRobot defaults are fine; you don't tune ResNet18 width or transformer depth. The knobs we touched, ranked by effect on our closed-loop numbers:

chunk_size / n_action_steps (ACT)

50 / 50: predict 50 actions, execute all 50 before re-predicting. Smaller (25 / 8) is more reactive but more wobbly. Larger (100 / 50) is wasteful given our 250-tick rollouts. 50 / 50 won.

horizon / n_action_steps (Diffusion)

16 / 8: predict a 16-step chunk, execute 8 before re-predicting. This is a 2× re-prediction rate vs ACT — diffusion needs it because its longer denoising path makes single-chunk plans noisier. We didn't tune past the LeRobot defaults.

batch_size + lr

ACT: bs=8, lr=1e-5. Diffusion: bs=32, lr=1e-4. SmolVLA: bs=4, lr=5e-5. We took these from LeRobot's default configs — they're well-tuned for these architectures and we never had reason to deviate.

save_freq

Save every 4000 (ACT) or 5000 (Diffusion) steps to the Modal volume. Each checkpoint is ≈ 300 MB; we keep them all so we can eval mid- training if something looks off. Volume holds ≈ 30 GB total across six checkpoints with histories.

n_obs_steps (Diffusion)

2: stack the last two observations into a 2-frame history. Empirically helps Diffusion smooth its predictions; ACT is 1 by default.

What we wished we'd tuned

Cond encoder dim for env_state. The bbox-to- feature MLP in both ACT and Diffusion projects 4 → 64. Bumping it to 4 → 512 (matching image features) was the obvious cheap ablation we didn't try. Chapter 35 details what that would look like.
Train-time data augmentation. LeRobot supports random affine + colour-jitter on input images. We left these off because the sim images are already deterministic; for real- robot deployment they'd be essential.
Steps. 30k steps is where ACT plateaus on training loss; we never tested 60k or 100k. Probably wouldn't have helped (loss is already low), but we never ruled it out.

PART VII · Inference

24 · serving

24Serving a trained policy on Modal.

Training produces a checkpoint sitting on the umi-sim-weights volume. To use it we deploy a second Modal app — a serving app whose job is to keep the policy warm on a GPU and answer HTTP POSTs. The shape we used:

@app.cls(
    image=image,
    gpu="A10G",                          # cheaper inference GPU
    volumes={WEIGHTS_DIR: weights_vol},
    secrets=[hf_secret],
    timeout=600,                          # 10 min per request max
    scaledown_window=300,                  # keep alive 5 min after last call
    min_containers=0,                      # spin up on first request
)
class PolicyServer:
    @modal.enter()
    def _setup(self):
        weights_vol.reload()
        self.device  = torch.device("cuda")
        self._policy = None
        self._current_ckpt = None

    def _ensure_loaded(self, checkpoint_path):
        if self._current_ckpt == checkpoint_path and self._policy is not None:
            return
        self.arch    = checkpoint_path.split("/")[1].split("-", 1)[0]
        self._policy = _load_policy(self.arch, checkpoint_path)
        self._policy.to(self.device).eval().reset()
        self._pre, self._post = make_pre_post_processors(
            policy_cfg=self._policy.config,
            pretrained_path=str(full_path),
            preprocessor_overrides={"device_processor": {"device": str(self.device)}},
        )
        self._current_ckpt = checkpoint_path

    @modal.fastapi_endpoint(method="POST")
    def live(self, payload: dict) -> dict:
        self._ensure_loaded(payload["checkpoint"])
        if payload.get("reset"):
            self._policy.reset()
        batch  = self._build_batch(payload)
        batch  = self._pre(batch)
        with torch.inference_mode():
            action = self._policy.select_action(batch)
        action = self._post(action)
        return {"action": action.squeeze(0).cpu().numpy().tolist(),
                "arch": self.arch, "checkpoint": self._current_ckpt}

We have three policy architectures; each one is deployed as its own Modal app with the same code, but a different value of an ARCH environment variable that gates which class gets loaded. The APP_NAME = f"umi-sim-serve-{ARCH}" line at the top of the file makes each deploy land on a unique Modal app.

The ARCH env var gotcha. ARCH is read at Python module-import time, on the local laptop, when modal deploy runs. Modal then bakes the APP_NAME into the deploy. But inside the container at runtime, os.environ["ARCH"] is NOT set unless we explicitly put it in image.env(...). Without that fix every container's self.arch defaulted to "act" — and the diffusion endpoint was loading Diffusion checkpoints with cls = ACTPolicy, failing in confusing schema-mismatch ways. Real fix: derive arch from the checkpoint path (ft/<arch>-…) instead of relying on the env var.

25 · pre/post processors

25The pre/post-processor pipeline — and the bug that cost us a day.

Neural nets are trained on normalised data. The dataset's raw action vectors live in a frame where typical TCP positions are around (0.1, −0.5, 0.7) — but we train the policy on (value − mean) / std so each dimension has unit variance. Same for the images: pixel values get divided by 255 then standardised against per-channel ImageNet means.

This normalisation is captured in a processor pipeline — a list of small transforms that LeRobot saves alongside the weights:

checkpoints/032000/pretrained_model/
├── model.safetensors                                   # the actual weights
├── config.json                                         # architecture config
├── policy_preprocessor.json                            # preprocessor pipeline schema
├── policy_preprocessor_step_3_normalizer_processor.safetensors
├── policy_postprocessor.json
└── policy_postprocessor_step_0_unnormalizer_processor.safetensors

At inference we MUST apply these:

from lerobot.policies import make_pre_post_processors

pre, post = make_pre_post_processors(
    policy_cfg=policy.config,
    pretrained_path=str(checkpoint_dir),
)
batch  = build_batch(payload)
batch  = pre(batch)                            # normalises image + state + bbox
action = policy.select_action(batch)
action = post(action)                          # un-normalises to world units

The bug. We forgot to load and apply post initially. The policy was returning raw normalised outputs — vectors with mean ≈ 0 and std ≈ 1 — and the eval driver treated them as TCP poses in metres. A "z" of 2.5 would have the IK solver hunting for the wrist 2.5 m above the table; the joint targets came back as the closest reachable configuration, which was basically random. 0 / 14 picks in our first eval. After adding post(action) the same checkpoints dropped from ≈ 1 m wrist error to ≈ 300 mm — a 3× jump just from getting normalisation right.

26 · closed loop

26The closed-loop driver.

The serving endpoint just answers "given this obs, what action?" The closed loop — rendering the env, calling the policy, converting the action to joint targets, stepping the env — runs on the laptop. tools/eval_in_sim.py does it. The core loop:

for t in range(max_ticks):
    # 1. Render wrist GoPro, project bbox
    renderer.update_scene(env._data, camera="umi/gripper_gopro")
    warped = apply_fisheye(renderer.render(), fisheye_map)
    bbox   = _project_brick_bbox(env._model, env._data, target_bid, cam_id, 224)

    # 2. Build payload + POST to Modal
    payload = {
        "checkpoint": checkpoint,
        "image_b64": _encode_png(warped),
        "state"     : _state_vec(env, wrist_bid),
        "bbox"      : bbox.tolist(),
        "reset"     : not sent_reset,
    }
    sent_reset = True
    action = np.asarray(requests.post(url, json=payload).json()["action"])

    # 3. TCP-pose target → joint target via mink IK
    target_xyz  = action[:3]
    target_quat = rotvec_to_wxyz(action[3:6])
    q_target = cid._solve_ik_once(env, ik_state, target_xyz, target_quat, n_iters=20)
    grip_target = float(np.clip(action[6], 0.0, 0.045))

    # 4. Cap per-tick joint motion at 15° to avoid wild snaps
    dq = q_target - q_arm
    n  = float(np.linalg.norm(dq))
    if n > MAX_DQ: dq = dq * (MAX_DQ / n)
    q_arm = q_arm + dq

    # 5. Step env
    obs, *_ = env.step(np.concatenate([q_arm, [grip_target]]).astype(np.float32))

    if env.target_object_pos[2] > target_pos[2] + 0.10:
        return {"success": True, "ticks": t + 1}

Latency budget

render wrist cam: ≈ 8 ms (MuJoCo GL renderer at 224×224)
PNG encode + base64: ≈ 12 ms
HTTP POST round trip: ≈ 250 ms (laptop → Modal → laptop)
Modal-side policy inference: ≈ 150 ms (Diffusion, 70 DDPM steps)
IK solve: ≈ 4 ms (mink + daqp at 20 iters)
env.step: ≈ 2 ms (20 inner physics substeps at 4 ms each)

Total ≈ 250-350 ms per control tick. Our env runs at 25 Hz (40 ms / step), so we're heavily bottlenecked by HTTP latency. In production you'd run the policy on the same machine (or at least the same data centre); here we accept the lag because the demo doesn't need real-time.

PART VIII · Viz & eval

27 · rerun

27Rerun — the rosetta stone of robotics viz.

Rerun is a multi-modal time-series viewer for robotics. Think Tensorboard + Foxglove + a 3D viewer fused together. It's the single best tool I've used for debugging a sim policy. You write to it from Python with one-line log calls; you view in a web browser or a native window.

Logging from the loop

import rerun as rr
rr.init("umi_sim_dashboard")
rr.serve_grpc(grpc_port=9876)
rr.serve_web_viewer(web_port=9090, open_browser=False)

for step in range(num_steps):
    rr.set_time("step", sequence=step)
    log_kinematic_step(env._model, env._data)              # 3D scene
    rr.log("sim/wrist_cam", rr.Image(wrist_img))
    rr.log("sim/third_person", rr.Image(overhead_img))
    for i, v in enumerate(obs["observation.state"]):
        rr.log(f"state/joint_{i}", rr.Scalars(float(v)))
    for i, v in enumerate(action):
        rr.log(f"action/dim_{i}", rr.Scalars(float(v)))

The killer feature is the timeline scrub. When an eval rollout fails at tick 287, scrub the timeline back to 287, see exactly where the wrist was, what the predicted action was, what the wrist cam looked like. We caught the missing-postprocessor bug from chapter 25 in Rerun by scrubbing through a failed rollout and noticing the action plots sitting at unit variance instead of in metres.

28 · dashboard

28The click-to-pick dashboard.

The dashboard (dashboard/main.py) is a local FastAPI server that wraps everything we've built behind a browser UI. It runs on the laptop, talks to the Modal serving endpoints, streams MJPEG to the browser, and embeds Rerun in an iframe.

browser

HTML + JS, dropdowns for model + target brick

→

/api/run

spawns a background thread running the rollout loop

→

modal

policy endpoint chosen from MODEL_ENDPOINTS

→

SSE / MJPEG

step counter + live preview streams back to UI

What the UI shows

Model dropdown. Six options: act-{vanilla,bbox,bbox-overlay} and diffusion-{vanilla,bbox,bbox-overlay}. Plus random for a baseline sanity check.
Target-brick dropdown. 12 named bricks plus (random). Selecting one overrides the env's random choice at reset() time.
Run / Stop buttons. Run kicks off a rollout (default 400 ticks). Stop politely breaks the loop without killing the thread.
Status pill. Live updates with step | target | dist=Xmm | r=0.00, where dist is the live wrist-to-target distance — your single best signal for whether the policy is actually making progress.
Wrist GoPro preview. Live 224×224 stream with the projected bbox drawn as a green rectangle around the target brick.
Overhead view. Third-person camera with a cyan ring + label highlighting which brick is the target.

29 · eval methodology

29How we measure "did it work?"

Three metrics, ranked by usefulness:

1 · Closed-loop pick rate

The headline. Out of N rollouts, how many actually picked up the target brick. Defined as brick z > initial brick z + 10 cm, sustained for at least one tick. Binary per seed.

2 · Final wrist-to-brick error (mm)

Even when the pick fails, we want to know how close the arm got. We log ‖wrist_pos − brick_pos‖ at the end of each rollout. A successful pick obviously has very small error (gripper actually touching brick). An unsuccessful one tells us whether the policy got within 5 cm (and just failed to grasp), 30 cm (got close but stopped), or 80 cm (didn't even leave the home pose). This number is the most informative even when we never get a successful pick.

3 · Wrist-to-brick distance over time

Per-tick distance, plotted. Lets us see whether the policy is actively descending, oscillating, or stuck. The dashboard's status pill exposes this live.

Sweeping seeds

We use a fixed range of seeds (10000–10004 for 5-seed evals) deliberately outside the seed range used for training (0–1379). This guarantees the eval bricks have positions the model never saw.

One seed isn't enough. Diffusion has noise in its sampling. ACT's chunk decoder is deterministic but the env state varies across seeds. Empirically we see 50-100 mm of variance in final wrist error across 5 seeds with the same checkpoint. Reporting a single number is misleading; the mean across 5 is the right summary, plus the range in parentheses.

PART IX · Results

30 · the plateau

30The plateau, in one chart.

We did four iterations of training, each with a different fix targeting a different hypothesised bottleneck. Here's all of them side-by-side. Lower bars are better; the dashed red line at 100 mm is the closest a wrist can be without being considered a pick.

eval · 4 iterations × 2 architectures

      Diffusion bbox
      ACT bbox
      pick threshold (100 mm)
    

Diffusion bbox lands at 309-333 mm across every intervention we tried — except delta-TCP, which broke the absolute-trajectory representation it likes. It's an architectural plateau, not a data one.

Reading the chart

baseline — first run after we fixed the postprocessor bug from chapter 25. Diffusion bbox = 316 mm, ACT bbox = 542 mm.
delta-TCP — re-collected the dataset with action[t] = state[t+1] − state[t] (instead of absolute pose). Hypothesis: predicting deltas avoids the absolute-pose plateau. Result: Diffusion went UP to 493 mm; ACT went DOWN to 465 mm. Mixed signal.
overlay — re-rendered the wrist mp4 with the bbox drawn directly on each image, so the policy can't ignore it visually. Diffusion: 309 mm. ACT: 476 mm. Essentially same as baseline for Diffusion.
randomized — re-collected with per-episode random spawn jitter (±20 mm xy + ±15° yaw). Diffusion: 333 mm, ACT: 529 mm. Same plateau.

Notice the Diffusion bar barely moves across the four interventions: 316, 493, 309, 333. The 493 outlier is delta-TCP, which broke the absolute-trajectory representation Diffusion liked. The other three sit within 24 mm of each other across completely different data layouts, action targets, and visual encodings.

This is the signature of an architectural bottleneck, not a data bottleneck. If 285 → 300 episodes with randomization and overlay couldn't move the plateau, no amount of more data would. The next chapter is the autopsy.

31 · sensitivity probe

31Does the bbox actually steer the policy?

The simplest test of "is the bbox doing anything?" is to feed three different bboxes with the same image and state, and see if the predicted action changes. If the policy is ignoring the bbox, the actions will be identical. If it's using it, they'll vary.

# tools/probe_bbox_sensitivity.py
def probe(url, ckpt, label):
    img_b64 = _dummy_img()
    state   = [0.13, -0.10, 1.20, 0, 0, 0, 0]
    bboxes = {
        "top-left":  [0.10, 0.10, 0.30, 0.30],
        "center":    [0.40, 0.40, 0.60, 0.60],
        "bot-right": [0.70, 0.70, 0.90, 0.90],
    }
    for name, bbox in bboxes.items():
        body = {"checkpoint": ckpt, "image_b64": img_b64, "state": state,
                "bbox": bbox, "reset": True}
        action = requests.post(url, json=body, timeout=60).json()["action"]
        print(f"  {name:9s}  pos=({action[0]:+.3f},{action[1]:+.3f},{action[2]:+.3f})")

Output:

=== act-bbox ===
  topleft    pos=(+0.135, -0.185, +1.198)
  center     pos=(+0.126, -0.232, +1.201)   |Δ_pos|= 4.8 cm
  botright   pos=(+0.117, -0.255, +1.198)   |Δ_pos|= 7.2 cm

=== diffusion-bbox ===
  topleft    pos=(+0.142, -0.100, +1.202)
  center     pos=(+0.136, -0.100, +1.202)   |Δ_pos|= 0.7 cm
  botright   pos=(+0.119, -0.100, +1.185)   |Δ_pos|= 2.9 cm

Two takeaways:

Both policies do respond to the bbox. The predicted y-coord (depth into the table) decreases from −0.185 → −0.232 → −0.255 as the bbox moves top-to-bottom. ACT's |Δ| is 7.2 cm across the full bbox sweep; Diffusion's is 2.9 cm.
Neither response is large enough to switch bricks. Adjacent bricks in our grid are 11 cm apart in x and 14 cm apart in y. The policy moves 1/3 of a brick spacing when we move the bbox by 60% of the image width. To pick a specific brick the policy needs to lateralise by a full brick spacing.

So: bbox conditioning is alive but architecturally weak. The next chapter explains why.

32 · diagnosis

32The architectural diagnosis.

Both ACT and Diffusion Policy follow the same pattern for combining conditioning signals: they project each signal through a small MLP to a fixed-dim feature, then concatenate the features into one big "global condition" vector that feeds the action decoder. The relevant math:

cond = [ embed_img(image) ; embed_state(state) ; embed_envstate(bbox) ]
= [ 512 ; 64 ; 64 ]

The image gets 512 channels — 8× the channel budget of the bbox. Even before the policy's decoder reads cond, the bbox is architecturally outvoted 8:1 by image features. The decoder is a transformer with self-attention, which could in principle let the bbox token gate image features — but it has no prior to do so, and during BC training the gradient signal flowing through the bbox token is small (because the bbox often is redundant with the image — the brick IS visible — so the model can ignore the bbox and still get low loss).

Why MORE data didn't fix it

More data would help if the bbox were doing nothing — the model would eventually be forced to use it. But the bbox is doing something (chapter 31's probe shows it). The model has converged on a small bbox→action sensitivity, and adding more data in the same regime reinforces the same convergence. Cross-arch the same plateau argues for an architectural floor, not a data floor.

Why overlay didn't fix it either

Drawing the bbox on the image moves it from environment_state (channel budget 64) to image pixels (channel budget 49 × 512 = 25 088). On paper that should help — the bbox is now a high-channel visual signal the policy can't ignore. In practice the green pixels are a tiny subset (≈ 0.5% of the image area), so ResNet18's feature aggregation drowns the rectangle in the much-larger ambient image content. Plus the policy has no architectural reason to attend to "green-stuff" specifically — that's a hand-engineered convention the model only learns to use weakly.

What WOULD fix it

Three architectural changes (any one would help):

Cross-attention from bbox to image patches. Replace the simple env_state→concat path with a transformer block where the bbox vector is the query and the 49 ResNet image patches are keys / values. Now the model is structurally forced to look at image features through the bbox's lens. This is how RT-1, RoboPoint, and π0 inject spatial prompts.
Position-encoded image patches gated by bbox. Add a per-patch mask derived from the bbox (1.0 inside, slowly decaying outside) and multiply patch features by it. The irrelevant patches get suppressed before they reach the cond vector.
Re-render with the target brick highlighted in the scene itself. i.e., make the target brick literally glow red in the rendered image. This is a sim-only cheat but makes the visual signal completely unambiguous; any policy that can read pixels can solve the task. Useful as a sanity check that the rest of the stack is sound.

Chapter 35 has the code for option 1.

PART X · From scratch

33 · in the box

33What LeRobot actually does for us.

If you're going to write this yourself, it helps to know exactly what LeRobot is doing on your behalf. Roughly four things:

1 · The dataset machinery

Loading parquet + mp4 + meta from disk or HF Hub, building a PyTorch Dataset, computing normalisation stats (or reading them from stats.json), constructing windowed action chunks across episode boundaries, validating timestamps, handling the v3.0 path conventions. The class to know is LeRobotDataset, defined in lerobot/datasets/lerobot_dataset.py.

2 · The policy classes

Concrete nn.Modules for each architecture: ACTPolicy, DiffusionPolicy, SmolVLAPolicy, PI0Policy (π0.5), VQBeTPolicy, etc. Each has a forward() (for training), a select_action() (for inference, handles the action queue), and a reset() (clears the queue).

3 · The processor pipeline

Pre/post-processing as a chain of ProcessorStep instances: RenameObservationsProcessorStep, NewLineTaskProcessorStep, AddBatchDimensionProcessorStep, TokenizerProcessorStep (for SmolVLA), NormalizerProcessorStep, DeviceProcessorStep. Built by make_pre_post_processors from the saved schema.

4 · The training loop

The function train(cfg) in lerobot/scripts/lerobot_train.py — instantiates the dataset + policy + optimizer + LR scheduler, runs the gradient loop with mixed precision, periodically saves checkpoints (model + config + processor pipeline + optimizer state), logs to wandb if configured.

The lines of code in our project

file	LOC	what it does
`experiments/train/app.py`	244	Modal train driver — 3 functions, one per arch
`experiments/serve/app.py`	220	Modal serve — load policy, apply pre/post, return action
`tools/eval_in_sim.py`	240	local closed-loop driver
`tools/umi_zarr_to_lerobot.py`	100	convert collection zarr → LeRobot v3 parquet
`data/lerobot_writer.py`	450	write LeRobot v3 dataset to disk
`dashboard/main.py`	370	FastAPI dashboard with MJPEG + Rerun
`envs/tabletop_pick.py`	200	MuJoCo env class
`scripts/collect_ik_demos.py`	340	scripted IK demonstrator
`scripts/build_menagerie_scene.py`	370	scene XML builder

Total ≈ 2 500 lines of our code. LeRobot is doing everything we'd otherwise have to write — and that's ≈ 30 000 lines on the LeRobot side, of which we lean on maybe 8 000.

34 · a minimal act

34What "from scratch" looks like.

Let's say you decided to write ACT yourself. Here's the realistic LOC budget, drawn from the actual upstream ACT reference impl:

component	LOC	where the time goes
`act_policy.py` — ResNet18 + transformer VAE + chunk decoder + EMA	500-700	translating from Tony Zhao's reference impl, getting the VAE KL right
`dataset.py` — parquet+mp4 loader, episode/chunk indexing, normalisation stats, action-chunk windowing across episode boundaries	600-1000	this is the bug-magnet — off-by-one in chunk indexing silently produces a model that "trains" but doesn't track
`train.py` — dataloader, optimizer, LR sched, EMA hook, ckpt save/load	200	mostly orchestration
`inference.py` — obs queue, action queue, reset semantics, preprocessor	200	same bugs as LeRobot's serve path — you'd rediscover them yourself
tests for chunking, normalisation round-trip, episode boundaries	300	catches the dataset bugs

So ≈ 2k LOC and 3-5 days to "matches LeRobot on the same data."

A 300-line skeleton (the absolute minimum)

# act_minimal.py — strip ACT to its skeleton; no VAE, no chunked decoder,
# just enough to get the geometry right.
import torch, torch.nn as nn, torchvision.models as tv

class MinimalACT(nn.Module):
    def __init__(self, chunk_size=50, action_dim=7, state_dim=7, bbox_dim=4, d=512):
        super().__init__()
        backbone = tv.resnet18(weights="DEFAULT")
        self.img_enc = nn.Sequential(*list(backbone.children())[:-2])
        self.img_proj = nn.Conv2d(512, d, kernel_size=1)
        self.pos_embed = nn.Parameter(torch.randn(49, d))         # 7×7 patches
        self.state_mlp = nn.Linear(state_dim, d)
        self.bbox_mlp  = nn.Linear(bbox_dim, d)
        encoder_layer = nn.TransformerEncoderLayer(d, nhead=8, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=4)
        self.query_embed = nn.Parameter(torch.randn(chunk_size, d))
        decoder_layer = nn.TransformerDecoderLayer(d, nhead=8, batch_first=True)
        self.decoder    = nn.TransformerDecoder(decoder_layer, num_layers=4)
        self.head       = nn.Linear(d, action_dim)

    def forward(self, image, state, bbox):
        # image: (B, 3, 224, 224); state: (B, 7); bbox: (B, 4)
        feat = self.img_proj(self.img_enc(image))           # (B, d, 7, 7)
        feat = feat.flatten(2).transpose(1, 2)              # (B, 49, d)
        feat = feat + self.pos_embed.unsqueeze(0)
        s_tok = self.state_mlp(state).unsqueeze(1)              # (B, 1, d)
        b_tok = self.bbox_mlp(bbox).unsqueeze(1)                # (B, 1, d)
        memory = self.transformer(torch.cat([feat, s_tok, b_tok], 1))
        Q = self.query_embed.unsqueeze(0).expand(image.size(0), -1, -1)
        out = self.decoder(Q, memory)                            # (B, 50, d)
        return self.head(out)                                # (B, 50, 7)

# Training loop
model = MinimalACT().cuda()
opt   = torch.optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = nn.MSELoss()

for step in range(32_000):
    batch = next_batch()                # (B, 3, 224, 224), (B, 7), (B, 4), (B, 50, 7)
    pred  = model(batch["image"], batch["state"], batch["bbox"])
    loss  = loss_fn(pred, batch["action_chunk"])
    opt.zero_grad(); loss.backward(); opt.step()

That's 35 lines of Python and trains. You'd add a VAE encoder, a chunk queue at inference, normalisation, optimizer schedule, EMA, and per-chunk loss masking — and you'd end up with ≈ LeRobot's ACT in 700 lines. It's an instructive exercise; we didn't do it for this project because LeRobot saved us a week.

35 · cross-attention bbox

35The cross-attention fix.

The architectural diagnosis from chapter 32 said: the bbox is one of many features concatenated into a global cond vector, and the image features dominate. The cleanest fix is to make the bbox directly attend to image features so it can pull out the spatial region it cares about.

The forked-ACT change

In the minimal ACT above, replace the bbox-as-extra-token with a cross-attention block:

class BboxCrossAttention(nn.Module):
    def __init__(self, d=512, nhead=8):
        super().__init__()
        self.attn = nn.MultiheadAttention(d, nhead, batch_first=True)
        self.norm = nn.LayerNorm(d)
        self.mlp  = nn.Sequential(nn.Linear(d, d), nn.GELU(), nn.Linear(d, d))

    def forward(self, bbox_tok, image_tokens):
        # bbox_tok: (B, 1, d) - query
        # image_tokens: (B, 49, d) - keys/values
        attended, weights = self.attn(bbox_tok, image_tokens, image_tokens)
        return self.norm(bbox_tok + self.mlp(attended)), weights

Now the bbox doesn't just sit in the cond vector — it pulls features from the specific image region around the targeted brick. The attention weights become an interpretable "what is the policy looking at" mask, which is useful for debugging too.

Where to slot it in

Right after image encoding, before the global cond pooling:

class ACTwithBboxCA(MinimalACT):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.bbox_ca = BboxCrossAttention(d=512)

    def forward(self, image, state, bbox):
        feat = self.img_proj(self.img_enc(image))
        feat = feat.flatten(2).transpose(1, 2) + self.pos_embed
        b_tok, attn = self.bbox_ca(self.bbox_mlp(bbox).unsqueeze(1), feat)
        s_tok = self.state_mlp(state).unsqueeze(1)
        memory = self.transformer(torch.cat([feat, s_tok, b_tok], 1))
        Q = self.query_embed.unsqueeze(0).expand(image.size(0), -1, -1)
        return self.head(self.decoder(Q, memory))

That's the architectural change. ≈ 30 net new lines of code.

Why it should help

The cross-attention's weights are forced to peak at image patches that match the bbox region. Even without explicit supervision, the softmax over 49 patches makes the bbox token "see" a small subset of the image. The downstream action prediction is then conditioned on those specific features, not on the global image average. Empirically, RT-1, RoboPoint, and π0 all use variations of this pattern and all show much stronger spatial-prompt steering than vanilla ACT.

The estimated cost to ship

Fork the ACT policy file from LeRobot: 1 day.
Add the cross-attention block and re-train: ~$1 on A10G, 30 min.
Run the sensitivity probe again: 1 hour.
Iterate on hyperparameters (attention heads, MLP dim): 1 day.

Total ≈ 2-3 focused days plus ≈ $5 of compute. We didn't do this because we ran out of project time. It's the highest-leverage thing on the followup list.

PART XI · Next

36 · dagger

36DAgger — the curriculum we should have used.

Chapter 11 introduced the BC covariate-shift problem: when the policy deploys, it drifts off the expert's distribution and never saw the new states it lands in. The classical fix is DAgger (Dataset Aggregation):

Train the policy on the original demos.
Deploy it. Record the trajectories it actually generates (states it actually sees, not the expert's).
For each recorded state, ask the expert "what would you have done here?" — relabel.
Add the (state, expert-action) pairs to the dataset.
Re-train. Repeat from 2.

For us the expert is the scripted IK demonstrator — and the scripted picker can in fact answer "what would I do from here?" for any state the policy reaches. So DAgger is essentially free in sim:

def dagger_iter(policy, env, scripted_expert, k_rollouts=50):
    new_data = []
    for seed in range(k_rollouts):
        obs, info = env.reset(seed=seed)
        for tick in range(300):
            # Get the policy's action (drives the env)
            policy_a = policy.select_action(obs)
            obs, *_ = env.step(policy_a)
            # Ask the expert what it WOULD have done from the policy's state
            expert_a = scripted_expert.target_action(env.state(), info)
            new_data.append((obs, expert_a))
    return new_data

We didn't run DAgger for time reasons but it's the obvious next step. With a working cross-attention bbox + 2-3 DAgger iterations, our pick rate would likely jump to > 50%. (See the SO101 sister project for a real-data version where this combination does work.)

37 · real robot path

37From sim to a real arm.

Everything in this codebase ports to a real UR5 with the following swaps:

Replace TabletopPickEnv with a thin wrapper around UR's ROS2 driver. Same observation dict shape; the wrapper reads joint states from the driver, reads images from the real GoPro, returns the dict.
Replace _project_brick_bbox with a real-world detector — GroundingDINO works well for point-and-click. The detector takes the wrist image and a text prompt ("the orange Lego"), returns a bbox. The bbox feeds the policy unchanged.
Replace mink IK with the UR driver's built-in inverse-kinematics (or use mink against a real URDF — works fine, similar API).
Add domain randomisation at training time — random lighting, random table textures, JPEG noise, small camera jitter — to close the sim-to-real visual gap.
Add safety bounds — joint velocity / force limits — that throw the arm into a soft-stop if the policy outputs something crazy.

The other half of this codebase (the SO-101 / boxbros project) already does the real-robot version with the same bbox-conditioned ACT, on real data — and there the pick rate is > 80%. The architecture differences are minor; the difference is that real sensors have noise that forces the policy to lean on the bbox conditioning, partially mitigating the architectural weakness we saw in sim.

38 · scorecard

38Final scorecard.

300

demos collected

102 k

frames

trained checkpoints

309 mm

best reach

successful picks

~$195

cloud spend

What worked

The whole stack runs end-to-end. From a click on the dashboard to MuJoCo physics step to Modal-hosted policy and back, every wire is connected.
Diffusion bbox consistently descended to the table from the home pose, lateralised slightly toward the targeted brick, plateaued in roughly the right neighbourhood — a far cry from baseline noise.
The pre/post-processor fix moved the baseline from "0 picks, ~1 m wrist error" to "0 picks, ~300 mm wrist error." A 3× improvement just from getting normalisation right.
The build pipeline (MuJoCo scene generation, Modal fan-out collection, parquet conversion, HF Hub push, Modal training and serving) is reproducible and reusable for any future task.

What didn't

Closed-loop pick success. We never crossed the 100 mm threshold.
Delta-TCP retrain. Did not move the plateau and made Diffusion worse.
Bbox image overlay. Did not move the plateau either. Marginal improvement on ACT (12%), essentially nothing on Diffusion.
Randomized brick spawning. Plateau stayed at 333 mm.

The one-sentence summary

Bbox conditioning works on real data because real noise forces the policy to lean on it; in clean sim, vanilla LeRobot architectures pass the bbox through too narrow a bottleneck for it to win against the image features. The fix is one layer of cross-attention; we ran out of time to ship it.

code: github.com/ozyphus/cs224-2026-project
data: ozyphus/cs224-lego-umi-rand-300 · ...-overlay · ...-delta
checkpoints: Modal volume umi-sim-weights
sister project (real robot): cs-224r-final-project (boxbros)