01The problem, in one sentence.
There is a robot arm sitting in front of a table. On the table are twelve Lego bricks of different colours. You — the operator — would like to walk up to a screen, click the brick you want, and have the arm pick that brick.
That sentence packs a lot. "The brick you want" presumes the arm can map your click to a specific physical object. "Pick it up" presumes the arm can decide where to move, how to close its fingers, and how not to knock the eleven other bricks onto the floor. Behind all of that there's a stack of work — modelling the robot, simulating the physics, collecting demonstrations, training a neural network, serving the trained network on a cloud GPU, and closing the loop in real time. Roughly half of this book is that stack.
The other half is what went wrong. Spoiler: across six trained checkpoints and four different data / training tricks, our policies never managed to actually close on a brick. They reliably descend toward the table, they reliably move in roughly the right direction — but they plateau ≈ 30 cm short of the target. Chapter 32 is the autopsy. Chapter 35 is the fix we wish we had time to build.
cs-224r-final-project, the
boxbros repo) does exactly this same trick on a real SO-101
arm — the same bbox-conditioned ACT policy, on real data — and it
does close on the brick. The reasons the simulated UR5
version doesn't are not "the idea is wrong"; they're "the
bottleneck is somewhere subtle." Stick around for chapter 32.
This book assumes no robotics background. Each chapter builds the next.
02Why we work in simulation first.
Picture the same problem on a physical arm. Every demonstration we'd need a human to teleoperate, or a scripted-IK rig in the lab, plus a USB camera, plus a calibrated Lego workspace, plus a person who has to hand-reset twelve bricks 300 times. At 30 s per demo that's 2.5 hours of attention from a human. If we want 1,000 demos, that's a day. If we want a thousand demos with randomised brick positions to teach the policy to lateralise, that's a week of nothing but data collection — before we've trained anything.
Simulation collapses that. In MuJoCo, a single CPU container can drive 10 episodes through a scripted picker in about 10 minutes. Fan out across 30 containers on Modal and the whole 300-episode dataset is done in 10 wall-clock minutes for ≈ $5. The same loop on real hardware would take days and cost engineer-hours.
What simulation is bad at
Three things, mostly:
- Visual realism. Our wrist-cam frames look like clean computer graphics — flat lighting, sharp polygon edges, zero noise. A real GoPro frame has motion blur, JPEG artefacts, rolling shutter, dust on the lens. A policy trained only in sim often fails on real frames it has never seen the noise floor of. This is called the sim-to-real gap.
- Contact dynamics. MuJoCo's contact solver is good but not perfect. A real gripper closing on a brick has friction, deformation, slippage. The sim solver approximates all of these.
- Sensor latency. Real cameras have 30–60 ms of capture-to-arrival lag, real servos have 5–20 ms of command-to- actuation lag. Simulation collapses all of these to zero. Policies that work in sim sometimes oscillate when wrapped in a real control loop because they were never trained to anticipate latency.
We accept these gaps for now because the question we're trying to answer in this project — "does bbox conditioning let us route a pick to a specific brick?" — is upstream of all of them. It's about whether the model architecturally can learn to point and pick, given clean signals. If yes, the next project ports it to a real robot and burns the sim-to-real fuel. If no, no amount of real-data engineering would have fixed it.
03The anatomy of a robot arm.
A robot arm is a chain of links (rigid metal tubes) connected by joints (motors). At one end of the chain is the base, bolted to the floor or a table. At the other end is the end-effector — the "hand", which in our case is a gripper.
Each joint has one or two degrees of freedom (DOF) — one number that describes its configuration. A rotating joint has 1 DOF: the rotation angle. A sliding joint has 1 DOF: the slide distance. Stack up six rotating joints in series and you've got a 6-DOF arm — enough to put the end-effector at any reachable position and rotate it to any reachable orientation. This is the standard configuration for industrial manipulators, and it's what the UR5e is.
The UR5e specifically
Universal Robots' UR5e: six revolute joints, ≈ 1 m of reach, ≈ 5 kg payload, 6-axis force/torque sensor built into the flange. We picked it because it's well-supported in MuJoCo's menagerie (a curated set of physically-accurate robot models), and because it's the same arm a lot of real lab work uses. The kinematics convention is:
shoulder_pan— rotates the whole arm about the vertical world axisshoulder_lift— pitches the shoulder up/downelbow— bends at the elbowwrist_1,wrist_2,wrist_3— three rotations that orient the flange
Plus the gripper has one more DOF — the finger separation. So our
full state vector is 7 numbers: [q1, q2, q3,
q4, q5, q6, gripper].
04Forward kinematics — joints to pose.
Given the seven joint numbers, where is the gripper tip in the world? That's forward kinematics (FK). It's pure geometry — no physics, no learning. You compose six rigid-body transformations (one per joint, each parameterised by that joint's angle), apply them in series starting from the base, and out pops the homogeneous transformation matrix Tee ∈ ℝ4×4 describing the end-effector's pose:
Each Ti→i+1(qi) is a 4×4 matrix
encoding "rotate by qi about this joint's axis,
then translate by the rigid link offset to the next joint." MuJoCo
does this calculation for you every physics step — it's available in
data.xpos[body_id] (world position) and
data.xmat[body_id] (orientation matrix).
A 2D toy you can play with
Below is a simplified 2-link planar arm — just two joints, each a single angle. You drag the sliders; the dots trace where the tip goes. Six-DOF UR5 has the same idea, just with more rotations composed.
# 2D forward kinematics — toy version of what UR5 FK does in 3D.
def fk_2link(theta1, theta2, L1=0.4, L2=0.35):
# joint 1 at origin, rotates by theta1.
x1 = L1 * cos(theta1)
y1 = L1 * sin(theta1)
# joint 2 at (x1, y1), rotates by theta1+theta2 in world frame.
x2 = x1 + L2 * cos(theta1 + theta2)
y2 = y1 + L2 * sin(theta1 + theta2)
return (x2, y2)
Three things to notice from playing with the demo: (1) for any pair of joint angles there's exactly one tip position — FK is a function, never ambiguous. (2) the set of reachable tip positions is an annulus (donut) of inner radius |L1−L2| and outer radius L1+L2. (3) two very different joint configurations can put the tip at the same place — once we go backwards from "tip here" to "joints how?" we'll have to pick one. That's what chapter 5 is about.
05Inverse kinematics — pose to joints.
Now imagine you have a target — "put the tip here, oriented like so." What joint angles do you set? That's inverse kinematics (IK), and it's much harder than FK:
- Multiple solutions. A 6-DOF arm reaching some pose typically has up to 8 valid joint configurations (elbow up / down, shoulder in / out, wrist flipped, etc).
- No solutions. If the target is outside the arm's reach, or behind it, or blocked by a joint limit, no configuration works at all.
- Continuity. Even if a target is reachable, the path from the current joint configuration to a target configuration may pass through a singular pose where small joint motions cause huge tip motions. You'd like to avoid those.
The Jacobian-based approach
The classical way to solve IK is iterative: start from the current joint configuration q, compute the FK to find the current tip pose, compute the error between current and target, and take a small step in joint space that reduces the error. The step direction comes from the Jacobian — a 6×n matrix that linearises the FK function locally:
Invert that relationship and you get the joint step that would reduce the pose error by the right amount:
where J+ is the pseudo-inverse (Moore–Penrose) of the Jacobian. Repeat that until error is small. In practice you regularise the inverse to stay stable near singularities, you add posture priors so the arm prefers "natural" configurations, and you add joint limits as soft penalties.
Just use a library
We don't write this ourselves. Google DeepMind's mink is a clean Python IK library for MuJoCo: differentiable, supports multiple tasks (frame, posture, joint-limit), uses a quadratic program solver (daqp) under the hood. Our scripted demonstrator uses it like this:
import mink
# Set up once.
config = mink.Configuration(model)
ee_task = mink.FrameTask(
frame_name="flange", frame_type="body",
position_cost=1.0, orientation_cost=0.5,
)
posture_task = mink.PostureTask(model, cost=0.01)
solver = "daqp"
# Every tick.
ee_task.set_target(target_SE3) # desired flange pose
posture_task.set_target_from_configuration(config)
for _ in range(20):
dq = mink.solve_ik(
config, [ee_task, posture_task],
dt=0.01, solver=solver,
)
config.integrate_inplace(dq, dt=0.01)
q_target = config.q # the desired joint config
Drag-the-target demo
Same 2-link arm, now upside down — the slider is gone. Click and drag anywhere inside the reachable annulus and the arm solves IK to put its tip on your cursor. You'll see how the joints move continuously toward the goal, and how they snap discontinuously when you cross between the two solution branches.
06The gripper, in mechanical detail.
Bolted to the UR5e's flange is the UMI gripper — a parallel-jaw design used by Columbia Robotics as the standard portable data-collection rig for imitation learning. Two wedge-shaped fingers slide together (or apart) along a single linear axis, driven by an internal screw. The wedge shape (rather than a flat surface) is deliberate: it lets you pinch oddly-shaped objects on a single axis without needing dexterous finger articulation.
The gripper exposes one DOF — the distance between the two fingertips, in metres. Fully open is ≈ 4.5 cm, fully closed is 0. We send the target separation as the seventh number in our action vector; the sim physics takes care of contacts and slippage.
The wrist-mounted GoPro
Above the gripper, looking down and forward, is a small GoPro Hero 10 camera. The real UMI rigs put a GoPro on the gripper to capture demonstrations from the wrist's perspective — the camera moves with the hand, so the policy gets a first-person view of the workspace. In simulation we replicate this by adding a MuJoCo camera attached to the wrist body, with a 155° vertical field of view (matching the real GoPro Linear setting).
A wide-FOV camera at the wrist means the brick is visible from a great distance away from the home pose. The downside is heavy fisheye distortion near the image edges. We apply an equidistant-fisheye warp in software to mimic the GoPro's optical distortion, so the policy sees the same kind of warped image it would see on a real rig.
Gripper polarity gotcha
ctrl = 0 means
fully OPEN, and ctrl = 0.05 means fully
CLOSED — the opposite of what one would intuit. The internal
tendon's joint axis is inverted relative to the typical convention.
We discovered this at 3 AM when the scripted picker was opening the
gripper at grasp time and closing it on the lift. Now it's a
comment block at the top of collect_ik_demos.py.
07MuJoCo in ten minutes.
MuJoCo (Multi-Joint dynamics with Contact) is an open-source rigid-body physics engine. Imagine Bullet or PhysX, but optimised for articulated robots, with a soft-contact model that's stable at large timesteps and a Python API that hands you raw access to every state variable.
The mental model:
- You write an XML file describing the world: a tree of bodies, each with a joint connecting it to its parent, geometry for visualisation and collision, and inertial properties.
- MuJoCo parses that into a
MjModel(the static description) and creates aMjData(the live state — positions, velocities, contact forces). - Calling
mj_step(model, data)advances time by one integration step (we use 4 ms, with 20 inner steps per 80 ms "control tick" = 12.5 Hz control). - You can write to
data.ctrl(actuator commands), read fromdata.qpos(joint positions),data.xpos(body world positions), and so on.
A minimal XML scene
<mujoco>
<option timestep="0.004" gravity="0 0 -9.81"/>
<worldbody>
<!-- a fixed table -->
<geom type="box" size="0.6 0.4 0.02"
pos="0 -0.55 0.43" rgba="0.6 0.4 0.2 1"/>
<!-- one free-body brick -->
<body name="brick" pos="0 -0.55 0.46">
<freejoint/>
<geom type="box" size="0.016 0.032 0.012"
rgba="0.8 0.1 0.1 1"/>
</body>
</worldbody>
</mujoco>
Load it from Python:
import mujoco
model = mujoco.MjModel.from_xml_path("scene.xml")
data = mujoco.MjData(model)
mujoco.mj_resetData(model, data)
for step in range(100):
mujoco.mj_step(model, data)
print("brick z:", data.qpos[2]) # falls onto the table
That's the entire loop. Add a camera and a renderer and you get images; add an arm and actuators and you get a controllable robot.
Why MuJoCo not Bullet, Gazebo, IsaacSim?
- Speed. Single-CPU simulation of a 6-DOF arm runs at > 1000× real time on a laptop.
- Stable contacts. The soft-constraint solver rarely explodes even at large timesteps; the alternative engines tend to need much smaller dt to stay stable, blowing the speed advantage.
- Ecosystem. The menagerie has well-calibrated XML for most robots. No need to model the UR5 from URDF kinematics scratch.
- License. Apache 2.0 since DeepMind open-sourced it in 2021.
The downside vs IsaacSim: no native GPU parallelisation (until very recently with MJX), and no built-in photorealistic renderer. For our project that's fine — we just need the wrist GoPro camera, which we render with MuJoCo's built-in rasteriser at 224×224 in a fraction of a millisecond.
08Building our specific scene.
Our scene XML is generated by a Python script
(scripts/build_menagerie_scene.py) rather than written
by hand. The script uses MuJoCo's
MjSpec
API — a programmatic scene builder — so we can adjust dimensions and
layouts without editing XML by hand. The skeleton:
import mujoco
spec = mujoco.MjSpec()
spec.from_file("third_party/mujoco_menagerie/universal_robots_ur5e/ur5e.xml")
wb = spec.worldbody
# Table
table = wb.add_body()
table.add_geom(type="box", size=[0.6, 0.4, 0.02],
pos=[0, -0.55, 0.43],
rgba=[0.6, 0.4, 0.2, 1])
# UMI gripper, mounted to the UR5e flange in a specific orientation
umi_spec = mujoco.MjSpec().from_file(".../umi_gripper.xml")
spec.attach(umi_spec, parent_name="flange", ...)
# Twelve coloured Lego bricks on a 3×4 grid
for name, color, (x, y) in LEGO_LAYOUT:
body = wb.add_body(name=name, pos=[x, y, 0.452])
body.add_freejoint()
body.add_geom(type="mesh", meshname="lego_brick", rgba=color)
# Cameras
wb.add_camera(name="overhead_cam", pos=[0.6, -0.1, 0.95], ...)
# Wrist GoPro camera is added INSIDE the UMI gripper body, so it
# moves with the gripper.
spec.compile()
spec.to_file("assets/mjcf/tabletop_pick_menagerie.xml")
The gotchas we hit (so you don't)
- UMI gripper home orientation. The Menagerie
UMI XML defines the gripper with the fingers pointing along its
own +Z. To make the fingers hang straight down when
bolted on the UR5e flange, we had to rotate the gripper body by
quaternion
(-1, 1, 0, 0)(= 180° about flange's local Y axis). This took an evening to figure out from looking at side-on screenshots. - Wrist→fingertip offset. The wrist body and the gripper fingertips are not at the same point — the gripper sticks out ≈ 26 cm below the flange. Our scripted picker has to subtract this offset when translating "I want the fingertip here" into "I want the wrist here." Wrong value → 0 / 8 pick rate. Correct value (−0.26 m along world −Z) → 30 / 30 pick rate.
- Spawn z = 0.457 m (5 mm above the 0.452 m table top). Bricks fall a few mm onto the table and settle. Spawn higher and they bounce; spawn at exactly 0.452 and they clip into the table at t = 0 and the solver oscillates.
The final scene has the arm, the gripper, twelve Lego bricks at settled positions, a table, an overhead third-person camera (for visualisation), and a wrist-mounted GoPro (for the policy's observations).
09How cameras "see" in a sim.
A MuJoCo camera has a position, an orientation, and a vertical field of view (fovy). To render an image, MuJoCo:
- Builds a 4×4 view matrix from the camera's world pose.
- Builds a 4×4 projection matrix from fovy and the aspect ratio.
- Rasterises every visible geom: for each pixel, finds which geom is in front, looks up the geom's material colour and lighting, writes the pixel.
For us the key question is the reverse: given a 3D point in the world (a Lego brick), what pixel does it land on? Without that we can't compute the "bbox of the target brick" that we'll use as the policy's prompt.
The pinhole projection formula
MuJoCo's cameras are pinhole cameras — they treat the lens as an infinitesimal point, with rays going through it in straight lines. To project a world point P:
- Translate so the camera is at the origin: P' = P − C where C is the camera's world position.
- Rotate into camera frame: pcam = RT · P' where R is the camera's rotation matrix (columns are the camera's local axes expressed in world coords). The convention is: camera looks along its local −Z, with local +Y up.
- Project: a point at camera-frame coords (xc, yc, zc) with zc < 0 (in front of camera) lands at normalised image coords (−xc/zc, −yc/zc).
- Convert to pixels: with image height H and fovy α, the focal length in pixels is f = (H/2) / tan(α/2), and pixel coords are (u, v) = (f · x_norm + W/2, f · y_norm + H/2).
Code in our project
# tools/collect_umi_data.py — exact code we use to bbox bricks.
def _project_brick_bbox(model, data, brick_bid, cam_id, image_hw):
body_pos = data.xpos[brick_bid]
body_mat = data.xmat[brick_bid].reshape(3, 3)
# 8 corners of the brick AABB, in world frame
corners_local = _AABB_CORNER_SIGNS * _BRICK_HALF # (8, 3)
world_corners = body_pos + (body_mat @ corners_local.T).T
cam_pos = data.cam_xpos[cam_id]
cam_mat = data.cam_xmat[cam_id].reshape(3, 3)
rel = world_corners - cam_pos
cam_coords = rel @ cam_mat # = (R.T @ rel.T).T
cz = cam_coords[:, 2]
in_front = cz < -1e-6
if not np.any(in_front): return np.zeros(4, dtype=np.float32)
fovy = float(model.cam_fovy[cam_id])
f = (image_hw / 2.0) / np.tan(np.deg2rad(fovy) / 2.0)
x_img = -cam_coords[in_front, 0] / cam_coords[in_front, 2]
y_img = -cam_coords[in_front, 1] / cam_coords[in_front, 2]
u = f * x_img + (image_hw - 1) / 2.0
v = f * y_img + (image_hw - 1) / 2.0
# Clip to image bounds + normalise.
u_min, u_max = max(0., float(u.min())), min(image_hw-1, float(u.max()))
v_min, v_max = max(0., float(v.min())), min(image_hw-1, float(v.max()))
return np.array([u_min, v_min, u_max, v_max]) / (image_hw - 1)
Output is a 4-vector [x_min, y_min, x_max, y_max] in
[0, 1]. We compute this every tick during data
collection AND every tick during inference; that 4-D vector goes
into the policy as observation.environment_state.
Fisheye on top of pinhole
A real GoPro is not a pure pinhole — it has heavy barrel distortion at the edges, especially in the 155° "Linear" mode we mimic. We apply an equidistant fisheye warp to the rendered pinhole image to fake this distortion. The warp is a fixed pixel remap that bends straight lines outward at the edges.
10Pointing at a brick, formally.
How do you tell a policy "this one?" Three options:
option A · text
"pick the orange brick on the right". Needs a vision-language model to ground the text against the image — expensive, slow, sometimes wrong. SmolVLA does this but at the cost of 450 M params and a tokenizer step in the inference loop.option B · click coordinates
"the brick at pixel (165, 220)". A single point. Spatially precise but the policy can't tell how large the object is or how it's oriented.option C · bounding box
"the brick inside this rectangle, [x_min, y_min, x_max, y_max]." A single 4-D vector. Spatially dense, gives implicit size info. Same shape a real detector would output. This is what we use.option D · object id
"brick #4." Needs the policy to internally track which object is which, which it can't because the input is just an image. Brittle.We use option C, the bbox. In sim we have ground-truth — we know each brick's 3D position so the projection of chapter 9 gives us perfect bboxes. In a real-robot deployment the bbox would come from a detector like GroundingDINO: you'd type "orange brick" into a textbox, the detector outputs a bbox, and that bbox feeds the policy. Same interface.
Click-to-pick demo
The widget below is a top-down view of our actual scene layout — twelve coloured bricks in a 3×4 grid. Click any of them: the right- hand panel shows the wrist GoPro's view (a top-down projection approximation) with the bbox of the brick you picked drawn in green. These are the literal four numbers we'd send to the policy.
observation.environment_state every
tick.
11Behaviour cloning — copy what works.
There are three big ways to teach a robot to do a task: behaviour cloning (BC), reinforcement learning (RL), and model-based planning. We do BC. It's the simplest, the most data-efficient, and the easiest to debug. The recipe:
- Get some expert trajectories — sequences of (observation, action) pairs from someone who already knows how to do the task.
- Train a neural net to predict the expert's action given the expert's observation. This is plain supervised learning.
- At deployment, run the trained net on live observations and execute its predictions.
Mathematically:
Where 𝒟 is the demo dataset, o is an observation, a* is the expert action, and π is our policy. We pick the policy parameters that minimise the mean-squared error between predicted and expert actions, averaged over the dataset.
Why BC fails in obvious ways
The Achilles' heel of BC is covariate shift. The policy is trained on observations drawn from the expert's distribution. At deployment, the policy itself decides what observations it sees next — and the moment it makes a small mistake, the next observation lands slightly off the training distribution. The policy has never seen this kind of obs, so its prediction is slightly worse, the next obs drifts further, and so on. Errors compound. Ross & Bagnell (2010) showed this scales like O(ε · T²) with episode length — quadratic in horizon.
Our episodes are 400 ticks. Even with a per-step error rate of 1% that gives an expected total error budget of 400² × 0.01 = 1600 — obviously trajectories don't blow up by 1600× because the policy sometimes recovers, but the math explains why "train on 99% accurate BC" still gives "deploy fails 100% of the time" if the task is long.
Tricks people use
- DAgger — Dataset Aggregation. Run the trained policy in the env, ask the expert to relabel its actual trajectories ("what would you have done from here?"), add those to the dataset, retrain. Repeat. Directly attacks covariate shift. We didn't do this; chapter 36 says we should have.
- Action chunking. Predict the next 50 actions in one shot instead of one. Each chunk is internally coherent — the policy commits to a multi-step trajectory before re-observing. We do this; chapter 15 has the details.
- Stronger generative heads. If the demo distribution is multi-modal (multiple valid ways to do the task), MSE loss collapses to the mean of the modes, which is often a bad action. Diffusion / flow matching / VQ heads model the full distribution. Chapter 18 covers diffusion.
12Writing a scripted demonstrator.
Where do the expert trajectories come from? Three options: human teleop (slow, expensive), scripted IK (us), or a learned expert (chicken-and-egg). We picked the second because in simulation we know each brick's exact 3D pose, so we can hand-write a four-phase plan:
Each phase converts to a TCP-pose target for the wrist. mink IK
turns that into joint angles. A small per-tick joint-velocity cap
keeps motion smooth. The whole thing is ≈ 280 lines of Python
(scripts/collect_ik_demos.py). The key loop:
def run_one_episode(env, *, seed, on_step):
obs, info = env.reset(seed=seed)
target_name = info["target_object"]
target_pos = env.target_object_pos.copy()
# Phase targets, fingertip-relative
FT_OFFSET = np.array(UR5UMIRobot.WRIST_TO_FINGERTIP_OFFSET) # (0,0,-0.260)
HOVER_FT = np.array([0, 0, 0.15])
GRASP_FT = np.array([0, 0, 0.020])
LIFT_FT = np.array([0, 0, 0.25])
pick_quat = np.array([0.7071, -0.7071, 0, 0]) # wrist hangs DOWN
q_arm = env._data.qpos[:6].copy()
ik = _build_mink_solver(env._model)
# PHASE 1 — hover
target_wrist = target_pos + HOVER_FT - FT_OFFSET
for tick in range(60):
q_arm = _step_to(env, ik, q_arm, target_wrist, pick_quat,
grip=GRIP_OPEN, max_dq_deg=5)
on_step(env._get_obs(), _action_vec(env, q_arm))
# PHASE 2 — descend
target_wrist = target_pos + GRASP_FT - FT_OFFSET
for tick in range(80):
q_arm = _step_to(env, ik, q_arm, target_wrist, pick_quat,
grip=GRIP_OPEN, max_dq_deg=5)
on_step(env._get_obs(), _action_vec(env, q_arm))
# PHASE 3 — close
for tick in range(40):
q_arm = _step_to(env, ik, q_arm, target_wrist, pick_quat,
grip=GRIP_CLOSED, max_dq_deg=5)
on_step(env._get_obs(), _action_vec(env, q_arm))
# PHASE 4 — lift
target_wrist = target_pos + LIFT_FT - FT_OFFSET
for tick in range(60):
q_arm = _step_to(env, ik, q_arm, target_wrist, pick_quat,
grip=GRIP_CLOSED, max_dq_deg=5)
on_step(env._get_obs(), _action_vec(env, q_arm))
success = env.target_object_pos[2] > target_pos[2] + 0.10
return {"success": success, "target": target_name}
Each tick the on_step callback records the observation
and the action. That callback writes to our recording buffer; after
the episode it flushes everything to disk in the UMI zarr format.
Total: 240 frames per episode, ≈ 9.6 s of motion at 25 Hz.
action we log is action[t] = state[t] —
the current TCP pose. We do not log "the target the IK was
trying to reach" or "the joint deltas." This means the trained
policy will be predicting future poses of the wrist
(because the action chunk extends 50 ticks ahead in the dataset).
In chapter 30 we'll see why this absolute-pose choice is what
bottlenecks our results.
13Collecting 300 episodes on Modal CPU.
One episode takes ≈ 25 seconds wall-clock (most of it MuJoCo simulation). 300 episodes serial = 2 hours. We don't have 2 hours of patience, so we fan out across 30 CPU containers on Modal:
@app.cls(
image=image, cpu=4.0, memory=8192,
volumes={RUNS_DIR: runs_vol},
timeout=3600,
scaledown_window=120,
max_containers=64,
)
class UMIDemoWorker:
@modal.enter()
def _setup(self):
runs_vol.reload()
@modal.method()
def collect_chunk(self, run_id, worker_id, n_episodes, base_seed, ...):
env = TabletopPickEnv(EnvConfig())
out = EpisodeRecorder(env, image_hw=224)
worker_dir = pathlib.Path(RUNS_DIR) / run_id / f"worker-{worker_id:03d}"
worker_dir.mkdir(parents=True, exist_ok=True)
for ep in range(n_episodes):
seed = base_seed + ep
result = run_one_episode(env, seed=seed, on_step=out)
if result["success"]:
_write_demo_folder(worker_dir / f"demo_sim{ep:04d}_seed{seed}", out)
out.flush_episode()
# Pack all this worker's frames into one zarr.zip
write_umi_zarr(out, worker_dir / "dataset.zarr.zip")
runs_vol.commit()
return {"zarr_path": str(worker_dir / "dataset.zarr.zip"), ...}
From my laptop:
$ modal run --detach infra/collect_umi_app.py::main \
--n-workers 30 --episodes-per-worker 10
Modal spins up 30 containers in parallel (it'll wait if the workspace
is at quota), each runs 10 episodes, each writes its own zarr to a
persistent volume. After all workers finish, an orchestrator
function calls _merge_zarrs(...) to concatenate them
into one big dataset.zarr.zip.
What went wrong (twice)
map(return_exceptions=False) then cancels all
sibling workers — even ones that already finished. Painful.
Mitigation: bump worker timeout to 60 min, reduce eps-per-worker so
individual workers finish well within budget.
timeout=7200 (2 hr) wasn't long
enough once we randomised the layout because individual
pick reliability dropped (more variation = more failed picks =
longer episodes). Fix: bump orchestrator timeout to 14400 s, and
add a merge_only() function that can re-merge from
already-collected worker zarrs without re-running episodes.
Final yield, after fixing both: 300 episodes, 102,000 frames, 100% scripted-IK pick rate. Cost: ≈ $5 of Modal CPU.
14From UMI zarr to LeRobot v3 parquet.
Our scripted picker writes the
UMI
on-disk format: a dataset.zarr.zip archive (one
multi-array zarr, JpegXL-compressed image stream, flat numeric
arrays for state / bbox) plus a folder of per-demo MP4 videos for
quick visualisation.
LeRobot, the framework we use for training, wants the LeRobot v3 format instead:
build/lerobot_ds_300/
├── data/chunk-000/file-000.parquet # numeric columns per frame
├── videos/observation.images.wrist_cam/chunk-000/file-000.mp4
├── meta/
│ ├── info.json # feature schema
│ ├── stats.json # per-feature global stats
│ ├── episodes/chunk-000/file-000.parquet # per-episode index
│ └── tasks.parquet # task-text lookup
The columns in data/.../file-000.parquet (one row per frame):
observation.state—fixed_size_list<float, 7>(joint angles + gripper)observation.ee_pos—fixed_size_list<float, 3>observation.environment_state—fixed_size_list<float, 4>(the bbox)action—fixed_size_list<float, 7>frame_index,episode_index,timestamp,task_index
Images live in mp4 video files, one per camera per
chunk. LeRobot's dataloader uses PyAV to seek into the video by
frame_index. Much smaller than 102k individual PNGs.
The converter
tools/umi_zarr_to_lerobot.py walks the zarr, reads each
frame's image and numerics, and feeds them to a
LeRobotDatasetWriter we wrote
(data/lerobot_writer.py). The writer keeps an in-memory
buffer of frames and at finalize():
- Writes
data/chunk-000/file-000.parquet. - Encodes the image stream into
file-000.mp4usingimageio-ffmpeg. - Writes per-episode metadata to
meta/episodes/chunk-000/file-000.parquet. - Writes
meta/info.json(the schema) andmeta/stats.json(per-feature min/max/mean/std).
Push to Hub
$ python tools/push_lerobot_to_hub.py build/lerobot_ds_300 \
--repo-id ozyphus/cs224-lego-umi-rand-300 --branch v3.0
The push uses huggingface_hub.HfApi.upload_folder to
create a dataset repo under ozyphus/, then uploads
everything to both the main branch and a
v3.0 branch (LeRobot reads from v3.0 by
convention). The Modal training jobs can then stream the dataset
directly from the Hub.
data/episode_chunk_000/episode_000.parquet; v3 uses
data/chunk-000/file-000.parquet. Easy to miss when
porting code.
(2) meta/episodes/.../file-000.parquet wants a
data/chunk_index and data/file_index column
pointing back into the data parquets — easy to forget.
(3) meta/tasks.parquet wants a pandas DataFrame
whose index name is the task string, not a column. SmolVLA's
tokenizer hard-fails with "task cannot be None" if you get this
wrong.
15Action chunks — why predict 50 at a time.
A naive BC policy predicts one action per observation. At inference
you call π(obst) → at, execute
at, get obst+1, repeat.
This works in theory, but it gives the policy no commitment — each
step it picks the best action right now, ignoring the
trajectory it's on. If the demo data has small high-frequency jitters
(it does — physics integration plus IK noise), the policy learns to
predict those jitters too and the resulting motion looks like a
drunk man trying to thread a needle.
Action chunking, popularised by ACT (Aloha), fixes this. Instead of one action per call, the policy predicts a chunk of k future actions in one shot:
The trained policy returns a coherent multi-step plan from a single observation. The runtime keeps a queue of the predicted chunk and pops one action per env tick. After naction_steps pops (a parameter, ≤ k), it discards the rest of the chunk, re- observes, and predicts a new chunk. We use k = 50 for ACT, k = 16 for Diffusion.
Why this helps
- Smoother motion. Within a chunk the policy commits to a low-noise plan instead of swinging on every tick.
- Implicit smoothing across modes. If the demo has two valid sub-policies (e.g. left vs right approach), a one- step policy averages them (and goes through the middle, hitting nothing). A chunk policy can pick one mode at chunk-prediction time and stay committed.
- Lower call rate. If you re-predict every 8 ticks instead of every 1, you've cut the inference HTTP load by 8×.
The chunk visualiser
Below is a synthetic 50-step action chunk — exactly the shape an ACT model would emit. Each line is one of the 7 action dimensions. Click play; the dashed vertical line is the "head" — what's currently being executed. After tick 49 the queue empties and a new chunk is fetched (starts over here for the demo).
16The four neural-net building blocks we need.
Before diving into ACT and Diffusion Policy, let's pin down the four primitives both use. If you've trained a CNN or a transformer this chapter is a refresher; if not, this is the bare minimum to make sense of the rest.
1 · Multi-layer perceptron (MLP)
A pile of linear layers with a non-linearity (ReLU, GELU) between them. Maps a vector in to a vector out. We use MLPs everywhere small things need to be embedded — the 7-D state, the 4-D bbox, the time-step embedding in diffusion.
import torch.nn as nn
mlp = nn.Sequential(
nn.Linear(4, 64), nn.GELU(),
nn.Linear(64, 64), nn.GELU(),
nn.Linear(64, 128),
)
out = mlp(bbox_tensor) # shape (B, 128)
2 · ResNet image encoder
A convolutional network that turns a 224×224 RGB image into a small feature map (e.g. 7×7×512). The convolutions slide small filters across the image, building up local-to-global features through successive downsampling. We use the pre-trained ResNet18 (≈ 11 M params, trained on ImageNet) because it's small, fast, and the early ImageNet features generalise surprisingly well to robot frames.
import torchvision.models as tv
backbone = tv.resnet18(weights="DEFAULT")
# strip the final classification head, keep up to conv5 features
feature_extractor = nn.Sequential(*list(backbone.children())[:-2])
img_feat = feature_extractor(image) # shape (B, 512, 7, 7)
3 · Transformer encoder / decoder
A stack of self-attention blocks. Each block lets every token in a sequence look at every other token, weighted by a learned similarity score. The transformer's strength is modelling long-range dependencies — exactly what we need to relate a bbox in the image to the gripper's joint state.
For our purposes the key parts are:
- Take a list of tokens (vectors of dim 512).
- Add positional embeddings so the model knows order.
- For L layers, run
token = token + Attention(token); token = token + MLP(token). - Output: the same number of tokens, each refined.
4 · U-Net (for Diffusion)
A specific CNN architecture that maps an input to an output of the same shape, with a "U" of downsampling-then-upsampling layers and skip connections between symmetric levels. Diffusion Policy's U-Net takes a noisy action chunk and predicts the noise. We'll meet it in chapter 18.
17ACT — the Action Chunking Transformer.
ACT, from Tony Zhao's ALOHA paper, is the simplest end-to-end imitation-learning model that "just works" on a 6-DOF manipulator. Three components glued together:
Why a VAE?
The demo dataset can have multiple valid trajectories from the same starting state — pick from the left, or from the right; close fast, or close slow. If we just regressed actions on observations, MSE would average those modes, producing a "between-modes" action that might not be valid (the gripper closing on empty air between two bricks). The VAE encoder bakes the demo trajectory into a latent z, so the decoder is conditioned on "this kind of trajectory". At training time z comes from encoding the actual demo; at inference time z is just zeros and the model uses the mode that fell out of training.
The forward pass, in code
# Pseudocode based on lerobot/policies/act/modeling_act.py
class ACT(nn.Module):
def __init__(self, config):
self.image_encoder = ResNet18Embed(out_dim=512)
self.state_mlp = nn.Linear(7, 512)
self.env_state_mlp = nn.Linear(4, 512) # bbox
self.transformer_enc = TransformerEnc(d_model=512, n_layers=4)
self.transformer_dec = TransformerDec(d_model=512, n_layers=7)
self.query_embed = nn.Parameter(torch.randn(50, 512)) # 50 query slots
self.action_head = nn.Linear(512, 7)
if config.use_vae:
self.action_encoder = TransformerEnc(...)
self.z_head = nn.Linear(512, 64) # μ, log σ
self.z_proj = nn.Linear(32, 512)
def forward(self, batch):
img_tokens = self.image_encoder(batch["image"]) # (B, 49, 512)
s_tok = self.state_mlp(batch["state"]).unsqueeze(1) # (B, 1, 512)
b_tok = self.env_state_mlp(batch["bbox"]).unsqueeze(1) # (B, 1, 512)
z_tok = self._sample_z(batch).unsqueeze(1) # (B, 1, 512)
tokens = torch.cat([s_tok, b_tok, z_tok, img_tokens], dim=1)
memory = self.transformer_enc(tokens) # (B, 52, 512)
queries = self.query_embed.unsqueeze(0).expand(B, -1, -1) # (B, 50, 512)
chunk_feat = self.transformer_dec(queries, memory)
return self.action_head(chunk_feat) # (B, 50, 7)
What it does well
- Fast. ~80 M params, runs at 50 ms / forward pass on an A10G — easily 20 Hz.
- Stable. The chunk-based output gives smooth motion without complex post-filters.
- Cheap to train. ≈ $1 of A10G time for our 32k steps; finished at loss 0.035 on the 300-ep dataset.
What it does badly (for us)
The bbox goes in as one of 52 tokens, alongside 49 image tokens that have 512 dimensions each. Self-attention can in theory let the bbox token pull selectively from image features, but in practice the image tokens dominate. Our sensitivity probe showed a full bbox sweep moves the predicted wrist position by ≤ 5 cm — about a third of a brick spacing. Chapter 32 has the full analysis.
18Diffusion Policy — DDPM for actions.
Diffusion Policy, from Chi et al. 2023, swaps ACT's transformer decoder for a Denoising Diffusion Probabilistic Model (DDPM) over the action chunk. It's heavier and slower than ACT, but better at multi-modal demos and tends to generalise more cleanly.
The denoising idea
Start with a clean target — a 16-step action chunk a0 ∈ ℝ16×7. Apply T = 100 increasingly noisy versions by adding Gaussian noise:
Where ᾱk is a noise schedule that decays from 1 (no noise) to 0 (pure noise) as k goes from 0 to T. Now we train a network εθ(ak, k, cond) to predict the noise ε given the noisy chunk, the noise level k, and a conditioning vector (image+state+bbox embedding). The loss is just MSE on the predicted noise:
Sampling at inference
Start from pure Gaussian noise aT. Run T steps of denoising: at each step, predict the noise, subtract a scaled version, optionally re-add a smaller amount of noise (for DDPM) or none (for DDIM). After all steps you've got a clean action chunk a0. That's the policy output.
The architecture in detail
# lerobot/policies/diffusion/modeling_diffusion.py — abridged
class DiffusionModel(nn.Module):
def __init__(self, config):
self.image_encoder = ResNet18Embed(out_dim=512)
self.state_mlp = nn.Linear(7, 64)
self.env_state_mlp = nn.Linear(4, 64) # bbox
cond_dim = 512 + 64 + 64 + 128 # img + state + bbox + time
self.unet = ConditionalUNet1D(
in_dim=7, cond_dim=cond_dim,
down_dims=[512, 1024, 2048],
)
self.scheduler = DDPMScheduler(num_train_timesteps=100)
def predict_action_chunk(self, batch):
cond = self._build_cond(batch) # (B, cond_dim)
a = torch.randn(B, 16, 7, device=device) # pure noise
for k in self.scheduler.timesteps: # 100, 99, …, 1
eps = self.unet(a, k, cond) # predicted noise
a = self.scheduler.step(eps, k, a).prev_sample
return a # clean chunk (B, 16, 7)
Why diffusion sometimes wins
Compared to ACT's regression head, diffusion captures multi-modal distributions natively — the noise sampling injects randomness that can land in any mode. Practically, this means smoother trajectories and better generalisation in our benchmarks (309 mm vs 542 mm reach on the same data). The downside is 100× more compute per inference call (≈ 150 ms vs 50 ms for ACT), but at our 25 Hz control rate that's still fast enough.
19SmolVLA — when you want language too.
Vision-language-action (VLA) models extend ACT/Diffusion with a pre-trained vision-language backbone that can understand text as part of the conditioning. The idea: the policy can be steered by natural-language commands ("pick the orange one"), not just bboxes.
SmolVLA (450 M params) is HuggingFace's compact VLA: it stacks an action flow-matching head on top of SmolVLM, a small (160 M) VLM that handles 224×224 images and short text prompts. The flow-matching head is similar to diffusion but one-step (faster to sample).
Why we tried it, and why we de-prioritized it
Conceptually, a VLA is the "right" interface for our task — the operator could just type "pick the orange Lego". In practice, on our 285-episode dataset:
- SmolVLA-base is too big to fit comfortably on A10G; needs A100-40GB, ≈ 3× the cost per training hour.
- Tokenisation of language at inference adds another 30 ms of latency.
- The "task" string is identical across all 285 episodes ("pick the lego brick"), so the language input wasn't doing useful work — we'd need to label demos with brick-specific text to teach the model word-to-bbox grounding.
- It still uses
observation.environment_statethe same way ACT does, so the bbox bottleneck would apply.
SmolVLA trained fine (final loss 0.013, 5 h on A100), but we never got a clean closed-loop number — the serve container kept missing the language tokenisation step (it's done in the preprocessor pipeline, which we initially didn't load — see chapter 25). By the time we patched it, results suggested it wasn't worth more iteration over Diffusion. The checkpoint is on the Modal volume if the next-iteration team wants to revive it.
20Gradient descent, for the unfamiliar.
All three policies (ACT, Diffusion, SmolVLA) train the same way: stochastic gradient descent with Adam. If you've trained any neural net before you can skip this chapter.
The recipe
- Define a loss function ℒ that takes the model's predictions and the dataset's ground truth and returns one number (lower = better). For BC, ℒ is MSE between predicted and expert actions. For diffusion, ℒ is MSE between predicted noise and added noise.
- Sample a small batch (say 32 frames) from the dataset. Run the model forward to get predictions. Compute ℒ for the batch.
- Backpropagate — compute ∂ℒ/∂θ for every model parameter θ. PyTorch handles this automatically.
- Update: θ ← θ − η · ∇ℒ, scaled by Adam's per-parameter learning rate.
- Repeat steps 2-4 for N iterations (we use ≈ 30,000).
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
loss_fn = nn.MSELoss()
for step in range(30_000):
batch = sample_batch(dataset, batch_size=32)
pred_action = model(batch["obs"])
loss = loss_fn(pred_action, batch["action"])
optimizer.zero_grad()
loss.backward()
optimizer.step()
if step % 100 == 0:
print(f"step {step}: loss={loss.item():.4f}")
What "training is done" means
Loss is monotonically going down (with noise). At some point the rate of decrease flattens — the model has roughly memorised the expert data and is now polishing. We hit:
| arch | final loss | steps | time | cost |
|---|---|---|---|---|
| ACT | 0.035 | 32 000 | 26 min | $0.17 |
| Diffusion | 0.001 | 30 000 | 1h 25m | $1.84 |
| SmolVLA | 0.013 | 20 000 | 5h 10m | $16.10 |
Diffusion's loss is much lower because it's predicting noise (which is well-bounded), whereas ACT is predicting raw action values which have wider variance. The absolute numbers aren't comparable across architectures.
21Modal — cloud GPUs from a Python decorator.
Renting GPUs is annoying. AWS gives you a raw EC2 instance to
configure, RunPod is per-minute but you maintain the container,
Lambda Labs has the cheapest H100s but it's a queue. Modal
is different: you decorate a Python function with
@app.function(gpu="L40S") and call it from your laptop.
Modal handles container build, GPU allocation, file persistence,
and tear-down. Per-second billing.
The mental model
- App — a Python module containing functions
and classes decorated with
@app.functionor@app.cls. You deploy the app (modal deploy) or run it (modal run); both build a container image, push it to Modal's registry, and register the functions in Modal's scheduler. - Image — a container image, defined declaratively: base image, apt installs, pip installs, env vars.
- Volume — persistent disk storage, mounted at a
path in your container. Reads need
vol.reload(), writes needvol.commit(). - Secret — environment variables (HF tokens, API keys), stored encrypted in Modal's KMS.
- Function vs. class — a function spins up a
container for one call. A class (
@app.cls) holds state across calls (e.g. a loaded model), with@modal.enter()setup methods and@modal.method()RPC entry points.
Our training image
image = (
modal.Image.from_registry(
"nvidia/cuda:12.4.1-devel-ubuntu22.04", # devel so nvcc is present
add_python="3.12", # LeRobot requires 3.12
)
.apt_install("git", "ffmpeg", "libgl1", "libglib2.0-0")
.run_commands("pip install --upgrade pip setuptools wheel")
.pip_install(
"lerobot[pi,dataset,training,diffusion,smolvla] @ "
"git+https://github.com/huggingface/lerobot.git@main",
"huggingface_hub>=1.0.0",
"wandb>=0.16",
"num2words", # SmolVLM tokenizer dep
)
.env({
"HF_HOME": f"{WEIGHTS_DIR}/hf",
"TRANSFORMERS_CACHE": f"{WEIGHTS_DIR}/hf",
"TOKENIZERS_PARALLELISM": "false",
})
)
First build of this image takes ≈ 12 minutes (CUDA download + LeRobot's deep dep tree). Subsequent deploys use the cached layers and take ≈ 8 seconds.
Cost calculator
Detach for long jobs
A training run takes 1+ hours. We don't want the local terminal
babysitting it. modal run --detach launches the
container and exits locally; the run continues on Modal until
completion. Logs are tailed via modal app logs <app-id>.
22The LeRobot trainer in detail.
Inside our Modal function, the training is all LeRobot. The driver is 25 lines of glue:
def _run_train(policy_cfg, dataset_repo_id, name, num_steps, batch_size, lr, ...):
from lerobot.configs.default import DatasetConfig
from lerobot.configs.train import TrainPipelineConfig
from lerobot.scripts.lerobot_train import train as run_train
output_dir = Path(WEIGHTS_DIR) / "ft" / name
policy_cfg.device = "cuda"
policy_cfg.optimizer_lr = lr
policy_cfg.push_to_hub = False
train_cfg = TrainPipelineConfig(
dataset = DatasetConfig(repo_id=dataset_repo_id),
policy = policy_cfg,
output_dir = output_dir,
job_name = name,
batch_size = batch_size,
steps = num_steps,
save_checkpoint= True,
save_freq = save_freq,
log_freq = log_freq,
num_workers = num_workers,
tolerance_s = 1e-3, # float32 timestamp slack
)
run_train(train_cfg)
weights_vol.commit()
Inside run_train(train_cfg) LeRobot does:
- Build the dataset. Stream the parquet+mp4 from
HF Hub if not cached locally, else load from disk. Construct a
PyTorch
Datasetthat yields windowed frames with action chunks. - Compute / load normalisation stats. Per-feature
mean / std are read from
meta/stats.json. These feed aNormalizerProcessorstep that whitens inputs and outputs at train time. - Build the policy. Construct the chosen class (ACT / Diffusion / SmolVLA) with the chosen config. Wrap it in the pre/post-processor pipeline.
- Train loop. Standard PyTorch DDP-ready loop
with mixed-precision, EMA shadow weights for Diffusion, periodic
checkpoint saves to
output_dir / "checkpoints" / "{step}", plus a symlink"last"for convenience. - Save artifacts. Each checkpoint folder
contains
model.safetensors,config.json, and the preprocessor/postprocessor pipeline as safetensors + JSON. The full pipeline is saved next to weights so inference can recreate it.
The bookend-timestamp bug
1e-4) is too tight;
we bumped it to 1e-3 which is still well inside the
40 ms frame spacing.
23Hyperparameters that actually mattered.
Most LeRobot defaults are fine; you don't tune ResNet18 width or transformer depth. The knobs we touched, ranked by effect on our closed-loop numbers:
chunk_size / n_action_steps (ACT)
50 / 50: predict 50 actions, execute all 50 before re-predicting. Smaller (25 / 8) is more reactive but more wobbly. Larger (100 / 50) is wasteful given our 250-tick rollouts. 50 / 50 won.
horizon / n_action_steps (Diffusion)
16 / 8: predict a 16-step chunk, execute 8 before re-predicting. This is a 2× re-prediction rate vs ACT — diffusion needs it because its longer denoising path makes single-chunk plans noisier. We didn't tune past the LeRobot defaults.
batch_size + lr
ACT: bs=8, lr=1e-5. Diffusion: bs=32, lr=1e-4. SmolVLA: bs=4, lr=5e-5. We took these from LeRobot's default configs — they're well-tuned for these architectures and we never had reason to deviate.
save_freq
Save every 4000 (ACT) or 5000 (Diffusion) steps to the Modal volume. Each checkpoint is ≈ 300 MB; we keep them all so we can eval mid- training if something looks off. Volume holds ≈ 30 GB total across six checkpoints with histories.
n_obs_steps (Diffusion)
2: stack the last two observations into a 2-frame history. Empirically helps Diffusion smooth its predictions; ACT is 1 by default.
What we wished we'd tuned
- Cond encoder dim for env_state. The bbox-to- feature MLP in both ACT and Diffusion projects 4 → 64. Bumping it to 4 → 512 (matching image features) was the obvious cheap ablation we didn't try. Chapter 35 details what that would look like.
- Train-time data augmentation. LeRobot supports random affine + colour-jitter on input images. We left these off because the sim images are already deterministic; for real- robot deployment they'd be essential.
- Steps. 30k steps is where ACT plateaus on training loss; we never tested 60k or 100k. Probably wouldn't have helped (loss is already low), but we never ruled it out.
24Serving a trained policy on Modal.
Training produces a checkpoint sitting on the
umi-sim-weights volume. To use it we deploy a second
Modal app — a serving app whose job is to keep the policy
warm on a GPU and answer HTTP POSTs. The shape we used:
@app.cls(
image=image,
gpu="A10G", # cheaper inference GPU
volumes={WEIGHTS_DIR: weights_vol},
secrets=[hf_secret],
timeout=600, # 10 min per request max
scaledown_window=300, # keep alive 5 min after last call
min_containers=0, # spin up on first request
)
class PolicyServer:
@modal.enter()
def _setup(self):
weights_vol.reload()
self.device = torch.device("cuda")
self._policy = None
self._current_ckpt = None
def _ensure_loaded(self, checkpoint_path):
if self._current_ckpt == checkpoint_path and self._policy is not None:
return
self.arch = checkpoint_path.split("/")[1].split("-", 1)[0]
self._policy = _load_policy(self.arch, checkpoint_path)
self._policy.to(self.device).eval().reset()
self._pre, self._post = make_pre_post_processors(
policy_cfg=self._policy.config,
pretrained_path=str(full_path),
preprocessor_overrides={"device_processor": {"device": str(self.device)}},
)
self._current_ckpt = checkpoint_path
@modal.fastapi_endpoint(method="POST")
def live(self, payload: dict) -> dict:
self._ensure_loaded(payload["checkpoint"])
if payload.get("reset"):
self._policy.reset()
batch = self._build_batch(payload)
batch = self._pre(batch)
with torch.inference_mode():
action = self._policy.select_action(batch)
action = self._post(action)
return {"action": action.squeeze(0).cpu().numpy().tolist(),
"arch": self.arch, "checkpoint": self._current_ckpt}
We have three policy architectures; each one is deployed as its own
Modal app with the same code, but a different value of an
ARCH environment variable that gates which class gets
loaded. The APP_NAME = f"umi-sim-serve-{ARCH}" line at
the top of the file makes each deploy land on a unique Modal app.
ARCH is read
at Python module-import time, on the local laptop, when
modal deploy runs. Modal then bakes the
APP_NAME into the deploy. But inside the container at
runtime, os.environ["ARCH"] is NOT set unless we
explicitly put it in image.env(...). Without that fix
every container's self.arch defaulted to
"act" — and the diffusion endpoint was loading
Diffusion checkpoints with cls = ACTPolicy, failing
in confusing schema-mismatch ways. Real fix: derive arch from the
checkpoint path (ft/<arch>-…) instead of relying
on the env var.
25The pre/post-processor pipeline — and the bug that cost us a day.
Neural nets are trained on normalised data. The dataset's
raw action vectors live in a frame where typical TCP positions are
around (0.1, −0.5, 0.7) — but we train the policy on
(value − mean) / std so each dimension has unit variance.
Same for the images: pixel values get divided by 255 then
standardised against per-channel ImageNet means.
This normalisation is captured in a processor pipeline — a list of small transforms that LeRobot saves alongside the weights:
checkpoints/032000/pretrained_model/
├── model.safetensors # the actual weights
├── config.json # architecture config
├── policy_preprocessor.json # preprocessor pipeline schema
├── policy_preprocessor_step_3_normalizer_processor.safetensors
├── policy_postprocessor.json
└── policy_postprocessor_step_0_unnormalizer_processor.safetensors
At inference we MUST apply these:
from lerobot.policies import make_pre_post_processors
pre, post = make_pre_post_processors(
policy_cfg=policy.config,
pretrained_path=str(checkpoint_dir),
)
batch = build_batch(payload)
batch = pre(batch) # normalises image + state + bbox
action = policy.select_action(batch)
action = post(action) # un-normalises to world units
post initially. The policy was returning raw
normalised outputs — vectors with mean ≈ 0 and std ≈ 1 — and
the eval driver treated them as TCP poses in metres. A "z" of 2.5
would have the IK solver hunting for the wrist 2.5 m above the
table; the joint targets came back as the closest reachable
configuration, which was basically random. 0 / 14 picks
in our first eval. After adding post(action) the same
checkpoints dropped from ≈ 1 m wrist error to ≈ 300 mm — a 3× jump
just from getting normalisation right.
26The closed-loop driver.
The serving endpoint just answers "given this obs, what action?" The
closed loop — rendering the env, calling the policy,
converting the action to joint targets, stepping the env — runs on
the laptop. tools/eval_in_sim.py does it. The core
loop:
for t in range(max_ticks):
# 1. Render wrist GoPro, project bbox
renderer.update_scene(env._data, camera="umi/gripper_gopro")
warped = apply_fisheye(renderer.render(), fisheye_map)
bbox = _project_brick_bbox(env._model, env._data, target_bid, cam_id, 224)
# 2. Build payload + POST to Modal
payload = {
"checkpoint": checkpoint,
"image_b64": _encode_png(warped),
"state" : _state_vec(env, wrist_bid),
"bbox" : bbox.tolist(),
"reset" : not sent_reset,
}
sent_reset = True
action = np.asarray(requests.post(url, json=payload).json()["action"])
# 3. TCP-pose target → joint target via mink IK
target_xyz = action[:3]
target_quat = rotvec_to_wxyz(action[3:6])
q_target = cid._solve_ik_once(env, ik_state, target_xyz, target_quat, n_iters=20)
grip_target = float(np.clip(action[6], 0.0, 0.045))
# 4. Cap per-tick joint motion at 15° to avoid wild snaps
dq = q_target - q_arm
n = float(np.linalg.norm(dq))
if n > MAX_DQ: dq = dq * (MAX_DQ / n)
q_arm = q_arm + dq
# 5. Step env
obs, *_ = env.step(np.concatenate([q_arm, [grip_target]]).astype(np.float32))
if env.target_object_pos[2] > target_pos[2] + 0.10:
return {"success": True, "ticks": t + 1}
Latency budget
- render wrist cam: ≈ 8 ms (MuJoCo GL renderer at 224×224)
- PNG encode + base64: ≈ 12 ms
- HTTP POST round trip: ≈ 250 ms (laptop → Modal → laptop)
- Modal-side policy inference: ≈ 150 ms (Diffusion, 70 DDPM steps)
- IK solve: ≈ 4 ms (mink + daqp at 20 iters)
- env.step: ≈ 2 ms (20 inner physics substeps at 4 ms each)
Total ≈ 250-350 ms per control tick. Our env runs at 25 Hz (40 ms / step), so we're heavily bottlenecked by HTTP latency. In production you'd run the policy on the same machine (or at least the same data centre); here we accept the lag because the demo doesn't need real-time.
27Rerun — the rosetta stone of robotics viz.
Rerun is a multi-modal time-series viewer for robotics. Think Tensorboard + Foxglove + a 3D viewer fused together. It's the single best tool I've used for debugging a sim policy. You write to it from Python with one-line log calls; you view in a web browser or a native window.
Logging from the loop
import rerun as rr
rr.init("umi_sim_dashboard")
rr.serve_grpc(grpc_port=9876)
rr.serve_web_viewer(web_port=9090, open_browser=False)
for step in range(num_steps):
rr.set_time("step", sequence=step)
log_kinematic_step(env._model, env._data) # 3D scene
rr.log("sim/wrist_cam", rr.Image(wrist_img))
rr.log("sim/third_person", rr.Image(overhead_img))
for i, v in enumerate(obs["observation.state"]):
rr.log(f"state/joint_{i}", rr.Scalars(float(v)))
for i, v in enumerate(action):
rr.log(f"action/dim_{i}", rr.Scalars(float(v)))
The killer feature is the timeline scrub. When an eval rollout fails at tick 287, scrub the timeline back to 287, see exactly where the wrist was, what the predicted action was, what the wrist cam looked like. We caught the missing-postprocessor bug from chapter 25 in Rerun by scrubbing through a failed rollout and noticing the action plots sitting at unit variance instead of in metres.
28The click-to-pick dashboard.
The dashboard (dashboard/main.py) is a local FastAPI
server that wraps everything we've built behind a browser UI. It
runs on the laptop, talks to the Modal serving endpoints, streams
MJPEG to the browser, and embeds Rerun in an iframe.
What the UI shows
- Model dropdown. Six options:
act-{vanilla,bbox,bbox-overlay}anddiffusion-{vanilla,bbox,bbox-overlay}. Plusrandomfor a baseline sanity check. - Target-brick dropdown. 12 named bricks plus
(random). Selecting one overrides the env's random choice atreset()time. - Run / Stop buttons. Run kicks off a rollout (default 400 ticks). Stop politely breaks the loop without killing the thread.
- Status pill. Live updates with
step | target | dist=Xmm | r=0.00, wheredistis the live wrist-to-target distance — your single best signal for whether the policy is actually making progress. - Wrist GoPro preview. Live 224×224 stream with the projected bbox drawn as a green rectangle around the target brick.
- Overhead view. Third-person camera with a cyan ring + label highlighting which brick is the target.
29How we measure "did it work?"
Three metrics, ranked by usefulness:
1 · Closed-loop pick rate
The headline. Out of N rollouts, how many actually picked up the target brick. Defined as brick z > initial brick z + 10 cm, sustained for at least one tick. Binary per seed.
2 · Final wrist-to-brick error (mm)
Even when the pick fails, we want to know how close the
arm got. We log ‖wrist_pos − brick_pos‖ at the end of
each rollout. A successful pick obviously has very small error
(gripper actually touching brick). An unsuccessful one tells us
whether the policy got within 5 cm (and just failed to grasp), 30 cm
(got close but stopped), or 80 cm (didn't even leave the home pose).
This number is the most informative even when we never get a
successful pick.
3 · Wrist-to-brick distance over time
Per-tick distance, plotted. Lets us see whether the policy is actively descending, oscillating, or stuck. The dashboard's status pill exposes this live.
Sweeping seeds
We use a fixed range of seeds (10000–10004 for 5-seed evals) deliberately outside the seed range used for training (0–1379). This guarantees the eval bricks have positions the model never saw.
30The plateau, in one chart.
We did four iterations of training, each with a different fix targeting a different hypothesised bottleneck. Here's all of them side-by-side. Lower bars are better; the dashed red line at 100 mm is the closest a wrist can be without being considered a pick.
Reading the chart
- baseline — first run after we fixed the postprocessor bug from chapter 25. Diffusion bbox = 316 mm, ACT bbox = 542 mm.
- delta-TCP — re-collected the dataset with
action[t] = state[t+1] − state[t](instead of absolute pose). Hypothesis: predicting deltas avoids the absolute-pose plateau. Result: Diffusion went UP to 493 mm; ACT went DOWN to 465 mm. Mixed signal. - overlay — re-rendered the wrist mp4 with the bbox drawn directly on each image, so the policy can't ignore it visually. Diffusion: 309 mm. ACT: 476 mm. Essentially same as baseline for Diffusion.
- randomized — re-collected with per-episode random spawn jitter (±20 mm xy + ±15° yaw). Diffusion: 333 mm, ACT: 529 mm. Same plateau.
Notice the Diffusion bar barely moves across the four interventions: 316, 493, 309, 333. The 493 outlier is delta-TCP, which broke the absolute-trajectory representation Diffusion liked. The other three sit within 24 mm of each other across completely different data layouts, action targets, and visual encodings.
This is the signature of an architectural bottleneck, not a data bottleneck. If 285 → 300 episodes with randomization and overlay couldn't move the plateau, no amount of more data would. The next chapter is the autopsy.
31Does the bbox actually steer the policy?
The simplest test of "is the bbox doing anything?" is to feed three different bboxes with the same image and state, and see if the predicted action changes. If the policy is ignoring the bbox, the actions will be identical. If it's using it, they'll vary.
# tools/probe_bbox_sensitivity.py
def probe(url, ckpt, label):
img_b64 = _dummy_img()
state = [0.13, -0.10, 1.20, 0, 0, 0, 0]
bboxes = {
"top-left": [0.10, 0.10, 0.30, 0.30],
"center": [0.40, 0.40, 0.60, 0.60],
"bot-right": [0.70, 0.70, 0.90, 0.90],
}
for name, bbox in bboxes.items():
body = {"checkpoint": ckpt, "image_b64": img_b64, "state": state,
"bbox": bbox, "reset": True}
action = requests.post(url, json=body, timeout=60).json()["action"]
print(f" {name:9s} pos=({action[0]:+.3f},{action[1]:+.3f},{action[2]:+.3f})")
Output:
=== act-bbox ===
topleft pos=(+0.135, -0.185, +1.198)
center pos=(+0.126, -0.232, +1.201) |Δ_pos|= 4.8 cm
botright pos=(+0.117, -0.255, +1.198) |Δ_pos|= 7.2 cm
=== diffusion-bbox ===
topleft pos=(+0.142, -0.100, +1.202)
center pos=(+0.136, -0.100, +1.202) |Δ_pos|= 0.7 cm
botright pos=(+0.119, -0.100, +1.185) |Δ_pos|= 2.9 cm
Two takeaways:
- Both policies do respond to the bbox. The predicted y-coord (depth into the table) decreases from −0.185 → −0.232 → −0.255 as the bbox moves top-to-bottom. ACT's |Δ| is 7.2 cm across the full bbox sweep; Diffusion's is 2.9 cm.
- Neither response is large enough to switch bricks. Adjacent bricks in our grid are 11 cm apart in x and 14 cm apart in y. The policy moves 1/3 of a brick spacing when we move the bbox by 60% of the image width. To pick a specific brick the policy needs to lateralise by a full brick spacing.
So: bbox conditioning is alive but architecturally weak. The next chapter explains why.
32The architectural diagnosis.
Both ACT and Diffusion Policy follow the same pattern for combining conditioning signals: they project each signal through a small MLP to a fixed-dim feature, then concatenate the features into one big "global condition" vector that feeds the action decoder. The relevant math:
= [ 512 ; 64 ; 64 ]
The image gets 512 channels — 8× the channel budget of the bbox. Even before the policy's decoder reads cond, the bbox is architecturally outvoted 8:1 by image features. The decoder is a transformer with self-attention, which could in principle let the bbox token gate image features — but it has no prior to do so, and during BC training the gradient signal flowing through the bbox token is small (because the bbox often is redundant with the image — the brick IS visible — so the model can ignore the bbox and still get low loss).
Why MORE data didn't fix it
More data would help if the bbox were doing nothing — the model would eventually be forced to use it. But the bbox is doing something (chapter 31's probe shows it). The model has converged on a small bbox→action sensitivity, and adding more data in the same regime reinforces the same convergence. Cross-arch the same plateau argues for an architectural floor, not a data floor.
Why overlay didn't fix it either
Drawing the bbox on the image moves it from environment_state
(channel budget 64) to image pixels (channel budget 49 × 512 =
25 088). On paper that should help — the bbox is now a high-channel
visual signal the policy can't ignore. In practice the green pixels
are a tiny subset (≈ 0.5% of the image area), so ResNet18's feature
aggregation drowns the rectangle in the much-larger ambient image
content. Plus the policy has no architectural reason to attend to
"green-stuff" specifically — that's a hand-engineered convention
the model only learns to use weakly.
What WOULD fix it
Three architectural changes (any one would help):
- Cross-attention from bbox to image patches. Replace the simple env_state→concat path with a transformer block where the bbox vector is the query and the 49 ResNet image patches are keys / values. Now the model is structurally forced to look at image features through the bbox's lens. This is how RT-1, RoboPoint, and π0 inject spatial prompts.
- Position-encoded image patches gated by bbox. Add a per-patch mask derived from the bbox (1.0 inside, slowly decaying outside) and multiply patch features by it. The irrelevant patches get suppressed before they reach the cond vector.
- Re-render with the target brick highlighted in the scene itself. i.e., make the target brick literally glow red in the rendered image. This is a sim-only cheat but makes the visual signal completely unambiguous; any policy that can read pixels can solve the task. Useful as a sanity check that the rest of the stack is sound.
Chapter 35 has the code for option 1.
33What LeRobot actually does for us.
If you're going to write this yourself, it helps to know exactly what LeRobot is doing on your behalf. Roughly four things:
1 · The dataset machinery
Loading parquet + mp4 + meta from disk or HF Hub, building a
PyTorch Dataset, computing normalisation stats (or
reading them from stats.json), constructing windowed
action chunks across episode boundaries, validating timestamps,
handling the v3.0 path conventions. The class to know is
LeRobotDataset, defined in
lerobot/datasets/lerobot_dataset.py.
2 · The policy classes
Concrete nn.Modules for each architecture:
ACTPolicy, DiffusionPolicy,
SmolVLAPolicy, PI0Policy (π0.5),
VQBeTPolicy, etc. Each has a forward()
(for training), a select_action() (for inference,
handles the action queue), and a reset() (clears the
queue).
3 · The processor pipeline
Pre/post-processing as a chain of ProcessorStep
instances: RenameObservationsProcessorStep,
NewLineTaskProcessorStep,
AddBatchDimensionProcessorStep,
TokenizerProcessorStep (for SmolVLA),
NormalizerProcessorStep,
DeviceProcessorStep. Built by
make_pre_post_processors from the saved schema.
4 · The training loop
The function train(cfg) in
lerobot/scripts/lerobot_train.py — instantiates the
dataset + policy + optimizer + LR scheduler, runs the gradient
loop with mixed precision, periodically saves checkpoints (model +
config + processor pipeline + optimizer state), logs to wandb if
configured.
The lines of code in our project
| file | LOC | what it does |
|---|---|---|
experiments/train/app.py | 244 | Modal train driver — 3 functions, one per arch |
experiments/serve/app.py | 220 | Modal serve — load policy, apply pre/post, return action |
tools/eval_in_sim.py | 240 | local closed-loop driver |
tools/umi_zarr_to_lerobot.py | 100 | convert collection zarr → LeRobot v3 parquet |
data/lerobot_writer.py | 450 | write LeRobot v3 dataset to disk |
dashboard/main.py | 370 | FastAPI dashboard with MJPEG + Rerun |
envs/tabletop_pick.py | 200 | MuJoCo env class |
scripts/collect_ik_demos.py | 340 | scripted IK demonstrator |
scripts/build_menagerie_scene.py | 370 | scene XML builder |
Total ≈ 2 500 lines of our code. LeRobot is doing everything we'd otherwise have to write — and that's ≈ 30 000 lines on the LeRobot side, of which we lean on maybe 8 000.
34What "from scratch" looks like.
Let's say you decided to write ACT yourself. Here's the realistic LOC budget, drawn from the actual upstream ACT reference impl:
| component | LOC | where the time goes |
|---|---|---|
act_policy.py — ResNet18 + transformer VAE + chunk decoder + EMA | 500-700 | translating from Tony Zhao's reference impl, getting the VAE KL right |
dataset.py — parquet+mp4 loader, episode/chunk indexing, normalisation stats, action-chunk windowing across episode boundaries | 600-1000 | this is the bug-magnet — off-by-one in chunk indexing silently produces a model that "trains" but doesn't track |
train.py — dataloader, optimizer, LR sched, EMA hook, ckpt save/load | 200 | mostly orchestration |
inference.py — obs queue, action queue, reset semantics, preprocessor | 200 | same bugs as LeRobot's serve path — you'd rediscover them yourself |
| tests for chunking, normalisation round-trip, episode boundaries | 300 | catches the dataset bugs |
So ≈ 2k LOC and 3-5 days to "matches LeRobot on the same data."
A 300-line skeleton (the absolute minimum)
# act_minimal.py — strip ACT to its skeleton; no VAE, no chunked decoder,
# just enough to get the geometry right.
import torch, torch.nn as nn, torchvision.models as tv
class MinimalACT(nn.Module):
def __init__(self, chunk_size=50, action_dim=7, state_dim=7, bbox_dim=4, d=512):
super().__init__()
backbone = tv.resnet18(weights="DEFAULT")
self.img_enc = nn.Sequential(*list(backbone.children())[:-2])
self.img_proj = nn.Conv2d(512, d, kernel_size=1)
self.pos_embed = nn.Parameter(torch.randn(49, d)) # 7×7 patches
self.state_mlp = nn.Linear(state_dim, d)
self.bbox_mlp = nn.Linear(bbox_dim, d)
encoder_layer = nn.TransformerEncoderLayer(d, nhead=8, batch_first=True)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=4)
self.query_embed = nn.Parameter(torch.randn(chunk_size, d))
decoder_layer = nn.TransformerDecoderLayer(d, nhead=8, batch_first=True)
self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=4)
self.head = nn.Linear(d, action_dim)
def forward(self, image, state, bbox):
# image: (B, 3, 224, 224); state: (B, 7); bbox: (B, 4)
feat = self.img_proj(self.img_enc(image)) # (B, d, 7, 7)
feat = feat.flatten(2).transpose(1, 2) # (B, 49, d)
feat = feat + self.pos_embed.unsqueeze(0)
s_tok = self.state_mlp(state).unsqueeze(1) # (B, 1, d)
b_tok = self.bbox_mlp(bbox).unsqueeze(1) # (B, 1, d)
memory = self.transformer(torch.cat([feat, s_tok, b_tok], 1))
Q = self.query_embed.unsqueeze(0).expand(image.size(0), -1, -1)
out = self.decoder(Q, memory) # (B, 50, d)
return self.head(out) # (B, 50, 7)
# Training loop
model = MinimalACT().cuda()
opt = torch.optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = nn.MSELoss()
for step in range(32_000):
batch = next_batch() # (B, 3, 224, 224), (B, 7), (B, 4), (B, 50, 7)
pred = model(batch["image"], batch["state"], batch["bbox"])
loss = loss_fn(pred, batch["action_chunk"])
opt.zero_grad(); loss.backward(); opt.step()
That's 35 lines of Python and trains. You'd add a VAE encoder, a chunk queue at inference, normalisation, optimizer schedule, EMA, and per-chunk loss masking — and you'd end up with ≈ LeRobot's ACT in 700 lines. It's an instructive exercise; we didn't do it for this project because LeRobot saved us a week.
35The cross-attention fix.
The architectural diagnosis from chapter 32 said: the bbox is one of many features concatenated into a global cond vector, and the image features dominate. The cleanest fix is to make the bbox directly attend to image features so it can pull out the spatial region it cares about.
The forked-ACT change
In the minimal ACT above, replace the bbox-as-extra-token with a cross-attention block:
class BboxCrossAttention(nn.Module):
def __init__(self, d=512, nhead=8):
super().__init__()
self.attn = nn.MultiheadAttention(d, nhead, batch_first=True)
self.norm = nn.LayerNorm(d)
self.mlp = nn.Sequential(nn.Linear(d, d), nn.GELU(), nn.Linear(d, d))
def forward(self, bbox_tok, image_tokens):
# bbox_tok: (B, 1, d) - query
# image_tokens: (B, 49, d) - keys/values
attended, weights = self.attn(bbox_tok, image_tokens, image_tokens)
return self.norm(bbox_tok + self.mlp(attended)), weights
Now the bbox doesn't just sit in the cond vector — it pulls features from the specific image region around the targeted brick. The attention weights become an interpretable "what is the policy looking at" mask, which is useful for debugging too.
Where to slot it in
Right after image encoding, before the global cond pooling:
class ACTwithBboxCA(MinimalACT):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.bbox_ca = BboxCrossAttention(d=512)
def forward(self, image, state, bbox):
feat = self.img_proj(self.img_enc(image))
feat = feat.flatten(2).transpose(1, 2) + self.pos_embed
b_tok, attn = self.bbox_ca(self.bbox_mlp(bbox).unsqueeze(1), feat)
s_tok = self.state_mlp(state).unsqueeze(1)
memory = self.transformer(torch.cat([feat, s_tok, b_tok], 1))
Q = self.query_embed.unsqueeze(0).expand(image.size(0), -1, -1)
return self.head(self.decoder(Q, memory))
That's the architectural change. ≈ 30 net new lines of code.
Why it should help
The cross-attention's weights are forced to peak at image patches that match the bbox region. Even without explicit supervision, the softmax over 49 patches makes the bbox token "see" a small subset of the image. The downstream action prediction is then conditioned on those specific features, not on the global image average. Empirically, RT-1, RoboPoint, and π0 all use variations of this pattern and all show much stronger spatial-prompt steering than vanilla ACT.
The estimated cost to ship
- Fork the ACT policy file from LeRobot: 1 day.
- Add the cross-attention block and re-train: ~$1 on A10G, 30 min.
- Run the sensitivity probe again: 1 hour.
- Iterate on hyperparameters (attention heads, MLP dim): 1 day.
Total ≈ 2-3 focused days plus ≈ $5 of compute. We didn't do this because we ran out of project time. It's the highest-leverage thing on the followup list.
36DAgger — the curriculum we should have used.
Chapter 11 introduced the BC covariate-shift problem: when the policy deploys, it drifts off the expert's distribution and never saw the new states it lands in. The classical fix is DAgger (Dataset Aggregation):
- Train the policy on the original demos.
- Deploy it. Record the trajectories it actually generates (states it actually sees, not the expert's).
- For each recorded state, ask the expert "what would you have done here?" — relabel.
- Add the (state, expert-action) pairs to the dataset.
- Re-train. Repeat from 2.
For us the expert is the scripted IK demonstrator — and the scripted picker can in fact answer "what would I do from here?" for any state the policy reaches. So DAgger is essentially free in sim:
def dagger_iter(policy, env, scripted_expert, k_rollouts=50):
new_data = []
for seed in range(k_rollouts):
obs, info = env.reset(seed=seed)
for tick in range(300):
# Get the policy's action (drives the env)
policy_a = policy.select_action(obs)
obs, *_ = env.step(policy_a)
# Ask the expert what it WOULD have done from the policy's state
expert_a = scripted_expert.target_action(env.state(), info)
new_data.append((obs, expert_a))
return new_data
We didn't run DAgger for time reasons but it's the obvious next step. With a working cross-attention bbox + 2-3 DAgger iterations, our pick rate would likely jump to > 50%. (See the SO101 sister project for a real-data version where this combination does work.)
37From sim to a real arm.
Everything in this codebase ports to a real UR5 with the following swaps:
- Replace
TabletopPickEnvwith a thin wrapper around UR's ROS2 driver. Same observation dict shape; the wrapper reads joint states from the driver, reads images from the real GoPro, returns the dict. - Replace
_project_brick_bboxwith a real-world detector — GroundingDINO works well for point-and-click. The detector takes the wrist image and a text prompt ("the orange Lego"), returns a bbox. The bbox feeds the policy unchanged. - Replace mink IK with the UR driver's built-in inverse-kinematics (or use mink against a real URDF — works fine, similar API).
- Add domain randomisation at training time — random lighting, random table textures, JPEG noise, small camera jitter — to close the sim-to-real visual gap.
- Add safety bounds — joint velocity / force limits — that throw the arm into a soft-stop if the policy outputs something crazy.
The other half of this codebase (the SO-101 / boxbros project) already does the real-robot version with the same bbox-conditioned ACT, on real data — and there the pick rate is > 80%. The architecture differences are minor; the difference is that real sensors have noise that forces the policy to lean on the bbox conditioning, partially mitigating the architectural weakness we saw in sim.
38Final scorecard.
What worked
- The whole stack runs end-to-end. From a click on the dashboard to MuJoCo physics step to Modal-hosted policy and back, every wire is connected.
- Diffusion bbox consistently descended to the table from the home pose, lateralised slightly toward the targeted brick, plateaued in roughly the right neighbourhood — a far cry from baseline noise.
- The pre/post-processor fix moved the baseline from "0 picks, ~1 m wrist error" to "0 picks, ~300 mm wrist error." A 3× improvement just from getting normalisation right.
- The build pipeline (MuJoCo scene generation, Modal fan-out collection, parquet conversion, HF Hub push, Modal training and serving) is reproducible and reusable for any future task.
What didn't
- Closed-loop pick success. We never crossed the 100 mm threshold.
- Delta-TCP retrain. Did not move the plateau and made Diffusion worse.
- Bbox image overlay. Did not move the plateau either. Marginal improvement on ACT (12%), essentially nothing on Diffusion.
- Randomized brick spawning. Plateau stayed at 333 mm.
The one-sentence summary
Bbox conditioning works on real data because real noise forces the policy to lean on it; in clean sim, vanilla LeRobot architectures pass the bbox through too narrow a bottleneck for it to win against the image features. The fix is one layer of cross-attention; we ran out of time to ship it.
code: github.com/ozyphus/cs224-2026-project
data: ozyphus/cs224-lego-umi-rand-300 · ...-overlay · ...-delta
checkpoints: Modal volume umi-sim-weights
sister project (real robot): cs-224r-final-project (boxbros)