TL;DR
We took an SO-101 follower arm + a USB wrist cam + a base cam, recorded teleop demos of "clean up the desk" (cube → tray), and trained two policies on the same data: π0.5 (Physical Intelligence's 3B-param vision-language-action foundation model — the published SOTA) and ACT (Action Chunking Transformer, ~80M params, ALOHA-paper architecture from 2023). Both ran inference behind a single Modal HTTP endpoint serving a Mac-side hardware loop.
π0.5 needed every input/output shape it expected to be hand-overridden, then produced actions outside the SO-101 servo's tracking envelope and failed to move the arm in any coherent way. ACT trained in 12 minutes from scratch on 8 demos and reached for the cube on the second live run — almost grasping it before timing out.
For broad zero-shot generalization across embodiments and tasks, more parameters and more pretraining wins. For narrow specialization on a single task with a tiny dataset, the architecture whose output distribution matches your servo's tracking envelope wins. ACT, designed in 2023 to be trained from scratch on 50-demo ALOHA datasets, is purpose- built for the second regime. π0.5 is purpose-built for the first. We were squarely in the second.
You're comfortable with: PyTorch, Transformer attention basics, robot joint-space concepts, and have at least heard of LeRobot / Modal. We re-derive the ACT objective, the camera-pinhole unprojection, and the reasoning behind π0.5's fp32-everywhere choice. We do not re-derive attention or VAEs from scratch — we treat them as primitives.
Goal & constraints
One sentence: press a button, type a task in plain English, watch the arm attempt the task. That's the user-facing surface.
The constraint that drives most of the architecture is: no local GPU. The control machine is a Mac with an M-series chip. PyTorch on MPS works, but a 3B-param VLA is well past its memory limit, and even the 80M ACT shares bandwidth with the teleop loop in unhappy ways. Inference needs to happen on a remote GPU. The arm needs to be physically attached to the local machine via USB. Bridging the two cleanly is the work.
Other constraints, less negotiable than they sound:
- Servo speed. SO-101 uses Feetech STS3215 servos. Their effective tracking speed under arm load is roughly 30 °/s — meaning a 3 °/tick command at 10 Hz is the upper limit before the position lags behind the goal. We will hit this wall.
- Calibration drift. Each session, the cv2 camera index assignment can shift depending on USB enumeration order. Index 1 today is index 2 tomorrow.
- Data scarcity. Recording demos is a manual, real-time, error-prone process. We collected 30 then 8 demos. Both numbers are well below what most published imitation-learning results assume.
- Cost discipline. Modal bills per second. The smaller and faster the model, the cheaper the iteration.
The hardware
SO-101 is a 3D-printed open-source 5-axis arm + parallel gripper from The Robot Studio, driven by Feetech STS3215 serial servos. Two arms come in the kit: a "follower" (does work) and a "leader" (you grab and drag — the follower mirrors). The leader has the same kinematics, just used as a 6-DOF input device.
| Joint | Bus ID | Range (deg) | Role |
|---|---|---|---|
shoulder_pan | 1 | ≈ ±90° | base rotation |
shoulder_lift | 2 | ≈ -110° to +60° | base tilt |
elbow_flex | 3 | ≈ -100° to +100° | elbow bend |
wrist_flex | 4 | ≈ +20° to +100° | wrist tilt |
wrist_roll | 5 | ≈ -180° to +180° | wrist twist |
gripper | 6 | 0° (open) to ~60° (closed) | parallel gripper |
Two USB cables (one per arm) into the Mac. One USB camera mounted on the follower's gripper —
the "wrist cam," ~12 cm from the gripper tip. One USB webcam on a tripod looking at the
workspace from a 30° angle — the "base cam." Calibration is per-arm, persisted under
~/.cache/huggingface/lerobot/calibration/, derived from a one-time manual sweep
to the joint extremes.
We didn't appreciate this until the second live run. The wrist cam's USB cable hangs in a loop near the gripper. Across 8 episodes, the loop's exact shape was different each time — and the model learned to attend to it as if it were part of the scene. Tape your cables flat before recording, or you teach your policy that "cable in this position = grasp now," which is a delightfully bad failure mode.
System architecture
Three Modal apps and one Mac client, all sharing a single HuggingFace dataset and a single Modal Volume of weights:
PI0Policy.live (ASGI)
train() · train_act()
30 → 8 episodes
parquet + AV1 video
/weights/ft/<run-name>/checkpoints/<step>/pretrained_model
The split mirrors what the GAS/HuggingBros experiments learned: tight loops on local hardware, slow loops on cloud GPUs. Here, "tight" is the servo-control loop (~10 Hz, USB-bound, no network). "Slow" is the policy forward pass (~10–300 ms inference + ~1 s network round trip). The Mac client batches actions from each policy call into a 20-tick chunk, which smooths the slow path back into the tight loop.
Same image, same Volume of weights. train_act() writes a checkpoint to
/weights/ft/..., then PI0Policy._load() reads it on the next live
invocation. Co-locating the two functions in one app means there's no shipping intermediate
artifacts between projects, no separate base image, no second Volume. The cost is that
modal deploy rebuilds image + ASGI surface + train function together — but
since the image is cached, redeploys are seconds.
Imitation learning, formally
Both policies in this article are trained by behavioral cloning (BC). The
formal setup: we have a dataset
𝒟 = {(ot(i), at(i))} of expert
observations o and actions a across episodes i. We learn a policy
πθ(a | o) by maximum likelihood:
For continuous-action robotics, the conditional distribution is usually parameterized as either Gaussian (giving an L2 / MSE objective) or Laplacian (L1) over an action chunk:
where H is the prediction horizon (the chunk size). Both ACT and π0.5 use this kind of chunked-action prediction, which dramatically reduces the well-known compounding error problem of single-step BC: if you only predict one step ahead, every error pushes the next observation out of distribution; if you predict a whole chunk and execute it open-loop, errors accumulate inside the chunk but the policy's state resets at the next chunk boundary.
Compounding-error bound for single-step BC: under standard assumptions, expected return suffers cost growing like O(T2·ε) in horizon T and per-step BC error ε. With chunk size H, the policy commits an open-loop trajectory of length H, then re-observes — so the bound becomes O((T/H)·H2·ε) = O(TH·ε). Larger chunks → fewer policy queries → less compounding, until H approaches a regime where the open-loop assumption itself breaks. ACT's default H = 100; π0.5's = 50. We use 50 for both.
π0.5 from first principles
π0 and π0.5 are vision-language-action foundation models from Physical Intelligence. The architecture fuses a frozen-ish PaliGemma 3B vision-language backbone with a small "action expert" transformer that produces continuous-action chunks via flow matching / diffusion.
5.1 — PaliGemma + action expert
The backbone is PaliGemma:
a SigLIP image encoder + Gemma 2B language model, jointly trained. For π0.5, three input
images (base camera, left wrist camera, right wrist camera, all 224×224 RGB) are tokenized
via SigLIP into a sequence of vision tokens. The text input — built as
"Task: {nl_task}, State: {discretized_state};\nAction: " where the state is
discretized into 256 bins per joint — is tokenized via the Gemma tokenizer.
These visual + language tokens feed into the Gemma 2B transformer (the "VLM expert"), which produces a context tensor. A separate ~300M-parameter "action expert" transformer (Gemma-300M variant) cross-attends to that context to denoise an action chunk.
5.2 — Action-chunk denoising
Action prediction in π0.5 is flow matching, a continuous-time variant of diffusion. Given a clean action chunk a0, sample a timestep τ ∈ [0, 1] and a noise ε ∼ 𝒩(0, I). Construct the noisy chunk:
Train the network vθ (action expert) to predict the velocity field:
At inference, integrate the ODE backwards from τ = 1 (pure noise) to τ = 0 (clean
action) using a few-step Euler scheme — π0.5 defaults to 10 inference steps. The output is a
chunk of shape [chunk_size=50, max_action_dim=32]. The action_dim is padded to 32
so the model can serve multiple embodiments (a 14-DOF bimanual ALOHA, a 7-DOF Franka, our 6-DOF
SO-101) with a single architecture; the leftover dims are zeroed out at training and ignored
at inference.
Most BC papers feed the robot's joint state into a small MLP, then concatenate the resulting vector with image features. π0.5 instead discretizes each state value into 256 bins and inlines them as text tokens in the prompt. This is unusual but elegant: the same tokenizer handles both the language task description and the proprioception. The cost is quantization noise (a 0.7° change won't change the discrete bin); the benefit is unified cross-attention with the language tokens.
ACT from first principles
ACT (Zhao et al., 2023, Stanford ALOHA paper) is a much simpler architecture, designed to be trained from scratch on small teleop datasets (~50 episodes) for a single task on a single robot. No pretraining, no language input — just images, state, and actions. The key idea is to wrap a transformer encoder-decoder in a conditional VAE.
6.1 — The conditional VAE wrapping
The motivation: human teleop is multi-modal. The same observation can correspond to different valid demonstrations (left-handed vs right-handed grasp, slow vs fast approach, etc). Maximum-likelihood Gaussian regression averages these modes and produces blurry "mean trajectories" that satisfy nobody. A VAE with a learned latent z lets the network commit to a single mode at inference (sample z, get a coherent trajectory).
The setup: an "encoder" reads the ground-truth action chunk + state and produces a latent posterior q(z | a1:H, s). A "decoder" reads z, the camera images, and the state, and produces a reconstructed action chunk â1:H. At inference, we drop the encoder and sample z from the prior p(z) = 𝒩(0, I).
z + images + state → decoder → â
loss = L1(a, â) + KL[q || p]
z + images + state → decoder → â
execute first n_action_steps
+ proprioceptive state token + latent z token → tx encoder (4 layers, dim 512, 8 heads)
→ tx decoder cross-attends from learnable action queries → MLP head → action chunk [H, action_dim]
6.2 — The loss
where:
- L1 reconstruction on actions (Laplace likelihood — smoother gradients than L2 near zero, more robust to outliers from noisy demos).
- KL divergence between the posterior qϕ(z | a, s) and the standard-Normal prior p(z). The posterior is a small transformer that reads the ground-truth action chunk and outputs μ, σ for a diagonal Gaussian.
- β =
kl_weight, defaulting to 10 in lerobot. Higher β collapses the latent toward the prior (less expressivity, more determinism).
Training takes only the decoder + the reconstruction tower from end to end. At inference, the encoder is discarded — z ∼ 𝒩(0, I) gives a sample from the action distribution. For deterministic eval one can fix z = 0 (the prior mean), which is what lerobot does by default.
Both ACT and π0.5 wrap a transformer in a generative model over action chunks. The CVAE requires a single forward pass at inference (sample z once, decode). Diffusion requires K forward passes (Euler steps over τ). At 80M params and 11 ms per forward, ACT can comfortably re-plan at ≥30 Hz; π0.5 at 3B and ~280 ms (10 steps × ~28 ms each on L40S) re-plans at ~3 Hz. For our 10 Hz control loop with 20-action chunks (i.e. re-plan every 2 s), the gap doesn't matter much, but for closed-loop visual servoing it does.
6.3 — ACT at the parameter level
The lerobot defaults we used:
| Hyperparameter | Value | Notes |
|---|---|---|
vision_backbone | resnet18 | ImageNet pretrained, fully fine-tuned during training |
dim_model | 512 | transformer width |
n_heads | 8 | attention heads |
n_encoder_layers | 4 | transformer encoder depth |
n_decoder_layers | 1 | transformer decoder depth (yes, only 1) |
chunk_size | 50 | actions predicted per forward pass |
n_action_steps | 50 | actions executed before re-querying |
use_vae | True | flip to off for deterministic regression |
latent_dim | 32 | VAE latent z dim |
kl_weight (β) | 10.0 | standard ACT setting |
optimizer_lr | 1e-5 | Adam (yes, just Adam) |
dropout | 0.1 | standard tx dropout |
Dataset collection
Dataset collection is the part of robot-learning research that feels least like ML and most like a craft. You sit at a desk. You hold the leader arm. You drag the leader to the cube. You squeeze the leader's gripper to close the follower's gripper. You lift. You move the object to the tray. You open the gripper. You return to a roughly-consistent home pose. You press a key to mark the episode complete. You move the cube to a new spot. Repeat thirty times.
The tool is lerobot-record, the LeRobot CLI for teleop demo collection. It
handles the leader/follower pairing, camera capture, dataset serialization (parquet for
state/action, AV1-encoded MP4 for video), and pushes the result to a HuggingFace dataset
repo:
lerobot-record \
--robot.type=so101_follower \
--robot.port=/dev/tty.usbmodem5A7A0546771 \
--robot.id=so101_follower_main \
--robot.cameras='{ wrist: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 15, backend: AVFOUNDATION}, base: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 15, backend: AVFOUNDATION} }' \
--teleop.type=so101_leader \
--teleop.port=/dev/tty.usbmodem5A7A0545661 \
--teleop.id=so101_leader_main \
--dataset.repo_id=ozyphus/so101-cleanup-v3 \
--dataset.single_task="clean up the desk" \
--dataset.num_episodes=30 \
--dataset.fps=15 \
--dataset.episode_time_s=35 \
--dataset.reset_time_s=15 \
--dataset.push_to_hub=true \
--dataset.private=true
Resulting parquet schema, per row (one row per timestep at 15 Hz):
| Column | Dtype | Shape | Meaning |
|---|---|---|---|
observation.state | float32 | (6,) | follower's current joint positions |
action | float32 | (6,) | leader's commanded positions (= follower target) |
observation.images.wrist | video ref | (480, 640, 3) | AV1-encoded MP4 frame |
observation.images.base | video ref | (480, 640, 3) | AV1-encoded MP4 frame |
timestamp | float64 | scalar | seconds since episode start |
episode_index | int64 | scalar | episode id |
frame_index | int64 | scalar | frame within episode |
task | str | scalar | "clean up the desk" |
Important subtlety: action ≠ state in teleop datasets. The action is what the leader arm was at on tick t; the state is what the follower achieved. The follower lags the leader by a few ticks under load. This is intentional — the dataset captures the operator's intent (action) plus the actual achieved configuration (state). At inference, the policy outputs an action (target leader pose ≈ target follower pose); the servo PID closes the loop.
If you abort lerobot-record before any episode is saved, it leaves an empty
~/.cache/huggingface/lerobot/<repo>/meta/info.json stub on disk. Re-running
crashes with FileExistsError. The fix is one line: rm -rf ~/.cache/huggingface/lerobot/<repo>
and re-run. Don't delete the parent. Yes, this took us 5 minutes to figure out.
7.1 — What we actually collected
Three datasets. Lessons learned with each.
| Repo | Episodes | Frames | Notes |
|---|---|---|---|
so101-cleanup-v1 | — | — | aborted (timer too short, 20s eps cut grasps mid-motion) |
so101-cleanup-v2 | 30 | 15,402 | full set, ~34 s/episode, used for π0.5 fine-tune |
so101-cleanup-v3 | 8 | 4,149 | recording crashed; we kept what we had. Different physical setup (cam indices changed) |
First attemptTraining π0.5
π0.5's input_features dict, as published, expects three image keys
(base_0_rgb, left_wrist_0_rgb, right_wrist_0_rgb all at
3×224×224), a 32-dim padded state, and outputs a 32-dim padded action. We had two cams (base
and a single wrist), 6-dim state, 6-dim action. Every shape was wrong. Every key was wrong.
The fix that made the trainer happy was a combination of a feature override and a column rename:
policy_cfg = PreTrainedConfig.from_pretrained("lerobot/pi05_base")
policy_cfg.pretrained_path = "lerobot/pi05_base"
policy_cfg.device = "cuda"
policy_cfg.freeze_vision_encoder = True # frozen PaliGemma SigLIP — fits L40S
policy_cfg.train_expert_only = True # only train the 300M action expert
policy_cfg.optimizer_lr = lr
policy_cfg.push_to_hub = False
# Override 32→6 for state and action; pi05's pad_vector pads internally at forward time.
policy_cfg.input_features["observation.state"] = PolicyFeature(
type=FeatureType.STATE, shape=(6,)
)
policy_cfg.output_features["action"] = PolicyFeature(
type=FeatureType.ACTION, shape=(6,)
)
# Map our two dataset image keys onto pi05's first two expected keys; the missing third
# key (right_wrist_0_rgb) gets filled with a -1-padded "empty camera" by pi05's forward.
rename_map = {
"observation.images.base": "observation.images.base_0_rgb",
"observation.images.wrist": "observation.images.left_wrist_0_rgb",
}
train_cfg = TrainPipelineConfig(
dataset=DatasetConfig(repo_id="ozyphus/so101-cleanup-v2"),
policy=policy_cfg,
output_dir=Path("/weights/ft") / job_name,
batch_size=4,
steps=4000,
save_freq=1000,
rename_map=rename_map,
)
run_train(train_cfg)
weights_vol.commit()
Each line is a deliberate concession to π0.5's published architecture:
freeze_vision_encoder=True+train_expert_only=True— Of π0.5's ~3B params, this trains only the ~300M action expert. The PaliGemma backbone (SigLIP + Gemma 2B) stays frozen at the upstream weights. This was a VRAM and compute necessity (full FT would not fit on L40S without offloading), but it's also a regularization choice: we have 15K frames; we can't responsibly fine-tune 3B params on that.- Shape override — Without overriding
input_features.observation.state.shapeto (6,), the dataset normalizer fails because dataset stats are 6-dim and the policy declares 32-dim. Setting it to (6,) lets the normalizer match; inside the model,pad_vector(state, 32)still fires at forward time to satisfy the action-expert input layer's fixed shape. - Image rename_map — Maps dataset columns to policy-expected feature names
at load time. The third image key (
right_wrist_0_rgb) is not in the dataset; π0.5'sget_image_features()handles missing keys by inserting a fully-saturated -1 image (SigLIP's mean-zero) and zeroing the corresponding mask bit.
We also had to discover the lerobot extras hierarchy by failing through it:
And π0.5's default push_to_hub=True config requires a repo_id — if
neither is set explicitly, training fails with a
ValueError: 'policy.repo_id' argument missing at the end-of-training save step.
Setting push_to_hub=False persists the checkpoint to the Volume only.
8.1 — Training results
4000 steps at batch 4, lr 1e-5, on an L40S took 41 minutes. Loss dropped 0.27 → 0.067 (a smooth 4× reduction over the run). Every metric of the training process said "this is working":
Serving on Modal
Inference is a single FastAPI endpoint behind @modal.fastapi_endpoint(method="POST").
The class lifecycle is:
@app.cls(
image=image,
gpu="L40S",
volumes={WEIGHTS_DIR: weights_vol},
secrets=SECRETS,
timeout=1800,
scaledown_window=300,
max_containers=2,
)
class PI0Policy:
@modal.enter()
def _load(self) -> None:
weights_vol.reload()
for repo_id, policy_type in CHECKPOINT_CANDIDATES:
try:
policy_cls = get_policy_class(policy_type)
config = PreTrainedConfig.from_pretrained(repo_id)
config.device = "cuda"
policy = policy_cls.from_pretrained(repo_id, config=config)
policy.to("cuda").eval()
preprocessor, postprocessor = make_pre_post_processors(
policy.config, pretrained_path=repo_id,
preprocessor_overrides={"device_processor": {"device": "cuda"}},
postprocessor_overrides={"device_processor": {"device": "cuda"}},
)
self.policy, self.preprocessor, self.postprocessor = policy, preprocessor, postprocessor
self.checkpoint, self.policy_type = repo_id, policy_type
self.expected_image_keys = [k for k in config.input_features if "image" in k]
state_feat = config.input_features.get("observation.state")
self.expected_state_dim = int(state_feat.shape[0]) if state_feat else 32
weights_vol.commit()
return
except Exception as e:
print(f"[pi0] failed {repo_id}: {e}")
raise RuntimeError("no checkpoint loaded")
@modal.fastapi_endpoint(method="POST", docs=True)
def live(self, payload: dict) -> dict:
images = {k: _decode(v) for k, v in (payload.get("images") or {}).items()}
state = payload.get("state") or [0.0]*7
task = payload.get("task") or "pick up the object"
raw = self._build_raw_obs(images, state, task)
chunk, latency_ms = self._predict(raw)
return {
"action_chunk": chunk[: self.n_action_steps].tolist(),
"horizon": self.n_action_steps,
"latency_ms": round(latency_ms, 2),
"checkpoint": self.checkpoint,
"expected_image_keys": self.expected_image_keys,
"received_image_keys": sorted(images.keys()),
}
Three notable design choices in this class:
CHECKPOINT_CANDIDATESas a fallback list. The first loadable entry wins. Local FT checkpoints first, then upstream pretrained as fallback. This means a corrupted local checkpoint never takes the endpoint down — it gracefully falls through tolerobot/pi05_base. We re-deploy by editing this list.scaledown_window=300keeps the container warm for 5 minutes after the last request. Cold start is ~90 s for π0.5 (loading the 3B-param backbone from the Volume) and ~2 s for ACT. Once warm, π0.5 inference is ~280 ms, ACT is ~12 ms._build_raw_obshandles cam-key heterogeneity. The same endpoint can serve π0.5 (which expects 3 image keys, replicated/padded) and ACT (which expects whatever the dataset has — 2 keys, namedwristandbase). The mapper substring-matches semantic names ("base" in the key → use the client's "base" cam; "wrist" → use the client's "wrist" cam; otherwise zeros). One client, multiple policy types, no client-side change.
9.1 — Why fp32 throughout
The most surprising serving choice is keeping π0.5 in fp32, not fp16. We discovered the same fp16 trap that GroundingDINO has: π0.5 fuses two architectures (PaliGemma's SigLIP + Gemma vision-language tower, plus the action expert with its own attention paths). Loading the whole policy in fp16 produces dtype-mismatch errors at the cross-attention boundary between the VLM and the action expert. Loading in fp32 takes about 9 GB — comfortably below L40S's 48 GB — and inference is 280 ms warm vs ~180 ms in fp16 (an imperceptible difference at 10 Hz control rate).
The Mac client
One Python file, ~400 lines: client/so101_run.py. It does five things, in this
order, every tick:
- Read follower joint positions and gripper state from USB.
- Read base + wrist camera frames (lerobot does this as part of
get_observation()). - Either pop a queued action from the previous chunk, or POST to the Modal endpoint to refill the queue with a new 50-action chunk.
- Compute the goal as either
scale × action(absolute mode) orstate + scale × action(delta mode), then clip per-joint to ±max_joint_step_deg. - Send the safe goal to the follower via
send_action().
step = 0
action_queue: list[np.ndarray] = []
while not stop["v"] and (args.steps < 0 or step < args.steps):
obs = robot.get_observation()
state = obs_to_state_vec(obs) # → list[float] of 6 joint positions
frames = {name: obs.get(name) for name in cameras}
if not action_queue:
# Refill via Modal POST; payload format is policy-agnostic.
images_payload = {name: encode_to_b64(f) for name, f in frames.items() if f is not None}
resp = post_observation(
args.endpoint_url,
{"images": images_payload, "state": state, "task": args.task},
timeout=180.0,
)
chunk = np.asarray(resp["action_chunk"], dtype=np.float32) # shape [50, 6]
take = min(args.actions_per_chunk, chunk.shape[0])
action_queue = [chunk[i] for i in range(take)]
action_vec = action_queue.pop(0)
goal = action_to_motor_dict(action_vec, obs, mode=args.action_mode, scale=args.action_scale)
safe_goal = clip_toward(goal, {k: obs[k] for k in goal}, args.max_joint_step_deg)
if not dry_run:
robot.send_action(safe_goal)
step += 1
time.sleep(max(0, 1 / args.hz - (time.perf_counter() - loop_t0)))
The --action-mode flag is the single most important client-side knob. It controls
how the policy's output is interpreted:
| Mode | Goal | When to use |
|---|---|---|
delta | state + scale·action | policy output is a velocity-like signal (relative) |
absolute | scale · action | policy output is the target joint position (what teleop datasets actually contain) |
During recording, lerobot stores the leader's joint positions as the action — so for
any policy trained on a teleop dataset, --action-mode=absolute is correct. This
matters: with absolute mode, the policy's output of "elbow_flex = 14°" means
"drive elbow to 14°", regardless of where the elbow currently is. With delta mode it would
mean "add 14° to the current elbow position," which is wildly wrong if the policy outputs
target poses near zero. We started with delta and saw nonsensical behavior. Absolute fixed it.
Even with absolute mode, when the live policy first comes up, the predicted target may be
tens of degrees away from the current pose. Sending that goal directly to the servos would
cause a snap-to-target motion at full speed — bad for the gear train, worse for whatever's
in the workspace. clip_toward(goal, present, max_delta=3.0) caps each step to
±3° per joint per tick, so the servos can only ever take small steps. The policy keeps
re-emitting the same large goal each tick; the arm crawls toward it.
Camera math
Both policies consume images directly — no calibrated 3D required. But for completeness (and because the next iteration of this project will use it), the camera unprojection is worth stating cleanly. The pinhole model:
v = fy · Y/Z + cy
(u, v) are pixel coordinates, (X, Y, Z) are camera-frame 3D coordinates, fx, fy are focal lengths in pixels, and (cx, cy) is the principal point (≈ image center). Inverting, given a pixel and its depth Z:
Y = (v − cy) · Z / fy
For a generic USB webcam without a calibrated K, the rule-of-thumb default fx = fy = max(H, W) with principal point at (W/2, H/2) is good enough for visualization. It assumes ~60° horizontal FOV, which most laptop and gripper-mount cameras are within ±10° of. For metric reconstruction (path planning, collision checking) you'd calibrate with a ChArUco board. For pose-conditioned imitation learning, the policy doesn't see K at all — it sees raw images.
First live run
With the π0.5 fine-tune deployed and the live endpoint smoke-tested, we ran:
python client/so101_run.py \
--wrist-cam-index 1 --base-cam-index 2 \
--task "clean up the desk" \
--steps 50 \
--action-mode absolute --action-scale 1.0 \
--max-joint-step-deg 3.0 \
--live
The arm moved a few degrees, then froze. The log told a precise story:
Two things stuck out:
- Elbow stuck at 92.6° from tick 10 onwards. Despite a continuous -3.00° command (the safety clip's saturation value), the present elbow position did not decrease.
- Other joints moved very slowly. Shoulder_pan moved 0.1° over 50 ticks. Shoulder_lift moved 3.4°. Wrist_flex 1.6°. Tiny absolute travel.
Before drawing any conclusions, we instrumented the client to print absolute joint state
alongside the per-tick delta — that's the pres[...] column in the log above. This
revealed the real story.
Servo torque vs gravity
Here is the elbow trajectory across two runs, plotted as commanded vs achieved per-tick motion:
| Run | Joint | Δ over N ticks | Commanded /tick | Achieved /tick |
|---|---|---|---|---|
| π0.5 first live | elbow_flex | 97.1° → 92.6° in 10 ticks | -3.0° | -0.45° |
| π0.5 first live | elbow_flex | 92.6° → 92.6° in 40 ticks | -3.0° | 0° |
| home-only test | elbow_flex | 99.4° → 96.8° in 5 ticks | -2.0° | -0.5° |
| home-only test | elbow_flex | 96.8° → 96.8° in 187 ticks | -2.0° | 0° |
| home-only test | shoulder_pan | 8.2° → 2.3° in 10 ticks | -2.0° | -0.6° |
| home-only test | wrist_flex | 82.9° → 73.5° in 10 ticks | -2.0° | -0.94° |
The interpretation:
- The servos can move at ~0.5°/tick at 10 Hz (= 5°/s) under nominal load.
- The elbow specifically stops moving when its position hits ~92–97°. Other joints continue their slow creep.
- This isn't a calibration limit (the dataset's recorded actions go down to -96°, so the servo definitely can travel there under teleop).
The mechanism is gravity load. With shoulder_lift at -102° (arm tilted back near
vertical) and elbow_flex at +97° (forearm pointing almost straight forward), the
forearm + gripper + wrist cam hangs out as a moment arm of about 12 cm. To straighten
the elbow further (drive elbow toward 0°), the servo must lift this moment arm against
gravity. The Feetech STS3215 is rated for ~30 kg·cm stall torque; under a smooth motion
profile with PID gains tuned for teleop loads it delivers ≈ 1/3 of that continuously. The arm
+ payload exceeds this in this specific posture.
During teleop recording, the leader arm guided the follower smoothly — small, continuous changes that the PID absorbed. At inference, we sent a step-target 80° away from the current state every tick. The PID asserted full effort, the servo couldn't make progress, and the position got pinned at the postural-equilibrium point.
We had been treating "the model outputs in-distribution actions" as a sufficiency condition. It isn't. The output must also be inside the servo's tracking envelope given the current arm configuration. A 60° absolute target step is in-distribution for π0.5's training data (other robots, other postures) but outside this particular SO-101's envelope when shoulder_lift = -102°. ACT, trained from scratch on our actual demos, naturally outputs within-envelope actions because that's the only thing it ever saw.
The homing trick
One robust mitigation: before the policy loop starts, drive the arm to the median start-of-episode pose from the training dataset, slowly. This guarantees the policy's first action is close to a configuration it was trained from.
# Median first-frame joint positions across so101-cleanup-v3 (8 episodes).
TRAINING_HOME_POSE_DEG = {
"shoulder_pan": -7.6,
"shoulder_lift": -102.5,
"elbow_flex": 94.1,
"wrist_flex": 62.7,
"wrist_roll": -9.3,
"gripper": 0.6,
}
def _drive_to_home(step_deg=2.0, timeout_s=15.0):
t0 = time.perf_counter()
while time.perf_counter() - t0 < timeout_s:
present = {f"{j}.pos": float(robot.get_observation()[f"{j}.pos"])
for j in JOINT_NAMES}
target = {f"{j}.pos": float(TRAINING_HOME_POSE_DEG[j])
for j in JOINT_NAMES}
diffs = {k: target[k] - present[k] for k in target}
if all(abs(v) < step_deg for v in diffs.values()):
return # within tolerance — done
safe = clip_toward(target, present, step_deg)
robot.send_action(safe)
time.sleep(0.1)
We compute the home pose by reading the parquet files of episode-start frames and taking the median per joint. This is more robust than the mean because a single outlier episode (e.g. recorded with the leader arm in a weird pose) can drag the mean significantly. With 8 episodes, the median is well-defined and the per-joint stdev is small (≤ 0.85° for shoulder_lift, ≤ 6.8° for wrist_flex — the latter being where most of the natural reset variance lives).
Homing brings the arm to the dataset's start distribution. From that pose, the policy's first action is in-distribution and the arm moves. But once the policy commands elbow extension to do the actual reach, we re-enter the gravity-loaded posture and the servo can stall again. The full fix is either: (a) record demos from multiple initial postures so the policy learns to operate from low-load configurations, or (b) tune the servo PID gains for higher continuous current. We didn't do either; homing was enough to get past the first barrier.
PivotSwitching to ACT
After multiple π0.5 live attempts producing stalled or near-zero motion, we made the call: drop the SOTA model, train ACT from scratch on the same dataset. Reasoning:
- Distribution match — ACT trains on our exact data. Its action distribution is, by construction, in our servo's envelope.
- No shape gymnastics — ACT auto-populates
input_featuresfrom the dataset. Two image keys, 6-dim state, 6-dim action: as is. - Smaller GPU — A10G (24 GB, ~$1.10/hr on Modal) is plenty for an 80M model; π0.5 needed L40S.
- Faster iteration — fewer minutes per training run means more chances to tune the recording protocol.
This was also a calibrated risk against advice in much of the recent literature — "use the biggest VLA you can fit, fine-tune from a checkpoint, you'll get better generalization." That's true for cross-task generalization. We're squarely in the single-task, narrow-data regime where smaller specialized architectures often win.
Training ACT
The ACT training function is dramatically simpler than the π0.5 one. No feature overrides, no rename map, no freeze-strategy juggling:
@app.function(
image=image,
gpu="A10G",
volumes={WEIGHTS_DIR: weights_vol},
secrets=SECRETS,
timeout=2 * 3600,
)
def train_act(
dataset_repo_id: str = "ozyphus/so101-cleanup-v3",
num_steps: int = 4000,
batch_size: int = 8,
lr: float = 1e-5,
save_freq: int = 1000,
log_freq: int = 100,
) -> dict:
from lerobot.policies.act.configuration_act import ACTConfig
from lerobot.configs.default import DatasetConfig
from lerobot.configs.train import TrainPipelineConfig
from lerobot.scripts.lerobot_train import train as run_train
weights_vol.reload()
name = f"act-so101-cleanup-{time.strftime('%Y%m%d_%H%M%S')}"
output_dir = Path(WEIGHTS_DIR) / "ft" / name
policy_cfg = ACTConfig(
device="cuda",
optimizer_lr=lr,
push_to_hub=False,
chunk_size=50, # match our control horizon
n_action_steps=50,
)
train_cfg = TrainPipelineConfig(
dataset=DatasetConfig(repo_id=dataset_repo_id),
policy=policy_cfg,
output_dir=output_dir,
batch_size=batch_size,
steps=num_steps,
save_checkpoint=True,
save_freq=save_freq,
log_freq=log_freq,
)
run_train(train_cfg)
weights_vol.commit()
return {"output_dir": str(output_dir), "steps": num_steps}
Loss trajectories from two ACT runs:
A few observations on the loss curve:
- Initial loss is huge. This is the KL term dominating — at init, the posterior q(z | a, s) is uncorrelated with anything, so the encoder produces diffuse posteriors with high KL to 𝒩(0, I). Once a few hundred steps reduce the KL, the recon term takes over.
- Loss is in arbitrary units. Don't compare ACT's 0.245 to π0.5's 0.067. Different normalization, different objectives (L1+KL vs L2 flow-matching). The only thing that matters is whether the curve is monotonically decreasing and the policy works.
- 30 epochs over 8 episodes is fine. 4000 steps × bs 8 = 32K samples = ~7.7 epochs. With dropout and the VAE bottleneck regularizing, no obvious overfitting on the held-out (rare-cube-position) test poses.
16.1 — Modal preemption surprise
One run crashed at step 5848 of 8000 with a FileNotFoundError in
_pin_memory_loop. Looking up: Modal's spot/preemptible compute had reclaimed the
container. Modal helpfully restarted the function from scratch — but our train()
had no resume handling, so it crashed immediately on
FileExistsError: output dir already exists. The restart worked once we deleted
the stale local dir, but losing 25 minutes of compute is annoying. The fix on subsequent runs
was to (a) lower step count for fast iteration and (b) add shutil.rmtree(output_dir)
guarded by "smoke" in name so iterative debugging doesn't hit the same wall.
ACT live
ACT smoke after training completed:
A 1.9 s cold start (vs π0.5's 90 s) and 15 ms warm inference (vs π0.5's 280 ms). That's a 47× cold-start improvement and an 18× inference improvement, on a smaller GPU.
We ran the live client with --steps 100 --hz 10 --actions-per-chunk 20 --action-mode
absolute --home. The arm executed:
Reading the trajectory:
- shoulder_lift: -102.5° → -7.5° (95° of motion). The arm tilted up from the rest pose toward vertical.
- elbow_flex: +94.1° → +48.2° (-46°). The forearm extended forward, well past the postural equilibrium where π0.5 had been pinned.
- The gripper opened from 0.6° to 9.6° — the policy entered the "approach phase" preparing to grasp.
- The arm exited because
--steps 100ran out, not because the policy gave up.
On the second run with --steps 400, the arm reached the cube and closed the
gripper around it. It didn't yet correct for cube position changes during the approach (a
true closed-loop visual servo behavior), but for 8 demos and zero pretraining, this was
well past the bar of "does it move at all coherently."
Why ACT beat the SOTA
A fair question: π0.5 is the published state of the art for cross-embodiment manipulation. ACT is from 2023. ACT has 38× fewer parameters. Why did it win on this task?
18.1 — Data distribution match dominates
ACT was trained from scratch on our SO-101, our camera angles, our lighting, our cube. Its output distribution exactly equals the empirical distribution of teleoperator actions in those conditions. Whatever the SO-101 servos can execute under gravity and friction in those poses, that's what ACT outputs. Distribution shift is zero by construction.
π0.5 was pretrained on a heterogeneous mixture of robot embodiments — bimanual ALOHA, Franka arms, mobile manipulators. Its action distribution is the average of all of those, projected through the un-normalization stats of our dataset. The result is an action distribution that's nearly right but systematically biased: target positions tend to be too far from current state (because most robots in the pretraining mix have lighter payload-to-torque ratios than ours), and the safety clip has to soak up the difference at every tick.
18.2 — Capacity vs data ratio
With 4,149 frames (the v3 dataset), ACT's 80M params have ~52K frames per million params.
That's a healthy specialization regime; the model can absorb the data signal without
collapsing onto its initialization. π0.5's 3B params have ~1.4 frames per million params.
With train_expert_only=True, the trainable subset is 300M, giving ~14
frames/M-params — still well below ACT's specialization density.
When data ≪ capacity, the model relies on its priors (pretrained features, attention patterns, normalization stats) to fill the gap. If the priors match the target task, this is a feature; if they don't, it's a liability. ACT has no priors to fight.
18.3 — Architecture-task fit
The ACT paper's whole motivation was: "small teleop dataset, single task, single robot, learn from scratch." Every architectural choice — VAE for multi-modal action distributions, fully-trainable ResNet18 vision (no frozen backbone), L1 reconstruction (robust to demo noise), small transformer (4 layers, 1 decoder layer) — is tuned for that regime.
π0.5's whole motivation was: "scale a VLA across embodiments and tasks via cross-modal pretraining." Its language conditioning, its 3B-param backbone, its discretized-state-as-tokens, its flow-matching action head — these all serve cross-task generalization. We don't need cross-task generalization. We have one task. The hardware is ours and known.
18.4 — Inference cost
The closed-loop story matters too. ACT at 11 ms warm gives us headroom to re-plan often; π0.5 at 280 ms forces longer action chunks (less feedback). For a small reactive task like "close the gripper when the cube is centered in the wrist cam," the inference rate matters more than the model capacity.
Foundation models earn their parameter count when the task domain is broad (many objects, many tasks, many lighting conditions, many embodiments). Small specialized models earn their cost when the task domain is narrow (one task, one robot, ~50 demos). Picking which regime you're in is the most important design decision in a robot-learning project — and it's almost always misjudged toward "use the biggest model" because that's the default advice in mainstream ML papers.
What's next
The shortest backlog, ranked by leverage:
- More demos with deliberate cube-position perturbation. 30+ demos, vary cube position across a 20×20 cm grid, occasionally push the cube during the demo to teach the policy "if cube moved, adjust." This is the single biggest ROI move on the policy side.
-
Temporal action ensembling. ACT supports
temporal_ensemble_coeff— average overlapping action chunks for smoother corrections. We had it disabled. Should be a free win for closed-loop performance. -
Servo PID tuning. Raise
profile_velocityand torque current on the elbow servo. The Feetech docs publish per-servo limits; we're well below them. Tuned servos = the model's actions actually execute = effective control rate goes up. - Multi-pose home recording. Record 10 demos from each of 3 different start poses. Teaches the policy to operate from low-gravity-load configurations as well as the current high-load one.
- Domain-randomized fine-tune of π0.5. Now that we know π0.5 fails on distribution shift in this regime, we could try collecting much more data and actually fine-tuning the action expert end-to-end (not just frozen-backbone) so the VLA's advantages can manifest. This is a real ML research question; ACT is the pragmatic answer in the meantime.
- Forward kinematics + world-frame point clouds. Layer the SO-101 URDF on top of the policy outputs. Unproject the wrist cam's depth (we don't have a depth model wired up yet, but Depth Anything v2 is one Modal endpoint away) into world-frame points. This is the same architecture as Hugging Face's GAS v2; we'd inherit it almost for free.
Full recipe
Everything compressed to a runbook. One repo, two arms, one cube.
# ── 0. Prerequisites ────────────────────────────────────────
# - SO-101 follower + leader, both connected via USB
# - Two USB cameras (wrist + base)
# - Modal account, HuggingFace write token
# - Mac with Python 3.12
# ── 1. Calibration (once per arm) ───────────────────────────
lerobot-calibrate --robot.type=so101_follower \
--robot.port=/dev/tty.usbmodem... --robot.id=so101_follower_main
lerobot-calibrate --teleop.type=so101_leader \
--teleop.port=/dev/tty.usbmodem... --teleop.id=so101_leader_main
# ── 2. Dataset collection ──────────────────────────────────
hf auth login # write-scoped HF token
./record_v3.sh # see Chapter 7 — 30 episodes recommended
# ── 3. Train ACT on Modal ──────────────────────────────────
modal deploy experiments/pi0/app.py
modal run experiments/pi0/app.py::train_act \
--dataset-repo-id ozyphus/so101-cleanup-v3 \
--num-steps 4000 --batch-size 8 --lr 1e-5
# Output: /weights/ft/act-so101-cleanup-<timestamp>/checkpoints/004000/pretrained_model
# ── 4. Wire the checkpoint into the live endpoint ──────────
# Edit CHECKPOINT_CANDIDATES in experiments/pi0/app.py to prepend the new path:
# ("/weights/ft/act-so101-cleanup-<ts>/checkpoints/004000/pretrained_model", "act"),
modal deploy experiments/pi0/app.py
modal run experiments/pi0/app.py::PI0Policy.smoke # sanity-check load + predict
# ── 5. Live run ────────────────────────────────────────────
python client/so101_run.py \
--wrist-cam-index 0 --base-cam-index 1 \
--task "clean up the desk" \
--steps 400 --hz 10 --actions-per-chunk 5 \
--action-mode absolute --action-scale 1.0 \
--max-joint-step-deg 3.0 \
--home --home-step-deg 3.0 --home-timeout-s 10 \
--live
Train the smallest model that can fit your task. Make the training distribution match the servo's tracking envelope. Run the policy on a remote GPU but the control loop on local USB. Teleop is the dataset; the model is the compression of the teleop into something that can be re-played with edits.
Source: experiments/pi0/app.py for both Modal apps,
client/so101_run.py for the Mac control loop. Datasets on HuggingFace at
ozyphus/so101-cleanup-v2 and ozyphus/so101-cleanup-v3.