Chapter 0

TL;DR

We took an SO-101 follower arm + a USB wrist cam + a base cam, recorded teleop demos of "clean up the desk" (cube → tray), and trained two policies on the same data: π0.5 (Physical Intelligence's 3B-param vision-language-action foundation model — the published SOTA) and ACT (Action Chunking Transformer, ~80M params, ALOHA-paper architecture from 2023). Both ran inference behind a single Modal HTTP endpoint serving a Mac-side hardware loop.

π0.5 needed every input/output shape it expected to be hand-overridden, then produced actions outside the SO-101 servo's tracking envelope and failed to move the arm in any coherent way. ACT trained in 12 minutes from scratch on 8 demos and reached for the cube on the second live run — almost grasping it before timing out.

DOF
6
5 joints + gripper
Cameras
2
wrist + base
π0.5 size
3B
PaliGemma + 300M expert
ACT size
80M
ResNet18 + tx encoder/decoder
π0.5 train
~40 min
L40S, frozen backbone
ACT train
~12 min
A10G, from scratch
π0.5 result
~0 cm
arm did not move usefully
ACT result
reach
almost grasped at 8 demos
💡 The meta-lesson

For broad zero-shot generalization across embodiments and tasks, more parameters and more pretraining wins. For narrow specialization on a single task with a tiny dataset, the architecture whose output distribution matches your servo's tracking envelope wins. ACT, designed in 2023 to be trained from scratch on 50-demo ALOHA datasets, is purpose- built for the second regime. π0.5 is purpose-built for the first. We were squarely in the second.

ℹ What this article assumes

You're comfortable with: PyTorch, Transformer attention basics, robot joint-space concepts, and have at least heard of LeRobot / Modal. We re-derive the ACT objective, the camera-pinhole unprojection, and the reasoning behind π0.5's fp32-everywhere choice. We do not re-derive attention or VAEs from scratch — we treat them as primitives.

Chapter 1

Goal & constraints

One sentence: press a button, type a task in plain English, watch the arm attempt the task. That's the user-facing surface.

The constraint that drives most of the architecture is: no local GPU. The control machine is a Mac with an M-series chip. PyTorch on MPS works, but a 3B-param VLA is well past its memory limit, and even the 80M ACT shares bandwidth with the teleop loop in unhappy ways. Inference needs to happen on a remote GPU. The arm needs to be physically attached to the local machine via USB. Bridging the two cleanly is the work.

Other constraints, less negotiable than they sound:

  • Servo speed. SO-101 uses Feetech STS3215 servos. Their effective tracking speed under arm load is roughly 30 °/s — meaning a 3 °/tick command at 10 Hz is the upper limit before the position lags behind the goal. We will hit this wall.
  • Calibration drift. Each session, the cv2 camera index assignment can shift depending on USB enumeration order. Index 1 today is index 2 tomorrow.
  • Data scarcity. Recording demos is a manual, real-time, error-prone process. We collected 30 then 8 demos. Both numbers are well below what most published imitation-learning results assume.
  • Cost discipline. Modal bills per second. The smaller and faster the model, the cheaper the iteration.
Chapter 2

The hardware

SO-101 is a 3D-printed open-source 5-axis arm + parallel gripper from The Robot Studio, driven by Feetech STS3215 serial servos. Two arms come in the kit: a "follower" (does work) and a "leader" (you grab and drag — the follower mirrors). The leader has the same kinematics, just used as a 6-DOF input device.

JointBus IDRange (deg)Role
shoulder_pan1≈ ±90°base rotation
shoulder_lift2≈ -110° to +60°base tilt
elbow_flex3≈ -100° to +100°elbow bend
wrist_flex4≈ +20° to +100°wrist tilt
wrist_roll5≈ -180° to +180°wrist twist
gripper60° (open) to ~60° (closed)parallel gripper

Two USB cables (one per arm) into the Mac. One USB camera mounted on the follower's gripper — the "wrist cam," ~12 cm from the gripper tip. One USB webcam on a tripod looking at the workspace from a 30° angle — the "base cam." Calibration is per-arm, persisted under ~/.cache/huggingface/lerobot/calibration/, derived from a one-time manual sweep to the joint extremes.

⚠ Cable management is a model input

We didn't appreciate this until the second live run. The wrist cam's USB cable hangs in a loop near the gripper. Across 8 episodes, the loop's exact shape was different each time — and the model learned to attend to it as if it were part of the scene. Tape your cables flat before recording, or you teach your policy that "cable in this position = grasp now," which is a delightfully bad failure mode.

Chapter 3

System architecture

Three Modal apps and one Mac client, all sharing a single HuggingFace dataset and a single Modal Volume of weights:

yerobot — full topology Architecture
Mac (USB host)
SO-101 follower + leader · wrist cam · base cam · client/so101_run.py
↓ HTTP POST {images, state, task} ↓ lerobot-record · push_to_hub
Modal · so101-pi0
L40S/A10G GPU
PI0Policy.live (ASGI)
train() · train_act()
HuggingFace Hub
ozyphus/so101-cleanup-v3
30 → 8 episodes
parquet + AV1 video
Modal Volume · pi0-weights
/weights/hf (HF cache, pi05_base ≈ 10 GB)
/weights/ft/<run-name>/checkpoints/<step>/pretrained_model
↓ optional
Modal · so101-pi0-dashboard
CPU FastAPI · WebSocket · run history viewer (Phase 5)

The split mirrors what the GAS/HuggingBros experiments learned: tight loops on local hardware, slow loops on cloud GPUs. Here, "tight" is the servo-control loop (~10 Hz, USB-bound, no network). "Slow" is the policy forward pass (~10–300 ms inference + ~1 s network round trip). The Mac client batches actions from each policy call into a 20-tick chunk, which smooths the slow path back into the tight loop.

ℹ Why one Modal app for both inference and training?

Same image, same Volume of weights. train_act() writes a checkpoint to /weights/ft/..., then PI0Policy._load() reads it on the next live invocation. Co-locating the two functions in one app means there's no shipping intermediate artifacts between projects, no separate base image, no second Volume. The cost is that modal deploy rebuilds image + ASGI surface + train function together — but since the image is cached, redeploys are seconds.

Chapter 4

Imitation learning, formally

Both policies in this article are trained by behavioral cloning (BC). The formal setup: we have a dataset 𝒟 = {(ot(i), at(i))} of expert observations o and actions a across episodes i. We learn a policy πθ(a | o) by maximum likelihood:

θ* = argmaxθ   𝔼(o,a) ∼ 𝒟   log πθ(a | o)

For continuous-action robotics, the conditional distribution is usually parameterized as either Gaussian (giving an L2 / MSE objective) or Laplacian (L1) over an action chunk:

BC(θ) = 𝔼   ‖at:t+H − πθ(ot)‖1

where H is the prediction horizon (the chunk size). Both ACT and π0.5 use this kind of chunked-action prediction, which dramatically reduces the well-known compounding error problem of single-step BC: if you only predict one step ahead, every error pushes the next observation out of distribution; if you predict a whole chunk and execute it open-loop, errors accumulate inside the chunk but the policy's state resets at the next chunk boundary.

∴ Why action chunking helps in one inequality

Compounding-error bound for single-step BC: under standard assumptions, expected return suffers cost growing like O(T2·ε) in horizon T and per-step BC error ε. With chunk size H, the policy commits an open-loop trajectory of length H, then re-observes — so the bound becomes O((T/H)·H2·ε) = O(TH·ε). Larger chunks → fewer policy queries → less compounding, until H approaches a regime where the open-loop assumption itself breaks. ACT's default H = 100; π0.5's = 50. We use 50 for both.

Chapter 5

π0.5 from first principles

π0 and π0.5 are vision-language-action foundation models from Physical Intelligence. The architecture fuses a frozen-ish PaliGemma 3B vision-language backbone with a small "action expert" transformer that produces continuous-action chunks via flow matching / diffusion.

5.1 — PaliGemma + action expert

The backbone is PaliGemma: a SigLIP image encoder + Gemma 2B language model, jointly trained. For π0.5, three input images (base camera, left wrist camera, right wrist camera, all 224×224 RGB) are tokenized via SigLIP into a sequence of vision tokens. The text input — built as "Task: {nl_task}, State: {discretized_state};\nAction: " where the state is discretized into 256 bins per joint — is tokenized via the Gemma tokenizer.

These visual + language tokens feed into the Gemma 2B transformer (the "VLM expert"), which produces a context tensor. A separate ~300M-parameter "action expert" transformer (Gemma-300M variant) cross-attends to that context to denoise an action chunk.

5.2 — Action-chunk denoising

Action prediction in π0.5 is flow matching, a continuous-time variant of diffusion. Given a clean action chunk a0, sample a timestep τ ∈ [0, 1] and a noise ε ∼ 𝒩(0, I). Construct the noisy chunk:

aτ = (1 − τ) · a0 + τ · ε

Train the network vθ (action expert) to predict the velocity field:

FM(θ) = 𝔼τ, ε, (o,a0)   ‖vθ(aτ, τ, o) − (ε − a0)‖22

At inference, integrate the ODE backwards from τ = 1 (pure noise) to τ = 0 (clean action) using a few-step Euler scheme — π0.5 defaults to 10 inference steps. The output is a chunk of shape [chunk_size=50, max_action_dim=32]. The action_dim is padded to 32 so the model can serve multiple embodiments (a 14-DOF bimanual ALOHA, a 7-DOF Franka, our 6-DOF SO-101) with a single architecture; the leftover dims are zeroed out at training and ignored at inference.

ℹ State as tokens, not vectors

Most BC papers feed the robot's joint state into a small MLP, then concatenate the resulting vector with image features. π0.5 instead discretizes each state value into 256 bins and inlines them as text tokens in the prompt. This is unusual but elegant: the same tokenizer handles both the language task description and the proprioception. The cost is quantization noise (a 0.7° change won't change the discrete bin); the benefit is unified cross-attention with the language tokens.

Chapter 6

ACT from first principles

ACT (Zhao et al., 2023, Stanford ALOHA paper) is a much simpler architecture, designed to be trained from scratch on small teleop datasets (~50 episodes) for a single task on a single robot. No pretraining, no language input — just images, state, and actions. The key idea is to wrap a transformer encoder-decoder in a conditional VAE.

6.1 — The conditional VAE wrapping

The motivation: human teleop is multi-modal. The same observation can correspond to different valid demonstrations (left-handed vs right-handed grasp, slow vs fast approach, etc). Maximum-likelihood Gaussian regression averages these modes and produces blurry "mean trajectories" that satisfy nobody. A VAE with a learned latent z lets the network commit to a single mode at inference (sample z, get a coherent trajectory).

The setup: an "encoder" reads the ground-truth action chunk + state and produces a latent posterior q(z | a1:H, s). A "decoder" reads z, the camera images, and the state, and produces a reconstructed action chunk â1:H. At inference, we drop the encoder and sample z from the prior p(z) = 𝒩(0, I).

ACT — train vs inference Architecture
Train pass
action chunk + state → encoder → q(z|a,s)
z + images + state → decoder → â
loss = L1(a, â) + KL[q || p]
Inference pass
z ~ p(z) = 𝒩(0, I)
z + images + state → decoder → â
execute first n_action_steps
↓ inside the decoder
Decoder = transformer encoder-decoder
ResNet18 backbones (one per camera, ImageNet pre-init) → CNN features
+ proprioceptive state token + latent z token → tx encoder (4 layers, dim 512, 8 heads)
→ tx decoder cross-attends from learnable action queries → MLP head → action chunk [H, action_dim]

6.2 — The loss

ACT(θ, ϕ) = 𝔼(o,a)   [ a1:H − â1:H1   +   β · KL(qϕ(z | a, s) p(z)) ]

where:

  • L1 reconstruction on actions (Laplace likelihood — smoother gradients than L2 near zero, more robust to outliers from noisy demos).
  • KL divergence between the posterior qϕ(z | a, s) and the standard-Normal prior p(z). The posterior is a small transformer that reads the ground-truth action chunk and outputs μ, σ for a diagonal Gaussian.
  • β = kl_weight, defaulting to 10 in lerobot. Higher β collapses the latent toward the prior (less expressivity, more determinism).

Training takes only the decoder + the reconstruction tower from end to end. At inference, the encoder is discarded — z ∼ 𝒩(0, I) gives a sample from the action distribution. For deterministic eval one can fix z = 0 (the prior mean), which is what lerobot does by default.

∴ Why a CVAE rather than diffusion

Both ACT and π0.5 wrap a transformer in a generative model over action chunks. The CVAE requires a single forward pass at inference (sample z once, decode). Diffusion requires K forward passes (Euler steps over τ). At 80M params and 11 ms per forward, ACT can comfortably re-plan at ≥30 Hz; π0.5 at 3B and ~280 ms (10 steps × ~28 ms each on L40S) re-plans at ~3 Hz. For our 10 Hz control loop with 20-action chunks (i.e. re-plan every 2 s), the gap doesn't matter much, but for closed-loop visual servoing it does.

6.3 — ACT at the parameter level

The lerobot defaults we used:

HyperparameterValueNotes
vision_backboneresnet18ImageNet pretrained, fully fine-tuned during training
dim_model512transformer width
n_heads8attention heads
n_encoder_layers4transformer encoder depth
n_decoder_layers1transformer decoder depth (yes, only 1)
chunk_size50actions predicted per forward pass
n_action_steps50actions executed before re-querying
use_vaeTrueflip to off for deterministic regression
latent_dim32VAE latent z dim
kl_weight (β)10.0standard ACT setting
optimizer_lr1e-5Adam (yes, just Adam)
dropout0.1standard tx dropout
Chapter 7

Dataset collection

Dataset collection is the part of robot-learning research that feels least like ML and most like a craft. You sit at a desk. You hold the leader arm. You drag the leader to the cube. You squeeze the leader's gripper to close the follower's gripper. You lift. You move the object to the tray. You open the gripper. You return to a roughly-consistent home pose. You press a key to mark the episode complete. You move the cube to a new spot. Repeat thirty times.

The tool is lerobot-record, the LeRobot CLI for teleop demo collection. It handles the leader/follower pairing, camera capture, dataset serialization (parquet for state/action, AV1-encoded MP4 for video), and pushes the result to a HuggingFace dataset repo:

bash · record_v3.sh
lerobot-record \
  --robot.type=so101_follower \
  --robot.port=/dev/tty.usbmodem5A7A0546771 \
  --robot.id=so101_follower_main \
  --robot.cameras='{ wrist: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 15, backend: AVFOUNDATION}, base: {type: opencv, index_or_path: 1, width: 640, height: 480, fps: 15, backend: AVFOUNDATION} }' \
  --teleop.type=so101_leader \
  --teleop.port=/dev/tty.usbmodem5A7A0545661 \
  --teleop.id=so101_leader_main \
  --dataset.repo_id=ozyphus/so101-cleanup-v3 \
  --dataset.single_task="clean up the desk" \
  --dataset.num_episodes=30 \
  --dataset.fps=15 \
  --dataset.episode_time_s=35 \
  --dataset.reset_time_s=15 \
  --dataset.push_to_hub=true \
  --dataset.private=true

Resulting parquet schema, per row (one row per timestep at 15 Hz):

ColumnDtypeShapeMeaning
observation.statefloat32(6,)follower's current joint positions
actionfloat32(6,)leader's commanded positions (= follower target)
observation.images.wristvideo ref(480, 640, 3)AV1-encoded MP4 frame
observation.images.basevideo ref(480, 640, 3)AV1-encoded MP4 frame
timestampfloat64scalarseconds since episode start
episode_indexint64scalarepisode id
frame_indexint64scalarframe within episode
taskstrscalar"clean up the desk"

Important subtlety: action ≠ state in teleop datasets. The action is what the leader arm was at on tick t; the state is what the follower achieved. The follower lags the leader by a few ticks under load. This is intentional — the dataset captures the operator's intent (action) plus the actual achieved configuration (state). At inference, the policy outputs an action (target leader pose ≈ target follower pose); the servo PID closes the loop.

⚠ Empty-stub directory bug

If you abort lerobot-record before any episode is saved, it leaves an empty ~/.cache/huggingface/lerobot/<repo>/meta/info.json stub on disk. Re-running crashes with FileExistsError. The fix is one line: rm -rf ~/.cache/huggingface/lerobot/<repo> and re-run. Don't delete the parent. Yes, this took us 5 minutes to figure out.

7.1 — What we actually collected

Three datasets. Lessons learned with each.

RepoEpisodesFramesNotes
so101-cleanup-v1aborted (timer too short, 20s eps cut grasps mid-motion)
so101-cleanup-v23015,402full set, ~34 s/episode, used for π0.5 fine-tune
so101-cleanup-v384,149recording crashed; we kept what we had. Different physical setup (cam indices changed)
Chapter 8

First attemptTraining π0.5

π0.5's input_features dict, as published, expects three image keys (base_0_rgb, left_wrist_0_rgb, right_wrist_0_rgb all at 3×224×224), a 32-dim padded state, and outputs a 32-dim padded action. We had two cams (base and a single wrist), 6-dim state, 6-dim action. Every shape was wrong. Every key was wrong.

The fix that made the trainer happy was a combination of a feature override and a column rename:

python · experiments/pi0/app.py::train()
policy_cfg = PreTrainedConfig.from_pretrained("lerobot/pi05_base")
policy_cfg.pretrained_path = "lerobot/pi05_base"
policy_cfg.device = "cuda"
policy_cfg.freeze_vision_encoder = True   # frozen PaliGemma SigLIP — fits L40S
policy_cfg.train_expert_only = True       # only train the 300M action expert
policy_cfg.optimizer_lr = lr
policy_cfg.push_to_hub = False

# Override 32→6 for state and action; pi05's pad_vector pads internally at forward time.
policy_cfg.input_features["observation.state"] = PolicyFeature(
    type=FeatureType.STATE, shape=(6,)
)
policy_cfg.output_features["action"] = PolicyFeature(
    type=FeatureType.ACTION, shape=(6,)
)

# Map our two dataset image keys onto pi05's first two expected keys; the missing third
# key (right_wrist_0_rgb) gets filled with a -1-padded "empty camera" by pi05's forward.
rename_map = {
    "observation.images.base":  "observation.images.base_0_rgb",
    "observation.images.wrist": "observation.images.left_wrist_0_rgb",
}

train_cfg = TrainPipelineConfig(
    dataset=DatasetConfig(repo_id="ozyphus/so101-cleanup-v2"),
    policy=policy_cfg,
    output_dir=Path("/weights/ft") / job_name,
    batch_size=4,
    steps=4000,
    save_freq=1000,
    rename_map=rename_map,
)
run_train(train_cfg)
weights_vol.commit()

Each line is a deliberate concession to π0.5's published architecture:

  • freeze_vision_encoder=True + train_expert_only=True — Of π0.5's ~3B params, this trains only the ~300M action expert. The PaliGemma backbone (SigLIP + Gemma 2B) stays frozen at the upstream weights. This was a VRAM and compute necessity (full FT would not fit on L40S without offloading), but it's also a regularization choice: we have 15K frames; we can't responsibly fine-tune 3B params on that.
  • Shape override — Without overriding input_features.observation.state.shape to (6,), the dataset normalizer fails because dataset stats are 6-dim and the policy declares 32-dim. Setting it to (6,) lets the normalizer match; inside the model, pad_vector(state, 32) still fires at forward time to satisfy the action-expert input layer's fixed shape.
  • Image rename_map — Maps dataset columns to policy-expected feature names at load time. The third image key (right_wrist_0_rgb) is not in the dataset; π0.5's get_image_features() handles missing keys by inserting a fully-saturated -1 image (SigLIP's mean-zero) and zeroing the corresponding mask bit.

We also had to discover the lerobot extras hierarchy by failing through it:

# experiments/pi0/app.py — image build_step - "lerobot[pi] @ git+https://github.com/huggingface/lerobot.git@main", + "lerobot[pi,dataset,training] @ git+https://github.com/huggingface/lerobot.git@main", # pi → policy code (pi05 module) # dataset → datasets, parquet, video decoding # training → accelerate, optimizers, training loop

And π0.5's default push_to_hub=True config requires a repo_id — if neither is set explicitly, training fails with a ValueError: 'policy.repo_id' argument missing at the end-of-training save step. Setting push_to_hub=False persists the checkpoint to the Volume only.

8.1 — Training results

4000 steps at batch 4, lr 1e-5, on an L40S took 41 minutes. Loss dropped 0.27 → 0.067 (a smooth 4× reduction over the run). Every metric of the training process said "this is working":

[train] starting: pi05-so101-cleanup-20260424_202558 step:50 loss:0.258 grdn:1.59 lr:2.0e-06 updt_s:0.597 step:1000 loss:0.118 grdn:1.18 lr:9.0e-06 updt_s:0.583 step:2000 loss:0.115 grdn:1.24 lr:6.3e-06 step:3000 loss:0.088 grdn:1.02 lr:3.7e-06 step:4000 loss:0.067 grdn:0.89 lr:2.5e-06 [train] done. committed Volume. FT checkpoint → /weights/ft/pi05-so101-cleanup-20260424_202558
Chapter 9

Serving on Modal

Inference is a single FastAPI endpoint behind @modal.fastapi_endpoint(method="POST"). The class lifecycle is:

python · experiments/pi0/app.py::PI0Policy
@app.cls(
    image=image,
    gpu="L40S",
    volumes={WEIGHTS_DIR: weights_vol},
    secrets=SECRETS,
    timeout=1800,
    scaledown_window=300,
    max_containers=2,
)
class PI0Policy:
    @modal.enter()
    def _load(self) -> None:
        weights_vol.reload()
        for repo_id, policy_type in CHECKPOINT_CANDIDATES:
            try:
                policy_cls = get_policy_class(policy_type)
                config = PreTrainedConfig.from_pretrained(repo_id)
                config.device = "cuda"
                policy = policy_cls.from_pretrained(repo_id, config=config)
                policy.to("cuda").eval()
                preprocessor, postprocessor = make_pre_post_processors(
                    policy.config, pretrained_path=repo_id,
                    preprocessor_overrides={"device_processor": {"device": "cuda"}},
                    postprocessor_overrides={"device_processor": {"device": "cuda"}},
                )
                self.policy, self.preprocessor, self.postprocessor = policy, preprocessor, postprocessor
                self.checkpoint, self.policy_type = repo_id, policy_type
                self.expected_image_keys = [k for k in config.input_features if "image" in k]
                state_feat = config.input_features.get("observation.state")
                self.expected_state_dim = int(state_feat.shape[0]) if state_feat else 32
                weights_vol.commit()
                return
            except Exception as e:
                print(f"[pi0] failed {repo_id}: {e}")
        raise RuntimeError("no checkpoint loaded")

    @modal.fastapi_endpoint(method="POST", docs=True)
    def live(self, payload: dict) -> dict:
        images = {k: _decode(v) for k, v in (payload.get("images") or {}).items()}
        state  = payload.get("state") or [0.0]*7
        task   = payload.get("task")  or "pick up the object"
        raw    = self._build_raw_obs(images, state, task)
        chunk, latency_ms = self._predict(raw)
        return {
            "action_chunk": chunk[: self.n_action_steps].tolist(),
            "horizon": self.n_action_steps,
            "latency_ms": round(latency_ms, 2),
            "checkpoint": self.checkpoint,
            "expected_image_keys": self.expected_image_keys,
            "received_image_keys": sorted(images.keys()),
        }

Three notable design choices in this class:

  • CHECKPOINT_CANDIDATES as a fallback list. The first loadable entry wins. Local FT checkpoints first, then upstream pretrained as fallback. This means a corrupted local checkpoint never takes the endpoint down — it gracefully falls through to lerobot/pi05_base. We re-deploy by editing this list.
  • scaledown_window=300 keeps the container warm for 5 minutes after the last request. Cold start is ~90 s for π0.5 (loading the 3B-param backbone from the Volume) and ~2 s for ACT. Once warm, π0.5 inference is ~280 ms, ACT is ~12 ms.
  • _build_raw_obs handles cam-key heterogeneity. The same endpoint can serve π0.5 (which expects 3 image keys, replicated/padded) and ACT (which expects whatever the dataset has — 2 keys, named wrist and base). The mapper substring-matches semantic names ("base" in the key → use the client's "base" cam; "wrist" → use the client's "wrist" cam; otherwise zeros). One client, multiple policy types, no client-side change.

9.1 — Why fp32 throughout

The most surprising serving choice is keeping π0.5 in fp32, not fp16. We discovered the same fp16 trap that GroundingDINO has: π0.5 fuses two architectures (PaliGemma's SigLIP + Gemma vision-language tower, plus the action expert with its own attention paths). Loading the whole policy in fp16 produces dtype-mismatch errors at the cross-attention boundary between the VLM and the action expert. Loading in fp32 takes about 9 GB — comfortably below L40S's 48 GB — and inference is 280 ms warm vs ~180 ms in fp16 (an imperceptible difference at 10 Hz control rate).

Chapter 10

The Mac client

One Python file, ~400 lines: client/so101_run.py. It does five things, in this order, every tick:

  1. Read follower joint positions and gripper state from USB.
  2. Read base + wrist camera frames (lerobot does this as part of get_observation()).
  3. Either pop a queued action from the previous chunk, or POST to the Modal endpoint to refill the queue with a new 50-action chunk.
  4. Compute the goal as either scale × action (absolute mode) or state + scale × action (delta mode), then clip per-joint to ±max_joint_step_deg.
  5. Send the safe goal to the follower via send_action().
python · client/so101_run.py — main loop
step = 0
action_queue: list[np.ndarray] = []
while not stop["v"] and (args.steps < 0 or step < args.steps):
    obs = robot.get_observation()
    state = obs_to_state_vec(obs)            # → list[float] of 6 joint positions
    frames = {name: obs.get(name) for name in cameras}

    if not action_queue:
        # Refill via Modal POST; payload format is policy-agnostic.
        images_payload = {name: encode_to_b64(f) for name, f in frames.items() if f is not None}
        resp = post_observation(
            args.endpoint_url,
            {"images": images_payload, "state": state, "task": args.task},
            timeout=180.0,
        )
        chunk = np.asarray(resp["action_chunk"], dtype=np.float32)  # shape [50, 6]
        take = min(args.actions_per_chunk, chunk.shape[0])
        action_queue = [chunk[i] for i in range(take)]

    action_vec = action_queue.pop(0)
    goal = action_to_motor_dict(action_vec, obs, mode=args.action_mode, scale=args.action_scale)
    safe_goal = clip_toward(goal, {k: obs[k] for k in goal}, args.max_joint_step_deg)
    if not dry_run:
        robot.send_action(safe_goal)

    step += 1
    time.sleep(max(0, 1 / args.hz - (time.perf_counter() - loop_t0)))

The --action-mode flag is the single most important client-side knob. It controls how the policy's output is interpreted:

ModeGoalWhen to use
deltastate + scale·actionpolicy output is a velocity-like signal (relative)
absolutescale · actionpolicy output is the target joint position (what teleop datasets actually contain)

During recording, lerobot stores the leader's joint positions as the action — so for any policy trained on a teleop dataset, --action-mode=absolute is correct. This matters: with absolute mode, the policy's output of "elbow_flex = 14°" means "drive elbow to 14°", regardless of where the elbow currently is. With delta mode it would mean "add 14° to the current elbow position," which is wildly wrong if the policy outputs target poses near zero. We started with delta and saw nonsensical behavior. Absolute fixed it.

⚠ The safety clip is not optional

Even with absolute mode, when the live policy first comes up, the predicted target may be tens of degrees away from the current pose. Sending that goal directly to the servos would cause a snap-to-target motion at full speed — bad for the gear train, worse for whatever's in the workspace. clip_toward(goal, present, max_delta=3.0) caps each step to ±3° per joint per tick, so the servos can only ever take small steps. The policy keeps re-emitting the same large goal each tick; the arm crawls toward it.

Chapter 11

Camera math

Both policies consume images directly — no calibrated 3D required. But for completeness (and because the next iteration of this project will use it), the camera unprojection is worth stating cleanly. The pinhole model:

u = fx · X/Z + cx
v = fy · Y/Z + cy

(u, v) are pixel coordinates, (X, Y, Z) are camera-frame 3D coordinates, fx, fy are focal lengths in pixels, and (cx, cy) is the principal point (≈ image center). Inverting, given a pixel and its depth Z:

X = (u − cx) · Z / fx
Y = (v − cy) · Z / fy

For a generic USB webcam without a calibrated K, the rule-of-thumb default fx = fy = max(H, W) with principal point at (W/2, H/2) is good enough for visualization. It assumes ~60° horizontal FOV, which most laptop and gripper-mount cameras are within ±10° of. For metric reconstruction (path planning, collision checking) you'd calibrate with a ChArUco board. For pose-conditioned imitation learning, the policy doesn't see K at all — it sees raw images.

Chapter 12

First live run

With the π0.5 fine-tune deployed and the live endpoint smoke-tested, we ran:

bash
python client/so101_run.py \
    --wrist-cam-index 1 --base-cam-index 2 \
    --task "clean up the desk" \
    --steps 50 \
    --action-mode absolute --action-scale 1.0 \
    --max-joint-step-deg 3.0 \
    --live

The arm moved a few degrees, then froze. The log told a precise story:

[LIVE 0] pres[sho= +8.2 sho=-103.5 elb= +97.1 wri= +84.7 wri= +1.2 gri= +1.1] d[sho-0.21 sho+0.63 elb-3.00 wri+0.75 wri+0.08 gri-0.23] [LIVE 3] pres[sho= +8.2 sho=-103.4 elb= +95.6 wri= +84.9 wri= +1.2 gri= +1.1] d[sho-0.09 sho+0.33 elb-3.00 wri+0.01 wri+0.06 gri-0.08] [LIVE 8] pres[sho= +8.1 sho=-102.9 elb= +93.1 wri= +85.3 wri= +1.2 gri= +1.1] d[sho-0.12 sho-0.14 elb-2.98 wri-0.25 wri-0.06 gri-0.09] [LIVE 10] pres[sho= +8.1 sho=-102.9 elb= +92.6 wri= +85.3 wri= +1.2 gri= +1.1] d[sho-0.08 sho+2.84 elb-3.00 wri-1.72 wri-0.75 gri-0.15] [LIVE 20] pres[sho= +8.1 sho=-100.6 elb= +92.6 wri= +84.0 wri= +0.4 gri= +1.1] d[sho-0.03 sho-0.01 elb-3.00 wri-0.99 wri-0.05 gri-0.06] [LIVE 49] pres[sho= +8.1 sho=-100.1 elb= +94.2 wri= +83.1 wri= +0.4 gri= +1.1] d[sho-0.43 sho+0.33 elb-3.00 wri+0.17 wri-0.25 gri+0.16]

Two things stuck out:

  1. Elbow stuck at 92.6° from tick 10 onwards. Despite a continuous -3.00° command (the safety clip's saturation value), the present elbow position did not decrease.
  2. Other joints moved very slowly. Shoulder_pan moved 0.1° over 50 ticks. Shoulder_lift moved 3.4°. Wrist_flex 1.6°. Tiny absolute travel.

Before drawing any conclusions, we instrumented the client to print absolute joint state alongside the per-tick delta — that's the pres[...] column in the log above. This revealed the real story.

Chapter 13

Servo torque vs gravity

Here is the elbow trajectory across two runs, plotted as commanded vs achieved per-tick motion:

RunJointΔ over N ticksCommanded /tickAchieved /tick
π0.5 first liveelbow_flex97.1° → 92.6° in 10 ticks-3.0°-0.45°
π0.5 first liveelbow_flex92.6° → 92.6° in 40 ticks-3.0°
home-only testelbow_flex99.4° → 96.8° in 5 ticks-2.0°-0.5°
home-only testelbow_flex96.8° → 96.8° in 187 ticks-2.0°
home-only testshoulder_pan8.2° → 2.3° in 10 ticks-2.0°-0.6°
home-only testwrist_flex82.9° → 73.5° in 10 ticks-2.0°-0.94°

The interpretation:

  • The servos can move at ~0.5°/tick at 10 Hz (= 5°/s) under nominal load.
  • The elbow specifically stops moving when its position hits ~92–97°. Other joints continue their slow creep.
  • This isn't a calibration limit (the dataset's recorded actions go down to -96°, so the servo definitely can travel there under teleop).

The mechanism is gravity load. With shoulder_lift at -102° (arm tilted back near vertical) and elbow_flex at +97° (forearm pointing almost straight forward), the forearm + gripper + wrist cam hangs out as a moment arm of about 12 cm. To straighten the elbow further (drive elbow toward 0°), the servo must lift this moment arm against gravity. The Feetech STS3215 is rated for ~30 kg·cm stall torque; under a smooth motion profile with PID gains tuned for teleop loads it delivers ≈ 1/3 of that continuously. The arm + payload exceeds this in this specific posture.

During teleop recording, the leader arm guided the follower smoothly — small, continuous changes that the PID absorbed. At inference, we sent a step-target 80° away from the current state every tick. The PID asserted full effort, the servo couldn't make progress, and the position got pinned at the postural-equilibrium point.

💡 The hardware envelope is part of the policy spec

We had been treating "the model outputs in-distribution actions" as a sufficiency condition. It isn't. The output must also be inside the servo's tracking envelope given the current arm configuration. A 60° absolute target step is in-distribution for π0.5's training data (other robots, other postures) but outside this particular SO-101's envelope when shoulder_lift = -102°. ACT, trained from scratch on our actual demos, naturally outputs within-envelope actions because that's the only thing it ever saw.

Chapter 14

The homing trick

One robust mitigation: before the policy loop starts, drive the arm to the median start-of-episode pose from the training dataset, slowly. This guarantees the policy's first action is close to a configuration it was trained from.

python · client/so101_run.py — TRAINING_HOME_POSE_DEG
# Median first-frame joint positions across so101-cleanup-v3 (8 episodes).
TRAINING_HOME_POSE_DEG = {
    "shoulder_pan":  -7.6,
    "shoulder_lift": -102.5,
    "elbow_flex":     94.1,
    "wrist_flex":     62.7,
    "wrist_roll":     -9.3,
    "gripper":         0.6,
}

def _drive_to_home(step_deg=2.0, timeout_s=15.0):
    t0 = time.perf_counter()
    while time.perf_counter() - t0 < timeout_s:
        present = {f"{j}.pos": float(robot.get_observation()[f"{j}.pos"])
                   for j in JOINT_NAMES}
        target  = {f"{j}.pos": float(TRAINING_HOME_POSE_DEG[j])
                   for j in JOINT_NAMES}
        diffs = {k: target[k] - present[k] for k in target}
        if all(abs(v) < step_deg for v in diffs.values()):
            return  # within tolerance — done
        safe = clip_toward(target, present, step_deg)
        robot.send_action(safe)
        time.sleep(0.1)

We compute the home pose by reading the parquet files of episode-start frames and taking the median per joint. This is more robust than the mean because a single outlier episode (e.g. recorded with the leader arm in a weird pose) can drag the mean significantly. With 8 episodes, the median is well-defined and the per-joint stdev is small (≤ 0.85° for shoulder_lift, ≤ 6.8° for wrist_flex — the latter being where most of the natural reset variance lives).

ℹ Why this isn't fixing the underlying servo issue

Homing brings the arm to the dataset's start distribution. From that pose, the policy's first action is in-distribution and the arm moves. But once the policy commands elbow extension to do the actual reach, we re-enter the gravity-loaded posture and the servo can stall again. The full fix is either: (a) record demos from multiple initial postures so the policy learns to operate from low-load configurations, or (b) tune the servo PID gains for higher continuous current. We didn't do either; homing was enough to get past the first barrier.

Chapter 15

PivotSwitching to ACT

After multiple π0.5 live attempts producing stalled or near-zero motion, we made the call: drop the SOTA model, train ACT from scratch on the same dataset. Reasoning:

  • Distribution match — ACT trains on our exact data. Its action distribution is, by construction, in our servo's envelope.
  • No shape gymnastics — ACT auto-populates input_features from the dataset. Two image keys, 6-dim state, 6-dim action: as is.
  • Smaller GPU — A10G (24 GB, ~$1.10/hr on Modal) is plenty for an 80M model; π0.5 needed L40S.
  • Faster iteration — fewer minutes per training run means more chances to tune the recording protocol.

This was also a calibrated risk against advice in much of the recent literature — "use the biggest VLA you can fit, fine-tune from a checkpoint, you'll get better generalization." That's true for cross-task generalization. We're squarely in the single-task, narrow-data regime where smaller specialized architectures often win.

Chapter 16

Training ACT

The ACT training function is dramatically simpler than the π0.5 one. No feature overrides, no rename map, no freeze-strategy juggling:

python · experiments/pi0/app.py::train_act()
@app.function(
    image=image,
    gpu="A10G",
    volumes={WEIGHTS_DIR: weights_vol},
    secrets=SECRETS,
    timeout=2 * 3600,
)
def train_act(
    dataset_repo_id: str = "ozyphus/so101-cleanup-v3",
    num_steps: int = 4000,
    batch_size: int = 8,
    lr: float = 1e-5,
    save_freq: int = 1000,
    log_freq: int = 100,
) -> dict:
    from lerobot.policies.act.configuration_act import ACTConfig
    from lerobot.configs.default import DatasetConfig
    from lerobot.configs.train import TrainPipelineConfig
    from lerobot.scripts.lerobot_train import train as run_train

    weights_vol.reload()
    name = f"act-so101-cleanup-{time.strftime('%Y%m%d_%H%M%S')}"
    output_dir = Path(WEIGHTS_DIR) / "ft" / name

    policy_cfg = ACTConfig(
        device="cuda",
        optimizer_lr=lr,
        push_to_hub=False,
        chunk_size=50,         # match our control horizon
        n_action_steps=50,
    )
    train_cfg = TrainPipelineConfig(
        dataset=DatasetConfig(repo_id=dataset_repo_id),
        policy=policy_cfg,
        output_dir=output_dir,
        batch_size=batch_size,
        steps=num_steps,
        save_checkpoint=True,
        save_freq=save_freq,
        log_freq=log_freq,
    )
    run_train(train_cfg)
    weights_vol.commit()
    return {"output_dir": str(output_dir), "steps": num_steps}

Loss trajectories from two ACT runs:

v2 dataset (30 episodes, 8000 steps, A10G): step:50 loss:84.560 grdn:1351 (huge initial — VAE KL term dominates) step:100 loss:9.589 grdn:204 step:1000 loss:1.973 grdn:65 step:4000 loss:0.521 grdn:30 step:8000 loss:0.245 grdn:18 done in 21:14, ~$0.40 v3 dataset (8 episodes, 4000 steps, A10G): step:1000 → checkpoint step:4000 → checkpoint, loss converged done in ~12 min, ~$0.22

A few observations on the loss curve:

  • Initial loss is huge. This is the KL term dominating — at init, the posterior q(z | a, s) is uncorrelated with anything, so the encoder produces diffuse posteriors with high KL to 𝒩(0, I). Once a few hundred steps reduce the KL, the recon term takes over.
  • Loss is in arbitrary units. Don't compare ACT's 0.245 to π0.5's 0.067. Different normalization, different objectives (L1+KL vs L2 flow-matching). The only thing that matters is whether the curve is monotonically decreasing and the policy works.
  • 30 epochs over 8 episodes is fine. 4000 steps × bs 8 = 32K samples = ~7.7 epochs. With dropout and the VAE bottleneck regularizing, no obvious overfitting on the held-out (rare-cube-position) test poses.

16.1 — Modal preemption surprise

One run crashed at step 5848 of 8000 with a FileNotFoundError in _pin_memory_loop. Looking up: Modal's spot/preemptible compute had reclaimed the container. Modal helpfully restarted the function from scratch — but our train() had no resume handling, so it crashed immediately on FileExistsError: output dir already exists. The restart worked once we deleted the stale local dir, but losing 25 minutes of compute is annoying. The fix on subsequent runs was to (a) lower step count for fast iteration and (b) add shutil.rmtree(output_dir) guarded by "smoke" in name so iterative debugging doesn't hit the same wall.

Chapter 17

ACT live

ACT smoke after training completed:

[pi0] loaded /weights/ft/act-so101-cleanup-20260430_102342/.../pretrained_model on cuda in 1.9s | image_keys=['observation.images.wrist', 'observation.images.base'] | chunk_size=50 | state_dim=6 [pi0.smoke] action_chunk_shape=[50, 6] latency_ms=14.88

A 1.9 s cold start (vs π0.5's 90 s) and 15 ms warm inference (vs π0.5's 280 ms). That's a 47× cold-start improvement and an 18× inference improvement, on a smaller GPU.

We ran the live client with --steps 100 --hz 10 --actions-per-chunk 20 --action-mode absolute --home. The arm executed:

[LIVE 97] pres[sho= -4.1 sho= -8.7 elb= +49.8 wri= +54.5 wri= -48.7 gri= +9.3] d[sho+0.35 sho-1.04 elb-3.00 wri+0.97 wri+0.53 gri+0.51] [LIVE 98] pres[sho= -4.1 sho= -8.6 elb= +49.0 wri= +55.1 wri= -48.5 gri= +9.5] d[sho+0.72 sho+1.34 elb-3.00 wri-1.13 wri-0.20 gri+0.20] [LIVE 99] pres[sho= -3.8 sho= -7.5 elb= +48.2 wri= +54.5 wri= -48.5 gri= +9.6] d[sho-0.17 sho-1.23 elb-3.00 wri+0.20 wri+0.60 gri-0.35]

Reading the trajectory:

  • shoulder_lift: -102.5° → -7.5° (95° of motion). The arm tilted up from the rest pose toward vertical.
  • elbow_flex: +94.1° → +48.2° (-46°). The forearm extended forward, well past the postural equilibrium where π0.5 had been pinned.
  • The gripper opened from 0.6° to 9.6° — the policy entered the "approach phase" preparing to grasp.
  • The arm exited because --steps 100 ran out, not because the policy gave up.

On the second run with --steps 400, the arm reached the cube and closed the gripper around it. It didn't yet correct for cube position changes during the approach (a true closed-loop visual servo behavior), but for 8 demos and zero pretraining, this was well past the bar of "does it move at all coherently."

The first time the arm reached for a cube it had never seen, in a position no demo had shown, because the wrist cam was pointed at the cube — that's the moment the architecture starts to feel real. ACT didn't learn the trajectory. It learned a policy. The difference is everything.
Chapter 18

Why ACT beat the SOTA

A fair question: π0.5 is the published state of the art for cross-embodiment manipulation. ACT is from 2023. ACT has 38× fewer parameters. Why did it win on this task?

18.1 — Data distribution match dominates

ACT was trained from scratch on our SO-101, our camera angles, our lighting, our cube. Its output distribution exactly equals the empirical distribution of teleoperator actions in those conditions. Whatever the SO-101 servos can execute under gravity and friction in those poses, that's what ACT outputs. Distribution shift is zero by construction.

π0.5 was pretrained on a heterogeneous mixture of robot embodiments — bimanual ALOHA, Franka arms, mobile manipulators. Its action distribution is the average of all of those, projected through the un-normalization stats of our dataset. The result is an action distribution that's nearly right but systematically biased: target positions tend to be too far from current state (because most robots in the pretraining mix have lighter payload-to-torque ratios than ours), and the safety clip has to soak up the difference at every tick.

18.2 — Capacity vs data ratio

With 4,149 frames (the v3 dataset), ACT's 80M params have ~52K frames per million params. That's a healthy specialization regime; the model can absorb the data signal without collapsing onto its initialization. π0.5's 3B params have ~1.4 frames per million params. With train_expert_only=True, the trainable subset is 300M, giving ~14 frames/M-params — still well below ACT's specialization density.

When data ≪ capacity, the model relies on its priors (pretrained features, attention patterns, normalization stats) to fill the gap. If the priors match the target task, this is a feature; if they don't, it's a liability. ACT has no priors to fight.

18.3 — Architecture-task fit

The ACT paper's whole motivation was: "small teleop dataset, single task, single robot, learn from scratch." Every architectural choice — VAE for multi-modal action distributions, fully-trainable ResNet18 vision (no frozen backbone), L1 reconstruction (robust to demo noise), small transformer (4 layers, 1 decoder layer) — is tuned for that regime.

π0.5's whole motivation was: "scale a VLA across embodiments and tasks via cross-modal pretraining." Its language conditioning, its 3B-param backbone, its discretized-state-as-tokens, its flow-matching action head — these all serve cross-task generalization. We don't need cross-task generalization. We have one task. The hardware is ours and known.

18.4 — Inference cost

The closed-loop story matters too. ACT at 11 ms warm gives us headroom to re-plan often; π0.5 at 280 ms forces longer action chunks (less feedback). For a small reactive task like "close the gripper when the cube is centered in the wrist cam," the inference rate matters more than the model capacity.

💡 The general pattern

Foundation models earn their parameter count when the task domain is broad (many objects, many tasks, many lighting conditions, many embodiments). Small specialized models earn their cost when the task domain is narrow (one task, one robot, ~50 demos). Picking which regime you're in is the most important design decision in a robot-learning project — and it's almost always misjudged toward "use the biggest model" because that's the default advice in mainstream ML papers.

Chapter 19

What's next

The shortest backlog, ranked by leverage:

  1. More demos with deliberate cube-position perturbation. 30+ demos, vary cube position across a 20×20 cm grid, occasionally push the cube during the demo to teach the policy "if cube moved, adjust." This is the single biggest ROI move on the policy side.
  2. Temporal action ensembling. ACT supports temporal_ensemble_coeff — average overlapping action chunks for smoother corrections. We had it disabled. Should be a free win for closed-loop performance.
  3. Servo PID tuning. Raise profile_velocity and torque current on the elbow servo. The Feetech docs publish per-servo limits; we're well below them. Tuned servos = the model's actions actually execute = effective control rate goes up.
  4. Multi-pose home recording. Record 10 demos from each of 3 different start poses. Teaches the policy to operate from low-gravity-load configurations as well as the current high-load one.
  5. Domain-randomized fine-tune of π0.5. Now that we know π0.5 fails on distribution shift in this regime, we could try collecting much more data and actually fine-tuning the action expert end-to-end (not just frozen-backbone) so the VLA's advantages can manifest. This is a real ML research question; ACT is the pragmatic answer in the meantime.
  6. Forward kinematics + world-frame point clouds. Layer the SO-101 URDF on top of the policy outputs. Unproject the wrist cam's depth (we don't have a depth model wired up yet, but Depth Anything v2 is one Modal endpoint away) into world-frame points. This is the same architecture as Hugging Face's GAS v2; we'd inherit it almost for free.
Chapter 20

Full recipe

Everything compressed to a runbook. One repo, two arms, one cube.

bash
# ── 0. Prerequisites ────────────────────────────────────────
# - SO-101 follower + leader, both connected via USB
# - Two USB cameras (wrist + base)
# - Modal account, HuggingFace write token
# - Mac with Python 3.12

# ── 1. Calibration (once per arm) ───────────────────────────
lerobot-calibrate --robot.type=so101_follower \
    --robot.port=/dev/tty.usbmodem... --robot.id=so101_follower_main
lerobot-calibrate --teleop.type=so101_leader \
    --teleop.port=/dev/tty.usbmodem... --teleop.id=so101_leader_main

# ── 2. Dataset collection ──────────────────────────────────
hf auth login          # write-scoped HF token
./record_v3.sh         # see Chapter 7 — 30 episodes recommended

# ── 3. Train ACT on Modal ──────────────────────────────────
modal deploy experiments/pi0/app.py
modal run experiments/pi0/app.py::train_act \
    --dataset-repo-id ozyphus/so101-cleanup-v3 \
    --num-steps 4000 --batch-size 8 --lr 1e-5
# Output: /weights/ft/act-so101-cleanup-<timestamp>/checkpoints/004000/pretrained_model

# ── 4. Wire the checkpoint into the live endpoint ──────────
# Edit CHECKPOINT_CANDIDATES in experiments/pi0/app.py to prepend the new path:
#   ("/weights/ft/act-so101-cleanup-<ts>/checkpoints/004000/pretrained_model", "act"),
modal deploy experiments/pi0/app.py
modal run experiments/pi0/app.py::PI0Policy.smoke    # sanity-check load + predict

# ── 5. Live run ────────────────────────────────────────────
python client/so101_run.py \
    --wrist-cam-index 0 --base-cam-index 1 \
    --task "clean up the desk" \
    --steps 400 --hz 10 --actions-per-chunk 5 \
    --action-mode absolute --action-scale 1.0 \
    --max-joint-step-deg 3.0 \
    --home --home-step-deg 3.0 --home-timeout-s 10 \
    --live
💡 The one-line summary of everything above

Train the smallest model that can fit your task. Make the training distribution match the servo's tracking envelope. Run the policy on a remote GPU but the control loop on local USB. Teleop is the dataset; the model is the compression of the teleop into something that can be re-played with edits.

Source: experiments/pi0/app.py for both Modal apps, client/so101_run.py for the Mac control loop. Datasets on HuggingFace at ozyphus/so101-cleanup-v2 and ozyphus/so101-cleanup-v3.