The robot learning stack.
A teardown of the architectures, losses, and training recipes that move modern manipulators — from behavior cloning's first sin to flow-matched VLAs and the pixel-RL renaissance. Written for the new grad, the staff engineer, and the director who needs to know which abstraction is load-bearing.
01The problem
A robot policy is a function from sensors to motor commands. The interesting part is everything that word "function" hides.
Formally, a robot operates inside a partially-observed Markov decision process. The world has a true state $s_t$; the robot sees an observation $o_t$ that is a noisy, lossy projection of it. At each tick the robot picks an action $a_t$, the world transitions to $s_{t+1}$ via dynamics $p(s_{t+1} \mid s_t, a_t)$, and (sometimes) emits a reward $r_t$. A policy $\pi(a_t \mid o_t, h_t)$ — possibly stateful via history $h_t$ — maps observations to actions. The goal is a policy that achieves the task, defined either by a reward function the policy maximizes (RL) or by a dataset of demonstrations the policy imitates (BC), or both.
That's the textbook frame. The reasons robot learning is hard are not in the textbook:
- Compounding error. A small mistake at step $t$ moves the robot to a state $s_{t+1}$ slightly outside the training distribution, so the next action is worse, and so on. The error grows in the worst case as $O(T^2)$ in horizon $T$ for naive behavior cloning — a result that is the original sin of the field.
- Multimodality. Humans demonstrating the same task move differently. Averaging two valid trajectories produces an invalid one (think: two ways around an obstacle, averaged, hits the obstacle). A policy that regresses to the mean is broken.
- Non-stationary distributions. The robot's actions shape its future observations. This is the difference between supervised learning, where the data is fixed, and any learning paradigm where the policy is a participant in data generation.
- The reality gap. Simulators are fast, free, and wrong. Real robots are slow, expensive, and right. Bridging the two is the central engineering problem of the field.
- Tight latency budgets. A 200ms control loop is luxurious; many tasks need 50ms or less. Architectures compete on inference latency, not just success rate.
Stated bluntly: a 99%-accurate per-step naive BC policy on a 200-step task fails reliably, because the second-order growth term dominates. Action chunking, DAgger-style correction, and chunked / receding-horizon control are all attacks on the same root cause — they break the feedback loop between policy errors and state distribution shift.
02Spaces of action and observation
Choose the action space carelessly and no architecture will save you.
Action representations
The choice of action space is a structural commitment that propagates through the entire stack. Five common options:
| Representation | What the policy outputs | Use when |
|---|---|---|
| Joint positions | Target $q_t \in \mathbb{R}^n$ for a position controller running at 500–1000Hz underneath | Bimanual tabletop, precise contact (ALOHA / ACT) |
| Joint velocities | Target $\dot q_t$ | Compliant control, when integration drift is acceptable |
| EE pose, abs. | $T \in SE(3)$ for the end-effector, solved by IK | Cross-embodiment, when the body shouldn't matter |
| EE pose, rel. | $\Delta T$ relative to current pose | UMI, Diffusion Policy — robust to recovery from drift |
| Torques | $\tau_t$ | Locomotion, rich contact, sim-trained policies |
The relative-EE space is quietly the most important shift of the last three years. A relative-pose policy that drifts can recover by issuing a corrective $\Delta T$; an absolute-pose policy that drifts is permanently confused. Relative actions also factor out the absolute pose of the demonstration, which means the same demo collected at any table works.
Rotation parameterization
Never regress to Euler angles or raw quaternions. Both have discontinuities or double-cover problems that confuse gradients. The accepted choice is the 6D continuous representation from Zhou et al. — predict the first two columns of the rotation matrix and Gram–Schmidt them. It has no discontinuities and trains cleanly.
Observations
The modern observation tuple is some subset of:
- RGB images from one or more cameras — fixed scene cams, wrist cams, fisheye on a handheld stick. Wrist cams are extraordinarily helpful for fine manipulation; one wrist camera often beats two scene cameras.
- Proprioception — joint positions, velocities, gripper width, end-effector pose. Cheap and informative.
- Force / torque — six-axis F/T sensors at the wrist. Crucial for contact-rich tasks; usually ignored because data is hard to collect.
- Tactile — DIGIT, GelSight, pressure arrays. Promising; not yet load-bearing in flagship policies.
- Language — instruction strings, encoded by a frozen language encoder (T5, CLIP text, an LLM).
- Goal images — used by Octo and others to specify task without language.
03Three paradigms
Imitation, reinforcement, and the rapidly growing hybrid in between.
Every modern robot policy lives somewhere on a triangle whose vertices are imitation learning, reinforcement learning, and model-based / world-model methods. The dominant practical paradigm in 2026 is imitation, often pre-trained at scale, sometimes fine-tuned with RL. The pure-RL vertex remains alive in locomotion, dexterous in-hand manipulation, and any setting where a simulator is faithful enough to train in.
Imitation
Match demonstration actions, conditioned on observations. Cheap to start, expensive to scale (data is the bottleneck). The default for manipulation.
Reinforcement
Maximize expected return. Powerful when the simulator is good or the real-world reward is dense. Brittle when reward shaping is wrong, sample-hungry when it isn't.
Model-based
Learn a dynamics model, then plan or learn inside it. Dreamer, MuZero. Sample-efficient; brittle under distribution shift in the model.
The triangle has interesting interior points. Offline RL trains a value-aware policy on demonstration data without further interaction — useful when you have demos but no simulator. Residual RL trains an RL correction on top of a frozen BC base. HIL-SERL blends online RL with human interventions for sample-efficient real-world learning. The big new entrant — VLA fine-tuning with RL — uses a pre-trained vision-language-action backbone and a small amount of task RL to specialize.
04Why naive BC fails
A regression model that loses to physics.
The simplest behavior cloning recipe: collect demonstrations $\{(o_i, a_i)\}$, train a network $\pi_\theta(o)$ to minimize $\sum_i \| \pi_\theta(o_i) - a_i \|^2$, deploy. It works on toy problems and fails on real ones for three reasons that compound.
Compounding error
Train-time observations come from the expert. Test-time observations come from the policy. Even if the policy is $\epsilon$-accurate per step, after $T$ steps it has wandered $O(\epsilon T)$ from the demonstration distribution; the loss bound on the expected number of mistakes is $O(\epsilon T^2)$. This is Ross & Bagnell, 2010. It means a 99%-accurate per-step BC policy fails reliably on a 200-step task.
Multimodality
The expert distribution $p(a \mid o)$ is often multimodal. Mean-squared regression converges to the conditional mean, which can be a bad action: imagine demonstrations split between going around an obstacle on the left and on the right; the mean goes through the obstacle.
Two valid trajectories average to one invalid trajectory. The squared-error minimizer is the conditional mean, and the mean of two perfectly-good modes can sit exactly where neither one ever went. The remedy is not better optimization — it is an output distribution that can represent a bimodal answer.
Causal confusion
A policy with too much information can learn shortcuts that don't generalize. The classic example: a self-driving model with access to a brake-light indicator learns to brake when the brake light is on — which works perfectly on demonstration data and catastrophically when deployed, because it's predicting its own past action rather than reading the road.
The fixes
Each failure has a class of solutions:
- For compounding error: action chunking, receding-horizon control, history conditioning, DAgger-style on-policy correction, RL fine-tuning.
- For multimodality: expressive output distributions — Gaussian mixtures, categorical heads, energy-based models, diffusion, flow matching, VQ.
- For causal confusion: careful observation design, dropout on proprioception during training, and refusing to feed the policy the previous action.
05The multimodality problem
Five answers to "how do you parameterize a multimodal action distribution?", in roughly chronological order.
Gaussian mixture heads
Predict a mixture: $\pi(a \mid o) = \sum_{k=1}^{K} w_k(o) \, \mathcal{N}(a; \mu_k(o), \Sigma_k(o))$. Train with negative log-likelihood. Robust, simple, interpretable. Limited by the number of mixture components and a tendency toward mode collapse during training. Works well as a baseline; was the basis of BC-RNN in early Robomimic experiments.
Discretized / categorical actions
Bin each action dimension into $B$ bins (typically 256), predict a categorical distribution per dimension, sample at test time. Expressive — a 256-bin categorical can represent any 1D distribution to that resolution — and the training objective is just cross-entropy. This is the recipe behind RT-1, RT-2, OpenVLA, and most VLAs. The loss reads cleanly:
Per-dimension factorization throws away cross-dimension correlation in a single timestep, which is mostly fine because action chunking gives you temporal structure to pick up the slack. Autoregressive variants restore correlation at the cost of slower inference.
Implicit / energy-based
Train an energy function $E_\theta(o, a)$ and define $\pi(a \mid o) \propto e^{-E_\theta(o, a)}$. Minimize an InfoNCE-style contrastive loss with negatives sampled from a proposal distribution. Implicit BC (Florence et al., 2021) showed this beats MSE on multimodal tasks. The downside is sampling: at inference you have to do gradient descent or rejection sampling on the energy, which is slow and brittle. Largely superseded by diffusion.
Diffusion
The current default for high-fidelity manipulation. Train a denoiser $\epsilon_\theta(a^{(k)}, k, o)$ to remove Gaussian noise from a noised action sequence; sample by iterative denoising. Naturally multimodal, expressive, stable to train. The next section is dedicated to it.
Flow matching
A close cousin of diffusion that learns a velocity field instead of a noise prediction. Cleaner objective, often fewer sampling steps, and the basis of Physical Intelligence's $\pi_0$. Covered in section 09.
Vector-quantized
Train a VQ-VAE over short action chunks, then learn an autoregressive transformer that predicts the discrete codes. VQ-BeT (Lee et al., 2024) is the canonical example. You get the multimodality benefits of categorical actions without per-dimension factorization, at the cost of a two-stage training pipeline.
06Action chunking
The single most underrated idea in modern robot learning.
The original BC recipe predicts one action per observation. Action chunking predicts a sequence of $H$ future actions per observation. The change is small in code and large in consequence.
Why it works
Three reasons, all important:
- It captures non-Markovian behavior. Real demonstrations have temporal structure — pre-grasps, follow-throughs — that a single-step policy must reproduce from scratch each tick. Chunks let the model commit to a plan.
- It reduces the frequency of compounding-error opportunities. If you re-observe and re-decide every $K$ steps instead of every step, the policy has $T/K$ chances to go wrong instead of $T$.
- It is a regularizer against pathological idle modes. Single-step policies trained on humans full of pauses learn to predict "stay still" because most actions are small; chunked policies see the whole motion and stop pausing.
Receding horizon control
The standard inference recipe: predict $H$ actions, execute the first $K \leq H$, replan. This is classical model-predictive control with a learned policy as the model. Diffusion Policy popularized $H = 16, K = 8$. ACT pushed harder with $H = 100, K = 1$ plus temporal ensembling (next).
Temporal ensembling
If you re-predict every step but each prediction is a chunk, you can average overlapping predictions for the same future timestep. ACT's recipe: at inference time $t$, average all predictions of action $a_t$ made at recent timesteps, weighted exponentially by recency. This drops control-signal jitter and is essentially free.
where $\hat{a}_t^{(t-i)}$ is the prediction of action $a_t$ that was made when the observation was $o_{t-i}$. Larger $\alpha$ trusts recent predictions more.
07ACT, in full
A conditional VAE wrapped around a transformer encoder–decoder, predicting one hundred actions at a time.
Action Chunking with Transformers (Zhao et al., 2023) is the policy that ships with ALOHA — a low-cost bimanual teleop platform whose data made fine bimanual tasks tractable for amateurs. The policy itself is small (≈80M parameters) and trains from scratch in hours on a single GPU. It is a useful object lesson in modern BC because every architectural choice answers a specific failure mode of section 04.
The CVAE wrapping
ACT is a conditional variational autoencoder over action chunks. Two encoders, one decoder.
- Style encoder $q_\phi(z \mid a_{1:H}, q)$ — a transformer that takes the ground-truth action chunk plus current joint positions and emits parameters of a Gaussian over a latent $z \in \mathbb{R}^{32}$. Used only at training time.
- Observation encoder — ResNet-18 image backbones run on each camera (typically four: top, front, two wrist cams). The features get flattened into tokens, joined by a proprioception token and the latent token, and fed to a transformer encoder that fuses them.
- Decoder — a transformer decoder cross-attends to the encoder output and emits an action chunk of length $H = 100$ in parallel (not autoregressive — the queries are fixed positional embeddings for the $H$ output slots).
The loss
Two pieces deserve scrutiny:
- L1 not L2. L1 is robust to the small label noise that comes from imperfect teleoperation; it discourages over-smoothing of fine motions. The original paper ablated this — L2 gave noticeably worse fine-control success.
- $\beta = 10$ in the original code. A relatively strong KL pulls the posterior toward the prior so that test-time inference (with $z$ set to the prior mean of zero) produces sensible actions.
Inference
At test time the style encoder is discarded; $z$ is fixed to $\mathbf{0}$. The model is now a deterministic regressor: observations in, $H$ actions out. Receding horizon with $K = 1$ and temporal ensembling. The CVAE is therefore not used to sample diverse actions — it's used as a training-time regularizer that lets the model represent action multimodality during training without forcing the deployed model to be stochastic.
What ACT gets right
- Action chunking $H = 100$ at 50Hz means one prediction covers two seconds of motion.
- Wrist cameras and DETR-style positional queries for the action slots — fast, parallel, no autoregressive bottleneck.
- Temporal ensembling with $\alpha \approx 0.01$ — a one-line change with outsized impact.
What ACT doesn't do
- It does not generalize across embodiments — it is trained per-platform.
- It does not condition on language out of the box.
- The 80M parameter size is not enough to absorb a large multi-task dataset; ACT is a strong single-task or few-task policy, not a foundation model.
Hyperparameters that matter
| Knob | Default | If you change it |
|---|---|---|
| Chunk H | 100 | Smaller = more reactive, less smooth. Below 20, multimodality issues return. |
| Latent dim | 32 | Bigger latents over-fit; KL regularizer barely scales. |
| KL weight β | 10 | Lower β → posterior collapse; higher β → ignored latent. |
| Ensemble α | 0.01 | Larger α trusts recent predictions more; smaller α smooths harder. |
| Image size | 480×640 | Wrist cams justify the compute; lower is fine for scene cams. |
08Diffusion Policy, in full
The action distribution as a denoising process. The architecture that ate manipulation.
Diffusion Policy (Chi et al., 2023) is a behavior-cloning architecture that models $p(a_{1:H} \mid o)$ as the reverse of a Gaussian diffusion process. It is the strongest single-task BC architecture in published benchmarks, and its variants underpin most of the post-2024 generalist policies.
The forward and reverse processes
Pick a sequence of noise levels $\{\beta_k\}_{k=1}^{K}$ and define $\alpha_k = 1 - \beta_k$, $\bar\alpha_k = \prod_{i=1}^k \alpha_i$. The forward process gradually corrupts an action chunk:
The model learns to undo this. Specifically, it learns a noise predictor $\epsilon_\theta\big(a^{(k)}_{1:H}, k, o\big)$ trained with the simple DDPM objective:
Three tiny details that matter more than the equation:
- $\epsilon$-prediction beats $a_0$-prediction in practice; the gradient signal is better-conditioned across noise levels.
- Cosine noise schedule from Improved-DDPM works better than linear for action spaces.
- The observation $o$ is a short history — typically two timesteps. Longer history hurts: the policy starts inferring its own past actions and gets causally confused.
Two backbone variants
The denoiser $\epsilon_\theta$ has two canonical implementations.
CNN-based: 1D temporal U-Net
A 1D U-Net over the time axis of the action chunk. Observations are encoded once, broadcast as a conditioning vector, and injected via FiLM layers (feature-wise affine modulation: $h \leftarrow \gamma(o) \odot h + \beta(o)$). The CNN exploits the locality of action sequences and is fast.
Transformer-based
Action tokens cross-attend to observation tokens. More expressive, slower, the right pick when the task has long-range structure or the observation is multimodal. Octo and most generalist policies use this variant.
Inference: DDIM
Naive DDPM sampling needs $K = 100$ denoising steps, which is too slow for a 10Hz control loop. The fix is DDIM sampling: a deterministic (or low-noise) reverse process with the same training objective that works at $K = 10$–$16$ steps with negligible quality loss.
where $\hat a_0$ is the implied clean prediction, $\hat a_0 = (a^{(k)} - \sqrt{1-\bar\alpha_k}\,\epsilon_\theta) / \sqrt{\bar\alpha_k}$. With $K = 16$ DDIM steps and a small CNN denoiser, inference is comfortably under 50ms on a single GPU.
EMA — the unsung hero
The single most underrated detail of training a diffusion policy is the exponential moving average of the model weights. Maintain a shadow copy $\theta_{\text{EMA}} \leftarrow \tau \theta_{\text{EMA}} + (1-\tau)\theta$ with $\tau \approx 0.9999$. Use the EMA copy for inference. Without it, the model is twitchy and unstable; with it, the same training run produces reliable behavior. The reason is mostly empirical — diffusion losses are noisy across noise levels and the EMA averages over the noise.
Why diffusion beat the alternatives
- Multimodal by construction. Different draws of the initial Gaussian sample produce different action chunks; the model never has to commit to a single mode.
- Stable training. The DDPM loss is well-conditioned and converges reliably; no GAN instabilities, no mixture-component collapse, no contrastive sampling.
- Receding-horizon natural fit. Predicting a chunk is what diffusion does anyway; chunking is free.
- Composes with vision encoders. The conditioning interface is a flat vector or token sequence — drop in any encoder.
09Flow matching policies
A simpler, faster cousin of diffusion. The basis of $\pi_0$.
Flow matching trains a velocity field that transports a simple base distribution (Gaussian) to the data distribution along straight-ish paths in time. It is mathematically simpler than diffusion and empirically faster to sample. Lipman et al. (2023) introduced conditional flow matching; Physical Intelligence's $\pi_0$ (2024) is the most prominent robotics application.
The objective
Define a continuous time $t \in [0, 1]$. Pair a noise sample $a_0 \sim \mathcal{N}(0, I)$ with a data sample $a_1$ and define the linear interpolant $a_t = (1-t) a_0 + t \, a_1$. The "true" velocity along this path is $a_1 - a_0$. Train a network $v_\theta(a_t, t, o)$ to predict it:
Inference
Sample $a_0 \sim \mathcal{N}(0, I)$, then integrate the ODE $\frac{d a_t}{dt} = v_\theta(a_t, t, o)$ from $t=0$ to $t=1$ using Euler with 5–10 steps. The ODE is not stochastic — there's no noise at inference, just integration of a learned vector field. This is part of why fewer steps suffice.
Why it matters
Flow matching has three advantages over diffusion at the level of practical robot policies:
- Fewer sampling steps — 5–10 vs 16+ — at comparable quality. Lower latency.
- Cleaner conditioning — the model is a single network that takes time as a continuous input, no noise-schedule gymnastics.
- Simpler loss — no $\sqrt{\bar\alpha_k}$ algebra, no $\epsilon$-vs-$x_0$ choice, no schedule tuning beyond uniform $t$ sampling.
$\pi_0$ specifics
$\pi_0$ couples a frozen vision-language backbone (PaliGemma, ~3B parameters) with an "action expert" — a small transformer that generates actions via flow matching, conditioned on the VLM's hidden states via cross-attention. The action expert is on the order of 300M parameters. The full model produces 50Hz control on bimanual platforms with action chunks of ~50 steps integrated over 10 Euler steps. The follow-up $\pi_{0.5}$ added open-vocabulary transfer and broader cross-embodiment generalization.
The architectural commitment is worth naming explicitly: the VLM does perception and high-level reasoning; the small action expert does motor control. This split is becoming standard — see also Helix's "system 1 / system 2" framing.
10Tokenized actions
When the action head is just another decoder of a transformer that already exists.
Two pressures push toward representing actions as tokens. First, you want to share weights with a pretrained language or vision-language model; the simplest way is to put actions in the model's own vocabulary. Second, transformers handle categorical sequences brilliantly, and decades of NLP optimization apply for free.
Per-dimension binning (RT-1 family)
Each action dimension is binned into 256 buckets uniformly across its observed range. A 7-dof end-effector action becomes seven categorical predictions per timestep. Training is cross-entropy; inference is argmax (or sampled, for diversity).
RT-1
Image tokens come from EfficientNet + TokenLearner (a small attention module that distills $H \times W$ patch tokens into $\sim$8 informative tokens). Language is encoded by Universal Sentence Encoder. A FiLM layer fuses language with image features. A transformer decoder predicts action token sequences. 35M parameters; trained on 130k Google demonstrations across 700+ tasks.
RT-2
The shift was simple and important: take a pre-trained vision-language model (PaLI-X, PaLM-E), overload some of its existing vocab tokens to mean action bins, and co-finetune on web data + robot data. The model can now answer "what should the robot do" the same way it answers "what's in the image" — by emitting tokens. This is the genealogical root of every modern VLA.
OpenVLA
An open re-implementation of the RT-2 idea. Llama-2 7B + DINOv2 + SigLIP vision, action tokens in the vocabulary, trained on a 970k-trajectory subset of Open X-Embodiment. It works because the recipe was always more important than the secret sauce.
VQ-BeT
The other path is a learned action codebook. VQ-BeT trains a VQ-VAE over short action chunks (say, 5 steps), producing a codebook of ~16 codes per chunk. A transformer is then trained to predict code indices autoregressively conditioned on observations.
The advantages over per-dimension binning: the codebook captures cross-dimension structure (an entire pre-grasp posture is one code) and each "action" the policy commits to is a coherent multi-step motion, not seven independent bins.
FAST — frequency-space tokenization
The 2025 advance that made autoregressive VLAs competitive with diffusion. The observation behind FAST (Pertsch et al., 2025) is that per-dimension binning fails on high-frequency dexterous tasks because adjacent timesteps in an action chunk are highly correlated — binning them independently produces enormous, redundant token sequences that the autoregressive model can't predict accurately. The fix borrows from JPEG.
Four steps:
- Normalize the action chunk (subtract mean, divide by 99th-percentile range).
- DCT each action dimension along the time axis. The discrete cosine transform concentrates signal energy in the low-frequency coefficients — same reason it's the heart of JPEG.
- Quantize via scale-and-round, with a hyperparameter trading lossiness for compression. Most of the high-frequency coefficients round to zero and disappear.
- BPE the resulting integer sequence using byte-pair encoding to losslessly compress repeated patterns. The output token IDs slot into the least-used positions in the LLM vocabulary.
The training loss is plain next-token cross-entropy. At inference, generate tokens autoregressively, run the inverse pipeline (un-BPE → de-quantize → inverse DCT) offline. The pipeline is invertible, the LLM machinery is unchanged, and the released FAST+ tokenizer is universal across embodiments — trained on 1M trajectories, it works zero-shot on new robots.
The empirical headline: π₀-FAST matches diffusion-π₀ on quality while training 5× faster. The result reframes the diffusion-vs-tokens debate. With FAST, autoregressive VLAs are no longer the speed-vs-fidelity compromise — they are competitive on both axes.
The trade-off, restated
Token-based heads are cheap at inference (one transformer forward, no iterative sampling) and they reuse pretrained weights for free. They are less expressive than diffusion or flow matching for fine continuous control — 256-bin discretization caps fidelity, and per-dimension factorization throws away within-step correlation. Empirically: tokens win when scale and multitask transfer dominate; diffusion / flow win when single-task fine motor control is the bottleneck. The 2026 generalist policy stack often combines them: a VLM backbone with tokens for routing, a flow-matching head for execution.
11UMI and the data shift
The most important paper of 2024 is not about a model. It is about a stick.
Universal Manipulation Interface (Chi et al., 2024) is a handheld parallel-jaw gripper with a GoPro camera, two side mirrors, and a fingertip-mounted IMU. A human picks it up and performs the task. Software extracts the 6-DoF gripper trajectory from visual SLAM and the gripper width from a fiducial; the resulting (image, EE-pose, gripper-width) trajectory is then used to train a Diffusion Policy. That policy is then transferred to a real robot with the same parallel-jaw end-effector.
UMI is not a new architecture. The policy on top is vanilla Diffusion Policy. The contribution is the data layer — and the contribution is large enough to reshape the field.
Why this works
- Embodiment is the gripper, not the arm. If the policy outputs relative EE poses and gripper width, the body that holds the gripper does not need to match between collection and deployment. A human's wrist is a perfectly good "robot arm" for data purposes.
- Mirrors give multi-view from one camera. The fisheye GoPro plus side mirrors yields three pseudo-views in a single frame. The policy gets multi-camera robustness from a single sensor.
- SLAM gives proprioception. No motion-capture rig, no instrumented environment. The trajectory is recovered from the camera's own motion.
- Latency-matched action representation. UMI shifts the predicted action sequence forward in time to compensate for robot actuation delay, so a policy trained on instantaneous-human-motion data still works on a robot with $\sim$200ms latency.
The policy stack
- Two-step observation history of (RGB, EE pose, gripper width).
- CLIP-pretrained ViT vision encoder; the EE-pose history is an MLP-encoded vector token.
- Transformer Diffusion Policy denoiser predicting 16 future relative-EE-pose + gripper-width steps.
- Receding horizon: execute first 8, replan.
The deeper lesson
For two decades the bottleneck on imitation learning was data — specifically, synchronized expert action data, which is expensive because it requires a robot. UMI shows that for parallel-jaw manipulation, much of that data can be collected without a robot at all, by humans acting through a handheld proxy. The implications cascade: cross-embodiment datasets, in-the-wild collection by non-experts, scaled-out pretraining corpora.
The same idea has been generalized: DexCap for dexterous hands, HumanPlus for whole-body humanoids, and a long tail of "make a thing a human can wear or hold to record actions" projects. The common thread is that the action space of the gripper or hand is shared between human and robot; everything else can vary.
12Vision–Language–Action models
When the policy is a frozen LLM with a different output head — and increasingly, with two heads running at different speeds.
A VLA is a single network that ingests images and natural-language instructions and emits robot actions. The bet behind every VLA is that the abstractions a model learns from internet-scale vision-language data — objects, affordances, spatial relations, intent — transfer to robotics, and that they transfer better than anything you could pretrain on robot data alone. By 2026 the bet has paid off, the architectures have converged, and the open question is no longer "do VLAs work" but "what fraction of the stack should be the VLM versus the action expert, and at what frequencies."
The lineage, in one table
| Model | Year | Backbone | Action head | Notable |
|---|---|---|---|---|
| RT-1 | 2022 | EfficientNet + USE + FiLM | Discrete tokens (256 bins) | First scaled VLA recipe; 35M params; 130k demos. |
| RT-2 | 2023 | PaLI-X / PaLM-E (12B–55B) | Tokens overloaded into LLM vocab | First true VLA; web + robot co-finetuning. |
| Octo | 2024 | Custom transformer (27M / 93M) | Diffusion (continuous) | Open. Goal-image or language; 800k demos. |
| OpenVLA | 2024 | Llama-2 7B + DINOv2 + SigLIP | Discrete tokens | Open RT-2 recipe; 970k demos. |
| RDT-1B | 2024 | DiT (1B) | Diffusion | Bimanual specialist; 1M+ episodes. |
| π₀ | 2024 | PaliGemma 3B + 300M expert | Flow matching | 50Hz bimanual; cross-embodiment training. |
| π₀-FAST | 2025 | Same backbone | Autoregressive on FAST tokens | 5× faster training; matches diffusion quality. |
| π₀.₅ | 2025 | PaliGemma + action expert | Flow matching | Open-world generalization; new kitchens/bedrooms. |
| π₀.₇ | 2026 | + MEM, RL Token | Flow + RL fine-tuning | Steerable; multi-scale memory; >10-min tasks. |
| GR00T N1 | 2025 | Eagle-2 VLM (1.34B) + DiT | Diffusion / flow matching | Humanoid; 2.2B; 63.9ms / 16-action chunk. |
| GR00T N1.5 | 2025 | 3B; frozen VLM in fine-tune | DiT | Layer-norm on adapter; promptable. |
| Helix | 2025 | 7B VLM at 7–9Hz | Visuomotor at 200Hz | 35-DOF upper body; runs on Jetson Orin; <100ms. |
| Helix 02 | 2026 | + System-0 motion prior | Hierarchical S0/S1/S2 | Whole-body; 61-step dishwasher demo. |
| Gemini Robotics 1.5 | 2025 | Gemini 2.5 + ER 1.5 | VLA + agentic orchestrator | Motion Transfer; "think before act" reasoning. |
| SmolVLA | 2025 | SmolVLM (450M) | Flow matching expert | Compact; matches 10× larger models on benchmarks. |
The two-system split, explicitly
The convergent architecture of 2026 has two unequal halves. A large vision-language model — the slow brain — observes the scene at 5–10Hz and emits either a latent plan, a chain-of-thought string, or a sequence of FAST tokens. A small action expert — the fast brain — runs at 50–200Hz, reads the latest observation plus the slow brain's output, and produces continuous joint or end-effector commands. The split is what makes language-conditioned humanoid control viable: a 7B forward pass per control tick is not feasible; a 7B forward pass per plan with a 100M expert per tick is.
The five families of bridges
Different VLAs disagree about what the slow brain sends to the fast brain:
- Hidden states. π₀ and GR00T pass the VLM's last-layer hidden states through cross-attention into the action expert. Highest bandwidth; tightest coupling; requires joint training.
- Discrete tokens. RT-2 / OpenVLA / π₀-FAST emit action tokens from the LLM's own vocabulary, decoded back into actions. Lowest latency for the VLM; throws away cross-dimension structure unless paired with FAST.
- Latent plan vectors. Helix-style designs emit a small "plan vector" updated at System-2 frequency that conditions System 1. Loose coupling; allows the two halves to be trained separately.
- Natural-language reasoning. Gemini Robotics 1.5 interleaves language reasoning steps with action chunks — "first I'll pick up the cup, then place it in the sink" — making behavior interpretable and improving long-horizon decomposition.
- Tool calls. Gemini Robotics-ER 1.5 acts as an orchestrator, calling a separate VLA (or grasping model, or web search) as a tool. The reasoning model never sees the actuators directly.
Motion Transfer and embodiment soup
A VLA trained on Open X-Embodiment sees seven different arms doing similar tasks with different action spaces. Naively, the model has to memorize a separate output head per embodiment. Motion Transfer (Gemini Robotics 1.5) and π₀'s zero-padding to the largest action vector are two answers to the same question: how do you make a single policy reuse motor knowledge across robots? The recipe that works is a shared semantic representation in the VLM, plus an action expert whose output is masked to the active embodiment's true degrees of freedom. ALOHA, Apollo humanoid, and Franka arm share weights through every layer except the final projection.
Embodied thinking
Gemini Robotics 1.5 added an explicit reasoning trace before action emission — the model writes natural language describing what it is about to do, then emits the action tokens. The trace is conditioned on by the action head, so the reasoning is causally upstream of motion. The empirical result: long-horizon tasks decompose more cleanly, mid-task corrections become tractable, and a human can read what the robot is thinking. The cost is latency. The benefit is that "pour the milk before the cereal" requires reasoning the model could not previously do at all.
Co-training on web data
RT-2 introduced and every successor confirmed: continue training on web vision-language data while fine-tuning on robot data. Otherwise the model loses its world knowledge — it can grab the green block, but ask it to "grab the dinosaur" and it doesn't know what a dinosaur looks like anymore. Mix ratios run 1:1 to 4:1 web:robot. π₀ + Knowledge Insulating (2025) takes this further: freeze most VLM weights through fine-tuning so internet knowledge is preserved structurally, not just statistically.
Synthetic data is the new Open X-Embodiment
GR00T N1's training mix is real-robot trajectories, human videos, and entire neural-generated trajectories from video diffusion models. The shift is significant: when image and video generation are themselves at foundation-model scale, the cheapest source of robot training data may be a generative model rather than a teleoperator. DexMimicGen, MimicGen, and similar tools synthesize trajectories from a small seed of real demos. The 2026 question is no longer "can synthetic data train policies" — it can — but "what is the right ratio of synthetic to real?"
Where VLAs are weak
- Raw latency. A 7B forward pass dominates the control budget. Two-system splits, FAST tokenization, INT4 quantization, and speculative decoding are the four levers.
- Fine motor control. A generalist policy still underperforms a specialist on its specialty by 5–15 points. RL fine-tuning closes most of the gap.
- Out-of-distribution physics. A VLA that never saw deformable cloth does not learn cloth physics from a few demos.
- Spatial precision. Pointing accuracy and 3D pose estimation are still fragile — see Gemini Robotics-ER 1.5's pointing benchmarks for the current state of the art.
Where VLAs are strong
- Language-conditioned task selection. "Pick up the red mug" works because the model knows what red mugs are.
- Cross-embodiment transfer. Trained on Open X-Embodiment, a VLA can be fine-tuned to a new arm with hundreds rather than thousands of demos.
- Instruction-following on novel objects. The pretraining corpus is the moat.
12·53D representations and equivariance
When the input is a point cloud, the symmetries of physics start paying for themselves.
2D image policies are the dominant paradigm for one reason: 2D images are easy to collect, easy to encode, and have ImageNet-scale priors available. They are also geometrically lossy. A policy trained on RGB images alone has no built-in notion of where things are in 3D space; it has to learn that from data, every time. A small but rapidly growing corner of the field argues that the right move is to give the policy 3D structure directly — and, while you're there, to bake the physical symmetries of 3D space into the architecture.
Why 3D helps
Three concrete wins:
- Spatial generalization for free. A policy that sees raw RGB has to learn that an object 30cm to the left looks similar to one straight ahead. A policy that operates on 3D points has the translation built into the input geometry — moving the object is a literal addition to point coordinates.
- Camera invariance. 3D point clouds aggregated from RGB-D or stereo cameras are indifferent to camera placement. Move the camera; the points don't move (much).
- Sample efficiency. The same task with the same object at different positions becomes one effective demonstration, not many. Empirically, 3D Diffusion Policy (Ze et al., 2024) needs ~10× fewer demos than 2D Diffusion Policy on contact-rich tasks.
The 3D policy zoo
| Model | Input | Architecture | Notable |
|---|---|---|---|
| 3D Diffusion Policy | Sparse point cloud (~512 pts) | 1D embedding + diffusion | Cheap; strong on data-scarce tasks. |
| 3D Diffuser Actor | Multi-view RGB-D → 3D scene tokens | Relative-position 3D attention | Translation equivariant; SOTA on RLBench. |
| EquiBot | Point cloud | Sim(3)-equivariant network | Scale-equivariant; data efficient. |
| Spherical Diffusion Policy | Point cloud | SE(3)-equivariant in spherical Fourier space | Full 3D rotational equivariance. |
| Canonical Policy (2025) | Point cloud | Canonicalize → diffusion / flow | Pre-rotates input to canonical frame. |
| ISP — Image-to-Sphere | Single eye-in-hand RGB | SO(3)-equivariant via spherical projection | Equivariance from a single 2D camera. |
The symmetry argument
If you rotate the entire scene by some $R \in SO(3)$, the correct robot action rotates by the same $R$. A policy that doesn't know this has to learn it from data — separately for every angle. A policy that has it baked in is, by construction, correct for every angle the moment it works for one. This is the same argument that made convolutional networks beat MLPs on images: a network that respects translation symmetry sees the same image once, regardless of where the object is. 3D policies extend the argument from $\mathbb{R}^2$ translations to $SE(3)$ rigid motions.
The downside is engineering. Equivariant networks are harder to write, harder to debug, and harder to compose with foundation-model priors. Spherical-Fourier and steerable-CNN libraries exist but are far less mature than PyTorch's standard transformer. Most of the field is still betting on data + 2D + flexible architectures over symmetry-baked 3D — but the 3D camp's sample efficiency numbers keep getting harder to ignore.
Hybrid strategies
Three pragmatic compromises that are starting to converge:
- Lift, don't replace. Keep the ViT backbone. Use it to extract per-pixel features, then lift those features into 3D via the camera intrinsics + depth. The downstream policy operates on 3D feature points. You get 3D structure without losing the 2D pretraining.
- Canonicalize the input. Before feeding a point cloud into a policy, rotate it to a canonical orientation determined by, e.g., principal axes or the object's bounding box. The policy itself is not equivariant; the preprocessor handles symmetry.
- Use 3D only at the contact phase. Run a 2D VLA for high-level reasoning and reaching, switch to a 3D contact-aware policy for the final approach. The slow-fast split, in 3D form.
13Vision encoders
The eyes of the robot. Where most policies still leave performance on the table.
The vision encoder converts pixels into tokens or feature vectors that the policy consumes. The choice of encoder is a major lever — both for sample efficiency (a good prior cuts demonstrations needed by 3–10×) and for generalization (the encoder is what determines whether "red mug" and "blue mug" share a representation).
Three eras
- ImageNet-pretrained ResNet (until ~2022). Standard ResNet-18 or ResNet-50, frozen or fine-tuned. Cheap, good enough, the backbone of ACT and most pre-VLA work.
- Self-supervised on robot or egocentric video (2022–2023). R3M, VC-1, MVP. Trained on Ego4D and similar; the priors are closer to manipulation distributions than ImageNet's.
- Frontier vision foundation models (2023–present). DINOv2, SigLIP, CLIP. Either used directly or distilled.
The encoders worth knowing
| Encoder | Training | Why it's used |
|---|---|---|
| ResNet-18 | ImageNet supervised | Cheap, fast, enough for single-task BC. The ACT default. |
| CLIP (ViT-B/16) | Image–text contrastive on 400M pairs | Language-aligned features. Standard for VLAs and UMI. |
| DINOv2 (ViT-L/14) | Self-supervised distillation, 142M images | Best raw visual features. Used in OpenVLA alongside SigLIP. |
| SigLIP | Sigmoid contrastive image–text | Stronger language alignment than CLIP at scale. |
| R3M | Time-contrastive + language alignment on Ego4D | Manipulation-aligned. Strong with little data. |
| VC-1 | MAE on Ego4D + ImageNet | Eric Mintun et al., Meta. Robust low-shot performance. |
| Theia | Distillation of CLIP+DINOv2+ViT+SAM | Multi-teacher distillation; competitive at smaller size. |
Frozen or fine-tuned?
The dominant practice in 2026 is frozen encoder + small adapter for foundation models, and full fine-tune for ResNet-scale encoders. The reasons:
- Fine-tuning a 300M+ parameter ViT on a few thousand robot demonstrations destroys the pretraining priors. The robot data is too narrow to support the fine-tune.
- A frozen encoder + a learnable linear probe or small transformer adapter preserves the priors and trains in hours.
- For ResNet-18-scale encoders, the prior is weak enough that fine-tuning helps — and the data is abundant enough to support it.
Multi-camera fusion
Two strategies. Late fusion: encode each camera independently, concatenate or attention-fuse the resulting tokens before the policy. This is the standard. Early fusion: stitch images side-by-side or stack channels. Cheap but throws away camera identity.
Cross-attention works better than concatenation when one camera dominates (e.g., the wrist cam during contact). The policy can route attention to the camera that matters at each timestep.
Augmentation
Three augmentations earn their seat:
- Random shifts ($\pm$4 pixels) — simulates camera calibration error. Drops sim-to-real gap.
- Color jitter — mild brightness, contrast, saturation. Critical for any policy that will see different lighting at deploy time.
- Random crops at test time — DrQ-v2's trick: sample multiple crops at inference, average the Q-values. Doesn't apply directly to BC but the sibling idea (test-time ensembling) does.
Augmentations that don't earn their seat: heavy cutout, MixUp, anything that changes the geometry between the wrist camera and the gripper. The policy is not invariant to these — it depends on them.
14PPO
The locomotion workhorse. The reason simulator-trained quadrupeds walk.
Proximal Policy Optimization (Schulman et al., 2017) is an on-policy actor-critic algorithm that became the dominant RL method for robotics-in-simulation. The reasons are mostly practical: it's stable, it parallelizes embarrassingly across thousands of simulator instances, and the gradient signal is well-behaved.
The objective
PPO maximizes a clipped surrogate of the policy gradient, which prevents the new policy from drifting too far from the old in a single update.
Three pieces:
- Importance ratio $r_t$ — corrects for the fact that the data was collected by the old policy.
- Advantage $\hat A_t$ — typically generalized advantage estimation, a $\lambda$-weighted blend of $n$-step TD errors that trades bias for variance. $\lambda \approx 0.95$ is standard.
- Clipping — when $r_t$ exceeds $1 \pm \epsilon$ (typically $\epsilon = 0.2$), the surrogate flattens, removing incentive to push further.
The asymmetry of the clipping is the trick: when an action was good (positive advantage) and the policy is already moving toward it, the gradient stops at $1+\epsilon$ — no overconfident leaps. When an action was bad (negative advantage), the gradient is not clipped on the corrective side — the policy is free to back away as far as it wants. The result is a trust region that punishes overshoot but never blocks recovery.
The full PPO loss adds a value function regression term and an entropy bonus:
$c_1 \approx 0.5$, $c_2 \approx 0.01$. The entropy bonus prevents premature collapse of the policy.
Why PPO and not policy gradient
Vanilla policy gradient (REINFORCE) is high-variance. TRPO (the predecessor) is correct but uses second-order optimization that's painful to implement at scale. PPO replaces the trust region with a clipped first-order objective that recovers most of the stability benefit at a fraction of the engineering cost.
Where PPO shines
- Locomotion in simulation. Isaac Gym, MuJoCo MJX, Brax. With 4096+ parallel environments, billions of timesteps cost an afternoon.
- Sim-to-real with domain randomization. The on-policy character means you need a fast simulator; on-policy is feasible because GPU sims have made it feasible.
- Discrete action spaces via categorical policies — game-playing, Atari.
Where PPO loses
- Real-world robots. On-policy means every gradient step throws away old data. Real-world data is too expensive to throw away. SAC or HIL-SERL are better here.
- Sparse rewards. PPO needs reward signal; without shaping, it doesn't explore well.
15SAC and the off-policy family
Maximum-entropy reinforcement learning. The default for sample-efficient RL.
Soft Actor-Critic (Haarnoja et al., 2018) is an off-policy actor-critic that adds an entropy bonus to the reward, encouraging exploration. It is the default RL algorithm for continuous control problems where you have a fixed amount of interaction budget — which is to say, almost all real-world RL.
The maximum-entropy objective
Standard RL maximizes $\mathbb{E}\big[\sum_t \gamma^t r_t\big]$. SAC instead maximizes:
The entropy bonus $\alpha \mathcal{H}$ encourages the policy to be as random as possible while still solving the task. At convergence, the policy is the Boltzmann distribution over actions weighted by their soft Q-value.
The losses
SAC trains three networks: a stochastic policy $\pi_\theta(a \mid s)$ (typically a squashed Gaussian), and two Q-networks $Q_{\phi_1}, Q_{\phi_2}$ (twin Q for stability), with target networks $\bar Q$ for bootstrapping.
The actor loss is the negative expected Q minus an entropy regularizer. The reparameterization trick (sample $\epsilon \sim \mathcal{N}$, apply $a = f_\theta(s, \epsilon)$) makes this differentiable through sampling.
Auto-tuned $\alpha$
The entropy coefficient $\alpha$ is tuned automatically to hit a target entropy $\bar{\mathcal{H}}$:
This single change is the difference between "SAC works out of the box" and "SAC requires careful per-task tuning."
Twin Q and target networks
Two tricks borrowed from TD3:
- Twin Q-networks with the $\min$ operator combat overestimation bias in the Q-learning target.
- Target networks updated as a slow EMA of the online networks ($\tau \approx 0.005$) stabilize the bootstrap target.
The off-policy family
SAC sits alongside its cousins:
- DDPG — deterministic policy gradient. Predecessor; less stable than SAC because the policy isn't stochastic.
- TD3 — DDPG with twin Q, delayed actor updates, target policy smoothing. Strong baseline.
- REDQ — large Q-ensemble (10 critics), high update-to-data ratio (UTD = 20). Vastly more sample-efficient than SAC at the cost of compute.
- DroQ — REDQ with dropout instead of an ensemble. Comparable performance with one critic.
16Sim-to-real
The reality gap is the central engineering problem of pure-RL robotics.
Training in simulation is fast, free, and produces policies that fail spectacularly when deployed on a real robot — unless you do specific things to close the gap. The interventions cluster into four categories: domain randomization, system identification, real-to-sim, and online adaptation.
Domain randomization
The dominant technique. At each simulator reset, sample physical and visual parameters from a wide distribution: friction, mass, motor gains, latency, observation noise, lighting, textures, camera pose, gravity. The policy is forced to learn a control law robust across the distribution; the real world is treated as one more sample from it.
Three regimes:
- Static randomization. Fixed ranges, sampled once per episode. Simple, works for many tasks.
- Adversarial randomization. Sample parameters that the policy currently fails on. Faster to converge, requires more infrastructure.
- Automatic Domain Randomization (ADR). Start narrow, widen the range when success rate exceeds a threshold. OpenAI's Rubik's cube paper. Gives a curriculum for free.
The randomization that matters
Not all parameters are equal. Empirical priorities, in rough order:
- Motor / actuator dynamics. Latency, PD gains, torque limits, deadbands. The biggest sim-to-real failure mode for legged robots is incorrect actuator modeling.
- Mass and inertia. Especially for objects being manipulated.
- Friction. Both ground and contact.
- Observation noise and latency. A policy trained on perfect proprioception fails on a real robot with 5ms IMU latency and quantization.
- Visuals. For pixel-based policies, lighting and texture randomization are mandatory.
System identification
Estimate physical parameters from a small amount of real data and condition the policy on the estimate. RMA (Rapid Motor Adaptation) trains a privileged policy on ground-truth dynamics in sim, then trains an "adaptation module" that infers dynamics from a window of recent proprioception. The adaptation module replaces the privileged input at deployment. This is now the standard recipe for legged locomotion; it's why robust quadrupeds exist.
Real-to-sim and digital twins
Build a simulator that matches your specific real environment — calibrated geometry, measured friction, characterized actuators. Useful when one specific deployment matters more than generality. Less helpful when the goal is broad generalization, because the calibrated sim is a single point rather than a distribution.
Online adaptation
Fine-tune in the real world after sim training. Sometimes via RL (slow, dangerous), sometimes via fast supervised correction signals (preferred). The unifying lesson is that sim training gets you to 80%, and the last 20% is real data.
17Pixel-based RL
Learning from images is harder than learning from state, in ways that are now well-understood.
State-based RL — where the agent observes a low-dimensional state vector — has been a solved problem in many simulated benchmarks since 2018. Pixel-based RL — observing only RGB frames — was a much harder problem until DrQ (Kostrikov et al., 2020) demonstrated that aggressive data augmentation closes most of the gap.
The DrQ family
DrQ
Augment image observations with random shifts ($\pm 4$ pixels), then run SAC. Average $K$ Q-values per state-action pair, computed on $K$ different augmentations. Shockingly, this single change closed the gap to state-based RL.
DrQ-v2
Replaces SAC with DDPG (deterministic policy + exploration noise schedule), drops the ensembling. Faster, simpler, better on DeepMind Control. The standard pixel-RL baseline since 2021.
DrM
Adds a dormant-neuron reset and a layer-norm tweak. Marginal gains but the right diagnostic frame for why pixel-RL is unstable: large fractions of the network become inactive during training and stop contributing.
The augmentation insight
Augmentations work in pixel-RL for the same reason they work in supervised vision: they enforce a useful invariance and act as a regularizer. But the deeper reason is that RL targets are noisy; without augmentation, the network overfits to whatever artifacts the noise produces. Random shifts force the encoder to be translation-equivariant and starve the network of the specific pixel-coordinate features it would otherwise memorize.
Why pixel-RL is hard, structurally
- Sample efficiency. The network has to learn perception, value estimation, and control jointly from a single scalar reward. Any of these tasks alone is hard.
- Representation collapse. The encoder can converge to features that are temporally smooth but task-irrelevant.
- Exploration. Random actions in a high-dimensional control space rarely produce useful images; you need either a curiosity bonus, a strong prior, or both.
The pretraining shortcut
Replace the encoder with a frozen visual foundation model (CLIP, DINOv2, R3M). The RL problem becomes "learn a policy on a 768-dim feature vector," which is much closer to state-based RL. This is the dominant pattern in 2026 — pure pixel-RL from scratch is rare; pixel-RL on top of a frozen foundation model is common.
18World models
Imagine the future, plan inside the imagination, hope the imagination is right.
A world model is a learned dynamics model — a network that predicts $p(s_{t+1} \mid s_t, a_t)$ — plus the apparatus to use it. Pure RL is model-free: the policy learns from real (or simulated) interactions only. Model-based RL learns a dynamics model and uses it to either plan (MuZero, MPC) or generate imagined rollouts to train a model-free policy on (Dreamer).
The Dreamer family
Hafner et al., 2019–2024. Three iterations: Dreamer, DreamerV2, DreamerV3. The architecture has stabilized; DreamerV3 in particular is notable for solving a wide range of tasks with the same hyperparameters out of the box.
The recurrent state-space model (RSSM)
The world model factorizes the state into a deterministic component $h_t$ (a GRU's hidden state) and a stochastic component $z_t$ (a categorical or Gaussian latent).
Train it with an ELBO that combines image reconstruction, reward prediction, and a KL between posterior and prior — the KL is what forces the prior to stay predictive without seeing the image.
Imagination training
With the model trained, drop into latent space. Sample a starting $h_0, z_0$, roll out 15–20 steps using the prior dynamics and the actor. Train the actor to maximize predicted return; train the critic to estimate value. No real environment interaction during the imagination phase.
DreamerV3 details that matter
- Symlog transformations on rewards and values: $\text{symlog}(x) = \text{sign}(x)\log(|x|+1)$. Compresses the dynamic range so the same loss works across tasks with rewards in [-1,1] or [0, 1000].
- Two-hot encoding of returns: predict a categorical over a discretized return range and decode with a soft target. Stabilizes value learning.
- Categorical latents with straight-through gradients: 32-dim categorical with 32 classes per dim, instead of Gaussian latents. Empirically more stable.
- KL balancing: separate scaling for the "make posterior close to prior" and "make prior close to posterior" terms of the KL. Prevents posterior collapse.
DayDreamer
Wu et al., 2022. Dreamer applied to four real robots. The headline result was not the algorithm — it was the framing: an A1 quadruped learned to walk in 1 hour from scratch, on real hardware, with no simulator. Dreamer's sample efficiency made real-world RL plausible.
Where world models help
- Sample-efficient real-world RL when the dynamics model is easier to learn than the policy.
- Transfer: a world model trained on one task can be reused for a related task.
- Long-horizon credit assignment: imagined rollouts can be 50+ steps without reset costs.
Where world models struggle
- Contact-rich manipulation, where prediction errors compound fast and the model can't track sliding contacts.
- Open-ended environments where reconstruction loss spends capacity on irrelevant background pixels.
19Offline RL
Reinforcement learning when you cannot interact. The bridge from imitation back to value-aware policies.
Offline RL learns a policy from a fixed dataset of transitions $\{(s, a, r, s')\}$, with no further interaction. It is what you do when you have demonstrations and rewards but no robot to run on. The central failure mode is distributional shift in the value target: the Bellman backup queries Q at out-of-distribution actions, and the network's extrapolation there is unreliable.
The three responses
CQL — Conservative Q-Learning
Penalize Q-values for OOD actions. Add a regularizer to the standard Bellman loss:
The first term pushes Q down everywhere, the second pulls it up on the data distribution. Net effect: Q is suppressed on OOD actions. Works; sometimes over-conservative.
IQL — Implicit Q-Learning
Avoid evaluating Q at OOD actions entirely. Use expectile regression on the value function and fit the policy to weighted behavior cloning of dataset actions, weighted by their advantage.
Expectile $\tau \approx 0.7$. The policy is extracted by advantage-weighted behavior cloning — never queries Q on OOD actions, never has to be conservative. Strong, simple, the modern default.
AWAC, AWR — Advantage-Weighted regression
The general family: estimate advantage from data, do BC weighted by $\exp(\beta A)$. AWAC adds an explicit Q-function update; AWR doesn't. The simplest member of the family.
When offline RL helps over BC
Two situations:
- Mixed-quality data. If your demonstrations include some failures or sub-optimal trajectories, BC trains on the average. Offline RL trains toward the best.
- Reward-labeled play data. If you have task-agnostic interaction with reward labels, BC has nothing to imitate. Offline RL extracts a task-specific policy.
When offline RL doesn't help
If your dataset is uniformly expert demonstrations, BC matches offline RL and is simpler. If your dataset is small and narrow, offline RL is hard to tune and unreliable. The big breakthroughs in robot learning over the last three years were data, not offline RL.
20Hybrid: BC + RL
The polishing step that makes specialists out of generalists.
BC gives you a policy that does roughly the right thing. RL gives you a policy that does the right thing reliably. The combinations of the two are where most production-grade robot policies actually live.
RL fine-tuning of BC
The simplest recipe: train a BC model, initialize an RL run with its weights, train with PPO or SAC. The challenge is preventing the policy from drifting too far from the BC prior in early training, which destroys the prior's value. Two tricks:
- KL constraint against the BC prior: add a $\mathrm{KL}(\pi_\theta \| \pi_{\text{BC}})$ regularizer to the policy loss with coefficient annealed down over training.
- Entropy clipping: bound the policy's stochasticity below the BC's so the policy doesn't immediately become uniform random when the entropy bonus is too high.
Residual RL
Freeze a BC base policy $\pi_{\text{BC}}$. Train a small RL "correction" policy $\pi_\Delta$ that outputs an action delta. The deployed action is $a = \pi_{\text{BC}}(o) + \pi_\Delta(o)$. The RL problem is much easier — the BC prior already does most of the task, and the correction lives in a small action-magnitude box. Johannink et al. (2019) demonstrated this for industrial assembly; the recipe still works.
HIL-SERL
Luo et al., 2024. The current state of the art for sample-efficient real-world RL on manipulation. The recipe combines:
- Pre-trained vision encoder (frozen) feeding a small policy and Q-network.
- Initial offline pretraining of the Q on a small demo dataset.
- Online RL with a human-in-the-loop intervention button: when the robot is about to fail, the human takes over via teleop and the trajectory becomes positive training data.
- Q-ensemble + high update-to-data ratio for sample efficiency.
The result: 100% success on contact-rich tasks (PCB insertion, Jenga manipulation) in under two hours of real-world training. This is the only RL recipe in 2026 that is competitive with BC + lots of data on real robots.
RLHF for robots
Human preference labels over pairs of trajectories train a reward model; the reward model trains a policy with RL. RT-2-X and several follow-ups have shown this works for VLA fine-tuning, much as it did for language models. The bottleneck is preference-label collection at scale; the technique is mature, the data isn't.
21Loss compendium
Every loss in the modern stack, named, derived, and placed in its bestiary.
The losses you'll see in robot-learning code are a small set with overlapping cousins. Knowing which one to reach for is half the battle; the other half is knowing why each exists.
Regression losses
The default. Smooth gradients, well-conditioned. The minimizer is $\mathbb{E}[y \mid x]$ — which is exactly the failure mode for multimodal $y$. Use when the conditional distribution is unimodal or when you've already factored out multimodality with another mechanism (e.g., the noise input to a diffusion model).
The minimizer is the conditional median, which is more robust to label noise. Used in ACT and other policies where teleoperation produces small jittery labels. Slower-converging gradients near zero (the gradient is constant in magnitude), but the resulting model is less prone to over-smoothing fine motions.
L2 inside a band, L1 outside. Robust to outliers without sacrificing convergence. Standard for Q-function regression in DQN and its descendants.
Likelihood losses
Max likelihood for a parametric distribution. For a Gaussian head this is L2 plus a $\log \sigma$ term that lets the model express uncertainty. For a mixture this is the natural way to capture multimodality before diffusion took over.
The right loss whenever your output head is a softmax over discrete bins or a vocabulary — VLAs, RT-1, RT-2, OpenVLA, VQ-BeT.
Variational losses
Reconstruct $x$ from a latent $z$ sampled from a variational posterior; regularize the posterior toward a prior. ACT's loss is exactly this, with $x$ replaced by the action chunk and the reconstruction term using L1 instead of Gaussian likelihood.
Score / denoising losses
The trained network predicts the noise that was added; subtracting the prediction recovers a denoised sample. Conceptually it's score matching ($\nabla_x \log p$ is proportional to $-\epsilon$ at the right scaling), pragmatically it's just MSE between known noise and predicted noise.
The same mechanical structure as DSM, training a velocity field instead of a noise predictor along straight interpolants between data and prior. Fewer integration steps at inference.
Contrastive / energy losses
Minimize energy on positive pairs and maximize it on a sampled set of negatives. Powers Implicit BC, CLIP, R3M, and many self-supervised vision objectives.
RL losses, gathered
| Loss | Used by | Shape |
|---|---|---|
| PG / REINFORCE | vanilla PG | $-\mathbb{E}[\log \pi(a \mid s) \cdot A]$ |
| PPO clip | PPO | $-\mathbb{E}[\min(rA, \mathrm{clip}(r, 1\!\pm\!\epsilon)A)]$ |
| DQN / Bellman | DQN, SAC critic | $\mathbb{E}[(Q - (r + \gamma \bar Q'))^2]$ |
| SAC actor | SAC | $\mathbb{E}[\alpha \log \pi - Q]$ |
| CQL extra | offline CQL | $\log \sum_a e^{Q} - \mathbb{E}_{\pi_\beta}[Q]$ |
| IQL expectile | offline IQL | $\mathbb{E}[L^\tau_2(Q - V)]$ |
| AWR / AWAC | offline / hybrid | $-\mathbb{E}[e^{\beta A} \log \pi]$ |
22Training recipes
The unwritten parts of the README that decide whether your run converges.
Optimizer
AdamW with $\beta_1 = 0.9, \beta_2 = 0.95$ for transformers, $\beta_2 = 0.999$ for everything else. Weight decay $\sim 0.05$ on linear-layer weights, zero on biases and norms. Gradient clipping at global norm $1.0$ — non-negotiable for transformers and a cheap insurance policy elsewhere.
Schedule
Linear warmup over the first 1000–5000 steps, then cosine decay to 10% of peak LR over the rest of training. Peak LR depends on architecture: $1\!\times\!10^{-4}$ for from-scratch transformers, $3\!\times\!10^{-5}$ for VLA fine-tuning, $5\!\times\!10^{-4}$ for ResNet-scale BC. Skip warmup and you eat a loss spike in the first hundred steps that the model never fully recovers from.
EMA
Maintain a shadow copy of model weights, updated as $\theta_{\text{EMA}} \leftarrow \tau \theta_{\text{EMA}} + (1-\tau) \theta$ at every step. Use the EMA copy at evaluation. Critical for diffusion ($\tau = 0.9999$) and flow matching policies; helpful for everything else. The intuition is that the loss surface has high-frequency noise that the EMA averages out, and the resulting weights generalize better than any single training step's.
Mixed precision
BF16 weights and activations on Hopper / Ada hardware; FP32 for the optimizer state, the loss, and any normalization statistics. The 4× memory and ~2× speed savings are too large to leave on the table. Watch for numerical issues in attention softmax and in any explicit $\log$ — keep those in FP32.
Batch construction
Three details that disproportionately matter:
- Episodes must not be split arbitrarily. If a chunk crosses an episode boundary, the model learns to model the dataset's stitching, not the task. Sample chunks within episodes only.
- Rebalance multi-task data. Naive concatenation gives long-tail tasks no signal. Square-root or temperature-weighted sampling per task is standard.
- Within a batch, mix camera views and embodiments. Each batch should be a microcosm of the dataset, not a single-task slug.
Augmentation, repeated
For pixel inputs, random shift + color jitter + a small random rotation. For proprioception, no augmentation other than dropout (25% on the proprio token, applied during training only — this prevents causal confusion with the action history). For action labels, never augment — those are your targets.
Regularization that's overrated
Dropout in transformers (other than attention dropout for very small datasets) usually hurts. L2 on activations is rarely needed. The only regularizers that consistently help are weight decay on linears, gradient clipping, and EMA.
Compute budget
An ACT-scale single-task policy fits on one GPU for a day. A Diffusion Policy with a transformer backbone fits on one GPU in a couple of days. A 7B-parameter VLA fine-tune wants 8×H100 for a week. A from-scratch VLA pretraining run is a small-cluster operation — on the order of $10^5$ GPU-hours. Plan accordingly.
23Inference and deployment
The system around the model is the system you ship.
Receding-horizon control, decoded
The deployment loop for a chunked policy:
- Read the latest observation $o_t$ from cameras and proprioception.
- Run the policy forward to predict $a_{t:t+H}$.
- Push the predicted chunk into a control buffer.
- Send actions from the buffer to the robot at the control rate (50–200Hz).
- After $K$ control ticks, return to step 1.
The trick is decoupling policy frequency ($1/K$ of the control rate) from control frequency. The policy can be slow; the controller is fast. A 200ms diffusion policy that produces 16 actions executed at 50Hz controls the robot for 320ms — well within budget.
Latency budgets
Approximate end-to-end inference cost on contemporary hardware (H100 unless noted):
| Architecture | Inference | Notes |
|---|---|---|
| ACT | 5–10 ms | Single forward, 80M params, batch-1 |
| Diffusion Policy | 30–80 ms | 16 DDIM steps × small CNN denoiser |
| 3D Diffusion Policy | 20–50 ms | Sparse point cloud, smaller backbone |
| π₀ flow | 40–60 ms | 10 Euler steps × 300M action expert |
| π₀-FAST | ~20–40 ms | Autoregressive on ~30 FAST tokens |
| π₀-FAST + DCT early-stop | ~10–20 ms | Decode 3–4 freq coeffs only |
| GR00T N1 (2.2B) | 63.9 ms / chunk-16 | L40 GPU, bf16, official number |
| OpenVLA 7B | 200–400 ms | One forward pass; INT4 cuts ~2× |
| SmolVLA 450M | ~30–50 ms | Designed for Jetson-class hardware |
| Helix two-system | <100 ms loop | VLM 7-9Hz, controller 200Hz, on Jetson Orin INT4 |
| One-step diffusion | 5–15 ms | Distilled diffusion; 1 sampling step |
The receding-horizon timeline
The reason high-latency policies can still control fast robots: the policy and the controller run on different clocks.
The control loop runs at 50–200Hz. The policy runs at 5–25Hz. The buffer between them is filled by predicted action chunks. As long as the policy returns the next chunk before the previous one runs out, the robot moves smoothly. Latency above the chunk duration causes a stall; latency below it is invisible. This is why action chunk length × control frequency is the budget that matters, not raw policy latency.
DDIM and other accelerators
For diffusion policies, sampling steps drop from $K = 100$ training to $K = 16$ inference via DDIM with negligible loss. Consistency models, distilled samplers, and rectified flow further compress this to single-step sampling at small accuracy cost. The 2026 production diffusion policy almost always samples in 4–16 steps, never 100.
Action smoothing and safety filters
Even a good policy outputs occasional spikes. The deployed system always includes:
- Velocity / acceleration limits on commanded actions — clip if the policy exceeds them.
- Workspace bounding boxes — clip EE poses to the safe operating volume.
- Force / torque limits — abort if measured forces exceed safe thresholds.
- Watchdog timer on policy inference — if the next chunk arrives late, fall back to coasting on the previous chunk's last command, then stop.
None of these are part of the model. All of them are part of the policy system. Skip them and your first deployment is your last.
Quantization
For VLA-scale policies, INT8 or FP8 weight-only quantization gives ~2× speedup with minimal degradation. Activation quantization is dicier — attention can be sensitive. AWQ and GPTQ work; SmoothQuant works; full INT4 is fragile but viable for the largest models when latency is critical.
24Evaluation
The hard part is not making the policy work. The hard part is knowing whether it works.
Success rate, with footnotes
The headline metric is task success rate: percentage of trials that reach the goal. The headline can lie. Specifics that change the meaning:
- How many trials? 10 trials gives a 95% confidence interval of about $\pm 30$ points for a 50% success rate. 50 is the bare minimum for a single task; 100+ for any claim that compares two methods.
- Reset distribution. Identical resets across methods, ideally videoed. "We tested in similar conditions" is not a reset distribution.
- Time limit. A success that takes 10 minutes is not the same as one that takes 10 seconds.
- Recovery. Did the policy recover from disturbance, or did it succeed on the easy trials and fail on the hard ones in the same direction?
Generalization axes
A useful eval reports separate numbers along distinct generalization axes:
- Object instance. Same object class, different instance.
- Object pose. Same instance, different starting pose.
- Background and lighting. Same task, new scene.
- Distractors. Add unrelated objects to the scene.
- Language paraphrase. "Pick up the apple" vs "grab the red fruit."
- Out-of-distribution objects. Objects not in the training set.
Average success rate across all six is misleading. Per-axis numbers tell you what the policy actually learned.
Long-horizon evaluation
For multi-step tasks, success rate alone is too crude. Useful instead: per-stage success rate, average completion fraction, and median time to first failure. A policy that consistently fails at stage 3 is more diagnostic than one that succeeds 40% of the time without telling you where it falls over.
Sim vs real
Sim eval is fast, free, deterministic, and only loosely correlated with real-world success. The standard discipline: track both, report sim-eval as a development signal and real-eval as the metric. A 10-point gap between the two on the same task is normal; a 30-point gap is a sign your sim is mis-specified.
The eval that catches problems early
- Closed-loop validation on a held-out subset of trajectories: does the policy reach the same states the demos did?
- Action distribution diagnostics: histogram of predicted actions vs demo actions. A skewed histogram is an early warning of mode collapse.
- Latency and jitter measurement under deployment conditions. A policy that's fast on the dev box and slow on the cell controller is a deployment-day surprise.
25The road ahead
A field manual is a snapshot. The map will be different next year.
The picture this manual draws is a rough consensus that did not exist three years ago: imitate at scale with a foundation-model backbone, polish with RL when it pays for itself, evaluate honestly, ship the policy as one piece of a controlled system. The architecture wars between ACT and Diffusion Policy and VLAs are mostly cooling — the field has internalized that they are points on a spectrum, not rivals. The interesting open problems are elsewhere.
Where the field actually stands, May 2026
Five things that were true a year ago and are still true; five that were not.
Still true: the two-system VLA is the dominant generalist architecture; flow matching has overtaken plain DDPM as the default action head; cross-embodiment training pays off when the action space is shared; co-training with web data is mandatory; HIL-SERL is the only RL recipe that's competitive with BC + lots of data on real hardware.
New since 2025: autoregressive VLAs caught up with diffusion via FAST tokenization (5× faster training, comparable quality); π₀.₅ → π₀.₇ added open-world generalization, multi-scale embodied memory for >10-minute tasks, and an RL Token mechanism for fast online polishing; GR00T N1/N1.5 demonstrated synthetic data from video diffusion at scale; Gemini Robotics 1.5 split high-level reasoning from low-level control via tool calls and "embodied thinking"; Helix 02 added a System-0 whole-body motion prior that constrains the controller to physically feasible behavior; SmolVLA and the small-VLA wave showed 450M-parameter models can match 7B baselines; 3D and equivariant policies started to make sample-efficiency arguments that the data-rich camp can no longer dismiss.
Data, still
Every plot of policy success rate versus dataset size is a line that has not yet bent. The cheapest way to a better policy in 2026 is more demonstrations. UMI-style handheld collection, mobile teleoperation rigs, humanoid teleop suits, and synthetic trajectories from video diffusion models are the engineering frontier. The model architecture is downstream of the data pipeline.
Synthetic data and the new bitter lesson
The most interesting development of the past year is that generative models are themselves becoming a data source for robot policies. GR00T N1 trains on neural-generated trajectories from video diffusion; DexMimicGen and MimicGen synthesize new demos from a small seed of real ones; Genesis and Newton (the open physics engine from DeepMind + Disney) push the upper bound on what physics simulators can model. The "bitter lesson 2.0" version of the field's debate is no longer "does scaling work" — it works — but "what is the cheapest source of marginal data?" Increasingly, the answer is a generative model.
The simulator question
Real-world data is rich and expensive. Simulation is fast and lossy. Closing the gap with better physics simulators (Genesis, MuJoCo MJX, NVIDIA Newton), neural simulators (world models trained on real video), digital twins, and synthetic-data pipelines is an open contest. The eventual answer is probably all of the above, layered, not one of them dominating.
Tactile, force, and the contact-rich plateau
Vision-only policies are reaching their plateau on contact-rich tasks. Force-torque sensing helps when wired in correctly. Tactile arrays (DIGIT, GelSight, ReSkin, AnySkin) help even more when there is data. ViTacFormer and the visuo-tactile diffusion-policy variants of 2025 are early evidence that the modality is real; the bottleneck remains large, diverse, well-labeled tactile datasets.
Whole-body humanoid control
Humanoids force the field to confront a problem manipulation policies have ignored: the policy and the locomotion controller are not independent. Helix 02's System 0/1/2 hierarchy, GR00T's diffusion transformer, and the bimanual VLAs that now control 35-DOF upper bodies are early answers. The architecture has not converged. The unresolved questions are: who owns balance — the controller or the policy? — and how do you train a policy that has to walk, reach, and stay up at the same time without specialized priors for each?
Continual and lifelong learning
A robot that ships, runs in a customer's facility, and never improves from the data it generates there is leaving most of its potential on the table. The infrastructure to safely fine-tune deployed policies on deployed data — without catastrophic forgetting, without privacy violations, without dangerous regressions — does not yet exist as a productized standard. The π₀.₇ "RL Token" and the inference-time online RL machinery from Physical Intelligence are the closest thing in the open literature. The full version of this is what 2027 looks like.
The director's takeaway
If you fund robot learning today, fund three things in roughly equal measure:
- Data infrastructure. Collection rigs, teleop, in-the-wild capture, dataset versioning, evaluation harness, synthetic-data pipelines. The unsexy stuff. The stuff that compounds.
- A foundation backbone. Either build one (expensive) or fine-tune one (cheaper but locked-in). The gap between teams with one and teams without keeps growing. Open weights now exist for π₀, GR00T N1, OpenVLA, RDT-1B, SmolVLA — there is no longer a reason to pretrain from scratch.
- Evaluation rigor. Real-robot evals, generalization-axis splits, statistical significance, sim-to-real correlation tracking. The only thing more expensive than slow eval is a policy you thought worked.
The new grad's takeaway
Read the loss compendium until you can sketch every loss from memory. Pick one architecture (Diffusion Policy is a fine choice; π₀-FAST is a more modern one) and rebuild it from scratch. Run it on a real robot with a real evaluation harness. Read three papers a week, half of them old. Don't chase the latest VLA — train your eye to recognize which parts of the latest VLA are new and which parts are the same DDPM you already know. The field is moving fast, but most of the motion is along directions that were visible from 2023.
The staff engineer's takeaway
The hard problems are still where they were five years ago: data quality, eval rigor, deployment safety, the gap between a notebook result and a customer-deployed policy. Architecture is not the bottleneck for any team you join. Be the person who makes the dataloader fast, the eval suite honest, the deployment robust, the safety filter trustworthy. That role is undersupplied, undervalued, and load-bearing.