Fine-Tuning pi0.5 with UMI: The Promise and Limits of Robot-Free Data Collection
Vision Language Action (VLA) models have shown impressive ability to learn precise manipulation tasks, but state-of-the-art systems are data-hungry, requiring tens of thousands of hours of demonstrations for pre-training and additional hours of fine-tuning on each target task. Collecting this data means deploying full robotic systems and teleoperation rigs to diverse environments, a process that is expensive and extremely difficult to scale.
At Armstrong Robotics, we’re building learning-based systems for commercial dishwashing. These are wet, dirty, unpredictable environments where learned policies can dramatically improve reliability while reducing system complexity and cost, yet they are also some of the hardest places to run a teleoperation setup.
Universal Manipulation Interface (UMI) offers a promising alternative. UMI devices are hand-held tools that simulate robotic hardware, allowing demonstrations to be collected without deploying the robot itself. Both academic and industry teams have developed UMI variants [1][2] and shown that this method can scale real-world data collection.
In this post, we share our experience building a custom UMI device and using it to fine-tune pi0.5 [3] for dishwashing tasks, specifically picking dirty dishes from bus tubs and racking them for the dishwasher. We find that policies trained only on UMI data can successfully complete our task, but their accuracy and reliability is limited by the visual domain shift between human demonstrations and robotic execution and by the policy’s inability to account for imperfect hardware dynamics.
ArmstrongUMI Design and Data Collection
To collect demonstrations, we need a handheld device that does not restrict the range of feasible demonstrations, reliably tracks the operator's pose, and records continuously without interruption. The original UMI design falls short on all three counts, so we design a new device around the YAM Ultra arms from i2rt. [4]
The original UMI gripper’s bulky handle and wide top plate block approach angles in constrained spaces like bus tubs and sink basins even when the robot arm itself could reach. We redesign the gripper to match the fingers, footprint, and form factor of our i2rt end-effector, ensuring any grasp the robot can perform is also feasible during data collection. We refer readers to our hardware-focused writeup for a more in-depth discussion.
Our most critical change is to pose tracking. The original implementation uses visual-inertial SLAM, which we find unacceptably fragile: roughly a quarter of our demonstrations contained significant tracking errors or failed to localize entirely, despite careful environment mapping and slow demonstration speeds. We replace this with a Meta Quest 3S VR controller mounted to the device, streaming poses in real time via a custom Unity application. We calibrate the fixed transform between the Quest controller frame and the end-effector frame implied by our device, mapping demonstrations directly to target robot trajectories. The full calibration procedure is described in the Appendix.
We also swap the original GoPro, which is prone to overheating during sustained streaming, for an off-the-shelf USB webcam. Finally, we find that a wrist camera alone provides insufficient context for our workspace, so we add a statically mounted scene camera. During each demonstration, we periodically record the latest image from all connected cameras alongside the current controller pose.
Action and Observation Format
Using UMI as a data collection platform dictates a number of downstream model design decisions. Since the Quest returns the controller pose in an arbitrary world frame liable to change across recording sessions, we must use relative end-effector poses for both our action and observation formats, making the dataset independent of the global frame used during data capture or inference. Actions are represented as poses relative to the current state, while observations are the relative transformation between the last state and the current state, effectively the translational and rotational velocity in the end-effector frame.
We represent the rotation component of each relative transformation as the flattened first two columns of the rotation matrix, a continuous six-dimensional parameterization [5] from which the full matrix can be recovered via Gram-Schmidt orthogonalization. This ensures any six values produced by the model correspond to a valid, unique rotation while avoiding the singularities of more compact representations.
Although these decisions are required by UMI, we find that these action and observation formats give superior results on our teleoperation datasets as well when compared to absolute and relative joint representations and absolute pose observations, likely due to the simpler learning problem resulting from not forcing the model to learn to run forward kinematics and the translational and rotational equivariance introduced by relative poses.
pi0.5 Fine-Tuning
We select pi0.5 as our base model for its open source JAX implementation, support for mixed precision training and LoRA fine-tuning, and easy integration with LeRobotDataset. While adapting it to our setting, we identify several issues and improvements.
pi0.5 predicts actions in chunks and when a chunk extends past the end of an episode, the official openpi implementation pads it with the final state but does not mask the loss on these padded actions. This teaches the policy to stop at the mean trajectory end position, causing it to freeze at the end of each pick and place cycle on episodic datasets like ours. Masking the loss on padded actions resolves this entirely.
We implement real-time chunking (RTC) [6] to overlap inference and action execution in a principled manner. Synchronous inference causes the arm to exhibit a distinctive jerking motion resulting from the sluggish motor controller on the i2rt arms. Since the arm’s state lags significantly behind commanded actions, when we collect an observation for the next inference pass, the arm is still mid-chunk. While running inference, the arm catches up to its final commanded position, which now differs from what the policy conditioned on, and the resulting actions jerk it back toward where it was when inference began. RTC resolves this by conditioning on the actions that will execute during inference, producing smooth trajectories. We note, however, that performance with RTC remains sensitive to inference delay for precise actions such as sliding fingers under the top plate of a stack.
Finally, we find it essential to properly compute the full relative transformation between adjacent waypoints for delta action prediction rather than simply subtracting their representations, and that sampling multiple random noise values and timestamps for flow matching per forward pass of the VLM backbone, as suggested by [7], produces stabler gradients for minimal memory and computational overhead.
Teleoperation vs. UMI Performance
We collect 400 demonstrations of picking stacked plates from a bus tub and placing them into a dish rack for each collection modality. Each demonstration lasts 10-20 seconds and including environment resets this process takes around three hours per modality. We aggressively filter demonstrations showing suboptimal behavior, and include recoveries after grasping multiple plates or missing the top plate entirely. The same model architecture and training configuration are applied to both datasets, isolating the collection modality as the only differentiating factor.
The teleoperation policy validates our modeling decisions, reliably picking from stacks of varying heights across the tub and clearly learning the intended ordering of placements into the rack. We separately experiment with adding recovery demonstrations, filtered behavior cloning, and offline reinforcement learning to further improve the teleoperation policy performance as detailed in the Appendix.
The UMI policy picks plates reliably but struggles significantly with placements. The most common failure modes are double placements and skipped slots in the rack. We trace this gap to two distinct issues: the policy’s inability to account for imperfect hardware execution of the predicted trajectory and the visual domain shift between training and deployment.
Hardware Dynamics
UMI demonstrations capture ideal trajectories with no mechanism to account for imperfect execution on real hardware. The i2rt's PD controller lacks an integral term and cannot account for changes in payload weight, resulting in 5-10 degree deviations between desired and actual joint values. The UMI policy simply predicts the next target pose and has no way to correct for these tracking errors, resulting in collisions with the environment and taking the model out of distribution.
Simply adding recovery demonstrations cannot address this. Since the arm fails to track any trajectory faithfully, the only recourse is to overcorrect, sending more extreme commands that force the arm toward its original target. The amount to overcorrect depends on the dynamics of the arm and its low-level controller, information that is entirely absent from UMI demonstrations. In contrast, the teleoperation policy implicitly learns these corrections from the deviation between leader and follower arm joints and the human teleoperator’s real-time compensation.
As a temporary fix, we significantly increase the proportional gain. This reduces end-effector deviation to roughly a centimeter in the worst case but produces a much stiffer, more jittery controller.
Visual Domain Shift
The second limitation is the visual mismatch between training and deployment in which UMI demonstrations show a human arm in the scene camera, while the deployed policy sees a robot.
To isolate the impact of this shift, we design a replay experiment. We take 30 training trajectories, compute the relative transformations between adjacent waypoints, and replay them on the robot starting from the first gripper close to isolate the placement phase where the UMI policy struggles. This produces rerecorded versions of training episodes with a robotic arm in place of a human one, while preserving the original trajectories.
A policy trained only on these rerecorded segments shows markedly more reliable placement behavior than the policy trained on the original UMI dataset, confirming that the placement failures are largely attributable to the visual domain shift.
This finding is compounded by our task design. During placement, the rack pegs are occluded from the wide-angle wrist camera, and with only relative pose state observations the policy is forced to rely almost entirely on the scene camera for this precise maneuver. We believe the teleoperation policy leverages the visual appearance of the follower arm itself as a cue for placement, a signal that human demonstrations cannot provide. Addressing this gap through masking, supplementary teleoperation data, or on-policy human corrections is a priority for our next round of experiments. [8][9][10]
Conclusion
This experience fine-tuning pi0.5 with data from our UMI device has raised a few primary issues we must address before continuing our scaling of UMI-based data collection.
From our experiments, it is essential that UMI implementations using scene cameras properly address the human-robot train-test domain shift, most likely using some form of augmentation or masking to replace the human arm with a rendered robotic one. An alternative is to construct the sensor suite and environment in a way that reduces exclusive reliance on the scene camera for more precise aspects of the selected task, but this potentially limits the generality of the system and does not resolve the fundamental issue.
A significant weakness of UMI is its inability to account for the imperfections of the hardware it is running on. The implicit assumption that robotic arms will perfectly execute the trajectories they are given is extremely fragile and limits the use of UMI-only policies to slow trajectories executed with light payloads on precise and expensive arms.
A primary advantage of learned policies over traditional motion planning is their ability to compensate for the imperfections of the hardware in a closed-loop fashion, allowing the use of much cheaper arms. Simply trading this benefit for more scalable data collection is not a clear improvement. It seems likely that this tradeoff could be addressed with supplementary teleoperation data, real-world reinforcement learning, or learning some mid-level controller to compensate for the hardware and the weight of grasped objects possibly along the lines of rapid motor adaptation [11], but we leave these explorations for future work.
References
[1] Chi, Cheng, et al. Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots. arXiv:2402.10329, https://arxiv.org/abs/2402.10329
[2] “Sunday Robotics | The Helpful Robotics Company.” Sunday.ai,
https://www.sunday.ai/
[3] Black, Kevin, et al. “π 0.5: a VLA with Open‑World Generalization.” Physical Intelligence Blog, https://www.pi.website/blog/pi05
[4] “YAM Ultra – 6-DOF Arm.” I2RT Robotics, https://i2rt.com/products/yam-ultra-6-dof-arm
[5] Zhou, Yi, et al. On the Continuity of Rotation Representations in Neural Networks. arXiv:1812.07035, https://arxiv.org/pdf/1812.07035.pdf
[6] Black, Kevin, et al. Training-Time Action Conditioning for Efficient Real-Time Chunking. arXiv:2512.05964, https://arxiv.org/pdf/2512.05964.pdf
[7] Larchenko, Ilia, Gleb Zarin, and Akash Karnatak. Task Adaptation of Vision-Language-Action Model: 1st Place Solution for the 2025 BEHAVIOR Challenge. arXiv:2512.06951, https://arxiv.org/abs/2512.06951
[8] Xu, Xiaomeng, Jisang Park, Han Zhang, Eric Cousineau, Aditya Bhat, Jose Barreiros, Dian Wang, and Shuran Song. HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations. arXiv:2603.03243, https://arxiv.org/pdf/2603.03243.pdf
[9] Yu, Justin, Yide Shentu, Di Wu, Pieter Abbeel, Ken Goldberg, and Philipp Wu. EgoMI: Learning Active Vision and Whole-Body Manipulation from Egocentric Human Demonstrations. arXiv:2511.00153, https://arxiv.org/abs/2511.00153
[10] Chen, Sirui, Chen Wang, Kaden Nguyen, Li Fei-Fei, and C. Karen Liu. ARCap: Collecting High-Quality Human Demonstrations for Robot Learning with Augmented Reality Feedback. arXiv:2410.08464, https://arxiv.org/pdf/2410.08464.pdf
[11] Kumar, Ashish, Zipeng Fu, Deepak Pathak, and Jitendra Malik. RMA: Rapid Motor Adaptation for Legged Robots. arXiv:2107.04034, https://arxiv.org/abs/2107.04034.
[12] Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentuk, and Philipp Wu. SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation. arXiv:2509.25358, https://arxiv.org/abs/2509.25358
[13] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning. arXiv:1910.00177, https://arxiv.org/abs/1910.00177
[14] Physical Intelligence, et al. $\pi^{*}_{0.6}$: a VLA That Learns From Experience. arXiv:2511.14759, https://arxiv.org/abs/2511.14759
Appendix
Quest Calibration
The location of the Quest controller frame relative to the physical controller housing is undocumented, so we calibrate the rigid transform between the controller and end-effector frames. We mount the controller to the end-effector, command the arm through a wide range of translations and rotations, and record the Quest controller pose alongside the forward kinematics pose at each waypoint.
We then compute the relative transform between each pair of adjacent waypoints in both the FK and Quest streams independently. We optimize for the 6-DOF transform that, when applied to the raw Quest relative transforms, minimizes the sum of geodesic rotation distances and translation errors against the FK relative transforms. Operating on relative rather than absolute poses makes the calibration independent of the Quest's arbitrary world frame. The resulting transform allows us to convert relative demonstration poses from the Quest controllers directly into the end-effector frame used during policy deployment.
Automated Data Filtering
We experiment with using SARM [12] to filter out low-quality sections of suboptimal demonstrations containing pauses and jitters. We find that SARM can detect pauses and missed grasps, as shown below with decreasing progress at t=9s.
However, given several issues below, we choose to filter episodes manually.
Recovery data gets penalized.
SARM learns that approaching a plate correlates with increasing progress and pulling back correlates with decreasing progress, without distinguishing a failed grasp from a successful one. It therefore rates bad slide-ins favorably and penalizes the recovery pull-back, effectively down-weighting exactly the data that teaches robustness, such as below starting at t=14s. We believe this is a data imbalance issue with very few recovery demonstrations.
Out-of-Distribution
We explore SARM weighting on our policy’s own rollouts to compare Advantage-Weighted-Regression [13] with advantage-conditioned RL. We notice that SARM can predict incorrect progress on out-of-distribution states encountered by the policy. Below, SARM predicts increasing progress from t=4 to t=6s even as the arm misses the plate entirely and moves past it. The human intervention at t=7s brings SARM back to the correct progress. This highlights that SARM cannot be the sole filter for handling truly out-of-distribution demonstrations.
Comparing Recovery Demonstrations vs RL
After solving clean plate stacks, we move to picking from messy tubs with debris to get closer to real kitchen conditions. A key finding from clean stacks is that recovery demonstrations are critical for fully autonomous operation. Rather than collecting these manually, reinforcement learning (RL) offers an alternative: let the policy discover its own failure states, then intervene to demonstrate recovery. We compare policies trained on purely teleop + recovery against RL + human feedback (RLHF) on messy tubs, using pi06* [14] as our RL method given its natural continuity from our pi05 baseline.
Starting from 400 clean stack demonstrations, we perform two rounds of 20 teleop demonstrations and RL rollouts with human corrections. We follow pi06*’s iterative cycle: train a value model on current dataset, label the dataset with binary advantage indicators based on this value model, train a policy conditioned on these text-converted indicators, then add the policy’s rollouts to the dataset and repeat. Notably, we use far fewer rollouts than pi06*’s 600 per iteration — 20 teleop episodes alone are nearly sufficient for this extra task.
For intuitive interventions, our leader arm shadows the follower arm at all times. Pressing a button seamlessly switches control from policy to leader. Following pi06*, we reserve interventions for genuine failures rather than preemptive corrections, specifically when the robot is stuck or will cause damage. We let the policy fully reach strange states before correcting; for example, the policy below explores an infeasible grasp direction before we correct it.
We also let the policy carry the plate in a precarious way over the edge of the table, and only intervene after it had reaches over the table:
Evaluation
We set up 5 tubs in nearly identical configurations for each policy, measuring total completion time and intervention count across 3 repeated runs. Interventions were only made for unsafe behavior or stuck policies for 20 seconds.
Results
Overall, given this new task’s similarity, just 20 demonstrations were enough to achieve solid performance, and the most effective strategy turned out to be collecting diverse expert recovery demonstrations starting directly from failure states. Above, we mention we would intervene late, letting the policy explore with some suboptimal behavior initially. With such little data and no failure episodes at all, our Value model overfits to this data and predicts higher return even during the bad motions. The RL policy subsequently imitates all these rollouts with the “good” conditioning signal, even the bad behavior, causing it to fail more during evaluation:
We believe that with increasing task complexity and duration, it will be more difficult to farm out all the necessary expert recoveries, making RL more effective at improving robustness. In hindsight, we should have set up proper environment resets to enable many autonomous RL rollouts. We also did not collect any “failure” episodes, which plays an important factor in the ground truth rewards.
Advantage Conditioning Experiments
Despite several attempts, the policy consistently ignores good/bad conditioning prompts and behaves identically regardless. We conducted a synthetic augmentation experiment, randomly assigning “bad” conditioning with some probability and inverting the entire gt action chunk. The policy did properly learn to use conditioning in this extreme setting, confirming the logic was sound. This points instead to insufficient dataset scale and not enough behavioral contrast between good and bad examples at enough states across the trajectory.










