Insight Post 2

The Distribution Shift Problem: Why Robots Trained in Labs Fail in Warehouses

5 min read · Published Mar 13, 2026

Key Takeaways

Distribution shift is a core reason robot demos fail when moved into production facilities.
The long tail of real operations includes irregular objects, lighting variation, and recovery scenarios.
Simulation helps, but contact-rich manipulation still suffers from a meaningful sim-to-real gap.
Continuous production data collection is structurally different from one-time lab capture.

There is a pattern that everyone in industrial robotics recognises but almost nobody talks about publicly. A robot system performs well in controlled testing. The demo is impressive. The success rate looks strong. Then it gets deployed into a real facility and performance drops sharply. Objects are not quite where they should be. Lighting is different. Product comes in with more variation than the test set contained. The robot starts failing in ways that were never observed during evaluation. This is called distribution shift. It is the most common reason robot deployments underperform, and it is almost entirely a data problem.

What distribution shift actually means

A robot policy is a function that maps observations to actions. It learns this mapping from training data. When it encounters situations during deployment that are similar to what it saw during training, it performs well. When it encounters situations that differ from the training distribution, performance degrades, sometimes catastrophically. The problem is that lab environments and production environments are fundamentally different distributions. In a lab, objects are placed precisely. Lighting is controlled. The same objects appear repeatedly. Surface textures are consistent. Clutter is minimal. The researcher can design the evaluation to match the training data. In a warehouse or factory, objects arrive in slightly different orientations every time. Lighting shifts across shifts and seasons. Product SKUs change. Surfaces accumulate dust and residue. Unexpected items appear in the workspace. The environment is continuously generating variation that no lab could fully anticipate. A model trained in a lab has learned to handle the lab distribution. When you put it in production, you are asking it to generalise to a different distribution without telling it the distribution changed.

The long tail is where failures live

The Yamane et al. paper we referenced in our previous post makes a related point about sensing fidelity. When researchers removed force feedback from the teleoperation system, performance collapsed on tasks that required physical judgment: grasping objects of unusual width, recovering from misalignment, completing contact-dependent placements. These failure modes are exactly what the long tail looks like. The most common cases might work fine. It is the edge cases that accumulate in production and drag down operational reliability. The long tail contains: objects near the edge of a bin rather than the center. Packaging that has been compressed or deformed in transit. Items stacked at angles that were not in the training set. Conveyors running slightly faster or slower than expected. Ambient conditions that affect sensing. Human workers entering the workspace in ways that were not anticipated. A model trained on lab data has seen almost none of this. A model trained on production data has seen all of it, because production is where it happens.

Why simulation does not solve the problem

The obvious response to lab limitations is simulation. Build a high-fidelity simulation of the target environment, collect millions of trajectories, train a robust policy. This approach has produced real progress, particularly for locomotion tasks where physics simulation is relatively accurate. For manipulation, especially contact-rich manipulation, the sim-to-real gap remains a serious constraint. Contact dynamics are hard to simulate faithfully. The way a plastic bottle deforms under gripper pressure, the friction coefficient of a cardboard surface after humidity exposure, the behavior of a pile of irregularly shaped components: these are genuinely difficult to capture in simulation with the fidelity that reliable manipulation requires. Sim-to-real transfer works best when the gap between simulated and real physics is small. For the kinds of tasks that are most valuable in industrial settings, that gap is not small.

Production data is structurally different, not just larger

This is the point that matters most for anyone thinking about physical AI data strategy. The value of production data is not simply that there is more of it. It is that it contains a fundamentally different distribution of situations than lab or simulated data. Production data contains recovery behaviors. Operators handling real objects in real environments naturally encounter and resolve edge cases as part of normal operation. Those recovery behaviors are encoded in the demonstration data if the data collection infrastructure captures them. Lab data almost never contains genuine recovery behaviors because labs are designed to prevent the conditions that require recovery. Production data contains temporal variation. The same task looks different at 6am, 2pm, and midnight. It looks different in winter and summer. It looks different when the conveyor is running at 80% speed versus 100%. All of this variation is invisible to a lab-trained policy and visible to a production-trained one. Production data contains real object variability. The objects that appear in industrial facilities come with natural variation in dimensions, weight, surface finish, and condition. This variation is the norm, not the exception, and a policy trained on it learns to handle it as such.

The implication for how deployments should be designed

If distribution shift is the core problem, then the architecture of a deployment matters as much as the quality of the underlying model. A deployment that generates ongoing production data, feeds it back into model improvement, and progressively expands the robot's competence across the real distribution it encounters is a fundamentally different proposition than a one-time deployment of a fixed policy. This is the logic behind Telepath's operating model. Our robots deploy into real facilities. The data generated by those deployments reflects the actual production distribution of those environments. That data feeds into autonomy development that progressively closes the gap between what the robot can handle independently and what the full range of production conditions requires. The distribution shift problem does not go away. But you can build a system that continuously narrows it, if the deployment architecture is designed to do that from the start. Most are not. The industry will get there. The companies that designed for this from day one will have compounding advantages by the time everyone else catches up.

Priority Topics

Robotic data collection in production Robot training data from live workflows Real-world robot data for autonomy teams Production robot data programs for AI Teleop-to-autonomy data for foundation models Teleoperation datasets for model training Teleoperation data pipeline design