Insight Post 4

The Perception Problem Nobody Is Talking About

Key Takeaways

  • Many manipulation failures are perception failures, not only control or policy failures.
  • Narrow camera field-of-view creates blind spots in real industrial environments.
  • Sensor architecture choices shape the quality and bias of the training data you collect.
  • Perception strategy should be designed as core infrastructure, not added late.

The robotics industry obsesses over manipulation. Which gripper. Which arm. Which force controller. Which policy architecture. Almost nobody is talking about what the robot can see. This is a significant blind spot, and a paper published in March 2026 makes the scale of it concrete in a way that is hard to ignore. Researchers at HKUST built a visuomotor policy for humanoid robots using panoramic LiDAR instead of conventional RGB-D cameras. They then tested it against three state-of-the-art vision-based baselines across a set of manipulation tasks, including tasks where the target object was outside the camera's field of view. The vision-based systems, Diffusion Policy, DP3, and iDP3, are all serious methods with strong results in the literature. Their scores on out-of-view tasks: 0 out of 20. Across all three systems. Every trial failed. The LiDAR-based system: 12 out of 20. The failure mode is not a model quality issue. The models are fine. The problem is that the sensors cannot see what they need to see, and no amount of model sophistication compensates for missing observational data.

Why conventional robot vision has a structural problem

The standard perception setup for robot manipulation is one or more RGB-D cameras, typically head-mounted or wrist-mounted, providing a depth-enhanced image of the robot's forward view. This works well when the robot and the task are aligned: the object is in front of the robot, within the camera's field of view, at a reasonable distance, under consistent lighting conditions. Industrial environments are not always like this. Objects may be distributed across a wide area around the robot. In a warehouse pick scenario, relevant items may be behind the robot, to the side, or in a bin that requires the arm to reach at an angle that takes the camera off-target. In a multi-step assembly task, the robot may need to track several objects across a workspace that exceeds the camera's coverage. In environments where the robot cannot reposition its base freely, the fixed camera creates a permanent blind spot for anything outside its narrow cone. The standard response to this problem is to move the robot. Walk to a better position. Rotate the base. Adjust the head. But repositioning takes time, introduces motion uncertainty, and in physically constrained environments may simply not be possible. The HKUST paper takes a different approach: fix the perception problem at the sensor level rather than compensating for it with locomotion. A LiDAR with 360-degree horizontal coverage sees the full environment without the robot moving. Objects anywhere around the robot's body are visible in the point cloud at all times.

The broader point: sensor architecture is infrastructure

The manipulation research community has made substantial progress on policy learning. The papers coming out of major labs on diffusion policies, flow matching, and transformer-based action models are genuinely impressive. Real capability is being built. But there is an implicit assumption baked into most of this work: that the robot can observe what it needs to observe to complete the task. The perception problem is treated as solved, or at least not the interesting problem. In production environments, this assumption frequently fails. And unlike model quality, which can be improved through retraining, sensor architecture is infrastructure. Once you have built a robot platform, the sensor configuration is largely fixed. A camera-based platform cannot easily be converted to panoramic perception. The perception decisions made at system design time propagate through every deployment that system ever runs. This makes sensor architecture a consequential decision that deserves much more deliberate attention than it typically receives. The question of what a robot can observe should be treated with the same seriousness as the question of what it can do.

What this means for physical AI training data

The perception point connects directly to the data quality argument. A robot with narrow-field-of-view cameras will, by construction, generate training data that reflects a narrow-field-of-view world. The demonstrations will be biased toward objects that happen to be in front of the camera. Recovery behaviors from out-of-view states will be underrepresented. The model trained on that data will inherit those biases. A robot with panoramic perception generates data that reflects the full spatial context of the task. The relationship between what is happening at the manipulation point and what is happening in the broader environment is visible in the training signal. The model can learn spatial awareness, not just local dexterity. For physical AI teams building foundation models that need to generalise across environments and tasks, the question of what is in the training data matters enormously. A dataset built from narrow-FOV demonstrations contains different information than a dataset built from full-context demonstrations, even if the manipulation tasks are identical.

The signal for where the industry is heading

The HKUST paper is a research result, not a product. But research results lead deployment trends by two to four years, and the signal here is clear: panoramic perception for manipulation is coming, and the performance advantages over camera-based systems in realistic environments are significant enough to drive adoption. The teams that are thinking about perception architecture now, not after their hardware is locked in and their data pipelines are built, will have an easier transition. At Telepath, we think about the full sensor stack as part of deployment design, not as an afterthought. The data our systems generate needs to be useful not just for the tasks we are deploying today but for the foundation models that will run on future hardware with better perception capabilities. That means being deliberate about what the training signal actually contains, which means being deliberate about what the sensors can actually see. The manipulation problem is getting solved. The perception problem is less solved than most people think. That gap is worth paying attention to.