Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning

Published 1 Apr 2026 in cs.RO, cs.CV, and cs.LG | (2604.00557v1)

Abstract: The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views during demonstration collection. Instead of acquiring more trajectories, multiple synchronized camera perspectives are used to generate pseudo-demonstrations from each expert trajectory, which enriches the training distribution and improves viewpoint invariance in visual representations. We analyze how different action spaces interact with view scaling and show that camera-space representations further enhance diversity. In addition, we introduce a multiview action aggregation method that allows single-view policies to benefit from multiple cameras during deployment. Extensive experiments in simulation and real-world manipulation tasks demonstrate significant gains in data efficiency and generalization compared to single-view baselines. Our results suggest that scaling camera views provides a practical and scalable solution for imitation learning, which requires minimal additional hardware setup and integrates seamlessly with existing imitation learning algorithms. The website of our project is https://yichen928.github.io/robot_multiview.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper demonstrates that multi-camera view scaling transforms single demonstrations into rich pseudo-demonstrations, boosting task success rates by 12–38%.
It systematically compares action labeling in base, end-effector, and camera spaces, showing that camera space labeling yields superior performance.
The approach achieves robust generalization in both simulated and real-world tasks, effectively doubling success rates in data-scarce regimes with minimal extra cost.

Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning

Motivation and Background

Scaling the generalization performance of imitation learning (IL) policies for vision-based robot manipulation is fundamentally dependent on exposure to diverse training observations. Traditional approaches, which increase the size of expert demonstration data, incur substantial human effort and cost, especially when varied environments and scene layouts are required. Recent IL advances have focused on architectural improvements and leveraging large-scale cross-domain datasets, but efficient strategies for maximizing data diversity with constrained supervision remain crucial.

The paper "Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning" (2604.00557) postulates that much of the visual diversity inherently available in most IL setups is underused. Whereas demonstrations typically contain merely a single, fixed-view camera stream, robot manipulations occur within 3D environments and thus admit significant appearance change under varying viewpoints. The authors propose to exploit this by systematically scaling camera views during demonstration collection, creating a scalable pipeline for generating pseudo-demonstrations from multiple synchronized perspectives.

Methodology

The proposed approach attaches multiple fixed cameras at distinct extrinsic configurations around the workspace during expert demonstration collection. From each physical demonstration, the system synthesizes multiple pseudo-demonstrations, each corresponding to a distinct camera perspective while preserving temporally-aligned robot actions.

Figure 1: Camera view scaling transforms a single expert trajectory into multiple pseudo-demonstrations using multiview camera configurations.

Action Space Transformation

A key technical component investigated is the interaction between visual input diversity and action representation. For each image-action pair in the pseudo-demonstrations, the action label can be defined in several reference frames (Figure 2):

Figure 2: Comparison among robot base, camera, and end-effector (EEF) action spaces for labeling pseudo-demonstrations.

Base Space: Actions expressed in the robot’s base frame; the same action label is applied to all views, ensuring consistency but not exploiting the full view-pair diversity.
End-Effector (EEF) Space: Actions parameterized with respect to the EEF's current pose; introduces increased learning complexity due to the moving local frame.
Camera Space: Actions are projected into each camera's coordinate frame using known extrinsics, maximizing diversity by assigning distinct action labels for each view.

The framework supports all these spaces, but experiments demonstrate particular generalization gains for camera space.

Pseudo-Demonstration Generation and Training

During training, the effective demonstration set is expanded to $N \times V$ , where $N$ is the number of original demonstrations and $V$ is the number of camera views. The authors integrate this with a diffusion policy framework, employing a vision encoder and denoising process, where pseudo-demonstrations are randomly sampled per batch to maintain viewpoint diversity.

Multiview Aggregation at Inference

In addition to training-time augmentation, the model supports a multiview aggregation mechanism at deployment (Algorithm 1 in the paper). When multiple camera views are available during inference, the single-view-trained policy independently predicts actions per view; these predictions are then fused via probabilistic composition (by combining distributions or means in the action space). This approach is agnostic to architecture and preserves high deployment efficiency.

Empirical Evaluation

The framework is validated via extensive experiments in both simulated environments (robomimic suite) and real-world settings (pouring manipulation task with a FANUC CRX-10iA arm). The following key empirical results stand out:

Training with three or five camera perspectives per demonstration consistently yields 12–38% absolute improvements in task success rates over single-camera baselines, given equal numbers of physical demonstrations.
The effect is amplified in data-scarce regimes: with only 10 demonstrations, data scaling via camera views achieves more than double the success rate on several tasks.
Camera space action labeling further improves performance relative to base and EEF space on most setups.
Attention visualizations indicate that policies trained with view scaling focus more robustly on manipulation-relevant regions in visual input.

Figure 3: Training with five camera views promotes more consistent and task-relevant visual attention compared to single-view training.

Policies trained with view scaling generalize more robustly to unseen camera views during inference, enabling flexible deployment.
In real-robot experiments, the view-scaled framework achieves up to 75% success with only 25 two-view demonstrations, compared to 45% with single-view data.
The multiview inference aggregation algorithm yields additive performance gain when multiple camera feeds are available, without retraining.
Figure 4: The real-robot setup utilizes a FANUC CRX-10iA arm with dual fixed cameras capturing the workspace.

Theoretical and Practical Implications

This work provides a systematic analysis of how leveraging underexploited observation diversity can close the generalization gap in IL for robot manipulation. It demonstrates that scene diversity via camera view scaling is an effective substitute for scene layout diversity, aligning with the scaling law observations made in concurrent studies on IL dataset composition. The results challenge the assumption that further generalization requires exponentially increasing human demonstrations or complex domain randomization.

From a practical perspective, the approach suggests that careful sensor configuration and geometric calibration during data collection can yield substantial efficiency gains. The method integrates seamlessly with modern visual policy learning pipelines, requiring no architecture modifications and minimal runtime cost.

Future Directions

Further research directions include combining view scaling with other scene diversification approaches (such as environment randomization or cross-embodiment data), exploiting more scalable calibration-free or self-supervised extrinsics estimation, and extending the aggregation technique to closed-loop reactive control. Additionally, the impact of enlarged camera sets and strategic view placement remains to be systematically explored, particularly in more complex real-world scenes.

Conclusion

Camera view scaling presents a principled approach for data-efficient imitation learning in robot manipulation, enabling substantial gains in policy generalization with minimal human supervision. By fully leveraging workspace visual diversity and appropriately transforming action labels, IL pipelines can approach the performance typically associated with much larger and more costly datasets. The framework sets the stage for more systematic exploitation of inherent multimodal observation structures in robotics learning setups, with direct relevance to scaling up generalist robot agents and robust deployment in the wild.

Markdown Report Issue