Harness heterogeneous foundation models to improve robotic policies

Determine effective mechanisms for incorporating heterogeneous perception foundation models—including 3D reconstruction and geometry-aware models, articulation and kinematic reasoning models, human motion estimation models, and object tracking systems—into robotic manipulation policies to improve performance.

Background

The paper notes that while Vision-LLMs are increasingly used for task specification and generalization in robot manipulation, a broad set of other foundation models with strong physical and geometric priors remains underutilized. These include 3D reconstruction, articulation and kinematic reasoning, human motion estimation, and object tracking models.

The authors identify a gap in how these heterogeneous models can be systematically exploited to improve robotic control policies. OmniGuide is proposed as a unified inference-time guidance framework intended to address this challenge by expressing diverse guidance sources as attractive and repulsive energy fields acting in 3D space, but the general problem of how to best harness such models remains a stated open question in the field.

References

Yet, it remains unclear how to harness these heterogeneous models to improve robotic policies.

OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies  (2603.10052 - Song et al., 9 Mar 2026) in Section 2.3 (Harnessing Foundation Priors for Robot Manipulation)