Sequential and compositional generalization in model-free imitation learning from pixels

Determine whether model-free imitation learning from raw image observations can achieve sequential and compositional generalization to novel tasks beyond the training distribution, particularly in settings with long-horizon tasks and small demonstration datasets.

Background

The paper motivates a model-based approach by noting limitations of model-free imitation learning from pixels: while such methods can solve complex tasks, they often fail to generalize compositionally and sequentially to tasks outside the training distribution, especially when horizons are long and data are scarce.

This open challenge frames the authors’ contribution: they propose pix2pred to invent and ground symbolic predicates from images via vision-LLMs and to plan over learned operators, aiming to improve generalization. Nonetheless, the broader question of achieving robust sequential and compositional generalization for model-free imitation learning remains explicitly stated as open.

References

However, sequential and compositional generalization to novel tasks beyond the training distribution remain open challenges, especially when task horizons are long and demonstration datasets are small.

From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models  (2501.00296 - Athalye et al., 2024) in Section 1 (Introduction)