Exploration-Driven Generative Interactive Environments

Published 3 Apr 2025 in cs.CV | (2504.02515v1)

Abstract: Modern world models require costly and time-consuming collection of large video datasets with action demonstrations by people or by environment-specific agents. To simplify training, we focus on using many virtual environments for inexpensive, automatically collected interaction data. Genie, a recent multi-environment world model, demonstrates simulation abilities of many environments with shared behavior. Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. While the model trained in this manner exhibits good controls, it is limited by the random exploration possibilities. To address this limitation, we propose AutoExplore Agent - an exploration agent that entirely relies on the uncertainty of the world model, delivering diverse data from which it can learn the best. Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments. With this approach, the pretrained multi-environment model can quickly adapt to new environments achieving video fidelity and controllability improvement. In order to obtain automatically large-scale interaction datasets for pretraining, we group environments with similar behavior and controls. To this end, we annotate the behavior and controls of 974 virtual environments - a dataset that we name RetroAct. For building our model, we first create an open implementation of Genie - GenieRedux and apply enhancements and adaptations in our version GenieRedux-G. Our code and data are available at https://github.com/insait-institute/GenieRedux.

Abstract PDF Upgrade to Chat

Summary

Insightful Overview of "Exploration-Driven Generative Interactive Environments"

The paper "Exploration-Driven Generative Interactive Environments" by Savov et al. presents an innovative framework to address challenges inherent in training world models for interactive environments, utilizing exploration-driven data collection instead of traditional demonstrations. This research contributes significantly to the automation of data gathering in virtual environments, reducing reliance on labor-intensive and costly human intervention.

The study introduces a framework for training world models using a novel approach termed "AutoExplore Agent." This agent strategically maximizes the uncertainty of the world model predictions to gather diverse and pertinent interaction data. The model operates independently of environment-specific rewards, enabling seamless adaptation across various virtual environments—a step towards generalized and efficient world modeling.

The progression in this work hinges on leveraging interaction data from several virtual environments. The researchers curated a substantial dataset named RetroAct, comprised of 974 virtual environments, with detailed annotations for behavior classification and control mapping. Such an extensive dataset underscores the model's capability to generalize across diverse settings and provides a benchmark for future efforts in interactive AI environments.

The architectural composition of the proposed model, GenieRedux, and its enhanced version, GenieRedux-G, demonstrate significant improvements over predecessors like Genie. These models circumvent the limitations of the Latent Action Model (LAM) by directly using ground truth actions, which enhances both visual fidelity and controllability. The Token Distance Cross-Entropy Loss introduced in GenieRedux-G is particularly noteworthy for its principled approach in classifying tokens by considering inter-token distances, thus resulting in better token prediction and less visual degradation in generated frames.

Quantitative measures further validate the model's capabilities. For GenieRedux-G trained on the Platformers-50 dataset, metrics reveal notable video fidelity (PSNR of 26.36) and controllability (∆PSNR of 0.450). Importantly, the fine-tuning of the model with exploration-derived data led to improvements up to 7.4 PSNR in visual fidelity and up to 1.4 ∆PSNR in controllability, clearly delineating the advantages of exploration-driven data collection.

The implications of this work are multifaceted:
1. Practical: The methodology serves as a more cost-effective and scalable solution for collecting diverse datasets without human intervention, poised to benefit domains where large, annotated datasets are infeasible.
2. Theoretical: This approach of maximizing world model uncertainty during data collection could inspire novel curiosity-driven exploration strategies in reinforcement learning and simulation tasks.
3. Future Directions: The approach could evolve to enhance generalization, permitting world models trained in virtual setups to transfer learned behaviors and rules to the physical world or unobserved environments, broadening the scope and applicability of AI simulations.

Advancing this line of research will likely converge on optimizing model training dynamics for increased data diversity and fidelity, leading to robust, flexible world models that push the boundaries of autonomous AI interaction in complex environments.