Genie: Generative Interactive Environments

Published 23 Feb 2024 in cs.LG, cs.AI, and cs.CV | (2402.15391v1)

Abstract: We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

Abstract PDF HTML Upgrade to Chat

References (85)

Citations (75)

View on Semantic Scholar

Summary

The paper presents a novel framework combining spatiotemporal video tokenization, latent action inference, and autoregressive dynamics to achieve controllable video generation.
The methodology utilizes ST-transformers to efficiently handle spatial and temporal self-attention, addressing the quadratic scaling limitations of traditional models.
Experimental results demonstrate that the 11-billion-parameter Genie model outperforms conventional approaches in video fidelity and controllability across diverse datasets.

Genie: Generative Interactive Environments

The paper "Genie: Generative Interactive Environments" introduces a novel generative model framework for creating interactive and action-controllable virtual environments through unsupervised learning from unlabelled internet videos. This approach uses a substantial model architecture and explores the potential of autonomous creation of diverse and dynamic environments.

Methodology

The Genie model employs a multifaceted architecture composed of three primary components: a spatiotemporal video tokenizer, a latent action model, and an autoregressive dynamics model. The framework adapts ST-transformers for efficient video generation. The video tokenizer encodes video frames into discrete tokens, which serve as inputs for the dynamics model. A unique latent action model infers actions in an unsupervised manner, allowing the system to generate controllable video sequences without explicit action labels.

Figure 1: Genie model training: Genie takes in $T$ frames of video as input, tokenizes them into discrete tokens $z$ via the video tokenizer, and infers the latent actions $\tilde{a}$ between each frame.

The spatiotemporal transformer network is a key innovation, optimizing computational efficiency by leveraging both spatial and temporal self-attention mechanisms. This method addresses the quadratic scaling problem inherent to traditional transformers and significantly improves the model's capacity to handle complex video dynamics.

Figure 2: ST-transformer architecture. The architecture is composed of $L$ spatiotemporal blocks, each containing a spatial layer, temporal layer and feed-forward layer.

Experimental Results

The Genie model demonstrates robust performance across various datasets, notably outperforming traditional models in both video fidelity and controllability metrics such as FVD and a proprietary signal-to-noise ratio-based controllability measure. Training occurs on vast and diverse datasets, such as curated 2D platformer videos and robotics videos, enabling the model to generalize across different domains.

A significant finding is the model's ability to scale gracefully with increased parameters and batch sizes, with detailed experiments showing linear scalability. The final model, with an extensive parameter count of 11 billion, exhibits exceptional video generation quality, reaffirming the scalability and robustness of the architecture.

Applications and Implications

The Genie model extends beyond conventional video generation, paving the way for applications in simulation, gaming, and agent training. The unsupervised latent action space offers potential for developing generalist agents capable of learning from diverse video data. Additionally, its applicability to robotics and RL environments highlights its versatility in learning dynamic physical interactions without explicit action labels.

Figure 3: Controllable, consistent latent actions in Robotics: trajectories beginning from three different starting frames from our Robotics dataset.

Conclusion

Genie represents a significant contribution to generative AI, offering scalable and controllable video generation with minimal supervision. Future work may focus on improving efficiency and extending the model's capacity for creating complex, interactive environments at real-time speeds. Given the current trajectory of advancements in AI, Genie illustrates a promising direction for autonomous content creation, simulation, and interactive experience design in both virtual and real-world applications.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Genie: Generative Interactive Environments

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces Genie, a powerful AI that can turn a single prompt—like a sentence, a picture, a photo, or even a quick sketch—into a playable, interactive world. Think of it like drawing a scene and then being able to jump into it and control what happens, frame by frame. Genie learns how to do this by watching lots of videos from the internet, without needing any labels or instructions about what actions people took in those videos.

Goals and Questions

The researchers wanted to answer three simple questions:

Can we build an AI that creates playable worlds from just videos, without knowing the actual controller inputs used in those videos?
Can users control these worlds with a small set of simple “buttons” (actions) that the AI learns on its own?
Can the same learned actions help train future game-playing robots or agents by imitating what they see in new videos?

How Genie Works

At a high level, Genie has three main parts that work together. You can imagine a video as a flipbook: each page is a frame. Genie looks at frames and predicts the next one based on what “button” you press.

Here’s the setup, explained with everyday analogies:

Video Tokenizer: This turns each video frame into compact “tokens,” a bit like breaking a picture into LEGO pieces so it’s easier to handle. Genie uses a special transformer (a type of AI model) that looks both within a single frame (spatial) and across time (temporal). That’s like paying attention to what’s happening in one picture and how it changes across the flipbook.
Latent Action Model (LAM): Genie learns a small set of actions—like controller buttons—only from watching videos. No one tells it “this was a jump” or “this was move right.” Instead, it figures out the most useful changes between frames and organizes them into a tiny action dictionary (in the paper, there are 8 actions). This is like learning what each button does by watching gameplay and noticing how the screen changes when certain movements happen.
Dynamics Model: This predicts the next frame given the previous frames and the chosen action. Imagine a smart storyteller who looks at what’s already happened and what button you pressed, then draws the next picture in the flipbook. Genie uses a method called MaskGIT, which is like filling in a puzzle piece by piece until the frame looks right.

Technically, all three parts use a “spatiotemporal transformer,” which keeps memory use low by focusing attention inside each frame (space) and across frames (time) separately. This makes it faster and more scalable for long videos.

Training data and scale:

Genie was trained on about 30,000 hours of 2D platformer game videos from the internet.
The team also trained a smaller version on robot videos (no action labels used).
The final Genie model has around 11 billion parameters, which makes it a “foundation world model”—a big, general system that can be adapted to many situations.

Main Findings

The researchers highlight several exciting results. Here is a short list to make them easy to follow:

Playable worlds from diverse prompts: Genie can take text-generated images, hand-drawn sketches, or even real photos, and turn them into controllable, game-like scenes. Users can press one of the learned action buttons to move characters or objects, frame by frame.
Consistent actions: Even though the actions were learned without labels, each action tends to mean something similar across different inputs—for example, “move right” or “jump” shows up consistently.
Generalization: Genie works even when the input images look very different from the training videos (this is called “out-of-distribution”). It still produces believable gameplay-like motions.
Understanding scenes: Genie can mimic parallax—the effect where foreground moves faster than background when you pan across a scene—showing it learned some 3D-like understanding from 2D videos.
Robotics: The robot-trained Genie version learns consistent manipulator actions and even simulates object properties (like a bag of chips deforming when moved). This is impressive because it learned only from video, not from action labels.
Scaling helps: As the model gets bigger and the training batches get larger, the results improve smoothly. This is a sign that Genie benefits from more data and compute.
Training agents: Genie’s learned actions can be used to imitate behaviors from new, unlabeled videos in unseen environments. With a small amount of extra data to map Genie’s “buttons” to real controls, an agent can reach performance close to an expert that had full labeled data.

Why It Matters

Genie shows a new way to create interactive experiences:

Creativity: Kids, artists, and game designers can sketch or imagine worlds and instantly make them playable.
Data efficiency: It learns actions from unlabeled videos, which are everywhere online, making it easier and cheaper than collecting special action-labeled datasets.
Building smarter agents: It opens a path to train general-purpose game-playing agents or robots by “watching” videos and learning what to do, rather than needing detailed instructions.
Foundation for future systems: Because Genie is large and general, it can be a base for many applications in simulation, training, and entertainment.

Limitations and Future Impact

The paper is honest about current limits:

Speed: Genie currently runs around 1 frame per second, which is too slow for smooth, real-time play.
Memory for long stories: It keeps a short history (about 16 frames), so it can sometimes lose consistency over long sequences.
Realism: Like other generative models, it can sometimes “hallucinate” odd or unrealistic frames.

Despite these limits, Genie could:

Help democratize game creation—more people can make interactive worlds quickly.
Provide vast, varied training environments for AI agents, potentially leading to more capable, general AI.
Inspire new research combining video learning, control, and simulation to bridge the gap between watching and doing.

In short, Genie is a big step toward AI that can watch videos, learn how actions change the world, and then let us play inside the worlds it imagines.

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (25)

First 10 authors:

Collections

Tweets

YouTube

Show All Videos

HackerNews

Genie: Generative Interactive Environments (5 points, 1 comment)
Genie: Generative Interactive Environments (3 points, 0 comments)
Genie: Generative Interactive Environments (2 points, 0 comments)
Genie: Generative Interactive Environments (2 points, 1 comment)
Google Genie: Generative Interactive Environments (1 point, 0 comments)
Genie: Generative Interactive Environments (1 point, 0 comments)

Google Deepmind announces Genie, the first generative interactive environment model (752 points, 330 comments)
[R] Genie: Generative Interactive Environments (25 points, 5 comments)

Genie: Generative Interactive Environments

Summary

Genie: Generative Interactive Environments

Methodology

Experimental Results

Applications and Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Goals and Questions

How Genie Works

Main Findings

Why It Matters

Limitations and Future Impact

Open Problems

Continue Learning

Related Papers

Authors (25)

Collections

Tweets

YouTube

HackerNews

Reddit