NitroGen: Open Foundation Model for Gaming Agents
- NitroGen is an open foundation model for generalist gaming agents that integrates vision and action using a unified Transformer architecture.
- The dataset comprises 40,000 hours of curated gameplay from over 1,000 games, extracted via template matching and segmentation-based parsing.
- The model achieves notable zero-shot performance and rapid transfer improvements by optimizing control policies with a conditional flow-matching loss.
NitroGen is an open foundation model designed for generalist gaming agents, trained on a large-scale, internet-derived vision–action dataset comprising 40,000 hours of manually curated gameplay recordings spanning over 1,000 unique games. NitroGen utilizes a unified Transformer architecture to couple vision and action modeling, leveraging behavior cloning from video-derived action data to achieve robust, diverse cross-game performance and effective transfer to novel titles. The complete dataset, evaluation environments, and model weights are released to advance research on generalist embodied control (Magne et al., 4 Jan 2026).
1. Dataset Construction
NitroGen’s foundation is an internet-scale “video–action” dataset curated and processed using a three-stage pipeline: data curation, automatic action extraction, and quality filtering.
Stage 1: Data curation
The initial collection comprises 71,000 hours of raw gameplay video from 38,739 recordings, sourced from 818 different creators. Selection criteria require the presence of real-time on-screen “gamepad overlays” (e.g., Xbox, PlayStation) to facilitate action extraction. Final filtering and quality assurance yield exactly 40,000 hours of gameplay from more than 1,000 distinct games:
- 846 games with at least 1 hour of data
- 91 games with at least 100 hours
- 15 games with over 1,000 hours The dataset distributes playtime across genres: Action-RPG (34.9%), Platformer (18.4%), Action-Adventure (9.2%), with remaining hours spread over seven additional genres.
Stage 2: Action extraction
- Template matching: Localization of gamepad overlays is achieved by matching 25 sampled frames per clip against a library of approximately 300 hand-curated controller templates using SIFT [Lowe04] and XFeat [Potje24], retaining matches with at least 20 keypoint inliers under an affine transform.
- Segmentation-based parsing: A fine-tuned SegFormer [Xie21] network ingests pairs of spatially concatenated cropped frames, providing 11×11 segmentation masks for each analog stick and binary on/off states for 16 controller buttons. Training utilizes 8 million synthetic frames simulating variations in overlay appearance.
- Quality filtering: To address the over-representation of “no-op” action labels, segments (16–32 frames) with under 50% nonzero actions are discarded, retaining 55% of raw footage. The on-screen overlay is masked to prevent the model from visual “cheating.”
Stage 3: Dataset statistics
The processed dataset contains exactly 40,000 hours of 256×256 RGB video at 30 Hz, each frame temporally synchronized with a 20-dimensional action vector (16 button bits + 4 continuous sticks), covering more than 1,000 games across 9 genres.
2. Model Architecture
NitroGen employs an end-to-end, unified vision–action Transformer with two principal components: a vision encoder and an Action Diffusion Transformer (DiT).
Vision Encoder
Observation frames are encoded by a SigLIP-2 ViT [Tschannen25] into tokens, each of dimension :
Action Diffusion Transformer (DiT)
Control output predicts future actions per forward pass via a diffusion process. The (noisy) action chunk is embedded as
These tokens are concatenated and processed by DiT blocks with:
- multi-head self-attention over action tokens,
- cross-attention from action to vision tokens,
- feed-forward layers.
The per-time-step MLP decoder outputs continuous-valued controls.
Unified Forward Pass
The total parameter count is approximately 500 million.
3. Training Objective and Hyperparameters
NitroGen is optimized using the conditional flow-matching (CFM) loss on 16-action chunks, conditioning on one visual context frame.
Training Loss
Given ground-truth actions , Gaussian noise , and interpolation parameter :
with the model trained to predict the conditional velocity :
Inference (Denoising)
Inference begins from and performs Euler steps:
Optimization
- Optimizer: AdamW with weight decay
- Learning rate: constant , cosine decay
- Exponential moving average decay: 0.9999
- Data augmentation: random brightness/contrast/saturation/hue, rotation, random crops
- Compute: GPU-hours (32×A100), trained to validation convergence
4. Benchmark Suite and Evaluation Procedures
The NitroGen multi-game evaluation environment wraps 10 commercial games under a unified Gymnasium [Towers24]-style API, with 30 tasks distributed as follows:
- 2D side-scrollers (3 games)
- 2D top-down roguelikes (2 games)
- 3D open-world exploration (2 games)
- 3D action-RPG (2 games)
- 3D sports (1 game)
Each benchmark task defines explicit start and goal states, with episodic success checked via human or automated assessments.
Metrics
Per-task success rate is defined as the fraction of successful episodes over rollouts:
Average per-game success is then
Without any fine-tuning, NitroGen exhibits the following average success rates across categories:
- 2D platformers:
- Top-down roguelikes:
- 3D action-RPGs (combat):
- 3D open-world exploration: All rates significantly exceed random baselines ($0$–).
5. Generalization and Transfer Capabilities
Transfer and data efficiency are evaluated on two held-out games:
(a) Fine-tuning on Isometric Roguelike (0–100h supervision):
Fine-tuning from the NitroGen checkpoint provides a 10% relative improvement in final success rates compared to training from scratch.
(b) 3D Action-RPG, 30h supervision:
Task-specific relative improvements when initializing from NitroGen:
- Combat: +52%
- Navigation: +25%
- Game-specific tasks: +5%
The relative improvement metric is formalized as
These results indicate that NitroGen acquires transferable control primitives, facilitating rapid adaptation to new environments and tasks with limited data.
6. Open-Source Release and Research Impact
All components of NitroGen are openly released:
- Dataset: 40,000 hours of video–action pairs from 1,000+ games: https://github.com/nitrogen-dataset
- Benchmark suite: Universal Gymnasium simulator (30 tasks, 10 games): https://github.com/nitrogen-eval
- Model weights: 500M parameter PyTorch checkpoint: https://github.com/nitrogen-models
By providing internet-scale video–action data, a multi-environment evaluation gym, and a scalable vision–action Transformer baseline, NitroGen lowers barriers for large-scale behavior-cloning pretraining. This establishes a practical testbed for progress in multi-game control, hierarchical planning, and potentially language-conditioned policies for generalist agents (Magne et al., 4 Jan 2026).