NitroGen: Open Foundation Model for Gaming Agents

Updated 4 February 2026

NitroGen is an open foundation model for generalist gaming agents that integrates vision and action using a unified Transformer architecture.
The dataset comprises 40,000 hours of curated gameplay from over 1,000 games, extracted via template matching and segmentation-based parsing.
The model achieves notable zero-shot performance and rapid transfer improvements by optimizing control policies with a conditional flow-matching loss.

NitroGen is an open foundation model designed for generalist gaming agents, trained on a large-scale, internet-derived vision–action dataset comprising 40,000 hours of manually curated gameplay recordings spanning over 1,000 unique games. NitroGen utilizes a unified Transformer architecture to couple vision and action modeling, leveraging behavior cloning from video-derived action data to achieve robust, diverse cross-game performance and effective transfer to novel titles. The complete dataset, evaluation environments, and model weights are released to advance research on generalist embodied control (Magne et al., 4 Jan 2026).

1. Dataset Construction

NitroGen’s foundation is an internet-scale “video–action” dataset curated and processed using a three-stage pipeline: data curation, automatic action extraction, and quality filtering.

Stage 1: Data curation

The initial collection comprises 71,000 hours of raw gameplay video from 38,739 recordings, sourced from 818 different creators. Selection criteria require the presence of real-time on-screen “gamepad overlays” (e.g., Xbox, PlayStation) to facilitate action extraction. Final filtering and quality assurance yield exactly 40,000 hours of gameplay from more than 1,000 distinct games:

846 games with at least 1 hour of data
91 games with at least 100 hours
15 games with over 1,000 hours The dataset distributes playtime across genres: Action-RPG (34.9%), Platformer (18.4%), Action-Adventure (9.2%), with remaining hours spread over seven additional genres.

Stage 2: Action extraction

Template matching: Localization of gamepad overlays is achieved by matching 25 sampled frames per clip against a library of approximately 300 hand-curated controller templates using SIFT [Lowe04] and XFeat [Potje24], retaining matches with at least 20 keypoint inliers under an affine transform.
Segmentation-based parsing: A fine-tuned SegFormer [Xie21] network ingests pairs of spatially concatenated cropped frames, providing 11×11 segmentation masks for each analog stick and binary on/off states for 16 controller buttons. Training utilizes 8 million synthetic frames simulating variations in overlay appearance.
Quality filtering: To address the over-representation of “no-op” action labels, segments (16–32 frames) with under 50% nonzero actions are discarded, retaining 55% of raw footage. The on-screen overlay is masked to prevent the model from visual “cheating.”

Stage 3: Dataset statistics

The processed dataset contains exactly 40,000 hours of 256×256 RGB video at 30 Hz, each frame temporally synchronized with a 20-dimensional action vector (16 button bits + 4 continuous sticks), covering more than 1,000 games across 9 genres.

2. Model Architecture

NitroGen employs an end-to-end, unified vision–action Transformer with two principal components: a vision encoder and an Action Diffusion Transformer (DiT).

Vision Encoder

Observation frames $V \in \mathbb{R}^{256\times256\times 3}$ are encoded by a SigLIP-2 ViT [Tschannen25] into $n_v=256$ tokens, each of dimension $d=1024$ :

$x_{\mathrm{vision}} = E_v(V) \in \mathbb{R}^{n_v\times d}.$

Action Diffusion Transformer (DiT)

Control output predicts $K=16$ future actions per forward pass via a diffusion process. The (noisy) action chunk $a_t \in \mathbb{R}^{K\times d}$ is embedded as

$x_{\mathrm{action}} = E_a(a_t) \in \mathbb{R}^{K\times d}.$

These tokens are concatenated and processed by $L=32$ DiT blocks with:

multi-head self-attention over $K$ action tokens,
cross-attention from action to vision tokens,
feed-forward layers.

The per-time-step MLP decoder outputs $\hat a \in \mathbb{R}^{K\times 20}$ continuous-valued controls.

Unified Forward Pass

$h = \mathrm{DiT}\bigl([\;x_{\mathrm{vision}},\;x_{\mathrm{action}}\;]\bigr), \qquad \hat a = \mathrm{MLP}_{\mathrm{out}}(h).$

The total parameter count is approximately 500 million.

3. Training Objective and Hyperparameters

NitroGen is optimized using the conditional flow-matching (CFM) loss on 16-action chunks, conditioning on one visual context frame.

Training Loss

Given ground-truth actions $a \in \mathbb{R}^{K\times 20}$ , Gaussian noise $\epsilon \sim \mathcal{N}(0, I)$ , and interpolation parameter $t \in [0,1]$ :

$a_t = (1-t)\epsilon + t a,$

with the model $\pi_\theta(a_t, \psi_\phi(V), t)$ trained to predict the conditional velocity $a - \epsilon$ :

$\mathcal{L}_{\mathrm{CFM}} = \mathbb{E}_{t,a,\epsilon} \bigl\|\, \pi_\theta(a_t,\psi_\phi(V),t)\;-\;(a-\epsilon) \bigr\|^2.$

Inference (Denoising)

Inference begins from $a_{t=0}\sim\mathcal{N}(0, I)$ and performs $k=16$ Euler steps:

$a_{t+\frac{1}{k}} = a_t + \frac{1}{k} \pi_\theta(a_t,\psi_\phi(V), t).$

Optimization

Optimizer: AdamW with weight decay $1\times 10^{-3}$
Learning rate: constant $1\times 10^{-4}$ , cosine decay
Exponential moving average decay: 0.9999
Data augmentation: random brightness/contrast/saturation/hue, $\pm 5^\circ$ rotation, random crops
Compute: $\sim 2\times 10^{4}$ GPU-hours (32×A100), trained to validation convergence

4. Benchmark Suite and Evaluation Procedures

The NitroGen multi-game evaluation environment wraps 10 commercial games under a unified Gymnasium [Towers24]-style API, with 30 tasks distributed as follows:

2D side-scrollers (3 games)
2D top-down roguelikes (2 games)
3D open-world exploration (2 games)
3D action-RPG (2 games)
3D sports (1 game)

Each benchmark task defines explicit start and goal states, with episodic success checked via human or automated assessments.

Metrics

Per-task success rate $\mathrm{SR}_i$ is defined as the fraction of successful episodes over $N$ rollouts:

$\mathrm{SR}_i = \frac{1}{N} \sum_{j=1}^N \mathbf{1}[\text{success}_{ij}]$

Average per-game success is then

$\overline{\mathrm{SR}}_{\mathrm{game}} = \frac{1}{15} \sum_{i=1}^3 \sum_{j=1}^5 \mathbf{1}[\text{success}_{ij}].$

Zero-shot Performance

Without any fine-tuning, NitroGen exhibits the following average success rates across categories:

2D platformers: $\approx 60\%$
Top-down roguelikes: $\approx 35\%$
3D action-RPGs (combat): $\approx 25\%$
3D open-world exploration: $\approx 18\%$ All rates significantly exceed random baselines ($0$– $5\%$ ).

5. Generalization and Transfer Capabilities

Transfer and data efficiency are evaluated on two held-out games:

(a) Fine-tuning on Isometric Roguelike (0–100h supervision):

Fine-tuning from the NitroGen checkpoint provides a 10% relative improvement in final success rates compared to training from scratch.

(b) 3D Action-RPG, 30h supervision:

Task-specific relative improvements when initializing from NitroGen:

Combat: +52%
Navigation: +25%
Game-specific tasks: +5%

The relative improvement metric is formalized as

$\Delta_{\rm rel} = \frac{\mathrm{SR}_{\rm pre} - \mathrm{SR}_{\rm scratch}}{\mathrm{SR}_{\rm scratch}} \times 100\%.$

These results indicate that NitroGen acquires transferable control primitives, facilitating rapid adaptation to new environments and tasks with limited data.

6. Open-Source Release and Research Impact

All components of NitroGen are openly released:

Dataset: 40,000 hours of video–action pairs from 1,000+ games: https://github.com/nitrogen-dataset
Benchmark suite: Universal Gymnasium simulator (30 tasks, 10 games): https://github.com/nitrogen-eval
Model weights: 500M parameter PyTorch checkpoint: https://github.com/nitrogen-models

By providing internet-scale video–action data, a multi-environment evaluation gym, and a scalable vision–action Transformer baseline, NitroGen lowers barriers for large-scale behavior-cloning pretraining. This establishes a practical testbed for progress in multi-game control, hierarchical planning, and potentially language-conditioned policies for generalist agents (Magne et al., 4 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

NitroGen: An Open Foundation Model for Generalist Gaming Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NitroGen: An Open Foundation Model.