Papers
Topics
Authors
Recent
Search
2000 character limit reached

NitroGen: Open Foundation Model for Gaming Agents

Updated 4 February 2026
  • NitroGen is an open foundation model for generalist gaming agents that integrates vision and action using a unified Transformer architecture.
  • The dataset comprises 40,000 hours of curated gameplay from over 1,000 games, extracted via template matching and segmentation-based parsing.
  • The model achieves notable zero-shot performance and rapid transfer improvements by optimizing control policies with a conditional flow-matching loss.

NitroGen is an open foundation model designed for generalist gaming agents, trained on a large-scale, internet-derived vision–action dataset comprising 40,000 hours of manually curated gameplay recordings spanning over 1,000 unique games. NitroGen utilizes a unified Transformer architecture to couple vision and action modeling, leveraging behavior cloning from video-derived action data to achieve robust, diverse cross-game performance and effective transfer to novel titles. The complete dataset, evaluation environments, and model weights are released to advance research on generalist embodied control (Magne et al., 4 Jan 2026).

1. Dataset Construction

NitroGen’s foundation is an internet-scale “video–action” dataset curated and processed using a three-stage pipeline: data curation, automatic action extraction, and quality filtering.

Stage 1: Data curation

The initial collection comprises 71,000 hours of raw gameplay video from 38,739 recordings, sourced from 818 different creators. Selection criteria require the presence of real-time on-screen “gamepad overlays” (e.g., Xbox, PlayStation) to facilitate action extraction. Final filtering and quality assurance yield exactly 40,000 hours of gameplay from more than 1,000 distinct games:

  • 846 games with at least 1 hour of data
  • 91 games with at least 100 hours
  • 15 games with over 1,000 hours The dataset distributes playtime across genres: Action-RPG (34.9%), Platformer (18.4%), Action-Adventure (9.2%), with remaining hours spread over seven additional genres.

Stage 2: Action extraction

  • Template matching: Localization of gamepad overlays is achieved by matching 25 sampled frames per clip against a library of approximately 300 hand-curated controller templates using SIFT [Lowe04] and XFeat [Potje24], retaining matches with at least 20 keypoint inliers under an affine transform.
  • Segmentation-based parsing: A fine-tuned SegFormer [Xie21] network ingests pairs of spatially concatenated cropped frames, providing 11×11 segmentation masks for each analog stick and binary on/off states for 16 controller buttons. Training utilizes 8 million synthetic frames simulating variations in overlay appearance.
  • Quality filtering: To address the over-representation of “no-op” action labels, segments (16–32 frames) with under 50% nonzero actions are discarded, retaining 55% of raw footage. The on-screen overlay is masked to prevent the model from visual “cheating.”

Stage 3: Dataset statistics

The processed dataset contains exactly 40,000 hours of 256×256 RGB video at 30 Hz, each frame temporally synchronized with a 20-dimensional action vector (16 button bits + 4 continuous sticks), covering more than 1,000 games across 9 genres.

2. Model Architecture

NitroGen employs an end-to-end, unified vision–action Transformer with two principal components: a vision encoder and an Action Diffusion Transformer (DiT).

Vision Encoder

Observation frames VR256×256×3V \in \mathbb{R}^{256\times256\times 3} are encoded by a SigLIP-2 ViT [Tschannen25] into nv=256n_v=256 tokens, each of dimension d=1024d=1024:

xvision=Ev(V)Rnv×d.x_{\mathrm{vision}} = E_v(V) \in \mathbb{R}^{n_v\times d}.

Action Diffusion Transformer (DiT)

Control output predicts K=16K=16 future actions per forward pass via a diffusion process. The (noisy) action chunk atRK×da_t \in \mathbb{R}^{K\times d} is embedded as

xaction=Ea(at)RK×d.x_{\mathrm{action}} = E_a(a_t) \in \mathbb{R}^{K\times d}.

These tokens are concatenated and processed by L=32L=32 DiT blocks with:

The per-time-step MLP decoder outputs a^RK×20\hat a \in \mathbb{R}^{K\times 20} continuous-valued controls.

Unified Forward Pass

h=DiT([  xvision,  xaction  ]),a^=MLPout(h).h = \mathrm{DiT}\bigl([\;x_{\mathrm{vision}},\;x_{\mathrm{action}}\;]\bigr), \qquad \hat a = \mathrm{MLP}_{\mathrm{out}}(h).

The total parameter count is approximately 500 million.

3. Training Objective and Hyperparameters

NitroGen is optimized using the conditional flow-matching (CFM) loss on 16-action chunks, conditioning on one visual context frame.

Training Loss

Given ground-truth actions aRK×20a \in \mathbb{R}^{K\times 20}, Gaussian noise ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), and interpolation parameter t[0,1]t \in [0,1]:

at=(1t)ϵ+ta,a_t = (1-t)\epsilon + t a,

with the model πθ(at,ψϕ(V),t)\pi_\theta(a_t, \psi_\phi(V), t) trained to predict the conditional velocity aϵa - \epsilon:

LCFM=Et,a,ϵπθ(at,ψϕ(V),t)    (aϵ)2.\mathcal{L}_{\mathrm{CFM}} = \mathbb{E}_{t,a,\epsilon} \bigl\|\, \pi_\theta(a_t,\psi_\phi(V),t)\;-\;(a-\epsilon) \bigr\|^2.

Inference (Denoising)

Inference begins from at=0N(0,I)a_{t=0}\sim\mathcal{N}(0, I) and performs k=16k=16 Euler steps:

at+1k=at+1kπθ(at,ψϕ(V),t).a_{t+\frac{1}{k}} = a_t + \frac{1}{k} \pi_\theta(a_t,\psi_\phi(V), t).

Optimization

  • Optimizer: AdamW with weight decay 1×1031\times 10^{-3}
  • Learning rate: constant 1×1041\times 10^{-4}, cosine decay
  • Exponential moving average decay: 0.9999
  • Data augmentation: random brightness/contrast/saturation/hue, ±5\pm 5^\circ rotation, random crops
  • Compute: 2×104\sim 2\times 10^{4} GPU-hours (32×A100), trained to validation convergence

4. Benchmark Suite and Evaluation Procedures

The NitroGen multi-game evaluation environment wraps 10 commercial games under a unified Gymnasium [Towers24]-style API, with 30 tasks distributed as follows:

  • 2D side-scrollers (3 games)
  • 2D top-down roguelikes (2 games)
  • 3D open-world exploration (2 games)
  • 3D action-RPG (2 games)
  • 3D sports (1 game)

Each benchmark task defines explicit start and goal states, with episodic success checked via human or automated assessments.

Metrics

Per-task success rate SRi\mathrm{SR}_i is defined as the fraction of successful episodes over NN rollouts:

SRi=1Nj=1N1[successij]\mathrm{SR}_i = \frac{1}{N} \sum_{j=1}^N \mathbf{1}[\text{success}_{ij}]

Average per-game success is then

SRgame=115i=13j=151[successij].\overline{\mathrm{SR}}_{\mathrm{game}} = \frac{1}{15} \sum_{i=1}^3 \sum_{j=1}^5 \mathbf{1}[\text{success}_{ij}].

Zero-shot Performance

Without any fine-tuning, NitroGen exhibits the following average success rates across categories:

  • 2D platformers: 60%\approx 60\%
  • Top-down roguelikes: 35%\approx 35\%
  • 3D action-RPGs (combat): 25%\approx 25\%
  • 3D open-world exploration: 18%\approx 18\% All rates significantly exceed random baselines ($0$–5%5\%).

5. Generalization and Transfer Capabilities

Transfer and data efficiency are evaluated on two held-out games:

(a) Fine-tuning on Isometric Roguelike (0–100h supervision):

Fine-tuning from the NitroGen checkpoint provides a 10% relative improvement in final success rates compared to training from scratch.

(b) 3D Action-RPG, 30h supervision:

Task-specific relative improvements when initializing from NitroGen:

  • Combat: +52%
  • Navigation: +25%
  • Game-specific tasks: +5%

The relative improvement metric is formalized as

Δrel=SRpreSRscratchSRscratch×100%.\Delta_{\rm rel} = \frac{\mathrm{SR}_{\rm pre} - \mathrm{SR}_{\rm scratch}}{\mathrm{SR}_{\rm scratch}} \times 100\%.

These results indicate that NitroGen acquires transferable control primitives, facilitating rapid adaptation to new environments and tasks with limited data.

6. Open-Source Release and Research Impact

All components of NitroGen are openly released:

By providing internet-scale video–action data, a multi-environment evaluation gym, and a scalable vision–action Transformer baseline, NitroGen lowers barriers for large-scale behavior-cloning pretraining. This establishes a practical testbed for progress in multi-game control, hierarchical planning, and potentially language-conditioned policies for generalist agents (Magne et al., 4 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NitroGen: An Open Foundation Model.