Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gaia2 Benchmark

Updated 31 January 2026
  • Gaia2 Benchmark is a dynamic, large-scale evaluation suite designed to rigorously assess agentic AI through 1,120 asynchronous scenarios modeled as MDPs.
  • It features diverse tasks including search, execution, adaptability, ambiguity, and temporal challenges to simulate real-world, noisy environments.
  • Its modular ARE-based design enables extension and cross-benchmark comparisons, highlighting limitations in compute allocation and concurrent planning.

Gaia2 Benchmark

The Gaia2 Benchmark denotes a new generation of large-scale, dynamic evaluation suites for general agentic AI systems, grounded in the ARE (Agents Research Environments) platform. Gaia2 scenarios are engineered to test not only search and procedural execution but also agent capabilities crucial for real-world deployment: handling ambiguous or noisy context, dynamic environment adaptation, collaborative and multi-agent orchestration, and strict temporal constraints. Unlike prior benchmarks, Gaia2 executes asynchronously, introducing concurrency-induced failure modes and explicit cost- and time-based evaluation metrics, thereby surfacing the architectural and compute-allocation limitations of current AI agents (Andrews et al., 21 Sep 2025).

1. Formal Structure and Evaluation Framework

Gaia2 is defined as a suite of N=1,120N = 1,120 discrete “scenarios” S={s1,,sN}\mathcal{S} = \{s_1, \dots, s_N\}, each specified as a Markov Decision Process (MDP) over a shared environment E=(S,A,P,s0)\mathcal{E} = (S, A, P, s_0):

  • SS: Joint state comprising all app states, time manager, and notification queue.
  • AA: Tool-type actions exposed by apps (partitioned into read/write).
  • P(ss,a)P(s' \mid s, a): Deterministic transition under seeded universe initialization (s0s_0).

For each scenario sSs\in\mathcal{S}:

  • An event DAG {ei}\{e_i\} schedules environment events, governing both deterministic (oracle) and user-driven actions.
  • The “oracle” trajectory τoracle\tau^{\textrm{oracle}} provides a reference sequence of write actions.
  • Verification is performed by V\mathcal{V}, which compares the agent’s trajectory τagent\tau^{\textrm{agent}} to the oracle, checking for tool-action parity, argument validity, DAG causality, and compliance with timing constraints.

The principal metric is Pass@1, with cost and time budgets tracked per scenario. The budget-scaling curve N(b)N(b) measures the number of solved scenarios for an agent under a total budget bb (e.g., in USD or compute tokens).

2. Task Types, Scenario Generation, and Dynamic Universes

All Gaia2 scenarios reside within a “Mobile” environment that simulates a smartphone populated with 12 apps (Email, Calendar, Filesystem, etc.). Contents are generated by Llama 3.3 Instruct, structured via a synthetic persona graph—spanning demographic diversity and instantiating 10 distinct “universe” configurations, each with internal consistency across app data.

Scenario taxonomy reflects six primary capability splits:

  • Search: Multi-app read-only information retrieval, concluding with a single final answer submission.
  • Execution: Multi-step sequences requiring interleaved read/write operations.
  • Adaptability: Agent must adjust its plan in response to changing environment events (i.e., bookings altered in the Calendar or Cabs apps).
  • Time: Tool actions required to occur within specified temporal windows ([5s,  25s][5\,\textrm{s},\;25\,\textrm{s}] tolerance).
  • Ambiguity: Prompts are contradictory or under-specified; agents must proactively request clarification.
  • Agent2Agent: Some apps are replaced by autonomous sub-agents; the main agent must coordinate via inter-agent messaging.

In addition, Gaia2-mini (160 scenarios) can be automatically augmented with Agent2Agent and noise-induced perturbations—e.g., random tool-call failures (probability p=0.1p = 0.1), and injected extraneous environment events (rate λ=10min1\lambda = 10\,\textrm{min}^{-1}).

3. Asynchronous Execution: Event Loops and Concurrency

Gaia2’s execution model is fundamentally asynchronous. The simulation environment advances continuously in wall-clock (environmental) time, decoupled from agent reasoning speed:

  • Each scenario’s event DAG is processed by a background EventLoop, scheduling environment events (user, oracle, validation) independently of agent action generation.
  • Agents interact through a notification queue, receiving all events NkN_k accumulated since the last step.
  • Special “System” tools enable agents to manipulate simulation time—e.g., wait(δ) and wait_for_next_notification()—allowing for strategic time advancement or pausing.

This design introduces nontrivial concurrency: critical events or environmental changes can occur while the agent is “thinking,” exposing limitations in static or sequential planning architectures.

4. Protocols for Scoring and Performance Analysis

Evaluation is per-scenario, with hard caps on simulation time (e.g., 5 minutes for time-critical splits), maximum steps (200 tool calls), and maximum dialogue turns. At each agent “send” event, the verifier assesses the full trajectory for correctness, budget adherence, and temporal compliance.

Reported statistics include:

  • Pass@1 overall and by capability split.
  • Average cost per successful scenario (cˉ\bar{c}), average runtime (tˉ\bar{t}), and tool-use distributions.
  • Token-generation parameters (for LLMs).
  • Budget-scaling curves, capturing the Pareto frontier of cost-versus-competence for agent models.

5. Empirical Findings and Diagnostic Patterns

Experimental analysis of multiple agentic systems on Gaia2 reveals:

  • Execution and Search scenarios are the least challenging, with model pass rates of 50–80%.
  • Ambiguity and Adaptability splits are extremely challenging; only high-reasoning models (Claude-4 Sonnet, GPT-5) achieve >30% success.
  • Temporal Constraints induce an inverse-scaling law: more capable (deeper-reasoning) models are often computationally too slow to satisfy real-time window requirements, reducing pass rates drastically unless “generation-time” is artificially collapsed.
  • Introduction of Noise causes severe performance degradation: default noise parameters drop Claude-4 mini performance from 31.2% to 8.1%.
  • Hierarchical multi-agent decomposition (Agent2Agent) benefits lighter models but shows little improvement, or regression, for top-tier planners.
  • Pass@1 success plateaus well before full coverage, regardless of budget increases; simply scaling up compute provides diminishing returns for current agent scaffolding architectures.
  • Raw performance, cost, and efficiency (success per dollar/time) vary widely between models; some open models (e.g., Kimi-K2) achieve competitive fraction of tasks at lower cost points.

6. Extensibility, Portability, and ARE Platform Integration

All scenario materials—event DAGs, oracle trajectories, and verifiers—are modularized, enabling straightforward extension:

  • Researchers can add or modify environment applications, populate alternative universes, reuse DAGs under new configurations (e.g., swap agent/app assignments), or inject systematic augmentations via the ARE API.
  • ARE supports rapid adaptation of existing benchmarks (e.g., τ-bench, BFCLv3) for direct, environment-consistent comparison.
  • Benchmarking new agentic paradigms or task types (e.g., real-world app integration, API-based workflows) is directly facilitated by these abstractions.

This modularization enables cross-benchmark “apples-to-apples” evalution and supports rapid evolution as agent capabilities—and target environments—change.

7. Implications for Agent Architectures and the Future of Evaluation

Gaia2’s design and results expose structural deficiencies and research frontiers:

  • Asynchronous and Multi-stream Agent Control: Sequential ReAct-style planning loops become untenable in real-time, event-driven environments. Agents must support overlapping perception, interruptible inference, and proactive subtask scheduling—akin to real-time OS kernels.
  • Adaptive Compute Allocation: The inverse scaling observed in Time scenarios demonstrates that raw reasoning depth does not translate to practical competence under cost/time constraints. Future agentic systems must include meta-control for allocating computation adaptively—utilizing shallow pipelines for trivial actions and deep planning when uncertainty/risk justifies the expense.
  • Evaluation Methodology: Compute-normalized metrics (success per dollar/second) will become primary, supplanting pure accuracy as compute and efficiency trade-offs dominate performance at scale.
  • Realistic Deployment Readiness: By embedding ambiguity, non-determinism, and systematic challenges (multi-agent, noise, temporal precision), Gaia2 benchmarks approach the operational constraints of real-world AI assistant deployment.

Agent evaluation, as epitomized by Gaia2, is moving beyond static, sequential tool use towards rigorously stress-testing robustness, adaptability, and efficiency under dynamic, resource-constrained, and adversarial conditions (Andrews et al., 21 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gaia2 Benchmark.