ARC-AGI Benchmark Series

Updated 21 January 2026

ARC-AGI Benchmark Series is a suite of benchmarks designed to assess AGI via fluid, systematic, and few-shot generalization across diverse tasks.
It evolved from hand-designed grid challenges (ARC-AGI-1) to interactive and multimodal assessments (ARC-AGI-3 and MMMU), emphasizing abstraction, compositional reasoning, and safety protocols.
The benchmarks incorporate rigorous evaluation metrics, human performance baselines, and innovative algorithmic methods to drive progress in robust and efficient AGI systems.

The ARC-AGI Benchmark Series is a suite of benchmarks conceived to operationalize and measure progress in artificial general intelligence (AGI), with particular emphasis on fluid, systematic, and few-shot generalization. Drawing its original philosophical and psychometric motivation from François Chollet’s definition of intelligence as "the ability to achieve diverse goals efficiently and with minimal prior knowledge across a wide range of novel worlds and tasks," the series has evolved over several years to integrate grid-based cognitive challenges, expert-level multimodal reasoning, interactive environments, and safety/stewardship evaluation protocols. It is recognized industry-wide as a litmus test for both general reasoning in foundation models and as a driver of algorithmic innovation in the AGI research community.

1. Historical Origins and Conceptual Foundations

The ARC-AGI series originated with the release of the Abstraction & Reasoning Corpus (ARC) in 2019, introduced as a benchmark for fluid intelligence—specifically, the ability to solve novel, small-data tasks through abstraction and composition rather than rote pattern recognition or exploitation of pretrained skills. ARC tasks are hand-authored and solvable by untrained humans, requiring generalization from few demonstration pairs without reliance on large parametric priors or specialized domain knowledge. The founding principles stressed the measurement of intelligence via efficiency, goal/world diversity, and minimal knowledge dependence (Pfister et al., 13 Jan 2025). The series has been progressively expanded and recalibrated to address both the accelerating progress of large models and critiques regarding brute-force searchability and task memorization.

2. Dataset Structure and Evolution

ARC-AGI-1

The first generation, ARC-AGI-1, consists of 1,000 hand-designed tasks split into public training (400), public evaluation (400), semi-private evaluation (100; used for commercial API scoring), and private evaluation (100; withheld for official leaderboard statements) (Chollet et al., 2024). Each task provides 2–4 input/output grid pairs (demonstrations) and one or more holdout test inputs. Grids are ≤30×30, each cell taking a value in {0,…,9}, with solution correctness requiring every output pixel to exactly match ground truth. All tasks are designed to be tractable with only "core knowledge" (objectness, counting, spatial relations).

ARC-AGI-2

To address the narrowing gap between human and machine performance, ARC-AGI-2 was introduced (Chollet et al., 17 May 2025, Chollet et al., 15 Jan 2026). It preserves the input–output task structure but increases complexity through multi-step, contextually gated, and compositional problems calibrated by extensive human studies. ARC-AGI-2 comprises: 400 public training (with a wide difficulty range), 80–100 public evaluation, 120 semi-private, and 120 private tasks; human accuracy is baselined at ≈75%. Task set construction mandates both broad coverage of abstraction types and uniform human difficulty across splits to avoid overfitting and enable robust progress measurement.

ARC-AGI-3 (Forthcoming)

ARC-AGI-3 transitions from static grid-based reasoning to interactive environments, requiring exploration, planning, persistent memory, goal inference, and value alignment (Chollet et al., 15 Jan 2026, Rudakov et al., 30 Dec 2025). Game-like tasks are procedurally generated, defining state spaces S, action spaces A, and transition dynamics T. Scores will reflect both accuracy and action efficiency, with human benchmarks included for the first time.

Multimodal Expansion: MMMU

The MMMU benchmark extends ARC-AGI’s scope to the "expert multimodal" axis (Yue et al., 2023). It comprises 11,500 college-level, multimodal questions (charts, diagrams, maps, music sheets, etc.) spanning six academic disciplines and 30 subjects. Problems demand advanced perception, domain-specific knowledge, and deliberate multi-step reasoning, filling a critical gap unaddressed by prior image-focused or K–12 benchmarks.

3. Evaluation Protocols and Scoring Metrics

Task Scoring

All static ARC-AGI variants employ per-task exact match: a task is solved if all test grid outputs are correct, with up to two guesses per input. The overall score is the percentage of tasks solved on an unseen private evaluation set:

$\text{Score} = \frac{S}{N_\text{tasks}} \times 100\%$

ARC-AGI-2 augments this with partial credit metrics and human-difficulty indices (Chollet et al., 17 May 2025).

Human Baselines

Human difficulty is rigorously calibrated: only tasks solved by multiple naïve participants in ≤2 trials are retained, and subset means differ by ≤0.01. Aggregate statistics include mean per-task accuracy (≈75%) and median solution time (~2.7 min/test) (Chollet et al., 15 Jan 2026, Chollet et al., 17 May 2025).

Interactive/Multimodal Metrics

ARC-AGI-3 introduces level solution counts, step/action efficiency, and path efficiency (shortest-to-goal) (Rudakov et al., 30 Dec 2025). Multimodal MMMU metrics rely on micro-averaged accuracy over all questions, exact-match for open, and option-ID for multi-choice (Yue et al., 2023).

4. Algorithmic Paradigms and Key Methods

SOTA ARC-AGI solvers utilize per-task "refinement loops": iterative optimization cycles (program synthesis, test-time training, application-layer retries) driven by task-specific feedback (Chollet et al., 15 Jan 2026). The refinement loop abstraction encompasses:

Evolutionary program synthesis: symbolic programs are iteratively generated, verified, and mutated.
Weight-space loops: direct optimization of model parameters on demonstration pairs (zero-pretraining or test-time adaptation; e.g., Tiny Recursive Model, CompressARC) (Roye-Azar et al., 4 Dec 2025, Liao et al., 5 Dec 2025).
Application-layer harnesses: LLM-powered solution plans, code generation, and verifier chains (Chollet et al., 2024, Franzen et al., 8 May 2025, Pourcel et al., 10 Jul 2025).

Neural Cellular Automata and Developmental Models

Cellular automata and developmental computation paradigms have been applied to ARC-AGI, with neural cellular automata (NCA) and memory-augmented variants (EngramNCA) trained per task using gradient descent (Xu et al., 18 Jun 2025, Guichard et al., 13 May 2025). These models demonstrate competitive few-shot efficiency, local-to-global pattern formation, and cost advantages compared to language-model-based methods, though their solve rates remain moderate.

Multimodal and Vision-Language Methods

Recent work has shown that vision and language exhibit complementary strengths in ARC-AGI pipelines. Vision-Language Synergy Reasoning (VLSR) decomposes tasks into visual pattern abstraction and linguistic rule specification/execution, often with cross-modal self-correction loops (MSSC) yielding significant empirical gains (Zhang et al., 19 Nov 2025). Explicit error-propagation studies confirm that combining multiple textual serializations and image modalities mitigates perceptual bottlenecks and leads to more reliable execution (Wen et al., 11 Nov 2025).

Knowledge Priors and Ontological Reasoning

Best methods increasingly structure solution search and code generation via staged, dependency-aware augmentation of "core knowledge priors" (objectness, geometry/topology, goal-directedness), formalized in frameworks such as KAAR (Lei et al., 23 May 2025). These ontologies mediate the injection of abstractions and action schemas, minimizing prompt interference and improving systematic generalization.

Safety, Alignment, and Inherently Safer AGI Protocols

A growing trend is the integration of explicit safety, corrigibility, and bounded-rationality objectives. The language-mediated active inference framework proposes hierarchical multi-agent architectures with transparent natural-language belief/preference separation, resource-aware free energy minimization, and compositional safety checks (Wen, 7 Aug 2025). Benchmarks now include evaluation of safety constraint violations, preference drift, and human oversight efficacy.

5. Model Performance and Empirical Results

Across benchmark versions and modalities, scores range from random baseline (<1%) through large-scale LLMs (up to 87.5% on ARC-AGI-1 with unlimited compute (Pfister et al., 13 Jan 2025)), but drop precipitously on ARC-AGI-2 (≈3%) and more challenging multimodal tasks (≈55–59% for best proprietary LMMs on MMMU; ≈34% for best open-weight LMMs) (Yue et al., 2023, Chollet et al., 17 May 2025, Chollet et al., 15 Jan 2026). Key findings include:

Substantial human–AI performance gaps persist on ARC-AGI-2 and MMMU.
Top ARC-AGI-2 methods achieve S≈0.24 on private tasks (2025 Prize).
On ARC-AGI-3, training-free graph-based exploration solves median 30/52 interactive levels, outperforming LLM-based agents (Rudakov et al., 30 Dec 2025).
Open methods (e.g., SOAR) approach ≈52% on ARC-AGI-1 with self-improved evolutionary LLM loops (Pourcel et al., 10 Jul 2025).

Model performance is capped by knowledge coverage (pretraining overlap) and architectural alignment with task demands, with brute-force search or heuristic enumeration sufficient for ARC-AGI-1 but defeated in ARC-AGI-2 and MMMU.

6. Analysis, Limitations, and Future Trajectory

Knowledge-Dependent Overfitting and Contamination

With growing public familiarity, high-capacity LLMs have memorized ARC-AGI-1 tasks, contaminating evaluation (Chollet et al., 15 Jan 2026). Application-layer prompts can now elicit correct color mappings or grid formats without explicit information. As a consequence, future benchmarks are focusing on novel, procedural, and interactive tasks to preserve the integrity of generalization measurements.

Ongoing Limitations

Key challenges include disentangling domain knowledge from reasoning, avoiding over-compute brute forcing, quantifying sample and compute efficiency, and minimizing risk of leaderboard overfitting. Many methods trade increasing compute for higher scores, prompting calls for reporting resource usage alongside accuracy.

Future Directions

Emphasis is shifting to:

Interactive exploration and planning across stateful environments (ARC-AGI-3).
Efficiency metrics beyond accuracy (action efficiency, information gain).
Safety/robustness validation protocols.
Systematic integration of vision-text reasoning and ontological knowledge priors.
Expansion to cover multimodal and expert-level domains (as in MMMU).
Community-driven competition cycles (ARC Prize) and continuous refinement.

The ARC-AGI Benchmark Series thus remains a dynamic, evolving set of intelligence tests that aim to keep pace with, and ideally drive, progress toward robust, general, efficient, and safe AGI systems.