AgentGym: Unified Platform for LLM Agents

Updated 3 February 2026

AgentGym is a unified and extensible research platform that standardizes training, evaluation, and evolution of LLM-based agents through a modular design and a consistent HTTP API.
It integrates reinforcement learning, imitation learning, and hybrid protocols like AgentEvol to achieve state-of-the-art performance across diverse benchmarks.
Its plug-and-play modules and curriculum-based horizon scaling enable scalable multi-environment generalization for tasks like web navigation, code editing, and gaming.

AgentGym is a unified, extensible research platform for training, evaluating, and evolving LLM–based agents across diverse interactive environments. It provides a standardized interface, a modular system for easy integration of new environments and tasks, and a scalable evolution-grade methodology for driving policy improvement across domains. AgentGym has been adopted in both domain-general and task-specific settings, supporting both reinforcement learning (RL) and imitation learning, with multiple open-source implementations and large benchmarks (Xi et al., 2024, Xi et al., 10 Sep 2025, Jain et al., 9 Apr 2025, Amrouni et al., 2021).

1. Architecture and Platform Design

AgentGym's architecture centers on three decoupled, “plug-and-play” modules designed to maximize flexibility and horizontal scalability (Xi et al., 10 Sep 2025, Xi et al., 2024):

Environment Module: Each environment exposes a standardized HTTP API (“/createEnv”, “/reset”, “/observation”, “/available_actions”, “/step”) for stateless, concurrent interaction. Environments span web navigation, embodied control (BabyAI, ALFWorld), text-based games (Wordle, MAZE), digital games (TextCraft), programming (BIRD), tools (Weather, Movie, Academia), and more (Xi et al., 2024, Xi et al., 10 Sep 2025).
Agent Module: Encapsulates an LLM-based policy $\pi_\theta$ . The agent, at each decision step $k$ , receives the current observation $o_k$ and generates a natural language “thought,” which is mapped to a discrete action $a_k$ . The module supports various prompting strategies (ReAct, chain-of-thought) and maintains a trajectory buffer for training (Xi et al., 10 Sep 2025).
Training Module: Operates RL or behavioral cloning pipelines; manages rollouts, advantage estimation, policy updates, curriculum schedules, and logging. This module is compatible with mainstream RL algorithms (PPO, REINFORCE++, GRPO, RLOO) and offline/preference-based algorithms (SFT, DPO, rejection sampling/AgentEvol). Distributed processing is supported via multi-process and multi-node batch execution (Xi et al., 10 Sep 2025, Xi et al., 2024).

The AgentGym platform provides a real-time, uni-format, concurrent interaction API and supports seamless integration of new environments by requiring only the HTTP endpoint interface and a Python client stub (Xi et al., 2024). Trajectory and instruction data are stored in uniform JSON schemas.

2. Environment Diversity, Task Suite, and Benchmarking

AgentGym distinguishes itself by curating diverse, high-fidelity environments and a large instruction/trajectory corpus for developing generalist agents (Xi et al., 2024, Xi et al., 10 Sep 2025):

Scope: The AgentGym suite includes 14+ environments and 89+ task types (web navigation, embodied reasoning, code editing, search/retrieval, games). For SWE (software engineering), the R2E-Gym instantiation offers 8,135 Gym-style code-editing tasks procedurally curated from GitHub (Jain et al., 9 Apr 2025).
Instruction and Trajectory Pools: ∼20,000 instructions (expanded via GPT-4, self-instruct, crowdsourcing), ∼14,000 expert or semi-expert trajectories (AgentTraj, AgentTraj-L). Benchmarks such as AgentEval (1,160 held-out instructions) provide standardized test sets for evaluation consistency (Xi et al., 2024).
Evaluation Metrics: Average success rate or normalized reward per environment. For code-editing, formal metrics include pass@ $k$ , test distinguishability, and toxicity rates (Jain et al., 9 Apr 2025). AgentGym experiments involve comparison with state-of-the-art proprietary (Claude-3, GPT-4) and open-source models (Llama-2, Qwen-2.5, DeepSeek) (Xi et al., 2024, Xi et al., 10 Sep 2025).

Environment Class	Example Domains	Benchmark Tasks
Web Navigation	WebShop, WebArena	Shopping, CMS
Embodied	ALFWorld, BabyAI, SciWorld	GoTo, Measure
Code-Editing	R2E-Gym/SWE-Bench	Bug Fixing
Games	TextCraft, MAZE, Wordle	Depth 1–4, Trivia
Tool Use	Weather, Movie, Sheet	Task Execution

AgentGym's diversity and scale underpin its ability to train and robustly evaluate generalist agents and to analyze transfer across domains (Xi et al., 2024, Xi et al., 10 Sep 2025).

3. Training Algorithms and Evolutionary Methods

AgentGym supports a spectrum of RL, behavioral cloning, and hybrid methods suitable for long-horizon, multi-turn decision-making (Xi et al., 10 Sep 2025, Xi et al., 2024, Jain et al., 9 Apr 2025):

Reinforcement Learning: Formalized as a POMDP $(\mathcal{U}, \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{T}, r)$ . Standard policy gradients optimize $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[r(\tau)]$ , with practical implementation via PPO (clipped surrogate), GRPO, REINFORCE++, and RLOO (Xi et al., 10 Sep 2025).
Behavioral Cloning (BC): Supervised fine-tuning to maximize log-likelihood over expert trajectory datasets $\mathcal{J}_{BC}(\theta)=\mathbb{E}_{(e,u,\tau)\sim D_s}[\log \pi_\theta(\tau|e,u)]$ (Xi et al., 2024).
AgentEvol Algorithm: AgentEvol is an iterative self-evolution protocol for multi-environment agent generalization. It alternates between (1) exploration: sampling new trajectories under the current policy for all (environment, instruction) pairs, scoring each by reward, and (2) learning: optimizing a reward-weighted log-likelihood objective that merges these samples with the original BC data (Xi et al., 2024). This inference-as-learning approach avoids instability seen in classic RL, promoting stable and scalable policy improvement in the multi-environment regime.

for m in 1 ... M:
    # Exploration step
    D_m = {(e, u^j, τ^j): ∀e, ∀u^j ∼ ℚ_e, τ^j ∼ π_θ^m(·|e, u^j)}
    # Reward labeling
    score each trajectory r(e, u, τ)
    D_m ← D_m ∪ original BC data
    # Learning step
    θ^{m+1} = argmax_θ E_{(e,u,τ)∼D_m}[r(e,u,τ)·logπ_θ(τ|e,u)]

AgentGym also supports preference optimization (DPO), test-time hybrid verification (see Section 5), and curriculum/intervention via interaction horizon schedules (see ScalingInter-RL below).

4. Curriculum Scheduling and Stability: ScalingInter-RL

Training long-horizon LLM agents is highly susceptible to optimization instability and exploration collapse. The ScalingInter-RL protocol addresses these issues through curriculum-based horizon adjustment (Xi et al., 10 Sep 2025):

Progressive Horizon Scaling: The allowed number of agent-environment interaction turns $h_t$ is increased according to a fixed schedule $h_{t+1} = h_t + \delta_h$ every $\Delta$ training steps. Early phases use short horizons for exploitation and rapid reward acquisition; later phases expand to longer horizons, enabling the development of planning, backtracking, and complex behaviors.
Empirical Results: Fixed long horizons can cause training collapse; fixed short horizons induce premature plateauing. ScalingInter-RL produces smoother reward curves and achieves higher long-term returns with reduced compute (Xi et al., 10 Sep 2025).
Curriculum in Multi-Turn Settings: By adapting the maximum rollout length dynamically, agents more reliably progress from short, direct problem-solving to handling open-ended, multi-stage scenarios. This is especially critical for LLM-based agents subject to high variance in RL signals.

5. Test-Time Verification and Hybrid Evaluation

In the code-editing and SWE domains, AgentGym incorporates advanced verification and reranking strategies to maximize test-time success (Jain et al., 9 Apr 2025):

Execution-Based (EB) Verification: Generator agents create new unit tests, run candidate code patches, and score each candidate by the number of passing tests, subject to regression-test filtering to preserve functionality.
Execution-Free (EF) Verification: Trained reward models score trajectories based on textual signals alone, returning a continuous score $s^{EF} \in [0,1]$ .
Hybrid Reranking: Candidates are reranked using a hybrid score $s^H = \mathbf{Top}_n(s^{EF}) + s^{EB}$ , restricting costly EB evaluation to only the top- $n$ EF-rated candidates. This leverages complementary strengths: EB provides ground-truth verification, EF yields fine-grained signal. The full hybrid method achieves up to 51% pass@1 on SWEBench-Verified—substantially exceeding either axis alone and closing much of the gap with proprietary agents (Jain et al., 9 Apr 2025).

6. Empirical Performance, Insights, and Limitations

AgentGym-trained agents achieve competitive or state-of-the-art results across a broad spectrum of benchmarks (Xi et al., 2024, Xi et al., 10 Sep 2025, Jain et al., 9 Apr 2025):

Generalization: The AgentEvol method reliably improves both open-source (Llama-2-13B, DeepSeek) and proprietary (Qwen, Sonnet) agents when scaling across environments. Merging new interactive data with base BC data prevents catastrophic forgetting and facilitates learning on unseen instructions.
SOTA Results: On SWEBench-Verified (code editing), AgentGym’s hybrid reranking yields 51% pass@1 (Qwen-2.5-Coder-32B), outperforming most open baseline models and matching top proprietary agents such as Agentless-1.5/O1 (Jain et al., 9 Apr 2025). In long-horizon RL, ScalingInter-RL agents close 30+ point performance gaps on multiple tasks and often match or best closed-source LLMs on web, scientific, and game environments (Xi et al., 10 Sep 2025).
Key Insights: Staged curricula (e.g., ScalingInter-RL) are crucial for long-horizon stability; reward-weighted BC and inference-as-learning offer stable evolution on open-ended, multi-environment problems; transfer to novel instructions is enhanced by pooling interaction data across worlds.
Limitations: Current sampling strategies (e.g., $K=1$ in AgentEvol) are compute-constrained. Execution-free verifiers can show bias towards trajectory text rather than action efficacy, and hybrid pipelines depend on test quality for signal. Integration with RL critics, better test generation, and scaling to mixture-of-experts remain open directions (Jain et al., 9 Apr 2025, Xi et al., 2024).

7. Influence, Extensibility, and Outlook

AgentGym's modular architecture and standardized API have facilitated widespread adoption and adaptation across research domains including general agentic intelligence, code reasoning, web navigation, and interactive RL. The design enables:

Plug-and-Play Extension: Adding new environments requires implementing five standardized HTTP endpoints and a Python stub, promoting rapid prototyping (Xi et al., 2024).
Collaborative Research: Open release of environments, trajectory corpora, baseline agents, benchmarks, and evaluation tools via https://github.com/WooooDyy/AgentGym encourages reproducible research and method development (Xi et al., 2024).
Current Community Focus: AgentGym is used as the empirical substrate for research into multi-agent collaboration, curriculum RL, preference learning, code-centric reasoning, and real-world tool/use interaction. It underpins recent advances in sim-to-real transfer, hybrid RL/inference pipelines, and large-scale agent benchmarking (Xi et al., 2024, Xi et al., 10 Sep 2025, Jain et al., 9 Apr 2025).
Avenues for Advancement: Proposed directions include scaling evolution to larger (∼70B) models, automated LLM-driven environment curation, more robust preference integration, and the design of critic-augmented RL for sparse reward scenarios.

AgentGym and its algorithmic extensions (e.g., AgentEvol, ScalingInter-RL) provide the methodological basis for advancing the continual, open-ended evolution of LLM-based agents and generalist AI systems (Xi et al., 2024, Xi et al., 10 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (4)

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments (2024)

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning (2025)

R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents (2025)

ABIDES-Gym: Gym Environments for Multi-Agent Discrete Event Simulation and Application to Financial Markets (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentGym.

AgentGym: Unified Platform for LLM Agents

1. Architecture and Platform Design

2. Environment Diversity, Task Suite, and Benchmarking

3. Training Algorithms and Evolutionary Methods

4. Curriculum Scheduling and Stability: ScalingInter-RL

5. Test-Time Verification and Hybrid Evaluation

6. Empirical Performance, Insights, and Limitations

7. Influence, Extensibility, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AgentGym: Unified Platform for LLM Agents

1. Architecture and Platform Design

2. Environment Diversity, Task Suite, and Benchmarking

3. Training Algorithms and Evolutionary Methods

4. Curriculum Scheduling and Stability: ScalingInter-RL

5. Test-Time Verification and Hybrid Evaluation

6. Empirical Performance, Insights, and Limitations

7. Influence, Extensibility, and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research