Papers
Topics
Authors
Recent
Search
2000 character limit reached

$Agent^2$: An Agent-Generates-Agent Framework for Reinforcement Learning Automation

Published 16 Sep 2025 in cs.AI and cs.LG | (2509.13368v2)

Abstract: Reinforcement learning (RL) agent development traditionally requires substantial expertise and iterative effort, often leading to high failure rates and limited accessibility. This paper introduces Agent$2$, an LLM-driven agent-generates-agent framework for fully automated RL agent design. Agent$2$ autonomously translates natural language task descriptions and environment code into executable RL solutions without human intervention. The framework adopts a dual-agent architecture: a Generator Agent that analyzes tasks and designs agents, and a Target Agent that is automatically generated and executed. To better support automation, RL development is decomposed into two stages, MDP modeling and algorithmic optimization, facilitating targeted and effective agent generation. Built on the Model Context Protocol, Agent$2$ provides a unified framework for standardized agent creation across diverse environments and algorithms, incorporating adaptive training management and intelligent feedback analysis for continuous refinement. Extensive experiments on benchmarks including MuJoCo, MetaDrive, MPE, and SMAC show that Agent$2$ outperforms manually designed baselines across all tasks, achieving up to 55\% performance improvement with consistent average gains. By enabling a closed-loop, end-to-end automation pipeline, this work advances a new paradigm in which agents can design and optimize other agents, underscoring the potential of agent-generates-agent systems for automated AI development.

Summary

  • The paper introduces Agent², an innovative framework that uses LLMs to automate the creation and optimization of RL agents.
  • It employs a dual-agent approach to automatically generate MDP models and select effective RL algorithms for enhanced performance.
  • Experimental results on benchmarks like MuJoCo and SMAC demonstrate up to 55% improvement over manually designed agents.

Agent2Agent^2: An Overview of Automated RL Agent Design

The paper "Agent2Agent^2: An Agent-Generates-Agent Framework for Reinforcement Learning Automation" (2509.13368) introduces a novel framework that automates the design of reinforcement learning (RL) agents through the use of LLMs. This framework, known as Agent2^2, is designed to reduce the complexity and domain expertise traditionally required in RL agent development by providing an end-to-end automated pipeline. Agent2^2 utilizes a dual-agent architecture where a Generator Agent creates and optimizes agents, subsequently resulting in a Target Agent that interacts with environments for training and evaluation.

Framework Architecture and Methodology

The architecture of Agent2^2 is structured around three main stages: problem analysis, MDP modeling, and algorithmic optimization. Initially, the framework analyzes task descriptions and environment code using LLMs to interpret and formalize the target problem. This is followed by the construction of Markov Decision Processes (MDPs) tailored to the problem at hand. The MDP modeling involves designing the state and action spaces, as well as the reward functions, in an automated manner. Figure 1

Figure 1: The framework of Agent2 illustrating the stages from task analysis to agent training and evaluation.

The algorithmic optimization stage is integral to Agent2^2, where the framework autonomously selects suitable RL algorithms, designs network architectures, and optimizes hyperparameters. By applying the Model Context Protocol, Agent2^2 ensures a standardized integration of services that facilitate iterative refinement and continuous improvement of the agent's performance.

Experimental Results and Evaluation

Agent2^2 demonstrates significant performance enhancements over traditional RL agents across various benchmarks, including MuJoCo, MetaDrive, MPE, and SMAC. Through experiments, the framework achieved up to 55% improvement over manually designed agents. These results underscore the efficacy of the automated design and optimization processes. Figure 2

Figure 2: Performance improvement across Task-to-MDP Mapping and Algorithmic Optimization.

The evaluation extends to both single-agent and multi-agent environments, highlighting the generality and adaptability of Agent2^2. For instance, on the MuJoCo tasks such as Ant and Humanoid, the framework surpassed existing RL libraries like Xuance and Tianshou by optimizing agent configurations that were traditionally suboptimal due to default settings. Importantly, even in challenging environments like SMAC's 1c3s5z scenario, Agent2^2 enhanced win rates from 0.17 to 0.23, reflecting its capability to handle complex cooperative tasks.

Implications and Future Directions

The introduction of Agent2^2 marks a significant advancement in automating the RL pipeline, enabling wider accessibility and reducing the need for extensive manual intervention and domain expertise. The framework's adaptability suggests potential applications across various domains where RL is applicable, including robotics and autonomous systems.

Future research could explore extending the Agent2^2 framework to handle increasingly complex and dynamic environments, potentially integrating additional capabilities for real-time learning and adaptation. Another promising direction is leveraging the framework's automation capabilities to explore new RL paradigms or integrate multi-modal environments.

Conclusion

Agent2^2 provides a comprehensive and automated approach to RL agent design, leveraging LLM capabilities to streamline the process from task specification to agent evaluation. Its dual-agent architecture facilitates a high degree of automation in both MDP modeling and algorithmic optimization, achieving significant performance improvements across diverse tasks. The success of Agent2^2 in experimental evaluations points towards its potential in enhancing AI's scalability and accessibility in various application areas.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Glossary

  • Ablation studies: Controlled experiments that remove or isolate components of a system to quantify their impact on performance. "Finally, we perform ablation studies to quantify the respective impact of the two stages: Task-to-MDP mapping and algorithmic optimization."
  • Adaptive verification and refinement: An automated process that verifies generated components and iteratively improves them using error and performance feedback. "To address this, we introduce an adaptive verification and refinement framework that integrates generated components into the RL pipeline and iteratively improves them via automated validation and feedback."
  • Action masking: A technique that restricts an agent’s available actions at a given state to guide learning and enforce constraints. "recent methods use LLM-generated action masking or suboptimal policies to dynamically constrain and guide RL agents"
  • Agent2^2: The proposed LLM-driven framework that automates the design and optimization of reinforcement learning agents. "Agent2^2 outperforms manually designed baselines across all tasks, achieving up to 55\% performance improvement with consistent average gains."
  • Agent-generates-agent: A paradigm where one agent designs and produces another agent to perform a target task. "an agent-generates-agent framework for fully automated RL agent design."
  • AutoML: Automated machine learning techniques for algorithmic and configuration optimization, reducing manual tuning. "Research on its algorithmic optimization mainly follows the AutoML, which has seen rapid development and become relatively mature~\citep{he2021automl}."
  • AutoRL: Automation of the reinforcement learning pipeline, including environment modeling and algorithm selection/tuning. "most existing AutoRL approaches automate only a single stage of the RL pipeline"
  • Credit assignment problem: The challenge of determining which actions or agents should receive credit for observed outcomes in multi-agent RL. "research mainly focuses on solving the credit assignment problem for effective reward distribution~\citep{nagpal2025leveraging, lin2025speaking, he2025enhancing}."
  • DQN: Deep Q-Network, a value-based RL algorithm for discrete action spaces. "value-based methods like DQN are appropriate for discrete action spaces"
  • Discount factor: A scalar γ\gamma that weights future rewards relative to immediate rewards in RL. "and γ\gamma the discount factor."
  • EvoPrompting: A method that uses LLMs as adaptive operators within evolutionary neural architecture search. "EvoPrompting~\citep{chen2023evoprompting} uses LMs as adaptive operators in evolutionary NAS."
  • Feature selection: Choosing a subset of relevant state variables or features to improve learning efficiency. "possibly applying feature selection, dimensionality reduction, or combining observations to improve learning efficiency."
  • Generator Agent: The autonomous agent in Agent2^2 that analyzes tasks and generates all components needed for the Target Agent. "The Generator Agent serves as an autonomous AI designer, capable of analyzing and producing all necessary components for an RL agent."
  • Hyperparameter optimization: The process of selecting and tuning training parameters (e.g., learning rate, batch size) to improve performance. "For hyperparameter optimization, some studies use LLMs to suggest and iteratively refine hyperparameter configurations based on dataset and model descriptions"
  • MAPPO: Multi-Agent Proximal Policy Optimization, an actor-critic algorithm tailored for multi-agent cooperation. "we adopt MAPPO, a classic algorithm that has demonstrated strong performance in many benchmarks such as MPE and SMAC."
  • MDP: Markov Decision Process, a formal model of decision making defined by states, actions, transitions, rewards, and a discount factor. "Formally, an MDP is defined as M=(S,A,P,R,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)"
  • MetaDrive: A large-scale autonomous driving simulator environment for RL research. "MetaDrive~\citep{li2022metadrive}, a large-scale autonomous driving simulator where agents must safely navigate diverse and dynamic traffic scenarios"
  • Model Context Protocol (MCP): A protocol ensuring standardized integration of tools and services in the agent framework. "operates in compliance with the Model Context Protocol (MCP), ensuring standardized integration of services."
  • MuJoCo: A physics engine and suite of continuous control tasks widely used for RL benchmarking. "benchmarks including MuJoCo, MetaDrive, MPE, and SMAC"
  • NAS: Neural architecture search, automated discovery of neural network designs. "evolutionary NAS."
  • Partial observability: When agents cannot fully observe the environment state, complicating decision making. "helping to address limitations from partial observability."
  • Policy gradient methods: RL algorithms that directly optimize the policy via gradient ascent, typically used for continuous control. "policy gradient methods such as PPO and SAC are better for continuous control."
  • PPO: Proximal Policy Optimization, a policy gradient algorithm emphasizing stable updates. "we employ PPO, SAC, and TD3 on the MuJoCo environments"
  • Quality diversity optimization: Evolutionary optimization that seeks a diverse set of high-performing solutions. "combines LLM code generation with quality diversity optimization to discover diverse and effective architectures"
  • Reward shaping: Modifying or augmenting the reward function to guide learning and improve performance, especially with sparse rewards. "developing automated reward shaping methods"
  • SAC: Soft Actor-Critic, an off-policy RL algorithm optimizing a stochastic policy with an entropy term. "we employ PPO, SAC, and TD3 on the MuJoCo environments"
  • Sample efficiency: The effectiveness of an RL method in learning good policies with fewer environment interactions. "improving sample efficiency and policy adaptability"
  • SMAC: StarCraft Multi-Agent Challenge, a benchmark of cooperative micromanagement tasks for multi-agent RL. "SMAC~\citep{samvelyan2019starcraft}, which provides cooperative StarCraft II micromanagement tasks of varying scales and difficulties"
  • Target Agent: The executable RL agent generated by the framework to interact with environments for training and evaluation. "Target Agent that is automatically generated and executed."
  • TD3: Twin Delayed Deep Deterministic policy gradient, an off-policy algorithm for continuous control using twin critics and delayed updates. "we employ PPO, SAC, and TD3 on the MuJoCo environments"
  • TensorBoard: A visualization toolkit for tracking and analyzing training metrics. "summarizes key performance indicators from TensorBoard data into a concise report"
  • Transition probability: The dynamics P\mathcal{P} describing the probability of moving from one state to another given an action. "the transition probability"
  • Value-based methods: RL approaches that learn value functions and derive policies from them, typically suited to discrete actions. "value-based methods like DQN are appropriate for discrete action spaces"
  • Verification operator: A programmatic checker used to validate generated components before training. "checked by a verification operator VV"
  • YAML: A human-readable data serialization format used for exporting configurations. "exported in standardized YAML format for compatibility and reproducibility."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 6 likes about this paper.