Multi-Agent Self-Evolution (MASE) Frameworks

Updated 20 January 2026

Multi-Agent Self-Evolution is a framework where autonomous agents iteratively improve through mutual feedback, structured interactions, and intrinsic reward signals.
It employs decentralized reward mechanisms, dynamic role adaptation, and hierarchical protocols such as CoMAS, ANN, and Mobile-Agent-E to enhance scalability and performance.
Empirical evaluations demonstrate up to 19.8% improvement on benchmarks, validating the robustness and efficiency of this self-evolution paradigm.

Multi-Agent Self-Evolution (MASE) refers to the class of frameworks and algorithms in which a population of autonomous agents—frequently LLM-based—continuously bootstraps its own collective capabilities through structured interactions, internal feedback, and iterative adaptation, with minimal or no reliance on external supervision. The defining feature of MASE is that improvement is achieved by agents dynamically learning from each other's outputs, critiques, or collaborative problem decomposition, often leveraging intrinsic or interaction-derived reward signals. This paradigm enables scalable, decentralized, and often human-analogous co-evolution, distinguishing it from static multi-agent configurations or isolated self-improvement.

1. Fundamental Principles and Motivations

The central motivation for MASE arises from the limitations of both RL-free self-evolution (e.g., knowledge extrapolation, static workflows) and RL-based schemes that depend on externally defined reward models or heuristics. RL-free methods are bounded by the static capabilities of pre-trained LLMs and cannot transcend the informational horizon of their backbone parameters. RL-based approaches with explicit reward supervision scale poorly due to their reliance on human-curated signals or narrow environmental feedback.

MASE is inspired by human social intelligence, where individuals advance by debate, mutual critique, and iterative adjustment of understanding—without reliance on any single external verifier. MASE operationalizes this analog by enabling agents to co-evolve through direct peer interaction, joint reasoning, reciprocal feedback loops, and adaptive restructuring of roles, protocols, and workflows (Xue et al., 9 Oct 2025, Ma et al., 10 Jun 2025).

2. Canonical Architectures and Interaction Protocols

MASE encompasses a broad family of architectures, unified by the principle that agents improve via closed-loop interaction, critique, and adaptation. Notable instantiations include:

CoMAS: Each agent proposes solutions, critiques peer outputs, and scores peer responses. An LLM-based judge interprets these scores as normalized, zero-sum intrinsic rewards, which are then used for decentralized token-level policy updates via REINFORCE++ with KL regularization. The process is strictly peer-driven, with no external ground-truth supervision (Xue et al., 9 Oct 2025).
Agentic Neural Networks (ANN): Agents are organized as graph nodes (layers or teams), with task decomposition handled dynamically in forward passes. After team outputs, global and local critiques (textual gradients) feed back into prompt templates, agent roles, and topology. New teams are instantiated post hoc via a neuro-symbolic routine, supporting ongoing self-evolution even after main training (Ma et al., 10 Jun 2025).
Mobile-Agent-E: Leverages hierarchical agent decomposition—separating high-level planning (Manager) from low-level operations (Perceptor, Operator, Action Reflector, Notetaker). Self-evolution occurs via persistent long-term memory comprising “Tips” (natural-language distilled guidance) and “Shortcuts” (structured reusable macros), updated at the end of each task by reflector modules (Wang et al., 20 Jan 2025).

MASE architectures frequently employ role specialization, decentralized (or in some cases, dynamically central) reward adjudication, and collective credit assignment across agent-generated artifacts.

3. Reward Formulation and Policy Optimization

A distinguishing aspect is the reliance on intrinsic rewards sourced from inter-agent interaction dynamics, rather than dense external labels. Representative mechanisms include:

Interaction-based zero-sum rewards: Peer scoring and critique are normalized (e.g., mapping integer scores to [0,1]), allocated complementarily to solvers and evaluators (e.g., $r(s_i) = \frac{\hat\tau_{i,j}-1}{2}$ , $r(e_{i,j}) = \frac{3-\hat\tau_{i,j}}{2}$ ), and adversarially shaped to mitigate collusion or reward hacking (Xue et al., 9 Oct 2025).
Textual gradient feedback: In ANN and EvoMAC, end-to-end or layer-wise performance is critiqued in natural language, parsed as “textual gradients” for adjusting agent prompts, aggregation functions, or graph topology (Ma et al., 10 Jun 2025, Hu et al., 2024).

Policy optimization leverages reinforcement learning with careful token-level credit assignment, typically employing surrogate PPO-style objectives and KL divergence to anchor to a reference or prior policy (e.g., $J(\theta_k)$ as in CoMAS). In EvoMAC, gradient descent is replaced with discrete “textual backpropagation” via agent pairs that generate and apply update signals to the collaboration graph (Hu et al., 2024).

4. Scalability, Diversity, and Adaptation Mechanisms

Scaling MASE is addressed through adaptive interaction protocols and by injecting agent heterogeneity:

Population scaling: Empirical evidence demonstrates monotonic performance improvements as the agent count increases—from one to four agents in CoMAS, yielding up to a 2% accuracy lift over single-agent variants (Xue et al., 9 Oct 2025).
Diversity injection: Heterogeneous agent pools (mixed model backbones or prompt configurations) lead to further improvements (1–3%) by fostering richer cross-learning and mitigating reward hacking or local consensus traps.
Post-training architecture evolution: Teams, aggregation modules, or workflow topologies can be continually expanded, pruned, or reconfigured in response to ongoing feedback, as in ANN’s LocalGradientUpdate and Mobile-Agent-E’s evolution of Shortcuts and Tips.
Decentralization: Notably, MorphAgent dispenses with any central coordinator, relying instead on self-evolving agent profiles, with each agent maintaining and updating a role description optimized for role clarity, differentiation, and task-role alignment via explicit metrics (RCS, RDS, TRAS) (Lu et al., 2024).

5. Quantitative Evaluation and Empirical Findings

MASE frameworks have demonstrated state-of-the-art or near-optimal performance across a spectrum of benchmarks:

Framework	Notable Benchmarks	Key Results
CoMAS (Xue et al., 9 Oct 2025)	GSM8K, HumanEval, MBPP, MATH-500	Up to +19.8% over baseline (AutoGen), +6.1% (Debate), +3.66% (Consistency), with monotonic gain as agents/heterogeneity increase
ANN (Ma et al., 10 Jun 2025)	HumanEval, MATH, DABench, MMLU-ML	Up to +10.5 pp (HumanEval), +5.2 pp (MATH) vs. leading baselines
EvoMAC (Hu et al., 2024)	HumanEval, rSDE-Bench (Web, Game)	+6.1% (HumanEval), +20–34% (rSDE-Bench) vs. GPT-4o-Mini
Mobile-Agent-E (Wang et al., 20 Jan 2025)	Mobile-Eval-E	+22% Satisfaction Score absolute, +17.2% Action Accuracy
MorphAgent (Lu et al., 2024)	BigCodeBench, MATH	Outperforms GPTSwarm up to +2.7 pp; robust to domain shift and node failure

Additional ablation studies underscore the necessity of adversarial evaluation (e.g., removing scoring or evaluation collapses CoMAS, leading to hyper-strictness or reward hacking), while diversity and iterative profile updating are shown to enhance robustness and adaptability.

6. Theoretical Guarantees, Limitations, and Open Problems

MASE frameworks posit the following theoretical and practical insights:

Adversarial, interaction-based reward shaping ensures agents do not converge to trivial agreements or reward collapses (Xue et al., 9 Oct 2025).
Monotonic improvement under local acceptance criteria: If meta-feedback loops (e.g., in MAS-ZERO) accept only improving architectures, the system hill-climbs in expected accuracy or completeness (Ke et al., 21 May 2025).
Decentralized convergence in profile evolution: MorphAgent demonstrates that profile updating guided by RCS/RDS/TRAS can yield stability and resilience without a central protocol (Lu et al., 2024).

Identified limitations include the current focus on reasoning/coding benchmarks, fixed or modestly dynamic judge modules, simplistic interaction structures, and notable compute overhead for profile metric evaluation. Open challenges comprise generalization to open-ended, real-world tasks; richer and multi-layered interaction protocols; fully decentralized and theoretically analyzable convergence; and integration with symbolic or retrieval-augmented submodules.

7. Relation to Adjacent Paradigms and Future Directions

MASE diverges from both centrally-organized RL schemes and static multi-agent scripts by embedding learning and adaptation within the agent population, leveraging social learning analogies and closed feedback loops. This enables continuous self-evolution—analogous to societal knowledge accumulation—without ongoing reliance on human supervision or rigid environment feedback.

Planned extensions highlighted in current literature include:

Nuanced, multi-stage protocols (e.g., group brainstorming, hierarchical meta-judging) (Xue et al., 9 Oct 2025);
Deeply recursive architecture generation and rectification (as in MAS²) (Wang et al., 29 Sep 2025);
Real-world or embodied settings where agents can evolve both internal objectives and inter-agent coalitions (Li et al., 5 Feb 2025).

MASE thus constitutes a new paradigm for scalable, decentralized, and intrinsically motivated agent self-improvement, continually redefining the boundaries of what multi-agent LLM systems can autonomously achieve.