STELLA: Self-Evolving LLM Agent Architecture

Updated 2 February 2026

STELLA is a self-evolving LLM agent that leverages multi-agent co-evolution, autonomous tool creation, and iterative feedback loops to continuously improve reasoning and planning capabilities.
It employs a modular framework with specialized agents (Manager, Dev, Critic, Tool Creation) operating in a closed-loop to optimize task workflows and resource management.
Its self-improvement mechanisms combine reinforcement learning, supervised fine-tuning, and evolutionary operators to achieve significant performance gains and cost efficiency.

STELLA is a self-evolving LLM agent architecture designed for the continuous and autonomous improvement of reasoning, planning, and tool-use capabilities across open-ended domains. Leveraging insights from recent advances in multi-agent co-evolution, self-orchestrated skill acquisition, iterative feedback optimization, and lifelong memory, STELLA implements a modular, data-driven framework integrating reinforcement, supervised, meta-cognitive, and dynamic resource control mechanisms. It is purpose-built to maintain high performance in rapidly evolving task spaces, efficiently extend its own skillset, and reduce dependence on static, human-curated knowledge or tool repositories.

1. Core Architectural Principles and Agent Loop

STELLA’s control and learning architecture is based on a multi-agent system motif, where distinct functional agents (Manager, Developer, Critic, Tool Creation) operate in a closed feedback loop and share information via dynamically evolving structured stores. The design draws directly from the architecture outlined in "STELLA: Self-Evolving LLM Agent for Biomedical Research" (Jin et al., 1 Jul 2025), which instantiated four main roles:

Manager Agent: Orchestrates task decomposition and selects or adapts workflow templates from an evolving Template Library, delegating subtasks based on problem context.
Dev Agent: Executes code and tool invocations in isolated environments, generating structured outputs for downstream critique.
Critic Agent: Evaluates intermediate results, signaling missing reasoning steps or tool requirements and ensuring quality control.
Tool Creation Agent: Detects capability gaps, autonomously discovers or wraps new tools, and integrates valid resources into the Tool Ocean—a dynamically growing registry of accessible tools and models.

The system maintains two central evolving stores:

Template Library: Curates successful multi-step reasoning plans for task decomposition, continually augmented with templates distilled from solved tasks.
Tool Ocean: Dynamically accumulates validated software tools, APIs, scripts, and accessors discovered or created on demand during operation.

The agentic loop proceeds as follows: The Manager selects a template and decomposes a task; the Dev Agent executes steps; the Critic identifies gaps; the Tool Creation Agent remediates deficiencies, and on success, the pathway is archived as a new template. This structure enables the system to bootstrap new skill trajectories with minimal external supervision, systematically curating reasoning and tool patterns directly from experiential data (Jin et al., 1 Jul 2025).

2. Autonomous Self-Evolution Mechanisms

STELLA’s self-improvement leverages multiple complementary algorithmic paradigms:

Multi-Agent Co-evolution and Co-Training

Borrowing from the Multi-Agent Evolve (MAE) framework (Chen et al., 27 Oct 2025), self-improvement is formulated as a triadic co-evolutionary game, with Proposer, Solver, and Judge instantiated from a shared LLM backbone. Each agent is reinforced independently on role-specific reward signals using synchronized policy gradient updates:

Proposer optimizes for question clarity, challenge (difficulty), and output format compliance:

$R_P(q) = \lambda_{quality} R_{quality}(q) + \lambda_{difficulty} R_{difficulty}(q) + \lambda_{format} R_{format}(output_P)$

Solver maximizes answer correctness and format:

$R_S(a) = \lambda_{judge} V_J(a, q) + \lambda_{format} R_{format}(output_S)$

Judge enforces output rigor via rubric-based scoring and format:

$R_J = R_{format}(output_J)$

Each iteration cycles through these phases, with generated questions/answers filtered by quality and difficulty, and the backbone updated via task-relative normalized REINFORCE++ objectives, efficiently balancing exploration and exploitation (Chen et al., 27 Oct 2025).

Iterative Self-Evolving RL-SFT Loops

STELLA extends the EvolveSearch approach (Zhang et al., 28 May 2025), interleaving reinforcement learning exploration and supervised fine-tuning on high-reward trajectories. A pool of self-generated rollouts is curated using multi-signal reward models and diversity filters, ensuring the training data progressively reflects both robustness and skill breadth. This model alternation mitigates RL collapse and SFT stagnation, consistently raising both in-domain and out-of-domain accuracy over multiple self-evolution cycles.

Trajectory Optimization via Evolutionary Operators

Adopting mechanisms from SE-Agent (Lin et al., 4 Aug 2025), STELLA maintains and evolves populations of reasoning trajectories, refining them via revision (self–critique), recombination (cross-trajectory transfer/crossover), and refinement (reward-guided selection and diversity maintenance). This evolutionary trajectory optimization allows for search beyond local optima and cross-pollination of solution strategies, with each evolutionary phase evaluated on task completion, reasoning quality, and efficiency.

3. Orchestration, Skill Acquisition, and Resource Optimization

A hallmark of STELLA’s scalability is dynamic orchestration, inspired by the "Adaptive Orchestration" paradigm (Sampath et al., 10 Jan 2026). The system implements a Dynamic Mixture of Experts (DMoE) scheduler, where specialized micro-skills (expert agents, tool plug-ins, or sub-LLMs) are loaded from storage on demand. Key features include:

Expert Selection: Semantic compatibility scoring (e.g., cosine similarity or learned compatibility) routes queries to relevant experts via softmax distributions.
Meta-Cognition Engine: An asynchronous background process continuously analyzes execution logs to detect capability gaps (e.g., refusals) or excessive generic-tool invocation, then triggers skill instantiation/hiring.
Resource Management: Active experts are pruned via Least Recently Used (LRU) policies or generalized eviction scoring combining idleness and utility, maintaining tight efficiency under constrained compute/memory footprints.
Surgical History Pruning: Automated context memory editing removes obsolete refusal patterns or few-shot exemplars, mitigating context pollution and refusal bias without compromising calibration.

This orchestrator enables STELLA to achieve the reliability and precision of specialized swarms while preserving the token efficiency and low-latency profile of a lean, modular agent (Sampath et al., 10 Jan 2026).

4. Lifelong Memory and Experience Reuse

Stateful, self-evolving memory is central to persistent agentic learning. STELLA’s memory framework is guided by Evo-Memory (Wei et al., 25 Nov 2025):

Memory Representation: At each step $t$ , the system augments its episodic memory $M_t$ with structured experience tuples $(x_t, \hat y_t, f_t, e_t)$ , supporting both key–value storage and hierarchical, compressive summarization.
Retrieval and Update: The retrieval module computes $s(q, m)$ via cosine similarity or learned metrics, supporting task-level RAG (ExpRAG), self-critique (SelfRAG), and workflow-graph retrieval (AWM).
Refinement Pipeline: STELLA’s ReMem agentic loop interleaves action, chain-of-thought synthesis, memory refinement (pruning, clustering), and feedback-driven write-back for continuous learning and context compression.
Scaling Mechanisms: Memory is periodically compressed, aged out, or sharded by domain/type. Dynamic cheatsheet modules and temporal eviction maintain high relevance under streaming, open-ended workloads.

Empirical benchmarks show that embedding-based retrieval with k=4 and periodic cheatsheet condensation yields substantive gains in multi-turn tasks, reducing action steps and raising emergent success rates relative to non-evolving memory agents (Wei et al., 25 Nov 2025).

5. Dynamic Model Routing and Tri-Optimal Execution

To address the performance–cost–efficiency trilemma, STELLA incorporates self-evolving routing strategies from EvoRoute (Zhang et al., 6 Jan 2026). At each step, STELLA dynamically selects Pareto-optimal model and tool configurations using an experience database $\mathcal{K}$ and Bayesian bandit mechanisms:

Pareto-Optimal Pruning: Step-level logs are queried across agent role, semantic instruction similarity, and tool-use congruence, with candidate backbones and tools filtered to ensure optimality across accuracy, cost, and latency axes.
Thompson Sampling Utility: Posterior distributions over empirical task metrics guide stochastic selection of policies maximizing the weighted utility $U'(\ell) = w_p \tilde x_{P,\ell} - w_c \tilde x_{C,\ell} - w_d \tilde x_{D,\ell}$ , balancing task success against resource constraints.
Log-and-Learn Feedback: Each invocation is logged and the routing posterior updated, ensuring continuous adaptation as workloads, skills, and backbone models evolve.

Experiments demonstrate that such dynamic routing, as opposed to static model assignment, can yield up to 80% cost reduction and 70% speedup without sacrificing task accuracy (Zhang et al., 6 Jan 2026).

STELLA can be configured to autonomously optimize its own workflows and modular organization via LLM-driven feedback loops. The iterative process follows a formal maximize-score framework (Yuksel et al., 2024):

Hypothesis Generation: Low-scoring aspects identified by Critic or Evaluation agents prompt the system to synthesize and test role/task modifications or tool rewiring.
Modification and Execution: New prototypes are instantiated and benchmarked on qualitative (clarity, relevance, depth) and quantitative (latency, throughput) metrics.
Selection and Memory: Successive system variants are ranked by scalar utility $S(C)$ ; only top performers are retained, with search terminating once improvement falls below threshold $\epsilon$ or max iterations reached.

Case studies confirm that such closed-loop optimization can yield substantial improvements in solution quality, consistency, and domain adaptability (Yuksel et al., 2024).

7. Empirical Results, Limitations, and Future Directions

Empirical results from the instantiation in (Jin et al., 1 Jul 2025) validate STELLA’s core hypothesis: Performance systematically improves with use, e.g., Humanity's Last Exam accuracy increases from 14% to 26% over nine self-evolution trials, outperforming static baselines by 6–8 percentage points. In multi-hop QA and agentic search, iterative self-evolution yields absolute accuracy gains up to 4.7% across diverse datasets (Zhang et al., 28 May 2025).

Limitations observed include unbounded growth of reasoning templates and tool databases, absence of rigorous template or tool deprecation criteria, reliance on internal judge reliability, and computational cost for large-scale self-play or trajectory evolution (Jin et al., 1 Jul 2025, Chen et al., 27 Oct 2025). Open challenges include integrating empirical (real-world) feedback, automating curriculum and reward shaping, tuning trilemma weights, and ensuring memory indexing scalability (Zhang et al., 6 Jan 2026, Wei et al., 25 Nov 2025).

Future directions for STELLA include meta-learning controllers to optimize skill acquisition policies, tighter integration with programmatic/grounded verification, utility-based pruning layers on template and tool stores, and extension to multi-modal and embodied agent settings.

References

"STELLA: Self-Evolving LLM Agent for Biomedical Research" (Jin et al., 1 Jul 2025)
"Multi-Agent Evolve: LLM Self-Improve through Co-evolution" (Chen et al., 27 Oct 2025)
"Adaptive Orchestration: Scalable Self-Evolving Multi-Agent Systems" (Sampath et al., 10 Jan 2026)
"EvolveSearch: An Iterative Self-Evolving Search Agent" (Zhang et al., 28 May 2025)
"Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory" (Wei et al., 25 Nov 2025)
"EvoRoute: Experience-Driven Self-Routing LLM Agent Systems" (Zhang et al., 6 Jan 2026)
"A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops" (Yuksel et al., 2024)
"SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents" (Lin et al., 4 Aug 2025)