Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLaMEA-SAGE: Evolutionary AAD Framework

Updated 5 February 2026
  • LLaMEA-SAGE is an integrated evolutionary framework for automated algorithm design that leverages LLM-generated code, AST-derived features, and SHAP insights.
  • It enhances the original LLaMEA process by incorporating surrogate modeling to relate code structure to performance, resulting in faster convergence and improved sample efficiency.
  • Empirical evaluations on black-box benchmarks show that LLaMEA-SAGE outperforms traditional methods by effectively guiding LLM mutations with actionable, explainable AI feedback.

LLaMEA-SAGE is an integrated evolutionary framework for automated algorithm design (AAD) in which LLMs generate, mutate, and refine Python-encoded optimization algorithms, with search guided not only by runtime fitness but also by structural feedback derived from explainable AI analysis of program code. Developed as an extension and enhancement of the LLaMEA (“LLM Evolutionary Algorithm”) approach, LLaMEA-SAGE exploits features extracted from the abstract syntax tree (AST) of each candidate program, relates these code features to performance via a learned surrogate model, and feeds back feature-based instructions—derived through SHAP (SHapley Additive exPlanations) attributions—into subsequent LLM mutation prompts. Empirical evaluation on black-box optimization benchmarks demonstrates accelerated convergence and improved sample efficiency compared to both vanilla LLaMEA and previous AAD frameworks (Stein et al., 29 Jan 2026).

1. Context and Rationale

Automated Algorithm Design (AAD) commonly frames the synthesis of new optimization algorithms as a black-box search, using performance on benchmark suites as objective feedback. In LLM-driven evolutionary search, LLaMEA represents optimizer candidates in Python, leveraging LLMs (e.g., GPT-5-mini) to perform code mutation and refinement within a (μ+λ)(\mu+\lambda) or (1+1)(1+1) evolutionary strategy. This methodology allows continuous exploration across vast algorithmic spaces, but suffers from two principal limitations:

  • The fitness/judgment signal is a noisy, aggregated metric that does not indicate which code-level patterns yield success.
  • The LLM mutates code without guidance about which structural elements correlate with performance, leading to inefficiency and possible stalling in suboptimal algorithm “basins.”

LLaMEA-SAGE addresses these challenges by introducing a feedback loop that captures code structure, explains its influence, and steers the LLM toward more promising solutions (Stein et al., 29 Jan 2026).

2. Code Feature Extraction and Surrogate Modeling

Each generated candidate algorithm is parsed into an AST, which is represented as a directed graph Gc=(V,E)G_c = (V, E) comprising nodes VV (corresponding to functions, loops, or expressions) and edges EE (parent–child relations). From GcG_c, a static feature vector xs=cf(s)Rd\mathbf{x}_s = \mathrm{cf}(s) \in \mathbb{R}^d is computed for each candidate ss. Feature categories include:

  • Graph-theoretic statistics: node and edge counts, degree statistics (mean, variance, entropy), tree depth (min/mean/max), clustering coefficient, assortativity, diameter, average shortest path.
  • Static complexity metrics: total cyclomatic complexity, token counts per function and globally, and function parameter counts.

An archive A\mathcal{A} is maintained:

A={(si,fi,xi)}i=1N\mathcal{A} = \{(s_i, f_i, \mathbf{x}_i)\}_{i=1}^N

where fif_i is the observed benchmark fitness for candidate sis_i with features xi\mathbf{x}_i. A gradient-boosted regression tree (XGBoost) surrogate f^(x;θ)\hat{f}(\mathbf{x}; \theta) is trained to predict ff from x\mathbf{x} by minimizing the squared-error loss over the archive. Surrogate retraining occurs once the archive size exceeds a threshold (the population size), ensuring sufficient data for reliable modeling (Stein et al., 29 Jan 2026).

3. Explainable AI Feedback for LLM-Guided Mutation

After fitting the surrogate model, explainable AI—in the form of SHAP values—decomposes each predicted fitness into individual feature attributions:

f^(x)=ϕ0+j=1dϕj(x)\hat f(\mathbf{x}) = \phi_0 + \sum_{j=1}^d \phi_j(\mathbf{x})

The feature kk with the largest absolute attribution ϕk|\phi_k| is selected as the most influential. If ϕk>0\phi_k>0, the guidance is to increase feature kk; if ϕk<0\phi_k<0, the guidance is to decrease it. This directive is expressed in natural language and prepended to the mutation prompt provided to the LLM (e.g., “Based on archive analysis, try to increase the total cyclomatic complexity of the solution.”).

This feature-level feedback is injected at each LLM mutation, influencing offspring synthesis without restricting overall code expressivity. A single most-relevant feature is targeted per mutation in the approach as presented (Stein et al., 29 Jan 2026).

4. Integration with LLaMEA Evolutionary Procedure

LLaMEA-SAGE modifies the standard LLaMEA evolutionary loop to incorporate structured guidance:

  1. Initialization: Generate population PP of μ\mu algorithms via LLM.
  2. Evaluation: For each sPs \in P, run on benchmark, extract fitness f(s)f(s) and feature set cf(s)\mathrm{cf}(s); store in archive A\mathcal{A}.
  3. Loop (until evaluation budget BB):
    • Retrain surrogate f^\hat{f} on (cf,f)(\mathrm{cf}, f) pairs.
    • For λ\lambda offspring:
      • Parent selection from PP.
      • Sample base mutation prompt.
      • Extract SHAP attributions from f^\hat{f} at parent’s features.
      • Append guidance (“increase” or “decrease” the feature kk) to prompt.
      • Generate offspring via LLM with new prompt, execute, extract features, and update A\mathcal{A}.
    • Update population by elitist (μ+λ)(\mu+\lambda) selection.

The core innovation is that each offspring is biased, via prompt engineering, toward structural modifications empirically correlated with better performance in the current archive (Stein et al., 29 Jan 2026).

5. Experimental Benchmarks and Quantitative Results

LLaMEA-SAGE has been evaluated on two primary experimental regimes:

Experiment 1: SBOX-COST (Proof of Concept)

  • Benchmarks: Five separable SBOX-COST functions, 10D
  • (μ=λ=8)(\mu=\lambda=8), $200$ evaluations, 5 random seeds
  • Metric: AOCC (anytime overall convergence coefficient)
  • Results: LLaMEA-SAGE converges more rapidly than vanilla LLaMEA, especially in the early phase (first 50–100 evaluations); average AUC (area under curve) improved by 11.1 units, with Cliff’s δ=0.60\delta=0.60.

Experiment 2: MA-BBOB (GECCO Competition Benchmark)

  • Benchmarks: MA-BBOB, 10 instances, d=10d=10, $5000d$ evaluations per algorithm
  • Methods: LLaMEA-SAGE, vanilla LLaMEA, MCTS-AHD, LHNS
  • LLaMEA-SAGE and vanilla LLaMEA: (μ=4,λ=16)(\mu=4, \lambda=16)
  • Results: LLaMEA-SAGE achieves higher AOCC faster than all baselines, with early sample efficiency—at 50 evaluations, AOCC exceeds vanilla LLaMEA’s final value at 200. Ablation with Gemini-2.0-flash-lite confirmed robustness to model backend.

Qualitative Analysis:

  • Cyclomatic complexity and parameter count are the most frequently targeted features by guidance prompts, almost always with the “increase” directive.
  • LLM responds to “refine” prompts with \sim70% compliance.
  • Both LLaMEA and LLaMEA-SAGE have similar LLM token usage per run, with lower computational overhead compared to MCTS-AHD (Stein et al., 29 Jan 2026).

6. Interpretation, Limitations, and Opportunities

Structural feedback provides an inductive bias enabling the search process to exploit associations between code patterns (e.g., higher cyclomatic complexity, structural depth) and performance, resulting in accelerated convergence and decreased outcome variance. Even imperfect surrogate modeling suffices, as the dominant impact is to bias the LLM away from fruitless regions of code space.

Limitations acknowledged include:

  • Applicability tested only on moderate-dimensional (d=10d=10) continuous problems.
  • Dynamic or behavioral code features (e.g., runtime traces, memory access statistics) are not yet incorporated.
  • Only the most relevant single feature is exploited for feedback per mutation; multi-objective guidance remains unexplored.
  • Surrogate model accuracy and SHAP decomposition quality inherently affect the value of the feedback.

Potential future directions proposed:

  • Multi-feature feedback and learnable weighting in prompt augmentation.
  • Online adaptation of guidance strategy (e.g., switching between increase/decrease based on uncertainty or observed progress).
  • Application to broader AAD frameworks and inclusion of dynamic code characterization (Stein et al., 29 Jan 2026).

LLaMEA-SAGE can be contrasted with prior AAD search frameworks:

  • Fitness-only LLM-based AAD (vanilla LLaMEA): No code structure feedback; LLM mutation is unbiased.
  • Non-LLM AAD (e.g., MCTS-AHD, LHNS): Explore via tree-search or neighborhood moves, not leveraging code structure or LLM-based mutation in this manner.
  • LLaMEA-SAGE: Multi-level integration—uses LLMs for code generation, a learned surrogate for feature–performance mapping, explainable AI for actionable feedback, and prompt engineering for targeted search acceleration, yielding improved sample efficiency and convergence rates in empirical evaluations.

Summary of algorithmic components:

Component Technique Purpose in LLaMEA-SAGE
Solution encoding Python AST + code graph Enables feature extraction
Surrogate modeling XGBoost regressor Relates structure to performance
Feature attribution SHAP (tree-SHAP) Identifies most impactful feature
Guidance injection Natural-language prompt augmentation Directs LLM mutation strategy
Evaluation Black-box optimization benchmarks Measures AOCC, sample efficiency

LLaMEA-SAGE operationalizes a feedback loop where knowledge about code structure, learned in an archive and distilled via explainable AI, can be used to inform and bias subsequent LLM-driven synthesis, demonstrating statistically significant improvements on AAD benchmarks (Stein et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLaMEA-SAGE.