LLaMEA-SAGE: Evolutionary AAD Framework
- LLaMEA-SAGE is an integrated evolutionary framework for automated algorithm design that leverages LLM-generated code, AST-derived features, and SHAP insights.
- It enhances the original LLaMEA process by incorporating surrogate modeling to relate code structure to performance, resulting in faster convergence and improved sample efficiency.
- Empirical evaluations on black-box benchmarks show that LLaMEA-SAGE outperforms traditional methods by effectively guiding LLM mutations with actionable, explainable AI feedback.
LLaMEA-SAGE is an integrated evolutionary framework for automated algorithm design (AAD) in which LLMs generate, mutate, and refine Python-encoded optimization algorithms, with search guided not only by runtime fitness but also by structural feedback derived from explainable AI analysis of program code. Developed as an extension and enhancement of the LLaMEA (“LLM Evolutionary Algorithm”) approach, LLaMEA-SAGE exploits features extracted from the abstract syntax tree (AST) of each candidate program, relates these code features to performance via a learned surrogate model, and feeds back feature-based instructions—derived through SHAP (SHapley Additive exPlanations) attributions—into subsequent LLM mutation prompts. Empirical evaluation on black-box optimization benchmarks demonstrates accelerated convergence and improved sample efficiency compared to both vanilla LLaMEA and previous AAD frameworks (Stein et al., 29 Jan 2026).
1. Context and Rationale
Automated Algorithm Design (AAD) commonly frames the synthesis of new optimization algorithms as a black-box search, using performance on benchmark suites as objective feedback. In LLM-driven evolutionary search, LLaMEA represents optimizer candidates in Python, leveraging LLMs (e.g., GPT-5-mini) to perform code mutation and refinement within a or evolutionary strategy. This methodology allows continuous exploration across vast algorithmic spaces, but suffers from two principal limitations:
- The fitness/judgment signal is a noisy, aggregated metric that does not indicate which code-level patterns yield success.
- The LLM mutates code without guidance about which structural elements correlate with performance, leading to inefficiency and possible stalling in suboptimal algorithm “basins.”
LLaMEA-SAGE addresses these challenges by introducing a feedback loop that captures code structure, explains its influence, and steers the LLM toward more promising solutions (Stein et al., 29 Jan 2026).
2. Code Feature Extraction and Surrogate Modeling
Each generated candidate algorithm is parsed into an AST, which is represented as a directed graph comprising nodes (corresponding to functions, loops, or expressions) and edges (parent–child relations). From , a static feature vector is computed for each candidate . Feature categories include:
- Graph-theoretic statistics: node and edge counts, degree statistics (mean, variance, entropy), tree depth (min/mean/max), clustering coefficient, assortativity, diameter, average shortest path.
- Static complexity metrics: total cyclomatic complexity, token counts per function and globally, and function parameter counts.
An archive is maintained:
where is the observed benchmark fitness for candidate with features . A gradient-boosted regression tree (XGBoost) surrogate is trained to predict from by minimizing the squared-error loss over the archive. Surrogate retraining occurs once the archive size exceeds a threshold (the population size), ensuring sufficient data for reliable modeling (Stein et al., 29 Jan 2026).
3. Explainable AI Feedback for LLM-Guided Mutation
After fitting the surrogate model, explainable AI—in the form of SHAP values—decomposes each predicted fitness into individual feature attributions:
The feature with the largest absolute attribution is selected as the most influential. If , the guidance is to increase feature ; if , the guidance is to decrease it. This directive is expressed in natural language and prepended to the mutation prompt provided to the LLM (e.g., “Based on archive analysis, try to increase the total cyclomatic complexity of the solution.”).
This feature-level feedback is injected at each LLM mutation, influencing offspring synthesis without restricting overall code expressivity. A single most-relevant feature is targeted per mutation in the approach as presented (Stein et al., 29 Jan 2026).
4. Integration with LLaMEA Evolutionary Procedure
LLaMEA-SAGE modifies the standard LLaMEA evolutionary loop to incorporate structured guidance:
- Initialization: Generate population of algorithms via LLM.
- Evaluation: For each , run on benchmark, extract fitness and feature set ; store in archive .
- Loop (until evaluation budget ):
- Retrain surrogate on pairs.
- For offspring:
- Parent selection from .
- Sample base mutation prompt.
- Extract SHAP attributions from at parent’s features.
- Append guidance (“increase” or “decrease” the feature ) to prompt.
- Generate offspring via LLM with new prompt, execute, extract features, and update .
- Update population by elitist selection.
The core innovation is that each offspring is biased, via prompt engineering, toward structural modifications empirically correlated with better performance in the current archive (Stein et al., 29 Jan 2026).
5. Experimental Benchmarks and Quantitative Results
LLaMEA-SAGE has been evaluated on two primary experimental regimes:
Experiment 1: SBOX-COST (Proof of Concept)
- Benchmarks: Five separable SBOX-COST functions, 10D
- , $200$ evaluations, 5 random seeds
- Metric: AOCC (anytime overall convergence coefficient)
- Results: LLaMEA-SAGE converges more rapidly than vanilla LLaMEA, especially in the early phase (first 50–100 evaluations); average AUC (area under curve) improved by 11.1 units, with Cliff’s .
Experiment 2: MA-BBOB (GECCO Competition Benchmark)
- Benchmarks: MA-BBOB, 10 instances, , $5000d$ evaluations per algorithm
- Methods: LLaMEA-SAGE, vanilla LLaMEA, MCTS-AHD, LHNS
- LLaMEA-SAGE and vanilla LLaMEA:
- Results: LLaMEA-SAGE achieves higher AOCC faster than all baselines, with early sample efficiency—at 50 evaluations, AOCC exceeds vanilla LLaMEA’s final value at 200. Ablation with Gemini-2.0-flash-lite confirmed robustness to model backend.
Qualitative Analysis:
- Cyclomatic complexity and parameter count are the most frequently targeted features by guidance prompts, almost always with the “increase” directive.
- LLM responds to “refine” prompts with 70% compliance.
- Both LLaMEA and LLaMEA-SAGE have similar LLM token usage per run, with lower computational overhead compared to MCTS-AHD (Stein et al., 29 Jan 2026).
6. Interpretation, Limitations, and Opportunities
Structural feedback provides an inductive bias enabling the search process to exploit associations between code patterns (e.g., higher cyclomatic complexity, structural depth) and performance, resulting in accelerated convergence and decreased outcome variance. Even imperfect surrogate modeling suffices, as the dominant impact is to bias the LLM away from fruitless regions of code space.
Limitations acknowledged include:
- Applicability tested only on moderate-dimensional () continuous problems.
- Dynamic or behavioral code features (e.g., runtime traces, memory access statistics) are not yet incorporated.
- Only the most relevant single feature is exploited for feedback per mutation; multi-objective guidance remains unexplored.
- Surrogate model accuracy and SHAP decomposition quality inherently affect the value of the feedback.
Potential future directions proposed:
- Multi-feature feedback and learnable weighting in prompt augmentation.
- Online adaptation of guidance strategy (e.g., switching between increase/decrease based on uncertainty or observed progress).
- Application to broader AAD frameworks and inclusion of dynamic code characterization (Stein et al., 29 Jan 2026).
7. Comparative Summary with Related Methodologies
LLaMEA-SAGE can be contrasted with prior AAD search frameworks:
- Fitness-only LLM-based AAD (vanilla LLaMEA): No code structure feedback; LLM mutation is unbiased.
- Non-LLM AAD (e.g., MCTS-AHD, LHNS): Explore via tree-search or neighborhood moves, not leveraging code structure or LLM-based mutation in this manner.
- LLaMEA-SAGE: Multi-level integration—uses LLMs for code generation, a learned surrogate for feature–performance mapping, explainable AI for actionable feedback, and prompt engineering for targeted search acceleration, yielding improved sample efficiency and convergence rates in empirical evaluations.
Summary of algorithmic components:
| Component | Technique | Purpose in LLaMEA-SAGE |
|---|---|---|
| Solution encoding | Python AST + code graph | Enables feature extraction |
| Surrogate modeling | XGBoost regressor | Relates structure to performance |
| Feature attribution | SHAP (tree-SHAP) | Identifies most impactful feature |
| Guidance injection | Natural-language prompt augmentation | Directs LLM mutation strategy |
| Evaluation | Black-box optimization benchmarks | Measures AOCC, sample efficiency |
LLaMEA-SAGE operationalizes a feedback loop where knowledge about code structure, learned in an archive and distilled via explainable AI, can be used to inform and bias subsequent LLM-driven synthesis, demonstrating statistically significant improvements on AAD benchmarks (Stein et al., 29 Jan 2026).