AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

Published 12 May 2026 in cs.AI, cs.CL, and cs.LG | (2605.11518v1)

Abstract: Effectively configuring scalable LLM experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates a framework that automates high-cost LLM experiment configuration via an RL-trained agent leveraging cross-fidelity reasoning.
It introduces LLMConfig-Gym, an offline RL environment supporting diverse, multi-fidelity LLM tasks to efficiently explore and optimize configuration spaces.
Results show near-zero regret and significant scalability improvements, outperforming classical baselines and meta-learning methods.

Automating LLM Experiment Configuration via Agentic Cross-Fidelity Reasoning: AutoLLMResearch

Motivation and Problem Statement

AutoLLMResearch (2605.11518) systematically addresses a critical yet previously neglected problem: configuration automation for large-scale LLM experiments with high computational costs. As LLMs become increasingly central to various applications, their optimal configuration—spanning architecture search, hyperparameter tuning, RL post-training, and data mixture selection—becomes vital for efficient deployment and maximal performance. Classical HPO tools, Bayesian optimization, and recent LLM-based prompt agents assume low-cost environments with unlimited iterations, a paradigm infeasible at LLM scale where each trial incurs prohibitive GPU hours.

Three domain-specific challenges are identified:

Challenge 1: Absence of verifiable multi-fidelity environments enabling cumulative experience transfer.
Challenge 2: Configuration space shifts between fidelities, i.e., disjoint variable domains across training and test settings.
Challenge 3: Optimization landscape shifts, where the search surface itself evolves with data/model scale, invalidating naive transfer.
Figure 1: Overview of the limitations of current methods, our motivation, and the three challenges addressed by our framework.

Framework Architecture: LLMConfig-Gym and Agentic Training Pipeline

The proposed solution consists of two principal components:

LLMConfig-Gym

LLMConfig-Gym is an offline, lookup-based RL environment exposing four representative LLM configuration tasks:

Model architecture search (HW-GPT-Bench).
Pretraining hyperparameter tuning (Step Law).
RL GRPO tuning hyperparameters (extensive grid search on multiple datasets and backbone sizes).
Data mixture selection (ADMIRE IFT Runs with Tülu-3/Qwen2.5).

The Gym supports multi-fidelity task splits and rapid queries, facilitating agent learning from verified historical outcomes across large discrete configuration spaces.

Figure 2: LLMConfig-Gym overview. A unified, lookup-table-based Gym organized by Task $\rightarrow$ Fidelity $\rightarrow$ Experiment.

Structured Agent Training Pipeline

Configuration research is cast as a long-horizon MDP, where the agent, implemented as a fine-tuned LLM, iteratively "thinks" (proposes configurations via text reasoning) and "executes" (queries the Gym to observe performance).

The training pipeline comprises:

Trajectory simulation and policy distillation: Successful reasoning chains are distilled to teach structured optimization strategies.
Multi-turn RL (GRPO algorithm): Direct optimization over long-form, sequential agent outputs, with reward shaped by cumulative regret (rather than best-found value) to mitigate overfitting.
Cross-fidelity transfer via in-context demonstrations: Low-fidelity results are injected as context, incentivizing extrapolative reasoning over shifted spaces/landscapes.
Figure 3: Overview of our framework. Step 1 curates multi-fidelity training and testing experiments with in-context demonstrations. Step 2 collects successful reasoning trajectories via high-temperature sampling for policy distillation. Step 3 further optimizes the policy with multi-turn RL. Step 4 deploys the trained policy on various unseen high-fidelity experiments.

Quantitative Evaluation

Extensive experiments on all Gym tasks demonstrate that the RL-trained agent achieves consistently lower regret under strict budget constraints compared to classical baselines (random search, Top-K warm start), meta-training approaches (NAP, MetaBO), and prompt-based LLM agents (OpenAI O4-mini, Gemini, GPT-5).

Figure 4: Overall performance comparison across all tasks and budget constraints. Our method achieves the lowest regret across different settings, demonstrating its effectiveness.

Notably:

Configuration space shift (Tasks 1, 4): The trained agent achieves near-zero regret (e.g., $\sim$ 0.01) even when the test space is disjoint from training, a scenario where meta-learning collapses.
Optimization landscape shift (Tasks 2, 3): RL-trained agents avoid over-extrapolation and adapt to shifted optima, outperforming both prompt-based and distributional meta-learners.

Interpretability: Reasoning and Transfer Mechanisms

Detailed case studies trace the evolution of agent reasoning:

Challenge 2 (Configuration space shift): The agent learns qualitative balancing rules (e.g., moderate embed_dim selection) rather than memorizing fixed configs, and applies them in unseen test domains.
Challenge 3 (Landscape shift): The agent internalizes fidelity-dependent scaling trends (e.g., "params $\uparrow$ ⇒ lr $\downarrow$ ; tokens $\uparrow$ ⇒ lr $\uparrow$ "), producing calibrated configs that correct naive extrapolation mistakes.
Figure 5: Reasoning trajectories on the train and test set for pretraining hyperparameter configuration, demonstrating agent's learned fidelity-dependent scaling rules.

Optimization trajectory analysis shows the RL-trained agent rapidly prunes the search region, leveraging text-based reasoning to concentrate exploration near global optima from the outset—contrasting with erratic or misdirected search by baselines.

Figure 6: Optimization trajectories and reward densification via most-similar matching. RL-trained agent search concentrates around global optima; matching converts format errors into valid queries, stabilizing RL training.

Robustness, Scalability, and Practical Implications

The cost analysis demonstrates substantial amortized gains: as the number of high-fidelity tasks increases, upfront meta-training cost is quickly offset, yielding a 3.6 $\times$ reduction in cumulative GPU hours at $K=30$ tasks.

In stress-tested settings with sparse training coverage or reversed optimal regions, RL-based cross-fidelity agents degrade gracefully and recover faster than baseline meta-learners, confirming robustness to adversarial regimes.

Figure 7: Sparse coverage experiment, showing the agent's regret declines steeply as coverage grows, outperforming NAP across regimes.

Training Dynamics and RL Stabilization

Quantitative monitoring shows regret (mean@3) dropping consistently on held-out test sets, critic score rising, and invalid outputs decreasing over RL training steps. Most-similar configuration matching reduces rate of format violations (32\% $\to$ negligible), densifying reward.

Figure 8: Training dynamics across all four tasks, showing improvement in regret, critic scores, and response length over training steps.

Optimization Landscapes

Multi-fidelity task landscapes are visualized, revealing rugged search spaces with multiple local minima and shifting optima—a context where classical and prompt methods fail to generalize, but RL-trained agents adaptively locate high-performing regions.

Figure 9: Optimization landscape for model architecture configuration, illustrating ruggedness and local minima.

Theoretical Implications and Future Directions

AutoLLMResearch provides a formal framework for cumulative, cross-fidelity experiential learning in scientific agentic optimization, and is a concrete advancement towards recursive self-improvement paradigms. The decoupling of knowledge accumulation from expensive experiment execution opens avenues for scalable agent-driven research in other domains (materials science, biology, etc.), where cheap proxy experiments can inform and accelerate high-fidelity exploration.

Potential directions include expanding Gym tasks, enabling multi-objective optimization, and integrating deeper recursive design strategies to automate future LLM training and configuration workflows.

Conclusion

AutoLLMResearch establishes a robust methodology for automating high-cost LLM configuration tasks by training agents that accumulate transferable principles from low-fidelity experiments. The resulting RL-trained LLM agents exhibit strong cross-fidelity generalization, interpretability, and practical scalability, outperforming conventional baselines on multiple axes and informing future trajectories in agent-driven scientific discovery.

Markdown Report Issue