Papers
Topics
Authors
Recent
Search
2000 character limit reached

Expanding LLM Agent Boundaries with Strategy-Guided Exploration

Published 2 Mar 2026 in cs.LG | (2603.02045v1)

Abstract: Reinforcement learning (RL) has demonstrated notable success in post-training LLMs as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low-level actions to higher-level language strategies. We thus propose Strategy-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed-temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool-calling, coding, and embodied agent environments, SGE consistently outperforms exploration-focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.

Summary

  • The paper introduces a novel framework that decouples high-level strategy planning from action execution, achieving a 27% relative pass@1 improvement over baselines.
  • It employs mixed-temperature sampling to generate diverse strategies while ensuring coherent, low-temperature actions, thus enhancing exploration efficiency.
  • The approach leverages strategy reflection, yielding improved generalization and sample efficiency across domains like Coding, AndroidWorld, and more.

Expanding Agentic Capabilities with Strategy-Guided Exploration in LLM RL

Introduction

Training LLM agents via reinforcement learning (RL) to perform complex agentic tasks in environments with compositional language-action spaces (e.g., software control, code generation, tool use, robotics) is critically bottlenecked by exploration under sparse-outcome rewards. Standard RL protocols—even with advanced policy gradient methods and entropy maximization—often fail to move beyond behaviors accessible to the base LLM, failing to generate learning signals for tasks not already partly solvable by the pretrained models. "Expanding LLM Agent Boundaries with Strategy-Guided Exploration" (2603.02045) addresses this issue, proposing Strategy-Guided Exploration (SGE): an RL training framework that leverages LLMs’ language reasoning to shift exploration from the low-level action space to the space of high-level, natural language "strategies," generated and reflected upon during training. SGE introduces mixed-temperature sampling and strategy reflection to induce both diversity and effectiveness in exploring new trajectories. Figure 1

Figure 1: Architecture of Strategy-Guided Exploration (SGE), illustrating strategy output and downstream action conditioning, with diverse exploration trajectories enabled by mixed-temperature sampling and reflection.

Methodology: Strategy-Guided Exploration

SGE modifies the standard RL training process for LLM agents at three levels:

  1. Strategy Prompting: At each decision step, before action generation, the LLM is explicitly prompted to output a concise, high-level strategy describing its intended approach toward the task goal. The action output is then conditioned on this strategy. This abstraction explicitly decouples strategic intent from fine-grained execution, exploiting LLMs' capacity for abstraction and planning.
  2. Mixed-Temperature Sampling: SGE individually controls the sampling temperature for strategy versus action tokens. Strategies are sampled at higher temperature, promoting divergent high-level intents, while subsequent reasoning traces and actions are produced at lower temperature, preserving coherence and minimizing stochasticity in execution. Empirical ablation confirms that this mixed sampling regime induces more meaningful exploration than globally high- or low-temperature sampling. Figure 2

Figure 2

Figure 2

Figure 2: Training ablation contrasting mixed-temperature with uniform sampling temperatures, showing the distinct advantage of SGE’s token-level temperature control.

  1. Strategy Reflection: SGE injects feedback from both failed and successful strategies. Failed rollouts trigger negative reflection, prompt the LLM to critique and explicitly avoid prior failed strategies, and sample new alternatives. Successful episodes occasionally trigger positive reflection, encouraging the generation of alternative strategies inspired by known successes, further increasing outcome diversity and entropy.

This schema is fully compatible with standard online RL algorithms (e.g., GRPO); SGE imposes no constraint on policy class or environment as long as high-level language strategies and outcome rewards can be defined.

Empirical Results

SGE is evaluated across four agentic domains: AndroidWorld (UI control from pixels), LangR (embodied household rearrangement), Coding (iterative code generation), and AppWorld (API toolcalls). In all domains, SGE achieves higher final pass rates (pass@1)—on average a 27% relative improvement over the strongest baseline.

Surpassing Base Model and RL Baseline Ceilings

Unlike prior approaches, SGE-trained agents can execute successful solutions for task instances unsolvable even with thousands of base model attempts (i.e., SGE-trained policies surpass pass@kk ceilings for large kk where base and RL-only models plateau). In Coding and LangR, SGE exceeds the highest observed base model pass@kk by 11% relative, with exploration enabling new behaviors that RL or diversity bonuses alone cannot reach. Figure 3

Figure 3: SGE-trained policy achieves higher pass@kk on unseen tasks in Coding, overtaking both the base LLM and non-SGE RL-trained model.

Further, SGE demonstrates substantial sample efficiency improvements, achieving stronger performance and faster learning than entropy-based (EntropyAdv), policy diversity (pass@kk reward), or intrinsic-motivation (RND) baselines. Figure 4

Figure 4: SGE rapidly increases the diversity of unique program outcomes per task, demonstrating accelerated environment exploration compared to standard RL.

Generalization

SGE not only improves within-training performance but yields more generalizable policies. On held-out test tasks, SGE outperforms both zero-shot and RL baselines in all environments—e.g., in Coding, moving pass@1 from 13.5 (zero-shot) and 22.0 (GRPO) to 29.2.

Component Ablations

  • Mixed-temperature sampling: Critical for allowing the exploration of semantically new trajectories; uniform temperature sampling (whether high or default) is suboptimal.
  • Strategy reflection: Negative reflection (on failed strategies) is particularly important; positive reflection also helps by encouraging entropy over successful modes.
  • Model scaling: SGE's benefit scales with base LLM model capacity. Small models (e.g., 600M) do not benefit, reflecting the method’s dependence on language-based reasoning and planning. Figure 5

Figure 5

Figure 5: Mixed-strategy/action sampling grid search showing highest pass@kk exploration with high-temperature strategies and lower-temperature actions.

Qualitative Analysis

SGE fosters exploration yielding actually distinct environment outcomes—rather than token variability over equivalent actions. For instance, in AndroidWorld UI tasks, SGE generates strategies that reflect on past failed button taps, guiding exploration to effective actions (not just spatially perturbed but semantically novel attempts).

Implications and Future Directions

On a theoretical level, SGE provides evidence that LLM agent RL can actually expand model capabilities beyond polishing or re-ordering base model solutions, under agentic multi-step settings. By leveraging structured high-level linguistic abstraction and reflection, SGE operationalizes a tractable form of exploration that is both efficient and practical.

Practically, SGE’s approach offers a blueprint for RL-driven capability expansion in LLM agents destined for complex, open-ended environments (UI, code, tool/APIs, robotics), provided the base LLM is sufficiently strong for reasoning and strategy formation. The approach is syntactically lightweight (pure prompting/sampling control), broadly compatible, and requires no external teachers or labeled strategies.

Limitations include potentially increased inference cost due to per-step strategy output, and dependency on model scale for effective benefit. Additionally, extensions to non-agentic RL settings (e.g., math reasoning, single-shot QA) require further investigation, given the tight coupling of SGE with sequential, partially observable POMDPs.

Conclusion

"Expanding LLM Agent Boundaries with Strategy-Guided Exploration" (2603.02045) offers a methodologically robust and empirically validated framework for solving the exploration bottleneck in RL fine-tuning of LLM agents for complex, sparse-reward tasks. By structuring exploration around high-level, diverse, and reflective strategy generation, SGE demonstrates marked advances in realized agentic capability, learning efficiency, and generalization, evidencing that strategic reasoning is a viable substrate for RL-driven improvement in LLM policy learning. Future work should target dynamic strategy scheduling, scaling to even longer-horizon tasks, and formalization of strategy abstraction for broader domains.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 66 likes about this paper.