- The paper introduces a novel framework that decouples high-level strategy planning from action execution, achieving a 27% relative pass@1 improvement over baselines.
- It employs mixed-temperature sampling to generate diverse strategies while ensuring coherent, low-temperature actions, thus enhancing exploration efficiency.
- The approach leverages strategy reflection, yielding improved generalization and sample efficiency across domains like Coding, AndroidWorld, and more.
Expanding Agentic Capabilities with Strategy-Guided Exploration in LLM RL
Introduction
Training LLM agents via reinforcement learning (RL) to perform complex agentic tasks in environments with compositional language-action spaces (e.g., software control, code generation, tool use, robotics) is critically bottlenecked by exploration under sparse-outcome rewards. Standard RL protocols—even with advanced policy gradient methods and entropy maximization—often fail to move beyond behaviors accessible to the base LLM, failing to generate learning signals for tasks not already partly solvable by the pretrained models. "Expanding LLM Agent Boundaries with Strategy-Guided Exploration" (2603.02045) addresses this issue, proposing Strategy-Guided Exploration (SGE): an RL training framework that leverages LLMs’ language reasoning to shift exploration from the low-level action space to the space of high-level, natural language "strategies," generated and reflected upon during training. SGE introduces mixed-temperature sampling and strategy reflection to induce both diversity and effectiveness in exploring new trajectories.
Figure 1: Architecture of Strategy-Guided Exploration (SGE), illustrating strategy output and downstream action conditioning, with diverse exploration trajectories enabled by mixed-temperature sampling and reflection.
Methodology: Strategy-Guided Exploration
SGE modifies the standard RL training process for LLM agents at three levels:
- Strategy Prompting: At each decision step, before action generation, the LLM is explicitly prompted to output a concise, high-level strategy describing its intended approach toward the task goal. The action output is then conditioned on this strategy. This abstraction explicitly decouples strategic intent from fine-grained execution, exploiting LLMs' capacity for abstraction and planning.
- Mixed-Temperature Sampling: SGE individually controls the sampling temperature for strategy versus action tokens. Strategies are sampled at higher temperature, promoting divergent high-level intents, while subsequent reasoning traces and actions are produced at lower temperature, preserving coherence and minimizing stochasticity in execution. Empirical ablation confirms that this mixed sampling regime induces more meaningful exploration than globally high- or low-temperature sampling.


Figure 2: Training ablation contrasting mixed-temperature with uniform sampling temperatures, showing the distinct advantage of SGE’s token-level temperature control.
- Strategy Reflection: SGE injects feedback from both failed and successful strategies. Failed rollouts trigger negative reflection, prompt the LLM to critique and explicitly avoid prior failed strategies, and sample new alternatives. Successful episodes occasionally trigger positive reflection, encouraging the generation of alternative strategies inspired by known successes, further increasing outcome diversity and entropy.
This schema is fully compatible with standard online RL algorithms (e.g., GRPO); SGE imposes no constraint on policy class or environment as long as high-level language strategies and outcome rewards can be defined.
Empirical Results
SGE is evaluated across four agentic domains: AndroidWorld (UI control from pixels), LangR (embodied household rearrangement), Coding (iterative code generation), and AppWorld (API toolcalls). In all domains, SGE achieves higher final pass rates (pass@1)—on average a 27% relative improvement over the strongest baseline.
Surpassing Base Model and RL Baseline Ceilings
Unlike prior approaches, SGE-trained agents can execute successful solutions for task instances unsolvable even with thousands of base model attempts (i.e., SGE-trained policies surpass pass@k ceilings for large k where base and RL-only models plateau). In Coding and LangR, SGE exceeds the highest observed base model pass@k by 11% relative, with exploration enabling new behaviors that RL or diversity bonuses alone cannot reach.
Figure 3: SGE-trained policy achieves higher pass@k on unseen tasks in Coding, overtaking both the base LLM and non-SGE RL-trained model.
Further, SGE demonstrates substantial sample efficiency improvements, achieving stronger performance and faster learning than entropy-based (EntropyAdv), policy diversity (pass@k reward), or intrinsic-motivation (RND) baselines.
Figure 4: SGE rapidly increases the diversity of unique program outcomes per task, demonstrating accelerated environment exploration compared to standard RL.
Generalization
SGE not only improves within-training performance but yields more generalizable policies. On held-out test tasks, SGE outperforms both zero-shot and RL baselines in all environments—e.g., in Coding, moving pass@1 from 13.5 (zero-shot) and 22.0 (GRPO) to 29.2.
Component Ablations
- Mixed-temperature sampling: Critical for allowing the exploration of semantically new trajectories; uniform temperature sampling (whether high or default) is suboptimal.
- Strategy reflection: Negative reflection (on failed strategies) is particularly important; positive reflection also helps by encouraging entropy over successful modes.
- Model scaling: SGE's benefit scales with base LLM model capacity. Small models (e.g., 600M) do not benefit, reflecting the method’s dependence on language-based reasoning and planning.

Figure 5: Mixed-strategy/action sampling grid search showing highest pass@k exploration with high-temperature strategies and lower-temperature actions.
Qualitative Analysis
SGE fosters exploration yielding actually distinct environment outcomes—rather than token variability over equivalent actions. For instance, in AndroidWorld UI tasks, SGE generates strategies that reflect on past failed button taps, guiding exploration to effective actions (not just spatially perturbed but semantically novel attempts).
Implications and Future Directions
On a theoretical level, SGE provides evidence that LLM agent RL can actually expand model capabilities beyond polishing or re-ordering base model solutions, under agentic multi-step settings. By leveraging structured high-level linguistic abstraction and reflection, SGE operationalizes a tractable form of exploration that is both efficient and practical.
Practically, SGE’s approach offers a blueprint for RL-driven capability expansion in LLM agents destined for complex, open-ended environments (UI, code, tool/APIs, robotics), provided the base LLM is sufficiently strong for reasoning and strategy formation. The approach is syntactically lightweight (pure prompting/sampling control), broadly compatible, and requires no external teachers or labeled strategies.
Limitations include potentially increased inference cost due to per-step strategy output, and dependency on model scale for effective benefit. Additionally, extensions to non-agentic RL settings (e.g., math reasoning, single-shot QA) require further investigation, given the tight coupling of SGE with sequential, partially observable POMDPs.
Conclusion
"Expanding LLM Agent Boundaries with Strategy-Guided Exploration" (2603.02045) offers a methodologically robust and empirically validated framework for solving the exploration bottleneck in RL fine-tuning of LLM agents for complex, sparse-reward tasks. By structuring exploration around high-level, diverse, and reflective strategy generation, SGE demonstrates marked advances in realized agentic capability, learning efficiency, and generalization, evidencing that strategic reasoning is a viable substrate for RL-driven improvement in LLM policy learning. Future work should target dynamic strategy scheduling, scaling to even longer-horizon tasks, and formalization of strategy abstraction for broader domains.