LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Published 22 Apr 2025 in cs.LG and cs.AI | (2504.16078v1)

Abstract: The success of LLMs has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as $\epsilon$-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.

Abstract PDF Upgrade to Chat

Summary

The paper identifies key failure modes in LLMs, showing that RL fine-tuning effectively mitigates greediness, frequency bias, and the knowing‐doing gap.
The authors employ self-generated chain-of-thought rationales with RL fine-tuning to boost exploration and reduce cumulative regret in various decision-making tasks.
Experiments demonstrate that enhanced generation budgets and explicit reward shaping further improve performance, underlining the importance of CoT in RLFT.

This paper investigates why LLMs often perform suboptimally in decision-making tasks despite their potential, stemming from pre-trained knowledge and reasoning capabilities like Chain-of-Thought (CoT) (Wei et al., 2022). The authors identify and systematically study three prevalent failure modes in small-to-medium scale LLMs (Gemma2 2B, 9B, 27B):

Greediness: LLMs tend to prematurely commit to the best-performing action seen so far, even if only a small portion of the action space has been explored. This leads to stagnating action coverage (up to 55% of actions unexplored in multi-armed bandits (MABs)) and suboptimal cumulative regret. Larger models and CoT reasoning help but do not fully resolve this.
Frequency Bias: Smaller LLMs (e.g., 2B) often copy the most frequent action present in their input context history, regardless of its associated reward. Larger models (e.g., 27B) largely overcome this bias but remain prone to greediness. This bias is suspected to be an artifact of supervised pre-training.
Knowing-Doing Gap: LLMs can often correctly reason about or describe the optimal strategy (e.g., generate a correct CoT rationale for the UCB algorithm, 87% correct in experiments) but fail to translate this knowledge into corresponding actions (e.g., selecting a greedy action 58% of the time even with a correct rationale).

To mitigate these shortcomings, the paper proposes Reinforcement Learning Fine-Tuning (RLFT) using self-generated CoT rationales. The core idea is to fine-tune the LLM policy $\pi_{\theta}$ using environment rewards obtained from interaction.

RLFT Implementation:

Context: The model input $c_t$ concatenates task-specific instructions ( $c_t^{in}$ ), output format instructions ( $c_t^{out}$ ), and recent interaction history ( $c_t^{\tau_{t-C:t}}$ including states, actions, rewards).
Action Generation: The model generates a sequence $z_t$ containing both CoT reasoning ( $z_t^{CoT}$ ) and the actual action $a_t$ . A parsing function $g(z_t)$ (using regular expressions) extracts $a_t$ . A permissive output template (e.g., ACTION=X) is used. A generation budget $G$ (default 256 tokens) limits the length of $z_t$ .
Reward Shaping: Besides the environment reward $r_t^{env}$ , a penalty $r_t^{valid}$ (e.g., -5) is applied if $g(z_t)$ fails to extract a valid action from the generated sequence. Environment rewards are normalized.
Objective: The fine-tuning uses a PPO-style (Schulman et al., 2017) clipping objective with a KL divergence penalty against a reference policy $\pi_{ref}$ $π_{re f}$ (the frozen pre-trained model) to maintain stability:

$L = \min\left(\frac{\pi_\theta(z|c)}{\pi_{\theta_{old}}(z|c)}A_{adv}, \text{clip}_{\epsilon}\left(\frac{\pi_\theta(z|c)}{\pi_{\theta_{old}}(z|c)}\right)A_{adv}\right) - \beta D_{KL}(\pi_\theta(\cdot|c)||\pi_{ref}(\cdot|c))$
- Advantage estimation $A_{adv}$ uses Monte Carlo returns (rewards-to-go) for fixed-horizon tasks (bandits) and Generalized Advantage Estimation (GAE) (Schulman et al., 2015) with a learned value head for variable-length tasks (Tic-tac-toe).

Experiments & Findings:

Environments: Gaussian/Bernoulli MABs (5, 10, 20 arms), contextual bandits (MovieLens), and text-based Tic-tac-toe (Ruoss et al., 2024).
RLFT Effectiveness: RLFT significantly improves decision-making performance across environments and model sizes (Gemma2 2B, 9B).
- It lowers cumulative regret compared to in-context learning (ICL) baselines.
- It mitigates greediness by increasing action coverage (+12% for 2B on 10-arm MABs).
- It counteracts frequency bias, reducing the selection of frequent suboptimal actions, although the bias isn't entirely eliminated at high repetition counts.
Exploration Mechanisms: While RLFT improves exploration, it remains suboptimal compared to specialized algorithms like UCB. Various mechanisms were tested:
- Try-all: Initial exploration of all arms (like UCB) yielded significant gains, suggesting LLMs perform well if given sufficient information but struggle with exploration itself.
- Exploration Bonus: A simple reward shaping (+1 reward for untried actions during RLFT) significantly improved exploration and reduced regret, highlighting the importance of explicit rewards for desired behaviors.
- Other methods ( $\epsilon$ -greedy, self-consistency (Wang et al., 2022), self-correction (Kumar et al., 2024)) showed varied effects.
Ablations:
- Tic-tac-toe: RLFT substantially increased win rates against random and MCTS opponents, demonstrating effectiveness in stateful environments. Providing legal actions in the prompt was crucial for high performance.
- Importance of CoT: RLFT without CoT performed poorly, barely matching ICL with CoT, confirming CoT's role as a vital mechanism for exploration and rationalization during RLFT.
- Supervised Fine-Tuning (SFT): SFT on expert UCB trajectories (Behavior Cloning - actions only; Thought Cloning - actions + CoT) achieved near-expert performance, showing the effectiveness of expert data when available.
- "Thinking" Time: Increasing the generation budget $G$ (e.g., from 256 to 512 tokens) improved performance, allowing the model more "time" to rationalize, but significantly increased computational cost due to longer rollouts in multi-step decision tasks.

Conclusion: The paper demonstrates that LLMs exhibit systematic failures (greediness, frequency bias, knowing-doing gap) in decision-making. RLFT on self-generated CoT rationales effectively mitigates these issues, enhancing exploration and overall performance. However, LLM exploration remains a challenge, often requiring explicit mechanisms or reward shaping for near-optimal behavior. The work underscores the importance of CoT and sufficient generation budget ("thinking time") for RLFT in decision-making contexts.

Markdown Report Issue