- The paper identifies key failure modes in LLMs, showing that RL fine-tuning effectively mitigates greediness, frequency bias, and the knowing‐doing gap.
- The authors employ self-generated chain-of-thought rationales with RL fine-tuning to boost exploration and reduce cumulative regret in various decision-making tasks.
- Experiments demonstrate that enhanced generation budgets and explicit reward shaping further improve performance, underlining the importance of CoT in RLFT.
This paper investigates why LLMs often perform suboptimally in decision-making tasks despite their potential, stemming from pre-trained knowledge and reasoning capabilities like Chain-of-Thought (CoT) (Wei et al., 2022). The authors identify and systematically study three prevalent failure modes in small-to-medium scale LLMs (Gemma2 2B, 9B, 27B):
- Greediness: LLMs tend to prematurely commit to the best-performing action seen so far, even if only a small portion of the action space has been explored. This leads to stagnating action coverage (up to 55% of actions unexplored in multi-armed bandits (MABs)) and suboptimal cumulative regret. Larger models and CoT reasoning help but do not fully resolve this.
- Frequency Bias: Smaller LLMs (e.g., 2B) often copy the most frequent action present in their input context history, regardless of its associated reward. Larger models (e.g., 27B) largely overcome this bias but remain prone to greediness. This bias is suspected to be an artifact of supervised pre-training.
- Knowing-Doing Gap: LLMs can often correctly reason about or describe the optimal strategy (e.g., generate a correct CoT rationale for the UCB algorithm, 87% correct in experiments) but fail to translate this knowledge into corresponding actions (e.g., selecting a greedy action 58% of the time even with a correct rationale).
To mitigate these shortcomings, the paper proposes Reinforcement Learning Fine-Tuning (RLFT) using self-generated CoT rationales. The core idea is to fine-tune the LLM policy πθ using environment rewards obtained from interaction.
RLFT Implementation:
- Context: The model input ct concatenates task-specific instructions (ctin), output format instructions (ctout), and recent interaction history (ctτt−C:t including states, actions, rewards).
- Action Generation: The model generates a sequence zt containing both CoT reasoning (ztCoT) and the actual action at. A parsing function g(zt) (using regular expressions) extracts at. A permissive output template (e.g.,
ACTION=X) is used. A generation budget G (default 256 tokens) limits the length of zt.
- Reward Shaping: Besides the environment reward rtenv, a penalty rtvalid (e.g., -5) is applied if g(zt) fails to extract a valid action from the generated sequence. Environment rewards are normalized.
- Objective: The fine-tuning uses a PPO-style (Schulman et al., 2017) clipping objective with a KL divergence penalty against a reference policy πref (the frozen pre-trained model) to maintain stability:
L=min(πθold(z∣c)πθ(z∣c)Aadv,clipϵ(πθold(z∣c)πθ(z∣c))Aadv)−βDKL(πθ(⋅∣c)∣∣πref(⋅∣c))
Experiments & Findings:
- Environments: Gaussian/Bernoulli MABs (5, 10, 20 arms), contextual bandits (MovieLens), and text-based Tic-tac-toe (Ruoss et al., 2024).
- RLFT Effectiveness: RLFT significantly improves decision-making performance across environments and model sizes (Gemma2 2B, 9B).
- It lowers cumulative regret compared to in-context learning (ICL) baselines.
- It mitigates greediness by increasing action coverage (+12% for 2B on 10-arm MABs).
- It counteracts frequency bias, reducing the selection of frequent suboptimal actions, although the bias isn't entirely eliminated at high repetition counts.
- Exploration Mechanisms: While RLFT improves exploration, it remains suboptimal compared to specialized algorithms like UCB. Various mechanisms were tested:
- Try-all: Initial exploration of all arms (like UCB) yielded significant gains, suggesting LLMs perform well if given sufficient information but struggle with exploration itself.
- Exploration Bonus: A simple reward shaping (+1 reward for untried actions during RLFT) significantly improved exploration and reduced regret, highlighting the importance of explicit rewards for desired behaviors.
- Other methods (ϵ-greedy, self-consistency (Wang et al., 2022), self-correction (Kumar et al., 2024)) showed varied effects.
- Ablations:
- Tic-tac-toe: RLFT substantially increased win rates against random and MCTS opponents, demonstrating effectiveness in stateful environments. Providing legal actions in the prompt was crucial for high performance.
- Importance of CoT: RLFT without CoT performed poorly, barely matching ICL with CoT, confirming CoT's role as a vital mechanism for exploration and rationalization during RLFT.
- Supervised Fine-Tuning (SFT): SFT on expert UCB trajectories (Behavior Cloning - actions only; Thought Cloning - actions + CoT) achieved near-expert performance, showing the effectiveness of expert data when available.
- "Thinking" Time: Increasing the generation budget G (e.g., from 256 to 512 tokens) improved performance, allowing the model more "time" to rationalize, but significantly increased computational cost due to longer rollouts in multi-step decision tasks.
Conclusion: The paper demonstrates that LLMs exhibit systematic failures (greediness, frequency bias, knowing-doing gap) in decision-making. RLFT on self-generated CoT rationales effectively mitigates these issues, enhancing exploration and overall performance. However, LLM exploration remains a challenge, often requiring explicit mechanisms or reward shaping for near-optimal behavior. The work underscores the importance of CoT and sufficient generation budget ("thinking time") for RLFT in decision-making contexts.