- The paper presents a novel mutual reasoning approach (rStar) that significantly boosts small LLM performance without requiring fine-tuning.
- It employs a two-phase generation-discrimination method that uses MCTS-based reasoning and unsupervised feedback to validate solution trajectories.
- Empirical evaluations on GSM8K, MATH, and other benchmarks demonstrate accuracy improvements from 12.51% to 91.13%, highlighting its scalability and effectiveness.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Introduction
The paper "Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers" presents a methodological advancement aimed at enhancing the reasoning capabilities of Small LLMs (SLMs) without the need for fine-tuning or superior LLMs. This solution, termed rStar, leverages a self-play mutual reasoning framework, introducing a generation-discrimination process that allows SLMs to collaboratively generate and verify reasoning trajectories. Remarkably, this method shows significant improvements across various datasets, achieving accuracy levels comparable to those of specialized fine-tuned models.
Methodology
The core innovation of rStar lies in its two-step approach involving:
- Generation: Utilizing a Monte Carlo Tree Search (MCTS) algorithm augmented with a rich set of human-like reasoning actions.
- Discrimination: Incorporating another SLM to act as a discriminator, providing unsupervised feedback on the generated solutions.
Generation Process
In the generation phase, the paper proposes augmenting the MCTS with diverse human-like reasoning actions, which include:
- Proposing an one-step thought: Encouraging the model to generate the next step individually.
- Proposing the remaining thought steps: Generating subsequent steps in a combined manner akin to Chain-of-Thought (CoT).
- Proposing sub-questions and their answers: Breaking down problems into simpler sub-questions, inspired by the least-to-most prompting technique.
- Re-answering sub-questions: Revisiting and verifying sub-questions using few-shot CoT prompting.
- Rephrasing questions/sub-questions: Revisiting the problem statements to highlight conditions and context clearly.
These actions simulate a robust solution space exploration, significantly aiding the trajectory towards a correct reasoning process.
Discrimination Process
In the discrimination phase, the emphasis is on mutual reasoning consistency. Another SLM of similar capabilities verifies each trajectory by attempting to derive the solution based on partially known steps provided by the generator SLM. This step mirrors peer review mechanisms where consistency in derived outcomes suggests correctness.
Empirical Evaluation
Extensive experiments were conducted across various datasets such as GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA using multiple SLMs including LLaMA2-7B, Mistral-7B, and Phi3-mini-4k. Notable improvements were observed:
- GSM8K accuracy improved from 12.51% to 63.91% for LLaMA2-7B, and up to 91.13% for LLaMA3-8B-Instruct.
- GSM-Hard and MATH benchmarks showed strong performance with accuracy improvements up to 37.91% and 42.94%, respectively.
Implications
The practical and theoretical implications of this work are significant. On a practical level, it provides an avenue for deploying effective reasoning capabilities in SLMs, which are typically more accessible and less resource-intensive than LLMs. Theoretically, it challenges the belief that LLMs are requisite for advanced reasoning tasks, opening doors to more scalable models that achieve comparable performances.
Future Directions
Future research could explore optimizing the mutual reasoning approach, possibly by exploring different model architectures or further refining the action space and discrimination feedback mechanisms. Integrating more sophisticated feedback loops or peer review mechanisms inspired by human collaboration may also offer new insights.
In conclusion, the rStar approach showcased in this paper proves to be a significant stride in AI research, enhancing the reasoning prowess of SLMs without extensive resource dependencies. It sets a promising foundation for future advancements in model scalability and efficiency.