Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Published 12 Aug 2024 in cs.CL | (2408.06195v1)

Abstract: This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small LLMs (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at https://github.com/zhentingqi/rStar.

Abstract PDF HTML Upgrade to Chat

Citations (15)

View on Semantic Scholar

Summary

The paper presents a novel mutual reasoning approach (rStar) that significantly boosts small LLM performance without requiring fine-tuning.
It employs a two-phase generation-discrimination method that uses MCTS-based reasoning and unsupervised feedback to validate solution trajectories.
Empirical evaluations on GSM8K, MATH, and other benchmarks demonstrate accuracy improvements from 12.51% to 91.13%, highlighting its scalability and effectiveness.

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

Introduction

The paper "Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers" presents a methodological advancement aimed at enhancing the reasoning capabilities of Small LLMs (SLMs) without the need for fine-tuning or superior LLMs. This solution, termed rStar, leverages a self-play mutual reasoning framework, introducing a generation-discrimination process that allows SLMs to collaboratively generate and verify reasoning trajectories. Remarkably, this method shows significant improvements across various datasets, achieving accuracy levels comparable to those of specialized fine-tuned models.

Methodology

The core innovation of rStar lies in its two-step approach involving:

Generation: Utilizing a Monte Carlo Tree Search (MCTS) algorithm augmented with a rich set of human-like reasoning actions.
Discrimination: Incorporating another SLM to act as a discriminator, providing unsupervised feedback on the generated solutions.

Generation Process

In the generation phase, the paper proposes augmenting the MCTS with diverse human-like reasoning actions, which include:

Proposing an one-step thought: Encouraging the model to generate the next step individually.
Proposing the remaining thought steps: Generating subsequent steps in a combined manner akin to Chain-of-Thought (CoT).
Proposing sub-questions and their answers: Breaking down problems into simpler sub-questions, inspired by the least-to-most prompting technique.
Re-answering sub-questions: Revisiting and verifying sub-questions using few-shot CoT prompting.
Rephrasing questions/sub-questions: Revisiting the problem statements to highlight conditions and context clearly.

These actions simulate a robust solution space exploration, significantly aiding the trajectory towards a correct reasoning process.

Discrimination Process

In the discrimination phase, the emphasis is on mutual reasoning consistency. Another SLM of similar capabilities verifies each trajectory by attempting to derive the solution based on partially known steps provided by the generator SLM. This step mirrors peer review mechanisms where consistency in derived outcomes suggests correctness.

Empirical Evaluation

Extensive experiments were conducted across various datasets such as GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA using multiple SLMs including LLaMA2-7B, Mistral-7B, and Phi3-mini-4k. Notable improvements were observed:

GSM8K accuracy improved from 12.51% to 63.91% for LLaMA2-7B, and up to 91.13% for LLaMA3-8B-Instruct.
GSM-Hard and MATH benchmarks showed strong performance with accuracy improvements up to 37.91% and 42.94%, respectively.

Implications

The practical and theoretical implications of this work are significant. On a practical level, it provides an avenue for deploying effective reasoning capabilities in SLMs, which are typically more accessible and less resource-intensive than LLMs. Theoretically, it challenges the belief that LLMs are requisite for advanced reasoning tasks, opening doors to more scalable models that achieve comparable performances.

Future Directions

Future research could explore optimizing the mutual reasoning approach, possibly by exploring different model architectures or further refining the action space and discrimination feedback mechanisms. Integrating more sophisticated feedback loops or peer review mechanisms inspired by human collaboration may also offer new insights.

In conclusion, the rStar approach showcased in this paper proves to be a significant stride in AI research, enhancing the reasoning prowess of SLMs without extensive resource dependencies. It sets a promising foundation for future advancements in model scalability and efficiency.