MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

Published 20 Jun 2024 in cs.CL and cs.AI | (2406.13975v3)

Abstract: LLMs have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, evaluating these reasoning abilities has become increasingly challenging. Existing outcome-based benchmarks are beginning to saturate, becoming less effective in tracking meaningful progress. To address this, we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. Our meta-reasoning paradigm is especially suited for system-2 slow thinking, mirroring the human cognitive process of carefully examining assumptions, conditions, calculations, and logic to identify mistakes.MR-Ben comprises 5,975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, with models like the o1 series from OpenAI demonstrating strong performance by effectively scrutinizing the solution space, many other state-of-the-art models fall significantly behind on MR-Ben, exposing potential shortcomings in their training strategies and inference methodologies.

Abstract PDF HTML Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper presents MR-Ben, a benchmark evaluating System-2 thinking by detecting and explaining reasoning errors in LLMs.
It leverages expert-annotated data from 5,975 questions across subjects, using metrics like step and reason accuracy.
Results reveal that while LLMs excel in outcome-based tasks, they struggle with deep reasoning, guiding future model improvements.

Introduction to MR-Ben Benchmark

The paper presents MR-Ben, a meta-reasoning benchmark designed for evaluating System-2 thinking capabilities in LLMs. Traditional benchmarks have primarily focused on outcome-based evaluations, which often overlook the quality of the reasoning processes that lead to those outcomes. MR-Ben aims to address this gap by providing a more comprehensive framework that evaluates the meta-reasoning skills of LLMs. Specifically, MR-Ben challenges LLMs to locate and analyze potential errors in automatically generated reasoning steps across various subjects.

Figure 1: Overview of the evaluation paradigm and representative examples in Mr-Ben, illustrating the Chain-of-Thought and error analysis process.

Dataset Construction and Annotation

MR-Ben comprises 5,975 questions across diverse domains such as physics, chemistry, logic, coding, and more. The dataset includes manually annotated solutions from experts, detailing solution correctness, the first error step, and reasons for errors. The dataset is designed to cover arithmetic, logical, and algorithmic reasoning types, offering a rich source for evaluating LLMs' capabilities in identifying and explaining reasoning errors.

The dataset construction involves a rigorous annotation process where domain experts assess the solution correctness and provide detailed analyses of the reasoning errors and their corrections. This ensures the high quality and reliability of the benchmark.

Figure 2: Dataset creation pipeline of Mr-Ben, highlighting question compilation and solution annotation by domain experts.

Evaluation Methodology

MR-Ben employs a unique evaluation framework that goes beyond accuracy metrics. It uses a combination of Matthews Correlation Coefficient, step accuracy, and reason accuracy to calculate the MR-Scores. These metrics provide a comprehensive assessment of an LLM's ability to understand and correct reasoning errors.

The evaluation framework is designed to challenge LLMs to act as a teacher, evaluating the reasoning process by assessing correctness, analyzing errors, and providing corrections.

Figure 3: Model performance across subjects, showcasing the comprehensive evaluation of reasoning abilities.

Experimental Results

The paper evaluates the performance of various LLMs on MR-Ben, revealing distinct limitations in their reasoning abilities. Open-source models, despite performing comparably to GPT-4 on outcome-based benchmarks, show significant gaps in reasoning processes when evaluated with MR-Ben.

Strengths and weaknesses across different domains are highlighted, indicating areas where LLMs excel and where they struggle. Techniques such as leveraging synthetic data are discussed as potential pathways to improve reasoning capabilities.

Figure 4: MR-Scores of different models on different levels of difficulty, illustrating the variability in performance.

Implications and Future Directions

MR-Ben provides valuable insights into the reasoning capabilities of current LLMs and opens up several avenues for future research. The benchmark is expected to guide researchers in developing models with better reasoning abilities that can understand and correct errors more effectively.

The findings suggest that enhancing LLMs' reasoning capabilities requires a focus on improving meta-reasoning skills. This can be achieved through advanced training methods and data synthesis techniques that enhance understanding of the reasoning process.

Conclusion

MR-Ben represents a significant advancement in the evaluation of reasoning capabilities in LLMs. By focusing on meta-reasoning, it provides a more nuanced understanding of LLMs' cognitive processes, highlighting areas for improvement and offering pathways for the development of more sophisticated AI reasoning frameworks.

The benchmark serves as a critical tool for researchers and developers aiming to enhance the decision-making and problem-solving abilities of LLMs, ultimately contributing to more robust AI systems capable of more complex and nuanced reasoning tasks.

Markdown Report Issue