Learning from Peers in Reasoning Models

Published 12 May 2025 in cs.CL | (2505.07787v1)

Abstract: Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose Learning from Peers (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our LeaP-T model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .

Abstract PDF Upgrade to Chat

Summary

The paper addresses the Prefix Dominance Trap in LRMs by proposing a method for self-correction through peer interactions.
It introduces the LeaP framework, which integrates periodic cross-path summarization using clustered, dispersed, and hybrid routing strategies.
Experimental results on benchmarks like AIME and GPQA demonstrate that LeaP significantly improves reasoning performance and token efficiency.

Learning from Peers in Reasoning Models

Introduction and Motivation

The paper "Learning from Peers in Reasoning Models" addresses a notable limitation in Large Reasoning Models (LRMs): their inadequate recovery ability from initial reasoning errors, termed the "Prefix Dominance Trap." This phenomenon occurs when LRMs start with a poor beginning, significantly degrading their subsequent reasoning paths. The authors identify that peer interaction, inspired by psychological findings, could effectively enhance self-correction capabilities in reasoning models without disrupting correct outputs. The proposed method, Learning from Peers (LeaP), introduces a novel mechanism allowing LRMs to share and incorporate peer insights during inference, thus targeting improvements in the model's reasoning proficiency.

Methodology: LeaP Framework

The LeaP framework facilitates cross-path interaction among LRMs through a two-stage process integrated into reasoning paths. At regular intervals, each reasoning path summarizes its conclusions and shares them via a routing mechanism, enabling other paths to leverage diverse insights. The routing mechanism is designed with multiple strategies—clustered, dispersed, and hybrid—to select the most relevant peer summaries for refining the reasoning paths.

Figure 1: The illustration of (a) Independent Reasoning and (b) the proposed method Learning from Peers (LeaP). In independent reasoning, multiple paths are generated independently in parallel. In contrast, LeaP inserts a LeaP block into reasoning path, encouraging the model to learn from peers.

Results and Analysis

Experiments conducted across various benchmarks—AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond—demonstrate substantial performance improvements with LeaP. Notably, QwQ-32B with LeaP shows a remarkable gain, surpassing advanced models like DeepSeek-R1-Distill-Qwen-14B in certain tasks. The fine-tuned LeaP-T model series further reinforces these improvements, indicating refined self-verification initiated through peer insights.

Figure 2: An example of how LeaP enables communication between path $i$ and $j$ .

By systematically evaluating various aspects of LeaP's implementation—such as communication granularity, traffic, and position—the paper identifies optimal configurations that maximize reasoning accuracy while maintaining token efficiency. Additionally, LeaP's robustness remains evident across different difficulty levels, demonstrating its capability to aid the models in solving complex tasks.

Implications and Future Directions

The introduction of peer learning in LRMs through LeaP has practical implications for enhancing model accuracy and efficiency during inference. Not only does this approach alleviate the Prefix Dominance Trap, but it also lays foundational work for broader applications in collaborative problem-solving within AI domains. Future directions include extending peer learning frameworks to reinforcement learning settings and leveraging specialized expertise across diverse domains to further amplify the reasoning capabilities of LRMs.

Conclusion

The "Learning from Peers in Reasoning Models" paper effectively addresses key limitations in LRMs by incorporating peer-based interaction, thereby enhancing their self-correction capabilities significantly. Through extensive analyses and experiments, the research validates the effectiveness of LeaP and illustrates promising future directions for improving reasoning models across various AI applications. This work emphasizes the potential of collaborative reasoning in advancing AI towards more reliable and human-aligned systems.