Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

Published 23 May 2025 in cs.CL and cs.AI | (2505.17813v1)

Abstract: Reasoning LLMs heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the short-#1@k{m} method, showing that shorter chains can boost LLM reasoning accuracy by up to 34.5%.
It demonstrates a reduction in token usage by approximately 50% and decreases inference wall time by up to 33% compared to longer chains.
Fine-tuning LLMs on shorter reasoning data (S1-short) yields enhanced performance, underscoring the benefits of efficient chain design.

Preferring Shorter Thinking Chains for Improved LLM Reasoning

Introduction

LLMs are often employed in complex reasoning tasks, where the prevailing methodology leverages extensive chains of thought (CoT) to achieve higher accuracy in problem-solving. This practice, although effective, is computationally expensive due to the increased inference time required for generating these long thinking chains. The paper "Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning" challenges the conventional wisdom that longer thinking chains lead to improved reasoning capabilities. Through empirical evaluation across multiple reasoning LLMs and mathematical benchmarks, the study uncovers that shorter reasoning chains can surpass longer ones in accuracy, suggesting a paradigm shift in the approach to LLM reasoning.

Methodology

The research introduces a novel inference method named short-#1@k{m}. This method involves executing $k$ parallel generations and halting computation once the first $m$ thinking trajectories complete. The final answer is determined using majority voting among these $m$ chains, which effectively reduces computational costs and inference time. The performance of LLM models on several competitive benchmarks is analyzed by comparing the accuracy of different inference strategies, including standard majority voting and the proposed short-#1@k{m}.

Figure 1: Visual comparison between majority voting and our proposed method short-#1@k{m} with $m=1$ .

Experiments and Results

Shorter Chains vs. Longer Chains

The study evaluates three top-tier reasoning LLMs across three complex mathematical benchmarks: AIME 2024, AIME 2025, and HMMT February 2025. It is observed that selecting the shortest reasoning chain yields a significant improvement in accuracy, with increases reaching up to 34.5% compared to the longest chain for the same question. Additionally, this approach naturally reduces the number of tokens by about 50%, demonstrating both performance and efficiency benefits.

Implementation of short-#1@k{m}

The proposed short-#1@k{m} method achieves superior performance across various compute budgets compared to standard majority voting. In low-compute scenarios, short-#1@k{1} outperforms other methods while using up to 40% fewer tokens. In high-compute regimes, short-#1@k{3} consistently exceeds majority voting efficiency, reducing wall time by up to 33%.

Figure 2: $m$ values ablation of short-#1@k{m}.

Fine-tuning with Shorter Reasoning Chains

To further investigate the impact of shorter reasoning chains, the researchers fine-tuned an LLM (Qwen-2.5-32B) using different S1 dataset variations—S1-short, S1-long, and S1-random. Results indicate that models fine-tuned on shorter reasoning chains (S1-short) not only reduced the length of thinking trajectories but also improved accuracy compared to models trained on longer or randomly sampled chains. This highlights the potential efficacy of supervised fine-tuning on shorter CoT data.

Figure 3: S1-short dataset variant leading to enhanced performance upon fine-tuning.

Implications and Future Directions

The findings suggest reevaluating current practices in test-time compute for reasoning LLMs, with an emphasis on balancing reasoning efficiency and performance. Shorter thinking chains not only mitigate computational demands but can also enhance the reasoning capabilities of LLMs. Future research may explore the integration of these insights into more extensive LLM training regimes, potentially bolstering the development of models that are both computationally efficient and adept at complex reasoning tasks.

Conclusion

The paper presents compelling evidence that challenges the assumption that longer thinking chains inherently enhance LLM reasoning. By leveraging shorter reasoning chains, the proposed methods demonstrate that improved performance can be attained with reduced computational overhead, paving the way for more efficient reasoning strategies in LLM applications. This research provides significant insights into the optimization of LLM reasoning processes, with potential implications for broader applications requiring efficient computational resource management.

Markdown Report Issue