Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Published 20 Mar 2025 in cs.CL | (2503.16419v3)

Abstract: LLMs have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the "overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small LLMs, and discuss evaluation methods and benchmarking.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a categorization of efficient reasoning techniques in LLMs to mitigate overthinking and lower inference costs.
It evaluates model-based, reasoning output-based, and input prompt-based methods, emphasizing RL reward designs and CoT compression.
The survey highlights practical applications and benchmarks that balance computational efficiency with reasoning accuracy in real-world deployments.

This paper, "Stop Overthinking: A Survey on Efficient Reasoning for LLMs" (2503.16419), provides a comprehensive overview of techniques aimed at making the reasoning processes of LLMs more computationally efficient without sacrificing accuracy. It addresses the "overthinking phenomenon" where models like OpenAI o1 and DeepSeek-R1, while capable of complex reasoning using Chain-of-Thought (CoT), generate excessively long and redundant reasoning steps, leading to high inference costs and latency.

The survey categorizes efficient reasoning methods into three main areas:

Model-based Efficient Reasoning: Focuses on modifying the model itself.
- RL with Length Reward Design: Integrates penalties for reasoning length into the Reinforcement Learning (RL) reward function during training. Different methods propose various formulations for this length reward, often balancing it with accuracy (e.g., O1-Pruner (Luo et al., 22 Jan 2025), Kimi k1.5 (Team et al., 22 Jan 2025), L1 (Aggarwal et al., 6 Mar 2025), Demystifying (Yeo et al., 5 Feb 2025), DAST (Shen et al., 6 Mar 2025)). Optimization often uses PPO or SimPO.
- SFT with Variable-Length CoT Data: Fine-tunes models using Supervised Fine-Tuning (SFT) on datasets containing CoT examples of varying lengths, particularly shorter, concise reasoning paths. This involves methods to generate short CoT data (e.g., sampling shortest paths, using LLMs as compressors, interpretation-driven skipping, token budget constraints) and applying standard or progressive fine-tuning techniques (e.g., Self-Training (Munkhbat et al., 27 Feb 2025), C3oT (Kang et al., 2024), TokenSkip (Xia et al., 17 Feb 2025), CoT-Valve (Ma et al., 13 Feb 2025), Learn to Skip (Liu et al., 2024)).
Reasoning Output-based Efficient Reasoning: Modifies the generation process during inference.
- Compressing Reasoning Steps into Latent Representation: Replaces explicit textual reasoning steps with more compact, non-textual latent representations (hidden states or learned tokens). This can involve training the LLM to use these latent steps (e.g., Coconut (Hao et al., 2024), CODI (Shen et al., 28 Feb 2025), CCOT (Cheng et al., 2024), Heima (Shen et al., 31 Jan 2025), Token Assorted (Su et al., 5 Feb 2025), Looped Transformers (Saunshi et al., 24 Feb 2025)) or using auxiliary modules while keeping the main LLM frozen (e.g., SoftCoT (Xu et al., 17 Feb 2025)).
- Dynamic Reasoning Paradigm during Inference: Adapts the reasoning strategy or resource allocation at inference time based on dynamic criteria. This includes reward-guided methods (Speculative Rejection (Sun et al., 2024), RSD (Liao et al., 31 Jan 2025)), confidence/certainty-based approaches (DPTS (Ding et al., 22 Feb 2025), Certaindex (Fu et al., 2024), FastMCTS (Li et al., 17 Feb 2025), Length-filtered Vote (Wu et al., 11 Feb 2025)), consistency-based selection (ST-BoN (Wang et al., 3 Mar 2025)), and summarization techniques where intermediate steps are condensed (LightThinker (Zhang et al., 21 Feb 2025), InftyThink (Yan et al., 9 Mar 2025)).
Input Prompts-based Efficient Reasoning: Leverages characteristics of the input prompt.
- Prompt-guided Efficient Reasoning: Uses specific instructions within the prompt to guide the model towards generating shorter reasoning chains (e.g., TALE-EP (Han et al., 2024), Chain of Draft (Xu et al., 25 Feb 2025), CCoT [Renze2024benefits], Token Complexity study (Lee et al., 3 Mar 2025)).
- Routing by Question Attributes: Directs input queries to different models or reasoning paths based on estimated difficulty or uncertainty. This can involve unknown criteria (Claude 3.7 Sonnet), trained classifiers (RouteLLM (Ong et al., 2024), SoT (Aytes et al., 7 Mar 2025)), or intrinsic uncertainty metrics (Self-Ref (Chuang et al., 2024), Confident or Seek Stronger (Chuang et al., 6 Feb 2025)).

The survey also covers related topics:

Efficient Data and Models: Training reasoning models effectively with less, high-quality data (LIMO (Ye et al., 5 Feb 2025), s1 (Muennighoff et al., 31 Jan 2025)), and enabling reasoning in Small LLMs (SLMs) through distillation techniques (mixed, counterfactual, feedback-driven) and model compression (quantization is found more effective than pruning for reasoning).
Evaluation and Benchmarks: Discusses benchmarks like Sys2Bench (Parashar et al., 18 Feb 2025) for evaluating diverse reasoning tasks, frameworks for measuring the "overthinking" phenomenon (Cuadron et al., 12 Feb 2025), and research on compute-optimal test-time scaling strategies (Liu et al., 10 Feb 2025).

Finally, the paper touches upon applications in areas like autonomous driving, embodied AI, and healthcare, and discusses broader challenges such as the trade-off between safety and efficiency, and the relative merits of RL versus SFT for achieving efficient reasoning. It concludes by emphasizing the practical importance and economic value of developing efficient reasoning capabilities in LLMs.