Efficient Long CoT Reasoning in Small Language Models

Published 24 May 2025 in cs.CL and cs.AI | (2505.18440v2)

Abstract: Recent large reasoning models such as DeepSeek-R1 exhibit strong complex problems solving abilities by generating long chain-of-thought (CoT) reasoning steps. It is challenging to directly train small LLMs (SLMs) to emerge long CoT. Thus, distillation becomes a practical method to enable SLMs for such reasoning ability. However, the long CoT often contains a lot of redundant contents (e.g., overthinking steps) which may make SLMs hard to learn considering their relatively poor capacity and generalization. To address this issue, we propose a simple-yet-effective method to prune unnecessary steps in long CoT, and then employ an on-policy method for the SLM itself to curate valid and useful long CoT training data. In this way, SLMs can effectively learn efficient long CoT reasoning and preserve competitive performance at the same time. Experimental results across a series of mathematical reasoning benchmarks demonstrate the effectiveness of the proposed method in distilling long CoT reasoning ability into SLMs which maintains the competitive performance but significantly reduces generating redundant reasoning steps.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an efficient method that distills chain-of-thought reasoning by pruning redundant steps in small language models.
The paper employs a three-stage pipeline—response sampling, binary cutting search, and on-policy validation—to streamline reasoning.
The paper demonstrates a 50-70% token reduction while maintaining competitive accuracy on benchmarks such as GSM8K, MATH, and AIME.

Efficient Long CoT Reasoning in Small LLMs

Recent developments in LLMs have advanced the ability to solve complex reasoning tasks using Chain-of-Thought (CoT) prompting techniques. However, large models that excel in reasoning often generate lengthy CoT traces that include redundant steps, posing challenges for small LLMs (SLMs) given their limited capacity. This paper presents a method for distilling efficient CoT reasoning into SLMs by strategically pruning these unnecessary steps and enhancing learning through on-policy validation.

Introduction to CoT Reasoning and its Challenges

Chain-of-Thought (CoT) prompting enhances reasoning capabilities by encouraging LLMs to think step-by-step. Large models like DeepSeek-R1 demonstrate that increasing the length of CoT prompts can improve problem-solving on complex tasks. As illustrated in the redundant reasoning generated by DeepSeek-R1 (Figure 1), long CoT traces introduce a computational burden, particularly for SLMs, which struggle with overthinking and often fail to generalize effectively due to their size.

Figure 1: Illustration of redundant reasoning to a simple question by DeepSeek-R1, showing unnecessary steps in the reasoning process.

Methodology: Streamlining CoT in Small LLMs

The paper proposes a three-stage method to streamline CoT reasoning for SLMs, consisting of Response Sampling, Binary Cutting Search, and On-policy Validation.

Response Sampling

Initially, original long CoT responses are sampled from large reasoning models, capturing both intermediate reasoning 'thinking' and their corresponding conclusions.

Binary Cutting Search

Binary cutting reduces the time complexity of searching valid CoT prefixes, ensuring the shortest reasoning segment capable of leading to a correct response. This approach uses a mid-point strategy to cut reasoning traces, minimizing computational burden while maintaining logical coherence.

Figure 2: Overview of the proposed streamlining method, highlighting the stages of sampling, validation, and binary cutting.

On-Policy Validation

Instead of relying on external validation, the target SLM itself evaluates whether truncated CoT segments are sufficient for a correct final answer, thereby tailoring the distilled reasoning steps to the SLM's inherent capabilities.

Application of Streamlined CoT to Fine-tuning SLMs

Upon obtaining distilled reasoning data, fine-tuning is conducted through Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). SFT leverages concise reasoning steps to enhance the generation process, while DPO further refines the model's ability to discern between efficient and redundant reasoning paths.

Experimental Results

Testing across mathematical reasoning benchmarks, such as GSM8K, MATH, and AIME, demonstrates that our streamlined approach consistently preserves reasoning efficacy while significantly reducing the average number of reasoning tokens generated. For example, our method decreased token usage by 50-70%, depending on the task, while maintaining competitive accuracy compared to full-length CoT traces.

Figure 3: Distribution of remaining token ratios indicating successful reduction of redundant reasoning steps.

Case Study and Analysis

Case studies illustrate how binary cutting effectively removes redundancy while maintaining critical reasoning logic. The backtracking mechanism ensures recovery of necessary steps if overly pruned, providing a robust framework for adapting CoT traces to SLM constraints.

Figure 4: A case study showing step restoration during backtracking enhances reasoning coherence.

Conclusion

This method enhances the reasoning abilities of small models by distilling efficient long CoT data from large models, eliminating redundancies, and leveraging the target model's own inductive preferences. Such approaches offer promising pathways for efficiently integrating complex reasoning capabilities into SLMs, enabling them to perform comparably without requiring extensive computational resources. Future work may explore optimizing search algorithms and expanding applications to other model sizes and tasks.

Markdown Report Issue