Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?

Published 16 Apr 2025 in cs.AI, cs.CL, and cs.LG | (2504.11741v1)

Abstract: Recent supervised fine-tuning (SFT) approaches have significantly improved LLMs' performance on mathematical reasoning tasks, even when models are trained at a small scale. However, the specific capabilities enhanced through such fine-tuning remain poorly understood. In this paper, we conduct a detailed analysis of model performance on the AIME24 dataset to understand how reasoning capabilities evolve. We discover a ladder-like structure in problem difficulty, categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard (Exh)), and identify the specific requirements for advancing between tiers. We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT (500-1K instances), while Hard-level questions suffer from frequent model's errors at each step of the reasoning chain, with accuracy plateauing at around 65% despite logarithmic scaling. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills that current models uniformly struggle with. Additional findings reveal that carefully curated small-scale datasets offer limited advantage-scaling dataset size proves far more effective. Our analysis provides a clearer roadmap for advancing LLM capabilities in mathematical reasoning.

Abstract PDF Upgrade to Chat

Summary

The paper shows that minimal SFT with 500–1K chain-of-thought examples transitions LLMs from basic to intermediate mathematical reasoning.
It finds that performance on Hard-level problems plateaus at around 65% accuracy due to instability in extended reasoning chains.
The study suggests that scaling SFT datasets is more effective than curation and recommends exploring RL and tool-augmented methods to enhance higher-order reasoning.

Analyzing the Potentials and Limitations of LLMs in Mathematical Reasoning Post-SFT

The paper "Climbing the Ladder of Reasoning: What LLMs Can—and Still Can't—Solve after SFT" intensively studies the impact of Supervised Fine-Tuning (SFT) on LLMs concerning their capabilities in solving mathematical reasoning tasks. By meticulously examining the AIME24 dataset, the authors delineate the stages of reasoning skill development in LLMs, categorizing problems into four distinct tiers of difficulty: Easy, Medium, Hard, and Extremely Hard (Exh). The paper offers a structured understanding of how reasoning abilities in LLMs evolve with the application of SFT.

The research reveals a stepwise advancement reflected in the model's capability to solve increasingly complex problems. The key findings indicate that minimal SFT involving 500-1K instances with long chain-of-thought data is sufficient for models to transition from handling Easy-level to Medium-level problems. These transitions utilize R1 reasoning styles, which emphasize extended and explicit verification steps.

For Hard-level problems, while initial SFT does yield some capacity improvements, performance plateaus at approximately 65% accuracy due to intrinsic instability in complex reasoning chains. The scaling of SFT data follows a logarithmic trend, indicating eventual diminishing returns with increased dataset sizes. Furthermore, the Exh-level problems present inherent challenges unmet by even the most refined models, highlighting fundamental limitations in unconventional problem-solving strategies that current LLM architectures struggle with.

The authors further discuss the efficacy of curated versus non-curated small-scale SFT datasets, noting a marginal advantage with curated datasets. Interestingly, they argue scaling the dataset appears more beneficial than curation in addressing problem complexity, thereby challenging earlier claims that highly selective datasets yield superior model performance.

The paper underscores several profound implications for advancing LLM capabilities:

Stability and Scaling: Ensuring stability in reasoning chains holds the potential to unlock further performance improvements. Larger datasets alleviate some instability but also necessitate exploring reinforcement learning (RL) approaches and tool-augmented reasoning to transcend inherent limitations of SFT.
SFT and Generalization: Despite the substantial initial improvements with small-scale SFT, models exhibit generalization potential that underscores the infancy of current understanding in reasoning trajectory impacts.
Higher-Order Reasoning: The study raises critical questions regarding whether existing SFT methodologies can independently foster higher-order reasoning, especially across unconventional problem-solving paradigms.

These implications suggest future research directions, including the integration of RL strategies and external computational tools to enhance the problem-solving horizons of LLMs. Notably, the study provides a comprehensive roadmap to refine LLM reasoning capabilities while addressing persisting inefficacies in complex reasoning tasks. As the landscape of AI continues to expand, understanding these limitations and potential pathways becomes crucial for meaningful advancements in automated reasoning systems.

Markdown