Adaptive Proxy Prefix CoTs Distillation

Updated 25 January 2026

Proxy Prefix CoTs is a technique that adaptively truncates teacher chain-of-thought reasoning into minimal, effective prefixes for improved student model training.
P-ALIGN employs a binary search with a self-judging function to identify the minimal sufficient reasoning prefix, significantly enhancing Pass@1 and Pass@3 metrics.
This approach reduces redundant information while preserving critical reasoning steps, leading to more efficient distillation and improved performance on complex tasks.

Proxy Prefix CoTs refer to the adaptive truncation and utilization of teacher-generated chain-of-thought (CoT) reasoning prefixes for distillation into smaller student models, addressing the challenge of lengthy and structurally complex teacher trajectories in reasoning-intensive tasks such as mathematical problem solving. The central procedure—Prefix-ALIGNment distillation (P-ALIGN)—algorithmically locates the minimally sufficient prefix of reasoning steps necessary for effective student supervision, avoiding the negative impact of irrelevant or uncertain suffix information. This approach systematically surpasses prior distillation strategies in mathematical reasoning tasks, yielding improved downstream accuracy and more effective alignment between teacher signals and student learning capacity (Liu et al., 15 Jan 2026).

1. Formal Definitions and Preliminaries

For each query $q$ , a reasoning trajectory generated by a teacher LLM is defined as $R = \{z_1, z_2, \dots, z_T\}$ , where each $z_t$ is an atomic reasoning statement and $T$ denotes trajectory length. Segmenting $R$ as $R = \{r_1,\ldots,r_m\}$ , a prefix is given as $P_t = \{r_1, \ldots, r_t\}$ and its complementary suffix $S_t = \{r_{t+1}, \ldots, r_m\}$ . The guiding models are denoted $M_{\mathrm{teacher}}$ (e.g., DeepSeek-R1) and $M_{\mathrm{student}}$ (e.g., Qwen2.5-7B, Qwen3-8B, Llama3.2-3B).

2. Adaptive Prefix Alignment Algorithm

The methodological core of P-ALIGN is to locate the minimal prefix $P_{t^*}$ that is sufficient for the student to solve $q$ . Sufficiency is operationalized via a "self-judging function": $S(P_t; q) = P(\text{ENOUGH} \mid q, P_t ; M_{\mathrm{student}})$ where $M_{\mathrm{student}}$ , upon being prompted with $P_t$ , indicates sufficiency (ENOUGH) or insufficiency (NOT_ENOUGH). A sufficiency threshold $\tau$ (typically $\tau = 0.5$ ) is applied such that a prefix is deemed sufficient iff $S(P_t; q) \ge \tau$ .

A binary search procedure efficiently identifies $t^*$ , requiring only $O(\log m)$ calls as opposed to the naive $O(m)$ scan. The process proceeds by, for each candidate prefix length $p$ , invoking $M_{\mathrm{student}}$ as a proxy judge, and converging to the minimal sufficient $P_{t^*}$ .

3. Training Objective and Dataset Construction

Upon determination of each $P_{i, t^*}$ , a distillation dataset is constructed as follows:

The minimal prefix $P_{i, t^*}$ for each sample $i$ is passed under an alignment prompt to $M_{\mathrm{student}}$ to generate the full CoT continuation $y_i$ .
Only those tuples $(q_i, P_{i, t^*}, y_i)$ that result in a final answer matching the ground truth are retained: $D_{\mathrm{align}} = \{(q_i, P_{i, t^*}: y_i) \mid \text{Ans}(y_i) = \text{gold}_i\}$
The main objective is the cross-entropy alignment loss: $\mathcal{L}_{\mathrm{align}} = -\sum_{(q, P{:}y)\in D_{\mathrm{align}}}\sum_{t=1}^{|P{:}y|} \log P_\theta \left((P{:}y)_t \mid (P{:}y)_{<t}, q\right)$ No explicit auxiliary regularizers are introduced; reliance is placed on truncation for coverage and redundancy control.

4. Models, Data Processing, and Training Details

Teacher Model: DeepSeek-R1 (175B parameters) in zero-shot with InstructQA templates.
Student Models: Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.2-3B, Qwen3-14B.
Training Data: 1,000 problem–CoT pairs from s1K-1.1; each trajectory sentence segmented.
Fine-tuning: 3 epochs, LoRA rank-4, learning rate $5 \times 10^{-5}$ , batch size 16, using TRL and LLaMA-Factory frameworks.

5. Experimental Evaluation and Results

P-ALIGN is evaluated on mathematical reasoning tasks using AIME25, AIME24, AMC12, and MATH500 datasets. Metrics are Pass@1 and Pass@3 (exact-match answer accuracy) compared against baselines: zero-shot InstructQA, supervised fine-tuning (SFT) on labels, SFT on full CoTs, and UPFT (32-token fixed truncation).

Pass rates (Qwen2.5-7B student):

Condition	AIME25	AIME24	AMC12	MATH500	avg
Zero-shot	6.7/10.0	10.0/20.0	40.8/49.7	69.8/84.8	–
SFT(Long-CoT)	10.0/20.0	10.0/26.7	48.2/56.6	75.6/86.6	35.9/47.7
UPFT	13.3/20.0	13.3/23.3	47.6/59.0	74.8/84.6	37.3/46.7
P-ALIGN	16.7/26.7	16.7/26.7	49.4/63.9	75.8/85.2	39.6/50.6

P-ALIGN outperforms all established baselines by more than 3 percentage points on average for Pass@1 and Pass@3. Similar improvements obtained for Qwen3-8B. Ablation studies indicate significant reductions in performance when eliminating the self-judging or sufficiency criterion, and that binary search offers comparable accuracy with much greater efficiency—approximately 20× fewer self-judging calls.

6. Analysis and Insights

Fixed-Ratio vs. Adaptive Prefixes

On complex tasks (AIME24/25), longer fixed prefixes enhance performance up to approximately 70–80% of the original CoT length. On easier tasks (MATH500), excess prefix length diminishes accuracy ("overthinking"). Adaptive per-example truncation via P-ALIGN consistently exceeds any fixed truncation.

Token Sequence Properties

Mean CoT lengths for SFT(Long-CoT), UPFT, and P-ALIGN are approximately 9,300, 4,500, and 5,400 tokens respectively. Human and GLM-4.5 judges prefer the P-ALIGN generations over 75% of the time, identifying superior conciseness and relevance.

Position-wise Uncertainty

Token entropy for the student rises monotonically throughout the teacher CoT. Training on early chunks (prefixes) produces optimal downstream accuracy, while mid/suffix supervision is less effective.

Qualitative Illustration

In a problem demanding the computation of a remainder involving iterated nines divided by 1000, the teacher’s CoT (48 sentences) contained the critical step at sentence 12 (" $P \equiv 109 \pmod{125}$ "). P-ALIGN truncated the CoT at this juncture, enabling the student to solve accurately, whereas SFT(Long-CoT) induced reflective excess and UPFT truncated before the decisive argument.

7. Limitations and Prospective Extensions

The self-judging sufficiency criterion requires a relatively capable student; diminished model capacity may lead to misjudgment.
The self-judge mechanism operates as a binary classifier. Potential extensions include continuous sufficiency scoring or the incorporation of external verification agents.
All experiments employ closed-source teacher models; adaptation to open-source alternatives such as LLaMA-class teachers is a direction for future work.

A plausible implication is that Proxy Prefix CoT distillation techniques such as P-ALIGN provide a principled route for maximizing the utility of teacher reasoning while minimizing transfer of irrelevant or ambiguous intermediate steps, with efficiency and robustness scaling to diverse reasoning tasks.

Markdown Report Issue Upgrade to Chat

References (1)

Long-Chain Reasoning Distillation via Adaptive Prefix Alignment (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proxy Prefix CoTs.