Explain the chain-of-thought length dip/collapse under log-probability rewards

Determine the mechanism that causes the observed initial shortening—and in non‑verifiable long‑form settings, persistent collapse—of chain‑of‑thought length during reinforcement learning fine‑tuning with log‑probability–based rewards (Log‑prob, AvgLogprob, and JEPO), and explain why this behavior differs from verifiable domains where chain‑of‑thought length subsequently recovers and from probability‑based rewards (VeriFree, AvgProb) and Base RL which do not exhibit the same shortening. The goal is to account for the phenomenon across datasets (MATH, DeepScaleR, Alpaca, NuminaProof) and model families (Llama‑3.2‑3B‑Instruct, Qwen‑2.5‑3B‑Instruct) used in the experiments.

Background

The paper reports that when training with log-probability rewards, chains of thought (CoTs) initially shorten across tasks. In verifiable math datasets (MATH, DeepScaleR), CoT length recovers later in training, whereas in non‑verifiable long‑form datasets (NuminaProof, Alpaca), CoTs collapse to very short sequences and do not recover.

This behavior does not occur for Base RL or probability-based rewards (e.g., VeriFree), and attempts to prevent CoT shortening via KL regularization or explicit length penalties stabilize length but degrade task performance. A warm-start approach stabilizes CoT length but still fails to outperform supervised fine-tuning under reasonable compute budgets.

The authors hypothesize potential causes (signal-to-noise and credit assignment over long CoTs or hidden internal reasoning in long answers) but offer no definitive explanation, motivating an explicit open problem to understand the underlying mechanism.

References

We now report some intriguing observations on the behavior of the CoT during training, for which we have no complete explanation.

Likelihood-Based Reward Designs for General LLM Reasoning  (2602.03979 - Kwiatkowski et al., 3 Feb 2026) in Section “Length of the Chain-of-Though During Training” (label: sec:length)