Explain the chain-of-thought length dip/collapse under log-probability rewards
Determine the mechanism that causes the observed initial shortening—and in non‑verifiable long‑form settings, persistent collapse—of chain‑of‑thought length during reinforcement learning fine‑tuning with log‑probability–based rewards (Log‑prob, AvgLogprob, and JEPO), and explain why this behavior differs from verifiable domains where chain‑of‑thought length subsequently recovers and from probability‑based rewards (VeriFree, AvgProb) and Base RL which do not exhibit the same shortening. The goal is to account for the phenomenon across datasets (MATH, DeepScaleR, Alpaca, NuminaProof) and model families (Llama‑3.2‑3B‑Instruct, Qwen‑2.5‑3B‑Instruct) used in the experiments.
References
We now report some intriguing observations on the behavior of the CoT during training, for which we have no complete explanation.