Conditions under which unguided rollouts reach guidance-like computations
Ascertain whether, and under what conditions, an unguided rollout from a base large language model will sample traces that perform computations similar to those induced by an oracle-provided guidance prefix, particularly when the guidance segment required for success is long.
References
In general, it is unclear whether the base model would ever sample traces that perform computations similar to the provided guidance, especially when the guidance required to obtain a successful completion is long.
— POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration
(2601.18779 - Qu et al., 26 Jan 2026) in Section 5.1 (A Mental Model)