Can outcome-reward RL learn operator-like reasoning procedures from a finite, fixed prompt set?
Determine whether outcome-reward reinforcement learning applied to large language models under a finite, fixed prompt distribution and bounded training rollout lengths can learn an operator-like reasoning procedure that composes across iterations to solve problems, thereby avoiding harmful distribution shift when test-time horizons exceed those seen during training.
References
While this form of distribution shift is not problematic if the model has learned a true "operator" that enables the chaining of behaviors to solve problems, it is unclear whether RL can learn such operators from a finite, fixed prompt set.
— Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL
(2602.03773 - Wu et al., 3 Feb 2026) in Section 4 (Problem Statement)