Can outcome-reward RL learn operator-like reasoning procedures from a finite, fixed prompt set?

Determine whether outcome-reward reinforcement learning applied to large language models under a finite, fixed prompt distribution and bounded training rollout lengths can learn an operator-like reasoning procedure that composes across iterations to solve problems, thereby avoiding harmful distribution shift when test-time horizons exceed those seen during training.

Background

The paper analyzes why standard outcome-reward RL for LLM reasoning, typically trained on a fixed set of prompts with a bounded rollout length, struggles to extrapolate to much longer test-time horizons. When models reason beyond training budgets, distribution shift arises because generation proceeds from conditional distributions not encountered during training.

The authors note that if a model had learned a true "operator"—a reasoning procedure that composes across iterations—this distribution shift would be less problematic. However, they explicitly state uncertainty about whether RL, under common training constraints, can learn such operators from a finite, fixed prompt set, motivating their alternative decoding and training approach (Reasoning Cache).

References

While this form of distribution shift is not problematic if the model has learned a true "operator" that enables the chaining of behaviors to solve problems, it is unclear whether RL can learn such operators from a finite, fixed prompt set.

— Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL (2602.03773 - Wu et al., 3 Feb 2026) in Section 4 (Problem Statement)

Can outcome-reward RL learn operator-like reasoning procedures from a finite, fixed prompt set?

Background

References

Related Problems