Reward-Based Pretraining (RPT)

Updated 20 February 2026

Reward-based Pretraining (RPT) is a method that reframes pretraining as a reinforcement learning task, replacing static objectives with dynamic, reward-driven interactions.
It employs on-policy gradients, intrinsic and extrinsic rewards, and synthetic curricula to foster enhanced reasoning, compositional generalization, and performance.
Empirical evaluations show that RPT improves factuality, sample efficiency, and convergence speed, offering a robust alternative to traditional supervised and preference-based pretraining.

Reward-based Pretraining (RPT) is a paradigm that replaces or augments standard supervised pretraining objectives (such as next-token prediction) with reinforcement learning (RL) objectives. RPT frameworks train models to maximize rewards derived from interaction with synthetic environments, intrinsic objectives, self-supervised signals, or learned reward models. The central claim is that RPT better equips models with generalizable reasoning skills and robust adaptability, outpacing models trained solely by maximizing log-likelihood on passive corpora (Han et al., 26 Feb 2025, Dong et al., 9 Jun 2025, Xing et al., 3 Dec 2025, Hatamizadeh et al., 26 Sep 2025, Li et al., 23 Sep 2025).

1. Formal Objectives and Theoretical Foundations

Reward-based Pretraining reframes the learning problem as RL in which the agent (model) interacts with an environment (synthetic, natural language, or multimodal), generating trajectories $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T)$ with associated rewards $r_t$ . The goal is to maximize expected cumulative reward:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \bigg[\sum_{t=0}^T \gamma^t r_t \bigg]$

where $\gamma \in [0,1]$ is a discount factor (Han et al., 26 Feb 2025). Typical implementations use on-policy gradients (REINFORCE/PPO) with advantage estimation and trust-region/KL regularization:

$L_\text{RPT}(\theta) = -\mathbb{E}_{\tau \sim \pi_{\theta_\text{old}}} \Bigg[ \sum_{t=0}^T A_t \log \pi_\theta(a_t|s_t) - \beta \mathrm{KL}\big[\pi_\theta(\cdot|s_t) \| \pi_{\theta_\text{old}}(\cdot|s_t)\big] \Bigg]$

(Han et al., 26 Feb 2025, Dong et al., 9 Jun 2025).

Specific RPT instantiations vary by domain and reward construction:

Next-token or next-segment correctness: Reward is given for exact matching of generated tokens or text segments to ground-truth from data (Dong et al., 9 Jun 2025, Li et al., 23 Sep 2025).
Information-gain rewards: Reward equals the increase in log-likelihood of the target given an explicit chain-of-thought, relative to a no-think baseline (Hatamizadeh et al., 26 Sep 2025).
Intrinsic rewards from discriminators or verifiers: Rewards are obtained from skill discriminators, preference models, or learned reward functions (Adeniji et al., 2022, Tan et al., 29 Jan 2026).
Multi-reward conditioning: Conditioning generation on multiple reward signals to satisfy various criteria (Dufour et al., 29 Oct 2025).

A canonical divergence from supervised pretraining (SPT) is that RPT operates on policies that actively generate and explore, rather than passively model co-occurrence statistics of token sequences (Han et al., 26 Feb 2025).

2. Synthetic Task Curricula and Exploration Mechanisms

A key challenge in RPT—particularly when bootstrapping from scratch—is the “needle-in-a-haystack” exploration space for high-dimension tasks such as natural language. To confront this, curricula of synthetic environments are adopted:

Symbolic arithmetic and logic: Tasks with minimal vocabulary and well-defined compositional rules (e.g., commutativity, increment-by-addition) (Han et al., 26 Feb 2025).
Grid-world or programmatic logic tasks: Introducing control flows, simple algorithms, or code-like reasoning to build priors for stepwise inference (Han et al., 26 Feb 2025).
Active mask selection: Policies jointly learn which segments or spans to mask/predict, focusing training on semantically rich, yet challenging, regions of text ("Reinforcement Active Pretraining", PretrainZero) (Xing et al., 3 Dec 2025).

This staged progression allows the model’s reasoning capabilities to advance from basic to higher-complexity regimes, and learned representations can be transferred to full-scale natural language tasks via architectural and embedding adaptation.

3. Disentangling Reasoning and Knowledge: Memory-Augmented Architectures

RPT frameworks emphasize disentanglement of reasoning from knowledge storage, aiming to mitigate entanglement artifacts seen in SPT-trained models (e.g., spurious context-driven correlations). The recommended approach deploys:

Restricted context windows: Limiting the "working memory" input to the reasoning network to $K$ tokens (K=8–16), reducing direct exploitation of global co-occurrence (Han et al., 26 Feb 2025).
External semantic memory: A dedicated, differentiable, or RL-trained memory bank that stores key-value pairs of facts or computation traces. Information retrieval is performed by explicit read/write operations controlled by the policy (Han et al., 26 Feb 2025).
Reasoner–retriever separation: The core reasoning module operates only on its local context and what is actively fetched; there is no implicit access to the entire history or corpus, enforcing a separation between reasoning and knowledge recall.

These constraints are posited to enforce generalizable, explicit reasoning strategies rather than context-dependent pattern matching.

Reward construction in RPT may combine multiple sources:

Intrinsic and extrinsic rewards: Combining exploration novelty (e.g., Plan2Explore) and task-aligned rewards (e.g., language–image alignment via pretrained vision–LLMs) to bias exploration towards both novelty and semantic skills (Adeniji et al., 2023).
Multi-reward conditioning: MIRO implements direct conditioning on a vector of reward scores (e.g., aesthetics, factuality, preference alignment) as an input alongside context, with corresponding embeddings injected into the transformer backbone. Binned or discretized reward values are embedded using sinusoidal representations and concatenated as tokens before the model’s attention stack (Dufour et al., 29 Oct 2025).

This approach grants the model ability to learn user-aligned generation patterns during pretraining rather than deferring such alignment to costly post-hoc RLHF or filtering.

5. Empirical Evaluation and Scaling Laws

Extensive benchmarks validate the advantages of RPT over SPT or post-hoc RL:

Algorithmic and synthetic task generalization: Pure RPT-trained agents consistently outperform SPT or SFT→RFT baselines in seen and unseen environments (e.g., Go, vector-orthogonality, esoteric programming) (Han et al., 26 Feb 2025).
Language modeling and reasoning: RPT-trained LLMs (Qwen-14B, Llama3-3.2B) show higher next-token or segment prediction accuracy (easy/medium/hard splits), with scaling curves following power-laws in compute and data volume, matching or exceeding known supervised pretraining scaling (Dong et al., 9 Jun 2025, Li et al., 23 Sep 2025, Hatamizadeh et al., 26 Sep 2025).
Sample and reward efficiency: Methods such as Intrinsic Reward Matching in skill-sequenced RL yield 2–5× faster finetuning and solve long-horizon settings with zero-shot skill selection (Adeniji et al., 2022).
Multimodal gains: In text-to-image, MIRO achieves dominant GenEval compositionality and user-preference metrics, with speed-up in convergence by up to 19× relative to single-objective baselines (Dufour et al., 29 Oct 2025).
Practicality: RPT brings substantial improvements in factuality, safety, and generation quality of LLMs—e.g., a 36.2% relative gain in factuality and >86% win rate in overall text quality (Tan et al., 29 Jan 2026).
Transfer and initialization: RPT establishes a superior initialization for downstream RL fine-tuning (RLHF, RLVR), compounding performance improvements and reducing the reliance on human annotation (Li et al., 23 Sep 2025).

6. Limitations, Open Problems, and Future Directions

While RPT demonstrates empirical and methodological advantages, several key open questions and challenges remain:

Exploration and bootstrap: Cold-starting RPT from scratch is impeded by sparsity in feasible reasoning behaviors; synthetic curricula and active learning mechanisms are engineered to address this, but scalability to open-domain language remains an active area (Han et al., 26 Feb 2025, Xing et al., 3 Dec 2025).
Reward specification: The construction and reliability of reward models are critical; deviations or reward misspecification may lead to suboptimal behaviors or reward hacking. Multi-reward conditioning and classifier-free guidance can mitigate, but more principled reward learning is needed (Dufour et al., 29 Oct 2025).
Compute and efficiency: RPT to date generally increases computational cost owing to multiple on-policy rollouts, group inference, and dense reward estimation. Strategies for sparsifying rollouts, efficient off-policy learning, or hybridization with SPT are subjects of ongoing study (Hatamizadeh et al., 26 Sep 2025).
Generalization and compositionality: While RPT encourages reasoning, measuring the full extent of compositional generalization remains challenging. Understanding transfer to tasks outside pretraining distribution requires more systematic benchmarks (Han et al., 26 Feb 2025).

Future research directions highlighted include development of richer, environment-invariant reward metrics, scalable unsupervised reward learning, reward-based pretraining across continuous task families (meta-RPT), and multi-modal extension (e.g., to code, image, or robotics domains) (Adeniji et al., 2022, Dufour et al., 29 Oct 2025, Adeniji et al., 2023).

7. Relationship to Supervised and Preference-Based Pretraining

RPT is conceptually distinct from both classical SPT and preference-based objectives:

SPT (Supervised Pretraining): Maximizes the log-likelihood of observed tokens in a static corpus. Known to overfit shallow co-occurrence without supporting robust, adaptive reasoning (Han et al., 26 Feb 2025).
Preference-based Pretraining (PHF/Conditional Training): Augments pretraining corpora with control tokens or auxiliary inputs corresponding to reward model outputs (e.g., human preference scores), training models to conditionally generate more desirable content. Such approaches reduce undesirable outputs by orders of magnitude but lack the on-policy exploration and trajectory optimization of RL-based pretraining (Korbak et al., 2023).
RPT: Optimizes directly for cumulative return over adaptive, potentially long-horizon trajectories, incentivizing stepwise construction of solution traces and explicit deployment of reasoning modules.

A plausible implication is that hybridization between these paradigms—e.g., multi-reward RPT, preference-and-reinforcement–driven pretraining, or self-supervised reward bootstrapping—may yield further gains in generalization, controllability, and alignment.

Key References:

"General Intelligence Requires Reward-based Pretraining" (Han et al., 26 Feb 2025)
"Reinforcement Pre-Training" (Dong et al., 9 Jun 2025)
"Reinforcement Learning on Pre-Training Data" (Li et al., 23 Sep 2025)
"PretrainZero: Reinforcement Active Pretraining" (Xing et al., 3 Dec 2025)
"RLP: Reinforcement as a Pretraining Objective" (Hatamizadeh et al., 26 Sep 2025)
"Self-Improving Pretraining: using post-trained models to pretrain better models" (Tan et al., 29 Jan 2026)
"Skill-Based Reinforcement Learning with Intrinsic Reward Matching" (Adeniji et al., 2022)
"MIRO: MultI-Reward cOnditioned pretraining improves T2I quality and efficiency" (Dufour et al., 29 Oct 2025)
"Pretraining LLMs with Human Preferences" (Korbak et al., 2023)
"Language Reward Modulation for Pretraining Reinforcement Learning" (Adeniji et al., 2023)