General Intelligence Requires Reward-based Pretraining

Published 26 Feb 2025 in cs.LG | (2502.19402v3)

Abstract: LLMs have demonstrated impressive real-world utility, exemplifying artificial useful intelligence (AUI). However, their ability to reason adaptively and robustly -- the hallmarks of artificial general intelligence (AGI) -- remains fragile. While LLMs seemingly succeed in commonsense reasoning, programming, and mathematics, they struggle to generalize algorithmic understanding across novel contexts. Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM's reasoning overfits to the training data and is limited in its transferability. We hypothesize that the core issue underlying such limited transferability is the coupling of reasoning and knowledge in LLMs. To transition from AUI to AGI, we propose disentangling knowledge and reasoning through three key directions: (1) pretaining to reason using RL from scratch as an alternative to the widely used next-token prediction pretraining, (2) using a curriculum of synthetic tasks to ease the learning of a reasoning prior for RL that can then be transferred to natural language tasks, and (3) learning more generalizable reasoning functions using a small context window to reduce exploiting spurious correlations between tokens. Such a reasoning system coupled with a trained retrieval system and a large external memory bank as a knowledge store can overcome several limitations of existing architectures at learning to reason in novel scenarios.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that integrating reward-based pretraining can decouple reasoning from knowledge, leading to improved task generalization in LLMs.
Empirical results on esoteric programming tasks reveal that conventional supervised pretraining limits compositional reasoning, even with RL fine-tuning.
The proposed method employs synthetic task curricula and an architectural separation of reasoning and memory to foster robust, transferable AI reasoning.

Reward-based Pretraining as a Prerequisite for General Intelligence

Introduction and Motivation

The paper "General Intelligence Requires Reward-based Pretraining" (2502.19402) presents a critical analysis of the limitations of current LLM pretraining paradigms and posits that the path to AGI necessitates a fundamental shift: from passive, next-token prediction-based pretraining to reward-based pretraining (RPT) that explicitly disentangles reasoning from knowledge. The authors argue that the prevailing supervised pretraining (SPT) paradigm, which relies on large-scale next-token prediction over Internet corpora, inherently entangles knowledge acquisition and reasoning, leading to models that overfit to surface-level correlations and fail to generalize reasoning to novel domains. This is empirically demonstrated via algorithmic tasks in esoteric programming languages, where state-of-the-art LLMs, including those with RL-based post-training, exhibit poor transfer of reasoning.

Empirical Evidence: Reasoning vs. Knowledge Entanglement

The authors construct a benchmark using algorithmic tasks in esoteric languages (Brainf**k, Befunge) to isolate reasoning from memorization. Despite the simplicity of the tasks and the provision of full language specifications and in-context examples, leading LLMs (Llama 3.1, Qwen2.5, GPT-4o) achieve low accuracy, with only marginal improvements from increased in-context examples. Notably, the o1 model, which incorporates RL-based post-training, outperforms others but still fails to generalize robustly, especially on tasks requiring compositional reasoning (e.g., sorting, copying, Fibonacci generation). This result underscores the fragility of reasoning capabilities acquired via SPT and the inability of RL-based post-training to escape the local minima imposed by initial supervised objectives.

Theoretical Argument: Local Minima and Exploration Constraints

The central hypothesis is that SPT on passive data induces a local minimum in the reasoning space, constraining subsequent RL-based finetuning. The analogy to AlphaGo (SPT+RL) vs. AlphaZero (pure RL) is invoked: AlphaZero's RL-from-scratch approach enabled superior exploration and the discovery of novel strategies, unencumbered by the biases of human demonstration data. In the LLM context, next-token prediction allows models to exploit spurious correlations in the context window, learning brittle heuristics rather than generalizable reasoning algorithms. This is further evidenced by experiments in Go 9x9, where RPT agents consistently outperform SPT+RFT agents, and by synthetic mathematical reasoning tasks, where pure RFT yields better generalization than SFT-then-RFT, which overfits to training distributions.

Proposed Paradigm Shift: Three Pillars

1. Reward-based Pretraining for Reasoning

The authors advocate for integrating RL into the pretraining phase, using reward signals to directly incentivize the discovery of robust, step-by-step reasoning strategies. Unlike SPT, which is agnostic to intermediate reasoning traces, RPT can generate and reinforce reasoning trajectories that maximize task rewards. Empirical results in both Go and synthetic reasoning tasks demonstrate that RPT avoids the overfitting and exploration constraints of SPT, leading to superior generalization.

2. Synthetic Task Curricula for Efficient Exploration

A major challenge for RPT in language is the combinatorial explosion of the token space. The authors propose pretraining on synthetic tasks with reduced token spaces and controlled structural properties (e.g., logic games, algorithmic primitives) to efficiently acquire reasoning priors. These priors can then be transferred to natural language domains via curriculum learning and architectural adaptation. The analogy to self-supervised learning in vision is apt: the community iteratively refined proxy tasks until they surpassed supervised learning. The transferability of reasoning priors is supported by evidence from code pretraining and cognitive studies on the benefits of structured environments.

3. Architectural Decoupling of Knowledge and Reasoning

To further promote generalization, the authors propose an explicit architectural separation between a reasoning module (operating over a small context window) and an external memory bank (storing knowledge). This design is motivated by cognitive science (limited working memory, chunking) and addresses the tendency of long-context models to learn spurious correlations. The reasoning module interacts with memory via learned retrieval and write operations, trained with RL to optimize dynamic, multi-round memory access. This approach is contrasted with RAG and differentiable memory architectures, which either rely on static retrieval or suffer from optimization instabilities.

Implementation Considerations

RPT Training: Requires online data collection and reward design. For language, this is non-trivial due to the vast token space and the need for automated reward signals (e.g., verifiers, programmatic correctness).
Synthetic Task Design: Tasks should be selected to cover core reasoning primitives (e.g., compositionality, abstraction, causal inference) and gradually increase in complexity and token space.
Transfer Mechanisms: Reasoning priors learned in synthetic domains can be transferred by preserving intermediate network layers and adapting input/output layers to new token spaces, leveraging techniques from transfer learning in vision and code-language transfer.
Memory-Reasoning Architecture: Requires the development of efficient, RL-trainable memory access mechanisms, possibly using discrete actions and curriculum learning to stabilize optimization.

Implications and Future Directions

The paper's thesis has significant implications for the development of AGI:

Generalization: RPT and architectural decoupling are posited to yield models that can flexibly adapt reasoning to new knowledge domains, overcoming the brittleness of current LLMs.
Scalability: The computational demands of RPT are substantial, especially for large models and natural language. Efficient synthetic curricula and scalable RL algorithms are critical research directions.
Evaluation: New benchmarks that disentangle reasoning from knowledge are necessary to measure progress toward AGI.
Theoretical Foundations: The work challenges the assumption that scaling SPT alone suffices for general intelligence, instead emphasizing the need for explicit reasoning incentives and modular architectures.

Conclusion

This paper provides a rigorous critique of the SPT paradigm and presents a compelling case for reward-based pretraining as a prerequisite for general intelligence. By empirically demonstrating the limitations of current LLMs in reasoning transfer, and by proposing a concrete research agenda centered on RPT, synthetic curricula, and architectural decoupling, the authors chart a path toward more robust, generalizable AI systems. The practical realization of these ideas will require advances in RL for language, synthetic environment design, and modular neural architectures, but the theoretical and empirical arguments presented establish a strong foundation for future work in this direction.

Markdown Report Issue