Golden Goose: Data Synthesis for RLVR

Updated 2 February 2026

Golden Goose is a data synthesis methodology that transforms reasoning-rich texts into verifiable RLVR tasks by masking key reasoning spans and generating distractors.
It enables the creation of the GooseReason-0.7M dataset, delivering improved RLVR performance across math, coding, STEM, and cybersecurity domains.
Empirical benchmarks show Golden Goose boosts accuracy in LLM training by addressing task saturation and enhancing reinforcement learning reward verification.

Golden Goose is a data synthesis methodology designed to address the data bottleneck in Reinforcement Learning with Verifiable Rewards (RLVR) for training LLMs in complex multi-step reasoning tasks. The method enables the automated creation of unlimited RLVR tasks from reasoning-dense but unverifiable internet text by leveraging a structured pipeline that masks key reasoning spans and generates diverse multiple-choice distractors, resulting in synthetically verifiable tasks suitable for RL training. Its most prominent application is the generation and validation of the GooseReason-0.7M dataset, which has shown sustained improvements in empirical benchmarks and broad applicability to specialized domains such as cybersecurity (Lu et al., 30 Jan 2026).

1. RLVR Context and Motivation

Reinforcement Learning with Verifiable Rewards (RLVR) forms the basis for fine-tuning LLMs to perform sophisticated reasoning, operationalized via the objective function

$J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[R(\tau)]$

where $\tau$ is a model output and $R(\tau)\in\{0,1\}$ signifies correctness, automatically verified via simple checkers (e.g., symbolic formula parsers, code unit tests). The RLVR paradigm is constrained by the scarcity of tasks that admit fully automated verification. As model competence increases, finite collections of such tasks quickly saturate, producing diminishing learning signal and curtailing effective RL progression.

2. Golden Goose Pipeline for RLVR Task Synthesis

Golden Goose introduces a four-step pipeline whereby arbitrary reasoning-rich text is transformed into verifiable, multiple-choice RLVR tasks, overcoming the saturation limitation.

Initial Passage Extraction (Optional): For noisy or unstructured input (e.g., raw web scrapes), GPT-5 is prompted to extract or summarize a coherent, “educationally valuable” passage $S'$ .
Key Reasoning Span Identification and Masking: GPT-5 selects a single, contiguous multi-sentence span $t\subset S'$ representing the crucial reasoning step. The passage is masked according to

$S_{\rm mask} = M(S, t) = S[\text{replace span } t \text{ by } [\text{MASK}]]$

with $A^* = t$ as the ground-truth answer.

Distractor Generation: GPT-5 is prompted to generate $k$ plausible yet incorrect distractors $\{d_1,\ldots,d_k\}$ sampled from

$\mathcal{D} \sim P_{\rm distractor}(\cdot|S_{\rm mask}, A^*)$

MCQ Assembly and Verification: The final multiple-choice question $\tau$ 0 is presented in randomized order. Verification reduces to matching the selected candidate with $\tau$ 1.

Algorithmic Pseudocode

$\tau$ 2

3. Construction and Structure of GooseReason-0.7M

GooseReason-0.7M constitutes a 0.7 million-example MCQ corpus synthesized using Golden Goose, spanning mathematics, programming, and general STEM fields. Datasets are sourced from AoPS-Instruct, Olympiad math problems, rStar-Coder programming problems lacking unit tests, and MegaScience textbook extracts (physics, chemistry, biology, medicine, economics). Each example is a nine-way multiple choice instance (one ground truth, eight GPT-5-crafted distractors).

No post-filtering is applied to clean corpora; for noisy sources such as cybersecurity web scrapes, trivial tasks—where baseline models always succeed—are omitted. Empirically, over 70% of GooseReason examples remain RL-informative (i.e., exhibit both success and failure for current strong RL-trained models), contrasted with approximately 25% in previous ProRL datasets.

Dataset	Size	Domain	Informative Rate
GooseReason-0.7M	700,000	Math, Code, STEM	>70%
ProRL	(baseline)	Math, Code, STEM	~25%

4. Empirical Evaluation across Benchmarks

Golden Goose and the GooseReason-0.7M dataset have been extensively benchmarked for RLVR efficacy across 15 challenge sets. Notable results include:

ProRL-1.5B-v2 Saturation: Traditional RLVR training delivers limited gains (<1%). Augmentation with GooseReason-0.7M yields sustained improvements: math (+2.71%), code (+2.12%), STEM (+3.48%). GooseReason surpasses RLVE on STEM tasks (RLVE only +0.62%).
Qwen-4B-Instruct Saturation: Baseline stagnates or regresses after ~300 RL steps; incorporating GooseReason reverses this, with math (+2.18%), code (+2.24%), STEM (+2.40%), establishing state-of-the-art for 4B instruct models and narrowing the gap to 30B models.
Compute-Budgeted RL: Joint ProRL and GooseReason training on Qwen-4B-Instruct achieves higher accuracy at each RL step relative to ProRL alone.

Benchmarks include six mathematics sets (AIME24/25, AMC, MATH, Minerva, Olympiad), coding suites (APPS, CodeContests, CodeForces, TACO, HumanEvalPlus, LiveCodeBench), and STEM/instructional tests (GPQA Diamond, IFEval, Reasoning Gym splits).

5. Application in Cybersecurity: GooseReason-Cyber

A salient deployment of Golden Goose is the synthesis of RLVR tasks for cybersecurity, where suitable curated data previously did not exist. The pipeline operates on 180,000 raw FineWeb passages filtered using a “Primus-Seed.” Post-synthesis, Qwen-4B-Instruct is fine-tuned with GooseReason-Cyber for 100 RL steps and evaluated on CTI-MCQ (threat intelligence), CyberMetric (compliance, pen-testing), and SecEval (software/network security).

Significant results:

Model	Pass@1 Accuracy	Δ vs Base Model
Qwen-4B-Instruct (base)	74.55%	+0.00 pts
+GooseReason-Cyber	78.99%	+4.44 pts
Llama-Primus-Instr (7B)	--	+1.28 pts
Llama-Primus-Merged (7B)	--	+1.44 pts

Qwen-4B-Instruct trained on GooseReason-Cyber outperforms a 7B-domain specialized model with extensive pre/post-training, despite only 4B parameters and comparatively minimal RL steps.

6. Implications and Future Directions

Golden Goose demonstrates that automated synthesis of RLVR tasks from otherwise unverifiable corpora enables scalable RLVR, addressing data bottlenecks impeding LLMs’ reasoning capabilities. The systematic masking and distractor generation pipeline yields robust RL-informative tasks at scale, with empirical evidence for sustained learning beyond traditional task saturation. Deployment in specialized domains (e.g., cybersecurity) validates the extensibility of the approach.

A plausible implication is that broader adoption of similar pipelines could generalize RLVR training to any reasoning-rich domain for which curated, verifiable data is sparse or impractical to construct manually. The continued availability of RL-informative tasks suggests ongoing improvements to LLMs are possible, conditional on further advances in automated passage summarization and distractor crafting.

In summary, Golden Goose constitutes a scalable, automated methodology for converting arbitrary reasoning-rich text into verifiable RL tasks, circumventing saturation issues and facilitating sustained progress in RLVR across both general scientific and domain-specific contexts (Lu et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Golden Goose.