GooseReason-0.7M: RLVR Task Dataset
- GooseReason-0.7M is a large-scale dataset that transforms unverifiable texts into verifiable RLVR multiple-choice tasks.
- It uses formal fill-in-the-middle masking and beam search to generate plausible distractors across math, coding, and STEM.
- The dataset enables continual RL training for LLMs, showing measurable performance gains on key reasoning benchmarks.
GooseReason-0.7M is a large-scale dataset of Reinforcement Learning with Verifiable Rewards (RLVR) tasks, synthesized from unverifiable internet text using the fill-in-the-middle transformation and multiple-choice question construction. Developed via the Golden Goose methodology, GooseReason-0.7M comprises over 700,000 MCQs spanning mathematics, programming, and general science, and provides a scalable mechanism for continual RL-based training of LLMs (Lu et al., 30 Jan 2026).
1. Formal Task Construction and Transformation
The foundation of GooseReason-0.7M is a formalized fill-in-the-middle multiple-choice transformation. Given a source text —e.g., a textbook paragraph, solution writeup, or code answer—a contiguous span containing the "key reasoning steps" is identified. The masking function replaces in with a placeholder "__?", producing . serves as the ground-truth answer.
A set of plausible but incorrect distractors is then generated, resulting in the MCQ . Model policies select from these candidates, receiving a binary reward signal
This formalism enables the conversion of unverifiable free-text into RLVR tasks with clear verifiability via exact-match, circumventing the traditional bottleneck of limited supervised data.
2. Dataset Generation via GPT-5 Prompting Pipeline
Synthesis of GooseReason-0.7M employs GPT-5 as a data-generating oracle, utilizing distinct prompt templates for reasoning, coding, and cybersecurity domains. For clean reasoning or programming tasks, templates instruct GPT-5 to (i) identify and mask key steps/lines, (ii) treat the removed content as ground-truth, and (iii) generate at least ten stylistically plausible but incorrect distractors. Cybersecurity and noisy domains incorporate an additional passage extraction phase to ensure coherence.
Distractor generation leverages beam search: the top 20 candidates from GPT-5 are sampled, then uniformly subsampled to nine, yielding a decuplet of options for MCQ construction. Filtering is applied to exclude tasks solvable or unsolvable by a frozen student model on all 16 rollout seeds, restricting the dataset to "medium-difficulty" regimes.
| Template | Domain | Core Instruction |
|---|---|---|
| Template A | Reasoning/QA | Mask key steps, generate text distractors |
| Template B | Programming | Mask lines, generate code distractors |
| Template C | Cybersecurity | Extract passage, mask, generate distractors |
3. Task Corpus Composition and Domain Structure
GooseReason-0.7M's corpus is assembled by applying the MCQ transformation across three reasoning-rich sources:
- AoPS-Instruct (approx. 600K olympiad-level math posts),
- rStar-Coder synthetic_sft (approx. 380K coding prompts absent conventional testcases),
- MegaScience textbook QA (approx. 650K general STEM Q&A entries).
Post-masking and filtering, the final task split is:
- Mathematics: 320,000 tasks
- Programming: 200,000 tasks
- General Science (STEM): 180,000 tasks
Example tasks include mathematics solutions with masked key derivations, programming functions with masked loop implementations, and scientific passages with core reasoning omitted and converted into MCQs with distractors.
4. RL Protocol and Training Methodology
Continual RL training adopts the ProRLv2 recipe, utilizing clipped GRPO (Generalized Reinforcement Proximal Optimization) with decoupled advantage normalization. The LLM policy acts over MCQ tasks, with the RL objective:
where and is normalized batch-wise after group-wise mean subtraction (REINFORCE++).
Key parameters include:
- 16 rollouts per task
- Clip range: ,
- Batch size: 256 tasks per update
- RL steps: 759 for both ProRL-1.5B and Qwen-4B (333 base + 156 + 270)
- Compute: ~1,100 H100 GPU-hours for continued training
5. Empirical Performance on Reasoning Benchmarks
Interleaving GooseReason-0.7M tasks with existing RLVR datasets overcomes data saturation, yielding sustained performance improvements across fifteen benchmarks. Metrics from prolonged model training include:
- ProRL-1.5B-v2 (1,100 GPU h):
- Mathematics: +2.71% (baseline +0.63%)
- Programming: +2.12% (baseline +0.95%)
- STEM: +3.48% (baseline +0.13%)
- Qwen-4B-Instruct:
- Math: +2.18% (baseline −1.29%)
- Code: +2.24% (baseline +0.43%)
- STEM: +2.40% (baseline −1.52%)
On aggregate, GooseReason-4B-Instruct attains new state-of-the-art performance for 4B-scale models and surpasses Qwen3-30B in selected domains, demonstrating transferability from MCQ training to open-ended reasoning tasks. With a constrained 200-step budget, blends including GooseReason reach notable milestones earlier (e.g., 80% math accuracy at 150 steps, compared to 200 steps for ProRL-only data).
6. Specialized Domain Deployment: GooseReason-Cyber
GooseReason methodology enables domain extension into fields lacking RLVR resources. For cybersecurity, a live deployment involved scraping approximately 5 million FineWeb pages flagged by Primus-Seed, applying Template C (passage extraction, span masking, distractor generation), and filtering for difficulty to produce 180,000 GooseReason-Cyber tasks.
Training Qwen-4B-Instruct for 100 RL steps on GooseReason-Cyber resulted in performance increases across several benchmarks:
| Benchmark | Qwen-4B-Instruct | GooseReason-Cyber | Δ |
|---|---|---|---|
| CTI-MCQ | 63.44% | 73.79% | +10.35 |
| CyberMetric | 89.78% | 92.05% | +2.27 |
| SecEval | 70.44% | 71.14% | +0.70 |
| Avg | 74.55% | 78.99% | +4.44 |
This approach outperforms Llama-Primus-Instruct (8B, extensively pre-/post-trained) by 11.5 percentage points, affirming the generalizability and scalability of RLVR task synthesis via GooseReason in zero-prior specialized domains.
7. Significance and Implications
GooseReason-0.7M constitutes a major advance in RLVR data generation, leveraging abundant unverifiable reasoning-rich text to synthesize structured, high-verifiability MCQ tasks at scale. The simplicity of the fill-in-the-middle mask transform , clarity of prompting templates, and automated distractor regime facilitate robust, continual RL for LLMs beyond conventional saturation, with demonstrable transfer to open-ended reasoning and domain generalization—including real-world application in cybersecurity where labeled RLVR data was previously unavailable.
A plausible implication is that such automated RLVR synthesis pipelines may be extensible to additional domains by iteratively refining passage extraction and masking strategies, further enhancing LLM reasoning capabilities across specialized fields (Lu et al., 30 Jan 2026).