Papers
Topics
Authors
Recent
Search
2000 character limit reached

GooseReason-0.7M: RLVR Task Dataset

Updated 2 February 2026
  • GooseReason-0.7M is a large-scale dataset that transforms unverifiable texts into verifiable RLVR multiple-choice tasks.
  • It uses formal fill-in-the-middle masking and beam search to generate plausible distractors across math, coding, and STEM.
  • The dataset enables continual RL training for LLMs, showing measurable performance gains on key reasoning benchmarks.

GooseReason-0.7M is a large-scale dataset of Reinforcement Learning with Verifiable Rewards (RLVR) tasks, synthesized from unverifiable internet text using the fill-in-the-middle transformation and multiple-choice question construction. Developed via the Golden Goose methodology, GooseReason-0.7M comprises over 700,000 MCQs spanning mathematics, programming, and general science, and provides a scalable mechanism for continual RL-based training of LLMs (Lu et al., 30 Jan 2026).

1. Formal Task Construction and Transformation

The foundation of GooseReason-0.7M is a formalized fill-in-the-middle multiple-choice transformation. Given a source text T={t1,t2,,tn}T = \{t_1, t_2, \dots, t_n\}—e.g., a textbook paragraph, solution writeup, or code answer—a contiguous span S={ti,ti+1,,tj}TS = \{t_i, t_{i+1}, \dots, t_j\} \subseteq T containing the "key reasoning steps" is identified. The masking function fmask(T,S)f_{\mathrm{mask}}(T, S) replaces SS in TT with a placeholder "__?", producing TmaskT_{\mathrm{mask}}. SS serves as the ground-truth answer.

A set of kk plausible but incorrect distractors D={d1,...,dk}\mathcal{D} = \{d_1, ..., d_k\} is then generated, resulting in the MCQ Q=(Tmask,{S}D)\mathcal{Q} = (T_{\mathrm{mask}}, \{S\} \cup \mathcal{D}). Model policies πθ\pi_\theta select from these candidates, receiving a binary reward signal

r(Q,S^)={1if S^=S 0otherwiser(\mathcal{Q}, \hat S) = \begin{cases} 1 & \text{if}~\hat S = S \ 0 & \text{otherwise} \end{cases}

This formalism enables the conversion of unverifiable free-text into RLVR tasks with clear verifiability via exact-match, circumventing the traditional bottleneck of limited supervised data.

2. Dataset Generation via GPT-5 Prompting Pipeline

Synthesis of GooseReason-0.7M employs GPT-5 as a data-generating oracle, utilizing distinct prompt templates for reasoning, coding, and cybersecurity domains. For clean reasoning or programming tasks, templates instruct GPT-5 to (i) identify and mask key steps/lines, (ii) treat the removed content as ground-truth, and (iii) generate at least ten stylistically plausible but incorrect distractors. Cybersecurity and noisy domains incorporate an additional passage extraction phase to ensure coherence.

Distractor generation leverages beam search: the top 20 candidates from GPT-5 are sampled, then uniformly subsampled to nine, yielding a decuplet of options for MCQ construction. Filtering is applied to exclude tasks solvable or unsolvable by a frozen student model on all 16 rollout seeds, restricting the dataset to "medium-difficulty" regimes.

Template Domain Core Instruction
Template A Reasoning/QA Mask key steps, generate text distractors
Template B Programming Mask lines, generate code distractors
Template C Cybersecurity Extract passage, mask, generate distractors

3. Task Corpus Composition and Domain Structure

GooseReason-0.7M's corpus is assembled by applying the MCQ transformation across three reasoning-rich sources:

  • AoPS-Instruct (approx. 600K olympiad-level math posts),
  • rStar-Coder synthetic_sft (approx. 380K coding prompts absent conventional testcases),
  • MegaScience textbook QA (approx. 650K general STEM Q&A entries).

Post-masking and filtering, the final task split is:

  • Mathematics: 320,000 tasks
  • Programming: 200,000 tasks
  • General Science (STEM): 180,000 tasks

Example tasks include mathematics solutions with masked key derivations, programming functions with masked loop implementations, and scientific passages with core reasoning omitted and converted into MCQs with distractors.

4. RL Protocol and Training Methodology

Continual RL training adopts the ProRLv2 recipe, utilizing clipped GRPO (Generalized Reinforcement Proximal Optimization) with decoupled advantage normalization. The LLM policy πθ\pi_\theta acts over MCQ tasks, with the RL objective:

L(θ)=Eτ[min(rθ(τ)A(τ),clip(rθ(τ),1ε,1+ε)A(τ))]\mathcal{L}(\theta) = \mathbb{E}_{\tau} \left[ \min(r_\theta(\tau)A(\tau), \text{clip}(r_\theta(\tau), 1-\varepsilon, 1+\varepsilon)A(\tau)) \right]

where rθ(τ)=πθ(τ)πold(τ)r_\theta(\tau) = \frac{\pi_\theta(\tau)}{\pi_{\mathrm{old}}(\tau)} and A(τ)A(\tau) is normalized batch-wise after group-wise mean subtraction (REINFORCE++).

Key parameters include:

  • 16 rollouts per task
  • Clip range: εlow=0.1\varepsilon_{\mathrm{low}} = 0.1, εhigh=0.2\varepsilon_{\mathrm{high}} = 0.2
  • Batch size: 256 tasks per update
  • RL steps: 759 for both ProRL-1.5B and Qwen-4B (333 base + 156 + 270)
  • Compute: ~1,100 H100 GPU-hours for continued training

5. Empirical Performance on Reasoning Benchmarks

Interleaving GooseReason-0.7M tasks with existing RLVR datasets overcomes data saturation, yielding sustained performance improvements across fifteen benchmarks. Metrics from prolonged model training include:

  • ProRL-1.5B-v2 (1,100 GPU h):
    • Mathematics: +2.71% (baseline +0.63%)
    • Programming: +2.12% (baseline +0.95%)
    • STEM: +3.48% (baseline +0.13%)
  • Qwen-4B-Instruct:
    • Math: +2.18% (baseline −1.29%)
    • Code: +2.24% (baseline +0.43%)
    • STEM: +2.40% (baseline −1.52%)

On aggregate, GooseReason-4B-Instruct attains new state-of-the-art performance for 4B-scale models and surpasses Qwen3-30B in selected domains, demonstrating transferability from MCQ training to open-ended reasoning tasks. With a constrained 200-step budget, blends including GooseReason reach notable milestones earlier (e.g., 80% math accuracy at 150 steps, compared to 200 steps for ProRL-only data).

6. Specialized Domain Deployment: GooseReason-Cyber

GooseReason methodology enables domain extension into fields lacking RLVR resources. For cybersecurity, a live deployment involved scraping approximately 5 million FineWeb pages flagged by Primus-Seed, applying Template C (passage extraction, span masking, distractor generation), and filtering for difficulty to produce 180,000 GooseReason-Cyber tasks.

Training Qwen-4B-Instruct for 100 RL steps on GooseReason-Cyber resulted in performance increases across several benchmarks:

Benchmark Qwen-4B-Instruct GooseReason-Cyber Δ
CTI-MCQ 63.44% 73.79% +10.35
CyberMetric 89.78% 92.05% +2.27
SecEval 70.44% 71.14% +0.70
Avg 74.55% 78.99% +4.44

This approach outperforms Llama-Primus-Instruct (8B, extensively pre-/post-trained) by 11.5 percentage points, affirming the generalizability and scalability of RLVR task synthesis via GooseReason in zero-prior specialized domains.

7. Significance and Implications

GooseReason-0.7M constitutes a major advance in RLVR data generation, leveraging abundant unverifiable reasoning-rich text to synthesize structured, high-verifiability MCQ tasks at scale. The simplicity of the fill-in-the-middle mask transform fmaskf_{\mathrm{mask}}, clarity of prompting templates, and automated distractor regime facilitate robust, continual RL for LLMs beyond conventional saturation, with demonstrable transfer to open-ended reasoning and domain generalization—including real-world application in cybersecurity where labeled RLVR data was previously unavailable.

A plausible implication is that such automated RLVR synthesis pipelines may be extensible to additional domains by iteratively refining passage extraction and masking strategies, further enhancing LLM reasoning capabilities across specialized fields (Lu et al., 30 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GooseReason-0.7M.