Papers
Topics
Authors
Recent
Search
2000 character limit reached

Improving Parametric Knowledge Access in Reasoning Language Models

Published 25 Feb 2026 in cs.CL | (2602.22193v1)

Abstract: We study reasoning for accessing world knowledge stored in a LLM's parameters. For example, recalling that Canberra is Australia's capital may benefit from thinking through major cities and the concept of purpose-built capitals. While reasoning LLMs are trained via reinforcement learning to produce reasoning traces on tasks such as mathematics, they may not reason well for accessing their own world knowledge. We first find that models do not generate their best world knowledge reasoning by default: adding a simple "think step-by-step" cue demonstrates statistically significant improvement in knowledge recall but not math. Motivated by this, we propose training models to reason over their parametric knowledge using world-knowledge question answering as a verifiable reward. After reinforcement learning on TriviaQA (+9.9%), performance also improves on Natural Questions, HotpotQA, SimpleQA, and StrategyQA by 4.2%, 2.1%, 0.6%, and 3.0%, respectively. Reasoning models are under-optimized for parametric knowledge access, but can be easily trained to reason better.

Summary

  • The paper demonstrates that RLVR significantly enhances factual recall in language models, with marked improvements in EM and Ex-Recall scores on benchmarks like TriviaQA.
  • It compares step-by-step prompting with reinforcement learning, revealing that RL-based optimization yields substantially superior outcomes in accessing stored parametric knowledge.
  • The study finds that while RLVR boosts knowledge recall, it does not improve mathematical reasoning, highlighting the necessity of domain-specific training strategies.

Reinforcement Learning for Enhanced Parametric Knowledge Recall in LLMs

Problem Scope and Motivation

LLMs have demonstrated strong capabilities on multi-step reasoning benchmarks in mathematics and code as a result of reinforcement learning from verifiable rewards (RLVR). However, the question remains whether these models effectively leverage their parametric (internal) world knowledge when engaged in closed-book factual recall tasks. The paper "Improving Parametric Knowledge Access in Reasoning LLMs" (2602.22193) systematically investigates this gap, evaluating standard prompting, explicit reasoning cues, and RL-based training for optimizing models' ability to reason over and access their stored knowledge.

Empirical Assessment of Reasoning Prompting on Knowledge Recall

Initial experiments evaluate whether prompting models to "think step-by-step" improves factual recall compared to baseline completions. Four models—DeepSeek-R1-Distill-Qwen-1.5B, Olmo-3-7B-Think, GPT-OSS-20B, and GPT-5.2—are assessed on two closed-book QA datasets (TriviaQA, Natural Questions) and a math benchmark (MATH). The results reveal that the step-by-step cue produces measurable but modest gains in Extracted-Recall for knowledge recall tasks (e.g., +1.1% on TriviaQA and +1.3% on Natural Questions for GPT-OSS-20B), while not benefiting—and in some cases slightly reducing—math task performance where RLVR has already instilled robust reasoning routines.

These observations underscore that, contrary to performance on mathematics benchmarks, current reasoning-tuned models do not default to optimal reasoning behavior for knowledge recall tasks, and that serial prompting can elicit a portion of the latent capacity in their parameters. Figure 1

Figure 1: Left: GPT-OSS-20B performance gains across prompt configurations and post-RLVR; Right: Post-RLVR model's sample reasoning trace demonstrates improved factual reasoning for Natural Questions.

Reinforcement Learning from Verifiable Rewards (RLVR) on Factual Recall

To address the suboptimal baseline, the study applies RLVR where correctness on closed-book factual QA (using TriviaQA) serves as the reward. Following policy-gradient optimization with a GRPO-style method, the GPT-OSS-20B checkpoint shows a marked jump in knowledge recall capability:

  • On TriviaQA, Exact Match (EM) increases from 36.5% to 63.6% and Ex-Recall from 60.1% to 70.0%
  • For Natural Questions, EM rises from 6.0% to 18.2% and Ex-Recall from 30.7% to 34.9%
  • HotpotQA and SimpleQA see similar improvements, with HotpotQA EM more than doubling
  • StrategyQA sees a smaller but significant increase

Importantly, the improvement obtained via RLVR training exceeds that of a supervised fine-tuning (SFT) baseline on correct reasoning traces, emphasizing the benefit of explicit reward-driven optimization beyond static data likelihood.

Analysis of Reasoning Traces

Post-RLVR, the reasoning traces generated by the model modestly increase in length (e.g., average tokens per trace for TriviaQA grew from 94 to 107). Qualitative assessment indicates that, while traces are sometimes longer and occasionally more calibrated, their structure varies and is not consistently more deductive or interpretable. In many instances, the correct answer is surfaced earlier in the reasoning trace, suggesting enhanced memory recall or answer confidence calibration rather than deeper chaining. This supports the view that RL optimization may increase retrieval effectiveness without always producing overtly richer, interpretable reasoning.

Contrasting Transfer: Factual Recall vs. Mathematical Reasoning

Unlike the knowledge recall gains, the think step-by-step cue and RLVR fine-tuning did not yield improvements on MATH; in fact, mathematical performance remained steady or decreased slightly, indicating reasoning behaviors for mathematical problem solving are optimally tuned through previous RLVR, with little headroom left for improvement via prompting. Interestingly, some transfer from RLVR on TriviaQA to mathematics was observed in the no-cue setting, warranting further investigation of cross-task RL-induced generalization.

Implications and Future Research Directions

The consistent improvements from RLVR in factual recall tasks demonstrate that model capacity for internal knowledge access is underutilized in standard training protocols and can be significantly enhanced through targeted RL objectives. The findings raise several theoretical and practical implications:

  • Separation of Reasoning Competencies: Reasoning strategies optimized for mathematics/code do not transfer optimally to factual recall; domain-specific RL objectives are beneficial.
  • Optimizing Factual Calibration: RLVR can drive more accurate answer retrieval/calibration from parametric memory, though not necessarily via more human-like or interpretable reasoning traces.
  • Prompt Engineering Boundaries: While step-by-step cues produce measurable gains, RL-based training yields distinctly larger improvements, arguing for a stronger focus on RL frameworks for knowledge-centric competences.
  • Trace Quality as a Training Signal: Future work should investigate RL objectives that not only reward answer correctness but incentivize structured and interpretable reasoning, for example via "spreading activation"-inspired trace modeling.

The results recommend ongoing research into RL signals for knowledge access, with potential to further bridge the gap between opaque parametric storage and robust, interpretable retrieval. This has downstream value for knowledge-intensive applications, reliability, and transparency in deployed LLM systems.

Conclusion

The paper provides compelling evidence that LLMs' ability to access parametric knowledge for factual recall is not saturated by current training regimes and can be substantially improved via RLVR. Explicit reasoning prompts can yield modest gains, but RL-based training on answer correctness significantly elevates recall performance. However, these advances raise questions regarding the qualitative nature of model-generated reasoning and encourage further inquiry into training objectives that explicitly shape both output accuracy and reasoning tractability (2602.22193).

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Helping AI Remember: What This Paper Is About

This paper looks at how LLMs (AIs that read and write text) can better “think” to remember facts they already know. Imagine the AI has a big internal memory. The authors ask: can we teach it to reason in steps so it can pull the right facts out of that memory more often—like recalling that Canberra is the capital of Australia?

What Questions Did the Researchers Ask?

The paper focuses on three simple questions:

  • Do LLMs automatically use their best kind of reasoning when trying to remember facts, or do they need a nudge?
  • Does a simple prompt like “think step-by-step” help models recall facts more reliably?
  • Can reinforcement learning (a training method that gives rewards for good answers) teach models to access their internal knowledge better?

How Did They Test This?

The researchers used “closed-book” question answering, which means the model must answer without looking things up—no searching the web or external documents. Think of it like a quiz where you can’t use notes.

They did two main things:

  1. Prompting experiment:
    • They tried adding a short cue: “think step-by-step.”
    • They tested this across several models on trivia-style datasets:
      • TriviaQA: lots of general-knowledge trivia
      • Natural Questions: real questions people ask (answers come from Wikipedia)
    • They also tested on MATH problems to see if the cue helped there, too.
  2. Reinforcement Learning (RL) training:
    • They trained one model (GPT-OSS-20B) using RL on TriviaQA.
    • In RL, the model gets a reward (like “points”) when it gives a correct answer, and small penalties if it breaks formatting rules (e.g., not using required answer tags).
    • Over time, the model learns patterns that lead to more correct answers.
    • They checked whether this training improved performance not just on TriviaQA, but also on:
      • Natural Questions
      • HotpotQA (questions that usually need reasoning across multiple pieces of info)
      • SimpleQA (short, factual questions)
      • StrategyQA (questions requiring simple reasoning)

To measure performance, they used:

  • Exact Match (EM): the answer is exactly right (like spelling “Canberra” exactly).
  • Extracted Recall (Ex-Recall): a gentler check that looks for the correct answer span inside the model’s output, even if the full sentence isn’t an exact match.

Think of EM as “did you say the exact right word?” and Ex-Recall as “did the correct answer appear in what you said?”

What Did They Find, and Why Does It Matter?

Here are the main takeaways:

  • The simple “think step-by-step” cue helps with factual recall.
    • Across different models and datasets, adding this cue gave small but consistent improvements in recalling facts.
    • However, this cue did not help with math problems—in some cases it slightly hurt math accuracy. That suggests these models already know how to reason in math without extra prompting, likely because they’ve been trained heavily on math reasoning.
  • Reinforcement Learning (RL) made a bigger difference.
    • Training the model with rewards on TriviaQA improved results across several datasets:
    • TriviaQA: EM up by about 27 percentage points; Ex-Recall up by about 9.9 points
    • Natural Questions: EM up by about 12.2; Ex-Recall up by about 4.2
    • HotpotQA: EM up by about 9.5; Ex-Recall up by about 2.1
    • SimpleQA: EM up by about 1.5; Ex-Recall up by about 0.6
    • StrategyQA: EM up by about 3.0
    • These gains beat a simpler training approach that just copied correct, existing reasoning traces (called “Reasoning-SFT”), showing RL’s on-the-fly feedback is more powerful.
  • The model’s “thinking” got a bit longer after RL (more tokens), but not always more human-like.
    • Sometimes the trained model just arrived at the right answer faster without showing detailed, step-by-step logic.
    • That’s okay: for remembering facts, the “right” reasoning is whatever reliably pulls out the correct memory—even if it’s not the kind of explanation a person would write.
  • Unexpected bonus: the RL-trained model also did slightly better on math in one setting, even though RL focused on trivia. That suggests some general benefits to the training.

Why it matters:

  • Many AI assistants answer questions using the knowledge stored inside them. Helping them “think to remember” can make them more reliable—even without internet access.
  • The results show current models don’t automatically use their best strategy to recall facts, but simple cues and RL training can help a lot.

What Does This Mean Going Forward?

  • Better everyday helpers: AI systems could answer more factual questions correctly without needing to search online, which is useful in offline or privacy-sensitive settings.
  • Training for memory access: RL with verifiable rewards (like “is the answer correct?”) is a practical way to improve how models tap into their internal knowledge.
  • Next challenge: make the reasoning traces not just effective, but clearer and more “human-understandable.” The authors suggest exploring “spreading activation” style thinking (like mentally jumping between related ideas to jog memory) and designing rewards that encourage that kind of reasoning.
  • Overall: LLMs can be taught to reason better for remembering facts, and doing so improves performance across many kinds of questions.

Knowledge Gaps

Unresolved Gaps, Limitations, and Open Questions

Below is a single, concrete list of open problems, missing analyses, and limitations that future work can directly act on:

  • Reward design and ablations are underexplored: How do different verifiable rewards (e.g., EM-only, token-level F1, contrastive rewards against near-miss distractors, penalizing verbosity) affect recall gains, verbosity, and reasoning-trace quality? Systematic ablations of the format penalty and partial-credit “recall” term are missing.
  • Mechanism of improvement is unclear: The model’s post-RL “reasoning” is longer but not reliably more coherent or interpretable. What internal processes (e.g., spreading-activation-like retrieval vs. calibrated guessing) are responsible for the observed gains? Mechanistic probes, activation patching, or knowledge-neuron analyses are needed.
  • Trace validity and causality are untested: Do visible reasoning tokens actually contribute to better recall, or are they epiphenomenal? Interventions that hide, shuffle, or truncate chains (and tests of “latent CoT” vs. visible CoT) could establish causal necessity.
  • Limited baseline strength: The SFT and Reasoning-SFT baselines are narrow (single dataset, LoRA ranks, limited hyperparameter sweeps). A stronger comparison should include high-capacity SFT with chain-of-thought on QA, instruction-tuned QA models, and multi-epoch/QLoRA SFT to clarify whether RL is strictly necessary.
  • RL algorithm choices are not compared: Only a GRPO-style objective is used. How do PPO, DPO/IPO-style preference objectives, token-regulated GRPO, or KL schedules influence stability, sample efficiency, and trace quality on knowledge recall?
  • Generalization breadth is narrow: RL is done only on TriviaQA and evaluated on a small set of English benchmarks. Assess transfer to long-tail and adversarial factual QA (e.g., AmbigQA, PopQA, EntityQuestions), specialized domains (science, medicine, law), and multilingual settings (e.g., X-FACTR, mLAMA).
  • Temporal robustness is unmeasured: Does RL overfit to stale facts or improve/impair handling of time-sensitive knowledge? Evaluate on temporally split datasets (e.g., TimeQA, TempLAMA) and measure recency calibration.
  • Safety and privacy risks are unassessed: Does optimizing closed-book recall increase memorization/redisclosure of PII or copyrighted content? Test on PII leakage audits and memorization probes.
  • Calibration and abstention are not evaluated: Do models become more or less overconfident after RL? Measure ECE/Brier scores, answerability/IDK behavior, and selective prediction metrics.
  • Error taxonomy is missing: Which factual error types improve (entity disambiguation, dates, numeric facts) and which persist? A fine-grained error analysis would target where reasoning-for-recall helps most.
  • Multi-hop recall is not rigorously tested: HotpotQA EM gains are shown, but there is no evaluation of intermediate hop correctness, supporting-fact prediction, or explicit multi-step factual chains without retrieval.
  • Cue engineering space is unexplored: Only “think step-by-step” is tested. Do alternative prompts (“list candidates then decide,” “recall related entities,” “justify then answer”) elicit better recall or shorter traces?
  • Cost–performance trade-offs are unquantified: RL increases thinking-token length. What is the marginal gain per additional token and the effect of budget forcing across budgets? Provide latency, throughput, and cost curves.
  • Interactions with other skills are not characterized: A small MATH check suggests mixed effects; impacts on coding, tool-use, and other RLVR-trained domains (e.g., SWE-bench) are unknown. Evaluate for positive/negative transfer and interference.
  • Closed-book vs. retrieval-augmented interplay is open: Does RL for parametric recall help or hurt search-augmented QA? Joint training or sequential curricula with retrieval remains unexplored.
  • Extracted-Recall metric dependency is fragile: Ex-Recall relies on a separate LLM (GPT-5-mini) for answer extraction, introducing potential bias and reproducibility issues. Validate with multiple extractors, deterministic extractors, and report extractor error rates.
  • Reward gaming risk not measured: Partial credit for “recall” may incentivize verbose or list-like outputs. Quantify gaming by checking answer length distributions and the relation between verbosity and reward under controlled prompts.
  • Data leakage checks are limited: Training uses a random split within TriviaQA’s train set; broader contamination audits (pretraining overlap, test leakage across datasets) are not provided.
  • Scaling laws are unstudied: Only one model (GPT-OSS-20B) is trained with RL. How do gains scale with parameter count and compute? Train across sizes to chart recall-reasoning scaling trends.
  • Sample efficiency and stability are unclear: How do gains evolve over steps, group size K, KL penalties, and LoRA ranks? Learning curves and stability diagnostics are missing.
  • Robustness to paraphrase/adversarial rewording is not tested: Evaluate on paraphrased/controlled-perturbation benchmarks and adversarially crafted questions to assess brittleness.
  • Mixture-of-tasks RL is untested: Does combining multiple QA datasets or interleaving math/code with knowledge recall during RL produce better generalization and reduce forgetting?
  • Human evaluation of reasoning quality is absent: Collect human judgments on trace plausibility, specificity, and factual correctness to complement token-length proxies.
  • Interpretability-guided objectives are not realized: The paper proposes spreading-activation–style reasoning but does not instantiate or evaluate objectives that reward semantically-linked intermediate mentions or structured recall plans.
  • Reproducibility dependencies exist: The pipeline uses Tinker infrastructure and a proprietary extractor model; provide open-source alternatives and seeds/checkpoints to ensure replicability.
  • Catastrophic forgetting and knowledge drift are unmeasured: Does RL for recall harm unrelated knowledge or introduce spurious biases? Audit with broad fact probes (e.g., LAMA), MMLU subsets, and pre/post comparisons.
  • Uncertainty handling in outputs is not explored: Train/evaluate with abstention-aware rewards, calibrated thresholds, or selective answering to reduce confident errors on hard questions.

Practical Applications

Immediate Applications

Below are actionable, sector-linked use cases that can be deployed now, leveraging the paper’s findings that simple “think step-by-step” prompting and RL from verifiable rewards (RLVR) substantially improve closed-book factual recall in reasoning LLMs.

  • Enterprise “parametric knowledge recall” fine-tuning pipeline (software, customer support, knowledge management)
    • Description: Use the paper’s RLVR setup (LoRA + on-policy GRPO-style optimization) to fine-tune an organization’s open-weight model (e.g., GPT-OSS-20B) on curated, verifiable Q&A pairs about products, policies, and procedures. Add “think step-by-step” prompting in production to further boost recall.
    • Tools/workflows: Curate QA datasets; enforce answer formatting (e.g., <answer></answer> tags); RL training via Tinker-like platforms; deploy a closed-book assistant for internal help desks or customer chat.
    • Assumptions/dependencies: Adequate Q&A coverage and quality; compute budget for RL; permissions/licensing for the base model; monitoring for drift and hallucinations; limited explainability of traces.
  • Prompting policy update for closed-book agents (software, education, consumer assistants)
    • Description: Incorporate an explicit “think step-by-step” instruction into prompts for any closed-book factual query mode, as it improves recall across multiple QA benchmarks.
    • Tools/workflows: Prompt templates from the paper; small A/B tests to validate impact; “budget forcing” or token caps for verbose models.
    • Assumptions/dependencies: The model contains sufficient parametric knowledge; gains are modest but consistent; prompt sensitivity varies by model/version.
  • Factual recall evaluation using Extracted-Recall (academia, compliance, product QA)
    • Description: Adopt the paper’s Ex-Recall metric by running an auxiliary LLM extractor to collapse verbose outputs into a single answer span for match checking—useful where EM alone underestimates performance.
    • Tools/workflows: Build an evaluation harness with an extraction prompt; compare EM vs. Ex-Recall; use McNemar’s test for paired significance testing.
    • Assumptions/dependencies: Availability of a small, reliable extractor LLM; careful prompt design to avoid gaming; domain-specific normalization rules.
  • Cost-aware search reduction in retrieval-augmented agents (software, finance, customer support)
    • Description: Prioritize closed-book recall first (with step-by-step prompting and RLVR-tuned parametric knowledge) before issuing search calls; reduce latency and API costs.
    • Tools/workflows: Decision policy in the agent: recall → fallback to search; RLVR fine-tuning for recall robustness; logging for fallback rates.
    • Assumptions/dependencies: Sufficient recall coverage; robust fallback to search for edge cases; continuous monitoring of accuracy.
  • Domain tutoring and study aids with improved factual recall (education)
    • Description: Apply the cue and RLVR fine-tuning on subject-specific trivia/history/geography QAs to build closed-book study aids that recall facts more consistently.
    • Tools/workflows: Curriculum Q&A sets; LoRA-based RLVR; student-facing assistant with constrained outputs and verification.
    • Assumptions/dependencies: Non-safety-critical content; periodic refresh to handle outdated facts; careful messaging about reliability limits.
  • Internal API and codebase Q&A assistants (software engineering)
    • Description: Fine-tune models via RLVR on verifiable Q&A about APIs, release notes, semantic conventions, and internal coding standards to improve recall without full retrieval.
    • Tools/workflows: QA dataset generation from docs; answer formatting rewards; CI/CD gate for assistant updates; use “think step-by-step”.
    • Assumptions/dependencies: Fast-moving codebases require frequent re-training; closed-book recall can still miss rare edge cases; access controls for internal data.
  • Policy and public information portals with verifiable Q&A rewards (public sector, policy communication)
    • Description: Train closed-book assistants on official Q&A (FAQs, statutes summaries) with correctness-based rewards, improving consistent recall of public information.
    • Tools/workflows: Government-curated QAs; answer formatting checks; Ex-Recall evaluation; disclaimers and escalation to human agents for ambiguous queries.
    • Assumptions/dependencies: High-quality, up-to-date QAs; transparent governance; audit requirements due to limited interpretability of traces.
  • On-device/offline assistants with better closed-book recall (consumer, mobile)
    • Description: Use step-by-step prompting and RLVR-tuned compact models to improve offline factual responses (e.g., device settings, local trivia, travel facts).
    • Tools/workflows: LoRA on smaller open-weight models; compression/distillation; prompt templates; token budget controls.
    • Assumptions/dependencies: Model capacity constraints; local evaluation; periodic updates to avoid stale facts.

Long-Term Applications

The following use cases require further research, scaling, or development—particularly around explainability, safety, and coverage—building on the paper’s insight that RLVR enhances parametric knowledge access but current traces are not always human-interpretable.

  • Explainable “spreading activation”-style recall (healthcare, legal, compliance)
    • Description: Design RL reward functions that explicitly encourage interpretable, concept-linking reasoning traces (e.g., human-auditable chains for drug facts or legal precedents).
    • Tools/workflows: New RL objectives incorporating graph-based reasoning constraints; verifier models; trace quality metrics beyond EM/Ex-Recall.
    • Assumptions/dependencies: Research breakthroughs in reasoning trace supervision; trade-offs between accuracy and interpretability; domain expert validation.
  • Safety-critical decision support with verified recall (healthcare, aviation, energy)
    • Description: Deploy closed-book assistants for non-diagnostic but high-stakes recall (procedures, checklists) only after achieving stringent accuracy, calibration, and trace audit standards.
    • Tools/workflows: Formal evaluation protocols; uncertainty estimation; human-in-the-loop; post-deployment monitoring.
    • Assumptions/dependencies: Regulatory approvals; robust fail-safes; extensive domain coverage and periodic re-certification.
  • Corporate knowledge memory systems with continuous RL updates (enterprise KM, HR, operations)
    • Description: A “knowledge OS” that regularly ingests curated Q&A from evolving policies and documents, re-optimizing the parametric memory via RLVR to maintain accurate recall.
    • Tools/workflows: Data pipelines to generate/validate QAs; scheduled RL runs; drift detection; change logs linked to model versions.
    • Assumptions/dependencies: Stable training infrastructure; careful data governance; measurable ROI versus retrieval-first architectures.
  • Standardization of closed-book recall metrics and audits (policy, AI governance, academia)
    • Description: Establish Ex-Recall-like metrics and audit practices as industry standards for evaluating closed-book assistants, enabling consistent benchmarking and procurement decisions.
    • Tools/workflows: Community benchmarks; public test suites; statistical significance testing (e.g., McNemar’s); third-party audits.
    • Assumptions/dependencies: Multi-stakeholder alignment; transparency; incentives for vendors to report standardized metrics.
  • Hybrid self-play generation of recall QAs (academia, software)
    • Description: Use LLMs to propose candidate QAs from internal corpora and filter for verifiability, scaling the coverage of rewardable training examples for RLVR.
    • Tools/workflows: Synthetic QA generation; verifier models; contamination checks; progressive curriculum learning.
    • Assumptions/dependencies: High-quality filters; contamination avoidance (not training on exact evaluation items); compute budget.
  • Offline high-recall edge models (consumer devices, robotics, remote operations)
    • Description: Develop compact, energy-efficient models with strong closed-book recall for constrained environments (robots, remote sites) where connectivity is limited.
    • Tools/workflows: Distillation from RLVR-trained larger models; memory-optimized architectures; on-device evaluation pipelines.
    • Assumptions/dependencies: Hardware constraints; periodic update channels; domain-specific recall content.
  • Finance and regulatory compliance assistants (finance, legal)
    • Description: RLVR-tuned assistants that recall internal compliance rules, reporting calendars, and regulatory thresholds; evolve toward interpretable traces for auditability.
    • Tools/workflows: Curated compliance QAs; trace logging & audit trails; integration with GRC (governance, risk, compliance) tooling.
    • Assumptions/dependencies: Legal liability considerations; ongoing regulatory changes; guarantee of timeliness and correctness.
  • Cross-task transfer research programs (academia, AI labs)
    • Description: Systematic investigation of how RLVR on knowledge recall transfers to other reasoning tasks (e.g., observed math improvement in no-cue setting), informing better multi-domain training curricula.
    • Tools/workflows: Multi-benchmark suites; ablation studies; shared training artifacts; open-weight checkpoints.
    • Assumptions/dependencies: Availability of compute and open datasets; reproducibility; community engagement for shared baselines.

Glossary

  • Advantage (in policy gradients): A baseline-adjusted reward signal used to reduce variance in gradient estimates by comparing each trajectory’s reward to a reference level. "Advantages are computed relative to the group-average reward"
  • AIME: The American Invitational Mathematics Examination, a challenging high school math competition used as a benchmark for reasoning capabilities. "frontier models achieve near-perfect accuracy on the AIME mathematics competition"
  • Budget forcing: A decoding constraint that limits or regularizes the length or verbosity of generated reasoning traces. "For verbose models, we apply budget forcing"
  • Closed-book QA: Question answering where the model must answer from its internal parameters without retrieving external documents. "two closed-book QA datasets for testing knowledge recall"
  • Exact Match (EM): A strict metric that checks whether the predicted answer exactly matches a reference after normalization. "we evaluate using Exact Match (EM) and Extracted-Recall (Ex-Recall)."
  • Ex-Recall (Extracted-Recall): A relaxed exact-match metric computed after extracting a single answer span from the model’s output. "Ex-Recall is a slightly relaxed Exact Match."
  • GPQA: A “Google-proof” graduate-level question answering benchmark used to assess advanced reasoning. "transfer these abilities to other reasoning-heavy domains, such as GPQA"
  • GRPO: Group Relative Policy Optimization; a reinforcement learning method that uses group-wise relative rewards to compute policy gradients. "We optimize the objective using a GRPO-style"
  • HotpotQA: A dataset requiring multi-hop reasoning over multiple documents to answer questions. "HotpotQA is a multi-hop QA dataset;"
  • Humanity's Last Exam: A difficult, broad-coverage evaluation benchmark for advanced reasoning in LLMs. "such as GPQA and Humanity's Last Exam"
  • Importance sampling: A technique for reweighting samples from one distribution to estimate expectations under another, used here in policy gradient estimation. "importance-sampling policy gradient method."
  • KL penalty: A Kullback–Leibler divergence regularization term that penalizes deviation from a reference policy to stabilize RL training. "KL penalty coefficient of 0.01"
  • LoRA: Low-Rank Adaptation; a parameter-efficient fine-tuning method that adapts large models by training low-rank updates. "We train with LoRA (rank=32)"
  • MATH: A benchmark of competition-level math problems to evaluate mathematical reasoning in LLMs. "MATH contains competition-level mathematical reasoning problems."
  • McNemar's test: A statistical test for paired nominal data used to assess significance of performance differences between two models. "statistically significant at the 95% level by McNemar's test"
  • Multi-hop QA: Question answering that requires reasoning across multiple pieces of evidence or steps. "HotpotQA is a multi-hop QA dataset;"
  • Natural Questions (NQ): A benchmark of real user questions with answers from Wikipedia, used in both open- and closed-book settings. "NQ = Natural Questions."
  • Offline RL: Reinforcement learning using fixed datasets of trajectories without online environment interaction. "This also outperforms an offline RL baseline in which we finetune a model on correct reasoning traces generated by the initial model."
  • On-policy: An RL training regime where data are sampled from the current policy being optimized. "in our on-policy setting yields importance weights close to one."
  • Online RL: Reinforcement learning where the agent continually collects new data from the environment or model during training. "We conduct online RL training using Tinker"
  • Open-book question answering: QA where systems can retrieve or consult external sources during answering. "RL has been applied to open-book question answering"
  • Parametric knowledge: Factual information stored implicitly in a model’s parameters, accessible via prompting or reasoning. "Reasoning to access parametric knowledge is qualitatively different from reasoning used in common RLVR training"
  • Policy gradient: A class of RL algorithms that optimize the expected reward by ascending the gradient of the policy’s parameters. "policy gradient method."
  • RLVR: Reinforcement Learning from Verifiable Rewards; training LMs with rewards derived from automatically checkable outcomes. "Reasoning LLMs trained with Reinforcement Learning from Verifiable Rewards (RLVR)"
  • Reasoning tokens: Special reasoning segments or modes in model outputs (e.g., think/analysis tokens) that encourage step-by-step reasoning before answering. "with and without the think step-by-step cue and reasoning tokens."
  • Reasoning-SFT: A supervised fine-tuning baseline that trains on model-generated reasoning traces filtered for correctness. "We first evaluate the model trained on Reasoning-SFT"
  • Reinforcement Learning (RL): An optimization framework where policies are trained to maximize expected reward signals. "We train GPT-OSS-20B on TriviaQA with online RL"
  • SFT (Supervised finetuning): Training a model directly on labeled input–output pairs to improve task performance. "SFT Baseline Details"
  • Spreading activation: A cognitive theory where activating one concept triggers related concepts in a semantic network, inspiring knowledge retrieval strategies. "spreading activation—where activating one concept in a semantic network causes activation to spread to related concepts."
  • StrategyQA: A dataset assessing implicit reasoning strategies for yes/no questions. "StrategyQA (+3.0% EM)"
  • SWE-bench: A benchmark of real-world software engineering tasks for evaluating code reasoning and problem solving. "SWE-bench, a benchmark of real-world software engineering tasks"
  • Temperature: A decoding parameter controlling randomness by scaling logits before sampling; higher values yield more diverse outputs. "For GPT-5.2, temperature and top-p could not be explicitly set"
  • Think step-by-step cue: A prompt instruction that elicits chain-of-thought reasoning to improve knowledge recall. "adding a simple think step-by-step cue demonstrates statistically significant improvement in knowledge recall"
  • Top-p: Also called nucleus sampling; a decoding method that samples tokens from the smallest set whose cumulative probability exceeds p. "For GPT-5.2, temperature and top-p could not be explicitly set"
  • Verifiable reward: A reward signal that can be automatically checked (e.g., answer correctness), enabling scalable RL training for reasoning. "using answer correctness as the verifiable reward"

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 48 likes about this paper.