In-Context Reinforcement Learning for Tool Use in Large Language Models
Abstract: While LLMs exhibit strong reasoning abilities, their performance on complex tasks is often constrained by the limitations of their internal knowledge. A compelling approach to overcome this challenge is to augment these models with external tools -- such as Python interpreters for mathematical computations or search engines for retrieving factual information. However, enabling models to use these tools effectively remains a significant challenge. Existing methods typically rely on cold-start pipelines that begin with supervised fine-tuning (SFT), followed by reinforcement learning (RL). These approaches often require substantial amounts of labeled data for SFT, which is expensive to annotate or synthesize. In this work, we propose In-Context Reinforcement Learning (ICRL), an RL-only framework that eliminates the need for SFT by leveraging few-shot prompting during the rollout stage of RL. Specifically, ICRL introduces in-context examples within the rollout prompts to teach the model how to invoke external tools. Furthermore, as training progresses, the number of in-context examples is gradually reduced, eventually reaching a zero-shot setting where the model learns to call tools independently. We conduct extensive experiments across a range of reasoning and tool-use benchmarks. Results show that ICRL achieves state-of-the-art performance, demonstrating its effectiveness as a scalable, data-efficient alternative to traditional SFT-based pipelines.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
In-Context Reinforcement Learning for Tool Use in LLMs — A Simple Explanation
What this paper is about
This paper explains a new way to teach AI LLMs (like chatbots) how to use external tools—such as web search or a calculator—so they can answer tougher questions. The method is called In-Context Reinforcement Learning (ICRL). It helps models learn these skills without needing lots of expensive, hand-made training data.
What questions the researchers asked
The authors focused on two main questions:
- How can we teach an AI model to use tools (like search engines or Python code) effectively without first “tutoring” it with many labeled examples?
- Can we make the model start by following a few examples and then gradually learn to do everything on its own?
How the method works (in everyday terms)
Think of teaching a friend to ride a bike with training wheels, then slowly taking the wheels off as they gain balance. ICRL follows the same idea:
- At first, the model is shown a few example Q&As that demonstrate how to think step-by-step, when to search, and how to format the final answer. These are placed directly in the prompt the model sees (this is called “in-context” learning).
- The model then tries to answer new questions. When it tries, it can:
- “Think” internally,
- Decide to use a tool (like a web search),
- And finally give an answer.
- These steps are written in a structured way using tags like > , <search>, <information>, and <answer>—like neatly labeling each part of the process. > > - The model gets a score (a “reward”) based on two things: > - Is the final answer correct? > - Did it follow the required format (used the tags correctly)? > > - Over time, the number of example solutions in the prompt is reduced (from 3 examples, to 2, to 0). This is like removing the training wheels, so the model learns to use tools independently. > > - The learning uses a kind of “trial-and-feedback” process called reinforcement learning. The model makes several attempts, gets feedback, and updates itself to do better next time. > > - There’s a special trick called “loss masking”: when the tool (like the search engine) returns text, the model isn’t punished or rewarded for those tool-generated words—only for its own writing. This keeps the training focused on the model’s decisions. > > Here is a short, friendly summary of the training process: > > - Start with a few examples in the prompt that show how to think, search, and answer. > > - Let the model try many times and score its attempts by accuracy and good formatting. > > - Gradually remove the examples so the model relies on its own learned skills. > > - Update the model step-by-step so it gets better at choosing when and how to use tools. > > ### What they found and why it matters > The team tested ICRL on question-answering tasks that require looking things up and piecing together facts (multi-hop reasoning). They trained on one dataset of real questions but tested on several different benchmarks. They also tried math problems where the model writes and runs Python code to compute answers. > > Key takeaways: > > - ICRL beat other strong methods on multiple hard question-answering datasets, especially on questions that require several steps of reasoning and multiple searches. > > - It worked well across different model sizes (3 billion, 7 billion, and 14 billion parameters), and performance improved with larger models. > > - Even without the usual “supervised fine-tuning” stage (which needs lots of expensive labeled examples), ICRL matched or outperformed methods that do use such data. > > - For math tasks that use code execution as a tool, ICRL performed competitively with leading methods that rely on heavy supervision. > > - A simple “curriculum” (3 examples → 2 → 0) worked better than a more gradual one. Removing training wheels too slowly encouraged the model to stop early, hurting quality. > > Why this is important: > > - It shows we can teach models to use tools effectively without huge labeled datasets—saving time and cost. > > - It helps models handle questions beyond their built-in knowledge by searching the web or doing calculations. > > - It’s a general approach that works across different tool types (search engines, code runners) and tasks. > > ### What this could change in the future > ICRL could make AI systems more reliable and flexible in the real world. Instead of being stuck with what they learned during pretraining, they can: > > - Look up fresh information, > > - Break down complicated problems into steps, > > - Use the right tool at the right time, > > - And do all this without needing massive amounts of hand-crafted training data. > > In short, ICRL is like teaching AI to be a smart researcher: learn from a few examples, practice with feedback, and eventually solve problems independently by using the right tools.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of unresolved issues that are concrete and actionable for future research:
- Dependence on demonstration quality and selection: No ablation on how the number, content, or source (human- vs model-generated) of in-context examples affects learning stability, tool-use correctness, and final performance across domains.
- Curriculum transition criteria: The paper reduces demonstrations after “sufficient learning progress,” but provides no automatic criterion or triggering policy; study adaptive schedules (e.g., performance- or entropy-based) and their impact on sample efficiency and stability.
- Reward design constraints: Reliance on exact-match () for accuracy yields sparse and brittle feedback (synonyms, multiple valid answers, partial correctness). Investigate graded rewards (F1, n-gram overlap, answer normalization), step-level/verifiable subgoal rewards, and retrieval-grounded rewards.
- Format-driven bias and reward hacking: Penalizing “No <search> usage” may force unnecessary tool calls; dummy or low-quality <search> calls could game format rewards. Develop conditional format rewards (only require search when needed), query-quality validators, and anti-gaming checks.
- Exploration quality for tool queries: The approach lacks mechanisms to assess or improve the semantic relevance of search queries; evaluate query reformulation strategies, dense retriever integration, and query-level intrinsic rewards to guide exploration.
- Ambiguity in the retrieval/search stack: The paper alternately cites Google Search via Serper and a BM25 retriever, with unclear roles and switching conditions; disambiguate the pipeline and quantify sensitivity to engine/retriever choices and top-k settings.
- Non-stationary environment and reproducibility: Training/evaluation rely on live web results, introducing time variance and non-determinism; quantify run-to-run variability, provide caching/snapshots, and evaluate robustness across time slices.
- Generalization to diverse tool schemas: Tool use is framed via XML tagging; it is unclear whether ICRL transfers to JSON/function-calling APIs (e.g., OpenAI tool schema), multi-argument tools, or multi-tool orchestration (selection, routing, parallelization).
- Safety and robustness to tool-side risks: No defenses or evaluations for prompt injection from retrieved pages, adversarial content, or content-policy violations; specify sanitization, filtering, or robust decoding strategies and assess their effectiveness.
- Code execution safety: For Python tools, sandboxing, resource limits, and side-effect prevention are not described; specify the execution environment and evaluate the impact of security constraints on performance.
- Backbone dependency: Results rely on instruction-tuned models; the effectiveness on base models (and required adjustments) is unknown. Evaluate ICRL on non-instruct backbones and quantify reliance on instruction-following priors.
- RL algorithmic choices and loss masking: Only GRPO with loss masking is used; ablate against PPO/other policy optimizers, importance sampling strategies, and masking policies to understand credit assignment and stability trade-offs.
- Hyperparameter sensitivity: No ablation on critical knobs (reward weight α, individual format penalties, KL coefficient, temperature, max search turns). Provide sensitivity analyses and principled defaults.
- Stage-wise dataset partitioning: The definition and construction of D(N), D(N−1), … are unclear (reuse vs disjoint splits). Assess how partitioning strategies affect catastrophic forgetting, coverage, and overfitting.
- Stopping criteria and search budget: The maximum of six search turns is fixed; study adaptive halting policies, learned stopping, and the accuracy–latency trade-off, especially under different curricula.
- Efficiency accounting: Although “data-efficient,” the compute, memory, and wall-clock costs of training with few-shot rollouts (long prompts) are not reported; measure training/inference latency, token throughput, and cost per EM gain versus SFT+RL baselines.
- Evaluation scope and rigor: Results are reported on up to 500-sample subsets per dataset without significance tests or full-benchmark replication; evaluate on full test sets, report confidence intervals, and analyze per-category performance.
- Error and failure-mode analysis: The paper lacks qualitative or quantitative breakdowns (e.g., retrieval failures, reasoning errors, answer normalization mismatches). Provide taxonomy of errors to guide targeted improvements.
- Domain, language, and temporal generalization: Experiments focus on English open-domain QA and limited math; assess transfer to non-English, specialized domains (biomedical/legal), multi-modal tools, and time-sensitive benchmarks.
- Applicability to non-verifiable tasks: ICRL requires ground-truth answers for EM rewards; explore extensions to tasks without verifiable end answers (e.g., open-ended writing) using preference models, learned reward models, or debate-style verification.
- Demonstration sourcing and reproducibility: Few-shot examples are produced by a proprietary model (GPT-5.2); quantify dependence on demonstration source, release demonstration sets, and evaluate with open-source generators for replicability.
- Retrieval depth and reranking: Only top-3 documents are used; ablate top-k, add passage reranking/retrieval fusion, and measure sensitivity to retrieval quality on multi-hop datasets.
- Stability under environment drift: Web content evolves during RL (non-stationary MDP); study off-policy corrections, replay buffers with cached observations, and training curricula robust to changing tool outputs.
- Post-training behavior and deployment: After zero-shot training, clarify whether special tagging and an external tool runner are still required at inference; evaluate end-to-end integration overheads and reliability in real applications.
- Math/code domain coverage: Beyond AIME2024/2025, assess on GSM8K, MATH, and long-form program synthesis; analyze code correctness rates, execution errors, and the interplay between reasoning and tool calls.
- Catastrophic forgetting and alignment: Investigate whether RL fine-tuning degrades general instruction-following or safety; run broad capability and safety evals pre/post-ICRL.
- Definition of “valid search” metric: Training dynamics report “number of valid search” without a precise definition; formalize the metric (e.g., parseable calls, non-empty results, used evidence) and align it with rewards.
Practical Applications
Summary
Below are actionable, real‑world applications derived from the paper’s findings and methods (ICRL: In‑Context Reinforcement Learning for tool use in LLMs). Each item notes sectors, what it enables, how it leverages the paper’s contributions, potential tools/products/workflows, and key assumptions or dependencies affecting feasibility.
Immediate Applications
- Search‑augmented enterprise assistants — sectors: software, enterprise IT, knowledge management
- What: Deploy assistants that answer employee questions using internal/external search, handling multi‑hop queries and up‑to‑date info.
- How (from ICRL): Few‑shot‑to‑zero‑shot RL rollouts teach structured > /<search>/<information>/<answer> behavior without SFT; GRPO with loss masking; composite reward for answer EM and format. > - Potential tools/products/workflows: Enterprise “Ask‑IT” agents; intranet/Confluence/Search adapters; XML‑tagged tool‑use logs; inference prompts without heavy scaffolding. > - Assumptions/dependencies: Verifiable QA tasks for reward; reliable search API (e.g., internal BM25, commercial search); instruct‑tuned base model; guardrails against prompt injection; access controls for corp data. > > - Customer support knowledge‑base bots — sectors: customer service, e‑commerce, telecom > - What: Resolve tickets by searching FAQs/manuals and synthesizing exact answers; better on multi‑step queries. > - How: ICRL trains the agent to interleave retrieval and reasoning; format reward enforces structured, auditable outputs. > - Potential tools/products/workflows: Support copilot integrated with Zendesk/Freshdesk; search adapters for KBs; auto‑evaluation using exact match on known intents. > - Assumptions/dependencies: Availability of labeled or programmatically verifiable outcomes for RL; stable retrieval quality; privacy policies for customer data. > > - RAG agent upgrades with structured tool use — sectors: software, ML platforms > - What: Improve Retrieval‑Augmented Generation by teaching models when/how to retrieve vs. reason, reducing prompt scaffolding at inference. > - How: Replace cold‑start SFT with ICRL curriculum; loss masking excludes retrieved spans from gradients, focusing learning on decisions. > - Potential tools/products/workflows: “ICRL‑RAG” training recipe; adapters for LangChain/LlamaIndex; telemetry on valid tool calls and stop conditions. > - Assumptions/dependencies: Tool schemas standardized (e.g., XML tags); KL‑regularized RL setup; accessible RL infra (e.g., VeRL or equivalent). > > - Developer and data‑science code execution helpers — sectors: software engineering, analytics, education > - What: Assistants that run Python in a sandbox to compute, verify, and explain results (math, data checks). > - How: ICRL generalizes to code execution tools; few‑shot examples guide code call format; rewards based on correctness of final outputs. > - Potential tools/products/workflows: Notebook copilot; BI Q&A that computes aggregations; classroom math solvers that show work and verify via execution. > - Assumptions/dependencies: Secure sandboxing; deterministic/verifiable tasks; tested execution APIs; safeguards against malicious code. > > - Research QA assistants for literature and fact‑checking — sectors: academia, media, legal > - What: Multi‑turn agents that decompose questions, search sources, and extract precise facts with citations. > - How: ICRL’s multi‑hop search behavior (learned without SFT) supports query decomposition and iterative retrieval. > - Potential tools/products/workflows: Literature scout for systematic reviews; newsroom fact‑checker; legal research assistant searching statutes. > - Assumptions/dependencies: High‑quality search indices (PubMed, case law, news); provenance tracking; reward proxies (e.g., EM on held‑out QA sets) during training. > > - Low‑cost tool‑use training for small labs/SMBs — sectors: ML/AI teams, startups > - What: Train capable tool‑using models with 3B–7B backbones without expensive annotation campaigns. > - How: ICRL eliminates SFT; small few‑shot sets plus RL rollouts; curriculum reduces few‑shot overhead to zero‑shot. > - Potential tools/products/workflows: Open‑source training kits using Qwen‑Instruct; reproducible pipelines using GRPO and loss masking. > - Assumptions/dependencies: Access to moderate GPUs; curated few‑shot examples; rewardable tasks; stable training frameworks. > > - Auditable agent operations via structured logs — sectors: governance, compliance, risk > - What: Produce transparent, structured traces (<think>, <search>, <information>, <answer>) for review and monitoring. > - How: ICRL’s format reward enforces tag correctness; logs can be analyzed for policy compliance and incident forensics. > - Potential tools/products/workflows: Observability dashboards; offline replay/analysis; automated format‑violation alerts. > - Assumptions/dependencies: Agreement on tool‑call schema; storage and PII governance; human review procedures. > > - Multi‑hop search query optimization — sectors: search/retrieval, marketing intelligence > - What: Train agents that plan and execute sequences of focused queries to reach answers faster and more accurately. > - How: Curriculum shows query decomposition patterns; RL advantages reward successful multi‑turn strategies. > - Potential tools/products/workflows: Query planner modules for enterprise search; web‑research copilots for analysts. > - Assumptions/dependencies: Search latency constraints; query rate limits; protection against adversarial content. > > - Education: step‑by‑step math tutors — sectors: education technology > - What: Tutors that reason in steps, check with code, and present final answers clearly. > - How: ICRL rewards accuracy and structure, with optional code execution as a tool. > - Potential tools/products/workflows: Homework assistants; exam practice systems using AIME‑style problems; auto‑grading with EM. > - Assumptions/dependencies: Curriculum‑aligned datasets; safe execution; accommodations for partial credit beyond EM. > > - Cost‑efficient model improvement in production — sectors: MLOps, SaaS > - What: Continual improvement of tool‑using agents via RL on live/collected tasks without synthesizing labeled traces. > - How: Apply ICRL rollouts with a small seed of examples, gradually removed; use programmatic rewards (accuracy/format). > - Potential tools/products/workflows: Feedback loops that re‑train on difficult queries; A/B testing with KL‑regularized updates. > - Assumptions/dependencies: Reliable reward signals; safe deployment of RL updates; drift monitoring. > > ## Long‑Term Applications > > - General multi‑tool orchestration across enterprise systems — sectors: enterprise software (CRM, BI, ITSM) > - What: Agents that learn to call heterogeneous tools (ticketing, dashboards, search, calculators) with schema compliance and minimal supervision. > - How: Extend ICRL to multiple tool schemas; design composite rewards (task success, format, latency). > - Potential tools/products/workflows: “Tool‑hub” adapters; policy LLMs that autonomously chain tools; orchestration planners. > - Assumptions/dependencies: Clear tool APIs and schemas; sandboxing and RBAC; robust reward proxies for non‑QA tasks. > > - Clinical information and guideline assistants — sectors: healthcare > - What: Evidence‑seeking agents that retrieve latest guidelines, compute risk scores, and present auditable recommendations. > - How: ICRL’s structured reasoning and tool use (search, calculators) with format‑compliance rewards. > - Potential tools/products/workflows: Point‑of‑care reference copilots; literature triage; audit logs for clinical governance. > - Assumptions/dependencies: Medical validation, bias/safety reviews, regulatory approval; high‑precision retrieval; PHI security. > > - Regulated finance research and compliance copilots — sectors: finance > - What: Agents that gather filings/news, compute metrics, and produce compliant, logged analyses. > - How: Structured tool calls and format rewards enable traceability; RL with verifiable tasks (e.g., numeric checks). > - Potential tools/products/workflows: 10‑K summarizers with metric calculations; compliance evidence collectors. > - Assumptions/dependencies: Strict guardrails, audit trails, sign‑off workflows; non‑public data access controls; risk limits. > > - Autonomous research agents for deep web investigations — sectors: academia, R&D, government > - What: Multi‑session agents that plan, search, verify, and synthesize across sources over long horizons. > - How: Scale ICRL curricula to longer horizons and richer reward shaping (task completion, citation coverage). > - Potential tools/products/workflows: Long‑context planners; memory modules; citation‑aware rewards. > - Assumptions/dependencies: Robust non‑sparse rewards; memory management; resistance to web adversarial content. > > - Software maintenance and CI/CD agents — sectors: software engineering > - What: Agents that search issues/docs, run tests, and propose fixes with structured, verifiable steps. > - How: Code‑execution tool use extended to build/test tools; rewards tied to test pass rates and diff policies. > - Potential tools/products/workflows: PR copilots; automated triage/resolution flows. > - Assumptions/dependencies: Secure repo access; hermetic build environments; strong safety constraints. > > - Robotics and embodied agents (skill invocation as “tools”) — sectors: robotics, manufacturing, logistics > - What: Language‑to‑action agents that call robot skills with structured plans and logs. > - How: Treat skills as tools; ICRL trains schema‑compliant calls; rewards from task success in simulation. > - Potential tools/products/workflows: Skill catalogs; plan validators; sim‑to‑real curriculum. > - Assumptions/dependencies: High‑fidelity simulators; safe policy deployment; reliable sensor/tool feedback. > > - Education at scale: STEM problem solvers across disciplines — sectors: education technology > - What: Generalized solvers that combine search (concepts) and computation (code) for physics/chemistry/econ problems. > - How: Multi‑tool ICRL; composite rewards (numerical accuracy, unit checks, structure). > - Potential tools/products/workflows: Course‑aligned agents; lab‑report assistants. > - Assumptions/dependencies: Domain datasets with verifiable answers; unit‑safe computation; academic integrity policies. > > - Continuous knowledge refresh pipelines — sectors: knowledge management, media > - What: Periodic ICRL fine‑tuning to maintain temporal awareness as knowledge evolves. > - How: Train on fresh QA or weakly‑labeled tasks; format and accuracy rewards; curriculum resets few‑shots. > - Potential tools/products/workflows: Scheduled retraining jobs; temporal benchmarks integration. > - Assumptions/dependencies: Stream of high‑quality, weakly verifiable tasks; drift detection; cost control. > > - Standardization of tool‑call schemas and audit protocols — sectors: policy, standards bodies, AI governance > - What: Industry standards for structured tool use (e.g., XML/JSON schemas) and minimum audit requirements. > - How: Leverage the paper’s tag‑based pattern and format‑violation penalties as enforceable specs. > - Potential tools/products/workflows: Conformance test suites; certification for tool‑using agents. > - Assumptions/dependencies: Multi‑stakeholder agreement; interoperability with existing SDKs/APIs. > > - Safety and alignment via structured rewards — sectors: cross‑sector AI safety > - What: Enforce safe tool usage (allowed tools, argument types, quotas) with reward penalties and policy constraints. > - How: Extend format‑correctness to safety‑constraint checks in ICRL; KL‑regularized updates to preserve baseline safety. > - Potential tools/products/workflows: Safety validators; red‑team feedback loops integrated into RL. > - Assumptions/dependencies: Coverage of safety rules; robust detectors; monitoring for reward hacking. > > Notes across applications: > > - Data and compute: ICRL reduces supervision needs but still requires RL infrastructure, rollout sampling, and moderate GPUs. > > - Rewards: Works best on tasks with programmatic/verifiable success (exact match, numeric checks). Open‑ended tasks need proxy rewards or human‑in‑the‑loop evaluation. > > - Security: Tool calling introduces risks (prompt injection, code execution). Secure sandboxes, content filters, and permissioning are essential. > > - Tool quality: Performance depends on retrieval and tool reliability; latency and rate limits affect UX and RL throughput.
Glossary
- Advantage: A baseline-adjusted measure of how much better a sampled trajectory is than average in a batch for RL updates. "Compute normalized advantage Ai"
- AIME2024: A benchmark (American Invitational Mathematics Examination 2024) used to evaluate math reasoning with tool use. "On the AIME2024 and AIME2025 benchmarks"
- AIME2025: A benchmark (American Invitational Mathematics Examination 2025) used to evaluate math reasoning with tool use. "On the AIME2024 and AIME2025 benchmarks"
- bfloat16 precision: A 16‑bit floating-point format that balances range and precision to speed training with reduced memory. "all models are loaded using bfloat 16 precision."
- BM25 retriever: A classic term‑matching retrieval method used to fetch top documents for search. "we use a BM25 retriever that returns the top-3 documents for each search query."
- Chain-of-Thought (CoT): A prompting strategy that elicits step-by-step reasoning in LLMs. "strategies such as Chain-of-Thought (CoT) reasoning"
- CLIP (in GRPO): A clipping function applied to importance weights to stabilize policy optimization. "CLIP(ri,t(0), A¿, E)"
- Cold-start: An initialization approach that begins training with supervised fine‑tuning before RL. "cold-start strategy that begins with supervised fine-tuning (SFT)"
- Compositional query: A question requiring decomposition into multiple sub-questions or reasoning steps. "answering a compositional query"
- Curriculum: A staged training schedule that adjusts difficulty (e.g., reducing demonstrations) over time. "multi-stage curriculum that gradually reduces the number of in-context examples"
- Data leakage: Unintended overlap between training and evaluation data that can inflate performance. "we exclude NQ from the evaluation to avoid data leakage."
- Exact Match (EM): An accuracy metric that counts predictions matching the gold answer exactly. "Exact Match (EM) Accuracy (%)"
- Few-shot prompting: Supplying a small number of exemplars in the prompt to guide model behavior. "few-shot prompting during the rollout stage of RL."
- FlashRAG: A modular toolkit used here to load and preprocess retrieval-augmented datasets. "The dataset is loaded via FlashRAG (Jin et al., 2025b)"
- Format violation penalties: Predefined deductions applied when generated outputs violate required structure. "Table 2. Format violation penalties for computing rewardformat."
- Fully Sharded Data Parallel (FSDP): A parallelization strategy that shards model parameters across devices to reduce memory. "We adopt Fully Sharded Data Parallel (FSDP) training"
- Gradient checkpointing: A memory-saving technique that recomputes activations during backpropagation. "with gradient checkpointing"
- GRPO: Group Relative Policy Optimization, an RL algorithm variant that leverages group-normalized advantages. "we adopt GRPO (Shao et al., 2024) with loss masking"
- Group-relative advantage: Advantage computed by normalizing returns across a sampled group of trajectories. "to compute the group-relative advan- tage."
- Importance weights: Ratios between current and old policies used to reweight samples in off-policy updates. "Compute importance weights ri,t (0)"
- In-Context Reinforcement Learning (ICRL): An RL framework that teaches tool use via in-context examples during rollouts, then phases them out. "In-Context Reinforcement Learning (ICRL)"
- In-context learning: The model’s ability to learn patterns from examples embedded directly in the prompt. "via in-context learning, akin to few-shot prompting."
- Instruction-tuned models: LLMs fine‑tuned to follow natural language instructions. "All these instruction-tuned models are widely adopted"
- Interleaving Retrieval Chain-of-Thought (IRCoT): A method that intermixes retrieval steps with step-by-step reasoning. "Interleaving Retrieval Chain-of-Thought (IRCoT)"
- KL-divergence: A measure of divergence between two probability distributions used to regularize policy updates. "DKL is KL-divergence measure."
- KL penalty: A regularization term penalizing deviation from a reference policy. "We ap- ply a KL penalty with a coefficient of 0.001."
- Loss masking: Excluding non-trainable or externally provided tokens from loss computation during training. "we adopt a loss masking strategy tailored for RL with tool use"
- Markov Decision Process (MDP): A formalism for sequential decision-making with states, actions, transitions, and rewards. "as a Markov Decision Pro- cess (MDP)"
- Multi-hop reasoning: Reasoning that requires chaining multiple facts or steps to answer a question. "multi-hop reasoning tasks"
- O2 -Searcher: A baseline method that applies SFT before RL for search-augmented QA. "such as O2 -Searcher (Mei et al., 2025)"
- Policy gradient: A family of RL methods that optimize the policy directly via gradient ascent on expected returns. "only tokens generated by the LLM contribute to the policy gradient"
- Policy LLM: The LLM treated as the decision policy in an RL setup. "where Te is the policy LLM"
- Pretraining: Large-scale unsupervised training that confers broad knowledge before fine-tuning. "during pretraining"
- Reference LLM: A fixed model used to compute KL regularization against the current policy. "Tref is the reference LLM"
- Reinforcement learning (RL): A learning paradigm where agents learn to act via rewards from interactions. "reinforcement learning (RL)"
- Rejection Sampling: A method that filters generated samples based on certain criteria, used as a baseline. "Rejection Sampling (Ahn et al., 2024)"
- Retrieval-Augmented Generation (RAG): A technique that conditions generation on retrieved external documents. "Retrieval-Augmented Gen- eration (RAG)"
- Reward function: A mapping from outputs to scalar signals that guide RL optimization. "We design a composite reward function"
- Rollout: The generated sequence (including tool calls) sampled from the current policy during training. "rollout prompts"
- Serper API: An API that interfaces with Google Search to retrieve live results for queries. "we integrate the Serper API3 across all models"
- Supervised fine-tuning (SFT): Training on labeled examples to adapt a pretrained model to specific tasks. "supervised fine-tuning (SFT)"
- Top-k documents: The k highest-scoring retrieved items returned by a search system. "the top-k documents retrieved from a corpus."
- Tool-augmented reasoning: Reasoning that actively invokes external tools (e.g., search or code execution) within the process. "tool-augmented reasoning"
- Trajectories: Complete sampled sequences of actions and outputs used to compute RL objectives. "we sample 8 rollout trajectories"
- VeRL framework: The Volcano Engine RL framework used to implement training. "using the Volcano Engine Reinforcement Learning (VeRL) framework"
- Zero-shot: Performing a task without seeing any in-context examples. "zero-shot setting"
- XML tags: Markup used to structure different components of the model’s reasoning, tool use, and answers. "such as XML tags"
Collections
Sign up for free to add this paper to one or more collections.