KARL: Knowledge Agents via Reinforcement Learning

Published 5 Mar 2026 in cs.AI and cs.LG | (2603.05218v1)

Abstract: We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over internal enterprise notes. Second, we show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark. Third, we develop an agentic synthesis pipeline that employs long-horizon reasoning and tool use to generate diverse, grounded, and high-quality training data, with iterative bootstrapping from increasingly capable models. Fourth, we propose a new post-training paradigm based on iterative large-batch off-policy RL that is sample efficient, robust to train-inference engine discrepancies, and naturally extends to multi-task training with out-of-distribution generalization. Compared to Claude 4.6 and GPT 5.2, KARL is Pareto-optimal on KARLBench across cost-quality and latency-quality trade-offs, including tasks that were out-of-distribution during training. With sufficient test-time compute, it surpasses the strongest closed models. These results show that tailored synthetic data in combination with multi-task reinforcement learning enables cost-efficient and high-performing knowledge agents for grounded reasoning.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel RL framework, OAPL, that trains knowledge agents to perform multi-step, evidence-grounded reasoning across structurally diverse tasks.
The paper shows that iterative synthetic data generation and RL post-training improves retrieval accuracy, context compression, and out-of-distribution generalization.
The paper demonstrates that KARL achieves competitive cost, latency, and quality tradeoffs, matching closed-source models like Claude 4.6 on benchmark performance.

KARL: Knowledge Agents via Reinforcement Learning

Motivation and Problem Formulation

This paper presents KARL, a knowledge agent for complex grounded reasoning, trained via large-batch off-policy RL and evaluated on a new multi-capability benchmark, KARLBench. Unlike prior LLM agentic research limited to open or narrow benchmarks, KARLBench comprises six structurally distinct tasks: constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over unstructured enterprise notes. The resulting suite exercises long-horizon information-seeking, synthesis, verification, and aggregation behaviors that paradigmatic open-domain QA, pure retrieval augmentation, or “deep research” tasks do not adequately stress.

KARL is instantiated by post-training an agent with a single tool (vector search) over heterogeneous corpora, emphasizing robust retrieval and context management. The central challenge addressed is not only achieving high-quality retrieval but reasoning and answering in complex, high-entropy closed corpora where multi-step, evidence-grounded search is required.

Benchmarks and Task Suite: KARLBench

KARLBench covers a diverse set of evaluation regimes (BrowseComp-Plus, TREC-Biogen, FinanceBench, QAMPARI, FreshStack, and PMBench), each focusing on a distinct aspect of grounded agentic reasoning. Evaluation is unified via nugget-based answer completeness metrics, providing a consistent reward signal and fine-grained supervision for RL optimization.

Data curation deliberately preserves the structural heterogeneity of sources. All benchmarks are closed-corpus, mitigating contamination and constraining tool-based solutions to vector search.

Data Synthesis Pipeline

Training data is synthesized via a multi-stage agentic pipeline. The pipeline consists of a Question-Answer Generator Agent that dynamically explores the corpus—synthesizing question-answer pairs grounded in retrieved evidence—and a Deduplication Agent that eliminates test set leakage or redundancy.

Figure 1: The agentic pipeline for synthetic training data generation, ensuring grounding and decontamination via retrieval and deduplication.

Solutions to these synthetic queries are produced by instantiations of the Solver Agent, with empirical pass rates used to filter easy or unsolvable prompts. Only medium-difficulty questions (some, but not all rollouts solved) are retained, passing through an additional Quality Filter Agent to screen for syntactic ambiguity and reference answer correctness.

Figure 2: RL uses only those synthesized tasks that offer maximal learning signal, as measured by pass rates and filtered for ambiguity and correctness.

This pipeline is iterated with increasingly capable models to bootstrap data and agent quality.

Reinforcement Learning Framework

KARL proposes OAPL (Optimal Advantage-based Policy Optimization with Lagged Inference policy), an iterative, large-batch off-policy RL paradigm. The framework optimizes a KL-regularized RL objective, amortizing data collection and supporting group rollouts for test-time compute strategies. Crucially, the off-policy procedure is robust to inference engine discrepancies (e.g., vLLM) and does not require stabilizing heuristics such as importance weighting or router replay, which are common in online RL for Mixture-of-Experts LLMs.

Compression of reasoning context is handled end-to-end within the RL loop: the agent is trained to compress its search history by itself, learning, via outcome rewards, how much information to retain to maximize answer correctness and sample efficiency.

Multi-task Training and Generalization

Unlike SFT-based multi-expert distillation, which yields in-distribution performance gains but limited out-of-distribution scaling, direct multi-task RL across BrowseComp-Plus and TREC-Biogen improves both in-domain and generalizes to four structurally distinct held-out regimes. Out-of-distribution generalization is quantitatively evidenced by consistent score improvements for all tasks, surpassing SFT-based models especially when combined with test-time compute scaling.

Figure 3: RL post-training enables scaling on generalization, while SFT benefits in-distribution only and saturates out-of-distribution.

Test-time Compute Scaling

The paper systematically explores test-time compute via Parallel Thinking (generating N independent rollouts with aggregation) and Value-Guided Search (BFS tree search using a value model as an outcome reward). On complex tasks with open-ended aggregation such as PMBench, generative aggregation enables composition of better answers than the maximal candidate; on factoid-style tasks, value-guided search with weighted majority voting yields the best test-time scaling.

Figure 4: Parallel Thinking framework for generating and aggregating multiple rollouts in parallel.

Figure 5: Value-Guided Search leverages value function predictions for branch selection and aggregation.

KARL matches or exceeds the strongest closed models (Claude Opus 4.6) with sufficiently high test-time compute budget, and does so at strictly lower cost and latency.

Figure 6: KARL achieves the best cost–quality and latency–quality Pareto frontiers, dominating both open- and closed-source agentic models.

Search Behavior, Efficiency, and Compression

RL-trained KARL demonstrates sharpened and genuinely expanded solution capability: max@k metrics improve at all k, with empirical data flow showing many previously unsolved questions now solved after training. RL shortens trajectories for efficiently solvable queries and reduces redundant searches post-retrieval, improving cost and latency. Search diversity and document recall scale with RL iterations, while answer accuracy improves even when only partial retrieval is achieved.

Figure 7: RL training shortens the median and overall trajectory length, especially for solved questions.

Figure 8: RL preferentially shifts longer, previously unsolved trajectories into partial/solved categories.

Compression is not only essential for long-horizon reasoning but is improved by RL: cross-eval studies show that swapping in the RL-trained compression module alone (on top of the base search model) consistently raises answer accuracy.

Figure 9: The aroll harness decouples agent, environment, and context management plugins for maximal RL data collection and evaluation throughput.

Ablations and Analysis

Ablation studies confirm robustness: performance is stable to embedding model choice in retrieval, with sharp drops only if context compression is removed or the retrieval count is excessively large (flooding context and eliminating multi-step reasoning).

Empirical search traces and LLM-based classification reveal that after RL, KARL is less likely to become trapped in exhaustive, unconverged search modes and more frequently balances exploration and early committal ("explore-then-commit" behavioral mode). Notably, premature commitment occurs primarily when reasoning across multi-hop constraints involving difficult arithmetic or numeric reasoning, suggesting an avenue for further reward engineering and capability extension.

Numerical Results

KARL matches or surpasses closed-source Claude 4.6 Opus in aggregate on KARLBench at a fraction of the cost and latency. For example, with 10 parallel traces, KARL achieves 67.5 overall score, equaling Opus 4.6, while maintaining 33% lower cost per query and 47% lower latency. Out-of-distribution performance scales with RL and test-time compute, unlike SFT-based agents which only scale in-distribution (see Table 1 and main results in the original paper for full breakdown).

Theoretical and Practical Implications

The results imply that highly general, efficient, and scalable knowledge agents can be derived via synthetic, task-diverse data generation and large-batch off-policy RL, without architectural changes or brittle reward heuristics. The action space restriction to vector search proves sufficient for a large span of grounded reasoning problems. Compression, when solved via outcome-reward-driven RL, supports very long-horizon tasks without manual context management.

Future work should extend this paradigm to tool chaining, code execution, and hierarchical memory, further expanding domain coverage and efficiency. The increasing marginal gains from additional RL iterations, apparent in continual pass@k improvements and improved search commitment calibration, suggest scalability to even more challenging information-seeking workloads.

Conclusion

KARL demonstrates that a systematic combination of iterative synthetic data, robust off-policy RL, and careful multi-task design yields knowledge agents with Pareto-optimal cost–quality and latency–quality tradeoffs, generalizing far beyond the training distribution. These findings set a new baseline for agentic, closed-corpus, evidence-grounded reasoning, motivating further work on compositional action spaces and dynamic memory for knowledge agents.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

KARL: Knowledge Agents via Reinforcement Learning

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

A simple guide to “KARL: Knowledge Agents via Reinforcement Learning”

1) What is this paper about?

This paper introduces KARL, a computer system that acts like a smart research assistant for companies. Instead of guessing answers, it searches through real documents, gathers evidence, and reasons step by step to give grounded, trustworthy answers. The authors also build a new test suite, called KARLBench, to fairly measure how good such assistants are across many kinds of tough, real-world search-and-reason tasks.

2) What were the researchers trying to find out?

In plain terms, they asked:

Can we train an AI assistant that’s good at many different “research” jobs, not just one?
Can it find and use information from large collections of company documents, not just the public internet?
Can we teach it efficiently (without wasting lots of time and money) and still get top-quality results?
Will a system trained on a mix of search styles handle new, unseen tasks better than a system trained on just one?

3) How did they do it?

Think of teaching KARL like training a well-rounded athlete:

You set up varied drills (different tasks).
You create realistic practice materials (hard questions with real evidence).
You coach it using feedback about what worked and what didn’t.
During a game, you let it think in parallel and pick the best move.

Here’s what that looks like in AI terms:

A new benchmark: KARLBench
- Find the one item that fits many rules (constraint-driven search)
- Write a careful report using pieces from many documents (cross-document synthesis)
- Do math with numbers in long tables (tabular reasoning)
- Find all items that match a rule (exhaustive retrieval)
- Follow instructions hidden across tech docs (procedural reasoning)
- Pull facts from messy internal notes (enterprise fact aggregation)
One simple tool, used cleverly: vector search Imagine an ultra-fast “index” that finds the most relevant chunks of text when you type a query. KARL has only this tool. It asks the search engine for pieces of documents, reads them, asks again, and keeps reasoning. This isolates what’s being tested: smart searching and grounded reasoning.
Keeping its “memory” tidy: compression Long research can overflow the model’s context window (its short-term memory). KARL learns to summarize its own past steps when things get too long—like a student making notes to keep track of what matters.
Making high-quality training data (agentic synthesis) The system creates its own practice questions and answers by: 1) Exploring the document collection and proposing new questions plus answers grounded in retrieved evidence. 2) Trying to solve those questions multiple times. If a question is too easy or nearly impossible, it’s thrown out. A separate check filters out poor-quality or ambiguous ones. This is like a coach inventing drills from the team’s playbook, then keeping only the exercises that challenge—but don’t overwhelm—the player.
How KARL learns: off-policy reinforcement learning (RL)
- Stable for large models
- Efficient with big batches of past experiences
- Easy to extend to multiple tasks at once
Multi-task training Instead of training on one type of search, KARL trains on different ones together. This is like cross-training—improving fitness in a way that helps in new, unseen sports. The authors show this helps the model generalize better to tasks it wasn’t explicitly trained on.
“Thinking harder” at test time
- Parallel Thinking: It generates multiple possible answers in parallel and then uses what it found to build a better final answer. Unlike simple voting, the final step can create an even better, evidence-based response.
- Value-Guided Search: It trains a small helper model to score how promising each partial idea is and uses that to pick better next steps, like a coach on the sidelines steering the play.
Strong, fast infrastructure The team built a fast, consistent system to run huge numbers of long searches, and a framework that keeps the behavior identical across training, evaluation, and deployment. That avoids surprises when moving from the lab to real use.

4) What did they find, and why is it important?

KARLBench shows wide coverage of “grounded reasoning” It tests skills that matter for companies (finance, tech docs, biomedical summaries, internal notes, etc.). This gives a balanced picture of an assistant’s real abilities.
Training across different search styles helps generalization Models trained on varied tasks performed better on new, held-out tasks than models trained on just one. This matters because real jobs rarely look exactly like the practice set.
Synthetic, grounded data works The agent-made training data—carefully filtered and tied to retrieved evidence—helped the system learn reliably without hand-labeling tons of examples.
Off-policy RL was efficient and stable The training method handled large models without the usual tricks, saving time and engineering complexity.
Competitive performance at lower cost and latency Compared to top closed-source systems (like Claude 4.6 and GPT 5.2), KARL hit a “Pareto-optimal” balance—meaning it offered better or equal quality for the same or lower cost and speed. With more test-time compute (like more parallel rollouts), it matched or even beat the strongest models.

Why it matters: This shows you can build powerful, trustworthy research assistants that work well across many tasks, use real evidence, and don’t break the bank.

5) What’s the big picture impact?

For companies: KARL-style systems can help teams find facts, do calculations from long reports, follow procedural steps, and summarize findings using their private data. That saves time, improves accuracy, and keeps sensitive information in-house.
For AI research: The work suggests a recipe that scales: use grounded synthetic data, train with stable off-policy RL across multiple tasks, and add smart test-time strategies when you need extra quality. It also highlights the importance of evaluating across many skills—not just one benchmark.
For the future: The approach could extend to more tools (like structured databases or code execution), richer document traversal, and stronger privacy controls. It points toward AI assistants that are not only clever, but also careful, explainable, and practical in real workplaces.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to guide future research.

Evaluation constrained to a single tool: The agent is restricted to vector search only; it is unclear how KARL performs with richer toolsets (e.g., web browsing, document traversal, table extraction/SQL, code execution, OCR, or multi-tool orchestration).
Closed-corpus focus: By avoiding live web search, the work sidesteps issues of dynamic content and search engine variability; generalization of KARL’s capabilities to open-web or evolving corpora is not evaluated.
Retrieval coverage limits on BrowseComp-Plus: Indexing only the first 512 tokens per document imposes a hard upper bound; the impact of enabling document traversal or more complete indexing on both baseline and KARL performance is not measured.
PMBench availability and reproducibility: PMBench is proprietary; lack of public release (data, prompts, nuggets, and indexing) prevents independent validation of reported gains and hinders community benchmarking.
Potential evaluation bias on QAMPARI: Indexing only documents that contain at least one gold entity may alter the search prior and difficulty; how this design choice affects absolute and comparative performance is not studied.
Embedding/model heterogeneity across tasks: Different embedding models and retrieval hyperparameters are used per dataset; no ablation isolates how embedding choice and k-values affect results or fairness of cross-model comparisons.
Fixed retrieved-token budget and k caps: Using a uniform retrieved-token budget (e.g., k ≤ 20) may constrain recall in tasks requiring exhaustive aggregation (e.g., QAMPARI, PMBench); sensitivity to k and budget scaling is not explored.
Minimal corpus preprocessing by design: While promoting generalization, the choice to avoid corpus-specific chunking, linking, or metadata enrichment obscures how much of KARL’s gains would persist or improve with stronger retrieval pipelines.
End-to-end learned compression: Compression quality, failure modes, and its contribution to final performance are not quantified; no comparison is provided against specialist summarizers or alternative compression strategies.
Reward design and credit assignment: Assigning the terminal rollout reward to all segments (including compression) may dilute or misplace learning signal; step-level credit assignment or reward shaping is not investigated.
Nugget-based evaluation alignment: Nugget completeness is used uniformly, but its correlation with human judgments (especially for multi-paragraph synthesis in TREC-Biogen or procedural answers in FreshStack) is not validated via human studies.
Synthetic data quality and decontamination: The synthesis and quality filter agents are themselves LMs; false positives/negatives in ambiguity and factuality detection are unquantified. Using evaluation examples as few-shot seeds (TREC-Biogen) heightens leakage risk despite de-duplication; deeper decontamination beyond surface similarity is not assessed.
Self-bootstrapping risks: Iteratively training on model-generated questions/solutions may induce distributional drift or amplify model biases; mechanisms to detect and mitigate error propagation in the synthesis loop are not presented.
Limited multi-task scope: Multi-task RL is demonstrated on only two tasks (BrowseComp-Plus and TREC-Biogen); it is unknown how scalability to more diverse tasks affects stability, interference, and OOD generalization.
Task balancing strategy: The simple “equal training tokens” heuristic may be suboptimal; alternatives (e.g., difficulty-aware sampling, temperature-based task mixing, per-task curriculum) are not explored.
Multi-expert distillation vs. multi-task RL: Claims of better OOD generalization for multi-task RL rely on two-task experiments; whether the advantage persists with more experts/tasks or alternative distillation recipes is untested.
Off-policy RL stability and limits: OAPL’s robustness claims lack detailed ablations versus online GRPO across architectures and scales (e.g., actual MoE vs. dense); boundaries where off-policy learning degrades are not mapped.
Trainer–inference discrepancy: While robustness to trainer–inference engine differences is claimed, the magnitude and conditions of such discrepancies and their impact on quality/latency are not quantified.
Test-time compute (TTC) methodology gaps: Parallel Thinking and VGS lack comprehensive ablations on N, k, value-model size, and aggregation prompts; cost/latency/quality trade-offs and saturation points are not systematically reported.
Aggregation strategy analysis: The paper notes the aggregator can synthesize better answers than any single rollout, but does not compare against strong baselines (e.g., weighted majority, consensus debate, cross-evidence validation).
Value model calibration and generalization: The value model is trained on rollouts from the same policy; its calibration, susceptibility to overfitting, and transfer across tasks or policies are not assessed.
Fairness of cross-model comparisons: Competitors may use different tool stacks or web access; without standardizing tools/retrieval and reporting compute budgets, the comparability of Pareto frontiers is uncertain.
Latency and operational constraints: Although cost–quality and latency–quality Pareto claims are made, per-task latency distributions under different TTC budgets and real-world SLAs are not detailed.
Failure mode analysis: The paper lacks qualitative error analysis across tasks (e.g., common retrieval misses, spurious compression, numeric reasoning slips, or synthesis inconsistencies), limiting diagnostic insight.
Security, privacy, and governance: Training on internal notes (PMBench) and deploying agentic retrieval in enterprise settings pose data governance and privacy questions that are not addressed (e.g., access control, auditability, PII handling).
Release of artifacts: The “aroll” harness, prompts, reward code, compression prompts, aggregation prompts, and TTC configs are not indicated as being released; reproducibility and community adoption are therefore constrained.
Domain transfer to non-text modalities: KARL is evaluated on text corpora; performance on documents with rich tables, figures, or code repositories requiring execution/runtime validation is not assessed.
Long-horizon navigation tools: For tasks with lengthy documents (e.g., FinanceBench), the absence of structured navigation tools (e.g., table readers, section jumpers) leaves open whether KARL can scale to even longer, more complex corpora.
Continual learning and corpus drift: How KARL adapts to frequent corpus updates in enterprise settings (without catastrophic forgetting or costly re-training) is left unexplored.

View Paper Prompt View All Prompts

Practical Applications

Overview

Based on the paper “KARL: Knowledge Agents via Reinforcement Learning,” the following items translate the system’s findings, methods, and innovations into practical applications across industry, academia, policy, and daily life. Each item notes sectors and key dependencies that may affect feasibility.

Immediate Applications

Enterprise knowledge agent for internal corpora
- Sectors: software, finance, manufacturing, legal, government
- What: Deploy a vector-search–only, grounded reasoning assistant over intranet docs (wikis, PM notes, policies, SOPs). Handles constraint-driven entity search, fact aggregation, and cross-document synthesis with end-to-end context compression.
- Tools/products/workflows: “Enterprise KARL” chatbot; secure RAG service with compression plugin; aroll harness for identical behavior across data collection, training, and serving.
- Dependencies/assumptions: High-quality embeddings and indexing; closed corpora access and data governance; nugget-based QA evaluation setup.
Financial report analysis and KPI extraction
- Sector: finance
- What: Navigate 10-K/10-Qs and investor PDFs to compute year-over-year metrics, extract tabular values, and generate audit-ready calculations.
- Tools/products/workflows: Finance copilot with vector search; TTC “Parallel Thinking” mode for reliability; nugget-based rubric to assess completeness.
- Dependencies/assumptions: Page-level indexing quality; coverage of tables; guardrails for accuracy; compliance sign-off.
Biomedical literature review synthesis (non-clinical decision support)
- Sector: healthcare, pharma/biotech
- What: Cross-document evidence aggregation (e.g., vaccine effectiveness summaries) for medical affairs and research reviews.
- Tools/products/workflows: Evidence-synthesis agent on PubMed-like corpora; nugget conversion for multi-reference answers; QA dashboards with completeness scoring.
- Dependencies/assumptions: Curated corpora; human-in-the-loop review; not a substitute for clinical decision-making.
Regulatory, risk, and compliance audit search
- Sectors: finance, healthcare, energy, defense
- What: Exhaustive retrieval of entities/events across policies, logs, and controls (e.g., sanctions, governance controls, audit trails).
- Tools/products/workflows: QAMPARI-style exhaustive search mode; deduplicated evidence bundles; report synthesis with traceable nuggets.
- Dependencies/assumptions: Structured access to logs/docs; precision/recall thresholds set with compliance teams.
Developer support and customer support assistants
- Sectors: software, IT services, SaaS
- What: Procedural reasoning over technical documentation to produce step-by-step fixes and runbooks.
- Tools/products/workflows: FreshStack-like doc agent; value-guided search (VGS) to prefer solutions that empirically lead to correct outcomes; integrated into ticketing systems.
- Dependencies/assumptions: Up-to-date docs; safe tool-call policies; clarity on when to escalate to humans.
Internal planning and notes synthesis (PMBench-style)
- Sectors: software, product orgs across industries
- What: Aggregate dispersed facts from meeting notes and planning documents to answer “who/what/when/risks” queries with source citations.
- Tools/products/workflows: Lightweight PM knowledge agent; parallel rollouts with aggregator to boost correctness when inputs are noisy.
- Dependencies/assumptions: Proper access controls; redaction where needed; minimal preprocessing sufficient for retrieval coverage.
Upgrade existing RAG systems with context compression
- Sectors: software, knowledge management platforms
- What: Introduce RL-tuned compression within agent loops to maintain salient context and reduce prompt costs without sacrificing accuracy.
- Tools/products/workflows: Compression lifecycle plugin integrated into existing RAG pipelines; reward-driven end-to-end optimization.
- Dependencies/assumptions: Stable prompt templates; compatibility with current LLM serving stack.
Agentic synthetic data generation for enterprise fine-tuning
- Sectors: software/ML platform teams, model vendors
- What: Use the agentic synthesis pipeline to produce grounded, decontaminated question–answer pairs and solution trajectories from company corpora; filter by pass-rate to target “learnable” difficulty.
- Tools/products/workflows: “Agentic Synthesis Studio” with dedup and quality filters; bootstrapping with progressively stronger checkpoints.
- Dependencies/assumptions: Few-shot seed examples; compute for rollouts; decontamination to avoid evaluation leakage.
Off-policy RL post-training (OAPL) for domain adaptation
- Sectors: ML platforms, enterprise AI teams
- What: Post-train a base LM to enterprise tasks using large-batch, iterative off-policy RL without complex online RL infra.
- Tools/products/workflows: OAPL training pipeline with group rollouts and KL-regularized regression objective; masked multi-step trajectories (search queries, compressions).
- Dependencies/assumptions: Rewardable tasks (e.g., nugget metrics or binary success); capacity for dataset regeneration across iterations.
Nugget-based evaluation framework for grounded QA
- Sectors: enterprise AI QA, procurement, governance
- What: Standardize evaluation with nugget completeness scoring for varied task types (single-fact, exhaustive sets, multi-para reports).
- Tools/products/workflows: Internal “KARLBench-like” eval suite; task-specific nugget prompts; leaderboards tracking cost/latency vs quality.
- Dependencies/assumptions: Gold references or curated nugget sets; agreement on reward prompts.
High-throughput offline retrieval for data generation and testing
- Sectors: ML platform, infra
- What: Embedded, columnar vector DB per worker to achieve 500+ QPS per host for rollout collection and TTC.
- Tools/products/workflows: In-process vector search library; cached indexes; uniform tool interface for training and serving.
- Dependencies/assumptions: Memory footprint and storage budgets; consistent embedding models across environments.
Parallel Thinking inference mode
- Sectors: all deploying LLM agents
- What: Launch an “N-parallel rollouts + aggregation” inference mode that often beats best-of-N or simple majority vote, while keeping latency manageable via parallelism.
- Tools/products/workflows: Aggregation prompt that can use tools for verification/refinement; budget-aware N selection.
- Dependencies/assumptions: Headroom for parallel compute; aggregation tuned to task; monitoring for diminishing returns.
Value-Guided Search (VGS) with small value model
- Sectors: support, finance, biomedical synthesis
- What: Train a small value model to predict future success probability token-by-token and steer step selection during inference.
- Tools/products/workflows: BFS-style step expansion with value scoring; final rollout aggregation via weighted voting or best-of-N.
- Dependencies/assumptions: Binary or nugget-derived rewards for value training; validation of value calibration.

Long-Term Applications

Multi-tool, enterprise-grade research agents
- Sectors: finance, legal, healthcare, manufacturing
- What: Extend from vector search to multi-tool orchestration (browsers, table parsers, spreadsheets, code execution, doc traversal) for broader grounded reasoning (e.g., deep table extraction, live market data).
- Tools/products/workflows: Tool-augmented agent harness with safety gates; RL that includes tool selection and outcomes.
- Dependencies/assumptions: Tool reliability; expanded evaluation to multi-tool settings; stronger safety/observability.
Real-time enterprise “deep research” with compliance
- Sectors: government, regulated industries
- What: Generalize the deep research paradigm to secure/certified corpora, combining live intranet sources with archived records for evidence-backed policy or executive briefs.
- Tools/products/workflows: Hybrid closed–open indexing; differential privacy where required; governance dashboards.
- Dependencies/assumptions: Content freshness and versioning; legal review for web content use; robust provenance tracking.
Safety-critical decision support (with rigorous validation)
- Sectors: healthcare, energy, aerospace
- What: Pilot evidence synthesis for clinical guidelines, safety notices, and engineering incident analysis with strict guardrails and post-market monitoring.
- Tools/products/workflows: Human-in-the-loop workflows; red teams and external validation; documented uncertainty estimates.
- Dependencies/assumptions: Regulatory approval; extensive domain-specific evaluation; liability and audit readiness.
Budget-aware test-time compute scheduling
- Sectors: platform, cloud AI, cost-sensitive deployments
- What: Dynamic controllers to allocate rollouts (N) per query based on difficulty estimates or value scores to optimize cost–latency–quality per request.
- Tools/products/workflows: Difficulty heuristics from value models; SLAs and budgets encoded in serving policies.
- Dependencies/assumptions: Accurate difficulty prediction; service mesh coordination; customer-configurable QoS.
Continual learning via pass-rate–targeted active learning
- Sectors: enterprise AI platforms
- What: Use pass-rate bands to continuously mine “learnable but hard” examples from production queries/doc updates for ongoing improvement.
- Tools/products/workflows: Feedback loops in aroll harness; automated synthesis and filtering; periodic OAPL iterations.
- Dependencies/assumptions: Data consent; drift monitoring; robust decontamination.
Cross-modal grounded reasoning
- Sectors: manufacturing, education, energy
- What: Incorporate figures, schematics, and images in manuals (e.g., wiring diagrams) into procedural reasoning and troubleshooting.
- Tools/products/workflows: Multimodal embeddings and retrieval; toolchain for OCR and diagram parsing; extended nugget metrics.
- Dependencies/assumptions: High-quality multimodal encoders; scalable indexing and chunking for images/figures.
Sector-wide benchmark standardization and procurement criteria
- Sectors: policy, government, standards bodies
- What: Adopt KARLBench-like suites (nugget-based, multi-capability) as procurement standards for AI assistants in regulated contexts.
- Tools/products/workflows: Public/sector-specific corpora and tests; cost/latency/quality Pareto reporting; open evaluation protocols.
- Dependencies/assumptions: Consensus on tasks and rubrics; versioned datasets; transparent reporting.
Marketplace of specialized knowledge agents
- Sectors: software, consulting, vertical AI vendors
- What: Offer pre-trained, OAPL-tuned agents per vertical (e.g., IFRS finance agent, SOC2/ISO compliance agent, automotive service agent) with plug-in TTC profiles.
- Tools/products/workflows: Deployment blueprints; customer-specific fine-tuning with agentic synthesis; policy packs and guardrails.
- Dependencies/assumptions: IP/licensing for training data; clear customization boundaries; support for varied vector DBs.
Autonomous knowledge graph construction from grounded QA
- Sectors: enterprise search, analytics
- What: Convert stable nuggets and citations into edge assertions for continuously updated, auditable knowledge graphs feeding search and analytics.
- Tools/products/workflows: Nugget-to-graph pipelines; confidence scoring from value models; change detection and rollback.
- Dependencies/assumptions: Schema design; provenance capture; reconciliation and dedup.
Scalable MoE RL post-training as a managed service
- Sectors: ML infrastructure, cloud providers
- What: Offer OAPL-based post-training for customers’ base models, with multi-task setups and stability on large MoEs without GRPO heuristics.
- Tools/products/workflows: Data generation at scale with embedded vector search; offline large-batch RL; hyperparameter sweeps amortized over generated datasets.
- Dependencies/assumptions: GPU capacity; privacy/security controls; robust failure handling and monitoring.

View Paper Prompt View All Prompts

Glossary

Agentic retrieval: Agent-driven iterative search and evidence gathering over a corpus using tools. "Our objective is to evaluate agentic retrieval under heterogeneous and realistic corpus conditions"
aroll: The paper’s high-throughput internal agentic rollout framework for collecting, training on, and evaluating long-horizon agent trajectories. "We built our internal agentic rollout framework (called ``aroll'' for convenience)"
Best-of-N: A test-time compute strategy that samples N candidate answers and selects the best one according to some criterion. "more expressive than simple TTC strategies such as Best-of- $N$ or Majority Vote."
Breadth-First-Search (BFS): A tree search strategy that explores all nodes at a given depth before moving deeper; used here to navigate candidate continuations guided by a value model. "This is the simple Breadth-First-Search (BFS) implementation proposed in VGS"
Closed-corpus benchmarks: Evaluation settings where retrieval happens over a fixed, indexed corpus rather than the live web. "we deliberately focus on closed-corpus benchmarks rather than web-search–based evaluation settings"
Clipped importance weighting: An off-policy correction heuristic that limits (clips) importance weights to stabilize training. "without requiring heuristics such as clipped importance weighting, data deletion, or router replay"
Columnar vector database: A column-oriented data store optimized here for fast, in-process vector similarity search. "we utilized an embedded, columnar vector database for vector search."
Compression (context management): Summarizing interaction history to stay within token limits while preserving salient information for subsequent reasoning. "For long rollouts, we design a compression mechanism for context management."
Decontamination pipeline: A process that removes exact or near-duplicate data to avoid test contamination. "based on the LMSys-recommended decontamination pipeline"
Deduplication Agent: A component that filters out exact or near-duplicate synthesized examples before training. "the Deduplication Agent filters out any exact or near-duplicates."
GRPO: An online reinforcement learning algorithm used for post-training LLMs; known to require stability heuristics in MoE settings. "stabilizing online GRPO training for large-scale MoE models"
Group rollouts: Multiple responses generated for the same prompt under a reference policy, used to estimate values and advantages. "is the group rollouts generated by $\pi_{\text{ref}$ given $x$ "
KL-regularized RL objective: An RL objective that maximizes reward while penalizing divergence from a reference policy via the KL divergence. "the KL-regularized RL objective:"
LATS: A lookahead tree-search method that uses the LLM itself to evaluate step-level candidates during search. "compared to LATS~\citep{zhou2024lats}, which uses the LLM itself as a step-level evaluator"
Lifecycle Plugins: Modular hooks that modify the agent–environment loop for behaviors like compression, step limits, or tool access control. "Lifecycle Plugins that intercept fixed points in the environment's interaction loop"
Long-horizon reasoning: Multi-step reasoning that spans many tool calls and decisions before producing a final answer. "employs long-horizon reasoning and tool use"
Majority Vote: An aggregation method that outputs the answer most frequently proposed among multiple sampled candidates. "more expressive than simple TTC strategies such as Best-of- $N$ or Majority Vote."
Mixture-of-Experts (MoE): A model architecture where a router selects among specialized expert subnetworks for each token or input. "large-scale MoE models"
Multi-expert distillation: Training a single student model by supervising it on outputs from multiple specialized teacher models. "Multi-expert distillation has been used in the literature to train some of the best open-source models"
Nugget-based completion: An evaluation method that scores answers by checking for predefined atomic facts (“nuggets”) they contain. "We unify answer evaluation across all tasks using nugget-based completion"
Nugget-based evaluation framework: A TREC-style evaluation approach that decomposes references into granular facts and measures coverage. "consistent with the nugget-based evaluation framework spearheaded by \cite{voorhees2003trecqa}"
OAPL (Optimal Advantage-based Policy Optimization with Lagged Inference policy): The paper’s off-policy RL post-training recipe that regresses to an advantage-implied target under KL regularization. "We use Optimal Advantage-based Policy Optimization with Lagged Inference policy -- OAPL"
Off-policy RL: Reinforcement learning that optimizes a policy using data generated by a different (reference) policy. "a new post-training recipe based on the concept of Large Batch Iterative Off-policy RL."
Off-policyness: The degree to which training data are generated by a behavior different from the current policy, potentially causing distribution mismatch. "By embracing the off-policyness in the design of the objective"
Out-of-distribution generalization: Maintaining strong performance on tasks or distributions not seen during training. "with out-of-distribution generalization"
Pareto frontiers: The set of models that are not strictly dominated across multiple objectives (e.g., cost–quality, latency–quality). "The cost–quality and latency–quality Pareto frontiers show that KARL achieves favorable trade-offs over existing models."
Pareto-optimal: A point on the Pareto frontier that cannot improve one objective without degrading another. "KARL is Pareto-optimal on KARLBench"
Parallel Thinking: A test-time compute strategy where multiple rollouts are generated in parallel and then aggregated into a final answer. "We apply Parallel Thinking ~\citep{zhao2025samplescrutinizescaleeffective, ...} as a task-independent TTC strategy"
Reference model: A baseline policy used for sampling and KL regularization during off-policy optimization. "We denote $\pi_{\text{ref}$ as the reference model"
Router replay: A MoE training heuristic that replays router (expert-selection) decisions to stabilize learning. "without requiring heuristics such as clipped importance weighting, data deletion, or router replay"
Step budgeting: Imposing a limit on the number of tool/assistant steps in a rollout to control cost and latency. "Cross-cutting concerns such as context compression, step budgeting, and tool gating"
Supervised fine-tuning (SFT): Training a model on labeled examples with standard supervised learning objectives. "via SFT (supervised fine-tuning)"
Test-time compute (TTC): Additional computation at inference time (e.g., multiple rollouts or search) to improve answer quality. "We investigate test-time compute (TTC) as a powerful augmentative method to boost performance"
Tool gating: Controlling whether and how an agent can invoke external tools during its reasoning process. "Cross-cutting concerns such as context compression, step budgeting, and tool gating"
Value model: A model that predicts the probability of eventual success given a partial rollout, used to steer search. "a value model which predicts the future probability of the success given any partial rollout."
Value-Guided Search (VGS): A search method that uses a value model to score candidate continuations and select high-value branches. "We also apply Value-Guided Search (VGS) — a method that performs parallel tree search using a value model"
Vector search: Retrieval by nearest-neighbor search in embedding space over chunked documents. "we equip the agent with a single tool: vector search."
vLLM: A high-throughput LLM inference engine used in serving/evaluation pipelines. "discrepancies between the trainer and the inference engine (e.g.\ vLLM)"

View Paper Prompt View All Prompts

Open Problems

Generalization of deep research agent results to broader grounded reasoning tasks

KARL: Knowledge Agents via Reinforcement Learning

Summary

KARL: Knowledge Agents via Reinforcement Learning

Motivation and Problem Formulation

Benchmarks and Task Suite: KARLBench

Data Synthesis Pipeline

Reinforcement Learning Framework

Multi-task Training and Generalization

Test-time Compute Scaling

Search Behavior, Efficiency, and Compression

Ablations and Analysis

Numerical Results

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

A simple guide to “KARL: Knowledge Agents via Reinforcement Learning”

1) What is this paper about?

2) What were the researchers trying to find out?

3) How did they do it?

4) What did they find, and why is it important?

5) What’s the big picture impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Authors (26)

Collections

Tweets

Don't miss out on important new AI/ML research

KARL: Knowledge Agents via Reinforcement Learning

Summary

KARL: Knowledge Agents via Reinforcement Learning

Motivation and Problem Formulation

Benchmarks and Task Suite: KARLBench

Data Synthesis Pipeline

Reinforcement Learning Framework

Multi-task Training and Generalization

Test-time Compute Scaling

Search Behavior, Efficiency, and Compression

Ablations and Analysis

Numerical Results

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

A simple guide to “KARL: Knowledge Agents via Reinforcement Learning”

1) What is this paper about?

2) What were the researchers trying to find out?

3) How did they do it?

4) What did they find, and why is it important?

5) What’s the big picture impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (26)

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research