Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

Published 21 Jan 2026 in cs.AI and cs.CL | (2601.15160v1)

Abstract: LLMs have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a "compositional bridge", enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel KG-based implicit reward model that guides compositional multi-hop reasoning in LLMs.
The methodology combines SFT and RL with path-derived signals, achieving an 11.1% accuracy improvement on unseen 5-hop tasks.
The approach outperforms larger models and maintains robustness across diverse ICD-10 subdomains using axiomatic, verifiable KG paths.

Knowledge Graphs as Implicit Reward Models for Compositional Reasoning

Introduction

This paper proposes a paradigm in which knowledge graphs (KGs) serve not merely as passive knowledge scaffolds but are elevated to the role of implicit reward models for compositional multi-hop reasoning within LLMs. By grounding the learning process in axiomatic, domain-specific facts and leveraging path-derived signals for reinforcement learning (RL), the authors target systematic, out-of-distribution compositional generalization in high-stakes scientific domains—demonstrated here using medical diagnostics and clinical reasoning.

Methodology: KG-Grounded SFT and RL Pipeline

The authors employ a robust Base Model $\rightarrow$ SFT (LoRA) $\rightarrow$ RL (GRPO) post-training pipeline. The KG, derived from UMLS and instantiated as entity-relation-entity triples, provides the generator for data, the supervisor for process-aligned reward, and the evaluation framework. The SFT stage exposes the model to 1–3-hop reasoning tasks, instantiating atomic domain knowledge and grounding the initial policy on verifiable traces synthesized from axiomatic paths.

The RL phase then employs Group Relative Policy Optimization (GRPO), with a primary innovation in reward design: A composite signal balances correctness (binary outcome) with explicit path-alignment, awarding partial credit for responses that traverse the correct intermediate states as encoded in the KG. This design yields process supervision at scale, far surpassing vanilla RL approaches based solely on answer correctness or aesthetic distillation.

Figure 1: Schematic illustration of the SFT+RL pipeline—SFT provides KG-grounded initialization, while RL with path-derived rewards sculpt compositional reasoning capabilities.

Reward Signal Engineering

Four competing reward signals are systematically ablated:

Binary Correctness: Supervises only final answer accuracy.
Similarity: A Jaccard-based measure of reasoning trace matches to experts.
Thinking Quality: Evaluates logical and syntactic structure of reasoning.
Path Alignment: Quantifies coverage and correctness of ground-truth KG path entities within the model’s internal reasoning.

Path alignment, especially when used with asymmetric negative sampling (heavier penalty for incorrect answers), provides the strongest compositional gains. The reward formulation ensures that the model cannot solve tasks by superficial pattern-matching or spurious fluency—true improvement reflects substantive, process-centric reasoning.

Experimental Results

Compositional Generalization and Robustness

Despite only being trained on short (1–3-hop) reasoning paths, the SFT+RL model reliably generalizes to 4- and 5-hop chains—tasks not encountered during either SFT or RL. Notably, there is a +11.1% accuracy improvement on unseen 5-hop tasks relative to SFT-only training, establishing that path-aligned rewards facilitate a "compositional bridge" across complexity gradients.

Figure 2: The SFT+RL model displays monotonic gains on 4- and 5-hop tasks, outpacing SFT-only and base models—this signals genuine zero-shot compositional generalization.

The pipeline further exhibits robustness to format and adversarial perturbations. In option-shuffling stress tests, the model suffers only a negligible performance drop ( $\sim1\%$ ), substantially lower than the degradation observed in much larger generalist models (e.g., GPT-5.2, Gemini 3 Pro), indicating reliance on internal reasoning over superficial cues.

Figure 3: Unlike the Base Model, whose performance collapses with increasing difficulty, the SFT+RL pipeline maintains a decisive lead and stable accuracy as problem complexity rises.

Domain Robustness and Category-Level Gains

Improvements are not restricted to a single category but are distributed across all 15 ICD-10 medical sub-domains, including demanding areas such as immunological and circulatory disorders.

Figure 4: Across medical specialties, path-aligned rewards enable consistent gains compared to SFT-only, highlighting the framework’s generality and robustness.

Outperforming Scale: Comparison with Frontier Models

A key empirical finding is that, given sufficient process-based reward shaping, a 14B SFT+RL model can surpass much larger (32B+) expert-distilled models and even generalist frontier models on the most complex compositional tasks. Performance does not degrade with hop length—in fact, it peaks on the hardest (5-hop) queries, while generalist models stagnate or regress with complexity.

Figure 5: On multi-hop chains, the KG-grounded 14B model maintains or grows its accuracy, unlike generalist models whose performance decays with depth.

Theoretical and Practical Implications

This work evidences a decoupling of compositional reasoning depth from sheer parameter count. The implicit reward model—formally, the KG-path-aligned and correctness-function—serves as a process supervisor, democratizing compositional reasoning by enabling smaller, data-efficient LLMs to out-reason their brute-force, scale-centric counterparts.

Practically, KGs emerge as scalable, automatable, and domain-agnostic supervisors. Any domain with a structured ontology is a candidate for this pipeline. The method eschews the burdens of human-in-the-loop annotation or handcrafted rubric rewards, favoring axiomatic, verifiable structure-process alignment.

Theoretically, this work gestures toward hybrid architectures: neural-symbolic integration, automated process verifiers, and dynamic reward shaping via KGs. The potential for self-critique, in-context path-derivation, and continuous edge augmentation opens avenues for meta-reasoning and "self-improving" scientific LLMs.

Limitations and Future Directions

Although this framework is validated on medical KGs, its core approach is domain-agnostic. Future work should consider:

Adapting the implicit reward framework to other symbolic domains, including chemistry, law, engineering, and mathematics.
Fusing neural and symbolic representations, potentially yielding architectures that not only use KGs as external supervisors but internalize symbolic reasoning primitives within their weights.
Designing dynamic KGs, where novel entity or relation discovery can in turn inform reward modeling and data generation online.

Conclusion

By treating KGs as implicit reward models, this work outlines a clear and efficient strategy for eliciting compositional reasoning in LLMs—substantiated by strong empirical gains, out-of-distribution robustness, and algorithmic efficiency competitive with large-scale models. The method demonstrates that verifiable, path-derived rewards are a scalable alternative to both static datasets and human feedback, with far-reaching implications for domain-specific and generalist AI.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about

This paper is about teaching AI models to “think in steps” using solid, checkable facts. The authors show how a model can learn to solve complex problems by combining small, true facts—especially in medicine—rather than just guessing a final answer. They do this by using a knowledge graph (a big map of facts and how they connect) to score the model’s reasoning, not only its final choice. Their key idea: knowledge graphs can act like “reward coaches,” giving points when the model uses the right facts in the right order.

The big questions the researchers asked

How can we help AI models do multi-step, “connect-the-dots” reasoning in scientific fields, like medicine, where small mistakes can be dangerous?
Can we train a model using short, simple fact chains so it generalizes to longer, more complicated ones it hasn’t seen before?
Is there a way to give the model useful feedback at scale without needing humans to label every step?
Will this approach be robust (hard to trick) and competitive with much larger models?

How they approached the problem

Think of this like teaching a student to solve a mystery using a map of clues:

A knowledge graph is the map. It stores facts as simple triples: head → relation → tail, like “Fever → is_symptom_of → Flu.” A chain of these triples is a “path” (like following connected clues).
“Hops” are the steps in a path. A 3-hop question needs three connected facts to reach an answer.
They trained a 14-billion-parameter model in two steps:
- Supervised Fine-Tuning (SFT): They first taught the model lots of short, correct reasoning examples, using a lightweight method called LoRA (think: adding small “adapters” instead of changing everything).
- Reinforcement Learning (RL): Then they coached the model to improve by giving it rewards (points) based on how well its reasoning used the right facts. They used GRPO, a training method that nudges the model towards better behavior based on group-level scores.

To make the rewards useful and fair, they tried different signals and kept the ones that really encouraged step-by-step thinking:

Binary correctness: a small positive point for the right final answer, a bigger negative point for a wrong one (this pushes the model to avoid mistakes).
Path alignment: extra points when the model’s explanation mentions the correct entities from the graph path, like correctly naming “symptom → disease → treatment” in order. This checks the process, not just the final pick.

They used a medical knowledge graph (UMLS) and tested on ICD-Bench, which has multiple-choice questions across medical categories with difficulty levels and different hop lengths.

What they found and why it matters

Here are the main takeaways in simple terms:

Training on short paths teaches “how to think”: The model was trained on 1–3-hop paths but performed better than baselines on unseen 4–5-hop questions. In other words, it learned the skill of building longer reasoning chains from shorter ones.
Rewards that check the process matter: The best results came from combining “final answer correctness” with “path alignment” rewards. Rewarding long explanations alone didn’t help and could be gamed; copying expert style (similarity) wasn’t as effective as using the right facts.
RL alone wasn’t enough: Starting RL from a base model didn’t consistently beat SFT. Warming up with SFT gave the model the basic facts; RL then taught it to connect those facts well.
Strong on hard problems: The model’s biggest gains were on the toughest questions (longer hop chains and higher difficulty). It kept high accuracy even when options were shuffled (so it wasn’t relying on option position or tricks).
Competitive with big models: Despite being smaller, the model often beat much larger, well-known systems on complex, multi-hop medical reasoning. That suggests smart training can sometimes beat raw size.

Why this is important

Safer, more reliable reasoning: In fields like medicine, it’s not enough to sound confident; models must show the right chain of evidence. This approach rewards that—step by step.
Scalable supervision without tons of human labels: Using a knowledge graph as a “reward coach” means you can score millions of reasoning chains automatically and consistently.
Small models can out-reason big ones: With good data and smart rewards, smaller models can solve harder problems more reliably, which is cheaper and more practical.
Works beyond medicine: Any area with a structured knowledge graph—like chemistry, law, or biology—could use this method to build models that reason from first principles, not just patterns.

In short, grounding AI in true, checkable facts and rewarding the path it takes to reach an answer is a powerful way to build models that think more like careful problem-solvers, not just good guessers.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise list of the paper’s unresolved knowledge gaps, limitations, and open questions to guide future research:

Benchmarking transparency: The comparisons to GPT-5.2, Gemini 3 Pro, and QwQ-Med-3 lack full reproducibility details (prompt templates, sampling settings, system prompts, API versions, context windows, n-samples, majority vote specifics), preventing rigorous third-party verification.
Statistical rigor: Reported gains do not include confidence intervals, significance tests, multiple seeds, or variance across runs; robustness to random initialization and sampling stochasticity remains unquantified.
Reward scaling anomaly: With α=0.1, β=1, and R_max=1.5, an incorrect answer can still yield a positive total reward (up to +0.5), potentially incentivizing plausible but wrong reasoning. A principled calibration of reward weights to prevent net-positive rewards on incorrect completions is needed.
Entity-only alignment: The path alignment reward uses entity token coverage and ignores relations, directionality, and causal structure. It remains unclear whether the model composes the correct relations or merely name-drops entities.
Order and causality: The reward does not enforce step order or causal dependencies along the KG path. Sequence-aware and relation-aware alignment (e.g., ordered path matching, alignment cost, or graph-edit distance) is not explored.
Multi-path validity: Many medical queries can be solved by multiple legitimate KG paths. The reward assumes a single “ground-truth” path, risking penalization of alternative correct chains. Methods to support multi-path rewards are not investigated.
Synonyms and concept normalization: Token-based matching may miss valid synonyms or concept variants (e.g., abbreviations, lexical differences). The paper does not evaluate entity linking or ontology-normalized matching to reduce false negatives/positives.
KG quality and coverage: The approach presumes UMLS completeness and correctness. The effects of missing, conflicting, or outdated KG facts on training and evaluation are not quantified; mechanisms for handling KG noise and updates are unspecified.
Distribution shift and out-of-KG reasoning: The pipeline is untested on tasks requiring entities or relations absent from the KG, leaving unknown how the model behaves when necessary facts are missing or uncertain.
Upper limits of compositional length: Training covers 1–3 hops and evaluation 2–5 hops; there is no analysis beyond 5 hops or of failure modes as path length increases (e.g., compounding errors, memory limits).
Process-level evaluation: Beyond token coverage, there is no independent assessment of reasoning trace faithfulness (human or expert review, causal consistency checks, step verification) to ensure explanations reflect the model’s actual reasoning.
Reward hacking diagnostics: While repetition penalties are introduced, systematic detection and analysis of reward exploitation (e.g., gratuitous entity insertion, template boilerplate) are absent.
RL hyperparameter sensitivity: GRPO settings, group sizes, sampling budgets, learning rates, and reward normalization choices are not thoroughly ablated; stability regions and failure modes are unclear.
RL-only capacity and scaling: The conclusion that “RL alone is insufficient” is based on limited budgets and vanilla configurations; whether stronger RL (e.g., actor-critic, off-policy, curriculum RL, longer budgets) could match or exceed SFT+RL is not resolved.
Compute and efficiency: Training time, GPU hours, memory footprints, and cost vs. performance trade-offs are not reported; sample efficiency of the path-derived reward remains unquantified.
Test-time compute policy: Apart from majority voting in one comparison, the paper does not specify test-time sampling strategies (n, temperature) across all experiments; effects of test-time compute on robustness and accuracy are not analyzed.
MCQ format limitations: The evaluation relies on multiple-choice questions, which may enable elimination heuristics and overestimate reasoning quality; performance on open-ended, free-text clinical cases is unknown.
Robustness breadth: Option shuffling is the only format perturbation studied. Harder stressors (shuffling correct option position, paraphrased prompts, adversarial distractors, contradictory evidence, long-context noise) are not evaluated.
Pretraining contamination: Despite train-test separation at path/entity levels, potential overlap of UMLS facts in model pretraining could inflate zero-shot performance; contamination audits are not presented.
Error analysis: The paper lacks qualitative failure analysis (e.g., typical error types by hop length, category, relation type), limiting insight into current weaknesses and prioritization for future fixes.
ICD-10 coverage metrics: Node/relation coverage statistics and their correlation with performance across categories are not reported; it remains unclear which coverage thresholds are necessary for generalization.
Generality to other domains: The claim of domain-agnostic applicability is not empirically validated; portability to KGs in chemistry, law, finance, or engineering (with different schemas and noise profiles) is untested.
Handling uncertainty: Medical reasoning often involves ambiguity. The pipeline does not model or reward calibrated uncertainty, abstention, or confidence expression; links to uncertainty-aware rewards remain unexplored.
Relation-type weighting: All relations are implicitly treated equally in path alignment. Exploring relation-specific weights (e.g., stronger signals for causal/mechanistic edges) could improve reasoning fidelity; this is not investigated.
Multi-lingual and cross-lingual robustness: The method’s reliance on English token matching and UMLS concepts leaves open questions about applicability in non-English settings and multilingual KGs.
Tool use and structured traversal: The approach does not integrate explicit graph traversal or planning tools (e.g., MCTS, symbolic executors) at inference; whether such integration improves faithfulness and long-horizon reasoning is unknown.
Safety and clinical validity: There is no assessment of potential harms from incorrect reasoning, misdiagnosis, or outdated KG facts; safety audits, calibration, and human-in-the-loop safeguards are not discussed.
Dataset provenance and LLM-generated artifacts: The training/test questions and reasoning traces are LLM-generated; the impact of stylistic artifacts and distribution biases on generalization to human-authored clinical materials is not measured.
Release and replicability: Code, data, prompts, and trained checkpoints are not stated as publicly available; without these, community replication and extension are hindered.
Transfer protocol clarity: The statement that insights “transfer effectively from 8B to 14B” lacks detail on the transfer mechanism (retraining vs. weight initialization, hyperparameter changes), leaving open questions about scalability and porting best practices.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete use cases that can be deployed with today’s tools, given a suitable domain knowledge graph (KG), entity linking, and modest SFT+RL compute.

Healthcare — KG-grounded clinical decision support
- What: Suggest differentials, tests, drug choices, and contraindications with a step-by-step rationale aligned to UMLS paths.
- Product/workflow: “Path-backed CDS” EHR sidebar that shows the triples/hops supporting each recommendation.
- Assumptions/dependencies: High-quality and current medical KG (e.g., UMLS + drug–disease–interaction extensions), reliable clinical NER/entity linking, human-in-the-loop review; not for autonomous use without regulatory clearance.
Healthcare — ICD-10 coding and audit
- What: Map notes to ICD-10 codes with KG-aligned reasoning chains that justify code selection and support audits.
- Product/workflow: “Coder cockpit” that highlights the specific path (e.g., symptom → disease → code) used; batch audit reports with path coverage metrics.
- Assumptions/dependencies: Coverage and correctness of ICD-10 mappings in KG; PHI-safe data pipelines; institutional acceptance of machine-generated rationales.
Healthcare — Pharmacovigilance and safety signal triage
- What: Prioritize adverse event reports by composing drug–mechanism–effect relations; reduce hallucinated causal links.
- Product/workflow: Safety triage console ranking cases by path evidence score (R_path) and showing supporting triples.
- Assumptions/dependencies: Drug safety KG (e.g., SIDER/FAERS-integrated), good synonym/alias handling, clinician oversight.
Healthcare and Education — Medical tutoring and assessments
- What: Generate multi-hop MCQs and explanations with controllable difficulty (1–5 hops), and grade student rationales by path coverage.
- Product/workflow: “Compositional curriculum generator” and “path-aligned grader” for medical schools and CME.
- Assumptions/dependencies: KG coverage across target specialties; guardrails to avoid leakage of test content; bias/coverage checks.
Software/ML — RL post-training without human preferences
- What: Replace/augment RLHF with KG-derived “verifiable rewards” (R_bin + R_path) for domain LLMs.
- Product/workflow: KG-RLVR training recipe (SFT via LoRA + GRPO with path-alignment reward), CI pipeline to re-train as KG updates.
- Assumptions/dependencies: Domain KG, chain-of-thought enabled during training, stable GRPO configuration; monitoring for repetition/reward hacking.
Enterprise Knowledge — Multi-hop QA over internal KGs
- What: Answer policy/procedure/product questions with evidence chains over an enterprise KG (e.g., org → system → control → requirement).
- Product/workflow: “Compositional QA agent” that returns the answer plus the traversed graph path and node snippets.
- Assumptions/dependencies: Well-curated enterprise KG; entity linking from unstructured docs to KG nodes; access controls.
Finance — AML/fraud analyst assist
- What: Explain suspicious activity by composing transaction–entity–jurisdiction relations and providing auditable reasoning chains.
- Product/workflow: Investigator assistant that ranks cases by path evidence and highlights multi-hop money flows.
- Assumptions/dependencies: High-fidelity transaction KG; strict privacy/sovereignty controls; human adjudication.
Legal/Policy — Regulation and compliance assistants
- What: Grounded answers that connect obligations → controls → exceptions via a regulation/caselaw KG with cited hops.
- Product/workflow: Compliance navigator that shows the governing path (statute → rule → guidance → precedent).
- Assumptions/dependencies: Up-to-date legal KG; rigorous citation canonicalization; jurisdictional scoping.
Evaluation and QA — Robustness and audit tooling
- What: Test suites for format robustness (e.g., option-shuffling), hop-length stratified reporting, and path-coverage dashboards.
- Product/workflow: “PathScore” verifier and robustness harness integrated into model QA/MLOps.
- Assumptions/dependencies: Access to evaluation KGs; standardized entity canonicalization; governance for model logging and trace storage.

Long-Term Applications

These use cases need further research, scaling, integration, or regulatory work before broad deployment.

Healthcare — Real-time, regulation-grade clinical decision support
- What: On-the-fly treatment planning and risk prediction with verified KG-backed chains, integrated into clinician workflow.
- Product/workflow: FDA/CE-marked CDS with uncertainty calibration, counterfactual paths, and automatic provenance logging.
- Assumptions/dependencies: Prospective clinical validation, post-market surveillance, robust coverage of guidelines in KG, safety cases.
Science and R&D — Autonomous hypothesis generation and experiment planning
- What: Compose biochemical/chemical pathways (e.g., reaction rules, targets, phenotypes) to propose testable hypotheses and experiment sequences.
- Product/workflow: “KG-guided discovery engine” combining lab ontologies with LLM reasoning and lab automation.
- Assumptions/dependencies: High-fidelity KGs (e.g., ChEBI/Reactome), integration with ELN/LIMS, closed-loop data-to-KG updates.
Law — Grounded drafting and argumentation assistants
- What: Compose multi-hop legal arguments that are verifiably tied to statutes, regulations, and precedents; generate and check citations.
- Product/workflow: Litigation and regulatory drafting copilots with path-aligned argument graphs and counter-argument exploration.
- Assumptions/dependencies: Comprehensive legal KGs, jurisdiction/version tracking, judicial acceptance of AI-produced rationale trails.
Finance and Risk — Compositional risk engines
- What: Multi-hop reasoning over counterparty, supply-chain, ESG, and macro links to explain portfolio risks and scenario propagation.
- Product/workflow: “Explainable risk copilot” that surfaces KG paths driving risk scores and suggests mitigation actions.
- Assumptions/dependencies: Multi-source KG fusion, timeliness of data, governance for model risk management.
Public Policy — Benefit eligibility and policy impact reasoning
- What: Compose eligibility rules and inter-program dependencies to give citizens explainable determinations and simulate policy changes.
- Product/workflow: Government digital assistants with path-backed determinations and appeal-ready evidence chains.
- Assumptions/dependencies: Machine-readable policy KGs, legal review, transparency mandates.
Robotics and Planning — KG-rewarded task planning
- What: Use environment/object KGs as verifiable reward models to train planners that compose sub-tasks and constraints.
- Product/workflow: “KG-MPC” hybrid where text plans are verified against object-relationship KGs; training-time path rewards.
- Assumptions/dependencies: Reliable mapping from physical states to KG nodes, multi-modal grounding, sim-to-real transfer.
Energy/Industrial — Root-cause analysis over asset KGs
- What: Diagnose grid or plant faults via multi-hop composition (sensor → component → subsystem → failure mode) with verifiable paths.
- Product/workflow: Operator assistant that proposes causes/fixes and shows KG paths and historic evidence.
- Assumptions/dependencies: Accurate asset and failure-mode KGs, streaming-to-KG entity linking, safety certification.
Education (STEM at scale) — Adaptive compositional curricula
- What: Personalized pathways that gradually increase hop length and grade student reasoning by path alignment in domains like org. chemistry or physics.
- Product/workflow: LMS plugins that generate tasks and give path-specific feedback; instructor analytics by hop/difficulty.
- Assumptions/dependencies: Domain KGs with pedagogy-aware structure, fairness and bias controls.
National/Enterprise KG infrastructure and governance
- What: Sustained investment to maintain, version, and expand KGs (coverage, canonicalization, provenance) as critical AI infrastructure.
- Product/workflow: KGOps (schema governance, continuous curation), KG–LLM co-training loops, standards for path-based evaluation.
- Assumptions/dependencies: Funding, inter-agency/industry collaboration, open standards, privacy-preserving entity linking.
Continual learning and on-device specialists
- What: Lightweight LoRA/RL updates as KGs evolve; domain specialists running on edge with verifiable reasoning.
- Product/workflow: Incremental KG-RLVR updates, telemetry-driven reward shaping, red-team stress suites.
- Assumptions/dependencies: Efficient training pipelines, on-device security, robust mitigation of reward hacking.

Notes on feasibility and transferability across all applications:

Core dependencies: a well-curated, up-to-date domain KG; accurate entity linking/canonicalization; access to chain-of-thought for training; stable GRPO/SFT stacks.
Risk considerations: KG incompleteness or bias will steer rewards; reasoning-trace exposure may create privacy/IP risks; regulatory constraints (especially in healthcare/finance/law) demand human oversight and auditability.
Generalization limits: Demonstrated gains to 4–5 hops; performance beyond that, under distribution shift, or against sophisticated adversaries, requires further validation and potentially richer path-reward designs (e.g., relation-level matching, paraphrase-robust entity normalization).

View Paper Prompt View All Prompts

Glossary

Axiomatic triples: Structured facts in a knowledge graph represented as (head, relation, tail) that serve as building blocks for reasoning. "axiomatic triples $(head, relation, tail)$ "
Chain-of-thought: The intermediate reasoning trace generated by a model before the final answer. "chain-of-thought"
Compositional reasoning: The ability to combine multiple axiomatic facts across steps to solve complex problems. "it requires compositional reasoning: the ability to reliably combine axiomatic facts for complex multi-hop problem solving"
Direct Preference Optimization (DPO): A post-training method that optimizes models to match preferences directly without explicit reward modeling. "direct preference optimization \cite{Rafailov2023DirectModel}"
Distillation-based reward: A training signal that scores outputs by similarity to expert-produced reasoning traces. "A distillation-based reward that measures the Jaccard similarity between the model output and an expert reasoning trace"
Group Relative Policy Optimization (GRPO): A PPO-like RL algorithm that estimates advantages at the group level and omits the critic. "Group Relative Policy Optimization (GRPO)"
ICD-10: The World Health Organization’s standardized coding system for diseases and health conditions. "ICD-10 category breakdowns"
ICD-Bench: A held-out benchmark of medical multi-hop reasoning questions used for evaluation. "we use ICD-Bench, a non-overlapping test set of 3,675 questions"
Jaccard similarity: A set-based metric measuring the overlap between two sets, used to compare reasoning traces. "Jaccard similarity"
Knowledge graphs (KGs): Structured representations of entities and relations that encode domain knowledge for grounding and verification. "Knowledge graphs (KGs)"
Link prediction: A KG task that infers missing triples or entities by reasoning over graph structure. "(link prediction)"
Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning method that adapts models via low-rank updates. "Low-Rank Adaptation (LoRA)"
Majority voting: An aggregation method that selects the most frequent answer across multiple samples. "majority-voting ( $n=16$ ) metric"
Multi-hop reasoning: Reasoning that requires traversing multiple linked steps (hops) in a knowledge graph to reach a conclusion. "compositional multi-hop reasoning"
Negative sampling reinforcement: A training strategy that penalizes incorrect generations more heavily to encourage exploration of correct trajectories. "negative sampling reinforcement"
Ontology: A formal representation of concepts and their relationships in a domain, often encoded in KGs. "medical ontology"
Option shuffling: A robustness stress test that randomizes the order of distractor choices to detect positional biases. "option shuffling"
Path alignment: Rewarding model reasoning that matches entities along a ground-truth KG path. "Path Alignment"
Post-training: Additional training stages (e.g., SFT and RL) applied after pretraining to refine capabilities. "post-training pipeline"
Proximal policy optimization: An on-policy RL method that stabilizes updates via a clipped objective. "a popular proximal policy optimization-like optimizer"
Process supervision: Training that rewards intermediate reasoning steps rather than only final answers. "Whereas process supervision (rewarding intermediate steps) has shown promise in mathematics and logic"
Reinforcement Learning (RL): A learning paradigm where models optimize behavior via reward signals. "reinforcement learning (RL)"
Reinforcement Learning with Verifiable Rewards (RLVR): An RL pipeline that uses grounded, automatically verifiable reward signals. "Reinforcement Learning with Verifiable Rewards (RLVR) Pipeline"
Repetition penalty: A penalty in the reward function that discourages repetitive text to prevent reward exploitation. "repetition penalty"
Retrieval-augmented generation: A method where models retrieve external knowledge to inform generation. "retrieval-augmented generation systems"
Reward hacking: Exploiting imperfections in the reward function to achieve high scores without genuine reasoning. "reward hacking"
Reward model: A learned or implicit model that scores outputs to provide feedback for RL training. "reward models"
Reward shaping: Modifying or augmenting reward signals to improve learning dynamics and guide behavior. "reward shaping"
Stochastic policy: A probabilistic mapping from inputs to a distribution over outputs (actions). "stochastic policy $\pi_\theta$ "
Trajectory: The full sequence of generated tokens treated as a single unit for reward assignment in RL. "single trajectory"
Unified Medical Language System (UMLS): A comprehensive biomedical knowledge base used to construct the medical KG. "Unified Medical Language System (UMLS)"

Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

Summary

Knowledge Graphs as Implicit Reward Models for Compositional Reasoning

Introduction

Methodology: KG-Grounded SFT and RL Pipeline

Reward Signal Engineering

Experimental Results

Compositional Generalization and Robustness

Domain Robustness and Category-Level Gains

Outperforming Scale: Comparison with Frontier Models

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about

The big questions the researchers asked

How they approached the problem

What they found and why it matters

Why this is important

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Authors (2)

Collections

Tweets

YouTube

Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

Summary

Knowledge Graphs as Implicit Reward Models for Compositional Reasoning

Introduction

Methodology: KG-Grounded SFT and RL Pipeline

Reward Signal Engineering

Experimental Results

Compositional Generalization and Robustness

Domain Robustness and Category-Level Gains

Outperforming Scale: Comparison with Frontier Models

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about

The big questions the researchers asked

How they approached the problem

What they found and why it matters

Why this is important

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

Tweets

YouTube