Papers
Topics
Authors
Recent
Search
2000 character limit reached

KnowRL: Enhancing Factuality in LLMs

Updated 4 January 2026
  • KnowRL is a reinforcement learning framework that enhances LLM factuality by integrating external knowledge verification and introspective consensus-based rewards.
  • It employs a composite reward strategy combining format, correctness, and factuality rewards to align each reasoning step with verified evidence.
  • Integration of cold-start SFT and consensus-driven post-training loops significantly boosts both intrinsic self-knowledge and extrinsic benchmarking metrics.

KnowRL is a family of reinforcement learning (RL) frameworks designed to enhance the factual reliability and self-knowledge boundaries of LLMs, particularly in open-domain reasoning settings. Two distinct implementations, each with different technical emphases, have been introduced under the "KnowRL" umbrella. The first KnowRL variant integrates external knowledge verification into RL training to reduce hallucinations in slow-thinking LLMs (Ren et al., 24 Jun 2025), while the second KnowRL framework focuses on post-training self-improvement, enabling LLMs to refine their internal feasibility boundaries using introspection and consensus-based rewards without external supervision (Kale et al., 13 Oct 2025).

1. Motivation and Conceptual Foundations

Large, slow-thinking LLMs—especially those producing extended chains of thought—are prone to factual hallucinations. In standard RL fine-tuning, models are rewarded for correct final outputs, but without intermediate factual supervision, errors early in the reasoning process compound, leading to unreliable reasoning chains. This outcome-centric approach fails to enforce alignment with external knowledge at each reasoning step, thus necessitating training paradigms which help models to "know what they know"—the recognition and enforcement of knowledge boundaries (Ren et al., 24 Jun 2025).

The second class of KnowRL methods is motivated by the observation that LLMs often misjudge their own competence. Improving the model's ability to discriminate between feasible (answerable) and infeasible (unanswerable) tasks is essential for safer and more reliable deployment. To this end, KnowRL post-training loops leverage the LLM’s own introspective capability, reinforcing internal self-knowledge boundaries through internal consensus, all while avoiding expensive external supervision (Kale et al., 13 Oct 2025).

2. Knowledge-Enhanced RL Architecture for Factuality

The first KnowRL architecture (Ren et al., 24 Jun 2025) augments standard slow-thinking LLMs with:

  • Structured Output Format: Outputs are explicitly split into > ... </think> (chain-of-thought, CoT) and <answer> ... </answer> (final answer) segments. > > - External Knowledge Verification: At training time, a knowledge base KK (Wikipedia subset) is interrogated; each atomic fact produced in the <think> block is checked by an NLI model against candidate evidence passages from KK, using title/entity matching and entailment scoring. > > During RL training, the model generates CoT and answer, and is guided by a composite reward signal detailed below. During inference, the output continues to strictly follow the <think>...<answer>...</answer> schema.

3. RL Formulation and Reward Design

The KnowRL RL process is formalized as follows (Ren et al., 24 Jun 2025):

  • State (sts_t): Current prompt plus all previously generated tokens in <think> and <answer> segments.
  • Action (ata_t): Next token selection or segment boundary transition.
  • Policy (πθ\pi_\theta): Parameterized by model and LoRA adapters.

The policy is optimized by Group Relative Policy Optimization (GRPO), a variant of PPO suitable for long outputs and grouped reward structures.

Reward Components

The total reward Rtotal(o)R_{total}(o) is:

  1. Format Reward: Rformat(o)=+1R_{format}(o) = +1 if the output matches the strict schema; 1-1 otherwise.
  2. Correctness Reward: Rcorrect(o)=+2R_{correct}(o) = +2 if a final answer is correct (as judged by GPT-4o-mini); 1-1 otherwise.
  3. Factuality Reward: The CoT is decomposed into NN atomic facts (via GPT-4o-mini), with each fact cross-validated by the NLI model on evidence from KK. Let NsupportedN_{supported} denote the count of facts entailed by evidence; then

Rfact(o)=min(Nsupported15,1.0)R_{fact}(o) = \min\left(\frac{N_{supported}}{15},\, 1.0\right)

  1. Combined Reward:

Rcombined(o)={+2,if oanswer is correct 1+Rfact(o),otherwise R_{combined}(o) = \begin{cases} +2, & \text{if } o_\text{answer} \text{ is correct} \ -1 + R_{fact}(o), & \text{otherwise} \ \end{cases}

  1. Total Reward: Rtotal(o)=Rformat(o)+Rcombined(o)R_{total}(o) = R_{format}(o) + R_{combined}(o)

This composite structure enforces output discipline, factual alignment through the reasoning chain, and correct final answers.

4. Training Procedure and Implementation

The training procedure consists of two distinct stages (Ren et al., 24 Jun 2025):

Stage 1: Cold-Start Supervised Fine-Tuning (SFT)

  • Models are fine-tuned on a curated set of ∼3K factual question–CoT pairs to ensure format and baseline reasoning competence.
  • This stage is essential; ablations show that omitting cold-start SFT renders factual rewards ineffective ("KnowRL-Zero" fails to improve factuality).

Stage 2: Knowledge-Enhanced RL

  • The SFT-initialized model undergoes RL updates:
    • For each prompt, generate output at temperature 0.
    • Compute RtotalR_{total} for each output using the three reward components.
    • Update policy parameters with GRPO.
    • Only LoRA adapters are updated; base model weights are kept frozen.

Pseudocode (High-Level)

1
2
3
4
5
6
7
8
9
initialize θ via SFT
for step in 1T_RL:
    for each prompt P in batch:
        o = sample_output(π_θ, P)
        r_format = format_reward(o)
        r_correct = correctness_reward(o)
        r_fact = fact_reward(o)
        R_total = r_format + combined(r_correct, r_fact)
    θ = GRPO_update(θ; {R_total})

5. Consensus-Driven Self-Knowledge Refinement

A second, independent KnowRL framework (Kale et al., 13 Oct 2025) operates as a post-training loop to sharpen the model’s self-knowledge boundaries:

  • Introspection: The LLM generates a batch of candidate tasks (instructions or queries), each self-labeled as "Feasible" or "Infeasible" using a seed set of 100 manually verified examples (50 Feasible, 50 Infeasible).
  • Consensus-Based Reward: For each candidate, k=8k=8 self-judgments are sampled. The consensus score is the fraction agreeing with the majority.

The RL update maximizes expected consensus reward, using reinforcement learning (Reinforce++ with a KL penalty) over the trajectories composed of the original candidate and the kk judgments.

Mathematical Formulation

  • For a batch X={(xi,^i)}\mathcal{X} = \{(x_i, \hat\ell_i)\} of candidates:
    • Consensus reward: r(xi)=1kj=1k1[yi,j=mi]r(x_i) = \frac{1}{k} \sum_{j=1}^k \mathbf{1}[y_{i,j} = m_i], where mim_i is the majority label.

The loop iterates, using high-consensus new tasks to expand the seed set and reinforce self-consistency, until improvements in both self-consistency and external feasibility benchmarks plateau.

6. Empirical Results and Ablation Studies

Experimental results on hallucination (TruthfulQA, SimpleQA, ChineseSimpleQA) and reasoning (GPQA, AIME 2025) benchmarks demonstrate:

  • KnowRL-trained DeepSeek-Qwen-7B attains the highest accuracy on all hallucination datasets, outperforming Direct Generation, Self-Refine, FactTune-FS, DPO, and standard SFT baselines.
  • Reasoning ability is preserved or improved on GPQA and AIME benchmarks.
  • Ablation: Cold-start SFT is essential; models without it do not benefit from factual rewards. Removing any one reward component degrades either factuality or reasoning metrics.
  • Distilled models show greater improvement ceilings than models already RL-trained; overtraining (over ~150 RL steps) harms factuality, revealing a factual-vs-reasoning trade-off under prolonged RL.
  • Intrinsic accuracy (fraction of self-generated tasks with label-reproducibility) improved by +28% (LLaMA-8B, from 33.56% to 42.99%) and +23% (Qwen-7B, from 39.22% to 48.29%) after 30 iterations.
  • Extrinsic F1 on the SelfAware benchmark: Increases of +12% (LLaMA-8B, from 56.12% to 63.10%) and +10% (Qwen-7B, from 62.17% to 68.29%).
  • Ablations: Removing consensus reward or semantic/perplexity-based task filtering eliminates gains and leads to trivial task "gaming".

7. Insights, Limitations, and Future Research

Directly integrating factual supervision (via external verification or introspective consensus) at each reasoning step enables LLMs to internalize fact-based reasoning strategies, mitigating the factual supervision gap inherent in outcome-only RL schemes (Ren et al., 24 Jun 2025). Post-training consensus-driven KnowRL methods further indicate that LLMs can bootstrap their own knowledge boundaries without external annotation effort, improving intrinsic and extrinsic reliability metrics in under 30 iterations (Kale et al., 13 Oct 2025).

Key limitations include:

  • Resource Demands: External verification via knowledge base retrieval and multiple API calls is computationally intensive (Ren et al., 24 Jun 2025).
  • Theoretical Characterization: Formal understanding of when fact-based or consensus-based rewards most benefit reasoning reliability is undeveloped.
  • Scalability: Extension to multi-lingual models, larger architectures, or broader knowledge domains (beyond Wikipedia) remains unresolved.

A plausible implication is that hybridizing fact-based and self-consensus reward paradigms, or judiciously selecting pretraining and SFT data to maximize knowledge coverage, may further enhance reliability without impairing reasoning capabilities.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KnowRL Framework.