Papers
Topics
Authors
Recent
Search
2000 character limit reached

Faithfulness-Aware Advantage Modulation

Updated 10 February 2026
  • The paper introduces faithfulness-aware advantage modulation to align policy-gradient credit with the minimal evidence required, reducing hallucinated answers.
  • It employs geometric reward shaping to continuously balance correctness and penalize hallucinations, resulting in significant THS improvements.
  • Empirical results on models like Qwen-2.5B demonstrate enhanced reasoning fidelity and performance by reinforcing faithful intermediate steps.

Faithfulness-aware advantage modulation is a reinforcement learning (RL) strategy for LLMs and small reasoning models (SRMs) that aims to improve the fidelity of step-by-step reasoning by selectively assigning policy-gradient credit based on the faithfulness of intermediate steps with respect to the required evidence set. By incorporating explicit verification and fine-grained reward shaping, these mechanisms address weaknesses of standard RL pipelines—most notably the reinforcement of overconfident or spurious intermediate reasoning that may lead to hallucinated answers. This approach has been instantiated in frameworks such as FaithRL, which combines geometric reward shaping with faithfulness-aware modulation of the policy advantage signal, and is supported by both theoretical guarantees and empirical advances in reducing hallucination rates while maintaining or improving correctness across reasoning benchmarks (Gui et al., 3 Feb 2026, Nie et al., 5 Feb 2026).

1. Faithfulness-Maximization and RL Objective Formulation

The design of faithfulness-aware advantage modulation centers on explicitly maximizing the probability that the LLM's reasoning trajectory uses exactly the minimal evidence required for a given query. Let θ\theta denote the policy parameters, πθ\pi_\theta the trajectory distribution, qq the query, and E(q)SE(q) \subset S its minimal evidence subset. The set of faithful trajectories is defined as

Sf={T:usedKN(T)=E(q)}S_f = \{ T : \text{usedKN}(T) = E(q) \}

where usedKN(T)\text{usedKN}(T) is the set of knowledge items cited in trajectory TT. The faithfulness maximization objective is

JF(θ)Pθ(TSf)=ETπθ[1TSf]J_F(\theta) \triangleq P_\theta(T \in S_f) = \mathbb{E}_{T\sim\pi_\theta}[1_{T\in S_f}]

In practice, RL surrogates optimize a shaped reward rfaith(T)=1TSfr_{\text{faith}}(T) = 1_{T\in S_f} via a policy-gradient algorithm (e.g., GRPO or PPO), grounding the learning objective directly in minimizing step-level hallucinations and maximizing the truthfulness-helpfulness score (THS) (Gui et al., 3 Feb 2026).

2. Geometric Reward Shaping and THS Alignment

Rather than relying solely on binary correctness rewards, geometric reward design introduces continuous rewards based on model baseline rates for correctness (x0=P(C)x_0=P(C)) and hallucination (y0=P(H)y_0=P(H)):

Rgeo(T)={+y0,TSc 0,TSm x0,TShR_{\text{geo}}(T) = \begin{cases} + y_0, & T \in S_c \ 0, & T \in S_m \ - x_0, & T \in S_h \end{cases}

where ScS_c, SmS_m, and ShS_h denote the sets of correct, miss, and hallucinated trajectories, respectively. The gradient of the expected geometric reward aligns with the gradient of the THS metric, ensuring policy updates directly target truthful and helpful reasoning:

θE[Rgeo]θTHS(πθ)\nabla_\theta \mathbb{E}[R_{\text{geo}}] \propto \nabla_\theta \text{THS}(\pi_\theta)

This alignment is formally proven (Theorem 4.2) and ensures that reward optimization is tightly coupled to desirable model behavior (Gui et al., 3 Feb 2026).

3. Faithfulness-Aware Advantage Modulation Mechanism

The central innovation of faithfulness-aware advantage modulation is in its assignment of policy-gradient credit at the granularity of trajectory steps and tokens, conditioned on their faithfulness status. For each sampled trajectory TiT_i, step sjs_j, and token tt, the modulation scalar is

Mi,t={(1α)V(sj)+α,Ai>0 (1α)(1V(sj))+α,Ai0M_{i,t} = \begin{cases} (1-\alpha) V(s_j) + \alpha, & A_i > 0 \ (1-\alpha)(1 - V(s_j)) + \alpha, & A_i \leq 0 \end{cases}

with V(sj){0,1}V(s_j) \in \{0, 1\} indicating step faithfulness, AiA_i the normalized group-relative advantage, and α[0,1)\alpha \in [0,1) the faithfulness coefficient (strict filtering at α=0\alpha = 0). The final per-token advantage is then

Ai,tf=Mi,tAiA^f_{i,t} = M_{i,t} A_i

This mechanism ensures that positive RL updates accrue only to faithful reasoning steps, while negative updates concentrate on unfaithful steps. The full objective combines these token-wise advantages in a clipped importance-sampling objective:

JFA(θ)=Eq,T1...TNπθold[1Ni=1Nt=1TiMi,tCLIP(πθπold,Ai)]J_{\mathrm{FA}}(\theta) = \mathbb{E}_{q,\,T_1...T_N \sim \pi_{\theta_{\text{old}}}} \left[ \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^{|T_i|} M_{i,t} \cdot \mathrm{CLIP}\left(\frac{\pi_\theta}{\pi_{\text{old}}}, A_i\right) \right]

As a result, the RL signal reinforces only those partial derivations grounded in the minimal required evidence, aligning learning directly with faithfulness maximization (Gui et al., 3 Feb 2026).

4. Step-Level Credit Assignment: Algorithmic Realization

FaithRL's training kernel can be summarized as follows:

  1. Trajectory Sampling: For each query, sample NN trajectories from the current policy πold\pi_{\text{old}} and decompose each into steps and tokens.
  2. Outcome Evaluation: Compute outcome rewards ri=Rgeo(Ti)r_i = R_{\text{geo}}(T_i) for all trajectories using the current baseline rates.
  3. Advantage Calculation: Normalize to obtain group-relative advantages AiA_i.
  4. Faithfulness Verification: For each step si,js_{i,j}, evaluate V(si,j)=1V(s_{i,j})=1 if and only if only evidence from E(q)E(q) is used.
  5. Modulation Scalar Assignment: Compute Mi,tM_{i,t} for each token tt within si,js_{i,j} according to the sign of AiA_i.
  6. Policy Update: Aggregate per-token advantages Ai,tf=Mi,tAiA^f_{i,t}=M_{i,t}A_i for policy-gradient computation and update θ\theta.

Empirical ablation confirms that the combination of geometric rewards and faithfulness-aware advantage modulation yields the best trade-off between answer correctness and hallucination suppression—e.g., on MuSiQue-Full with Qwen-2.5B, the THS increases from 22.5 (baseline) to 51.8 (full FaithRL), with simultaneous increases in correctness and reductions in hallucination (Gui et al., 3 Feb 2026).

5. Theoretical Guarantees: Avoiding Overconfidence Bias

Theoretical analysis establishes that, under strict filtering (α=0\alpha=0), the FaithRL policy gradient collects positive updates only from trajectories that are both faithful and correct, and negative updates only from those that are both unfaithful and incorrect. This property eliminates two critical sources of standard RL bias:

  • Reinforcement of spurious correct guesses (by withholding positive reward from unfaithful but correct chains)
  • Penalization of valid but failed reasoning chains (by withholding negative reward from faithful but incorrect chains)

Consequently, policy updates are exactly aligned with maximizing the probability of faithful trajectories, thereby provably avoiding the overconfidence collapse (where P(M)0P(M) \to 0) that afflicts outcome-only RL (Gui et al., 3 Feb 2026).

6. Empirical Performance and Generalization

Extensive evaluation demonstrates that faithfulness-aware advantage modulation robustly outperforms alternatives across a range of architectures and tasks. Representative results include:

Model Baseline THS +Geometric Reward +FAAM Only FaithRL (Full)
Qwen-2.5B 22.5 43.0 27.5 51.8
  • In-domain, Llama-3.1-8B: FaithRL reduces hallucinations by ~4.7 points, raises correctness by ~1.6 points, for an average THS gain of +9.0 over baseline.
  • Out-of-domain generalization (GSM8k): Correctness improves from 72.3% to 86.9%, hallucination drops from 26.5% to 12.2%, and the faithful-step ratio among correct samples rises from 72.7% to 79.5%.
  • Only FaithRL steadily increases per-step faithfulness ratio during training (from ~0.51 to ~0.83 over 15k updates), whereas other RL baselines such as GRPO and TruthRL do not demonstrate this trend.

This comprehensive improvement is achieved without sacrificing correctness, indicating that optimizing explicit faithfulness at the intermediate step level substantially raises both the average fidelity and the floor of model performance (Gui et al., 3 Feb 2026).

Faithfulness-aware advantage modulation extends the paradigm established in correctness-aware low-entropy segment–based advantage shaping (e.g., LESS (Chen et al., 30 Nov 2025)), which amplifies updates in stable reasoning segments by correctness overlap, and complements step-level reinforcement frameworks that combine explicit faithfulness rewards with implicit contrastive resampling strategies (e.g., FaithRL for small reasoning models (Nie et al., 5 Feb 2026)). While LESS uses segment entropy and correctness overlap, FaithRL’s faithfulness-aware modulation focuses on binary verification of evidence usage and geometric reward alignment with THS. Both strands share the principle of modulating the RL signal using fine-grained, correctness- or faithfulness-oriented criteria, but they differ in their operationalization of reasoning fidelity.

References

  • (Gui et al., 3 Feb 2026) "Learning to Reason Faithfully through Step-Level Faithfulness Maximization"
  • (Chen et al., 30 Nov 2025) "Beyond High-Entropy Exploration: Correctness-Aware Low-Entropy Segment-Based Advantage Shaping for Reasoning LLMs"
  • (Nie et al., 5 Feb 2026) "Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Faithfulness-Aware Advantage Modulation.