Faithfulness-Aware Advantage Modulation
- The paper introduces faithfulness-aware advantage modulation to align policy-gradient credit with the minimal evidence required, reducing hallucinated answers.
- It employs geometric reward shaping to continuously balance correctness and penalize hallucinations, resulting in significant THS improvements.
- Empirical results on models like Qwen-2.5B demonstrate enhanced reasoning fidelity and performance by reinforcing faithful intermediate steps.
Faithfulness-aware advantage modulation is a reinforcement learning (RL) strategy for LLMs and small reasoning models (SRMs) that aims to improve the fidelity of step-by-step reasoning by selectively assigning policy-gradient credit based on the faithfulness of intermediate steps with respect to the required evidence set. By incorporating explicit verification and fine-grained reward shaping, these mechanisms address weaknesses of standard RL pipelines—most notably the reinforcement of overconfident or spurious intermediate reasoning that may lead to hallucinated answers. This approach has been instantiated in frameworks such as FaithRL, which combines geometric reward shaping with faithfulness-aware modulation of the policy advantage signal, and is supported by both theoretical guarantees and empirical advances in reducing hallucination rates while maintaining or improving correctness across reasoning benchmarks (Gui et al., 3 Feb 2026, Nie et al., 5 Feb 2026).
1. Faithfulness-Maximization and RL Objective Formulation
The design of faithfulness-aware advantage modulation centers on explicitly maximizing the probability that the LLM's reasoning trajectory uses exactly the minimal evidence required for a given query. Let denote the policy parameters, the trajectory distribution, the query, and its minimal evidence subset. The set of faithful trajectories is defined as
where is the set of knowledge items cited in trajectory . The faithfulness maximization objective is
In practice, RL surrogates optimize a shaped reward via a policy-gradient algorithm (e.g., GRPO or PPO), grounding the learning objective directly in minimizing step-level hallucinations and maximizing the truthfulness-helpfulness score (THS) (Gui et al., 3 Feb 2026).
2. Geometric Reward Shaping and THS Alignment
Rather than relying solely on binary correctness rewards, geometric reward design introduces continuous rewards based on model baseline rates for correctness () and hallucination ():
where , , and denote the sets of correct, miss, and hallucinated trajectories, respectively. The gradient of the expected geometric reward aligns with the gradient of the THS metric, ensuring policy updates directly target truthful and helpful reasoning:
This alignment is formally proven (Theorem 4.2) and ensures that reward optimization is tightly coupled to desirable model behavior (Gui et al., 3 Feb 2026).
3. Faithfulness-Aware Advantage Modulation Mechanism
The central innovation of faithfulness-aware advantage modulation is in its assignment of policy-gradient credit at the granularity of trajectory steps and tokens, conditioned on their faithfulness status. For each sampled trajectory , step , and token , the modulation scalar is
with indicating step faithfulness, the normalized group-relative advantage, and the faithfulness coefficient (strict filtering at ). The final per-token advantage is then
This mechanism ensures that positive RL updates accrue only to faithful reasoning steps, while negative updates concentrate on unfaithful steps. The full objective combines these token-wise advantages in a clipped importance-sampling objective:
As a result, the RL signal reinforces only those partial derivations grounded in the minimal required evidence, aligning learning directly with faithfulness maximization (Gui et al., 3 Feb 2026).
4. Step-Level Credit Assignment: Algorithmic Realization
FaithRL's training kernel can be summarized as follows:
- Trajectory Sampling: For each query, sample trajectories from the current policy and decompose each into steps and tokens.
- Outcome Evaluation: Compute outcome rewards for all trajectories using the current baseline rates.
- Advantage Calculation: Normalize to obtain group-relative advantages .
- Faithfulness Verification: For each step , evaluate if and only if only evidence from is used.
- Modulation Scalar Assignment: Compute for each token within according to the sign of .
- Policy Update: Aggregate per-token advantages for policy-gradient computation and update .
Empirical ablation confirms that the combination of geometric rewards and faithfulness-aware advantage modulation yields the best trade-off between answer correctness and hallucination suppression—e.g., on MuSiQue-Full with Qwen-2.5B, the THS increases from 22.5 (baseline) to 51.8 (full FaithRL), with simultaneous increases in correctness and reductions in hallucination (Gui et al., 3 Feb 2026).
5. Theoretical Guarantees: Avoiding Overconfidence Bias
Theoretical analysis establishes that, under strict filtering (), the FaithRL policy gradient collects positive updates only from trajectories that are both faithful and correct, and negative updates only from those that are both unfaithful and incorrect. This property eliminates two critical sources of standard RL bias:
- Reinforcement of spurious correct guesses (by withholding positive reward from unfaithful but correct chains)
- Penalization of valid but failed reasoning chains (by withholding negative reward from faithful but incorrect chains)
Consequently, policy updates are exactly aligned with maximizing the probability of faithful trajectories, thereby provably avoiding the overconfidence collapse (where ) that afflicts outcome-only RL (Gui et al., 3 Feb 2026).
6. Empirical Performance and Generalization
Extensive evaluation demonstrates that faithfulness-aware advantage modulation robustly outperforms alternatives across a range of architectures and tasks. Representative results include:
| Model | Baseline THS | +Geometric Reward | +FAAM Only | FaithRL (Full) |
|---|---|---|---|---|
| Qwen-2.5B | 22.5 | 43.0 | 27.5 | 51.8 |
- In-domain, Llama-3.1-8B: FaithRL reduces hallucinations by ~4.7 points, raises correctness by ~1.6 points, for an average THS gain of +9.0 over baseline.
- Out-of-domain generalization (GSM8k): Correctness improves from 72.3% to 86.9%, hallucination drops from 26.5% to 12.2%, and the faithful-step ratio among correct samples rises from 72.7% to 79.5%.
- Only FaithRL steadily increases per-step faithfulness ratio during training (from ~0.51 to ~0.83 over 15k updates), whereas other RL baselines such as GRPO and TruthRL do not demonstrate this trend.
This comprehensive improvement is achieved without sacrificing correctness, indicating that optimizing explicit faithfulness at the intermediate step level substantially raises both the average fidelity and the floor of model performance (Gui et al., 3 Feb 2026).
7. Connections to Related Approaches
Faithfulness-aware advantage modulation extends the paradigm established in correctness-aware low-entropy segment–based advantage shaping (e.g., LESS (Chen et al., 30 Nov 2025)), which amplifies updates in stable reasoning segments by correctness overlap, and complements step-level reinforcement frameworks that combine explicit faithfulness rewards with implicit contrastive resampling strategies (e.g., FaithRL for small reasoning models (Nie et al., 5 Feb 2026)). While LESS uses segment entropy and correctness overlap, FaithRL’s faithfulness-aware modulation focuses on binary verification of evidence usage and geometric reward alignment with THS. Both strands share the principle of modulating the RL signal using fine-grained, correctness- or faithfulness-oriented criteria, but they differ in their operationalization of reasoning fidelity.
References
- (Gui et al., 3 Feb 2026) "Learning to Reason Faithfully through Step-Level Faithfulness Maximization"
- (Chen et al., 30 Nov 2025) "Beyond High-Entropy Exploration: Correctness-Aware Low-Entropy Segment-Based Advantage Shaping for Reasoning LLMs"
- (Nie et al., 5 Feb 2026) "Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models"