Textual Answer-Matching Reward
- Textual answer-matching reward is a formal evaluation signal that assesses LM outputs by comparing generated answers with ground-truth references for both surface-level and semantic accuracy.
- It employs dual criteria—factual correctness and faithfulness to supporting documents—enabling granular, sentence-level reward decomposition.
- Recent architectures like TRM and CompassVerifier integrate generative reasoning and discriminative classification within reinforcement learning frameworks to enhance answer quality.
A textual answer-matching reward is a formalized signal for training or evaluating LMs by comparing generated answers to ground-truth or reference answers. In open-domain question answering (QA) and related settings, automated answer matching must address both surface-level and semantic correctness, detect subtle forms of invalidity, and scale across domains, reasoning types, and answer formats. Recent advances have introduced structured generative reward models and robust verifier architectures as core components for alignment and reinforcement learning with verifiable rewards (RLVR), enabling fine-grained assessment and optimization of model outputs (Ma et al., 29 Sep 2025, Liu et al., 5 Aug 2025).
1. Formal Definitions of Textual Answer-Matching Reward
Let a QA instance be indexed by , and its output answer decomposed into sentences, each denoted . Two orthogonal but complementary criteria are defined for each sentence:
- Faithfulness score : Indicates whether is semantically aligned with supporting documents.
- Correctness score : Indicates whether is factually correct, considering both external sources and the model's internal consistency.
These are combined as a scalar reward,
where balances the importance between faithfulness and correctness. Sentence segmentation and per-sentence annotation are essential, as they enable precise localization and reasoning of errors and strengths at a finer granularity.
In alternative architectures, such as the three-way classification used in CompassVerifier, the reward is formulated as , where is the model prediction and is the reference label. The reward mapping is typically binary ( only for correct matches) or ternary (with explicit penalty for invalid answers) (Liu et al., 5 Aug 2025).
2. Generative and Discriminative Reward Architectures
Two principal reward model architectures have emerged for textual answer matching:
- Thinking-supervised Reward Model (TRM): A generative transformer that, given a query, supporting documents, and an answer decomposed into sentences, produces for each sentence:
- A binary faithfulness prediction.
- A natural language reasoning chain linking faithfulness with factuality.
- A binary correctness judgment. Input linearization encodes all context and sentence markers in a single sequence, and the model emits explicit spans for each output criterion (Ma et al., 29 Sep 2025).
CompassVerifier: A sequence classifier operating over triplet inputs: . It projects the last token’s hidden state to 3 logits corresponding to Correct, Incorrect, or Invalid, using a softmax layer. The architecture is domain-agnostic and supports multi-subproblem, formulaic, and sequence-type answers through segment and global attention encodings. Robustness is achieved via adversarial augmentation, formula-equivalence handling, and explicit invalid-answer detection (Liu et al., 5 Aug 2025).
| Architecture | Input Format | Output Labels / Scores |
|---|---|---|
| TRM (Generative RM) | ⟨Q⟩Query ⟨D⟩Docs ⟨A⟩[s₁]...[s_K] | Per-sentence: Faithfulness, Reasoning, Correctness |
| CompassVerifier | [Question] ‖ [Reference] ‖ [Model Response] | {Correct, Incorrect, Invalid} |
Explicit decomposition, as in TRM, enforces sequential consideration of external evidence before internal correctness reflection; discriminative verifiers like CompassVerifier focus on direct label assignment but leverage rich augmentation and robust input features to cover diverse failure modes.
3. Training Data, Supervision, and Losses
Reward model training requires high-quality, well-annotated data pools:
- TRM/Sentence-level supervision: Data is constructed by document retrieval, followed by sentence segmentation of answers and stepwise annotation: faithfulness (to source), correctness (with justification). Training objectives are:
- Supervised Fine-Tuning (SFT): Token-level negative log-likelihood over sequences of faithfulness, reasoning, and correctness spans.
- Reinforcement Learning (RL): Policy gradients using Group Relative Policy Optimization (GRPO), optimizing expected reward plus KL divergence penalty between current and reference policies. Class imbalance is addressed via bonus terms for correctly identified negative sentences (Ma et al., 29 Sep 2025).
- CompassVerifier: Trained via cross-entropy loss on (question, reference, response, gold label) tuples. The dataset, VerifierBench, comprises over 1.32M raw triples sampled from 53 models × 16 datasets, followed by multi-expert voting, multi-prompt voting, and human adjudication over disputed and edge cases. Robustness is further enhanced by:
- Complex-Formula Augmentation: 18k formulaic equivalence examples.
- Error-driven Augmentation: 24k meta-judge exemplars covering partial or malformed answers.
- Generalizability Augmentation: Paraphrasing prompts and truncating reasoning to adapt to prompt/context variability.
- Invalid and flawed sample identification and explicit handling in loss and reward mapping are central to accurate penalty assignment (Liu et al., 5 Aug 2025).
4. Integration with Reinforcement Learning Pipelines
Reward models are deployed as oracles to evaluate and optimize the outputs of a target QA policy within an RLVR framework:
- TRM-based Policy Optimization: Each rollout’s answer is decomposed and passed through TRM, which produces per-sentence faithfulness and correctness scores used as the accuracy reward. To promote overall usefulness (fluency, coverage, style), a preference model (Prefer) is introduced: each candidate is compared against an anchor with perfect TRM score, and Prefer contributes an additional reward component. The final reward for each sentence becomes
with the commission of GRPO for policy update (Ma et al., 29 Sep 2025).
- CompassVerifier as RL Critic: Given group rollouts, CompassVerifier assigns and associated scalar reward . The mean reward for the group normalizes individual rewards, providing the advantage signal for the clipped-ratio policy gradient objective. Invalid answers are detected and penalized via the reward mapping, and the process includes explicit pseudocode for batched sampling, mean reward computation, and policy parameter update (Liu et al., 5 Aug 2025).
5. Evaluation Benchmarks, Quantitative Results, and Error Analysis
- TRM Evaluation: 2,133 queries, 86,482 sentences, with 13.1% negative. Baselines are outcome-supervised (ORM) and process-supervised (PRM) reward models. F1 scores for incorrect sentence detection and worst-answer identification rates serve as key metrics.
Model F1 (Incorrect) Detection Rate ORM 0.2564 0.3222 PRM 0.3194 0.3479 TRM (SFT) 0.3384 0.3643 TRM+ (RL) 0.3447 0.3690
In policy optimization, combining TRM with Prefer yields correctness gains up to +23% on domain-specific tasks and +30% out-of-domain, and enhances sentence-level usefulness (Ma et al., 29 Sep 2025).
- CompassVerifier and VerifierBench: The benchmark contains 2,817 curated quadruples (q, a, r, y) post-multi-stage filtering. Evaluation uses accuracy, micro/macro F1 (binary and ternary), and domain/answer-type breakdowns for coverage. Explicit invalid answer handling (e.g., truncation, repetition, refusal) leads to improved hygiene and calibration. CompassVerifier demonstrates high-accuracy detection of both mathematical equivalence and sequence-level errors, supporting robust RL reward signals (Liu et al., 5 Aug 2025).
6. Generalization and Best Practices
Best practices highlighted across both architectures include:
Sentence-level segmentation with explicit faithfulness and correctness annotation, including brief free-form reasoning, is critical for fine-grained reward supervision.
- Explicit decomposition of the reward criteria—whether through generative reasoning chains or discriminative multi-class labeling—prevents conflation of document alignment with factual accuracy, countering overreliance on external texts.
- Combining intermediate faithfulness and correctness signals yields more robust and informative rewards than endpoint-only matching, especially for knowledge-intensive tasks.
- Addressing class-imbalance (e.g., predominance of correct sentences) is essential for stable training, accomplished via bonus/penalty shaping and robust augmentation schemes.
- For practical use, reward aggregation strategies (e.g., blending accuracy- and usefulness-based signals) and careful reward scaling/allocation (via , coefficients) enable domain adaptation and task-specific trade-offs.
- The reward modeling frameworks generalize to related settings—summarization, table QA, retrieval-augmented generation—provided appropriately annotated, sentence-level data is available.
A plausible implication is that ongoing advances in textual answer-matching reward models will further raise the bar for faithful, factually correct, and practically useful outputs in complex QA tasks, as models learn to both align with evidence and critically appraise their own generations.