ReFInE: Fine-Grained Interpretability & Evidence
- The paper introduces a novel framework that compels models to emit structured provenance triples (Quotation, Compression, Inference) alongside fluent answers.
- It employs a rigorous human-in-the-loop annotation process to ensure precise evidence extraction and accurate relation mapping.
- Empirical results reveal significant gains in provenance fidelity and inference accuracy compared to traditional LLM citation methods.
ReFInE (Relation-aware Fine-grained Interpretability & Evidence) is a supervised corpus and evaluation framework for generation-time, sentence-level provenance in multi-document question answering. It addresses limitations in traditional LLM-generated citations by requiring models to emit not only fluent answers but also structured, sentence-wise provenance triples, capturing nuanced relationships between generated content and underlying evidence. ReFInE formalizes provenance as a collection of doc_id, sent_id, relation tags per answer sentence, distinguishing Quotation, Compression, and Inference to enable precise attribution and interpretability in model outputs (Wei et al., 8 Jan 2026).
1. Dataset Specification and Provenance Schema
ReFInE is constructed for generation-time provenance, pairing every sentence in a reference answer with a provenance triple set:
with from the set Quotation, Compression, Inference. After each factual answer sentence, a [PROVE:] tag lists its supporting evidence, capturing both the source and the semantic relationship. The three relation types are distinguished as follows:
- Quotation: copies or closely paraphrases a single source sentence, with 80% lexical overlap. For example, "The dam released water because of heavy rainfall. [PROVE:(0,3,'Quotation')]"
- Compression: summarizes content spanning one or more contiguous source sentences, retaining all key facts but with condensed wording. Example: "The bird migrates annually in spring and returns in autumn. [PROVE:(1,2,'Compression'), (1,3,'Compression')]"
- Inference: synthesizes information not explicit in the sources but logically deducible (possibly multi-hop), e.g., "Annual sales quadrupled over the year. [PROVE:(2,5,'Inference')]"
Each answer sentence can be grounded in multiple source sentences and relations, allowing fine-grained mapping from output claims to evidence (Wei et al., 8 Jan 2026).
2. Annotation Pipeline and Corpus Statistics
ReFInE's annotation is realized via a three-stage human-in-the-loop process:
- Preprocessing: Answers and retrieved source documents are segmented into sentences, each assigned a unique (DocID, SentID).
- LLM-Assisted Annotation and Filtering: GPT-4o proposes initial provenance triples; three annotators filter outputs for instruction compliance, fluency, and strict format adherence.
- Expert Validation and Reconstruction: Annotated answer sentences with [PROVE] tags are aggregated, and three specialists vet both sufficiency of cited evidence and relation-type labeling. Incorrect or invalid samples are revised or eliminated.
Corpus-level key statistics:
| Split | Instances | Proportion (%) |
|---|---|---|
| SFT (warm-up) | 12,540 | 55.4 |
| GRPO (RL align) | 5,256 | 23.2 |
| EVAL (test) | 4,838 | 21.4 |
Relation-type distribution across triples:
| Relation | Proportion (%) |
|---|---|
| Quotation | ≃ 70 |
| Compression | ≃ 20 |
| Inference | ≃ 10 |
Provenance density metrics:
- Average [PROVE] tags per answer: 3.96 (min 1, max 14)
- Average triples per answer: 7.64 (min 1, max 46)
- Average triples per tag: 1.98 (min 1, max 18)
This schema provides both high coverage and granularity, enabling robust provenance evaluation (Wei et al., 8 Jan 2026).
3. GenProve Model and Optimization Framework
The GenProve framework operates in two main phases:
3.1 Supervised Fine-Tuning (SFT)
An LLM is fine-tuned on ReFInE to maximize:
where is a reference answer interleaved with [PROVE] tags. This drives the model to emit both fluent answers and well-formed structured provenance.
3.2 Group Relative Policy Optimization (GRPO)
Starting from the SFT policy , candidate answers are sampled and is updated via policy gradients to maximize a composite reward:
The reward combines:
- Content fidelity : Matches generation-reference sentence pairs using embedding cosine similarity (threshold ); rewards with ROUGE-L if above threshold; averaged across all sentences.
- Provenance correctness : Matches sentences via similarity (threshold ), computes F1 over set intersections of supporting provenance triples, then averages gated F1s.
Formally,
with empirically.
This dual-objective optimization aligns both textual fidelity and evidence-grounding under fine-grained provenance (Wei et al., 8 Jan 2026).
4. Empirical Results and Comparative Evaluation
GenProve, built on Qwen3-8B, is evaluated against 14 contemporary LLMs using ReFInE EVAL. Representative metrics include ROUGE-L, BLEU, METEOR, MoverScore, provenance precision/recall/F1, and format/judge scores.
| Model | ROUGE-L | BLEU | METEOR | MoverScore | Prec. | Rec. | F1 | Format | Judge |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-14B | 52.85 | 35.70 | 55.34 | 47.06 | 45.80 | 40.33 | 41.16 | 99.70 | 2.59 |
| GLM-4.5-355B | 49.81 | 35.05 | 57.69 | 44.92 | 48.84 | 44.03 | 44.55 | 98.63 | 2.63 |
| Gemini 2.5 Pro | 48.75 | 31.77 | 53.09 | 44.79 | 46.68 | 42.86 | 42.92 | 100.0 | 2.57 |
| GenProve | 57.25 | 42.22 | 59.39 | 51.04 | 54.96 | 51.26 | 51.21 | 99.85 | 3.14 |
GenProve achieves leading results across all measures, notably outperforming the next best in answer quality (+4.4 ROUGE-L), provenance F1 (+6.7), and LLM-judge (+0.51).
F1 for relation types:
| Model | Quotation F1 | Compression F1 | Inference F1 |
|---|---|---|---|
| Qwen3-14B | 69.2 | 43.1 | 18.4 |
| GLM-4.5 | 72.8 | 51.0 | 24.6 |
| GenProve | 84.5 | 61.3 | 41.7 |
GenProve's largest gains are in the Compression (+10–12 pts) and Inference (+17–23 pts) relations, indicating improved grounding beyond verbatim citation (Wei et al., 8 Jan 2026).
5. Observations, Limitations, and Future Work
A clear ordering emerges in model difficulty: Quotation Compression Inference (e.g., GenProve, 84.5 > 61.3 > 41.7 F1), reflecting that surface-level attribution is tractable but verifiable inference remains challenging. GRPO's dual-reward design shows that improvements in answer quality and provenance accuracy are correlated, not antagonistic.
Observed failure modes:
- Unsynchronized provenance tags
- Incomplete [PROVE] coverage
- Incorrect localization (wrong doc or sent indices)
- Relation-type mislabeling
Known limitations include increased output length from structured triples, monolingual focus requiring taxonomy extension for multilingual settings, and inherent dependency on retrieval quality—absent evidence in retrieved documents precludes correct provenance.
Anticipated future directions comprise adding retrieval optimization to the pipeline, broadening provenance taxonomies (e.g., to include argumentative or discourse-aware relations), and applying direct supervision for multi-step inference (Wei et al., 8 Jan 2026).
6. Significance and Contributions
ReFInE establishes the first expert-validated, large-scale dataset for generation-time, fine-grained provenance in LLMs. Its explicit separation of Quotation, Compression, and Inference relational types enables analysis of model interpretability and verifiability at a level not captured by coarse citation schemes. GenProve, its associated modeling framework, demonstrates that LLMs can make substantial advances in provenance fidelity, most notably on inference-based evidence alignment. Nevertheless, the persistent reasoning gap highlights that verifiable logical inference in generation remains an open research frontier (Wei et al., 8 Jan 2026).