ReFInE: Fine-Grained Interpretability & Evidence

Updated 15 January 2026

The paper introduces a novel framework that compels models to emit structured provenance triples (Quotation, Compression, Inference) alongside fluent answers.
It employs a rigorous human-in-the-loop annotation process to ensure precise evidence extraction and accurate relation mapping.
Empirical results reveal significant gains in provenance fidelity and inference accuracy compared to traditional LLM citation methods.

ReFInE (Relation-aware Fine-grained Interpretability & Evidence) is a supervised corpus and evaluation framework for generation-time, sentence-level provenance in multi-document question answering. It addresses limitations in traditional LLM-generated citations by requiring models to emit not only fluent answers but also structured, sentence-wise provenance triples, capturing nuanced relationships between generated content and underlying evidence. ReFInE formalizes provenance as a collection of $\langle$ doc_id, sent_id, relation $\rangle$ tags per answer sentence, distinguishing Quotation, Compression, and Inference to enable precise attribution and interpretability in model outputs (Wei et al., 8 Jan 2026).

1. Dataset Specification and Provenance Schema

ReFInE is constructed for generation-time provenance, pairing every sentence $t_j$ in a reference answer with a provenance triple set:

$P_j = \{ (\text{doc\_id}, \text{sent\_id}, r) \}$

with $r$ from the set $\{$ Quotation, Compression, Inference $\}$ . After each factual answer sentence, a [PROVE: $\cdots$ ] tag lists its supporting evidence, capturing both the source and the semantic relationship. The three relation types are distinguished as follows:

Quotation: $t_j$ copies or closely paraphrases a single source sentence, with $\geq$ 80% lexical overlap. For example, "The dam released water because of heavy rainfall. [PROVE:(0,3,'Quotation')]"
Compression: $\rangle$ 0 summarizes content spanning one or more contiguous source sentences, retaining all key facts but with condensed wording. Example: "The bird migrates annually in spring and returns in autumn. [PROVE:(1,2,'Compression'), (1,3,'Compression')]"
Inference: $\rangle$ 1 synthesizes information not explicit in the sources but logically deducible (possibly multi-hop), e.g., "Annual sales quadrupled over the year. [PROVE:(2,5,'Inference')]"

Each answer sentence can be grounded in multiple source sentences and relations, allowing fine-grained mapping from output claims to evidence (Wei et al., 8 Jan 2026).

2. Annotation Pipeline and Corpus Statistics

ReFInE's annotation is realized via a three-stage human-in-the-loop process:

Preprocessing: Answers and retrieved source documents are segmented into sentences, each assigned a unique (DocID, SentID).
LLM-Assisted Annotation and Filtering: GPT-4o proposes initial provenance triples; three annotators filter outputs for instruction compliance, fluency, and strict format adherence.
Expert Validation and Reconstruction: Annotated answer sentences with [PROVE] tags are aggregated, and three specialists vet both sufficiency of cited evidence and relation-type labeling. Incorrect or invalid samples are revised or eliminated.

Corpus-level key statistics:

Split	Instances	Proportion (%)
SFT (warm-up)	12,540	55.4
GRPO (RL align)	5,256	23.2
EVAL (test)	4,838	21.4

Relation-type distribution across triples:

Relation	Proportion (%)
Quotation	≃ 70
Compression	≃ 20
Inference	≃ 10

Provenance density metrics:

Average [PROVE] tags per answer: 3.96 (min 1, max 14)
Average triples per answer: 7.64 (min 1, max 46)
Average triples per tag: 1.98 (min 1, max 18)

This schema provides both high coverage and granularity, enabling robust provenance evaluation (Wei et al., 8 Jan 2026).

3. GenProve Model and Optimization Framework

The GenProve framework operates in two main phases:

3.1 Supervised Fine-Tuning (SFT)

An LLM is fine-tuned on ReFInE to maximize:

$\rangle$ 2

where $\rangle$ 3 is a reference answer interleaved with [PROVE] tags. This drives the model to emit both fluent answers and well-formed structured provenance.

3.2 Group Relative Policy Optimization (GRPO)

Starting from the SFT policy $\rangle$ 4, candidate answers are sampled and $\rangle$ 5 is updated via policy gradients to maximize a composite reward:

$\rangle$ 6

The reward combines:

Content fidelity $\rangle$ 7: Matches generation-reference sentence pairs using embedding cosine similarity (threshold $\rangle$ 8); rewards with ROUGE-L if above threshold; averaged across all sentences.
Provenance correctness $\rangle$ 9: Matches sentences via similarity (threshold $t_j$ 0), computes F1 over set intersections of supporting provenance triples, then averages gated F1s.

Formally,

$t_j$ 1

with $t_j$ 2 empirically.

This dual-objective optimization aligns both textual fidelity and evidence-grounding under fine-grained provenance (Wei et al., 8 Jan 2026).

4. Empirical Results and Comparative Evaluation

GenProve, built on Qwen3-8B, is evaluated against 14 contemporary LLMs using ReFInE EVAL. Representative metrics include ROUGE-L, BLEU, METEOR, MoverScore, provenance precision/recall/F1, and format/judge scores.

Model	ROUGE-L	BLEU	METEOR	MoverScore	Prec.	Rec.	F1	Format	Judge
Qwen3-14B	52.85	35.70	55.34	47.06	45.80	40.33	41.16	99.70	2.59
GLM-4.5-355B	49.81	35.05	57.69	44.92	48.84	44.03	44.55	98.63	2.63
Gemini 2.5 Pro	48.75	31.77	53.09	44.79	46.68	42.86	42.92	100.0	2.57
GenProve	57.25	42.22	59.39	51.04	54.96	51.26	51.21	99.85	3.14

GenProve achieves leading results across all measures, notably outperforming the next best in answer quality (+4.4 ROUGE-L), provenance F1 (+6.7), and LLM-judge (+0.51).

F1 for relation types:

Model	Quotation F1	Compression F1	Inference F1
Qwen3-14B	69.2	43.1	18.4
GLM-4.5	72.8	51.0	24.6
GenProve	84.5	61.3	41.7

GenProve's largest gains are in the Compression (+10–12 pts) and Inference (+17–23 pts) relations, indicating improved grounding beyond verbatim citation (Wei et al., 8 Jan 2026).

5. Observations, Limitations, and Future Work

A clear ordering emerges in model difficulty: Quotation $t_j$ 3 Compression $t_j$ 4 Inference (e.g., GenProve, 84.5 > 61.3 > 41.7 F1), reflecting that surface-level attribution is tractable but verifiable inference remains challenging. GRPO's dual-reward design shows that improvements in answer quality and provenance accuracy are correlated, not antagonistic.

Observed failure modes:

Unsynchronized provenance tags
Incomplete [PROVE] coverage
Incorrect localization (wrong doc or sent indices)
Relation-type mislabeling

Known limitations include increased output length from structured triples, monolingual focus requiring taxonomy extension for multilingual settings, and inherent dependency on retrieval quality—absent evidence in retrieved documents precludes correct provenance.

Anticipated future directions comprise adding retrieval optimization to the pipeline, broadening provenance taxonomies (e.g., to include argumentative or discourse-aware relations), and applying direct supervision for multi-step inference (Wei et al., 8 Jan 2026).

6. Significance and Contributions

ReFInE establishes the first expert-validated, large-scale dataset for generation-time, fine-grained provenance in LLMs. Its explicit separation of Quotation, Compression, and Inference relational types enables analysis of model interpretability and verifiability at a level not captured by coarse citation schemes. GenProve, its associated modeling framework, demonstrates that LLMs can make substantial advances in provenance fidelity, most notably on inference-based evidence alignment. Nevertheless, the persistent reasoning gap highlights that verifiable logical inference in generation remains an open research frontier (Wei et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

GenProve: Learning to Generate Text with Fine-Grained Provenance (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReFInE (Relation-aware Fine-grained Interpretability & Evidence).