Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agentic Discriminative Verifier (DiVA)

Updated 14 January 2026
  • DiVA is a hybrid verification framework that combines LLM agentic search with continuous scoring to achieve fine-grained factuality assessments.
  • It actively retrieves evidence using static and real-time tools, compressing context to support precise discriminative scoring.
  • DiVA’s design integrates agentic search with pairwise ranking training, significantly enhancing performance on single-hop and multi-hop QA tasks.

The Agentic Discriminative Verifier (DiVA) is a hybrid factuality verification framework that integrates agentic search functionality of LLM agents with the continuous scoring power of discriminative models. DiVA is specifically designed to address deficit in existing verification paradigms, which predominantly yield only binary (correct/incorrect) judgments and lack granularity with respect to factual error severity. The framework is validated on the FGVeriBench benchmark, which emphasizes fine-grained factuality across single-hop and multi-hop question answering tasks. Empirical results demonstrate DiVA’s superiority over prevailing generative and discriminative factuality verification models in both precision and ranking correlation metrics (Huang et al., 7 Jan 2026).

1. Motivation and Core Contributions

Traditional LLM outputs are susceptible to hallucinations, generating text that is superficially coherent yet factually incorrect. The dominant approach in factuality verification enforces a binary decision criterion, an inadequate abstraction for applications such as nuanced evaluation of summaries or LLM preference gains through RLHF. DiVA introduces a paradigm shift by:

  • Employing agentic search, whereby an LLM agent actively retrieves and reasons with evidence using both static (Wikipedia) and real-time (Google search) tools.
  • Combining retrieval and reasoning traces into compressed representations suitable for downstream scoring.
  • Utilizing a discriminatively trained model to output continuous-valued factuality scores rather than discrete or categorical labels.

The integration of these components enables fine-grained discrimination of error severity and positions DiVA as a robust foundation for both evaluation and preference-optimization workflows.

2. Technical Architecture

DiVA’s workflow consists of three sequential modules: agentic search, context compression, and score prediction.

An LLM agent (e.g., Qwen-2.5-7B-Instruct) orchestrates a multi-step search loop with two dedicated retrieval tools:

  • search_local: retrieves evidence from a Wikipedia snapshot.
  • search_web: executes real-time queries via Google’s API.

At each iteration, the agent examines the question-answer context, identifies gaps in supporting evidence, formulates targeted retrieval queries, chooses the search modality (local or web), integrates returned information, and repeats the cycle until sufficient support for verification is amassed. This process results in an explicit trajectory encoding the sequence of queries, observations, and reasoning steps.

2.2. Context Compression

The raw agentic trajectory, typically verbose and redundant, is passed through a secondary LLM compression process. The outcome is a two-part distilled artifact for each candidate answer: (1) a bullet list comprising precise supporting facts extracted from retrieved evidence, and (2) a concise, structured reasoning chain elucidating how the evidence relates to the answer’s veracity. This compression ensures that only high-signal, contextually relevant information populates the discriminative verifier’s context window.

2.3. Score Prediction (Discriminative Verifier)

The discriminative verification module augments a base generative LLM with a compact regression head, which is randomly initialized and fine-tuned solely via Low-Rank Adaptation (LoRA). Given a question xx, candidate answer yy, and its compressed context tt, the verifier computes a real-valued factuality function f(x,y,t)Rf(x, y, t) \in \mathbb{R}. Continuous scoring circumvents the quantization artifact of token-based generative verifiers, providing nuanced gradations of factuality.

2.4. Pairwise Ranking-Based Training

Ground-truth absolute scores are impractical to curate at scale; instead, DiVA employs a pairwise ranking loss. Training data consist of triplets (x,y+,t+;y,t)(x, y_+, t_+; y_-, t_-), with y+y_+ adjudged more factual than yy_- by an LLM-judge and human verification. The objective is minimizing a margin-based loss:

L=max[0,m(f(x,y+,t+)f(x,y,t))]\mathcal{L} = \max[0, m - (f(x, y_+, t_+) - f(x, y_-, t_-))]

where mm is a pre-defined margin (e.g., $0.1$). Candidate answers are sampled and labeled as Correct, Intermediate, or Incorrect, and all requisite agentic search and compression steps are performed for each.

LoRA ensures that only a fraction of model parameters is fine-tuned, resulting in storage and compute efficiency.

3. FGVeriBench Benchmark

FGVeriBench is constructed to facilitate rigorous, fine-grained evaluation of verifiers on both single-hop and multi-hop settings using the following datasets: Natural Questions (NQ), TriviaQA, PopQA (single-hop); HotpotQA, MuSiQue, 2Wiki (multi-hop). Each dataset is characterized by moderate answer lengths (approximately 14–21 tokens). For every question, three candidate answers (representing correct, intermediate, and incorrect factuality) are curated and ranked using a two-stage process involving an LLM-as-judge and human verification. Inter-annotator agreement on relative rankings is 64.8%, highlighting intrinsic annotation challenges.

Evaluation is measured via:

  • Precision@1: the frequency with which the model selects the top-ranked, most factual answer.
  • Kendall’s τ\tau: rank correlation between model ordering and human relative rankings.
Dataset # Questions Avg. Answer Length
NQ 203 ≈14–21 tokens
TriviaQA 372 ≈14–21 tokens
PopQA 365 ≈14–21 tokens
HotpotQA 287 ≈14–21 tokens
MuSiQue 154 ≈14–21 tokens
2Wiki 298 ≈14–21 tokens

4. Empirical Results and Comparative Evaluation

DiVA demonstrates leading performance across diverse tasks and baseline configurations:

  • On general QA and multi-hop QA datasets, DiVA (Qwen-2.5-7B-Instruct) records average Precision@1 of approximately 88.4% and Kendall τ\tau of 85.6%. By comparison, AG-Verifier achieves 79.9% and 74.1%, FactScore 72.6% and 66.3% on these metrics, respectively.
  • Ablation studies indicate that omitting context compression degrades Kendall τ\tau by 7–12 points, particularly for multi-hop queries. Excluding agentic search yields further drops (Kendall τ\tau ≈ 55–60%).
  • In binary correctness adaptation (reshaping the dataset accordingly), DiVA attains ACC ≈ 88–89% and F₁ ≈ 88% on TriviaQA and HotpotQA, outperforming generative-only approaches by 10–20 percentage points.
  • On FactScore’s long-form set, DiVA achieves Precision@1 ≈ 65% and Kendall τ\tau ≈ 48%, compared to GPT-4’s 60% and 45%, respectively; in best-of-N selection for Meta-Llama-3-8B and Llama-3.1-8B, DiVA delivers up to +8 token-level F₁ improvement over FactScore for NQ_Test, MuSiQue, and Bamboogle.
  • Knowledge-source ablation shows verification accuracy peaks (Kendall τ\tau ≈ 80.1%) when both WebSearch (79.5%) and LocalSearch (74.3%) are available. Samples with highly relevant retrieval evidence yield Kendall τ\tau exceeding 90%.
  • Scaling studies reveal DiVA remains robust and effective even with small discriminative models, outperforming larger agentic generators up to 14B parameters.

5. Analysis and Qualitative Findings

Analysis demonstrates that generative verifiers relying solely on parametric memory often hallucinate, especially when tasked with multi-hop reasoning or fact disambiguation (e.g., confusing mythological names, misattributing geographic details). DiVA’s explicit decomposition of questions into sequential retrieval and reasoning steps—grounded in external evidence—systematically mitigates these issues. The continuous factuality score predicted by DiVA’s discriminative verifier enhances downstream selection tasks (e.g., best-of-N reranking) and improves reliability in binary decisions.

6. Limitations and Future Work

Several limitations and prospective research directions are identified:

  • Agentic search components within DiVA are not specifically fine-tuned for factual reasoning; future research may employ reinforcement learning to optimize search policies for factuality objectives.
  • At present, generative and discriminative modules are decoupled, potentially incurring inference inefficiency. A unified, fully end-to-end architecture could streamline performance.
  • While pairwise ranking training yields strong results, deeper integration of DiVA scores into LLM preference-optimization frameworks, such as RLHF, may further elevate base-model factuality.

A plausible implication is that future extensions encompassing end-to-end optimization and search policy learning could establish new standards for fine-grained and preference-aware factuality verification in LLM pipelines (Huang et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agentic Discriminative Verifier (DiVA).