Shrinking the Generation-Verification Gap with Weak Verifiers

Published 22 Jun 2025 in cs.CL | (2506.18203v1)

Abstract: Verifiers can improve LLM capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Weaver, a framework that aggregates multiple weak verifiers to reduce the gap between generated and verified outputs.
It employs adaptive weighting, binarization, and weak supervision to estimate verifier accuracies and improve candidate selection.
Empirical results show Weaver achieves near-oracle performance, outperforming majority voting by up to 15.5% and enabling efficient distillation.

Shrinking the Generation-Verification Gap with Weak Verifiers: The Weaver Framework

Introduction and Motivation

Verification is a central challenge in the deployment of LLMs, particularly for tasks requiring high precision such as mathematics, code generation, and scientific reasoning. While repeated sampling from LLMs can increase the likelihood of generating a correct response, the ability to reliably select the correct response from a pool of candidates is fundamentally limited by the quality of the verifier. Perfect (oracle) verifiers are either unscalable (e.g., human evaluation) or only available in narrow domains (e.g., formal proof checkers like Lean). In practice, most available verifiers—reward models (RMs) and LMs prompted as judges—are weak: they are noisy, poorly calibrated, and often exhibit high false positive rates.

This work introduces Weaver, a framework for aggregating multiple weak verifiers to close the generation-verification gap without requiring large amounts of labeled data. Weaver leverages weak supervision (WS) techniques to estimate verifier accuracies and adaptively combine their outputs, outperforming naive ensembling and majority voting. The framework is further extended via distillation, enabling efficient deployment with minimal compute overhead.

Problem Formulation

Given a query $q$ and a set of $K$ candidate responses $\{r_j\}_{j=1}^K$ generated by an LLM, the goal is to select a correct response $r_{j^*}$ such that $y(q, r_{j^*}) = 1$ , where $y$ is the (unknown) correctness label. Each verifier $v_k$ provides a score $s_{jk} = v_k(q, r_j)$ for each candidate. The challenge is to aggregate these scores into a selection rule $f(q, r_j)$ that maximizes the probability of selecting a correct response, ideally approaching the upper bound set by $Pass@K$ (the probability that at least one correct response exists among $K$ samples).

The generation-verification gap is defined as $Pass@K - \text{Success Rate}$ , where the success rate is the empirical accuracy of the selection rule. The objective is to minimize this gap.

Weighted vs. Naive Verifier Ensembling

Naive ensembling—averaging verifier scores—implicitly assumes uniform verifier quality. However, empirical analysis reveals substantial heterogeneity in verifier accuracies, with ranges up to 37.5% across tasks. Weighted ensembles, where each verifier is assigned a learned weight, can exploit this heterogeneity for improved selection accuracy. Supervised approaches (e.g., logistic regression, Naive Bayes) require substantial labeled data, which is often unavailable.

Figure 1: Weighted verifier ensembles, using oracle data or learned aggregation weights, outperform naive combinations by 3.6% and 7.8% on average, respectively.

The Weaver Framework: Weak Supervision for Verifier Aggregation

Weaver adapts weak supervision to the verification setting, addressing unique challenges:

Inconsistent Output Formats: Verifiers emit scores in diverse formats (continuous, binary, Likert).
Low-Quality/Adversarial Verifiers: Some verifiers perform near chance or worse, necessitating filtering.
Limited Labeled Data: Only a small development set is available for estimating global statistics.

Algorithmic Details

Binarization and Normalization: Verifier scores are normalized and binarized using thresholds estimated from a small labeled development set, ensuring comparability and robustness to calibration differences.
Filtering: Verifiers with extreme marginals or low accuracy are pruned to ensure identifiability and stability in the WS model.
Latent Variable Model: Weaver models the joint distribution of binary verifier outputs and the latent correctness label, assuming conditional independence of verifiers given the label. The posterior probability of correctness is computed as:

$\Pr(Y = 1 | S_1, \dots, S_m) = \frac{\prod_{k=1}^m \Pr(S_k | Y=1) \Pr(Y=1)}{\Pr(S_1, \dots, S_m)}$

Verifier accuracies $\Pr(S_k = 1 | Y=1)$ are estimated via moment matching on observed pairwise statistics, following the Snorkel paradigm.

Empirical Results

Closing the Generation-Verification Gap

Weaver is evaluated on MATH500, GPQA Diamond, MMLU College, and MMLU Pro, using Llama 3.3 70B Instruct as the generator and a pool of 33 open-source RMs and LM judges as verifiers. Weaver achieves an average accuracy of 87.7%, outperforming majority voting by 15.5% and coming within 4.2% of the Pass@100 oracle. Notably, Weaver matches or exceeds the performance of much larger or more heavily tuned models (e.g., OpenAI o3-mini) using only repeated sampling and weak verifier aggregation.

Scaling with Generations and Verifiers

Increasing the number of candidate generations $K$ amplifies the benefits of strong verification. While majority voting and naive ensembling quickly plateau, Weaver continues to improve, closely tracking the oracle upper bound.

Figure 2: As $K$ increases, Weaver narrows the generation-verification gap, outperforming alternative verification methods by an average of 18.3%.

Scaling the number of verifiers also yields substantial gains, with Weaver outperforming naive ensemble averaging across both oracle top-5 and total verifier configurations.

Figure 3: Weaver consistently outperforms naive ensemble averaging, with improvements ranging from +2.4% to +10.1% across datasets.

Compute-Accuracy Trade-Offs

Verification can dominate inference-time compute costs. Weaver achieves the highest accuracy but at increased compute, as each candidate must be scored by multiple verifiers. To address this, Weaver is distilled into a 400M-parameter cross-encoder, retaining 98.7% of the full ensemble's accuracy while reducing verification compute by 99.97%.

Figure 4: Weaver improves the accuracy-compute trade-off, and distillation enables high accuracy with three orders of magnitude less compute.

Distillation: Efficient Deployment

Weaver's distilled cross-encoder is trained to predict the ensemble's output, enabling efficient deployment. On GPQA Diamond, the distilled model achieves 98.2% of Weaver's accuracy at 0.03% of the compute cost, requiring only a single A100 GPU for inference.

Analysis and Ablations

Verifier Filtering and Binarization: Pruning low-quality verifiers and adaptive binarization are critical for stability and performance.
Difficulty-Aware Clustering: Partitioning queries by empirical difficulty and fitting separate WS models per cluster yields further gains, especially for smaller models or highly imbalanced datasets.
Prompt Optimization: Optimizing LM judge prompts via discrete search (e.g., DSPy) can further reduce false positive rates and improve precision, complementing ensemble aggregation.

Implications and Future Directions

Weaver demonstrates that reliable, scalable verification is achievable without large labeled datasets or specialized verifiers. This has several implications:

Data Filtering and Model Alignment: Weaver can be used to improve data curation and RLHF pipelines by providing higher-quality labels than individual verifiers.
Test-Time Compute Scaling: Verification becomes a new axis for scaling LLM performance, orthogonal to model size and number of generations.
Efficient Deployment: Distillation enables practical deployment of strong verification strategies in resource-constrained settings.
Extension to Multimodal and Domain-Specific Tasks: Adapting Weaver to multimodal verification or specialized domains (e.g., code, math) is a promising direction.

Conclusion

Weaver provides a principled, label-efficient framework for aggregating weak verifiers, substantially reducing the generation-verification gap in repeated sampling regimes. By leveraging weak supervision, adaptive filtering, and distillation, Weaver achieves strong empirical performance, robust scaling, and efficient deployment. These results suggest that strategic aggregation and distillation of weak verifiers can enable scalable, high-precision verification for LLMs without the need for extensive labeled data or retraining of the base generator.