Taming Variability: Randomized and Bootstrapped Conformal Risk Control for LLMs

Published 27 Sep 2025 in stat.ME | (2509.23007v1)

Abstract: We transform the randomness of LLMs into precise assurances using an actuator at the API interface that applies a user-defined risk constraint in finite samples via Conformal Risk Control (CRC). This label-free and model-agnostic actuator manages ship/abstain/escalate actions based solely on a scalar score from opaque outputs. We enhance CRC's computational efficiency and robustness through Batched Bootstrap CRC (BB-CRC) and Randomized Batched Weighted-Average CRC (RBWA-CRC), reducing calibration calls and stabilizing thresholds while maintaining statistical validity. Additionally, we present a semantic quantification method grounded in gram matrix geometry, resulting in interpretable signal and metric design. Together these pieces deliver principled randomness control for LLM hallucination mitigation and LLM-as-judge reliability. Our framework is assessed using four datasets, demonstrating its efficacy in enhancing factual accuracy and measuring LLM-as-judge performance, yielding a simplified and computationally efficient control layer that converts variability into statistical validity.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a conformal risk control mechanism that provides probabilistic assurances for LLM outputs to mitigate issues like hallucinations.
It employs Batched Bootstrap CRC and Randomized Weighted-Average CRC to improve calibration stability and reduce computational costs.
Semantic quantification via Gram matrix geometry enables label-free, interpretable assessments of model outputs, enhancing factual accuracy in QA tasks.

"Taming Variability: Randomized and Bootstrapped Conformal Risk Control for LLMs"

Introduction to CRC for LLMs

The paper introduces a method to manage the stochastic nature of LLMs by implementing a Conformal Risk Control (CRC) mechanism. This approach provides probabilistic assurances on output reliability, particularly in mitigating issues like hallucinations and evaluation inconsistencies. By employing a model-agnostic actuator at the API interface, CRC transforms the inherent randomness of LLMs into finite-sample guarantees without requiring explicit labels, thus ensuring vendor-independent control of variability.

Conformal Actuator Framework

The proposed Conformal Actuator (CA) is a monotonic gating mechanism that utilizes outputs' scalar scores to direct actions such as shipping, abstaining, or escalating responses. This mechanism grounded on CRC applies Batched Bootstrap CRC (BB-CRC) and Randomized Batched Weighted-Average CRC (RBWA-CRC) to enhance computational efficiency and threshold stability. With these methods, CRC reduces unnecessary recalibration calls and maintains statistical validity in finite samples.

Figure 1: Calibration comparison. Left: Empirical risk at the calibrated threshold versus $\alpha$ . Right: Stability versus $\alpha$ . RBWA shows superior stability.

Gram Space Geometry for Output Evaluation

The paper introduces a semantic quantification approach based on Gram matrix geometry. Utilizing unit-norm embeddings of model outputs, the method derives interpretable metrics for assessing decision sufficiency and per-item uncertainty. Two core capabilities are highlighted: (1) semantic sufficiency via spectral-overlap decisions in leading Gram subspaces, and (2) per-item quantification using intrinsic energy scores. This theoretical approach provides an efficient and label-free mechanism for batch evaluations.

Implementation: BB-CRC and RBWA-CRC

BB-CRC and RBWA-CRC implement advanced calibration strategies that utilize either batch bootstrapping or randomized weights across batches. These methods uphold finite-sample bounds while reducing computational costs. Noteworthy, the RBWA-CRC increases calibration stability through unbiased smoothing and variance control, providing more consistent thresholding compared to standard CRC methods.

Experimental Evaluation

Experiments across multiple question-answering datasets, including ASQA and HotpotQA, exhibit the proposed framework's efficacy in improving factual accuracy. The GramQ policy, for instance, consistently reduced Factuality Severity (FS) across different datasets, highlighting its robustness against variation in model and provider settings.

Figure 2: Policy $Q_E$ : FS reduction (\%) per benchmark demonstrates high reductions and uniform outcomes across tasks.

Conclusion

The study demonstrates how CRC can significantly affect the reliability of LLM outputs by managing risks associated with their randomness. The integration of CRC with semantic quantification offers a systematic approach for deploying LLMs in safety-critical environments. The results advocate for further exploration of CRC's potential in broader contexts beyond language modeling, emphasizing its scalability and robustness.

The findings suggest that future work could relax the exchangeability assumptions to account for covariate shifts or extend the LLM-as-judge applications on a larger scale, addressing challenges in pairwise ranking and multi-judge systems.

Markdown Report Issue