Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey

Published 20 Mar 2025 in cs.CL | (2503.15850v2)

Abstract: LLMs excel in text generation, reasoning, and decision-making, enabling their adoption in high-stakes domains such as healthcare, law, and transportation. However, their reliability is a major concern, as they often produce plausible but incorrect responses. Uncertainty quantification (UQ) enhances trustworthiness by estimating confidence in outputs, enabling risk mitigation and selective prediction. However, traditional UQ methods struggle with LLMs due to computational constraints and decoding inconsistencies. Moreover, LLMs introduce unique uncertainty sources, such as input ambiguity, reasoning path divergence, and decoding stochasticity, that extend beyond classical aleatoric and epistemic uncertainty. To address this, we introduce a new taxonomy that categorizes UQ methods based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, and prediction uncertainty). We evaluate existing techniques, assess their real-world applicability, and identify open challenges, emphasizing the need for scalable, interpretable, and robust UQ approaches to enhance LLM reliability.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel four-dimensional framework to classify uncertainty in LLMs covering input, reasoning, parameter, and prediction uncertainties.
It evaluates methods such as Monte Carlo dropout and Bayesian approaches, highlighting trade-offs between computational efficiency and calibration quality.
The survey outlines real-world applications and future research directions, emphasizing reliability and interpretability in high-stakes domains.

Uncertainty Quantification and Confidence Calibration in LLMs: A Survey

This survey paper provides a systematic review of uncertainty quantification (UQ) methods tailored to different dimensions of uncertainty in LLMs. It categorizes UQ methods based on computational efficiency and uncertainty dimensions, including input, reasoning, parameter, and prediction uncertainty, addressing the limitations of traditional UQ methods when applied to LLMs.

Sources of Uncertainty in LLMs

The paper categorizes uncertainty in LLMs into aleatoric and epistemic types, acknowledging the insufficiency of these classical categories alone for capturing the complexities of LLM uncertainty. Aleatoric uncertainty arises from inconsistencies, biases, and contradictions in training data, as well as ambiguity in natural language. Epistemic uncertainty stems from the model's lack of knowledge, manifesting as hallucinations or incorrect statements, and can be mitigated through fine-tuning or retrieval-augmented generation. The paper introduces a novel four-dimensional framework:

Input Uncertainty: Arises from ambiguous or underspecified prompts, inherently aleatoric as no model can resolve the ambiguity.
Reasoning Uncertainty: Occurs in multi-step logic or retrieval, being aleatoric when the problem is ambiguous and epistemic when the model's reasoning is flawed.
Parameter Uncertainty: Stems from gaps in training data, representing the model's lack of knowledge and thus is epistemic.
Prediction Uncertainty: Reflects variability in generated outputs across different sampling runs, influenced by both aleatoric and epistemic sources.

Uncertainty and Confidence

The survey distinguishes between uncertainty quantification and confidence estimation, defining uncertainty as a property of the model's predictive distribution and confidence as the model's belief in the correctness of a prediction. In NLG, confidence is measured by the joint probability of the generated sequence. The paper highlights two dimensions for improving confidence estimates: ranking performance, which measures the discriminative power of the confidence measure, and calibration, which aims to align the confidence score with the expected correctness.

UQ Methods for Different Dimensions

The paper provides an overview of UQ methods, categorized by the four uncertainty dimensions:

Input Uncertainty: Methods focus on perturbing input prompts and ensembling generations to capture disagreement.
Reasoning Uncertainty: Techniques elicit and analyze the internal reasoning process, using methods like Monte Carlo Dropout and graph representations to quantify uncertainty.
Parameter Uncertainty: Approaches range from Bayesian methods incorporated into LoRA adapters to finetuning-based techniques that train auxiliary models to predict confidence.
Prediction Uncertainty: Methods include single-round generation techniques using logits or hidden states, and multiple-round generation approaches analyzing consistency and variability. Semantic-based methods incorporate external models to assess consistency beyond lexical similarity.

Evaluation of Uncertainty in LLMs

The paper discusses benchmark datasets categorized by their focus, including reading comprehension, reasoning, factuality, general knowledge, and consistency. Evaluation metrics include AUROC, AUPRC, and AUARC, with the choice of the correctness function significantly impacting experimental conclusions. The survey also notes the increasing use of LLMs as judges for evaluation, alongside traditional human annotations and lexical similarity metrics.

Applications in LLMs

The survey highlights applications of UQ in various domains, including robotics, transportation, healthcare, and education. In robotics, UQ is used in closed-loop planners and failure detectors to assess and adjust plans in real-time. In transportation, LLMs augmented with UQ can potentially reduce the risk of hallucination. In healthcare, UQ can assist in diagnosis and treatment selection. In education, UQ can be employed to guide students' thinking processes more reliably.

Challenges and Future Directions

The survey identifies several challenges and potential research directions:

Efficiency-Performance Trade-offs: Addressing the high computational costs of multi-sample uncertainty methods.
Interpretability Deficits: Clarifying the origins of uncertainty for users in high-stakes domains.
Cross-Modality Uncertainty: Addressing misaligned confidence estimates between modalities in multi-modal LLMs.
Interventions for Uncertainty: Developing real-time uncertainty monitoring and self-correction mechanisms.
UQ Evaluation: Improving evaluation methods to capture nuanced uncertainty and creating suitable datasets for UQ.

Conclusion

The paper concludes by emphasizing the importance of integrating UQ techniques into LLM development to enhance reliability and trustworthiness. It provides a comprehensive taxonomy, evaluates existing methods, and identifies persistent challenges, offering insightful directions for future research in this rapidly evolving field.

Markdown Report Issue