- The paper introduces a novel four-dimensional framework to classify uncertainty in LLMs covering input, reasoning, parameter, and prediction uncertainties.
- It evaluates methods such as Monte Carlo dropout and Bayesian approaches, highlighting trade-offs between computational efficiency and calibration quality.
- The survey outlines real-world applications and future research directions, emphasizing reliability and interpretability in high-stakes domains.
Uncertainty Quantification and Confidence Calibration in LLMs: A Survey
This survey paper provides a systematic review of uncertainty quantification (UQ) methods tailored to different dimensions of uncertainty in LLMs. It categorizes UQ methods based on computational efficiency and uncertainty dimensions, including input, reasoning, parameter, and prediction uncertainty, addressing the limitations of traditional UQ methods when applied to LLMs.
Sources of Uncertainty in LLMs
The paper categorizes uncertainty in LLMs into aleatoric and epistemic types, acknowledging the insufficiency of these classical categories alone for capturing the complexities of LLM uncertainty. Aleatoric uncertainty arises from inconsistencies, biases, and contradictions in training data, as well as ambiguity in natural language. Epistemic uncertainty stems from the model's lack of knowledge, manifesting as hallucinations or incorrect statements, and can be mitigated through fine-tuning or retrieval-augmented generation. The paper introduces a novel four-dimensional framework:
- Input Uncertainty: Arises from ambiguous or underspecified prompts, inherently aleatoric as no model can resolve the ambiguity.
- Reasoning Uncertainty: Occurs in multi-step logic or retrieval, being aleatoric when the problem is ambiguous and epistemic when the model's reasoning is flawed.
- Parameter Uncertainty: Stems from gaps in training data, representing the model's lack of knowledge and thus is epistemic.
- Prediction Uncertainty: Reflects variability in generated outputs across different sampling runs, influenced by both aleatoric and epistemic sources.
Uncertainty and Confidence
The survey distinguishes between uncertainty quantification and confidence estimation, defining uncertainty as a property of the model's predictive distribution and confidence as the model's belief in the correctness of a prediction. In NLG, confidence is measured by the joint probability of the generated sequence. The paper highlights two dimensions for improving confidence estimates: ranking performance, which measures the discriminative power of the confidence measure, and calibration, which aims to align the confidence score with the expected correctness.
UQ Methods for Different Dimensions
The paper provides an overview of UQ methods, categorized by the four uncertainty dimensions:
- Input Uncertainty: Methods focus on perturbing input prompts and ensembling generations to capture disagreement.
- Reasoning Uncertainty: Techniques elicit and analyze the internal reasoning process, using methods like Monte Carlo Dropout and graph representations to quantify uncertainty.
- Parameter Uncertainty: Approaches range from Bayesian methods incorporated into LoRA adapters to finetuning-based techniques that train auxiliary models to predict confidence.
- Prediction Uncertainty: Methods include single-round generation techniques using logits or hidden states, and multiple-round generation approaches analyzing consistency and variability. Semantic-based methods incorporate external models to assess consistency beyond lexical similarity.
Evaluation of Uncertainty in LLMs
The paper discusses benchmark datasets categorized by their focus, including reading comprehension, reasoning, factuality, general knowledge, and consistency. Evaluation metrics include AUROC, AUPRC, and AUARC, with the choice of the correctness function significantly impacting experimental conclusions. The survey also notes the increasing use of LLMs as judges for evaluation, alongside traditional human annotations and lexical similarity metrics.
Applications in LLMs
The survey highlights applications of UQ in various domains, including robotics, transportation, healthcare, and education. In robotics, UQ is used in closed-loop planners and failure detectors to assess and adjust plans in real-time. In transportation, LLMs augmented with UQ can potentially reduce the risk of hallucination. In healthcare, UQ can assist in diagnosis and treatment selection. In education, UQ can be employed to guide students' thinking processes more reliably.
Challenges and Future Directions
The survey identifies several challenges and potential research directions:
- Efficiency-Performance Trade-offs: Addressing the high computational costs of multi-sample uncertainty methods.
- Interpretability Deficits: Clarifying the origins of uncertainty for users in high-stakes domains.
- Cross-Modality Uncertainty: Addressing misaligned confidence estimates between modalities in multi-modal LLMs.
- Interventions for Uncertainty: Developing real-time uncertainty monitoring and self-correction mechanisms.
- UQ Evaluation: Improving evaluation methods to capture nuanced uncertainty and creating suitable datasets for UQ.
Conclusion
The paper concludes by emphasizing the importance of integrating UQ techniques into LLM development to enhance reliability and trustworthiness. It provides a comprehensive taxonomy, evaluates existing methods, and identifies persistent challenges, offering insightful directions for future research in this rapidly evolving field.