MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty

Published 13 Aug 2024 in cs.AI and cs.CL | (2408.06816v2)

Abstract: Despite the massive advancements in LLMs, they still suffer from producing plausible but incorrect responses. To improve the reliability of LLMs, recent research has focused on uncertainty quantification to predict whether a response is correct or not. However, most uncertainty quantification methods have been evaluated on single-labeled questions, which removes data uncertainty: the irreducible randomness often present in user queries, which can arise from factors like multiple possible answers. This limitation may cause uncertainty quantification results to be unreliable in practical settings. In this paper, we investigate previous uncertainty quantification methods under the presence of data uncertainty. Our contributions are two-fold: 1) proposing a new Multi-Answer Question Answering dataset, MAQA, consisting of world knowledge, mathematical reasoning, and commonsense reasoning tasks to evaluate uncertainty quantification regarding data uncertainty, and 2) assessing 5 uncertainty quantification methods of diverse white- and black-box LLMs. Our findings show that previous methods relatively struggle compared to single-answer settings, though this varies depending on the task. Moreover, we observe that entropy- and consistency-based methods effectively estimate model uncertainty, even in the presence of data uncertainty. We believe these observations will guide future work on uncertainty quantification in more realistic settings.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel MAQA dataset with over 2,000 multi-answer questions to probe data uncertainty in LLM outputs.
It evaluates five uncertainty quantification methods, demonstrating that entropy and response consistency effectively measure model reliability.
Experiments reveal LLM overconfidence in reasoning tasks, highlighting challenges in distinguishing model uncertainty from data uncertainty.

Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty

Introduction to Uncertainty Quantification in LLMs

The study titled "MAQA: Evaluating Uncertainty Quantification in LLMs Regarding Data Uncertainty" investigates the capacity of LLMs to quantify uncertainty, particularly in scenarios involving data uncertainty. Addressing the issue of LLM-generated plausible yet incorrect responses, this research emphasizes the necessity of uncertainty quantification to enhance LLM reliability, distinguishing between model uncertainty (epistemic) and data uncertainty (aleatoric).

Uncertainty quantification allows for the determination of whether to accept or reject model outputs based on assigned confidence values. Traditional approaches primarily explore model uncertainty in single-answer scenarios, but this paper extends the analysis to questions with multiple possible correct answers, inherently incorporating data uncertainty (Figure 1).

Figure 1: Evaluation settings with and without data uncertainty. When asking for a single label set, the probability distribution can be used to estimate the model uncertainty. On the other hand, when evaluating a question that has multiple answers, it may become difficult to distinguish between model uncertainty and data uncertainty, due to the existence of multiple possible answers.

Multi-Answer Question Answering Dataset

The paper introduces a novel dataset designed to probe LLM performance under data uncertainty. This dataset, named Multi-Answer Question Answering (MAQA), comprises questions that necessitate multiple correct answers across diverse domains such as world knowledge, mathematical reasoning, and commonsense reasoning, totaling over 2,000 questions. This diversity enables comprehensive analysis of LLM behavior in tasks susceptible to both model and data uncertainty (Table 1).

Methodological Approach

The study evaluates five uncertainty quantification methods, considering both white-box and black-box LLMs:

White-box Methods:
- Max Softmax Logit: Utilizes logits to determine confidence, with lower max logits indicating higher uncertainty.
- Entropy: Measures the randomness of the token distribution, with higher entropy reflecting higher uncertainty.
- Margin: Defines uncertainty as the difference between top logits; a smaller margin suggests higher uncertainty.
Black-box Methods:
- Verbalized Confidence: Models verbally express their confidence level in responses.
- Response Consistency: Evaluates confidence through consistency in multiple responses to the same input.

Experimental Results

The study's experimental framework explores three key questions: the performance of uncertainty quantification methods amidst data uncertainty, variability across tasks, and correlation with recall scores.

Key Observations

Impact of Data Uncertainty:
- Data uncertainty affects logit distributions; however, entropy remains a reliable predictor of model uncertainty in the presence of data uncertainty. This stems from LLMs' tendency to concentrate predictions on a few probable tokens.
Overconfidence in Reasoning Tasks:
- Reasoning tasks show LLM overconfidence, particularly post-initial answer generation, complicating uncertainty quantification (Figure 2).
Response Consistency vs. Verbalized Confidence:
- Response consistency effectively predicts model correctness, even under data uncertainty, outperforming verbalized confidence, which is hampered by inherent LLM overconfidence (Figure 3).

Figure 2: Result using different promptings.

Figure 3: AUROC scores by precision for using uncertainty quantification methods: Verbalized Confidence and Response Consistency.

Implications and Future Directions

The findings underscore the robustness of entropy and consistency-based methods in uncertainty quantification, even with data uncertainty. This work provides essential insights for developing more reliable LLMs capable of navigating complex, real-world scenarios where multiple correct answers coexist. Future research may focus on refining these methods, integrating them with LLM training to mitigate hallucinations, and enhancing the deployment of LLMs in domains requiring high accuracy and interpretability.

Conclusion

The study presents a vital step towards understanding and improving uncertainty quantification in LLMs, especially under realistic, multi-answer conditions. By addressing both model and data uncertainty, the research advances the reliability and applicability of LLMs across various complex tasks, forming a foundation for ongoing and future advancements in the field of AI uncertainty quantification.

Markdown Report Issue