Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Published 22 Jun 2023 in cs.CL | (2306.13063v2)

Abstract: Empowering LLMs to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used LLMs including GPT-4 and LLaMA 2 Chat. Our analysis uncovers several key insights: 1) LLMs, when verbalizing their confidence, tend to be overconfident, potentially imitating human patterns of expressing confidence. 2) As model capability scales up, both calibration and failure prediction performance improve. 3) Employing our proposed strategies, such as human-inspired prompts, consistency among multiple responses, and better aggregation strategies can help mitigate this overconfidence from various perspectives. 4) Comparisons with white-box methods indicate that while white-box methods perform better, the gap is narrow, e.g., 0.522 to 0.605 in AUROC. Despite these advancements, none of these techniques consistently outperform others, and all investigated methods struggle in challenging tasks, such as those requiring professional knowledge, indicating significant scope for improvement. We believe this study can serve as a strong baseline and provide insights for eliciting confidence in black-box LLMs.

Abstract PDF HTML Upgrade to Chat

References (47)

Citations (254)

View on Semantic Scholar

Summary

The paper finds that LLMs often exhibit overconfidence in their responses, mirroring human overestimation of certainty.
The paper introduces a novel black-box framework that uses strategic prompting, multi-sample response generation, and aggregation to assess calibration.
The paper demonstrates that as models mature, calibration improves, suggesting that integrating hybrid methods could enhance decision-making reliability in specialized tasks.

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

The paper investigates the capability of LLMs to express their uncertainty in responses, aiming to improve the reliability of AI-driven decision-making. Traditional methods for eliciting confidence, relying on white-box access to model internals or model fine-tuning, are increasingly impractical, especially for black-box, closed-source models like GPT-4 and LLaMA 2 Chat. This study explores a black-box framework composed of three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for leveraging consistency among those responses.

Key Findings

Overconfidence in LLMs: The empirical analysis shows that LLMs are prone to overconfident verbalizations of their confidence, often reflecting human-like patterns, such as overestimation of certainty.
Scalability with Model Capability: As the maturity of the LLM improves, so do both calibration and failure prediction performances, though far from what might be considered ideal.
Mitigating Overconfidence: Utilizing strategies like human-inspired prompts, assessing consistency among multiple generated responses, and advanced aggregation techniques can temper overconfidence and improve confidence calibration in tasks like commonsense and arithmetic reasoning.
Comparison to White-Box Methods: Although white-box methods offer more accurate confidence calibration than black-box ones, the performance gap in AUROC between them is notably narrow, suggesting that further development in black-box approaches is warranted.
Challenges in Specific Tasks: The study reveals that all existing techniques face significant challenges when tasked with problems necessitating specialized knowledge, such as professional law or ethics, suggesting ample room for advancement.

Implications and Future Directions

The study serves as a baseline for subsequent investigations into black-box approaches to confidence elicitation. Practically, these findings could inform developers on integrating more reliable decision-making capabilities into AI applications, especially where access to internal model parameters is limited. Theoretically, the insights derived suggest potential pathways for improving confidence elicitation, challenging researchers to potentially blend white-box and black-box methods to strike a balance between performance and feasibility.

Looking ahead, future advancements might involve hybrid methods that incorporate limited white-box data (like output logits) to enhance black-box models or explore novel neural architectures that inherently facilitate better confidence estimations. Additionally, optimizing the trade-off between computational efficiency and confidence accuracy in multi-query sampling strategies and refining aggregation techniques to draw more semantic correlations are promising areas of pursuit.

Overall, this paper illuminates the intricacies of constructing effective confidence elicitation systems in LLMs, furthering the collective understanding necessary to build more robust, trustworthy AI systems.

Markdown Report Issue