Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?

Published 30 May 2025 in cs.CL | (2505.24778v2)

Abstract: As LLMs are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., "fairly confident") instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at https://github.com/HKUST-KnowComp/MarCon.

Abstract PDF Upgrade to Chat

Summary

Revisiting Epistemic Markers in Confidence Estimation

The paper "Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?" delves into the challenge of assessing the reliability of epistemic markers as tools for confidence estimation in LLMs. The increasing application of LLMs in critical domains necessitates robust mechanisms for uncertainty quantification, traditionally approached via numerical values or response consistency. However, the linguistic interface that humans utilize—epistemic markers—offers an alternative avenue that resonates more naturally in human-LLM interactions.

Study Objectives and Methodology

The core aim of the study is to investigate whether LLM-generated epistemic markers can reliably communicate the models' intrinsic confidence. The authors define "marker confidence" as the observed accuracy of responses that contain a specific epistemic marker. This definition diverges from traditional semantic interpretations, providing a quantitative scope for analysis. Their methodology encompasses assessments across a variety of question-answering datasets, tackling both in-distribution and out-of-distribution contexts using multiple LLMs to ensure comprehensive coverage.

Seven distinct evaluation metrics are proposed to scrutinize the stability and consistency of these epistemic markers. Among these metrics, the study emphasizes the Expected Calibration Error (ECE) for marker confidence across different settings and the Pearson and Spearman correlation coefficients to illustrate consistency issues.

Key Findings

The results reveal notable inconsistencies in epistemic markers' reliability. Specifically, while marker confidence exhibits stability within similar datasets (in-distribution), its reliability deteriorates significantly in out-of-distribution contexts. This instability is quantified using the metrics I-AvgECE, C-AvgECE, and NumECE, highlighting a discernible calibration disparity when transitioning across different data distributions.

Additionally, markers do not maintain a consistent ordering of confidence across datasets, particularly underscored by low Marker Ranking Correlation (MRC) values across the models evaluated. The insufficient dispersion of marker confidence values suggests a failure to clearly differentiate between different confidence levels, an essential feature for the deployment in high-stakes scenarios.

Implications

The findings suggest that while epistemic markers are intuitive, they fail to provide an accurate reflection of LLMs' uncertainty, especially when datasets vary significantly in domain or complexity. Thus, there's a compelling need for more effective alignment strategies between verbal confidence and actual model performance.

Future efforts might include refining LLM architectures to improve their understanding of linguistic expressions of uncertainty, possibly incorporating epistemic markers into model training protocols to enhance alignment. Furthermore, a hybrid approach integrating both numerical and linguistic confidence measures could enhance robustness, facilitating more reliable decision-making processes in applications demanding high confidence.

Conclusion

The study serves as a pivotal step towards decoding the complex interaction between LLMs and natural language interfaces. While highlighting shortcomings in current confidence communication methodologies, it prompts further investigation into combining human-like uncertainty expressions with empirical accuracy data, thereby paving the way for more reflective and trustworthy LLM applications in critical domains.