Improving LLM “oracle” aleatoric uncertainty via prompting

Develop prompting strategies for the Mistral Medium (version 25.05) large language model used as a confidence-scoring "oracle" to generate responses whose confidence scores more accurately estimate aleatoric uncertainty when answering PubMedQA-style biomedical questions with the specified JSON-format interface.

Background

The study evaluates an LLM-based “oracle” that provides both an answer and a self-reported confidence score using Mistral Medium 25.05 on PubMedQA-style questions. The authors averaged ten such responses per sample to derive a per-sample confidence score intended to capture aleatoric uncertainty.

Despite experimenting with more complex prompts aimed at eliciting better aleatoric uncertainty estimates, the authors report that they could not improve the quality of this LLM-based uncertainty estimation approach, leaving open how to elicit higher-quality aleatoric confidence from such models.

References

Although more complex prompts were tested to generate responses that better estimate aleatoric uncertainty, we were unable to improve the quality of this approach.

Enhancing the Reliability of Medical AI through Expert-guided Uncertainty Modeling  (2604.01898 - Khalin et al., 2 Apr 2026) in Subsection "Generating confidence scores from an LLM 'oracle'", Supplementary Methods