Papers
Topics
Authors
Recent
Search
2000 character limit reached

Uncertainty-aware Language Modeling for Selective Question Answering

Published 26 Nov 2023 in cs.CL and cs.LG | (2311.15451v1)

Abstract: We present an automatic LLM conversion approach that produces uncertainty-aware LLMs capable of estimating uncertainty with every prediction. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems. We evaluate converted models on the selective question answering setting -- to answer as many questions as possible while maintaining a given accuracy, forgoing providing predictions when necessary. As part of our results, we test BERT and Llama 2 model variants on the SQuAD extractive QA task and the TruthfulQA generative QA task. We show that using the uncertainty estimates provided by our approach to selectively answer questions leads to significantly higher accuracy over directly using model probabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Amini, A.; et al. 2023. Capsa Software Library.
  2. Weight uncertainty in neural network. In International conference on machine learning, 1613–1622. PMLR.
  3. Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness. arXiv:2308.16175.
  4. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. arXiv:2309.03883.
  5. Human Uncertainty in Concept-Based AI Systems. arXiv:2303.12872.
  6. Deep gaussian processes. In Artificial intelligence and statistics, 207–215. PMLR.
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
  8. Confidence modeling for neural semantic parsing. arXiv preprint arXiv:1805.04604.
  9. El-Yaniv, R.; et al. 2010. On the Foundations of Noise-free Selective Classification. Journal of Machine Learning Research, 11(5).
  10. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, 1050–1059. PMLR.
  11. Selective classification for deep neural networks. Advances in neural information processing systems, 30.
  12. Posing fair generalization tasks for natural language inference. arXiv preprint arXiv:1911.00811.
  13. A framework for merging and ranking of answers in DeepQA. IBM Journal of Research and Development, 56(3.4): 14–1.
  14. On calibration of modern neural networks. In International conference on machine learning, 1321–1330. PMLR.
  15. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10: 178–206.
  16. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9: 962–977.
  17. Selective question answering under domain shift. arXiv preprint arXiv:2006.09462.
  18. Nemo: a toolkit for building ai applications using neural modules. arXiv preprint arXiv:1909.09577.
  19. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.
  20. Improving the repeatability of deep learning models with Monte Carlo dropout. npj Digital Medicine, 5(1): 174.
  21. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  22. Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models. arXiv:2305.19187.
  23. Capsa: A Unified Framework for Quantifying Risk in Deep Neural Networks. In 5th Robot Learning Workshop: Trustworthy Robotics.
  24. Capsa: A Unified Framework for Quantifying Risk in Deep Neural Networks. arXiv:2308.00231.
  25. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896.
  26. SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning. arXiv:2308.00436.
  27. Dropconnect is effective in modeling uncertainty of bayesian deep networks. Scientific reports, 11(1): 5458.
  28. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), volume 1, 55–60. IEEE.
  29. Overview of ResPubliQA 2009: Question answering evaluation over European legislation. In Multilingual Information Access Evaluation I. Text Retrieval Experiments: 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Corfu, Greece, September 30-October 2, 2009, Revised Selected Papers 10, 174–196. Springer.
  30. Conformal Language Modeling. arXiv:2306.10193.
  31. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv:1806.03822.
  32. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250.
  33. Quizbowl: The Case for Incremental Question Answering. arXiv:1904.04792.
  34. BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
  35. Post-hoc Uncertainty Learning using a Dirichlet Meta-Model. arXiv:2212.07359.
  36. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1): 1929–1958.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  38. Wang, M.; et al. 2006. A survey of answer extraction techniques in factoid question answering. Computational Linguistics, 1(1): 1–14.
  39. Large language models are reasoners with self-verification. arXiv preprint arXiv:2212.09561.
  40. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144.
  41. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, 19–27.
Citations (6)

Summary

  • The paper introduces a method to convert LLMs to uncertainty-aware models by incorporating both aleatoric and epistemic uncertainty estimates.
  • It employs the Capsa framework to modify models like BERT on SQuAD and Llama 2 on TruthfulQA, ensuring computationally efficient uncertainty integration.
  • The study shows that uncertainty-informed predictions significantly improve selective QA accuracy compared to traditional logit-based confidence measures.

Uncertainty-aware Language Modeling for Selective Question Answering

This essay examines the methodology and results of the paper "Uncertainty-aware Language Modeling for Selective Question Answering" (2311.15451). The objective is to explore the integration of uncertainty estimates into LLMs to enhance the accuracy and reliability of selective question answering tasks.

Introduction

The paper presents a mechanism to enable LLMs to produce uncertainty estimates alongside predictions, thereby allowing them to selectively answer questions based on confidence levels. The need for such a mechanism arises due to the inherent limitations of LLMs in handling ambiguous information, out-of-domain data, and inconsistent training inputs. Traditional approaches have relied on predicted softmax probabilities to gauge prediction confidence, but this paper argues for a more robust uncertainty quantification (UQ) approach that aligns better with model prediction accuracy. Figure 1

Figure 1: Robust, uncertainty-aware language modeling. Our methodology converts LLMs — agnostic of architecture — into uncertainty-aware variants and applies to generative (i.e., next-token prediction, left) and extractive (i.e., sub-context answering, right) models.

Methodology

Selective Question Answering

The core strategy outlined in the study involves converting existing LLMs into uncertainty-aware models capable of providing both aleatoric and epistemic uncertainty estimates. Aleatoric uncertainty captures data-related noise, while epistemic uncertainty addresses uncertainties in the model's knowledge.

Models and Datasets

Two types of models and datasets are examined: extractive models, using BERT on the SQuAD dataset, and generative models, using Llama 2 on the TruthfulQA dataset. Both models are adapted to output uncertainty measures for their predictions, which are then employed to determine whether a model should answer or abstain from a given question.

Conversion Process

The conversion process utilizes the Capsa framework to integrate UQ metrics into existing LLMs, transforming them into models that can manage and report uncertainty alongside their outputs. This model modification process is computationally efficient and architecture agnostic, ensuring broad applicability.

Results

Performance on Test Datasets

The study’s findings highlight that uncertainty-aware models signify a marked improvement in both coverage and accuracy of selective QA tasks compared to baseline models using logit-based confidence measures. Specifically, the uncertainty estimates derived from these converted models provide a more reliable means of determining the likelihood of a correct prediction. Figure 2

Figure 2: Selective answering accuracy by confidence level. Increasing values of logit probability do not correspond to increased question answering ability — despite being often misconceived as a measure of confidence. Our methods report a reliable measure of confidence — with increased confidence corresponding to increased accuracy.

The study observes that in generative and extractive QA scenarios, high-confidence predictions using traditional logit probabilities actually result in lower accuracy, demonstrating the efficacy of UQ methods over conventional confidence estimates.

Implications and Future Directions

This research advances the field of AI by integrating UQ into LLM predictions for more reliable question answering, especially under uncertainty. Such techniques are critical for models deployed in applications where incorrect predictions could be costly, such as legal, medical, or customer service domains.

The implications extend beyond QA tasks to any domain requiring robust, adaptable LLMs capable of self-assessing reliability. Future work could explore extending this framework to integrate with external knowledge bases dynamically or apply these methods to broader NLP tasks that necessitate robustness and reliability.

Conclusion

The paper effectively illustrates the need for uncertainty-aware language modeling in enhancing LLM performance on QA tasks. By adopting uncertainty measures, these models achieve higher accuracy and reliability, offering significant benefits for applications requiring precise, contextually appropriate responses. This shift towards uncertainty-informed predictions represents a crucial step in the evolution of AI systems, aiming to balance the trade-offs between coverage and accuracy most effectively.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.