Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification
Abstract: Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from all the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation $\mathbb{E}[Y|X]$ and the conditional variance Var$(Y|X)$. This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window $S$, we prove a generalization bound of $\tilde{\mathcal{O}}(\sqrt{\min{S, T}/(n T)})$ on $n$ tasks with sequences of length $T$, providing sharper analysis compared to previous results of $\tilde{\mathcal{O}}(\sqrt{1/n})$. Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the \textit{equivalence} between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.
- A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion 76 243–297.
- Distinguishing the knowable from the unknowable with language models. arXiv preprint arXiv:2402.03563 .
- Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems 36.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661 .
- Exploring length generalization in large language models. Advances in Neural Information Processing Systems 35 38546–38556.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in neural information processing systems 36.
- Language models are few-shot learners. Advances in neural information processing systems 33 1877–1901.
- Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744 .
- Chernoff, Herman. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics 493–507.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 .
- Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on pure and applied mathematics 36(2) 183–212.
- Ekin, Sabit. 2023. Prompt engineering for chatgpt: a quick guide to techniques, tips, and best practices. Authorea Preprints .
- Are large language models bayesian? a martingale perspective on in-context learning. ICLR 2024 Workshop on Secure and Trustworthy Large Language Models.
- Hoeffding’s inequality for general markov chains and its applications to statistical learning. Journal of Machine Learning Research 22(139) 1–35.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems 35 30583–30598.
- A survey of uncertainty in deep neural networks. Artificial Intelligence Review 56(Suppl 1) 1513–1589.
- How do transformers learn in-context beyond simple functions? a case study on learning with representations. arXiv preprint arXiv:2310.10616 .
- Prompt engineering in medical education. International Medical Education 2(3) 198–205.
- An information-theoretic analysis of in-context learning. arXiv preprint arXiv:2401.15530 .
- Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664 .
- Large language models for supply chain optimization. arXiv preprint arXiv:2307.03875 .
- Transformers as algorithms: Generalization and stability in in-context learning. International Conference on Machine Learning. PMLR, 19565–19594.
- Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. arXiv preprint arXiv:2310.08566 .
- Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187 .
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 .
- On the method of bounded differences. Surveys in combinatorics 141(1) 148–188.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895 .
- Prompting ai art: An investigation into the creative skill of prompt engineering. arXiv preprint arXiv:2303.13534 .
- In-context learning through the bayesian prism. The Twelfth International Conference on Learning Representations.
- Paulin, Daniel. 2015. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability 20(none) 1 – 32. doi:10.1214/EJP.v20-4039. URL https://doi.org/10.1214/EJP.v20-4039.
- Information theory: From coding to learning .
- Language models are unsupervised multitask learners. OpenAI blog 1(8) 9.
- Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. Advances in Neural Information Processing Systems 36.
- Reddy, Gautam. 2023. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. arXiv preprint arXiv:2312.03002 .
- The transient nature of emergent in-context learning in transformers. Advances in Neural Information Processing Systems 36.
- The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 3607–3625.
- Smith, Ralph C. 2013. Uncertainty quantification: theory, implementation, and applications. SIAM.
- Sullivan, Timothy John. 2015. Introduction to uncertainty quantification, vol. 63. Springer.
- Benign overfitting in ridge regression. Journal of Machine Learning Research 24(123) 1–76.
- Transformers learn in-context by gradient descent. International Conference on Machine Learning. PMLR, 35151–35174.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846 .
- How many pretraining tasks are needed for in-context learning of linear regression? arXiv preprint arXiv:2310.08391 .
- An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080 .
- Relational reasoning via set transformers: Provable efficiency and applications to marl. Advances in Neural Information Processing Systems 35 35825–35838.
- Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927 .
- What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420 .
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.