Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

Published 24 May 2024 in cs.LG, cs.CL, and stat.ML | (2405.15115v1)

Abstract: Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from all the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation $\mathbb{E}[Y|X]$ and the conditional variance Var$(Y|X)$. This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes-optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window $S$, we prove a generalization bound of $\tilde{\mathcal{O}}(\sqrt{\min{S, T}/(n T)})$ on $n$ tasks with sequences of length $T$, providing sharper analysis compared to previous results of $\tilde{\mathcal{O}}(\sqrt{1/n})$. Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the \textit{equivalence} between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion 76 243–297.
  2. Distinguishing the knowable from the unknowable with language models. arXiv preprint arXiv:2402.03563 .
  3. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems 36.
  4. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661 .
  5. Exploring length generalization in large language models. Advances in Neural Information Processing Systems 35 38546–38556.
  6. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in neural information processing systems 36.
  7. Language models are few-shot learners. Advances in neural information processing systems 33 1877–1901.
  8. Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744 .
  9. Chernoff, Herman. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics 493–507.
  10. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 .
  11. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on pure and applied mathematics 36(2) 183–212.
  12. Ekin, Sabit. 2023. Prompt engineering for chatgpt: a quick guide to techniques, tips, and best practices. Authorea Preprints .
  13. Are large language models bayesian? a martingale perspective on in-context learning. ICLR 2024 Workshop on Secure and Trustworthy Large Language Models.
  14. Hoeffding’s inequality for general markov chains and its applications to statistical learning. Journal of Machine Learning Research 22(139) 1–35.
  15. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems 35 30583–30598.
  16. A survey of uncertainty in deep neural networks. Artificial Intelligence Review 56(Suppl 1) 1513–1589.
  17. How do transformers learn in-context beyond simple functions? a case study on learning with representations. arXiv preprint arXiv:2310.10616 .
  18. Prompt engineering in medical education. International Medical Education 2(3) 198–205.
  19. An information-theoretic analysis of in-context learning. arXiv preprint arXiv:2401.15530 .
  20. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664 .
  21. Large language models for supply chain optimization. arXiv preprint arXiv:2307.03875 .
  22. Transformers as algorithms: Generalization and stability in in-context learning. International Conference on Machine Learning. PMLR, 19565–19594.
  23. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. arXiv preprint arXiv:2310.08566 .
  24. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187 .
  25. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 .
  26. On the method of bounded differences. Surveys in combinatorics 141(1) 148–188.
  27. In-context learning and induction heads. arXiv preprint arXiv:2209.11895 .
  28. Prompting ai art: An investigation into the creative skill of prompt engineering. arXiv preprint arXiv:2303.13534 .
  29. In-context learning through the bayesian prism. The Twelfth International Conference on Learning Representations.
  30. Paulin, Daniel. 2015. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability 20(none) 1 – 32. doi:10.1214/EJP.v20-4039. URL https://doi.org/10.1214/EJP.v20-4039.
  31. Information theory: From coding to learning .
  32. Language models are unsupervised multitask learners. OpenAI blog 1(8) 9.
  33. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. Advances in Neural Information Processing Systems 36.
  34. Reddy, Gautam. 2023. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. arXiv preprint arXiv:2312.03002 .
  35. The transient nature of emergent in-context learning in transformers. Advances in Neural Information Processing Systems 36.
  36. The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 3607–3625.
  37. Smith, Ralph C. 2013. Uncertainty quantification: theory, implementation, and applications. SIAM.
  38. Sullivan, Timothy John. 2015. Introduction to uncertainty quantification, vol. 63. Springer.
  39. Benign overfitting in ridge regression. Journal of Machine Learning Research 24(123) 1–76.
  40. Transformers learn in-context by gradient descent. International Conference on Machine Learning. PMLR, 35151–35174.
  41. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846 .
  42. How many pretraining tasks are needed for in-context learning of linear regression? arXiv preprint arXiv:2310.08391 .
  43. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080 .
  44. Relational reasoning via set transformers: Provable efficiency and applications to marl. Advances in Neural Information Processing Systems 35 35825–35838.
  45. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927 .
  46. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420 .
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.