2000 character limit reached
Large Language Models
Published 11 Jul 2023 in cs.CL, hep-th, math.HO, and physics.comp-ph | (2307.05782v2)
Abstract: Artificial intelligence is making spectacular progress, and one of the best examples is the development of LLMs such as OpenAI's GPT series. In these lectures, written for readers with a background in mathematics or physics, we give a brief history and survey of the state of the art, and describe the underlying transformer architecture in detail. We then explore some current ideas on how LLMs work and how models trained to predict the next word in a text are able to perform other tasks displaying intelligence.
- Reproduced under the cc by 4.0 license. https://creativecommons.org/licenses/by/4.0/.
- What learning algorithm is in-context learning? Investigations with linear models, November 2022. arXiv:2211.15661 [cs]. URL: http://arxiv.org/abs/2211.15661, doi:10.48550/arXiv.2211.15661.
- A distinct cortical network for mathematical knowledge in the human brain. NeuroImage, 189:19–31, April 2019. URL: https://www.sciencedirect.com/science/article/pii/S1053811919300011, doi:10.1016/j.neuroimage.2019.01.001.
- Computational complexity: a modern approach. Cambridge University Press, 2009.
- A Latent Variable Model Approach to PMI-based Word Embeddings. arXiv:1502.03520 [cs, stat], June 2019. arXiv: 1502.03520. URL: http://arxiv.org/abs/1502.03520.
- ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics, February 2023. arXiv:2302.12433 [cs]. URL: http://arxiv.org/abs/2302.12433, doi:10.48550/arXiv.2302.12433.
- Dimensions of Neural-symbolic Integration - A Structured Survey, November 2005. arXiv:cs/0511042. URL: http://arxiv.org/abs/cs/0511042, doi:10.48550/arXiv.cs/0511042.
- Explaining Neural Scaling Laws. February 2021. URL: https://arxiv.org/abs/2102.06701v1.
- Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit, July 2022. arXiv:2207.08799 [cs, math, stat]. URL: http://arxiv.org/abs/2207.08799, doi:10.48550/arXiv.2207.08799.
- Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. Publisher: National Acad Sciences.
- Deep learning: a statistical viewpoint. arXiv:2103.09177 [cs, math, stat], March 2021. arXiv: 2103.09177. URL: http://arxiv.org/abs/2103.09177.
- Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48:207–219, 2021. URL: http://arxiv.org/abs/2102.12452.
- Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. arXiv:2105.14368 [cs, math, stat], May 2021. arXiv: 2105.14368. URL: http://arxiv.org/abs/2105.14368.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. Publisher: National Academy of Sciences _eprint: https://www.pnas.org/content/116/32/15849.full.pdf. URL: https://www.pnas.org/content/116/32/15849, doi:10.1073/pnas.1903070116.
- To understand deep learning we need to understand kernel learning. February 2018. URL: https://arxiv.org/abs/1802.01396v3.
- Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online, July 2020. Association for Computational Linguistics. URL: https://aclanthology.org/2020.acl-main.463, doi:10.18653/v1/2020.acl-main.463.
- Yoshua Bengio. From system 1 deep learning to system 2 deep learning, December 2019. URL: https://slideslive.com/38922304/from-system-1-deep-learning-to-system-2-deep-learning.
- A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
- Pattern recognition and machine learning, volume 4. Springer, 2006.
- An enriched category theory of language: from syntax to semantics. arXiv:2106.07890 [cs, math], June 2021. arXiv: 2106.07890. URL: http://arxiv.org/abs/2106.07890.
- The mathematics of statistical machine translation: Parameter estimation. 1993.
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs], June 2020. arXiv: 2005.14165. URL: http://arxiv.org/abs/2005.14165.
- A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
- Sparks of Artificial General Intelligence: Early experiments with GPT-4, March 2023. URL: https://arxiv.org/abs/2303.12712v1.
- Discovering Latent Knowledge in Language Models Without Supervision, December 2022. arXiv:2212.03827 [cs]. URL: http://arxiv.org/abs/2212.03827.
- Recurrent Neural Networks as Weighted Language Recognizers, March 2018. arXiv:1711.05408 [cs]. URL: http://arxiv.org/abs/1711.05408, doi:10.48550/arXiv.1711.05408.
- Finding Universal Grammatical Relations in Multilingual BERT. arXiv:2005.04511 [cs], May 2020. arXiv: 2005.04511. URL: http://arxiv.org/abs/2005.04511.
- Tighter Bounds on the Expressivity of Transformer Encoders, May 2023. arXiv:2301.10743 [cs]. URL: http://arxiv.org/abs/2301.10743, doi:10.48550/arXiv.2301.10743.
- Ted Chiang. Chatgpt is a blurry jpeg of the web. The New Yorker, February 2023.
- Generating Long Sequences with Sparse Transformers. April 2019. URL: https://arxiv.org/abs/1904.10509v1.
- The Loss Surfaces of Multilayer Networks, January 2015. arXiv:1412.0233 [cs]. URL: http://arxiv.org/abs/1412.0233, doi:10.48550/arXiv.1412.0233.
- PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs], April 2022. arXiv: 2204.02311. URL: http://arxiv.org/abs/2204.02311.
- A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations, May 2023. arXiv:2302.03025 [cs, math]. URL: http://arxiv.org/abs/2302.03025, doi:10.48550/arXiv.2302.03025.
- Mathematical Foundations for a Compositional Distributional Model of Meaning, March 2010. arXiv:1003.4394 [cs, math]. URL: http://arxiv.org/abs/1003.4394, doi:10.48550/arXiv.1003.4394.
- Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999. PMLR, 2016. arXiv:1602.07576.
- Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.
- Francis Crick. The recent excitement about neural networks. Nature, 337:129–132, 1989.
- George V. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2:303–314, 1989.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. October 2018. arXiv: 1810.04805. URL: https://arxiv.org/abs/1810.04805v1.
- Neural Network Approximation. arXiv:2012.14501 [cs, math], December 2020. arXiv: 2012.14501. URL: http://arxiv.org/abs/2012.14501.
- Inductive Biases and Variable Creation in Self-Attention Mechanisms. arXiv:2110.10090 [cs, stat], October 2021. arXiv: 2110.10090. URL: http://arxiv.org/abs/2110.10090.
- Toy Models of Superposition, September 2022. arXiv:2209.10652 [cs]. URL: http://arxiv.org/abs/2209.10652, doi:10.48550/arXiv.2209.10652.
- Trapping LLM Hallucinations Using Tagged Context Prompts, June 2023. arXiv:2306.06085 [cs]. URL: http://arxiv.org/abs/2306.06085, doi:10.48550/arXiv.2306.06085.
- John Rupert Firth. Studies in linguistic analysis. Wiley-Blackwell, 1957.
- Jerry A Fodor. The modularity of mind. MIT press, 1983.
- Learning Transformer Programs, June 2023. arXiv:2306.01128 [cs]. URL: http://arxiv.org/abs/2306.01128, doi:10.48550/arXiv.2306.01128.
- Neurosymbolic AI: The 3rd Wave, December 2020. arXiv:2012.05876 [cs]. URL: http://arxiv.org/abs/2012.05876.
- What Can Transformers Learn In-Context? A Case Study of Simple Function Classes, January 2023. arXiv:2208.01066 [cs]. URL: http://arxiv.org/abs/2208.01066, doi:10.48550/arXiv.2208.01066.
- Deep learning. MIT press, 2016.
- Inductive Biases for Deep Learning of Higher-Level Cognition, August 2022. arXiv:2011.15091 [cs, stat]. URL: http://arxiv.org/abs/2011.15091, doi:10.48550/arXiv.2011.15091.
- Andrey Gromov. Grokking modular arithmetic, January 2023. arXiv:2301.02679 [cond-mat]. URL: http://arxiv.org/abs/2301.02679, doi:10.48550/arXiv.2301.02679.
- A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys, 51(5):1–42, September 2019. URL: https://dl.acm.org/doi/10.1145/3236009, doi:10.1145/3236009.
- Thilo Hagendorff. Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods, April 2023. arXiv:2303.13988 [cs]. URL: http://arxiv.org/abs/2303.13988, doi:10.48550/arXiv.2303.13988.
- A Theory of Emergent In-Context Learning as Implicit Structure Induction, March 2023. arXiv:2303.07971 [cs]. URL: http://arxiv.org/abs/2303.07971, doi:10.48550/arXiv.2303.07971.
- Scaling Laws for Transfer, February 2021. arXiv:2102.01293 [cs]. URL: http://arxiv.org/abs/2102.01293, doi:10.48550/arXiv.2102.01293.
- A Structural Probe for Finding Syntax in Word Representations. page 10, 2019.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Training Compute-Optimal Large Language Models, March 2022. arXiv:2203.15556 [cs]. URL: http://arxiv.org/abs/2203.15556, doi:10.48550/arXiv.2203.15556.
- Energy Transformer, February 2023. arXiv:2302.07253 [cond-mat, q-bio, stat]. URL: http://arxiv.org/abs/2302.07253, doi:10.48550/arXiv.2302.07253.
- Feng-Hsiung Hsu. Behind Deep Blue: Building the computer that defeated the world chess champion. Princeton University Press, 2002.
- Consistency Analysis of ChatGPT, March 2023. arXiv:2303.06273 [cs]. URL: http://arxiv.org/abs/2303.06273, doi:10.48550/arXiv.2303.06273.
- Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs, November 2022. arXiv:2210.12283 [cs]. URL: http://arxiv.org/abs/2210.12283, doi:10.48550/arXiv.2210.12283.
- Iain M. Johnstone. High Dimensional Statistical Inference and Random Matrices, November 2006. URL: https://arxiv.org/abs/math/0611589v1.
- Speech and language processing, 2009.
- Language Models (Mostly) Know What They Know, July 2022. arXiv:2207.05221 [cs]. URL: http://arxiv.org/abs/2207.05221.
- Daniel Kahneman. Fast and slow thinking. Allen Lane and Penguin Books, New York, 2011.
- Scaling Laws for Neural Language Models, January 2020. arXiv:2001.08361 [cs, stat]. URL: http://arxiv.org/abs/2001.08361, doi:10.48550/arXiv.2001.08361.
- An introduction to computational learning theory. MIT press, 1994.
- Probabilistic graphical models: principles and techniques. MIT press, 2009.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
- Statistical Physics, Optimization, Inference, and Message-Passing Algorithms: Lecture Notes of the Les Houches School of Physics: Special Issue, October 2013. Number 2013. Oxford University Press, 2016.
- Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. URL: https://www.sciencemag.org/lookup/doi/10.1126/science.aab3050, doi:10.1126/science.aab3050.
- Building Machines That Learn and Think Like People. April 2016. URL: http://arxiv.org/abs/1604.00289.
- Yann LeCun. Popular talks and private discussion, 2015.
- Yann LeCun. A path towards autonomous machine intelligence, 2022. URL: https://openreview.net/forum?id=BZ5a1r-kVsf.
- Deep learning. Nature, 521:436–444, 2015.
- Solving Quantitative Reasoning Problems with Language Models, June 2022. Number: arXiv:2206.14858 arXiv:2206.14858 [cs]. URL: http://arxiv.org/abs/2206.14858, doi:10.48550/arXiv.2206.14858.
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, February 2023. arXiv:2210.13382 [cs]. URL: http://arxiv.org/abs/2210.13382, doi:10.48550/arXiv.2210.13382.
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, June 2023. arXiv:2306.03341 [cs]. URL: http://arxiv.org/abs/2306.03341, doi:10.48550/arXiv.2306.03341.
- Holistic Evaluation of Language Models, November 2022. arXiv:2211.09110 [cs]. URL: http://arxiv.org/abs/2211.09110, doi:10.48550/arXiv.2211.09110.
- Let’s Verify Step by Step, May 2023. arXiv:2305.20050 [cs]. URL: http://arxiv.org/abs/2305.20050, doi:10.48550/arXiv.2305.20050.
- Transformers Learn Shortcuts to Automata, October 2022. arXiv:2210.10749 [cs, stat]. URL: http://arxiv.org/abs/2210.10749.
- David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
- Dissociating language and thought in large language models: a cognitive perspective, January 2023. arXiv:2301.06627 [cs]. URL: http://arxiv.org/abs/2301.06627, doi:10.48550/arXiv.2301.06627.
- A Solvable Model of Neural Scaling Laws, October 2022. arXiv:2210.16859 [hep-th, stat]. URL: http://arxiv.org/abs/2210.16859, doi:10.48550/arXiv.2210.16859.
- Semantic Spaces. arXiv.org, May 2016. arXiv: 1605.04238v1. URL: http://arxiv.org/abs/1605.04238v1.
- Foundations of statistical natural language processing. MIT press, 1999.
- Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48):30046–30054, 2020.
- Mathematical Structure of Syntactic Merge, May 2023. arXiv:2305.18278 [cs, math]. URL: http://arxiv.org/abs/2305.18278, doi:10.48550/arXiv.2305.18278.
- Building a large annotated corpus of english: The penn treebank. 1993.
- David Marr. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010.
- Jirǐ Matoušek. Lecture notes on metric embeddings. 2013. URL: https://kam.mff.cuni.cz/~%****␣transreviewv2final.bbl␣Line␣700␣****matousek/ba-a4.pdf.
- Machines who think: A personal inquiry into the history and prospects of artificial intelligence. CRC Press, 2004.
- William Merrill. On the Linguistic Capacity of Real-Time Counter Automata. arXiv:2004.06866 [cs], April 2020. arXiv: 2004.06866. URL: http://arxiv.org/abs/2004.06866.
- The Parallelism Tradeoff: Limitations of Log-Precision Transformers, April 2023. arXiv:2207.00729 [cs]. URL: http://arxiv.org/abs/2207.00729, doi:10.48550/arXiv.2207.00729.
- Saturated Transformers are Constant-Depth Threshold Circuits. arXiv:2106.16213 [cs], April 2022. arXiv: 2106.16213. URL: http://arxiv.org/abs/2106.16213.
- Information, physics, and computation. Oxford University Press, 2009.
- The Quantization Model of Neural Scaling, March 2023. arXiv:2303.13506 [cond-mat]. URL: http://arxiv.org/abs/2303.13506, doi:10.48550/arXiv.2303.13506.
- Efficient Estimation of Word Representations in Vector Space, September 2013. arXiv:1301.3781 [cs]. URL: http://arxiv.org/abs/1301.3781.
- Marvin Minsky. Society of mind. Simon and Schuster, 1988.
- David Mumford. Pattern theory: the mathematics of perception. arXiv preprint math/0212400, 2002.
- Pattern theory: the stochastic analysis of real-world signals. CRC Press, 2010.
- Progress measures for grokking via mechanistic interpretability, January 2023. arXiv:2301.05217 [cs]. URL: http://arxiv.org/abs/2301.05217, doi:10.48550/arXiv.2301.05217.
- Empirical explorations of the logic theory machine: a case study in heuristic. In Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability, pages 218–230, 1957.
- Nils J Nilsson. The quest for artificial intelligence. Cambridge University Press, 2009.
- Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases, 2022. URL: https://transformer-circuits.pub/2022/mech-interp-essay/index.html.
- In-context Learning and Induction Heads, September 2022. arXiv:2209.11895 [cs]. URL: http://arxiv.org/abs/2209.11895, doi:10.48550/arXiv.2209.11895.
- GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/D14-1162, doi:10.3115/v1/D14-1162.
- Formal Algorithms for Transformers, July 2022. arXiv:2207.09238 [cs]. URL: http://arxiv.org/abs/2207.09238.
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv:2201.02177 [cs], January 2022. arXiv: 2201.02177. URL: http://arxiv.org/abs/2201.02177.
- Measuring and Narrowing the Compositionality Gap in Language Models, October 2022. arXiv:2210.03350 [cs]. URL: http://arxiv.org/abs/2210.03350, doi:10.48550/arXiv.2210.03350.
- Improving language understanding by generative pre-training. 2018. Publisher: OpenAI.
- Language Models are Unsupervised Multitask Learners. undefined, 2019. URL: https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs, stat], July 2020. arXiv: 1910.10683. URL: http://arxiv.org/abs/1910.10683.
- Hopfield Networks is All You Need, April 2021. arXiv:2008.02217 [cs, stat]. URL: http://arxiv.org/abs/2008.02217, doi:10.48550/arXiv.2008.02217.
- The Principles of Deep Learning Theory. arXiv:2106.10165 [hep-th, stat], August 2021. arXiv: 2106.10165. URL: http://arxiv.org/abs/2106.10165.
- Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
- A general framework for parallel distributed processing. 1986.
- Stuart J Russell. Artificial intelligence a modern approach. Pearson Education, Inc., 2010.
- Are emergent abilities of large language models a mirage? ArXiv, abs/2304.15004, 2023.
- Terrence J Sejnowski. The deep learning revolution. MIT press, 2018.
- Claude E Shannon. Xxii. programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, 1950.
- On the computational power of neural nets. In Proceedings of the fifth annual workshop on Computational learning theory, pages 440–449, 1992.
- Brian Cantwell Smith. Procedural reflection in programming languages volume i. 1982.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015. arXiv:1503.03585.
- Beyond neural scaling laws: beating power law scaling via data pruning, June 2022. Number: arXiv:2206.14486 arXiv:2206.14486 [cs, stat]. URL: http://arxiv.org/abs/2206.14486, doi:10.48550/arXiv.2206.14486.
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Technical Report arXiv:2206.04615, arXiv, June 2022. arXiv:2206.04615 [cs, stat] type: article. URL: http://arxiv.org/abs/2206.04615.
- Richard Sutton. The bitter lesson, 2019. URL: http://www.incompleteideas.net/IncIdeas/BitterLesson.html.
- Christian Szegedy. A promising path towards autoformalization and general artificial intelligence. In International Conference on Intelligent Computer Mathematics, 2020.
- Chess as a Testbed for Language Model State Tracking, May 2022. arXiv:2102.13249 [cs]. URL: http://arxiv.org/abs/2102.13249, doi:10.48550/arXiv.2102.13249.
- Richard E. Turner. An Introduction to Transformers, July 2023. arXiv:2304.10557 [cs]. URL: http://arxiv.org/abs/2304.10557, doi:10.48550/arXiv.2304.10557.
- Attention Is All You Need. June 2017. arXiv: 1706.03762. URL: https://arxiv.org/abs/1706.03762.
- Emergent Abilities of Large Language Models. 2022. Publisher: arXiv Version Number: 2. URL: https://arxiv.org/abs/2206.07682, doi:10.48550/ARXIV.2206.07682.
- On the Practical Computational Power of Finite Precision RNNs for Language Recognition, May 2018. arXiv:1805.04908 [cs, stat]. URL: http://arxiv.org/abs/1805.04908, doi:10.48550/arXiv.1805.04908.
- NaturalProofs: Mathematical Theorem Proving in Natural Language. page 14, 2021.
- The Learnability of In-Context Learning, March 2023. arXiv:2303.07895 [cs]. URL: http://arxiv.org/abs/2303.07895, doi:10.48550/arXiv.2303.07895.
- Avi Wigderson. Mathematics and computation: A theory revolutionizing technology and science. Princeton University Press, 2019.
- Wikipedia. URL: https://en.wikipedia.org/wiki/Reflective_programming.
- Stephen Wolfram. What Is ChatGPT Doing… and Why Does It Work? Stephen Wolfram, 2023.
- An Explanation of In-context Learning as Implicit Bayesian Inference, July 2022. arXiv:2111.02080 [cs]. URL: http://arxiv.org/abs/2111.02080, doi:10.48550/arXiv.2111.02080.
- Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer, March 2022. arXiv:2203.03466 [cond-mat]. URL: http://arxiv.org/abs/2203.03466, doi:10.48550/arXiv.2203.03466.
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models, May 2023. arXiv:2305.10601 [cs]. URL: http://arxiv.org/abs/2305.10601, doi:10.48550/arXiv.2305.10601.
- Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models, May 2023. arXiv:2305.17311 [cs]. URL: http://arxiv.org/abs/2305.17311, doi:10.48550/arXiv.2305.17311.
- Do Transformers Parse while Predicting the Masked Word?, March 2023. arXiv:2303.08117 [cs]. URL: http://arxiv.org/abs/2303.08117, doi:10.48550/arXiv.2303.08117.
- A Survey of Large Language Models, September 2023. arXiv:2303.18223 [cs]. URL: http://arxiv.org/abs/2303.18223, doi:10.48550/arXiv.2303.18223.
- MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics, February 2022. arXiv:2109.00110 [cs]. URL: http://arxiv.org/abs/2109.00110, doi:10.48550/arXiv.2109.00110.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.