Papers
Topics
Authors
Recent
Search
2000 character limit reached

Large Language Models

Published 11 Jul 2023 in cs.CL, hep-th, math.HO, and physics.comp-ph | (2307.05782v2)

Abstract: Artificial intelligence is making spectacular progress, and one of the best examples is the development of LLMs such as OpenAI's GPT series. In these lectures, written for readers with a background in mathematics or physics, we give a brief history and survey of the state of the art, and describe the underlying transformer architecture in detail. We then explore some current ideas on how LLMs work and how models trained to predict the next word in a text are able to perform other tasks displaying intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (146)
  1. Reproduced under the cc by 4.0 license. https://creativecommons.org/licenses/by/4.0/.
  2. What learning algorithm is in-context learning? Investigations with linear models, November 2022. arXiv:2211.15661 [cs]. URL: http://arxiv.org/abs/2211.15661, doi:10.48550/arXiv.2211.15661.
  3. A distinct cortical network for mathematical knowledge in the human brain. NeuroImage, 189:19–31, April 2019. URL: https://www.sciencedirect.com/science/article/pii/S1053811919300011, doi:10.1016/j.neuroimage.2019.01.001.
  4. Computational complexity: a modern approach. Cambridge University Press, 2009.
  5. A Latent Variable Model Approach to PMI-based Word Embeddings. arXiv:1502.03520 [cs, stat], June 2019. arXiv: 1502.03520. URL: http://arxiv.org/abs/1502.03520.
  6. ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics, February 2023. arXiv:2302.12433 [cs]. URL: http://arxiv.org/abs/2302.12433, doi:10.48550/arXiv.2302.12433.
  7. Dimensions of Neural-symbolic Integration - A Structured Survey, November 2005. arXiv:cs/0511042. URL: http://arxiv.org/abs/cs/0511042, doi:10.48550/arXiv.cs/0511042.
  8. Explaining Neural Scaling Laws. February 2021. URL: https://arxiv.org/abs/2102.06701v1.
  9. Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit, July 2022. arXiv:2207.08799 [cs, math, stat]. URL: http://arxiv.org/abs/2207.08799, doi:10.48550/arXiv.2207.08799.
  10. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. Publisher: National Acad Sciences.
  11. Deep learning: a statistical viewpoint. arXiv:2103.09177 [cs, math, stat], March 2021. arXiv: 2103.09177. URL: http://arxiv.org/abs/2103.09177.
  12. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48:207–219, 2021. URL: http://arxiv.org/abs/2102.12452.
  13. Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. arXiv:2105.14368 [cs, math, stat], May 2021. arXiv: 2105.14368. URL: http://arxiv.org/abs/2105.14368.
  14. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. Publisher: National Academy of Sciences _eprint: https://www.pnas.org/content/116/32/15849.full.pdf. URL: https://www.pnas.org/content/116/32/15849, doi:10.1073/pnas.1903070116.
  15. To understand deep learning we need to understand kernel learning. February 2018. URL: https://arxiv.org/abs/1802.01396v3.
  16. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Online, July 2020. Association for Computational Linguistics. URL: https://aclanthology.org/2020.acl-main.463, doi:10.18653/v1/2020.acl-main.463.
  17. Yoshua Bengio. From system 1 deep learning to system 2 deep learning, December 2019. URL: https://slideslive.com/38922304/from-system-1-deep-learning-to-system-2-deep-learning.
  18. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  19. Pattern recognition and machine learning, volume 4. Springer, 2006.
  20. An enriched category theory of language: from syntax to semantics. arXiv:2106.07890 [cs, math], June 2021. arXiv: 2106.07890. URL: http://arxiv.org/abs/2106.07890.
  21. The mathematics of statistical machine translation: Parameter estimation. 1993.
  22. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs], June 2020. arXiv: 2005.14165. URL: http://arxiv.org/abs/2005.14165.
  23. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
  24. Sparks of Artificial General Intelligence: Early experiments with GPT-4, March 2023. URL: https://arxiv.org/abs/2303.12712v1.
  25. Discovering Latent Knowledge in Language Models Without Supervision, December 2022. arXiv:2212.03827 [cs]. URL: http://arxiv.org/abs/2212.03827.
  26. Recurrent Neural Networks as Weighted Language Recognizers, March 2018. arXiv:1711.05408 [cs]. URL: http://arxiv.org/abs/1711.05408, doi:10.48550/arXiv.1711.05408.
  27. Finding Universal Grammatical Relations in Multilingual BERT. arXiv:2005.04511 [cs], May 2020. arXiv: 2005.04511. URL: http://arxiv.org/abs/2005.04511.
  28. Tighter Bounds on the Expressivity of Transformer Encoders, May 2023. arXiv:2301.10743 [cs]. URL: http://arxiv.org/abs/2301.10743, doi:10.48550/arXiv.2301.10743.
  29. Ted Chiang. Chatgpt is a blurry jpeg of the web. The New Yorker, February 2023.
  30. Generating Long Sequences with Sparse Transformers. April 2019. URL: https://arxiv.org/abs/1904.10509v1.
  31. The Loss Surfaces of Multilayer Networks, January 2015. arXiv:1412.0233 [cs]. URL: http://arxiv.org/abs/1412.0233, doi:10.48550/arXiv.1412.0233.
  32. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs], April 2022. arXiv: 2204.02311. URL: http://arxiv.org/abs/2204.02311.
  33. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations, May 2023. arXiv:2302.03025 [cs, math]. URL: http://arxiv.org/abs/2302.03025, doi:10.48550/arXiv.2302.03025.
  34. Mathematical Foundations for a Compositional Distributional Model of Meaning, March 2010. arXiv:1003.4394 [cs, math]. URL: http://arxiv.org/abs/1003.4394, doi:10.48550/arXiv.1003.4394.
  35. Group equivariant convolutional networks. In International conference on machine learning, pages 2990–2999. PMLR, 2016. arXiv:1602.07576.
  36. Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.
  37. Francis Crick. The recent excitement about neural networks. Nature, 337:129–132, 1989.
  38. George V. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2:303–314, 1989.
  39. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. October 2018. arXiv: 1810.04805. URL: https://arxiv.org/abs/1810.04805v1.
  40. Neural Network Approximation. arXiv:2012.14501 [cs, math], December 2020. arXiv: 2012.14501. URL: http://arxiv.org/abs/2012.14501.
  41. Inductive Biases and Variable Creation in Self-Attention Mechanisms. arXiv:2110.10090 [cs, stat], October 2021. arXiv: 2110.10090. URL: http://arxiv.org/abs/2110.10090.
  42. Toy Models of Superposition, September 2022. arXiv:2209.10652 [cs]. URL: http://arxiv.org/abs/2209.10652, doi:10.48550/arXiv.2209.10652.
  43. Trapping LLM Hallucinations Using Tagged Context Prompts, June 2023. arXiv:2306.06085 [cs]. URL: http://arxiv.org/abs/2306.06085, doi:10.48550/arXiv.2306.06085.
  44. John Rupert Firth. Studies in linguistic analysis. Wiley-Blackwell, 1957.
  45. Jerry A Fodor. The modularity of mind. MIT press, 1983.
  46. Learning Transformer Programs, June 2023. arXiv:2306.01128 [cs]. URL: http://arxiv.org/abs/2306.01128, doi:10.48550/arXiv.2306.01128.
  47. Neurosymbolic AI: The 3rd Wave, December 2020. arXiv:2012.05876 [cs]. URL: http://arxiv.org/abs/2012.05876.
  48. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes, January 2023. arXiv:2208.01066 [cs]. URL: http://arxiv.org/abs/2208.01066, doi:10.48550/arXiv.2208.01066.
  49. Deep learning. MIT press, 2016.
  50. Inductive Biases for Deep Learning of Higher-Level Cognition, August 2022. arXiv:2011.15091 [cs, stat]. URL: http://arxiv.org/abs/2011.15091, doi:10.48550/arXiv.2011.15091.
  51. Andrey Gromov. Grokking modular arithmetic, January 2023. arXiv:2301.02679 [cond-mat]. URL: http://arxiv.org/abs/2301.02679, doi:10.48550/arXiv.2301.02679.
  52. A Survey of Methods for Explaining Black Box Models. ACM Computing Surveys, 51(5):1–42, September 2019. URL: https://dl.acm.org/doi/10.1145/3236009, doi:10.1145/3236009.
  53. Thilo Hagendorff. Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods, April 2023. arXiv:2303.13988 [cs]. URL: http://arxiv.org/abs/2303.13988, doi:10.48550/arXiv.2303.13988.
  54. A Theory of Emergent In-Context Learning as Implicit Structure Induction, March 2023. arXiv:2303.07971 [cs]. URL: http://arxiv.org/abs/2303.07971, doi:10.48550/arXiv.2303.07971.
  55. Scaling Laws for Transfer, February 2021. arXiv:2102.01293 [cs]. URL: http://arxiv.org/abs/2102.01293, doi:10.48550/arXiv.2102.01293.
  56. A Structural Probe for Finding Syntax in Word Representations. page 10, 2019.
  57. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  58. Training Compute-Optimal Large Language Models, March 2022. arXiv:2203.15556 [cs]. URL: http://arxiv.org/abs/2203.15556, doi:10.48550/arXiv.2203.15556.
  59. Energy Transformer, February 2023. arXiv:2302.07253 [cond-mat, q-bio, stat]. URL: http://arxiv.org/abs/2302.07253, doi:10.48550/arXiv.2302.07253.
  60. Feng-Hsiung Hsu. Behind Deep Blue: Building the computer that defeated the world chess champion. Princeton University Press, 2002.
  61. Consistency Analysis of ChatGPT, March 2023. arXiv:2303.06273 [cs]. URL: http://arxiv.org/abs/2303.06273, doi:10.48550/arXiv.2303.06273.
  62. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs, November 2022. arXiv:2210.12283 [cs]. URL: http://arxiv.org/abs/2210.12283, doi:10.48550/arXiv.2210.12283.
  63. Iain M. Johnstone. High Dimensional Statistical Inference and Random Matrices, November 2006. URL: https://arxiv.org/abs/math/0611589v1.
  64. Speech and language processing, 2009.
  65. Language Models (Mostly) Know What They Know, July 2022. arXiv:2207.05221 [cs]. URL: http://arxiv.org/abs/2207.05221.
  66. Daniel Kahneman. Fast and slow thinking. Allen Lane and Penguin Books, New York, 2011.
  67. Scaling Laws for Neural Language Models, January 2020. arXiv:2001.08361 [cs, stat]. URL: http://arxiv.org/abs/2001.08361, doi:10.48550/arXiv.2001.08361.
  68. An introduction to computational learning theory. MIT press, 1994.
  69. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  70. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  71. Statistical Physics, Optimization, Inference, and Message-Passing Algorithms: Lecture Notes of the Les Houches School of Physics: Special Issue, October 2013. Number 2013. Oxford University Press, 2016.
  72. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015. URL: https://www.sciencemag.org/lookup/doi/10.1126/science.aab3050, doi:10.1126/science.aab3050.
  73. Building Machines That Learn and Think Like People. April 2016. URL: http://arxiv.org/abs/1604.00289.
  74. Yann LeCun. Popular talks and private discussion, 2015.
  75. Yann LeCun. A path towards autonomous machine intelligence, 2022. URL: https://openreview.net/forum?id=BZ5a1r-kVsf.
  76. Deep learning. Nature, 521:436–444, 2015.
  77. Solving Quantitative Reasoning Problems with Language Models, June 2022. Number: arXiv:2206.14858 arXiv:2206.14858 [cs]. URL: http://arxiv.org/abs/2206.14858, doi:10.48550/arXiv.2206.14858.
  78. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, February 2023. arXiv:2210.13382 [cs]. URL: http://arxiv.org/abs/2210.13382, doi:10.48550/arXiv.2210.13382.
  79. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, June 2023. arXiv:2306.03341 [cs]. URL: http://arxiv.org/abs/2306.03341, doi:10.48550/arXiv.2306.03341.
  80. Holistic Evaluation of Language Models, November 2022. arXiv:2211.09110 [cs]. URL: http://arxiv.org/abs/2211.09110, doi:10.48550/arXiv.2211.09110.
  81. Let’s Verify Step by Step, May 2023. arXiv:2305.20050 [cs]. URL: http://arxiv.org/abs/2305.20050, doi:10.48550/arXiv.2305.20050.
  82. Transformers Learn Shortcuts to Automata, October 2022. arXiv:2210.10749 [cs, stat]. URL: http://arxiv.org/abs/2210.10749.
  83. David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
  84. Dissociating language and thought in large language models: a cognitive perspective, January 2023. arXiv:2301.06627 [cs]. URL: http://arxiv.org/abs/2301.06627, doi:10.48550/arXiv.2301.06627.
  85. A Solvable Model of Neural Scaling Laws, October 2022. arXiv:2210.16859 [hep-th, stat]. URL: http://arxiv.org/abs/2210.16859, doi:10.48550/arXiv.2210.16859.
  86. Semantic Spaces. arXiv.org, May 2016. arXiv: 1605.04238v1. URL: http://arxiv.org/abs/1605.04238v1.
  87. Foundations of statistical natural language processing. MIT press, 1999.
  88. Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences, 117(48):30046–30054, 2020.
  89. Mathematical Structure of Syntactic Merge, May 2023. arXiv:2305.18278 [cs, math]. URL: http://arxiv.org/abs/2305.18278, doi:10.48550/arXiv.2305.18278.
  90. Building a large annotated corpus of english: The penn treebank. 1993.
  91. David Marr. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010.
  92. Jirǐ Matoušek. Lecture notes on metric embeddings. 2013. URL: https://kam.mff.cuni.cz/~%****␣transreviewv2final.bbl␣Line␣700␣****matousek/ba-a4.pdf.
  93. Machines who think: A personal inquiry into the history and prospects of artificial intelligence. CRC Press, 2004.
  94. William Merrill. On the Linguistic Capacity of Real-Time Counter Automata. arXiv:2004.06866 [cs], April 2020. arXiv: 2004.06866. URL: http://arxiv.org/abs/2004.06866.
  95. The Parallelism Tradeoff: Limitations of Log-Precision Transformers, April 2023. arXiv:2207.00729 [cs]. URL: http://arxiv.org/abs/2207.00729, doi:10.48550/arXiv.2207.00729.
  96. Saturated Transformers are Constant-Depth Threshold Circuits. arXiv:2106.16213 [cs], April 2022. arXiv: 2106.16213. URL: http://arxiv.org/abs/2106.16213.
  97. Information, physics, and computation. Oxford University Press, 2009.
  98. The Quantization Model of Neural Scaling, March 2023. arXiv:2303.13506 [cond-mat]. URL: http://arxiv.org/abs/2303.13506, doi:10.48550/arXiv.2303.13506.
  99. Efficient Estimation of Word Representations in Vector Space, September 2013. arXiv:1301.3781 [cs]. URL: http://arxiv.org/abs/1301.3781.
  100. Marvin Minsky. Society of mind. Simon and Schuster, 1988.
  101. David Mumford. Pattern theory: the mathematics of perception. arXiv preprint math/0212400, 2002.
  102. Pattern theory: the stochastic analysis of real-world signals. CRC Press, 2010.
  103. Progress measures for grokking via mechanistic interpretability, January 2023. arXiv:2301.05217 [cs]. URL: http://arxiv.org/abs/2301.05217, doi:10.48550/arXiv.2301.05217.
  104. Empirical explorations of the logic theory machine: a case study in heuristic. In Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability, pages 218–230, 1957.
  105. Nils J Nilsson. The quest for artificial intelligence. Cambridge University Press, 2009.
  106. Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases, 2022. URL: https://transformer-circuits.pub/2022/mech-interp-essay/index.html.
  107. In-context Learning and Induction Heads, September 2022. arXiv:2209.11895 [cs]. URL: http://arxiv.org/abs/2209.11895, doi:10.48550/arXiv.2209.11895.
  108. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. URL: https://www.aclweb.org/anthology/D14-1162, doi:10.3115/v1/D14-1162.
  109. Formal Algorithms for Transformers, July 2022. arXiv:2207.09238 [cs]. URL: http://arxiv.org/abs/2207.09238.
  110. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets. arXiv:2201.02177 [cs], January 2022. arXiv: 2201.02177. URL: http://arxiv.org/abs/2201.02177.
  111. Measuring and Narrowing the Compositionality Gap in Language Models, October 2022. arXiv:2210.03350 [cs]. URL: http://arxiv.org/abs/2210.03350, doi:10.48550/arXiv.2210.03350.
  112. Improving language understanding by generative pre-training. 2018. Publisher: OpenAI.
  113. Language Models are Unsupervised Multitask Learners. undefined, 2019. URL: https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe.
  114. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs, stat], July 2020. arXiv: 1910.10683. URL: http://arxiv.org/abs/1910.10683.
  115. Hopfield Networks is All You Need, April 2021. arXiv:2008.02217 [cs, stat]. URL: http://arxiv.org/abs/2008.02217, doi:10.48550/arXiv.2008.02217.
  116. The Principles of Deep Learning Theory. arXiv:2106.10165 [hep-th, stat], August 2021. arXiv: 2106.10165. URL: http://arxiv.org/abs/2106.10165.
  117. Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
  118. A general framework for parallel distributed processing. 1986.
  119. Stuart J Russell. Artificial intelligence a modern approach. Pearson Education, Inc., 2010.
  120. Are emergent abilities of large language models a mirage? ArXiv, abs/2304.15004, 2023.
  121. Terrence J Sejnowski. The deep learning revolution. MIT press, 2018.
  122. Claude E Shannon. Xxii. programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, 1950.
  123. On the computational power of neural nets. In Proceedings of the fifth annual workshop on Computational learning theory, pages 440–449, 1992.
  124. Brian Cantwell Smith. Procedural reflection in programming languages volume i. 1982.
  125. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015. arXiv:1503.03585.
  126. Beyond neural scaling laws: beating power law scaling via data pruning, June 2022. Number: arXiv:2206.14486 arXiv:2206.14486 [cs, stat]. URL: http://arxiv.org/abs/2206.14486, doi:10.48550/arXiv.2206.14486.
  127. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Technical Report arXiv:2206.04615, arXiv, June 2022. arXiv:2206.04615 [cs, stat] type: article. URL: http://arxiv.org/abs/2206.04615.
  128. Richard Sutton. The bitter lesson, 2019. URL: http://www.incompleteideas.net/IncIdeas/BitterLesson.html.
  129. Christian Szegedy. A promising path towards autoformalization and general artificial intelligence. In International Conference on Intelligent Computer Mathematics, 2020.
  130. Chess as a Testbed for Language Model State Tracking, May 2022. arXiv:2102.13249 [cs]. URL: http://arxiv.org/abs/2102.13249, doi:10.48550/arXiv.2102.13249.
  131. Richard E. Turner. An Introduction to Transformers, July 2023. arXiv:2304.10557 [cs]. URL: http://arxiv.org/abs/2304.10557, doi:10.48550/arXiv.2304.10557.
  132. Attention Is All You Need. June 2017. arXiv: 1706.03762. URL: https://arxiv.org/abs/1706.03762.
  133. Emergent Abilities of Large Language Models. 2022. Publisher: arXiv Version Number: 2. URL: https://arxiv.org/abs/2206.07682, doi:10.48550/ARXIV.2206.07682.
  134. On the Practical Computational Power of Finite Precision RNNs for Language Recognition, May 2018. arXiv:1805.04908 [cs, stat]. URL: http://arxiv.org/abs/1805.04908, doi:10.48550/arXiv.1805.04908.
  135. NaturalProofs: Mathematical Theorem Proving in Natural Language. page 14, 2021.
  136. The Learnability of In-Context Learning, March 2023. arXiv:2303.07895 [cs]. URL: http://arxiv.org/abs/2303.07895, doi:10.48550/arXiv.2303.07895.
  137. Avi Wigderson. Mathematics and computation: A theory revolutionizing technology and science. Princeton University Press, 2019.
  138. Wikipedia. URL: https://en.wikipedia.org/wiki/Reflective_programming.
  139. Stephen Wolfram. What Is ChatGPT Doing… and Why Does It Work? Stephen Wolfram, 2023.
  140. An Explanation of In-context Learning as Implicit Bayesian Inference, July 2022. arXiv:2111.02080 [cs]. URL: http://arxiv.org/abs/2111.02080, doi:10.48550/arXiv.2111.02080.
  141. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer, March 2022. arXiv:2203.03466 [cond-mat]. URL: http://arxiv.org/abs/2203.03466, doi:10.48550/arXiv.2203.03466.
  142. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, May 2023. arXiv:2305.10601 [cs]. URL: http://arxiv.org/abs/2305.10601, doi:10.48550/arXiv.2305.10601.
  143. Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models, May 2023. arXiv:2305.17311 [cs]. URL: http://arxiv.org/abs/2305.17311, doi:10.48550/arXiv.2305.17311.
  144. Do Transformers Parse while Predicting the Masked Word?, March 2023. arXiv:2303.08117 [cs]. URL: http://arxiv.org/abs/2303.08117, doi:10.48550/arXiv.2303.08117.
  145. A Survey of Large Language Models, September 2023. arXiv:2303.18223 [cs]. URL: http://arxiv.org/abs/2303.18223, doi:10.48550/arXiv.2303.18223.
  146. MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics, February 2022. arXiv:2109.00110 [cs]. URL: http://arxiv.org/abs/2109.00110, doi:10.48550/arXiv.2109.00110.
Citations (393)

Summary

  • The paper demonstrates how transformer architecture, via attention mechanisms and parallel processing, enhances language modeling efficiency.
  • It reveals that LLMs exhibit emergent abilities while facing challenges such as limited interpretability and occasional fact hallucination.
  • The study highlights scaling laws and current research directions that drive improvements in performance and applicability of LLMs.

LLMs: An Expert Essay

Introduction

The development of LLMs represents a significant advancement in artificial intelligence, epitomized by OpenAI's GPT series and other prominent models. Drawing insights from recent lectures aimed at individuals with a background in mathematics and physics, this essay explores the state-of-the-art in LLMs, detailing the underlying transformer architecture and examining the broader implications of these models in AI research and applications.

Transformer Architecture and Its Role in LLMs

The transformer model, originally proposed by Vaswani et al. in 2017, revolutionized the approach to natural language processing tasks. Characterized by its use of attention mechanisms and positional encoding, the transformer architecture facilitates the modeling of complex word dependencies, enabling LLMs to generate coherent and contextually relevant text. Unlike recurrent neural networks, transformers allow for parallel processing of data, significantly improving computational efficiency and scalability.

Emergent Abilities and Interpretability Challenges

LLMs display capabilities that appear to extend beyond their training objectives, such as solving complex problems and demonstrating forms of intelligent behavior. These emergent abilities have sparked considerable interest and debate regarding the interpretation of LLM outputs and their underlying reasoning processes. The lack of long-term memory, inability to perform reliable logical reasoning, and tendency to hallucinate facts pose significant challenges that researchers continue to address.

Phenomenology of LLMs and Scaling Laws

One of the key insights into the functioning of LLMs involves their scaling laws. Studies have shown that model performance improves predictably with increases in computational power, dataset size, and model parameters, often following a power law distribution. This scaling behavior has informed the development of increasingly large models, contributing to breakthroughs in capabilities but also raising questions about the practical limits and potential saturation of performance gains.

Current Research Directions and Open Questions

Research into LLMs increasingly focuses on understanding their inner workings, including the representation and utilization of world models and algorithms within their networks. Approaches range from probing model activations to hypothesizing computational models that these systems might employ to perform tasks. The controversy surrounding whether LLMs are genuine agents of understanding or sophisticated statistical systems continues to fuel vigorous academic discourse. Furthermore, concepts like in-context learning and zero-shot task handling have positioned LLMs as versatile tools in AI applications.

Conclusion

The advancement of LLMs has undeniably transformed the field of AI, opening up new horizons for research and application. While significant progress has been made in understanding and leveraging these models, many questions remain unanswered, particularly concerning their interpretability and long-term potential for achieving artificial general intelligence. As the field evolves, continued exploration of these models will likely yield further insights into their capabilities and limitations, shaping the future of AI technologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 6 likes about this paper.