The Expressive Power of Transformers with Chain of Thought
Abstract: Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.
- Tighter bounds on the expressivity of transformer encoders. ICML, 2023.
- Faith and fate: Limits of transformers on compositionality. In NeurIPS, 2023.
- Towards revealing the mystery behind chain of thought: A theoretical perspective, 2023.
- Formal language recognition by hard attention transformers: Perspectives from circuit complexity. TACL, 10:800–810, 2022.
- On time versus space. J. ACM, 24:332–337, 1977.
- Introduction to automata theory, languages, and computation. ACM Sigact News, 32(1):60–65, 2001.
- S-Y Kuroda. Classes of languages and linear-bounded automata. Information and control, 7(2):207–223, 1964.
- Lillian Lee. Fast context-free grammar parsing requires fast boolean matrix multiplication. J. ACM, 49(1):1–15, Jan 2002.
- Transformers learn shortcuts to automata. In ICLR, 2023.
- William Merrill. On the linguistic capacity of real-time counter automata. ArXiv, abs/2004.06866, 2020.
- William Merrill. Formal languages and neural models for learning on sequences. In François Coste, Faissal Ouardi, and Guillaume Rabusseau (eds.), ICGI, volume 217 of PMLR, Jul 2023.
- A logic for expressing log-precision transformers. In NeurIPS, 2023a.
- The parallelism tradeoff: Limitations of log-precision transformers. TACL, 11:531–545, 2023b.
- Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In EMNLP, 2021.
- Saturated transformers are constant-depth threshold circuits. TACL, 10:843–856, 2022.
- Show your work: Scratchpads for intermediate computation with language models. ArXiv, abs/2112.00114, 2021.
- Attention is Turing complete. JMLR, 22(1), January 2021.
- Dale Schuurmans. Memory augmented large language models are computationally universal. ArXiv, abs/2301.04589, 2023.
- Leslie G. Valiant. General context-free recognition in less than cubic time. Journal of Computer and System Sciences, 10(2):308–315, 1975.
- Attention is all you need. In NeurIPS, 2017.
- Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), NeurIPS, 2022.
- On the practical computational power of finite precision RNNs for language recognition. In ACL, July 2018.
- Thinking like transformers. In ICML, 2021.
- Avi Wigderson. The complexity of graph connectivity. In International Symposium on Mathematical Foundations of Computer Science, pp. 112–132. Springer, 1992.
- On layer normalization in the transformer architecture. In ICML, 2020.
- Self-attention networks can process bounded hierarchical languages. In ACL, 2021.
- How language model hallucinations can snowball. ArXiv, abs/2305.13534, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.