What Formal Languages Can Transformers Express? A Survey

Published 1 Nov 2023 in cs.LG, cs.CL, cs.FL, and cs.LO | (2311.00208v3)

Abstract: As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.

Abstract PDF Upgrade to Chat

Citations (26)

View on Semantic Scholar

Summary

The paper surveys how different transformer architectures express formal languages, mapping their capabilities to established computational complexity classes.
It details the contrast between encoder-only, decoder-only, and encoder-decoder models, highlighting how each is tailored for tasks from basic counting to Turing-completeness.
The study identifies that specific attention mechanisms constrain expressivity, with hard attention limiting transformers to AC^0 and softmax mechanisms broadening this to TC^0.

"What Formal Languages Can Transformers Express? A Survey" (2311.00208)

Introduction to Transformers in NLP

Transformers have revolutionized the field of NLP by providing versatile architectures capable of handling various tasks such as machine translation and language modeling with pretrained models like BERT and GPT. Researchers have turned to the theoretical underpinnings of transformers to discern their formal capabilities, particularly focusing on expressivity—how transformers can be treated as recognizers or generators of formal languages, and what their innate computational power may be.

Framework and Variants of Transformers

Transformers comprise several components: the input layer, layers for hidden processing, and the output layer. These elements are meticulously defined with the help of positional and word embeddings, which map sequences into vectors, and various forms of attention mechanisms that dictate how these sequences interact internally. The paper places significant emphasis on defining different transformer variants, distinguishing between encoder-only, decoder-only, and encoder-decoder architectures. Importantly, the use of position embeddings and the attention mechanism—whether standard, argmax, or average-argmax—specifies the extent of computational prowess achieved by a transformer.

Theoretical Characterization of Expressivity

Numerous formal LLMs serve as benchmarks against which the expressivity of transformers is measured. Transformers have been intricately linked with models such as automata, Turing machines, and various classes like $%%%%0%%%%{NL}%%%%1%%%%{P}$ . Roughly, these model complexities span from merely recognizing finite languages to addressing more intricate languages within the Chomsky hierarchy.

Figure 1: Relationship of some languages and language classes discussed in this paper (right) to the Chomsky hierarchy (left), assuming that $TC^0 \subsetneq NC^1$ and $\subsetneq {NL}$ .

Lower Bounds and Capabilities

Transformers exhibit varied expressivity, contingent on the architectural choice and configuration:

Simple Languages in Majority and Dyck: Basic configurations of transformers, even without extensive intermediate steps or particular precision constraints, can handle simple counting problems and balanced language constructs like Dyck sequences.
Recursively Enumerable Languages: With theoretical augmentations such as specific position embeddings and unbounded computation steps, transformers have been symbolically proven to recognize recursively enumerable languages—akin to the general expressivity of Turing machines.

Upper Bounds and Limitations

Despite these capabilities, certain attention mechanisms like hard attention inherently restrict transformers' capabilities:

AC⁰ Limitations: Hard attention transformers can only process languages within low circuit complexity classes such as $AC^0$ .
$TC^0$ Bounds: Softmax and average-hard attention widen this scope upwards to $TC^0$ , including counting operations but still fail to breach the computational barriers posed by problems such as arbitrary Boolean formula evaluation or permutation group word problems.

Conclusions and Future Directions

The research delineates a succinct framework defining the polarized expressive capabilities of transformers, with emerging observations signifying that encoder-decoder variants with unbounded steps approach universal Turing computability. Ongoing work aims to further scrutinize configurational subtleties that influence expressivity, including the role of embeddings and numeric precision. In isolating additional constraints, the theoretical foundation presented invites new inquiries into efficiently leveraging transformers for tasks historically deemed computationally intensive.

In closing, future exploration is warranted into embedding configurational intricacies and their practical implications for real-time learning applications, further bridging the gap between theoretical models and practical deployment scenarios in intelligent NLP systems.