Pushdown Layers: Encoding Recursive Structure in Transformer Language Models

Published 29 Oct 2023 in cs.CL | (2310.19089v1)

Abstract: Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer LLMs poorly capture long-tail recursive structure and exhibit sample-inefficient syntactic generalization. This work introduces Pushdown Layers, a new self-attention layer that models recursive state via a stack tape that tracks estimated depths of every token in an incremental parse of the observed prefix. Transformer LMs with Pushdown Layers are syntactic LLMs that autoregressively and synchronously update this stack tape as they predict new tokens, in turn using the stack tape to softly modulate attention over tokens -- for instance, learning to "skip" over closed constituents. When trained on a corpus of strings annotated with silver constituency parses, Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization, while maintaining similar perplexities. Pushdown Layers are a drop-in replacement for standard self-attention. We illustrate this by finetuning GPT2-medium with Pushdown Layers on an automatically parsed WikiText-103, leading to improvements on several GLUE text classification tasks.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces Pushdown Layers, an architectural innovation inspired by pushdown automata, to enable recursion handling in Transformer models.
It integrates a stack-tape memory and attachment head to dynamically track token depths and build parse trees during language processing.
Experimental results show that Pushdown Layers significantly improve syntactic generalization and sample efficiency on both synthetic and natural language datasets.

Pushdown Layers for Transformer LLMs

The paper "Pushdown Layers: Encoding Recursive Structure in Transformer LLMs" introduces a novel architectural feature for enhancing the handling of recursive structures in Transformer-based LMs. By incorporating pushdown automaton-inspired mechanisms, the proposed Pushdown Layers provide a memory augmentation strategy that maps structural manipulations in a LLM, thus enabling more efficient syntactic generalization.

Recursive State Modeling

Human languages exhibit recursive structures that are fundamental to syntactic constructs. While traditional neural sequence models like RNNs have explicit state-tracking to model recursion, Transformers rely on self-attention mechanisms without inherent state-memory capacities. This work integrates Pushdown Layers as a recursion-capable modification to Transformers. Pushdown Layers employ stack-tape memory to track token depths in parse trees, facilitating context-based recursive computations.

Figure 1: Pushdown Layers model recursive states using a stack-tape for token depth tracking, biasing attention towards recursive syntactic computations.

Implementation of Pushdown Layers

Architecture

Pushdown Layers act as a drop-in replacement for standard self-attention layers. They comprise:

Stack Tape: This memory mechanism stores and updates token depths during parsing, simulating a pushdown automaton's operations.
Attachment Head: Predicts attachment operations, influencing stack tape updates based on token predictions. Attachment decisions are modulated through self-attention to select parse constituents for combining new tokens.
Figure 2: Demonstration of parse building via stack-tape updates in Pushdown LMs, where attention mechanisms drive attachment decisions.

Computational Considerations

Integrating Pushdown Layers in a Transformer involves a shift from 2D matrix operations to 3D, slightly impacting memory but maintaining FLOP equivalence. The added memory footprint stems from storing depth-augmented key tensors for attention computations.

Training and Inference

Models utilize joint learning of word predictions and attachment operations during training, paralleling standard LMs in data processing. During inference, structures are updated dynamically, employing beam search to navigate the combinatorial parsing possibilities and generate parse trees.

Experimentation and Results

Synthetic and Natural Language Evaluation

Experiments encompassed formal languages like Dyck and real language datasets, such as BLLIP-lg and a novel "WikiTrees" corpus. Pushdown Layers demonstrated superior generalization on recursive structures with significantly reduced data requirements compared to baseline LMs.

Sample Efficiency

Pushdown LMs exhibited a dramatic reduction in data dependency for syntactic generalization. For instance, syntactic abilities of Pushdown-LMs were achieved with considerably lesser training examples compared to standard LMs.

Figure 3: Pushdown Layers improve sample efficiency in syntactic generalization over multiple datasets.

Syntactic Language Modeling

The paper measured syntactic generalization through controlled experiments on tasks like subject-verb agreement, showing Pushdown-LMs outperform standard models significantly, particularly in confounding contexts.

Figure 4: Attention maps show Pushdown-LM's focus shift away from distractor nouns to maintain subject-verb agreement.

Implications and Future Work

Pushdown Layers represent a step toward embedding explicit structural biases within neural LMs, targeting recursive phenomena. This development has the potential to expand syntactic learning in low-resource scenarios, augment semantic parsing in NLP applications, and provide a template for extending formal language understanding capacities in neural architectures.

Upcoming directions include exploring unsupervised structural bias learning, extending syntactic scaffolds to domains beyond natural language, and integrating non-constituency parsing like dependency structures into the framework.

Conclusion

Pushdown Layers enhance Transformer LMs by embedding recursive structure processing capabilities, achieving syntactic generalization and syntactic memory efficiency. Through this approach, the paper expands the applicability and robustness of Transformers in modeling linguistically-intricate data.

Markdown Report Issue