Algorithmic Capabilities of Random Transformers

Published 6 Oct 2024 in cs.LG, cs.AI, and cs.CL | (2410.04368v1)

Abstract: Trained transformer models have been found to implement interpretable procedures for tasks like arithmetic and associative recall, but little is understood about how the circuits that implement these procedures originate during training. To what extent do they depend on the supervisory signal provided to models, and to what extent are they attributable to behavior already present in models at the beginning of training? To investigate these questions, we investigate what functions can be learned by randomly initialized transformers in which only the embedding layers are optimized, so that the only input--output mappings learnable from data are those already implemented (up to a choice of encoding scheme) by the randomly initialized model. We find that these random transformers can perform a wide range of meaningful algorithmic tasks, including modular arithmetic, in-weights and in-context associative recall, decimal addition, parenthesis balancing, and even some aspects of natural language text generation. Our results indicate that some algorithmic capabilities are present in transformers (and accessible via appropriately structured inputs) even before these models are trained. Code is available at https://github.com/fjzzq2002/random_transformers.

Abstract PDF HTML Upgrade to Chat

Summary

The paper demonstrates that random transformers achieve 100% accuracy in modular arithmetic tasks via embedding-only training, highlighting innate algorithmic encoding.
Random transformers effectively perform associative recall, decimal addition, and parenthesis balancing by leveraging low-dimensional subspace manipulation.
Language modeling experiments reveal that while fully trained models excel, random transformers generate coherent outputs with appropriately large hidden sizes.

Algorithmic Capabilities of Random Transformers

Introduction

The paper explores the innate capabilities of randomly initialized transformer models, analyzing their potential to encode specific algorithmic tasks without training the complete network. By focusing primarily on the embedding layers and assessing their algorithmic performance, the work investigates an alternative approach to understanding how transformers learn and generalize specific functions present even before training starts.

Model Setup and Initialization

The transformer models under consideration are decoder-only architectures truncated to manipulate only their embedding and unembedding layers. The model parameters are randomly initialized primarily based on Gaussian distributions with standard deviations conditioned by the number of layers and dimensions. The training methodology contrasts full training of all parameters against optimizing solely the embedding layers, a process termed as "embedding-only training."

Algorithmic Tasks Assessment

Modular Addition: Random transformers manage modular arithmetic, provable by attaining 100% accuracy in tests, hence indicating successful internalization of addition within a constrained modulus.
Needle-in-a-Haystack: These transformers accurately performed associative recall tasks, reaffirming their ability to recognize and map inputs.
Decimal Addition and Parenthesis Balancing: Both tasks observed close-to-perfect accuracy when random transformers were employed, elucidating their capacity to manage sequence-oriented operations and context-free grammars necessary for tasks like arithmetic and syntax validation.

Figure 1: Illustration of the setup and performances.

In each scenario, the experiment demonstrated a significant capacity for random transformers to internalize operations through low-dimensional subspace manipulation, evident even in preliminary post-initialization states.

Language Modeling and Memorization

Language modeling reveals a stark contrast between fully trained models and random transformers. While fully trained transformers manage a loss lower than embedding-optimized counterparts, the random models still deliver coherent, grammatically correct outputs if large hidden sizes are provided.

Figure 2: Language modeling performances (measured in cross-entropy loss or equivalently log perplexity, lower is better) for fully trained and random transformers. Comparatively large hidden sizes are needed for random models to match the performance of fully trained models.

Such experimental setups testify to the potential these models have in understanding syntax, although their efficiency and accuracy shrink compared to fully trained versions.

Dimensional Analysis

Across several experiments, random transformers consistently operated within constrained and notably lower-dimensional subspaces, efficiently embedding calculations necessary for algorithmic task resolution. This phenomenon, termed "subspace selection," witnesses low variance explained by the primary components of these models’ internal representations.

Figure 3: During training, the accuracy curve from fully trained and random transformers in the memorization task (\cref{sec:memorization}.

Conclusion

Random transformers exhibit a surprising proficiency in learning and solving complex tasks using minimal training through selective subspace manipulation. Their innate ability to encode features and contextual rules without comprehensive training suggests profound implications for understanding the capabilities of neural circuits in transformers. Future work should continue exploring this avenue to refine insights into how these models can be optimized for task-specific implementations with reduced computational resources. The study provides a rich landscape for further inquiry into how initialization impacts learning trajectories in AI systems.

Markdown