Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Published 28 Jan 2025 in cs.CL and cs.LG | (2501.16975v2)

Abstract: Tokenization is a fundamental component of LLMs, yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Over-Encoding to expand input vocabularies, revealing a log-linear relationship between vocabulary size and training loss.
It decouples input and output vocabularies using n-gram embeddings, improving efficiency without incurring extra computational costs.
The methodology employs tensor parallelism to handle large embedding tables, achieving faster convergence with minimal overhead.

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

The paper "Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling" (2501.16975) investigates the impact of tokenization on the performance and scalability of LLMs. It introduces the Over-Tokenized Transformer framework, proposing a novel approach to decouple input and output vocabularies to enhance LLM efficiency and efficacy.

Introduction

The research highlights that while tokenization is crucial in LLMs, its effects on scaling laws and performance are underexplored. The study reveals that expanding the input vocabulary size through multi-gram token embeddings can significantly improve model scalability. It establishes a log-linear relationship between input vocabulary size and training loss, suggesting that larger vocabularies enhance model performance without additional computational costs. Decoupling input and output vocabularies allows the model to leverage larger vocabularies more effectively for input while avoiding the computational penalties of expanding output vocabularies for smaller models.

Figure 1: Scaling trend for Over-Encoded models and baselines on OLMo2 during 400B tokens' training.

Tokenization Design and Scaling Vocabulary

The research examines various tokenization strategies, such as Byte-Pair Encoding (BPE) and n-gram modeling, concluding that larger tokenizer vocabularies benefit LLMs by reducing sequence length and enhancing training efficiency. The paper also emphasizes separate considerations for embedding and unembedding (input and output vocabularies), noting that they exhibit distinct scaling behaviors.

Figure 2: Performance comparison for models trained on CFG data, showing the impact of 1-gram and 3-gram tokenizers.

Over-Tokenized Transformer (OT)

Over-Encoding (OE)

The study introduces Over-Encoding (OE) to scale input vocabularies using large hierarchical n-gram embeddings. OE is shown to bring significant performance improvements, enabling smaller models to match larger baseline models without additional costs. The technique involves creating a configurable vocabulary size extendable via matrix decompositions, allowing efficient embedding lookups.

Figure 3: Illustration of 2-gram encoding/decoding GPT, preserving next-token prediction while maintaining inference cost.

Over-Decoding (OD) and OT Integration

Over-Decoding (OD) employs larger output vocabularies for detailed supervision, typically effective for larger models. Combining OE and OD results in the Over-Tokenized Transformer, leveraging advanced multi-token prediction methods to enhance model learning. The integration amplifies the advantages of wide token embeddings and improved supervision for future token predictions, achieving better performance with minimal additional costs.

Engineering Challenges and Solutions

Practical implementation of OE presents memory and computational challenges, particularly with large embedding tables. The paper suggests using tensor parallelism to mitigate overheads and proposing novel training frameworks to optimize embedding lookups and offload management. The implementation demonstrates less than 5% overhead when scaling to very large models.

Figure 4: Training curves for OE-12.8M and baseline model on OLMo2-1B, displaying significant convergence acceleration.

Conclusion

The paper emphasizes the importance of tokenization in LLM scaling, proposing strategies to optimize tokenizer design for enhanced performance and efficiency. The introduction of the Over-Tokenized Transformer demonstrates the potential of decoupling vocabularies and leveraging larger input vocabularies to drive the advancements of LLMs. The findings set the stage for further exploration into tokenization strategies and their integration into future models, highlighting independent scaling of vocabularies as a pivotal factor in LLM evolution.