MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Published 28 Oct 2024 in cs.CL, cs.AI, and cs.LG | (2410.20771v3)

Abstract: Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level models like ByT5 attempt to address these concerns, they have not gained widespread adoption -- processing raw byte streams without tokenization results in significantly longer sequence lengths, making training and inference inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of ByT5 that integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. After processing through a fixed number of encoder layers, a learned delete gate determines which tokens are to be removed and which are to be retained for subsequent layers. MrT5 effectively "merges" critical information from deleted tokens into a more compact sequence, leveraging contextual information from the remaining tokens. In continued pre-training experiments, we find that MrT5 can achieve significant gains in inference runtime with minimal effect on performance, as measured by bits-per-byte. Additionally, with multilingual training, MrT5 adapts to the orthographic characteristics of each language, learning language-specific compression rates. Furthermore, MrT5 shows comparable accuracy to ByT5 on downstream evaluations such as XNLI, TyDi QA, and character-level tasks while reducing sequence lengths by up to 75%. Our approach presents a solution to the practical limitations of existing byte-level models.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a dynamic token deletion gating mechanism that compresses sequence lengths by up to 80% while preserving contextual integrity.
The paper reports a 39.92% reduction in inference runtime compared to ByT5, highlighting substantial improvements in efficiency across languages.
The paper demonstrates robust cross-lingual transfer abilities, with effective zero-shot and multilingual performance, broadening its deployment potential.

Overview of MrT5: Dynamic Token Merging for Efficient Byte-level LLMs

The paper presents MrT5 (MergeT5), an advancement in the field of byte-level LLMs, addressing the challenges faced by existing models such as ByT5. Traditional subword tokenization methods, like byte-pair encoding or SentencePiece, exhibit limitations in handling character-level noise and variability in compression rates across languages. In contrast, byte-level models bypass tokenization, processing raw byte streams directly, inherently causing longer sequence lengths which impedes efficiency. MrT5 emerges as a novel solution, introducing a dynamic token deletion mechanism which not only streamlines byte-level sequences but also demands minimal alteration to the transformational framework for adaptation from existing pre-trained models like ByT5.

Key Contributions

MrT5 integrates a token deletion gating mechanism within its encoder to optimize sequence length dynamically. After initial processing by a fixed number of encoder layers, the mechanism determines the tokens to retain, merging salient information into the reduced sequence. This strategy effectively condenses necessary context without significant performance loss, achieving up to 80% reduction in sequence length.

Numerical Findings

MrT5 demonstrates substantial improvements, especially in terms of inference runtime, without compromising accuracy. Across languages, MrT5 consistently achieved significant efficiency gains, evidenced by a 39.92% reduction in inference runtime when compared to ByT5, particularly with models achieving over 50% token deletion rates. This efficiency was tested over multiple languages, showcasing the model's adaptability and robust performance in cross-lingual tasks, such as the XNLI benchmark.

Cross-lingual Transfer and Multilingual Training

The model shows exceptional potential in cross-lingual transfer capabilities. Through zero-shot evaluations, MrT5 trained solely on English was capable of applying its token merging strategies effectively to other languages. When subjected to multilingual training, MrT5 demonstrated enhanced deletion rates and lower associated loss metrics, optimizing performance across diverse scripts and demonstrating universal applicability.

Implications and Future Directions

MrT5's advancements in reducing sequence length and computational load present practical implications for deploying LLMs in resource-constrained environments. The ability to maintain performance while lowering inference times expands potential deployment scenarios, particularly in latency-sensitive applications. Theoretical implications include further exploration of dynamic token merging mechanisms and possibly extending the model's capabilities to encompass broader, more generalized language tasks.

Future research may explore refining the gating mechanism, exploring more granular control of deletion rates, or integrating additional contextual cues to enhance model performance in complex semantic tasks. Additionally, the model provides a promising framework for the future design of efficient LLMs that minimize reliance on traditional subword tokenization.

Conclusion

MrT5 stands as a compelling contribution to LLM efficiency, particularly in its approach to byte-level modeling. It elegantly combines the robustness of existing architectures with innovative mechanisms for token merging, setting a precedent for future advancements in efficient NLP frameworks. By integrating a delete gate, MrT5 reduces the drawbacks of prior models, making strides towards more adaptable and computation-efficient language processing tools. This paper posits MrT5 as a feasible path forward in circumventing the limitations of traditional tokenization, paving the way for broader implementation of byte-level models in diverse linguistic contexts.

Markdown Report Issue