Adapting Language Models via Token Translation
Abstract: Modern LLMs use a fixed tokenizer to effectively compress text drawn from a source domain. However, applying the same tokenizer to a new target domain often leads to inferior compression, more costly inference, and reduced semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain and learns to translate between target and source tokens, enabling more effective reuse of the pre-trained next-source-token predictor. In our experiments with finetuned English LLMs, S2T2 improves both the perplexity and the compression of out-of-domain protein sequences, outperforming direct finetuning with either the source or target tokenizer. In addition, we find that token translations learned for smaller, less expensive models can be directly transferred to larger, more powerful models to reap the benefits of S2T2 at lower cost.
- Lessons from the trenches on reproducible evaluation of language models. arXiv preprint arXiv:2405.14782, 2024.
- A method for finding projections onto the intersection of convex sets in hilbert spaces. In Advances in Order Restricted Statistical Inference: Proceedings of the Symposium on Order Restricted Statistical Inference held in Iowa City, Iowa, September 11–13, 1985, pages 28–47. Springer, 1986.
- A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 644–648, 2013.
- Quadratically regularized optimal transport on graphs. SIAM Journal on Scientific Computing, 40(4):A1961–A1986, 2018.
- Philip Gage. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994.
- Olmo: Accelerating the science of language models. Preprint, 2024.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pages 1614–1623. PMLR, 2016.
- Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
- Toward a theory of tokenization in llms. arXiv preprint arXiv:2404.08335, 2024.
- Trans-tokenization and cross-lingual vocabulary transfers: Language adaptation of LLMs for low-resource NLP. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=sBxvoDhvao.
- Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6):926–932, 2015.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.