Adapting Language Models via Token Translation

Published 1 Nov 2024 in cs.CL, cs.AI, and cs.LG | (2411.00593v2)

Abstract: Modern LLMs use a fixed tokenizer to effectively compress text drawn from a source domain. However, applying the same tokenizer to a new target domain often leads to inferior compression, more costly inference, and reduced semantic alignment. To address this deficiency, we introduce Sparse Sinkhorn Token Translation (S2T2). S2T2 trains a tailored tokenizer for the target domain and learns to translate between target and source tokens, enabling more effective reuse of the pre-trained next-source-token predictor. In our experiments with finetuned English LLMs, S2T2 improves both the perplexity and the compression of out-of-domain protein sequences, outperforming direct finetuning with either the source or target tokenizer. In addition, we find that token translations learned for smaller, less expensive models can be directly transferred to larger, more powerful models to reap the benefits of S2T2 at lower cost.

Abstract PDF HTML Upgrade to Chat

References (12)

Summary

The paper introduces Sparse Sinkhorn Token Translation (S2T2), a novel method to adapt large language models (LLMs) to new domains by translating tokens between vocabularies without requiring parallel data.
S2T2 utilizes a sparse optimal transport layer within the LLM to effectively map sequences between source and target vocabularies, enabling more efficient tokenization and utilization of pre-trained capabilities.
Experiments on protein sequences show S2T2 achieves superior perplexity and compression, provides effective initialization for fine-tuning, and demonstrates successful parameter transferability between models of different sizes.

Adapting LLMs via Token Translation

The research paper titled "Adapting LLMs via Token Translation" introduces a novel methodology called Sparse Sinkhorn Token Translation (S2T2) aimed at enhancing the performance of LLMs when applied to new target domains. The central issue addressed is the deficiency that arises when LLMs, initially trained with a fixed tokenizer, are deployed in domains that differ from those of their training data. Specifically, the standard tokenizer often results in suboptimal compression and increased inference costs, complicating efforts to maintain semantic alignment.

The S2T2 approach focuses on creating a tailored tokenizer for the target domain, allowing for more effective utilization of a pre-trained model’s predictive capabilities. By translating tokens between the source and target domains without relying on parallel data, S2T2 overcomes inherent limitations faced by LLMs processing domain-specific texts, such as protein sequences.

Key Methodological Contributions

The paper delineates the Sparse Sinkhorn translation process, which operates by injecting a weight-tied sparse optimal transport (OT) layer both in the token embedding and LLM head. By implementing a sparse OT framework, the algorithm effectively learns to map sequences between the two vocabularies, leveraging sparse representations that are more computationally efficient. The OT matrix, which becomes sparse through iterative projections, is calculated without direct parameterization, reducing the computational burden and enhancing scalability.

Experimental Evaluation

Experiments conducted using English LLMs on the UniRef50 protein sequence dataset provide strong evidence for the efficacy of S2T2. The results indicate that S2T2 achieves superior performance in perplexity and bits-per-byte (BpB) metrics compared to baseline methods, which include unconstrained and dense translation models. Specifically, S2T2 provides an effective initialization for continual fine-tuning, enhancing LLM quality and compression.

Remarkably, the S2T2 approach not only outperforms simple fine-tuning with either the original or new tokenizers but also demonstrates successful parameter transferability between models of different sizes. Notably, a translator learned on a smaller model (OLMo-1B) was effectively applied to a larger model (OLMo-7B), underscoring the potential for scalable model refinement without the need for extensive retraining.

Implications and Future Directions

The S2T2 method holds substantial implications for the development of more adaptable, multi-domain LLMs. Practically, this research suggests pathways to reduce inference costs and improve semantic coherence across diverse datasets, which is invaluable for applications in bioinformatics, code translation, and beyond. Theoretically, it contributes to the discourse on probabilistic and sparse methods in machine translation and domain adaptation, challenging conventional reliance on parallel corpora.

Future research could explore the integration of S2T2 with various modalities beyond proteins, such as code or image data, further extending the utility of LLMs. Additionally, the exploration of unified multidomain tokenizers that combine multiple domain vocabularies remains an intriguing, yet challenging, avenue for further investigation.

In summary, the approach provides a rigorous, scalable mechanism for expanding the applicability of LLMs, ensuring their utility in increasingly complex, diverse environments.