Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts

Published 14 Oct 2024 in cs.CL | (2410.10626v2)

Abstract: Adapting medical LLMs to local languages can reduce barriers to accessing healthcare services, but data scarcity remains a significant challenge, particularly for low-resource languages. To address this, we first construct a high-quality medical dataset and conduct analysis to ensure its quality. In order to leverage the generalization capability of multilingual LLMs to efficiently scale to more resource-constrained languages, we explore the internal information flow of LLMs from a multilingual perspective using Mixture of Experts (MoE) modularity. Technically, we propose a novel MoE routing method that employs language-specific experts and cross-lingual routing. Inspired by circuit theory, our routing analysis revealed a Spread Out in the End information flow mechanism: while earlier layers concentrate cross-lingual information flow, the later layers exhibit language-specific divergence. This insight directly led to the development of the Post-MoE architecture, which applies sparse routing only in the later layers while maintaining dense others. Experimental results demonstrate that this approach enhances the generalization of multilingual models to other languages while preserving interpretability. Finally, to efficiently scale the model to 50 languages, we introduce the concept of language family experts, drawing on linguistic priors, which enables scaling the number of languages without adding additional parameters.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel Mixture of Experts approach with Hybrid-k routing that dynamically activates language-specific and cross-lingual experts.
It curates a diverse medical dataset across 12 languages and employs rigorous ablation studies to validate enhancements in reasoning capabilities.
The Apollo-MoE models outperform comparable medical LLMs, with the 10B model achieving over 69% accuracy on major and 58% on minor languages.

Medical LLMs have the potential to improve global healthcare access, but deploying them in local languages, especially low-resource ones, is hindered by data scarcity. This paper, "Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts" (2410.10626), addresses these challenges by proposing efficient methods for scaling medical LLMs to a large number of languages.

The authors first construct a high-quality medical dataset covering 12 major languages, drawing from diverse sources like books, papers, encyclopedias, dialogues, exams, websites, and practical guidelines. They use methods like ChatGPT for data processing (converting texts to QA pairs) and employ quality checks, including monolingual training and ablation studies (e.g., confirming the value of including math and code data for reasoning). This dataset significantly improves the performance of fine-tuned dense models on medical benchmarks across the 12 languages.

To improve efficiency and scalability for multilingual models, the paper explores the use of Mixture of Experts (MoE) architectures. They propose Hybrid- $k$ routing, a novel routing strategy for MoE layers that combines language-specific experts with cross-lingual routing. This approach aims to leverage language-dependent knowledge while also enabling the transfer of general medical knowledge across languages. Hybrid- $k$ routing ensures that the expert corresponding to the input token's language is activated, while also allowing dynamic routing to other experts based on the router's scoring, potentially replacing lower-scoring vanilla Top- $k$ experts. Experiments show that MoE models with Hybrid- $k$ routing achieve better performance and generalization to minor languages compared to dense models and MoE models with vanilla Top- $k$ or strict Language-Specific routing.

The paper explores interpreting the multilingual information flow within the MoE using a circuit-based paradigm. By analyzing how tokens from different languages are routed to experts across layers, they observe a phenomenon called "Spread Out in the End." This refers to the observation that earlier layers exhibit shared routing patterns across languages, indicating cross-lingual integration, while later layers show language-specific divergence, with tokens primarily routed to experts specializing in their respective languages.

Inspired by this "Spread Out in the End" phenomenon, the authors propose the Post-MoE architecture. This architecture applies sparse MoE layers only in the final layers of the model, while keeping earlier layers dense. This design choice leverages the observed specialization in later layers while maintaining efficient processing in earlier, more cross-lingual layers. Experiments with different base models (Qwen2-0.5B and Qwen2-1.5B) show that applying MoE in the last few layers (specifically, the last two layers yielded the best balance in their experiments) significantly improves performance, particularly multilingual generalization.

Building upon the Post-MoE architecture, the paper introduces an efficient method for scaling to 50 languages without a proportional increase in model parameters. They group the 50 languages into 7 language families based on linguistic priors and propose using Mixture of Language Family Experts. Instead of having an expert per language, the MoE layers feature experts dedicated to language families. Tokens from languages within a family are routed to the corresponding language family expert, still utilizing the Hybrid- $k$ routing mechanism within this family context. For low-resource minor languages, the training data is synthesized by translating English medical data.

The resulting models, named Apollo-MoE (based on Qwen2-0.5B, 1.5B, and 7B base models), are evaluated on a benchmark covering 12 major and 38 minor languages (medical-clinical subset of MMLU translated using Google Translate for minor languages). The results demonstrate that Apollo-MoE models outperform other open-source medical LLMs of similar sizes on both major and minor languages. The 10B Apollo-MoE model achieved particularly strong results, exceeding 69% accuracy on major languages and 58% on minor languages, surpassing larger 8B open-source models. The method is shown to be relatively data-efficient for minor languages, achieving saturation with around 2,000 translated samples per language.

Practical Implementation Considerations:

Data Curation: The process involves collecting data from diverse medical sources and using LLMs (like GPT-3.5-turbo) for reformatting and enhancing quality (e.g., generating QA pairs). Implementing this requires careful data sourcing, cleaning, and prompt engineering for Q&A generation. Data leakage checks are crucial during this phase.
MoE Integration: Implementing MoE layers involves modifying the standard transformer architecture, replacing Feed-Forward Networks (FFNs) with MoE blocks. Sparse Upcycling (Komatsuzaki et al., 2022) can be used to initialize MoE layers from pre-trained dense models efficiently.
Hybrid Routing: The Hybrid- $k$ $k$ routing logic needs to be implemented within the MoE layer. This involves:
1. Identifying the language of the input token (e.g., using a language identification tool or pre-computed labels per document).
2. Computing router scores for all experts (similar to vanilla Top- $k$ ).
3. Identifying the language-specific expert(s).
4. Selecting the top- $k$ experts, ensuring the language-specific expert(s) are included, potentially replacing lower-scoring experts if they weren't in the initial Top- $k$ .
Post-MoE Architecture: This requires strategically placing MoE layers only in the final transformer blocks (e.g., the last 2-4 layers as explored in the paper). The choice of how many layers to make sparse might depend on the base model size and should potentially be tuned.
Language Family Experts: For scaling to many languages, define language families based on linguistic knowledge. Implement the MoE layer with experts corresponding to these families. The Hybrid- $k$ routing would then operate within the context of routing to these family experts.
Training: Fine-tuning requires significant computational resources (e.g., 8x A800 GPUs). The training involves optimizing the router weights and the expert weights. Careful tuning of learning rates, especially for the router, is necessary.
Evaluation: For low-resource languages, relying on machine translation of benchmarks (like MMLU) is a practical approach when dedicated benchmarks are unavailable. This requires setting up a translation pipeline and prompt-based evaluation.
Deployment: MoE models offer computational advantages during inference by only activating a subset of parameters per token. However, they can have higher memory requirements than dense models of comparable activated parameters due to loading all experts. The Post-MoE architecture might offer a balance here.

The paper demonstrates a practical path towards building medical LLMs that can serve a wide range of languages efficiently, leveraging MoE architectures and insights into multilingual information flow. While achieving parity with the largest closed-source models remains a goal, the proposed techniques provide a strong foundation for democratizing access to medical AI.