- The paper introduces a novel Mixture of Experts approach with Hybrid-k routing that dynamically activates language-specific and cross-lingual experts.
- It curates a diverse medical dataset across 12 languages and employs rigorous ablation studies to validate enhancements in reasoning capabilities.
- The Apollo-MoE models outperform comparable medical LLMs, with the 10B model achieving over 69% accuracy on major and 58% on minor languages.
Medical LLMs have the potential to improve global healthcare access, but deploying them in local languages, especially low-resource ones, is hindered by data scarcity. This paper, "Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts" (2410.10626), addresses these challenges by proposing efficient methods for scaling medical LLMs to a large number of languages.
The authors first construct a high-quality medical dataset covering 12 major languages, drawing from diverse sources like books, papers, encyclopedias, dialogues, exams, websites, and practical guidelines. They use methods like ChatGPT for data processing (converting texts to QA pairs) and employ quality checks, including monolingual training and ablation studies (e.g., confirming the value of including math and code data for reasoning). This dataset significantly improves the performance of fine-tuned dense models on medical benchmarks across the 12 languages.
To improve efficiency and scalability for multilingual models, the paper explores the use of Mixture of Experts (MoE) architectures. They propose Hybrid-k routing, a novel routing strategy for MoE layers that combines language-specific experts with cross-lingual routing. This approach aims to leverage language-dependent knowledge while also enabling the transfer of general medical knowledge across languages. Hybrid-k routing ensures that the expert corresponding to the input token's language is activated, while also allowing dynamic routing to other experts based on the router's scoring, potentially replacing lower-scoring vanilla Top-k experts. Experiments show that MoE models with Hybrid-k routing achieve better performance and generalization to minor languages compared to dense models and MoE models with vanilla Top-k or strict Language-Specific routing.
The paper explores interpreting the multilingual information flow within the MoE using a circuit-based paradigm. By analyzing how tokens from different languages are routed to experts across layers, they observe a phenomenon called "Spread Out in the End." This refers to the observation that earlier layers exhibit shared routing patterns across languages, indicating cross-lingual integration, while later layers show language-specific divergence, with tokens primarily routed to experts specializing in their respective languages.
Inspired by this "Spread Out in the End" phenomenon, the authors propose the Post-MoE architecture. This architecture applies sparse MoE layers only in the final layers of the model, while keeping earlier layers dense. This design choice leverages the observed specialization in later layers while maintaining efficient processing in earlier, more cross-lingual layers. Experiments with different base models (Qwen2-0.5B and Qwen2-1.5B) show that applying MoE in the last few layers (specifically, the last two layers yielded the best balance in their experiments) significantly improves performance, particularly multilingual generalization.
Building upon the Post-MoE architecture, the paper introduces an efficient method for scaling to 50 languages without a proportional increase in model parameters. They group the 50 languages into 7 language families based on linguistic priors and propose using Mixture of Language Family Experts. Instead of having an expert per language, the MoE layers feature experts dedicated to language families. Tokens from languages within a family are routed to the corresponding language family expert, still utilizing the Hybrid-k routing mechanism within this family context. For low-resource minor languages, the training data is synthesized by translating English medical data.
The resulting models, named Apollo-MoE (based on Qwen2-0.5B, 1.5B, and 7B base models), are evaluated on a benchmark covering 12 major and 38 minor languages (medical-clinical subset of MMLU translated using Google Translate for minor languages). The results demonstrate that Apollo-MoE models outperform other open-source medical LLMs of similar sizes on both major and minor languages. The 10B Apollo-MoE model achieved particularly strong results, exceeding 69% accuracy on major languages and 58% on minor languages, surpassing larger 8B open-source models. The method is shown to be relatively data-efficient for minor languages, achieving saturation with around 2,000 translated samples per language.
Practical Implementation Considerations:
The paper demonstrates a practical path towards building medical LLMs that can serve a wide range of languages efficiently, leveraging MoE architectures and insights into multilingual information flow. While achieving parity with the largest closed-source models remains a goal, the proposed techniques provide a strong foundation for democratizing access to medical AI.