ApolloCorpora: Multilingual Medical Dataset
- ApolloCorpora is a multilingual medical dataset curated for developing and benchmarking medical LLMs in six major languages.
- It employs a multi-stage preprocessing pipeline with semantic segmentation, QA conversion, and stringent deduplication to ensure high data quality.
- The dataset underpins both direct model training and proxy-tuning approaches, facilitating robust evaluation via the XMedBench suite.
ApolloCorpora is a multilingual medical dataset engineered to facilitate large-scale development of medical LLMs tailored to the six most widely spoken languages: English (EN), Chinese (ZH), French (FR), Spanish (ES), Arabic (AR), and Hindi (HI). Designed to democratize access to high-performance medical AI for a global population of 6.1 billion, ApolloCorpora underpins both LLM pre-training and benchmarking. It includes rigorously curated high-quality medical content and “out of profession” data to safeguard general model versatility. The corpus supports both continuing pre-training and instruction tuning workflows, and underlies the XMedBench suite for multilingual medical evaluation. ApolloCorpora and its associated resources are open-sourced with permissive licensing and compliance with explicit usage guidelines (Wang et al., 2024).
1. Scope and Taxonomy of ApolloCorpora
ApolloCorpora is architected to serve diverse NLP tasks in medical settings across six languages. The dataset is structured in two principal categories:
A. High-Quality Medical Data (for continuing pre-training & instruction tuning):
- Books: EN (296.7M tokens), ZH (117.1M); Total ≃ 413.8M
- Papers: EN (252.9M), ZH (45.6M), ES (46.0M), FR (4.5M); Total ≃ 349.0M
- Encyclopedias: EN (221.1M), FR (4.6M), HI (0.5M); Total ≃ 226.2M
- Doctor–Patient Dialogues: EN (92.1M), ZH (46.6M), AR (10.4M); Total ≃ 149.1M
- Medical Exams (instruction tuning): EN (42.1M), ZH (35.3M), FR (0.1M), ES (0.5M); Total ≃ 78.0M
- Clinical Guidelines: EN (29.6M)
B. “Out of Profession” Data (to preserve general model capabilities):
- Web Crawl: EN (499.9M), ZH (329.3M), ES (57.5M); Total ≃ 886.7M
- General Instruction Data: EN (194.5M), ZH (69.4M), HI (43.9M), FR (20.0M), AR (18.7M), ES (18.4M); Total ≃ 364.9M
- Math: EN (18.9M), ZH (3.7M); Total ≃ 22.6M
- Code: EN (9.2M), ZH (7.2M); Total ≃ 16.4M
The aggregate token count across all categories can be computed as
2. Cleaning, Segmentation, and QA Rewrite Pipeline
ApolloCorpora incorporates a multi-stage preprocessing pipeline to standardize semantic units, transform content, and assure high annotation quality:
- Semantic Segmentation: Books and guidelines are segmented by section; papers by abstract; web pages, dialogues, and encyclopedias by paragraph or wiki-entry.
- QA Conversion: Every segment is reformulated into question-answer (QA) pairs via ChatGPT (gpt-3.5-turbo-16k) with language-specific block size limits (EN/ES/FR/HI ≤ 2048 tokens; ZH ≤ 256; AR ≤ 128). Two-stage prompt skeleton is used: first to generate a question from the text, then to answer it citing the reference.
- Deduplication & Leakage Filtering: Stringent overlap criterion (per Med-PaLM2) is applied: any QA item overlapping ≥64 consecutive characters with test/held-out data is removed. For exam data, 3,041 out of 580,645 items were filtered, with a rate of 0.52%; non-exam data was unaffected.
- Language Filtering & Normalization: Language identification is enforced with confidence thresholds (≥0.95). Punctuation normalization and Unicode form unification are systematically applied.
3. Benchmarking and Evaluation: The XMedBench Construction
XMedBench, intimately connected to ApolloCorpora, operationalizes multilingual evaluation via machine-friendly multiple-choice question answering (MCQA):
- Task Selection: Only MCQA (four options) is included for robustly automated scoring.
- Dataset Sourcing:
- EN: MedQA-USMLE (≃12,723 Qs), MedMCQA (≃194,636), MMLU-Medical subset (≃1,200)
- ZH: CMB/CMExam (≃44,000), CMMLU-Medical (≃1,000)
- FR: FrenchMedMCQA (≃8,000)
- ES: HEAD-QA (≃3,000)
- AR & HI: Translated MMLU subcategories (Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine; ≃1,200 each)
Prompting: Evaluation uses 3-shot in-context examples; template: “You are a medical doctor answering real-world exam questions. Select one correct answer A–D. Question: {Q} Options: (A)…(D)… Assistant: The correct answer is {X}.<special_token>”
Metrics: Accuracy is measured as
F1 is referenced but not applied.
4. Licensing, Open Access, and Compliance
All data sources selected for ApolloCorpora are accompanied by fully open-source or permissive licenses (e.g., CC-BY, MIT, public domain). ApolloCorpora itself is released under the Apache 2.0 license on GitHub (https://github.com/FreedomIntelligence/Apollo), with attribution to each original source.
Usage Guidelines:
- No medical advice should be automated without human oversight.
- Local medical practices and taboos must be respected; local-language corpora should not be repurposed out of context.
- Regional regulations on patient privacy and data handling must be followed.
- Researchers should acknowledge that cultural differences may introduce biases in medical knowledge.
Practitioners may clone, review, and adapt the repository, including new modules or subdomains, but must retain the corpus’ full filtering (≥64-char overlap), rewriting, and segmentation pipeline for data quality and legal compliance.
5. Significance and Applications in Medical AI
ApolloCorpora substantiates multilingual medical LLMs, specifically the lightweight Apollo models (0.5B, 1.8B, 2B, 6B, 7B parameters) and enables high performance in diverse medical QA tasks. Apollo-7B is reported as state-of-the-art among multilingual medical LLMs up to 70B parameters.
A notable application is “proxy-tuning,” where smaller Apollo-derived models can be leveraged to enhance multilingual medical capabilities of larger, foundational models without necessitating full fine-tuning on the larger model—a strategy facilitating resource-efficient integration.
This suggests that ApolloCorpora is integral not only for direct model training and benchmarking, but also for iterative improvement of multilingual medical AI research pipelines.
6. Contextual Considerations and Future Directions
The construction of ApolloCorpora foregrounds the necessity of coverage in major world languages for equitable healthcare AI. Its pipeline and strict filtering criteria aim to mitigate data leakage and bolster trustworthiness of medical LLM outputs. Direct public access and extensibility create opportunity for rapid adaptation to new clinical disciplines or languages under robust licensing and compliance regimes.
A plausible implication is that future extensions may incorporate additional languages, subdomains, or evaluation modalities, as long as adherence to the established QA rewriting, segmentation, and filtering protocol is maintained to preserve data integrity and legal compatibility.
No controversies or misuse are noted in the data; ethical caution and regulatory alignment are underscored throughout corpus deployment and downstream usage (Wang et al., 2024).