TICO-19: Multilingual COVID-19 Translation Dataset
- TICO-19 is a multilingual benchmark providing high-quality, sentence-aligned COVID-19 translations across 35 languages.
- It includes human-validated Dev and Test splits, TMX translation memories, and terminology lists for precise MT evaluation.
- The dataset supports pivot-based and domain adaptation research, especially for under-resourced languages in crisis communication.
The TICO-19 dataset is a multilingual benchmark developed by the Translation Initiative for COVID-19 (TICO-19) to facilitate the dissemination of high-quality, human-validated translations of COVID-19–related guidance, safety measures, and technical content. Explicitly designed for both machine translation (MT) research and urgent humanitarian communication, it comprises parallel Dev and Test splits in 35 languages—including 26 under-resourced languages from Africa, South Asia, and Southeast Asia—aligned at the sentence level across 3,071 English source segments. This resource includes professionally aligned parallel data, translation memories in TMX v1.4 format, and terminology lists, and is released under the CC BY 4.0 license to support broad reuse and domain adaptation research (Anastasopoulos et al., 2020).
1. Scope, Language Coverage, and Motivation
The primary motivation behind TICO-19 is twofold: accelerating access to accurate, validated COVID-19 information for vulnerable, linguistically diverse communities, and enabling substantive progress in the evaluation and improvement of low-resource and biomedical domain-specific MT systems. The assembled benchmark specifically targets languages that are poorly represented in existing datasets but essential for effective global public health communication.
The 35 languages are stratified into three categories:
- Pivot (9 high-resource lingua-franca): Arabic (ar), Chinese–Simplified (zh), French (fr), Portuguese–Brazilian (pt-BR), Spanish–Latin-American (es-419), Hindi (hi), Russian (ru), Swahili (sw), Indonesian (id)
- Priority (18 under-resourced, TWB-identified):
- South Asia: Dari (prs), Central Khmer (km), Kurdish Kurmanji (Latn) (ku), Kurdish Sorani (ckb), Nepali (ne), Pashto (ps)
- Africa: Amharic (am), Dinka (din), Fulfulde (Nigeria, Latn) (fuv-Latn-NG), Hausa (ha), Kanuri (kr), Kinyarwanda (rw), Lingala (ln), Luganda (lg), Oromo (om), Somali (so), Tigrinya (et-tir), Zulu (zu)
- Important (8 South & South-East Asia): Bengali (bn), Burmese (my), Farsi (fa), Malay (ms), Marathi (mr), Tagalog (tl), Tamil (ta), Urdu (ur)
2. Data Composition and Structure
Each language corpus provides direct, parallel sentence-level alignments for all 3,071 source English segments, split into a held-out Dev set (971 segments) and Test set (2,100 segments). There is no large, explicit training set. All segments are identical and aligned across language pairs, enabling arbitrary source-target extraction or pairwise MT evaluation (up to 1,190 distinct directions). This compositional uniformity eliminates the need for realignment or re-filtering by researchers.
| Language | Dev segments | Test segments | Total segments |
|---|---|---|---|
| ar | 971 | 2,100 | 3,071 |
| zh | 971 | 2,100 | 3,071 |
| fr | 971 | 2,100 | 3,071 |
| ... | ... | ... | ... |
| ur | 971 | 2,100 | 3,071 |
This structure facilitates the development, tuning, and robust evaluation of translation models under consistent, reproducible conditions.
3. Translation Directions, Pivoting, and TMX Resources
Owing to the one-to-one alignment with English, any direct language pair can be extracted for supervised or zero-shot MT evaluation. The nine Pivot languages serve as intermediate stages for cascade approaches (pivoting) to extremely low-resource targets (e.g., English→French→Lingala, English→Arabic→Somali) where direct parallel data are unavailable. The repository provides:
- TMX translation memories for all English–X directions.
- Selected secondary pairings (e.g., French–Lingala, Farsi–Dari, Kurmanji–Sorani), supporting research on language transfer and pivot-based evaluation.
Pivoting is operationalized by composing two TMX files (e.g., en→pivot and pivot→X) and filtering based on Translation Unit (TU) identifiers.
Example TMX parallel unit:
1 2 3 4 5 6 7 8 |
<tu tuid="1017"> <tuv xml:lang="en"> <seg>Are you having any shortness of breath?</seg> </tuv> <tuv xml:lang="sw"> <seg>Je, una matatizo ya kupumua?</seg> </tuv> </tu> |
4. File Formats, Directory Layout, and Data Access
The benchmark is distributed as UTF-8 encoded plain text files, with parallel Dev and Test splits for each language. TMX v1.4 translation memories are provided for broad compatibility with localization and Computer-Assisted Translation (CAT) tools. The directory structure is modular, containing benchmark splits, translation memories, and terminology lists:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
tico19/
bench/
dev/
TICO19.dev.en
TICO19.dev.fr
...
test/
TICO19.test.en
TICO19.test.ru
...
tmx/
en-ar.tmx
en-zh.tmx
...
fa-prs.tmx
ku-ckb.tmx
terms/
facebook_terms.csv
google_terms.csv |
Acquisition is enabled via git:
1 2 |
git clone https://github.com/tico-19/tico-19-benchmark.git cd tico-19-benchmark |
A pseudo-code example for loading memories in Python:
1 2 3 4 5 |
from tmx import TMX memory = TMX('tmx/en-sw.tmx') for tu in memory.units(): src = tu.source.seg trg = tu.target.seg |
5. Licensing, Attribution, and Terminology Lists
All TICO-19 Dev/Test splits, TMX resources, and terminology lists are released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This permits reuse, adaptation, and redistribution in research and humanitarian applications. Proper attribution to “TICO-19: Translation Initiative for COVID-19 (https://tico-19.github.io/)” is required in derivative works or literature.
Terminology lists from Facebook and Google are included to support domain adaptation and terminology-guided MT experiments.
6. MT Baselines and Evaluation Metrics
TICO-19 provides standardized BLEU baseline scores with sacreBLEU for a variety of English–X translation systems, highlighting the quality gap between high-resource and low-resource pairs:
Selected English→X BLEU scores (from Table 6 of (Anastasopoulos et al., 2020)):
| Language (System) | BLEU |
|---|---|
| es-419 (OPUS-MT) | 48.73 |
| pt-BR (OPUS-MT) | 47.26 |
| fr (OPUS-MT) | 37.59 |
| id (OPUS-MT) | 41.27 |
| ru (OPUS-MT) | 25.49 |
| sw (OPUS-MT) | 22.62 |
| ln (OPUS-MT) | 7.85 |
| mr (Our OPUS) | 0.21 |
| fa (Our OPUS) | 8.48 |
| prs (Our OPUS/fa→prs) | 9.49 |
| om (Our OPUS) | 0.57 |
| zu (Our OPUS) | 11.73 |
X→English examples include Spanish-Latin→en: 46.82, French→en: 39.40, Hindi→en: 18.91, Kinyarwanda→en: 8.29, Lingala→en: 6.71 BLEU, illustrating significant quality degradation for under-resourced language pairs.
7. Applications, Benchmarks, and Research Implications
TICO-19 is actively used as:
- A multi-domain, multi-lingual Dev/Test suite for supervised, unsupervised, and pivot-based MT evaluation and fine-tuning.
- A resource for domain adaptation research in biomedical/scientific/technical text.
- A low-resource MT benchmark for error-analysis and empirical studies.
- A source of TMX translation memories for CAT tools and humanitarian localizers.
The rigorously quality-assured, human-validated nature of the benchmark—combined with its comprehensive, sentence-aligned parallelism—enables controlled evaluation of domain adaptation strategies, terminology-guided translation, pivot cascades, and multilingual transfer learning. The absence of direct training material focuses experimentation on true domain- and resource-constrained settings. These characteristics distinguish TICO-19 as an immediate foundation for improving both the deployment of crisis communications and the sophistication of MT research for under-resourced, high-impact domains (Anastasopoulos et al., 2020).