English-Turkish Clinical RE Dataset
- The paper introduces the first curated bilingual clinical relation extraction dataset with 2,000 expert-validated, aligned sentence pairs.
- It employs a rigorous methodology using DeepL machine translation, medical student post-editing, and expert validation to ensure high semantic fidelity.
- The dataset enables direct cross-lingual evaluation of NLP models, demonstrating high micro-F1 scores with techniques like contrastive retrieval and structured prompting.
The English-Turkish Parallel Clinical Relation Extraction (RE) Dataset constitutes the first systematically curated and expert-validated bilingual resource to enable parallel evaluation of relation extraction models on English and Turkish clinical text. Derived as a parallel corpus from the 2010 i2b2/VA Relation Classification challenge, it provides 2,000 aligned sentence–relation pairs annotated with clinical entities and inter-entity relations, supporting direct comparison of NLP methods and prompting strategies on high-fidelity, semantically matched clinical data in both languages (Aidynkyzy et al., 14 Jan 2026).
1. Source Corpus and Construction Methodology
The dataset originates from the 2010 i2b2/VA Relation Classification challenge corpus (Uzuner et al. 2011), which comprises 394 de-identified discharge summaries in the training split and 477 in the test split, all annotated for medical Problems, Tests, and Treatments, along with eight predefined relation types connecting these entities. For manageable bilingual annotation, a random subset of 1,500 sentences from the i2b2 train split and 500 from the test split was sampled, preserving the i2b2 relation-type distribution. Each sampled sentence contains exactly one annotated relation instance between two entities.
Preprocessing adhered to i2b2 guidelines: all protected health information (PHI) tokens were replaced by identifiers, punctuation-based sentence segmentation and original i2b2 tokenization were applied to preserve entity span offsets.
Turkish translations were generated using the DeepL machine translation engine for initial drafts. Each translation underwent post-editing by third- and fourth-year medical students at Ege University, supervised by clinical faculty, to ensure terminological precision. A subset of 100 randomly chosen sentences was independently double-checked by two senior medical experts (from Ege and Dokuz Eylül), leading to a compiled error taxonomy and confirmation that over 95% of translations were semantically faithful. All 2,000 sentences then underwent a final post-edit by a retired physician and professional translator, focusing on flagged error types such as Turkish case-marking and negation scope.
2. Dataset Statistics and Distribution
Each language version (English or Turkish) includes precisely 2,000 sentences (1,500 train, 500 test) with exactly one annotated relation instance per sentence and two marked entities per instance. No development split was defined. The dataset reflects the i2b2 relations schema, ensuring that the eight original relation categories are preserved with their natural distribution from the source corpus.
Summary statistics are as follows:
| Language | Sentences (train/test) | Total relation instances | Total entity mentions |
|---|---|---|---|
| English | 1,500 / 500 | 2,000 | 4,000 |
| Turkish | 1,500 / 500 | 2,000 | 4,000 |
Relation-type distributions match the i2b2 challenge corpus proportions:
| Relation | Count / Proportion |
|---|---|
| TrIP | 380 / 19.0 % |
| TrWP | 60 / 3.0 % |
| TrCP | 40 / 2.0 % |
| TrAP | 340 / 17.0 % |
| TrNAP | 60 / 3.0 % |
| TeRP | 440 / 22.0 % |
| TeCP | 120 / 6.0 % |
| PIP | 560 / 28.0 % |
Each sentence-relation pair is unique, and parallel alignment ensures that every Turkish sentence corresponds precisely to one English source, facilitating rigorous cross-lingual experiments.
3. Relation Taxonomy and Annotation Guidelines
The dataset employs the exact relation taxonomy of the i2b2/VA 2010 challenge, comprised of eight clinically relevant pairwise relation labels connecting the entity types Medical_Problem, Test, and Treatment. Definitions and canonical English/Turkish examples are as follows:
| Relation | Description | Example (EN) | Example (TR) |
|---|---|---|---|
| TrIP | Treatment improves Problem | "Beta-blockers improved his hypertension." | "Beta-blokerler hipertansiyonunu düzeltti." |
| TrWP | Treatment worsens Problem | "Steroids aggravated his diabetes." | "Steroidler diyabetini kötüleştirdi." |
| TrCP | Treatment causes Problem | "He developed ulcers from NSAIDs." | "NSAID’lerden ülserler oluştu." |
| TrAP | Treatment administered for Problem | "He was started on insulin for hyperglycemia." | "Hiperglisemi nedeniyle insülin başlandı." |
| TrNAP | Treatment not administered due to Problem | "Medication was withheld because of hypotension." | "Hipotansiyon nedeniyle tedavi iptal edildi." |
| TeRP | Test reveals Problem | "MRI revealed a subdural hematoma." | "MRI subdural hematomayı ortaya çıkardı." |
| TeCP | Test conducted to investigate Problem | "A chest X-ray was done for suspected pneumonia." | "Şüpheli pnömoni için grafi çekildi." |
| PIP | Problem indicates Problem | "Chest pain may indicate myocardial ischemia." | "Göğüs ağrısı miyokard iskemisini gösterebilir." |
Annotation rules, following i2b2, stipulate that only intra-sentence (not cross-sentence) relations are marked, entities must be Medical_Problem, Test, or Treatment, and in cases of multiple candidate relations only the strongest (highest-priority per i2b2 hierarchy) is annotated. Nested or discontinuous entities are excluded.
Inter-annotator agreement (IAA) on a 100-sentence Turkish sample was Cohen’s κ = 0.88 for entity spans and κ = 0.85 for relation labels. The original English i2b2 corpus reported κ ≈ 0.90 for relations.
4. Parallel Alignment and Quality Control
Each English sentence in the 2,000-sentence subset is mapped one-to-one with its Turkish translation. The translation pipeline consists of (1) DeepL machine translation; (2) medical student post-editing (with clinical supervision); (3) independent double-expert validation of a 100-sentence subset, during which an error taxonomy was established (covering phenomena such as mistranslation of drug names and suffix ambiguity); and (4) professional translation post-editing focused on systematic errors.
Alignment quality was further verified using automatic back-translation. Sentence pairs for which EN→TR→EN BLEU dropped below 0.8 (~7% of pairs) were reviewed manually; roughly half that subset underwent further correction, while the remainder were replaced. All final 2,000 sentence pairs exceeded alignment confidence of 0.85 (evaluated with BLEU and TER) and passed bilingual review.
A plausible implication is that this degree of validation and correction establishes a new standard for translation quality in parallel clinical NLP datasets, particularly for low-resource target languages.
5. Data Schema and Format
The dataset is released in JSON Lines (".jsonl") format, with one record per sentence per language. Each record follows the schema:
1 2 3 4 5 6 7 8 9 |
{
"id": "TRex_0001",
"text": "...clinical sentence...",
"entities": [
{"eid": "e1", "type": "Test", "start": 12, "end": 20, "text": "MRI"},
{"eid": "e2", "type": "Problem", "start": 30, "end": 46, "text": "subdural hematoma"}
],
"relation": { "type": "TeRP", "head": "e1", "tail": "e2" }
} |
Example (English):
1 2 3 4 5 6 7 8 9 |
{
"id": "TRex_EN_0001",
"text": "MRI revealed a small subdural hematoma.",
"entities": [
{"eid": "e1", "type": "Test", "start": 0, "end": 3, "text": "MRI"},
{"eid": "e2", "type": "Problem", "start": 16, "end": 34, "text": "subdural hematoma"}
],
"relation": {"type": "TeRP", "head": "e1", "tail": "e2"}
} |
Example (Turkish):
1 2 3 4 5 6 7 8 9 |
{
"id": "TRex_TR_0001",
"text": "MRI küçük subdural hematomayı ortaya çıkardı.",
"entities": [
{"eid": "e1", "type": "Test", "start": 0, "end": 3, "text": "MRI"},
{"eid": "e2", "type": "Problem", "start": 9, "end": 27, "text": "subdural hematomayı"}
],
"relation": {"type": "TeRP", "head": "e1", "tail": "e2"}
} |
The dataset is licensed under CC-BY-NC 4.0, following i2b2 data-use requirements. It is available at https://github.com/ClinicalRE‐ENG‐TR/parallel_i2b2_2010_RE, with usage instructions detailing data use agreement requirements, installation, and data file structure.
6. Significance and Applications
The English-Turkish Parallel Clinical RE Dataset constitutes the first resource enabling direct cross-lingual evaluation of clinical relation extraction models and prompting techniques for both high- and low-resource languages in a controlled, semantically aligned setting. Its construction enables systematic assessment of methods including in-context learning, Chain-of-Thought, and contrastive retrieval approaches. In the benchmark described in the source paper, prompting-based LLMs consistently outperformed fine-tuned models such as PURE, with Relation-Aware Retrieval (RAR)—a contrastive demonstration selection method—yielding the top micro-F1 scores (e.g., Gemini 1.5 Flash at 0.906 F1 for English and 0.888 for Turkish; DeepSeek-V3 achieves 0.918 F1 in English when RAR is combined with structured reasoning prompts).
These results underscore the value of high-quality parallel datasets not only as benchmarks but as resources for developing new few-shot learning and retrieval-based methods in clinical NLP, particularly for languages with limited annotation resources (Aidynkyzy et al., 14 Jan 2026).