MIMIC-IV-to-FHIR Reference Mappings
- MIMIC-IV-on-FHIR is a framework that systematically transforms structured clinical data into HL7 FHIR resources, ensuring semantic interoperability.
- It employs advanced retrieval-augmented generation and context-aware prompt engineering to achieve high accuracy in resource mapping.
- Evaluation metrics demonstrate high precision and recall under baseline conditions, underscoring the method's reliability and potential for further fine-tuning.
MIMIC-IV-on-FHIR reference mappings define the systematic transformation of structured clinical data from the MIMIC-IV database into HL7 FHIR-compliant resources. This mapping framework enables semantic interoperability and supports automation via LLMs. It incorporates rigorous attribute-level and terminology normalization protocols, leverages context-aware prompt engineering, and is validated against formal evaluation metrics (Riquelme et al., 3 Jul 2025, Brens et al., 9 Jan 2026).
1. Mapping Principles and Pipeline Architecture
The MIMIC-IV-on-FHIR mapping pipeline is a semi-automated process executed in sequential stages: data processing, context building, and targeted LLM prompting. Both baseline (schema-aware) and real-world (minimal context) scenarios are supported.
- Data Processing: In the baseline, 17 MIMIC-IV tables (183 attributes) are filtered to 119 candidate attributes. In the real-world configuration, a single table contains 68 unconstrained attributes with only basic metadata.
- Context Building:
- Retrieval-Augmented Generation (RAG) combines embeddings from TF-IDF, BM25, Universal Sentence Encoder, and Word2Vec for semantic similarity between MIMIC-IV and 45 official FHIR resources.
- Cosine similarity and Reciprocal Rank Fusion prioritize resource selection: top-1 FHIR resource assignment achieves 100% accuracy in the baseline.
- Unsupervised clustering (KMeans, Silhouette/Calinski-Harabasz/Davies-Bouldin) with biomedical embeddings (PubMedBERT, MedEmbed-v0.1, ClinicalBERT, BioBERT) is applied for real-world attribute grouping; top-5 resource recall is 94%.
- LLM Interaction:
- Self-reflexive, mixture-of-prompts, and 5-step serial prompting strategies are engineered for resource-element mapping.
- GPT-4o and Llama 3.2 models are configured for deterministic output (temperature=0, top_p=0), utilizing "functions" and "structured_output" interfaces to enforce schema-compliance and to invoke resource-specific tools (Riquelme et al., 3 Jul 2025).
2. Reference Mapping Tables: Source-to-FHIR Alignment
Explicit tabular mappings provide direct references for implementers. Attribute-level transformations specify the target FHIR resource, element path, data type, and normalization mechanism:
| MIMIC-IV Table.Field | FHIR Resource | FHIR Element Path |
|---|---|---|
| PATIENTS.subject_id | Patient | Patient.identifier[0].value |
| PATIENTS.gender | Patient | Patient.gender |
| PATIENTS.dob | Patient | Patient.birthDate |
| ADMISSIONS.hadm_id | Encounter | Encounter.identifier[0].value |
| ADMISSIONS.subject_id | Encounter | Encounter.subject.reference |
| ADMISSIONS.admittime | Encounter | Encounter.period.start |
| ADMISSIONS.dischtime | Encounter | Encounter.period.end |
| ADMISSIONS.admission_type | Encounter | Encounter.class.code |
| DIAGNOSES_ICD.icd_code | Condition | Condition.code.coding.code |
| DIAGNOSES_ICD.icd_code | Condition | Condition.code.coding.system |
| DIAGNOSES_ICD.long_title | Condition | Condition.code.coding.display |
| DIAGNOSES_ICD.hadm_id | Condition | Condition.encounter.reference |
| DIAGNOSES_ICD.subject_id | Condition | Condition.subject.reference |
| LABEVENTS.itemid | Observation | Observation.code.coding.code |
| LABEVENTS.itemid | Observation | Observation.code.coding.system |
| LABEVENTS.valuenum | Observation | Observation.valueQuantity.value |
| LABEVENTS.valueuom | Observation | Observation.valueQuantity.unit |
| LABEVENTS.charttime | Observation | Observation.effectiveDateTime |
| PRESCRIPTIONS.drug_code_rxnorm | MedicationRequest | MedicationRequest.medicationCodeableConcept.code |
| PRESCRIPTIONS.drug_code_rxnorm | MedicationRequest | MedicationRequest.medicationCodeableConcept.system |
| PRESCRIPTIONS.drug | MedicationRequest | MedicationRequest.dosageInstruction.text |
Each mapping encodes normalization rules: for example, DIAGNOSES_ICD.icd_code is mapped to SNOMED-CT via UMLS CUI lookup and contextual embedding similarity (SapBERT); LABEVENTS.itemid utilizes an official LOINC crosswalk; PRESCRIPTIONS.drug_code_rxnorm is injected directly (Brens et al., 9 Jan 2026).
3. Terminology Normalization and Transformation Functions
Terminology harmonization is critical for semantic interoperability across standards:
- Diagnosis:
- ICD-9/10 codes from DIAGNOSES_ICD are mapped to SNOMED-CT using UMLS CUI lookup supplemented with SapBERT embedding similarity:
where is the set of SNOMED codes sharing a UMLS CUI with ICD code .
Lab Observations: ITEMID is mapped to LOINC code via the LOINC-MIMIC crosswalk:
Medications: RxNorm is used directly with all required URIs.
4. Prompt Engineering and Automated Mapping Strategies
LLM-based mapping employs several prompt paradigms, each contributing distinct accuracy-effect profiles.
Self-Reflexive Prompt: The model performs initial mapping using FHIR JSON schemas, then internally revises output for consistency.
Mixture-of-Prompts (MoP): Alternates between direct column-to-element mapping, value-driven resource alignment, and FHIR-URL-based definitions.
5-Step Serial Prompt: Guides the model through staged identification, table intent summarization, schema provisioning, mapping, and output validation.
OpenAI "functions" and "structured_output" modes inject FHIR JSON schemas, enabling strict conformance. The parameter function_call="auto" invokes appropriate mapping tools.
Determinism is enforced using temperature=0; real-world tests vary temperature between 0, 0.5, and 1 to stress mapping resilience (Riquelme et al., 3 Jul 2025).
5. Evaluation Metrics and Empirical Validation
Performance assessment utilizes both resource-level and attribute-level metrics:
Definitions:
- Precision:
- Recall:
- F1-score:
- Accuracy:
- Results:
- Baseline resource identification: Perfect F1=1.00.
- Attribute-level mapping:
- GPT-4o, Self-Reflexive: 67.02%, 73.88%
- GPT-4o, MoP: [64.50%, 70.89%]
- Llama 3.2, MoP: [43.79%, 52.98%]
- Llama 3.2, Serial Schema: [28.49%, 35.62%]
- Real-world attribute-level mapping (N=10 runs, temperature=0,0.5,1):
- GPT-4o: 68.2–68.8%, Llama 3.2: 51.6–56.1% (Riquelme et al., 3 Jul 2025).
This suggests substantial gains for schema-aware instruction and prompt diversification. A plausible implication is that further fine-tuning on FHIR-specific sources could increase LLM mapping fidelity.
6. Error Analysis, Mitigation, and Recommendations
Error analysis reveals recurring LLM failure modes:
- Hallucinated Attributes: Models invent plausible but non-existent fields.
- Granularity Mismatch: Source columns are mapped to overly general or overly specific FHIR elements.
- Insufficient Context: Abbreviated or ambiguous column names yield incorrect mappings.
Mitigation strategies demonstrate efficacy:
- Structured JSON schemas (in-prompt) and "functions" reduce hallucinations.
- Self-reflexive prompts support automatic internal correction.
- Mixed and sample-value-enhanced prompts clarify ambiguous cases.
- Out-of-the-box open-source models underperform; fine-tuning and interactive expert interfaces are recommended for validation and iterative improvement (Riquelme et al., 3 Jul 2025).
7. Implementation Guidelines and Future Directions
Adherence to reference mappings enables reproducibility and extension to additional standards:
- Expand to HL7 CDA, OMOP CDM, and openEHR support via the same semi-automated workflow.
- Fine-tune open-source LLMs on FHIR implementation guides and US-Core profiles.
- Integrate centralized terminology servers (SNOMED CT, LOINC) for dynamic code alignment.
- Develop interactive GUIs for manual mapping validation and feedback-driven prompt refinement.
- Benchmark lightweight, privacy-preserving models for on-premise deployments.
- Implement RAG-based prompt generation for real-time FHIR fragment retrieval.
With the canonical mapping tables, prompt templates, normalization functions, and evaluation strategies, clinical informaticists and standards engineers can reliably effect MIMIC-IV-to-FHIR transformation, ensuring high-confidence semantic interoperability (Riquelme et al., 3 Jul 2025, Brens et al., 9 Jan 2026).