MIMIC-CXR: Large-Scale Chest Radiograph Dataset
- MIMIC-CXR dataset is a large-scale repository comprising over 377,000 chest radiographs with free-text clinical reports, designed to support AI diagnostic and imaging research.
- It incorporates robust de-identification, image processing, and NLP-driven clinical label extraction, offering standardized splits and extensive metadata for benchmarking.
- The resource supports diverse applications such as disease classification, anatomical segmentation, and multimodal report generation while highlighting challenges like class imbalance and label noise.
The MIMIC-CXR dataset is a large-scale, publicly available repository of chest radiographs paired with clinical annotations and free-text reports, originating from the Beth Israel Deaconess Medical Center. Comprising over 377,000 images from more than 65,000 unique patients, the dataset is designed to support computer vision research in medical imaging, with robust annotation pipelines, standardized splits, and extensive metadata for benchmarking diagnostic models (Johnson et al., 2019). Subsequent derivative datasets and large-scale multimodal studies have made extensive use of this resource, significantly advancing automated disease recognition, interpretability, segmentation, and multimodal report generation.
1. Composition and Scale
MIMIC-CXR consists of 377,110 chest radiographs associated with 227,827 imaging studies from approximately 65,379 unique patients, acquired between 2011 and 2016 in the emergency department of Beth Israel Deaconess Medical Center (Johnson et al., 2019). The images are distributed as follows:
| Modality | Percentage |
|---|---|
| Frontal | 67.2% |
| Lateral | 32.7% |
| Other | 0.1% |
Studies typically include one free-text radiology report per imaging study. Example derived corpora such as MIMIC-CXR-JPG and CheXmask-JPG restrict to high-quality frontal AP/PA views, with subsets selected for anatomical segmentation and multimodal pairing (Gaggion et al., 2023, Agostini et al., 2024).
2. Image Processing, De-identification, and Format
Raw radiographs in DICOM format undergo an extensive preprocessing pipeline prior to public release (Johnson et al., 2019, Rubin et al., 2018):
- De-identification: Automated optical character recognition (OCR) and masking of “burned-in” annotations eliminate identifying text, with >6,900 manually reviewed images confirming removal of PHI.
- Format conversion: Images are intensity-normalized to 8-bit JPEG via pydicom, contrast-enhanced with OpenCV histogram equalization, and written out at quality factor 95. No resizing, denoising, or spatial filtering is applied, preserving native clinical dimensions and field of view.
- Variant pipelines: Some studies (e.g., CXR-LLAVA, CXR-LanIC) further crop and resize images to 224×224 or 512×512 pixels depending on neural architecture (Lee et al., 2023, Tang et al., 24 Oct 2025).
3. Clinical Label Generation and Annotation Protocols
Annotation of radiographic findings is performed by automated NLP applied to free-text reports, primarily using NegBio and CheXpert (Johnson et al., 2019, Rubin et al., 2018, Agostini et al., 2024):
- NegBio: Rule-based dependency pattern engine detecting affirmed, negated, or uncertain disease mentions.
- CheXpert: Three-stage extraction using lexical variants, local context classification, and aggregation for each finding.
- Label Set: Fourteen disease categories, including Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity, No Finding, Pleural Effusion, Pleural Other, Pneumonia, Pneumothorax, and Support Devices.
A subset of downstream studies limit analysis to five key findings (e.g. Cardiomegaly, Pleural Effusion, Edema, Consolidation, Atelectasis) due to label reliability (Tang et al., 24 Oct 2025).
Label statistics (training split) demonstrate marked class imbalance; for example, “No Finding” is present in 33.0% of studies, while rare pathologies (“Pleural Other”) occur in <1% (Johnson et al., 2019). Disagreement between NegBio and CheXpert labelers is typically <3%, except for rare findings.
Prevalence for a finding :
where is the positive count for finding .
4. Data Partitioning and Benchmarking
All data splits are performed at the patient level to prevent study-level leakage and overfitting to patient anatomy (Johnson et al., 2019, Tang et al., 24 Oct 2025, Agostini et al., 2024):
| Split | Patients | Studies | Images | % ≥1 finding |
|---|---|---|---|---|
| Train | 64,586 | 222,758 | 368,960 | 76.5% |
| Validation | 500 | 1,808 | 2,991 | 77.1% |
| Test (held) | 293 | 3,269 | 5,159 | 89.1% (enriched) |
Similar proportions are used in multimodal and interpretability studies—typically 80% train, 10% validation, 10% test (Agostini et al., 2024). Studies with insufficient view types or incomplete reports are excluded from downstream tasks.
5. Extensions: Multimodal, Segmentation, and Interpretability Benchmarks
MIMIC-CXR serves as a foundational dataset for numerous research directions:
A. Anatomical Segmentation (Gaggion et al., 2023)
CheXmask provides pixel-level organ masks (left lung, right lung, heart) for >243,000 MIMIC-CXR images, generated using HybridGNet (landmark graph model). Quantitative expert validation yields Dice similarity coefficients (DSC) of 0.962 (lungs) and 0.919 (heart), with Reverse Classification Accuracy (RCA) used for quality control and individualized mask scoring.
B. Multimodal Learning (Agostini et al., 2024, Lee et al., 2023)
Combinatorial pairing of frontal and lateral images with associated free-text reports supports bimodal image representation learning. For weakly supervised, multimodal VAEs, all study-level tuples inherit 14 binary CheXpert labels.
C. Interpretable Model Development (Tang et al., 24 Oct 2025)
Recent frameworks (CXR-LanIC) utilize MIMIC-CXR multimodal embeddings to train sparse autoencoders (“transcoders”), discovering thousands of interpretable radiological patterns aligned with clinical findings using large ensembles and pattern sparsification strategies.
D. LLM Fine-tuning (Lee et al., 2023)
CXR-LLAVA leverages MIMIC-CXR (n ≈ 217,699 frontal CXRs with reports) for multimodal LLM fine-tuning, using GPT-4 guided report cleaning and ViT-based image encoding. Held-out MIMIC test sets (n = 3,000) are used to benchmark autonomous report generation and diagnostic performance by F1 and radiologist acceptability metrics.
6. Limitations, Known Issues, and Usage Recommendations
Principal limitations include (Johnson et al., 2019, Rubin et al., 2018, Gaggion et al., 2023):
- Label Noise: Automatic NLP annotation may generate errors, especially for uncertain or negated mentions. Disagreement rates are up to 2–3% for rare findings.
- De-identification Artifacts: Masking may remove small areas of diagnostic content, though manual review shows minimal impact.
- Class Imbalance: Many findings are rare; downstream models must address imbalance via loss weighting or targeted sampling.
- Institutional Bias: Data from a single academic ED cohort may not generalize; AP images may have confounded device signatures.
- Segmentation Quality: CheXmask guideline recommends discarding images with individualized mask RCA < 0.70 to avoid poor-quality cases, especially in AP views.
Recommended practices include patient-level splitting, use of standardized preprocessing scripts, and exclusion of low-quality or ambiguous cases based on quality indices. Data access requires PhysioNet credentialing and adherence to privacy protocols (Rubin et al., 2018).
7. Impact and Research Applications
MIMIC-CXR has catalyzed advances in medical computer vision by:
- Enabling benchmark classification, localization, and multimodal learning tasks,
- Supporting development of interpretable and clinically-relevant AI systems,
- Providing high-fidelity segmentation masks for self-supervised or auxiliary training,
- Facilitating standardized benchmarking and reproducible research through public data, processing scripts, and annotation protocols.
Continued community adoption and derivative datasets (CheXmask, CXR-LLAVA, CXR-LanIC) highlight its centrality in both supervised and unsupervised radiograph analysis research, extending to multimodal report generation and clinical workflow integration (Johnson et al., 2019, Gaggion et al., 2023, Tang et al., 24 Oct 2025, Lee et al., 2023, Agostini et al., 2024, Rubin et al., 2018).