PTB-XL: ECG Benchmark Dataset
- PTB-XL is a large-scale, annotated clinical ECG dataset featuring 21,837 12-lead recordings from nearly 19,000 patients for machine learning research.
- It provides comprehensive metadata with hierarchical diagnostic labels, detailed preprocessing workflows, and patient-stratified cross-validation splits to ensure reproducibility.
- The dataset supports practical benchmarks, transfer learning, and interpretability studies, making it a pivotal resource for automated ECG interpretation.
The PTB-XL dataset is a large-scale, open-access corpus of clinical 12-lead electrocardiogram (ECG) recordings designed for benchmarking and research in automated ECG interpretation. It provides structured ground-truth labels, comprehensive metadata, and a well-defined experimental framework facilitating rigorous development and evaluation of machine learning algorithms in electrophysiological signal analysis (Strodthoff et al., 2020, Kang et al., 14 Apr 2025, Sharma et al., 2022).
1. Data Composition and Annotation Structure
PTB-XL comprises 21,837 12-lead ECG records, each 10 seconds in duration, acquired from 18,885 unique patients (52% male, 48% female). The recordings are provided at both 500 Hz (raw) and 100 Hz (recommended for modeling) sample rates. The standard clinical lead configuration is used: I, II, III, aVR, aVL, aVF, V1–V6. Diagnostic labeling follows the SCP-ECG standard with 71 non-mutually-exclusive statement codes, divided into three major categories:
- Diagnosis: 44 statements (including myocardial infarction [MI], hypertrophy [HYP], conduction disturbance [CD])
- Form: 19 statements (waveform morphology)
- Rhythm: 12 statements (e.g., atrial fibrillation, pacemaker rhythm)
Diagnosis labels are further structured hierarchically into 5 superclasses and 24 subclasses. The five superclasses—Normal (NORM), Conduction Disturbance (CD), Hypertrophy (HYP), Myocardial Infarction (MI), and ST/T-Change (STTC)—map most common high-level diagnoses for multi-label tasks (Kang et al., 14 Apr 2025, Strodthoff et al., 2020, Sharma et al., 2022). Each record may have multiple concurrent labels reflecting clinical poly-diagnosis. Class imbalance is pronounced: for example, in (Kang et al., 14 Apr 2025), NORM: 7,185; CD: 3,232; HYP: 815; MI: 2,936; STTC: 3,064.
Expert cardiologists annotated each record using SCP taxonomy, with diagnosis likelihoods (15–100) via keyword mapping. Additional technical expert review labels signal quality (noise, baseline drift), supporting downstream quality filtering or artifact-aware training strategies (Strodthoff et al., 2020).
2. Preprocessing and Time–Frequency Representation
PTB-XL's ECG tracings are subject to several preprocessing workflows, depending on the experimental design. For deep learning pipelines, the recommended 100 Hz downsampled signals (1,000 samples/lead, per record) are frequently used directly with minimal filtering; optional baseline-wander removal and band-pass filtering address artifactual contamination as required (Strodthoff et al., 2020). In (Kang et al., 14 Apr 2025), explicit band-pass filtering is omitted, with baseline wander and high-frequency noise handled in the frequency domain.
Advanced feature extraction is achieved using Short-Time Fourier Transform (STFT), as exemplified in (Kang et al., 14 Apr 2025). Input (with ) is mapped using STFT with window length 64, hop length 16, and 480 frequency bins:
where , with = window length, = hop length, = FFT bins.
Classic denoising approaches such as 1D low-pass FIR filtering (cutoff 45 Hz) and wavelet-based baseline correction were used in (Sharma et al., 2022) for robust binary classification, particularly targeting deployment on embedded hardware.
3. Splitting Protocols and Experimental Workflows
Recommended experimental design relies on the predefined 10-fold cross-validation partitions with patient-level stratification to preclude overfitting and ensure generalizability (Strodthoff et al., 2020, Kang et al., 14 Apr 2025, Sharma et al., 2022). The standard split allocates folds 1–8 for training (~80%), fold 9 for validation (~10%), and fold 10 as an independent test set (~10%). Each fold maintains the original distribution of diagnoses, permitting consistent benchmarking. Patient IDs are used to avoid patient overlap between splits (Strodthoff et al., 2020).
For multi-label modeling, class imbalance is mitigated via per-class weighting in loss functions, robust data augmentation (random masking of contiguous time–frequency segments, with a mask ratio of 0.2 and application probability of 0.8 per batch in (Kang et al., 14 Apr 2025)), and frequency-domain artifact suppression.
4. Model Benchmarks and Performance Metrics
PTB-XL supports both multi-label and single-label classification tasks. Multi-label metrics include:
- Term-centric macro-AUC:
where is the number of classes.
- Sample-centric score: Maximizes threshold-dependent F1 over all thresholds.
- Sample-centric precision and recall:
: predicted labels at threshold for sample ; : true labels.
Single-label tasks (e.g., binary classification, age/gender) use accuracy, precision, recall, F1, and mean absolute error for regression endpoints.
A range of modeling architectures have been evaluated:
| Task/Labels | Top Models | Term-AUC | |
|---|---|---|---|
| All (71) | XResNet101/IncTime | 0.925 | 0.764 |
| Diagn. (44) | ResNet1d/XResNet101 | 0.936 | 0.741 |
| Super (5) | ResNet1d/XResNet101 | 0.930 | 0.823 |
| Rhythm (12) | XResNet101/IncTime | 0.957 | 0.917 |
Convolutional and recurrent models (e.g., ResNet1d, InceptionTime, LSTM, xLSTM) consistently outperform feature-based or naive predictors (Strodthoff et al., 2020, Kang et al., 14 Apr 2025).
5. Specialized Applications and Deployment Scenarios
PTB-XL's scale, diversity, and annotation granularity permit experimentation across a range of clinical and technical tasks. In (Sharma et al., 2022), a 1D-CNN trained for binary “Normal vs. Abnormal” classification was quantized and deployed on Raspberry Pi, achieving accuracy of 81.2% (12 leads) with floating-point TFLite models, and model size reductions of 86–94% (float32 to float16 conversion). Class weights (NORM 1.192, Abnormal 0.861) counteracted label imbalance. Performance scaled positively with lead count, with 12-lead input providing maximal accuracy.
The dataset also supports transfer learning workflows, with pretraining on PTB-XL followed by adaptation to other ECG datasets (e.g., ICBEB2018), though substantial benefits arise only in limited-target-data regimes (Strodthoff et al., 2020).
6. Advanced Analytic Methods and Interpretability
PTB-XL's rich annotation facilitates interpretability analyses, model calibration, and stratified performance assessment:
- Hidden stratification: Fine-grained evaluation reveals subgroup performance deficits potentially masked by aggregate metrics (e.g., IVCD with low AUC within NORM co-diagnosis group (Strodthoff et al., 2020)).
- Model uncertainty: Deep ensembles quantify predictive variance, aligning with annotator likelihoods and flagging possible human overconfidence at maximal certainty labels.
- Interpretability: Layer-wise relevance propagation and gradient-based saliency maps elucidate the model's decision basis for specific pathologies.
These capabilities enable the development of clinically robust and explainable ECG classifiers.
7. Practical Guidelines and Research Impact
The PTB-XL dataset constitutes a primary resource for reproducible ECG machine learning research. Key recommendations for researchers include:
- Adopt patient-stratified cross-validation exclusively.
- Report both term-centric macro-AUC and sample-centric with confidence intervals (e.g., bootstrapping).
- Prefer 100 Hz signals for direct input; apply only necessary preprocessing.
- Deploy data augmentation (windowing, time–frequency masking) for regularization.
- Address label imbalance with weighted losses and systematic augmentation strategies.
- Leverage metadata and artifact labels for robust model development.
By providing comprehensive, structured ground-truth for ECG signal interpretation, PTB-XL has established itself as a central benchmark for the development, comparison, and deployment of deep learning methods in clinical cardiology and digital health (Strodthoff et al., 2020, Kang et al., 14 Apr 2025, Sharma et al., 2022).