BREATH Datasets: A Multi-domain Overview
- The BREATH dataset is a collective term for multiple, multimodal corpora capturing respiratory signals for applications in biometric authentication, deepfake detection, and medical imaging.
- These datasets employ rigorous experimental protocols with high-frequency sensors, advanced feature engineering such as multifractal analysis and CNN-LSTM models, and detailed annotations to support robust analysis.
- They provide actionable insights with benchmark results and defined evaluation metrics that advance research in areas like stress detection, airway morphometry, and artifact segmentation.
The term "BREATH dataset" designates several distinct, well-recognized research corpora across biomedical imaging, speech analysis, biometric authentication, and human–computer interaction, each addressing key questions through rigorous experimental and annotation protocols. The following entries detail the most prominent BREATH datasets, organizing their content, acquisition methodologies, data structures, and research utility, as evidenced by recent literature.
1. Physically-Grounded Exhaled Breath Biometrics ("User authentication system based on human exhaled breath physics") (Karunanethy et al., 2024)
Data Collection and Purpose
The BREATH dataset introduced in (Karunanethy et al., 2024) is a large-scale, fluid-mechanics-based human biometric corpus designed to test the hypothesis that turbulent airflow during exhalation encodes uniquely identifying morphometric information of the extrathoracic airway. It comprises raw and processed time-series data supporting both user confirmation and identification tasks, as well as morphometric respiratory investigations. The cohort consists of 94 healthy adults (IIT Madras; ages 21–27), each contributing ten forced expiration trials recorded at 10 kHz via a Dantec Dynamics 55P11 hot-wire anemometer in a controlled laboratory setting.
Signal Processing and Feature Engineering
Each 1.5-second exhalation yields 15,000 samples, with Z-score normalization per trial. Segmentation divides trials into 19 overlapping windows of 1,500 samples. The principal feature class is multifractal detrended fluctuation analysis (MFDFA), extracting window-specific β (peak of singularity spectrum), ω (multifractal width), and ε (spectrum asymmetry). Supplementary time-series features include the sum of absolute differences, AR model coefficients, local maxima counts, CWT peak counts, PACF at lag 3, and signal kurtosis. Segments failing convexity or width constraints (ω < 0.05) are discarded.
Data Organization
The corpus is arranged as directories per subject, containing raw voltage CSV files and per-segment feature CSVs with explicit segment, trial, and subject indexing. Confirmation tasks model the subjectID as a binary label (genuine/impostor); identification requires predicting subjectID from [1…94]. Machine learning modeling employs random forests and Hotelling’s T² for pairwise statistical tests, with recommended train/test splits, feature selection, and round-robin voting described in the source.
Benchmark Results and Utility
True confirmation rates exceed 97% (±2.5%). In identification, top-candidate precision is ∼20–40% and top-3 candidate precision 50–66%, with clear intersubject spectral separability. The dataset's structure, acquisition protocols, multifractal and statistical feature suite, and modeling recommendations provide a complete foundation for research in airway biometrics and morphometric fingerprinting (Karunanethy et al., 2024).
2. Multimodal Wearable Breathing Biometric Dataset ("Personalized breath based biometric authentication with wearable multimodality") (Bui et al., 2021)
Hardware and Data Modalities
This publicly available corpus introduces intentional nasal breathing gestures recorded with synchronized acoustics (microphone below nostrils) and inertial measurements (chest-mounted tri-axial accelerometer and gyroscope, 50 Hz) controlled by a Raspberry Pi Zero platform. It spans 20 adults (16M/4F), each performing 20–61 breathing gesture cycles (about 2,445 instances in total), labeled normal, deep, or strong.
Acquisition and Annotation
Each instance is timestamped and manually annotated (start/end, gesture type) in Audacity. Acoustic signals are sampled at 44.1 kHz (downsampled to 16 kHz for feature extraction), and gesture lengths are normalized to 2.5–4.5 seconds depending on type.
Preprocessing and Features
Audio is parameterized by 20 MFCCs per frame (32 ms window, 20 ms hop); motion signals are provided as raw time series. Each gesture label, session, subject ID, and timing metadata is directly accessible.
Modeling and Benchmarks
Two architectures are benchmarked:
- CNN-LSTM: Simultaneous processing of MFCC and IMU branches, followed by concatenation and LSTM-based sequence modeling.
- Temporal Convolutional Network (TCN): Causal, dilated convolution-based temporal modeling, with late multimodal fusion.
Both identification and verification paradigms are supported. Multimodal CNN-LSTM achieves 97.1% (normal) identification accuracy and equal error rates <2% (multimodal, normal). The dataset demonstrates that fused multimodal breathing signals enable robust biometric authentication, with normal breathing gestures providing maximum discriminability (Bui et al., 2021).
3. Deepfake Speech Discrimination via Breath Patterns ("Every Breath You Don't Take") (Layton et al., 2024)
Dataset Design
This corpus consists of 333 audio files (52.48 hours: 277 TTS/26.94 h, 56 real/25.54 h), sourced from four major news sites’ “listen to this article” tracks—paired human-read and TTS renderings. Manual breath annotations are provided for a podcast-derived, 5-hour training set, enabling supervised breath detection. The main news article set is annotated by an automated CNN–BiLSTM breath detector.
Feature Suite and Classification
Three features summarize detected breath events: average breaths per minute, breath duration, and inter-breath interval. Classifiers include thresholding, SVM (poly kernel), and decision trees. Evaluation metrics are AUPRC and EER, with breath-absence as a strong TTS/deepfake discriminator.
Performance and Relevance
On the 33.6-hour test set, SVM achieves perfect discrimination (AUPRC=1.0, EER=0.00), compared to 0.72 AUPRC for SOTA wav2vec2.0 SSL models, emphasizing the utility of prosodic breath cues over low-level features. Data access is via public URLs (code/models/lists available online); the dataset provides a reproducible, human-verification benchmark for breath-based deepfake detection pipelines (Layton et al., 2024).
4. Cardiac MRI Motion Artifact Dataset (BREATH subset, CMRxMotion Challenge) (Wang et al., 2022)
Cohort and Imaging Protocol
Comprised of 320 short-axis cine MRI volumes (40 healthy adults), this dataset captures the impact of respiratory motion artifacts on CMR segmentation and quality grading. Scans are acquired at a single 3T Siemens Vida scanner, bSSFP sequence, during four scripted breath-hold behaviors (compliance, half-length, free, intensive), at two cardiac phases (ED, ES), totaling 8 volumes/subject.
Annotation and Structure
Volumes are graded on a five-point Likert scale (collapsed to mild, intermediate, and severe artifact classes) by two radiologists. Diagnostic (mild/intermediate) images are additionally segmented into left ventricular blood pool, myocardium, and right ventricle. Data are split into train (160 volumes), validation (40), and test (120); only the train set includes full annotations, supporting both artifact classification (Cohen’s kappa) and segmentation (Dice, HD95).
Research Utility and Limitations
Potential uses include algorithmic benchmarking for image quality/classification, segmentation robustness under artifact, and pipeline stress testing. Key limitations: healthy young adults only, single vendor/field strength, behavioral breath-hold scripting, consensus but no formal inter-observer statistics. Access is managed via Synapse under a DUA (Wang et al., 2022).
5. Thermal-based Breathing and Stress Dataset ("DeepBreath") (Cho et al., 2017)
Thermal Sensing Workflow
The DeepBreath BREATH dataset provides ≈3,936 respiration variability spectrogram (RVS) images (120×120) derived from low-cost FLIR One mobile thermal images of 8 adults (3 female, 18–53 years) under cognitive stress tasks (Stroop, Math; easy/difficult). Each image encodes spectrotemporal variability of a tracked nostril region (mean temp, 8 Hz) during 5-minute sessions, sliding-windowed (20 s window, 1 s hop) and bandpass filtered (0.1–0.85 Hz).
Ground-Truth Stress Labeling
Stress levels are annotated using normalized Visual Analog Scale (VAS, 0–10 cm) self-reports after each stressor trial, clustered (k=3) into “No,” “Low,” and “High” stress. The RVS image label directly inherits from session labels.
Data Access and Use Cases
Images (PNG) and per-instance metadata (CSV: subject/session/label/timing/score) are packaged in a structured downloadable directory. This dataset supports deep and shallow learning for non-contact stress detection, respiration variability analysis, and ROI-tracking method development (Cho et al., 2017).
6. Other Prominent BREATH-Related Datasets
Vision-Language Bronchoscopy Localization ("BREATH-VL") (Tian et al., 7 Jan 2026)
A large-scale bronchoscopy dataset with 148,926 bronchoscopic video frames from 66 in-vivo procedures, each annotated with 6-DoF physical camera poses in the CT frame, branch-level labels, and semantic airway structure information. Data are acquired using Olympus bronchoscopes and calibrated via checkerboard imaging and virtual-to-real mesh registration, enabling benchmarking of 6-DoF endoscopy localization and VQA-style anatomy-aware tasks (Tian et al., 7 Jan 2026).
Non-Contact Breathing Multimodal Dataset (OMuSense-23) (Cañellas et al., 2024)
OMuSense-23 provides 600 RGB-D and mmWave radar sequences (50 subjects × 3 poses × 4 breathing activities), each with 30s blocks and physiological features for non-contact vital sign analysis, biometric modeling, and activity recognition. It enables baseline research into pose/activity classification and physiological regression from non-contact signals (public Zenodo release) (Cañellas et al., 2024).
Summary Table: Representative BREATH Datasets
| Dataset (Paper/Year) | Domain/Modality | Subjects/Samples | Key Tasks/Labels |
|---|---|---|---|
| (Karunanethy et al., 2024) | Fluid mech., HWA | 94 subjects, 940 trials | Biometrics, morphometrics |
| (Bui et al., 2021) | Acoustics+IMU | 20 subjects, 2445 inst. | Multimodal biometrics |
| (Layton et al., 2024) | Audio (speech) | 333 files, 52.5 hr | Deepfake/classification |
| (Wang et al., 2022) | Cardiac MRI | 40 subjects, 320 vols | Quality, segmentation |
| (Cho et al., 2017) | Thermal imaging | 8 subjects, 3936 RVS | Stress classification |
| (Tian et al., 7 Jan 2026) | Bronchoscopy video | 66 patients, 148k fr. | 6-DoF localization, VQA |
| (Cañellas et al., 2024) | Radar+RGB-D | 50 subjects, 600 seq. | Activity, pose, biometrics |
Nomenclature and Contextual Usage
The term "BREATH dataset" is not unified but refers generically to datasets in which breath or respiratory phenomena are central to the sensing modality, label definition, or downstream analysis, spanning physical (fluid/multifractal), acoustic (audio/IMU), image-based (thermal, MR, endoscopic, RGB-D), and even speech prosody. Each dataset defines precise acquisition, annotation, and evaluation protocols, supporting reproducible research and benchmark comparison in its subdomain. Researchers are advised to specify context or citing paper when referencing “the BREATH dataset” due to the existence of multiple, independently-developed resources.