UltraTwin: Cardiac & Behavioral Datasets
- UltraTwin is comprised of two distinct, high-quality datasets that enable 3D cardiac reconstruction from sparse 2D ultrasound and CT imaging as well as digital persona simulation through multi-wave behavioral surveys.
- The cardiac dataset employs strict and pseudo-pairing techniques with comprehensive annotations and multi-view imaging to facilitate conditional generative modeling for accurate anatomical twin reconstruction.
- The behavioral dataset provides large-scale, multi-wave data with robust test–retest and convergent validity metrics, supporting reliable digital twin simulation in psychological, cognitive, and economic domains.
UltraTwin is a term referring to two distinct, high-quality datasets released independently in 2025, each supporting digital twin research: (1) a multimodal real-world dataset for 2D-to-3D cardiac anatomical twin reconstruction from clinical ultrasound and computed tomography (CT) imaging (Yu et al., 30 Jun 2025); and (2) a large-scale, four-wave behavioral dataset for digital persona simulation and social science (Toubia et al., 23 May 2025). In both instances, the datasets are designed to enable modeling and evaluation of “twins”—either anatomical or behavioral—under rich, empirically anchored settings.
1. Cardiac Anatomical UltraTwin Dataset: Acquisition and Structure
The UltraTwin cardiac dataset was established through a multi-center prospective study across nine hospitals in China, yielding 891 paired ultrasound–CT patient cases. Its core objective is to enable conditional generative reconstruction of 3D cardiac anatomy (“anatomical twins”) from sparse, clinical-grade multi-view 2D echocardiography (Yu et al., 30 Jun 2025).
Key acquisition protocols and dataset composition:
- Subjects and Pairing:
- 96 ECG-gated, strictly paired US–CT cases for high-fidelity, temporally aligned ground truth.
- 795 non–ECG-gated, non-contrast CT scans supporting pseudo-pair generation.
- All US and CT were acquired within 10 days. The dataset itself does not report distributions for age, sex, body mass index, or pathology.
- Ultrasound:
- Multi-view 2D videos, covering 12 standard echocardiographic views.
- Three views used in reconstruction: Apical Two Chamber (A2C), Apical Four Chamber (A4C), Parasternal Short Axis at papillary level (PSAX_PAP).
- Each key cardiac cycle phase (end-diastole, end-systole) per view manually annotated, resulting in 780 key frames for strictly paired cases.
- All frames resampled to pixels; field-of-view in physical units not detailed.
- CT:
- ECG-gated or non-gated, non-contrast acquisitions.
- Post-segmentation, all cardiac volumes resampled to isotropic voxels and standardized to grids.
- Data Splitting:
- Strictly paired set (130 samples) partitioned: 96 train / 10 validation / 24 test; pseudo-paired/pre-training incorporates broader data.
- 3D Cardiac Structure Annotation:
- Segmentation for myocardium, left and right ventricles, left and right atria utilizing TotalSegmentator, with manual correction.
- Rigid alignment applied to ensure spatial standardization.
2. Cardiac Dataset: Pairing Techniques and Quality Control
UltraTwin adopts dual approaches to maximize both ground truth reliability and data scale:
- Strictly Paired Data:
- Manual selection and temporal alignment of US and ECG-gated CT at identical cardiac phases, validated through ECG trace and manual review.
- Pseudo-Pair Generation:
- For non-gated cases, an automated frame-matching pipeline aligns US and CT images by projecting CT anatomy into canonical 2D (A4C) space (cf. Stojanovski et al. 2022) and L2-minimizing the geometric parameter error between paired modalities:
- Selected US frames for pseudo-pairs are not used in primary evaluation, only for pre-training.
Noise Modeling:
- Implicit autoencoder training incorporates additive voxel noise, random volumetric masking, and voxel-value swaps (mathematical forms not specified) to induce model robustness.
- Quality Control:
- Extensive manual review and correction during segmentation.
- No pixel-level segmentation of US; input is full-frame intensity.
- No reported SNR, contrast-to-noise, or inter-rater metrics.
3. Cardiac Dataset: Data Statistics, Coverage, and Limitations
A summary of the main dataset composition and limitations is provided in the table below:
| Aspect | Value/Description | Notes |
|---|---|---|
| Total cases | 891 US–CT pairs (96 strict, 795 pseudo-pair) | 9 hospitals; all scanned within 10 days |
| Strictly paired dataset | 130 unique 2D–3D pairs (train=96, val=10, test=24) | Each with 6 frames (A2C/A4C/PSAX, ED/ES) |
| CT resolution | grid, voxels | Field-of-view matches volume grid |
| 2D Ultrasound frame | pixels | All frames resampled |
| Chambers/structures | Myocardium, LV, RV, LA, RA | Segmented and validated |
| Evaluation limitations | Small size of strict pairs, lack of detailed US quality metrics | No SNR/CNR/inter-rater stats |
Limitations acknowledged in the design:
- The strictly paired dataset remains relatively small ().
- 2D US does not sample all possible geometries; 3D CT ground truth is limited to spatial resolution.
- No direct reporting of SNR, contrast or annotation reliability in US data.
4. Cardiac Dataset: Intended Use and Applicability
The reproducible, multi-view cardiac pairing design of UltraTwin targets several core applications:
- Training and validation of conditional generative models for 3D cardiac anatomy (anatomical twin) from sparse 2D US.
- Enabling personalized chamber volume estimation, surgical planning, and computational modeling of cardiac function.
- Providing a public benchmark for topology-aware, coarse-to-fine reconstruction algorithms under real-world data scarcity and noise (Yu et al., 30 Jun 2025).
A plausible implication is that this dataset will be central for evaluating generalization and robustness for deep learning models aiming for clinical translation from 2D to 3D cardiac modeling.
5. Behavioral UltraTwin Dataset (Twin-2K-500): Scope and Organization
Twin-2K-500, alternately termed UltraTwin in the arXiv submission, consists of multiwave, deep-survey data from US adults, supporting digital twin simulation in behavioral, psychological, and economic domains (Toubia et al., 23 May 2025). Its structure enables both individual-level persona modeling and aggregate-level experimental replication.
Key features include:
- Survey Waves and Contents:
- Four waves; 500 unique questions across demographics, 26 personality constructs, 11 cognitive measures, 10 economic preference measures, and multiple heuristics/bias experiments.
- Wave 4 (retest) repeats 88 pivotal tasks/questions for robust test–retest benchmarking.
- Data organization:
- Raw data in JSONL, CSV, and per-participant JSON files for responses and personas.
- Canonical schemas and code snippets provided for immediate integration and analysis.
- Metadata:
- Each subject includes demographic fields (age, sex, education, region, income, political indicators), fully traceable question and domain assignments, and block-level answer files for holdout/evaluation.
6. Behavioral Dataset: Validation, Quality, and Benchmarking
Empirical quality and benchmark design are extensively detailed:
- Sample Composition:
- Age: 18–29 (18.9%), 30–49 (35.7%), 50–64 (32.0%), 65+ (13.5%).
- Sex: Female (50.7%), Male (49.3%).
- Diverse distributions across education, income, and race.
- Reliability and Consistency:
- Test–retest accuracy on 17 tasks: 81.72%.
- Cronbach’s for Big 5: extraversion (0.84), agreeableness (0.81), conscientiousness (0.87), neuroticism (0.88), openness (0.83); Need for Cognition (0.89); Anxiety (Beck) (0.91); Financial Literacy (0.75).
- Convergent/face validity:
- Strong correlations replicated, e.g., Need for Cognition Openness: ; Neuroticism Depression: .
- Behavioral Economics Replication:
- 10/11 classic effects and 4/5 within-subject effects replicated across both main and retest waves.
For LLM-based persona simulation benchmarks, human test–retest serves as ground truth (81.72%); best models (GPT4.1 variants) achieve ~70–72%, with random baseline at 59.17%.
7. Best Practices, Extensions, and Integration
- Preprocessing:
- Convert all Likert/binary/ordinal responses to numeric codes; address missingness by imputation or flagging.
- Compute aggregate scale scores as specified in the technical appendix (e.g., Big 5 averaging, MPL discount rates).
- Standardize numeric fields (z-scores) if using for LLM persona input.
- Data Splitting:
- Train: waves 1–3 non-holdout; validation: subset of waves 1–3 holdout; test: wave 4 retest responses.
- Cross-validation can be realized via rotating holdout questions among subjects/tasks.
- Evaluation and Benchmarking:
- Accuracy/normalized error for individual items; ATE difference for aggregate replication.
- Confidence intervals computed by participant-level bootstrapping.
- Prompting Protocols:
- A recommended prompt template for persona simulation, with persona Q–A pairs followed by a new question, is included in the release.
- Scaling:
- Determinism via JSON persona + low-temperature decoding.
- Structured “Predicted Output” mode reduces inference cost in OpenAI APIs.
Potential extensions include cross-construct correlational analyses, latent-trait modeling, and evaluation of heterogeneous treatment effects using the rich demographic/cognitive metadata.
UltraTwin, in both cardiac anatomical and behavioral domains, provides rigorously designed, large-scale empirical foundations for digital twin modeling, calibration, and benchmarking in clinical and social science contexts (Yu et al., 30 Jun 2025, Toubia et al., 23 May 2025).