Kaggle/CDC 100K Dataset Overview
- The Kaggle/CDC 100K Dataset is a curated subset of CDC’s COVID-19 case data, offering de-identified, person-level records for epidemiological analysis.
- It employs rigorous privacy techniques like k-anonymity and l-diversity to balance research utility with confidentiality.
- Automated pipelines and strict quality controls ensure FAIR compliance, enhancing data transparency and public health insights.
The Kaggle/CDC 100K Dataset is an extract of the first 100,000 rows from the CDC’s COVID-19 Case Surveillance Public Use Data, a de-identified, person-level dataset representing laboratory-confirmed or probable COVID-19 cases as reported by state, tribal, local, and territorial (STLT) jurisdictions across the United States. Its primary purpose is to balance transparency, research utility, and individual confidentiality in the urgent context of COVID-19 epidemiological surveillance. The dataset schema, de-identification methodology, and automated release mechanisms reflect an application of data privacy principles to public health data dissemination at national scale (Lee et al., 2021).
1. Dataset Schema and Provenance
The Kaggle/CDC 100K is a direct subset of the monthly-updated full public-use file published on Data.CDC.gov. At its initial release on May 18, 2020, this file included 339,301 de-duplicated case records; by December 4, 2020, its size exceeded 8.4 million records. Each case record originates from one of three submission modalities (@@@@1@@@@ batch upload, direct entry, or NNDSS feed) into CDC’s DCIPHER platform (a Palantir Foundry-based system). Inclusion criteria are: person-level, laboratory-confirmed or probable reports using CDC surveillance definitions, with all records lagged by at least 14 days for quality review and de-duplication.
Records are excluded if: they are less than 14 days old at extraction, contain any free text or direct identifier (e.g., names, addresses), or fail basic logic checks (e.g., implausible dates such as symptom onset after testing).
2. Variable Inventory
The dataset follows a rigid schema of 11 de-identified fields inherited from the full public-use file. Each row encodes variables as follows:
| Variable | Type | Notes / Allowable Values |
|---|---|---|
| cdc_report_dt | Date | YYYY-MM-DD (date received in CDC system) |
| pos_spec_dt | Date/NA | YYYY-MM-DD or NA, suppressed as per privacy rules |
| onset_dt | Date/‘Null’ | YYYY-MM-DD or ‘Null’ if illogical/missing |
| current_status | Categorical | “Laboratory-confirmed case”, “Probable Case” |
| sex | Categorical | “Male” … “Unknown”/“Missing”/NA (quasi-identifier) |
| age_group | Categorical | 10-year bins from “0–9” to “80+”, “Unknown”/NA (QI) |
| race_ethnicity_combined | Categorical | Collapsed to 7 categories + “Unknown”/“Missing”/NA (QI) |
| hosp_yn | Categorical | “Yes”/“No”/“Unknown”/“Missing” |
| icu_yn | Categorical | “Yes”/“No”/“Unknown”/“Missing” |
| death_yn | Categorical | “Yes”/“No”/“Unknown”/“Missing” |
| medcond_yn | Categorical | “Yes”/“No”/“Unknown”/“Missing” |
Derivations include age_group set by $\text{floor}((\text{onset_dt} - \text{dob})/365.25)$ (if date of birth is present) and race_ethnicity_combined via collapsing of race and ethnicity fields. Quasi-identifiers (QIs) are {sex, age_group, race_ethnicity_combined}.
3. De-identification and Privacy Guarantees
Privacy risk is addressed by a multi-step approach: k-anonymity , l-diversity , generalization, and minimal cell suppression.
k-Anonymity
For each record with quasi-identifiers $Q = \{\text{sex},\,\text{age_group},\,\text{race_ethnicity_combined}\}$, define the equivalence class . The dataset enforces , suppressing QI values minimally to NA if this fails:
- For any , fields in are set to NA until a new tuple for which .
l-Diversity
To prevent inference attacks against the confidential pos_spec_dt, each equivalence class must have at least two distinct values of pos_spec_dt:
- $|\{ s.\text{pos_spec_dt} \mid s \in E \}| \geq 2$
- If violated, pos_spec_dt is set to NA for all of .
Generalization
- age_group: 10-year bins, top-coded at 80+
- race_ethnicity_combined: Collapsed from distinct race and ethnicity codes to seven aggregate categories, with multiple selections grouped as “Multiple/Other”
Implementation
Initial detection uses Palantir Contour. The R package sdcMicro is used for full k-anonymity and l-diversity testing. Suppression is applied at the cell level; records remain but suppressed quasi-identifier and/or confidential attribute cells appear as NA.
Example transformations:
- Pre-suppression: (Male, 0–9, Hispanic/Latino), frequency = 1
- Post-suppression: (NA, 0–9, NA), frequency = 5
- Pre-l-diversity: {(Female, 0–9, Asian, 2020-03-31) × 5}, only one specimen date
- Post-l-diversity: Set pos_spec_dt=NA for entire class
4. Data Transformation Pipeline
The data pipeline comprises:
a. Ingestion and de-duplication of all incoming feeds (CSV, direct entry, NNDSS) into DCIPHER b. Cleaning (blank/missing → “Missing”; implausible dates → “Null”; onset_dt imputed from cdc_report_dt if absent) c. Derivation (DOB→age, aggregation of race/ethnicity) d. 14-day lag (exclude records newer than extraction date minus 14 days)
No imputation or interpolation is performed beyond DOB-to-age mapping. All processes are automated within the Palantir Foundry environment, with pipeline configuration and scripts (R 4.0.3) version-controlled and public (GitHub: cdcgov/covid_case_privacy_review). Each build applies rigorous privacy re-verification using sdcMicro; failures prompt pipeline halting and notification.
5. Access, Use Modalities, and Distribution
Public-use data, including the Kaggle/CDC 100K, require no data-use agreement and are accessible through:
- CSV download: https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
- REST API with the same resource ID (vbim-akqf)
Restricted scientific-use data (31 variables, geographic detail) require prior approval under a CDC Registration Information and Data Use Restriction Agreement (RIDURA) and are distributed through a private repository (URL and access instructions as indicated).
6. Limitations and Analytical Considerations
- All records are at least 14 days old at extraction; recent cases are excluded.
- Suppressed QIs (NA cells) may induce analytic bias, especially for intersectional subgroups; aggregation to higher-level groups and careful treatment of “Unknown” categories is recommended.
- No free-text fields or detailed comorbidity breakdowns are included, limiting some clinical uses.
- Residual re-identification risk from external linkage ("mosaic effect") is minimized but not eliminated.
- Sensitivity analyses, especially for handling NA and "Unknown," are advised for robust statistical inference.
7. Automation, Version Control, and FAIR Compliance
The entire extraction, transformation, review, and publication cycle for the Kaggle/CDC 100K is fully automated under monthly scheduling. Version control is implemented for both code (public GitHub repository) and pipeline configurations. Re-verification using sdcMicro ensures privacy criteria are enforced on every build iteration. The release process is designed for Findability, Accessibility, Interoperability, and Reusability (FAIR), with all stages of the data lineage—raw STLT intake through to public CSV artifact—fully documented and programmatically reproducible (Lee et al., 2021).