Kaggle/CDC 100K Dataset Overview

Updated 11 December 2025

The Kaggle/CDC 100K Dataset is a curated subset of CDC’s COVID-19 case data, offering de-identified, person-level records for epidemiological analysis.
It employs rigorous privacy techniques like k-anonymity and l-diversity to balance research utility with confidentiality.
Automated pipelines and strict quality controls ensure FAIR compliance, enhancing data transparency and public health insights.

The Kaggle/CDC 100K Dataset is an extract of the first 100,000 rows from the CDC’s COVID-19 Case Surveillance Public Use Data, a de-identified, person-level dataset representing laboratory-confirmed or probable COVID-19 cases as reported by state, tribal, local, and territorial (STLT) jurisdictions across the United States. Its primary purpose is to balance transparency, research utility, and individual confidentiality in the urgent context of COVID-19 epidemiological surveillance. The dataset schema, de-identification methodology, and automated release mechanisms reflect an application of data privacy principles to public health data dissemination at national scale (Lee et al., 2021).

1. Dataset Schema and Provenance

The Kaggle/CDC 100K is a direct subset of the monthly-updated full public-use file published on Data.CDC.gov. At its initial release on May 18, 2020, this file included 339,301 de-duplicated case records; by December 4, 2020, its size exceeded 8.4 million records. Each case record originates from one of three submission modalities (@@@@1@@@@ batch upload, direct entry, or NNDSS feed) into CDC’s DCIPHER platform (a Palantir Foundry-based system). Inclusion criteria are: person-level, laboratory-confirmed or probable reports using CDC surveillance definitions, with all records lagged by at least 14 days for quality review and de-duplication.

Records are excluded if: they are less than 14 days old at extraction, contain any free text or direct identifier (e.g., names, addresses), or fail basic logic checks (e.g., implausible dates such as symptom onset after testing).

2. Variable Inventory

The dataset follows a rigid schema of 11 de-identified fields inherited from the full public-use file. Each row encodes variables as follows:

Variable	Type	Notes / Allowable Values
cdc_report_dt	Date	YYYY-MM-DD (date received in CDC system)
pos_spec_dt	Date/NA	YYYY-MM-DD or NA, suppressed as per privacy rules
onset_dt	Date/‘Null’	YYYY-MM-DD or ‘Null’ if illogical/missing
current_status	Categorical	“Laboratory-confirmed case”, “Probable Case”
sex	Categorical	“Male” … “Unknown”/“Missing”/NA (quasi-identifier)
age_group	Categorical	10-year bins from “0–9” to “80+”, “Unknown”/NA (QI)
race_ethnicity_combined	Categorical	Collapsed to 7 categories + “Unknown”/“Missing”/NA (QI)
hosp_yn	Categorical	“Yes”/“No”/“Unknown”/“Missing”
icu_yn	Categorical	“Yes”/“No”/“Unknown”/“Missing”
death_yn	Categorical	“Yes”/“No”/“Unknown”/“Missing”
medcond_yn	Categorical	“Yes”/“No”/“Unknown”/“Missing”

Derivations include age_group set by $\text{floor}((\text{onset_dt} - \text{dob})/365.25)$ (if date of birth is present) and race_ethnicity_combined via collapsing of race and ethnicity fields. Quasi-identifiers (QIs) are {sex, age_group, race_ethnicity_combined}.

3. De-identification and Privacy Guarantees

Privacy risk is addressed by a multi-step approach: k-anonymity $(k=5)$ , l-diversity $(l=2)$ , generalization, and minimal cell suppression.

k-Anonymity

For each record $r$ with quasi-identifiers $Q = \{\text{sex},\,\text{age_group},\,\text{race_ethnicity_combined}\}$, define the equivalence class $E(r) = \{ s \in \text{dataset} \mid s.Q = r.Q \}$ . The dataset enforces $|E(r)| \geq 5$ , suppressing QI values minimally to NA if this fails:

For any $|E(r)| < 5$ , fields in $Q$ are set to NA until a new tuple $q'$ for which $|\{s: s.Q=q'\}| \geq 5$ .

l-Diversity

To prevent inference attacks against the confidential pos_spec_dt, each equivalence class $E$ must have at least two distinct values of pos_spec_dt:

$|\{ s.\text{pos_spec_dt} \mid s \in E \}| \geq 2$
If violated, pos_spec_dt is set to NA for all of $E$ .

Generalization

age_group: 10-year bins, top-coded at 80+
race_ethnicity_combined: Collapsed from distinct race and ethnicity codes to seven aggregate categories, with multiple selections grouped as “Multiple/Other”

Implementation

Initial detection uses Palantir Contour. The R package sdcMicro is used for full k-anonymity and l-diversity testing. Suppression is applied at the cell level; records remain but suppressed quasi-identifier and/or confidential attribute cells appear as NA.

Example transformations:

Pre-suppression: (Male, 0–9, Hispanic/Latino), frequency = 1
Post-suppression: (NA, 0–9, NA), frequency = 5
Pre-l-diversity: {(Female, 0–9, Asian, 2020-03-31) × 5}, only one specimen date
Post-l-diversity: Set pos_spec_dt=NA for entire class

4. Data Transformation Pipeline

The data pipeline comprises:

a. Ingestion and de-duplication of all incoming feeds (CSV, direct entry, NNDSS) into DCIPHER b. Cleaning (blank/missing → “Missing”; implausible dates → “Null”; onset_dt imputed from cdc_report_dt if absent) c. Derivation (DOB→age, aggregation of race/ethnicity) d. 14-day lag (exclude records newer than extraction date minus 14 days)

No imputation or interpolation is performed beyond DOB-to-age mapping. All processes are automated within the Palantir Foundry environment, with pipeline configuration and scripts (R 4.0.3) version-controlled and public (GitHub: cdcgov/covid_case_privacy_review). Each build applies rigorous privacy re-verification using sdcMicro; failures prompt pipeline halting and notification.

5. Access, Use Modalities, and Distribution

Public-use data, including the Kaggle/CDC 100K, require no data-use agreement and are accessible through:

CSV download: https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
REST API with the same resource ID (vbim-akqf)

Restricted scientific-use data (31 variables, geographic detail) require prior approval under a CDC Registration Information and Data Use Restriction Agreement (RIDURA) and are distributed through a private repository (URL and access instructions as indicated).

6. Limitations and Analytical Considerations

All records are at least 14 days old at extraction; recent cases are excluded.
Suppressed QIs (NA cells) may induce analytic bias, especially for intersectional subgroups; aggregation to higher-level groups and careful treatment of “Unknown” categories is recommended.
No free-text fields or detailed comorbidity breakdowns are included, limiting some clinical uses.
Residual re-identification risk from external linkage ("mosaic effect") is minimized but not eliminated.
Sensitivity analyses, especially for handling NA and "Unknown," are advised for robust statistical inference.

7. Automation, Version Control, and FAIR Compliance

The entire extraction, transformation, review, and publication cycle for the Kaggle/CDC 100K is fully automated under monthly scheduling. Version control is implemented for both code (public GitHub repository) and pipeline configurations. Re-verification using sdcMicro ensures privacy criteria are enforced on every build iteration. The release process is designed for Findability, Accessibility, Interoperability, and Reusability (FAIR), with all stages of the data lineage—raw STLT intake through to public CSV artifact—fully documented and programmatically reproducible (Lee et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets for Public Use (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Kaggle/CDC 100K Dataset.