PubHealth Dataset Overview
- PubHealth Dataset is a comprehensive collection of de-identified, multi-source public health data tailored for epidemiological analysis, case surveillance, and health informatics.
- It employs rigorous privacy filtering, deduplication, and standardized processing to ensure data integrity across case reports, mobility trends, and behavioral metrics.
- The datasets support high-impact applications including COVID-19 trend analysis, misinformation detection, and LLM-based evaluations of health guidance.
The PubHealth Dataset refers to a diverse suite of openly available public health data resources and benchmarks, each tailored for high-impact epidemiological analysis, health informatics, surveillance, and AI-driven methodologies. These datasets support core research activities in case surveillance, population mobility, epidemiological modeling, guidance evaluation, pharmacovigilance, entity recognition, misinformation detection, and multimedia content analysis. The term appears in various settings—most notably, as the CDC’s COVID-19 Case Surveillance Public Use Data, the PubHealthBench LLM benchmark, and curated COVID-19-related resources—each described below in technical detail and contextualized within the current literature.
1. CDC COVID-19 Case Surveillance Public Use Data
The CDC’s PubHealth Dataset comprises person-level, de-identified case reports from U.S. state, tribal, local, and territorial jurisdictions, processed to ensure privacy and facilitate trend analyses (Lee et al., 2021). The December 2020 release included 8,405,079 deduplicated records from Jan–Nov 2020. Two principal modes of access are provided:
- Unrestricted Public: 11 fields, refreshed monthly, accessible at Data.CDC.gov.
- Restricted Scientific: 31 fields, available under a Data Use Restriction Agreement (RIDURA) via private GitHub.
The 11 variables include reporting and specimen collection dates, onset, case status, age group (top-coded at 80+ years), sex, combined race/ethnicity, plus hospitalization, ICU, death, and medical condition flags. No free-text or direct identifiers are present. All quasi-identifying and confidential attributes are subject to rigorous k-anonymity (k=5) and l-diversity (l=2) suppression algorithms.
Privacy filtering follows seven steps: classifying attributes, re-coding missingness, iterative suppression for privacy thresholds, and mosaic-effect review. Automated data pipelines in R and the Palantir Foundry platform provide reproducible triggers and metadata stamping. Data completeness is ensured by deduplication and logical validation, though suppressed or generalized values may affect subgroup analyses.
Primary research applications include national trend analyses, subgroup risk modeling, equity assessments, and resource planning. All research use requires citation of the CDC COVID-19 Case Surveillance Public Use Data.
2. County-Level COVID-19 Dataset for the U.S.
A parallel PubHealth Dataset organizes county-level time-series COVID-19 and cross-sectional public health variables for the United States (Killeen et al., 2020). It aggregates more than 300 fields from sources including JHU CSSE, SafeGraph, Google Mobility, IHME, government orders, Census, USDA ERS, NOAA, AAMC, CNT, and DOJ BJS.
- Files provided (CSV):
- cases_deaths.csv (date, FIPS, cases, deaths)
- county_descriptors.csv (demographics, income, healthcare metrics)
- safegraph_activity.csv (POI foot traffic)
- google_mobility.csv (mobility changes)
- interventions.csv (NPI dates, recorded as ordinal integers)
This dataset supports per-capita and rate calculations, normalization for downstream analysis, and imputation of static data. Researchers are equipped for time-series regression, survival, geo-spatial clustering, and policy effect estimation. Examples in Python and R are provided for data joins and rate computations.
3. NYC COVID-19 Health Facility Egress Behavior
The NYC Egress Behavior resource captures individual routes, behavioral events, and PPE usage for 5,163 egress traces at 19 medical facilities, with links to 61 zip-level socio-economic indicators and 7 weather variables (Laefer et al., 2021). Each record anonymizes spatio-temporal trajectory, object touches, transportation choices, and destination, consolidated in CSV and ESRI shapefiles. Complete codebooks detail all 112 variables.
Temporal imputation employs the MICE algorithm:
Rates of contact and touch are formalized as , providing direct input for spatially explicit SIR models with behavioral vector components.
4. PubHealthBench: UK Public Health LLM Benchmark
PubHealthBench is a benchmark for evaluating LLMs on UK government health guidance (Harris et al., 9 May 2025). It contains $8,090$ Multiple Choice Questions (MCQA), generated from 687 UKHSA documents, with distribution across 10 technical areas and 352 subcategories. Samples include 1 correct and 6 distractor options, manually and automatically validated for ambiguity (invalid rate: 5.5%).
A sample generation pipeline employs markdown chunking, LLM-based recommendation filtering, and Chain-of-Thought prompting for MCQA. Free-form question tasks extract the stem only, evaluated by LLM judges. SOTA LLMs (GPT-4.5, o1) exceed human baselines in MCQA mode (≈92.5% vs. 88% accuracy), but free-form accuracy remains sub-75%. Accuracy metrics follow standard binary and interval scoring protocols.
Licensing follows the UK Open Government Licence v3.0.
5. PUBHEALTH: Explainable Fact-Checking Resource
PUBHEALTH is a 11,832-instance fact-checking dataset for public health claims, annotated with four-way veracity labels and gold explanations (Kotonya et al., 2020, Zhu et al., 26 Aug 2025). Claims originate from 27,578 fact-checking, 9,023 news, and 2,700 review site entries (Oct 1995–May 2020), filtered by a health lexicon. Each datum comprises claim, explanation, veracity, article, and metadata fields.
Label mapping standardizes to {True, False, Mixture, Unproven}. Journalist-crafted explanations are assessed by strong/weak global coherence and local coherence via NLI models. SGC: 76.8%, WGC: 98.4%, LC: 65.6% for gold explanations.
Baseline veracity classifiers use BERT, SciBERT, and BioBERT; models fine-tuned in-domain outperform general models (e.g., SCIBERT + top 5 sentences: F₁ 70.52%, Acc 69.73%). Explanation generation evaluated with ROUGE-n F₁ metrics.
ArgRAG applies Quantitative Bipolar Argumentation, using retrieved Wikipedia evidence and deterministic inference over constructed argumentation graphs, yielding accuracy up to 89.8% on binary fact verification (Zhu et al., 26 Aug 2025).
6. Supporting Resources: HealthE, PHAD, and PHEE
- HealthE: 6,756 health advice articles, >42,000 annotated entities across 10 types. Provides benchmarks for health advice entity recognition; EP S-BERT sets new F₁ standards (0.72), outperforming prior medical NER (Gatto et al., 2022).
- PHAD: 5,730 social media tobacco usage videos, 4.3 million frames; annotated in eight fine-grained product classes, enriched with engagement metrics and descriptors. Two-stage Vision-Language Encoder achieves 73.2% accuracy (F₁ 71.8%), with fine-grained ablations for multi-modal tobacco content analysis (Chappa et al., 2024).
- PHEE: 5,019 pharmacovigilance events from MEDLINE, annotated with a hierarchical schema over ADE/PTE triggers, subject/treatment/effect arguments, and rare attributes. Supports QA and sequence-labeling event extraction with benchmarks across argument and sub-argument tasks (Sun et al., 2022).
7. COVID-19 Online Datasets for Misinformation and Policy Tracking
The “PubHealth Dataset” also references a curated collection of COVID-19 Twitter data, official guidance, and WHO situation reports, supporting infodemic research (Inuwa-Dutse et al., 2020). Twitter subcomponents include account-based, random, and myth/conspiracy-related tweets—all with user, text, entity, and sentiment metadata. Official practices documents and sitrep tables contain standardized health guidance and case/death reporting, available as normalized CSV/JSON.
Pipeline integration is enabled via hydrating tweet IDs (Tweepy), parsing guidance sources (BeautifulSoup, PDFMiner), and time-series analytic storage. Misinformation detection, sentiment analysis, policy impact modeling, and community clustering require harmonization of missingness, selection, language, and geographic biases.
8. Data Access, Licensing, and Noted Limitations
All datasets described provide open-source or academic research licenses, with explicit field schemas and code repositories. Privacy filters, missing data handling, temporal scope, and representativeness are all documented as constraints. Suppression and aggregation approaches are tailored to minimize re-identification risk but may reduce analytical granularity.
By offering de-identified, multi-source, and multi-modal public health datasets, the PubHealth family supports reproducible, high-impact research across epidemiology, health policy, machine learning, and informatics. Continued updates, harmonization, and methodological transparency remain necessary to maintain utility for the evolving demands of global public health research.