Actionable Warning Dataset Overview

Updated 22 November 2025

Actionable Warning Dataset is a curated collection of alerts annotated for actionability to support precise filtering and prioritization.
It is constructed using rigorous methods including source selection, formal labeling protocols, deduplication strategies, and statistical validation.
The dataset underpins ML applications in static code analysis, safety systems, content filtering, and crisis communication to reduce false alarms.

An Actionable Warning Dataset is a rigorously constructed corpus in which each instance represents a warning or alert—for example, from program analysis, environment monitoring, or content filtering—explicitly annotated for actionability, i.e., whether the warning should or could elicit concrete, corrective, or preventive action by a human or automated agent. Such datasets are critical for developing, benchmarking, and deploying machine-learning (ML) methods that must prioritize, filter, or generate warnings with actual importance to end-users, thereby reducing false alarms, alert fatigue, and missed critical events. The design, construction, and evaluation of actionable warning datasets involve careful source selection, formal actionability criteria, annotation protocol, schema definition, and validation. The following sections provide a comprehensive overview spanning the creation, structure, use, and limitations across contemporary benchmarks and domains.

1. Foundations: Definition and Motivations

Actionable warning datasets systematically capture alerts produced by systems that detect (potentially) undesirable states—such as static code analyzers (SCAs), safety systems, recommender filters, or financial monitors—paired with explicit actionability labels. Actionability, in this context, indicates whether the warning triggered a verifiable change (e.g., a code fix, user behavior modification), corresponds to a real issue, or satisfies downstream utility (e.g., makes a recommendation system safer or more user-aligned).

The motivation for such datasets stems from observed phenomena including high static analysis false alarm rates (often >80%) and the resulting alert fatigue, where prevalence of non-actionable warnings leads to critical signals being disregarded. Machine learning methods for actionable warning identification (AWI) and triage require large, representative, and accurately labeled corpora to achieve generalizable performance and robust prioritization (Kószó et al., 13 Nov 2025, Ge et al., 2023).

2. Dataset Construction Methodologies

Dataset construction processes vary by domain but usually involve the following methodological components:

a. Source Selection and Extraction:

Data are typically sourced by operationally running SCA tools, safety monitors, or content filters on large-scale real-world data—such as major open-source project repositories (e.g., top 500 GitHub C or Java repositories (Kószó et al., 13 Nov 2025, Xue et al., 15 Nov 2025, Ge et al., 2024)), application logs, or curated content collections.

b. Labeling Protocols:

Code Analysis Datasets: Labeling uses commit diffs, warning disappearance tracking, and sometimes human validation. For instance, in NASCAR, a warning is “actionable” if it appeared in the parent commit and was removed by a source code change in the child (next) commit after the warning context changed, with non-actionable warnings defined as those persisting across revisions (Kószó et al., 13 Nov 2025).
Content/Recommendation Systems: Actionability maps to the presence of strong, community-verified content warnings (e.g., “Clear Yes” with ≥75% upvotes on Does the Dog Die? for a given sensitive topic) (Kovacs et al., 8 Sep 2025).
LLM-Assisted and Robotic Environments: Scene annotations via graph-based extraction and expert LLM (GPT-4) labeling of object relationships into “normal,” “dangerous,” “unsanitary,” or “dangerous for children” (Jr et al., 2024).

c. Deduplication, Filtering, and Imbalance Handling:

Deduplication (e.g., MinHash+LSH in NASCAR) is performed to eliminate near-duplicates; persistent warnings may have their last occurrence retained. Class imbalance is intrinsic: actionable warnings are often 2–6 times less frequent than non-actionable (“class imbalance ratio…often 2–6”) (Ge et al., 2023).

d. Statistical Validation:

Manual adjudication or statistical sampling is often used to validate labeling accuracy (e.g., 100% agreement in a validation sample of n=69 in NASCAR) (Kószó et al., 13 Nov 2025). Weak supervision may supplement or replace fully manual methods (Xue et al., 15 Nov 2025, Xue et al., 2023).

3. Data Schema, Scale, and Label Semantics

Actionable warning datasets are structured to facilitate ML consumption, cross-study reproducibility, and careful downstream analysis. The core schema components are as follows:

Field	Examples/Values	Purpose
tool	“PMD”, “SpotBugs”, “Infer”	Origin of warning
warning_type	“UnnecessaryLocalBeforeReturn”, “NullDereference”	Rule/granularity
warning_msg	Full diagnostic, template	Textual description
commit metadata	parent_sha, commit_sha, parent_date, commit_date	Temporal context
repo/filename	“github.com/org/repo”, “src/Foo.java”	Source linkage
positions	start/end line and column (JSON)	Code region pinpointing
code_context	±N lines, AST context, embeddings	Enriched feature context
label	0/1 (“actionable”), or granular (VTB/LTB/UTB/FalseAlarm)	Actionability class

Dataset scales in prominent benchmarks include up to 1.2M warnings (“NASCAR”), with actionable rates ranging from 1.7% (Infer on C: 538 actionable vs 30,590 false alarms (Xue et al., 2023)) to 16% (NASCAR/Java (Kószó et al., 13 Nov 2025)).

Labeling semantics may be binary (actionable/non-actionable), ordinal (e.g., VTB/LTB/UTB—Very Likely/Likely/Unlikely To Be Bugs), or vectorial (e.g., explicit warning category presence) depending on the domain and downstream application.

4. Benchmark Datasets and Domains

a. Static Code Analysis (SCA):

NASCAR (“Non-)Actionable Static Code Analysis Reports”:
- 1,227,763 Java SCA warnings, 16% actionable, deduplicated subset of 1,083,073 (Kószó et al., 13 Nov 2025).
- Actionability labels inferred by differential SCA report comparison across GitHub commit history.
ACWRecommender and Variants:
- 31,128 Infer warnings (C), with deterministic weak-labeling into VTB/LTB/UTB (Xue et al., 2023, Xue et al., 15 Nov 2025).
- Precomputed embeddings (e.g., UniXcoder) supplied for ML pipeline integration.
SpotBugs Java Corpus:
- 10,140 warnings across 10 Apache projects; actionable via manual and lifetime-based rules (Ge et al., 2024).
Historical Datasets Surveyed: See (Ge et al., 2023) for coverage of FindBugs and other SCA tools, detailing thousands to tens of thousands of warnings per dataset.

b. User-Facing Content and Safety Systems:

Sensitive Content in Recommendation Systems:
- ML-DDD: 22.8M MovieLens user–movie–rating interactions; 137 warning categories (e.g., “blood,” “gun violence”); warning flags reflect “Clear Yes” consensus in community (Kovacs et al., 8 Sep 2025).
- AO3/Webis: 45.9M fanfiction user–work–interaction tuples; 36 trigger warnings (e.g., “pornography,” “abuse”).
Robotic Home Safety—SafetyDetect:
- 1,000 home episodes, 967 placed anomalies across 13 hazard types (safety, sanitation, child). Scene-graph and LLM (GPT-4)–based labeling into (normal, dangerous, unsanitary, dangerous_for_children) (Jr et al., 2024).

c. Early Warning Systems (EWS) in Finance:

Household-level panel (2,250 records): financial distress flags, engineered predictors, and intervention trigger mapping (e.g., “Volatility_Index > mean+2σ” triggers alert) (Pant et al., 25 Oct 2025).

d. Crisis Communication and NLG:

CrisiText: Over 400,000 warning messages generated for 17,771 distinct crisis situations, annotated for structure and expert compliance, supporting LLM training/evaluation (Gonella et al., 10 Oct 2025).

5. Use Cases and Evaluation Metrics

Actionable warning datasets enable a diversity of downstream tasks and benchmarking protocols:

AWI Model Training: Training classifiers or ranking functions for warning triage, with metrics such as precision, recall, F1-score, AUC, nDCG@K, and MRR (Kószó et al., 13 Nov 2025, Xue et al., 15 Nov 2025, Xue et al., 2023).
Alert Prioritization: Applied within IDEs, CI/CD pipelines, or user dashboards for context-aware or personalized alerting.
Recommendation and Filtering: User-preference–sensitive filtering (e.g., penalizing content with unwanted warnings, minimizing “warning amplification”) (Kovacs et al., 8 Sep 2025).
Safety and Compliance: Robotics and safety agents (e.g., TurtleBot scenario) generate context-dependent, user-preference–filtered warnings (Jr et al., 2024).
Crisis/Emergency NLG Systems: Training LLMs to produce concise, actionable, and guideline-compliant public warnings (Gonella et al., 10 Oct 2025).

Evaluation metrics are tailored:

Classification: Precision, recall, F1, AUC, Matthews Correlation Coefficient (MCC) (Ge et al., 2023).
Ranking: nDCG@K, MRR, Precision@K, FP_avg (Xue et al., 15 Nov 2025).
Content/Recommendation: Warning amplification@K, checklist adherence, human/LLM preference score (Kovacs et al., 8 Sep 2025, Gonella et al., 10 Oct 2025).

6. Limitations and Open Challenges

Prominent limitations are evident across the surveyed datasets:

Labeling Uncertainty: Many labeling schemes infer actionability via warning disappearance, which may not always align with ground-truth resolution (e.g., developers may ignore a warning for other reasons or never run the tool again) (Kószó et al., 13 Nov 2025, Ge et al., 2023, Xue et al., 15 Nov 2025).
Domain Coverage: Most large-scale datasets focus on limited languages (notably Java, C) and a narrow SCA tool set (FindBugs, PMD, SpotBugs, Infer), with limited inclusion of other languages or platforms (Ge et al., 2023).
Class Imbalance and Coverage Gaps: Ratios of actionable to non-actionable warnings are commonly highly skewed; rare rule types and security-critical alerts are often underrepresented (Ge et al., 2023).
Deduplication and Heuristics: Parameters in deduplication (e.g., Jaccard similarity threshold τ=0.95) or warning matching (lifetime, content normalization) can affect coverage and introduce noise (Kószó et al., 13 Nov 2025).
Real-World Generalization: Most datasets derive from large, actively maintained OSS projects; data from other software ecosystems or organizational environments may differ in warning semantics and actionability.

7. Access, Reproducibility, and Future Directions

Major datasets offer open access (e.g., NASCAR at Zenodo (Kószó et al., 13 Nov 2025), ACW on Zenodo and 4open.science (Xue et al., 15 Nov 2025, Xue et al., 2023), SafetyDetect repository release (Jr et al., 2024)). Documentation and code are typically provided to support reproducibility, and best practices recommend public release of both raw and processed data, code for mining and deduplication, and clear annotation schemas (Kószó et al., 13 Nov 2025, Ge et al., 2023). Current research directions include expansion to underrepresented domains, integration of LLMs for actionability inference, hybrid human-machine labeling workflows, and robust validation on field deployments (Ge et al., 2023). There is also growing attention to domain adaptation and rare-event detection in safety-critical systems (Khan et al., 2024, Gonella et al., 10 Oct 2025).

Actionable warning datasets are thus foundational resources for the empirical study, evaluation, and improvement of alerting and decision-support systems spanning software engineering, safety, crisis response, and user-centered AI. Their continued development underpins advances in the actionable, reliable, and context-sensitive delivery of critical warnings across technical domains.