Unraveling Pedestrian Fatality Patterns: A Comparative Study with Explainable AI

Published 22 Mar 2025 in cs.LG and cs.AI | (2503.17623v1)

Abstract: Road fatalities pose significant public safety and health challenges worldwide, with pedestrians being particularly vulnerable in vehicle-pedestrian crashes due to disparities in physical and performance characteristics. This study employs explainable artificial intelligence (XAI) to identify key factors contributing to pedestrian fatalities across the five U.S. states with the highest crash rates (2018-2022). It compares them to the five states with the lowest fatality rates. Using data from the Fatality Analysis Reporting System (FARS), the study applies machine learning techniques-including Decision Trees, Gradient Boosting Trees, Random Forests, and XGBoost-to predict contributing factors to pedestrian fatalities. To address data imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) is utilized, while SHapley Additive Explanations (SHAP) values enhance model interpretability. The results indicate that age, alcohol and drug use, location, and environmental conditions are significant predictors of pedestrian fatalities. The XGBoost model outperformed others, achieving a balanced accuracy of 98 %, accuracy of 90 %, precision of 92 %, recall of 90 %, and an F1 score of 91 %. Findings reveal that pedestrian fatalities are more common in mid-block locations and areas with poor visibility, with older adults and substance-impaired individuals at higher risk. These insights can inform policymakers and urban planners in implementing targeted safety measures, such as improved lighting, enhanced pedestrian infrastructure, and stricter traffic law enforcement, to reduce fatalities and improve public safety.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates a robust ensemble approach using XGBoost with SMOTENC and SHAP to achieve high predictive accuracy in identifying pedestrian fatality determinants.
It employs advanced explainable AI techniques to quantify critical risk factors including age, substance use, and environmental attributes across varying state contexts.
Findings support targeted public safety interventions and data-driven policy decisions by highlighting key infrastructure and behavior-based risk factors.

Comparative Analysis of Pedestrian Fatality Patterns via Ensemble Explainable AI

Introduction

This paper presents a technical and comparative examination of pedestrian fatality determinants in the United States using state-of-the-art ensemble ML models augmented with explainable AI (XAI). The research investigates contributing factors in the top five and bottom five U.S. states by pedestrian fatality rate (2018–2022), leveraging the Fatality Analysis Reporting System (FARS) for heterogeneous and high-dimensional crash and contextual data. The primary methodological innovations include the integration of SMOTENC for class imbalance correction and SHAP for post-hoc feature attribution. The study offers both a rigorous model performance assessment and detailed interpretable insights on the principal variables linked to pedestrian mortality.

Data Exploration and Identification of High-Risk Zones

A two-stage methodology was initiated: initial exploratory data analysis for high-fatality zone identification, followed by ML-based pattern extraction. The clustering and spatial analyses reveal several spatial clusters characterized by recurrent high-risk zones, typically at mid-block locations, arterial roadways, and environments with compromised visibility.

Figure 1: Identification of high-risk pedestrian fatality zones across the U.S.

The identification phase highlights infrastructure deficiencies, speed zones, lighting, and intersection control as recurrent spatial correlates of fatal incidents, providing granularity often absent from national or regional trend analyses.

Feature Selection and Modeling Pipeline

The predictive modeling pipeline incorporates Decision Tree (DT), Random Forest (RF), Gradient Boosting Tree (GBT), and Extreme Gradient Boosting (XGBoost), benchmarked across a harmonized feature subset. To address severe outcome-class imbalance, SMOTENC is applied, which yields balanced learnable representations without information loss from categorical encoding.

Model interpretability is prioritized via SHAP, allowing precise, post-hoc quantification of global and local feature contributions. Figure 2 illustrates the integrated methodological framework.

Figure 2: Feature analysis and model interpretability workflow leveraging ensemble methods and SHAP XAI.

This pipeline enables robust variable importance estimation, model transparency, and differential analysis across state cohorts.

Model Performance and Resultant Confusion Patterns

XGBoost achieves superior balanced accuracy (98%), accuracy (90%), precision (92%), recall (90%), and F1 (91%), outperforming single-tree and vanilla boosting alternatives. The Random Forest model trails closely, but XGBoost demonstrates the best compromise between positive and negative class discrimination. The confusion matrix (Figure 3) reveals extremely low false negative and false positive rates—critical for public safety-related applications.

Figure 3: Confusion matrix for pedestrian crash severity prediction, post-SMOTENC balancing.

The model detects fatal/non-fatal events with high reliability, underlining the efficacy of ensemble approaches and synthetic minority resampling for rare event learning.

Feature Attribution and Statewise Pattern Heterogeneity

SHAP analysis identifies the key predictors of pedestrian fatality: AGE, DRINKING, DRUG USE, LOCATION (intersection/midblock, rural/urban, light condition), STATE, and both driver and non-motorist distraction/impairment. Figure 4 details SHAP importance stratified by state fatality rates.

Figure 4: Top 5 states' SHAP feature importance rankings for fatality prediction.

AGE emerges as universally dominant; older adults and youth face significantly greater risk due to physiological and behavioral factors. DRINKING and DRUG USE are pronounced in bottom-fatality states, while environmental and locational features are primary in high-fatality states. The variable STATE captures both legislative, infrastructural, and behavioral heterogeneity. Distraction variables (driver/non-motorist) show consistent but regionally variable influence.

Model explanations also uncover the diminished impact of work zone presence and rural/urban designation, suggesting that contemporary risk is increasingly decoupled from crude traffic-engineering boundaries and more directly tied to micro-environmental and individual-level risk factors.

Discussion: Implications for Practice and Theory

From a practical safety engineering perspective, the findings substantiate the prioritization of midblock interventions (e.g., raised crossings, median refuges), improved nighttime illumination, and targeted behavioral countermeasures (drunk/drugged walking, education for older adults). State-level SHAP analysis supports differential intervention allocation and motivates policy heterogeneity, e.g., high-fatality states exhibiting strong substance-use effects demand aggressive enforcement and detection strategies.

Theoretically, the results endorse the integration of ensemble ML and XAI for injury epidemiology, where rare event imbalance, variable collinearity, and multi-scale confounding are endemic. The clear performance margin of XGBoost—particularly after SMOTENC—advances the utility of sophisticated ML in safety-critical, high-imbalance domains.

Limitations and Future Directions

Notwithstanding model robustness, the scope is constrained by the available feature set in FARS and the limited sample of both high and low fatality states. Future work should incorporate extended geospatial-temporal variables (e.g., built environment metrics, real-time traffic flow), fine-grained weather/darkness data, and socioeconomic factors. The application of GIS-integrated models could elucidate spatiotemporal stability and the interaction of exposure and risk. Prospective research should also address prediction transferability and domain adaptation in low-fatality locales, where the class imbalance challenge is more severe.

Conclusion

This comparative analysis validates XGBoost with SMOTENC and SHAP as a high-fidelity, interpretable framework for elucidating pedestrian fatality determinants across U.S. state contexts. The empirical findings confirm the centrality of age, substance impairment, and locational-exposure factors while uncovering nuanced statewise differences. The study’s modeling and analytic protocol should inform both transport policy and the broader injury surveillance community in developing proactive, data-driven countermeasures for pedestrian risk mitigation.

(2503.17623)

Markdown Report Issue