SoccerMon Dataset for Injury Risk Modeling
- SoccerMon Dataset is a publicly available, longitudinal resource that collects daily well-being, training, GPS, and injury data from elite female footballers.
- Its comprehensive structure, featuring rolling workload metrics, subjective wellness, session details, and GPS data, enables detailed time-series analysis of injury risk.
- By employing DeepHit-based survival modeling and SHAP for interpretability, the dataset has advanced personalized, time-varying injury forecasting in sports science.
The SoccerMon dataset is a publicly available, longitudinal resource for monitoring elite female footballers, emphasizing the collection of daily athlete well-being, training, GPS, and injury data. It has been leveraged to advance injury forecasting methodologies, notably within deep learning-based survival modelling, providing a rigorously curated platform for empirical evaluation of individualized, time-varying risk predictions in sports science research (Catterall et al., 27 Jan 2026).
1. Dataset Composition and Structure
SoccerMon captures two seasons (2020–2021) of elite Norwegian female football (Team B cohort), with the following attributes:
- Participants: 37 players
- Observation period: 2 seasons, 322 unique days
- Player-date observations: 4,449
- Injuries recorded: 43 (24 acute, 19 overuse)
It aggregates five primary feature blocks, totaling 39 variables:
- Rolling workload metrics: Daily load, acute/chronic training load (ATL, CTL28, CTL42), monotony, strain, acute-to-chronic workload ratio (ACWR), and other windowed sums/ratios.
- Subjective wellness: Six self-reported metrics (Fatigue, Mood, Readiness, Sleep Duration, Soreness, Stress) on a 1–10 scale, except sleep duration (hours).
- Session characteristics: Rate of perceived exertion (RPE), session-RPE (sRPE), subjective and objective session durations.
- GPS-derived running metrics: Mean/max/std speeds, proportional/time/distance running at intensity strata, total/speed-specific distances, distance per minute.
- Engineered features: Subjective_missingness_7d (proportion missing wellness entries in prior week), past_injury_count (cumulative, per-player).
This structure permits granular, daily-level modelling of both objective exposures and subjective states over extended athletic careers.
2. Data Preprocessing and Imputation Strategies
Preprocessing consisted of:
- File consolidation: Merging session records (GPS, sRPE, injury logs) into a complete daily player-by-date matrix.
- Outlier removal: Speeds >32 km/h, session durations >200 min, daily running >16 km flagged as implausible.
- Feature derivation: Calculation of 7-day missingness and past injury counts.
Missing value imputation followed three strategies, applied per feature and per player:
- Median Imputation: Replacement with that player’s median value.
- Bespoke formula: Preserves a player’s empirical feature rank (relative to teammates over prior 14 days) for any missing day.
- Linear interpolation: Temporal smoothing of missing entries within each player’s series.
Post-imputation, features with over 30% residual missingness were omitted. Distributional checks (via KDEs, univariate injury correlations) ensured imputations preserved essential statistical properties. The bespoke formula imputation best maintained signal for risk modeling, whereas linear interpolation best preserved marginal feature distributions.
3. Survival Modeling with DeepHit
Time-to-injury forecasting was formulated as a discrete-time survival prediction problem:
- Survival function:
- Hazard function:
The DeepHit neural network [Lee, Yoon & van der Schaar, 2020] was employed, minimizing a convex combination of log-likelihood and pairwise ranking loss:
where is cross-entropy for event/censoring at each day, and penalizes incorrect ordering among comparable event times. was set to $0.7$.
Time discretization: Injury events were mapped to daily intervals (1–7 days ahead). Each training instance used 21 days of rolling features, flattened into an 819-dimensional vector.
Architecture details:
- MLP with Dense(256)→Dense(128)→Dense(64) (ReLU activations, dropout 0.3/0.3/0.2)
- Output: Dense(7), softmax for daily risk
- Optimizer: Adam (), batch size 64, weight decay , early stopping on validation C-index, up to 200 epochs (average ~45 for convergence)
4. Validation, Baseline Comparisons, and Model Performance
Two principal validation regimes were implemented:
- Chronological Split: 80% of days for training, 20% (future events) for test, reflecting practical deployment.
- Leave-One-Player-Out (LOPO): Each player’s sequence withheld for testing, with models trained on the remainder.
Performance Metrics
- Primary: Concordance index (-index), measuring correct pairwise event ordering.
- Baselines: Next-day injury prediction via Random Forest, XGBoost, and Logistic Regression with grid-searched hyperparameters and oversampled minority class.
Summary of Model Results
| Model | F1 | Precision | Recall | AUC | Features | Look-back | Forecast |
|---|---|---|---|---|---|---|---|
| Random Forest | 0.533 | 1.000 | 0.364 | 0.779 | 19 | 1 day | 1 day |
| XGBoost | 0.429 | 1.000 | 0.273 | 0.876 | 18 | 1 day | 1 day |
| Logistic Regression | 0.071 | 0.037 | 0.833 | 0.758 | 8 | 1 day | 1 day |
| Split | Imputation | C-index |
|---|---|---|
| Chronological | Linear interp. | 0.660 |
| Chronological | Bespoke formula | 0.762 |
| LOPO (median) | Bespoke formula | 0.72 ± 0.192 |
Additional metrics (e.g., Brier score) were not reported.
5. Model Interpretability and Risk Factor Insights
Interpretation of survival outputs was achieved using the SHapley Additive exPlanations (SHAP) framework, adapted for time-to-event predictions (Wang et al., 2024). For each , the SHAP value quantified each feature’s marginal effect on predicted injury risk.
Key SHAP-derived risk factors included:
- Elevated Stress (positive )
- Greater high-intensity running volume (sp_hir_d)
- Lower Mood and Sleep Duration
- Increased Fatigue (especially on flagged days)
- High 7-day subjective_missingness (suggests disengagement from recovery routines)
These patterns corroborate established roles of acute workload spikes, psychosocial stress, poor recovery, and cumulative injuries in elevating injury propensity.
6. Contributions and Empirical Significance
Application of the SoccerMon dataset provided a novel proof of concept for deep survival modeling of player injury risk (Catterall et al., 27 Jan 2026). The DeepHit architecture, leveraging rolling multivariate time series, delivered:
- Superior event discrimination (C-index = 0.762 with bespoke imputation) over conventional classifiers
- Individualized, time-varying risk estimation for practical decision support
- Actionable interpretability through feature-level SHAP values
The findings underline the critical importance of high-quality imputation preserving risk-relevant variance. The dataset’s multidimensionality and the analytic framework collectively enable practitioners to target stress, loading, and recovery in a data-driven, personalized manner.
This demonstrates the viability of the SoccerMon dataset as a benchmark for further advancement in sports injury analytics and individualized, explainable risk stratification.