SoccerMon Dataset for Injury Risk Modeling

Updated 3 February 2026

SoccerMon Dataset is a publicly available, longitudinal resource that collects daily well-being, training, GPS, and injury data from elite female footballers.
Its comprehensive structure, featuring rolling workload metrics, subjective wellness, session details, and GPS data, enables detailed time-series analysis of injury risk.
By employing DeepHit-based survival modeling and SHAP for interpretability, the dataset has advanced personalized, time-varying injury forecasting in sports science.

The SoccerMon dataset is a publicly available, longitudinal resource for monitoring elite female footballers, emphasizing the collection of daily athlete well-being, training, GPS, and injury data. It has been leveraged to advance injury forecasting methodologies, notably within deep learning-based survival modelling, providing a rigorously curated platform for empirical evaluation of individualized, time-varying risk predictions in sports science research (Catterall et al., 27 Jan 2026).

1. Dataset Composition and Structure

SoccerMon captures two seasons (2020–2021) of elite Norwegian female football (Team B cohort), with the following attributes:

Participants: 37 players
Observation period: 2 seasons, 322 unique days
Player-date observations: 4,449
Injuries recorded: 43 (24 acute, 19 overuse)

It aggregates five primary feature blocks, totaling 39 variables:

Rolling workload metrics: Daily load, acute/chronic training load (ATL, CTL28, CTL42), monotony, strain, acute-to-chronic workload ratio (ACWR), and other windowed sums/ratios.
Subjective wellness: Six self-reported metrics (Fatigue, Mood, Readiness, Sleep Duration, Soreness, Stress) on a 1–10 scale, except sleep duration (hours).
Session characteristics: Rate of perceived exertion (RPE), session-RPE (sRPE), subjective and objective session durations.
GPS-derived running metrics: Mean/max/std speeds, proportional/time/distance running at intensity strata, total/speed-specific distances, distance per minute.
Engineered features: Subjective_missingness_7d (proportion missing wellness entries in prior week), past_injury_count (cumulative, per-player).

This structure permits granular, daily-level modelling of both objective exposures and subjective states over extended athletic careers.

2. Data Preprocessing and Imputation Strategies

Preprocessing consisted of:

File consolidation: Merging session records (GPS, sRPE, injury logs) into a complete daily player-by-date matrix.
Outlier removal: Speeds >32 km/h, session durations >200 min, daily running >16 km flagged as implausible.
Feature derivation: Calculation of 7-day missingness and past injury counts.

Missing value imputation followed three strategies, applied per feature and per player:

Median Imputation: Replacement with that player’s median value.
Bespoke formula: Preserves a player’s empirical feature rank (relative to teammates over prior 14 days) for any missing day.
Linear interpolation: Temporal smoothing of missing entries within each player’s series.

Post-imputation, features with over 30% residual missingness were omitted. Distributional checks (via KDEs, univariate injury correlations) ensured imputations preserved essential statistical properties. The bespoke formula imputation best maintained signal for risk modeling, whereas linear interpolation best preserved marginal feature distributions.

3. Survival Modeling with DeepHit

Time-to-injury forecasting was formulated as a discrete-time survival prediction problem:

Survival function: $S(t|X) = P(T > t|X)$
Hazard function: $h(t|X) = P(T = t|T \geq t, X) = f(t|X)/S(t|X)$

The DeepHit neural network [Lee, Yoon & van der Schaar, 2020] was employed, minimizing a convex combination of log-likelihood and pairwise ranking loss:

$L = \alpha L_{CE} + (1-\alpha) L_{rank}$

where $L_{CE}$ is cross-entropy for event/censoring at each day, and $L_{rank}$ penalizes incorrect ordering among comparable event times. $\alpha$ was set to $0.7$.

Time discretization: Injury events were mapped to daily intervals (1–7 days ahead). Each training instance used 21 days of rolling features, flattened into an 819-dimensional vector.

Architecture details:

MLP with Dense(256)→Dense(128)→Dense(64) (ReLU activations, dropout 0.3/0.3/0.2)
Output: Dense(7), softmax for daily risk
Optimizer: Adam ( $1\times10^{-3}$ ), batch size 64, weight decay $1\times10^{-5}$ , early stopping on validation C-index, up to 200 epochs (average ~45 for convergence)

4. Validation, Baseline Comparisons, and Model Performance

Two principal validation regimes were implemented:

Chronological Split: 80% of days for training, 20% (future events) for test, reflecting practical deployment.
Leave-One-Player-Out (LOPO): Each player’s sequence withheld for testing, with models trained on the remainder.

Performance Metrics

Primary: Concordance index ( $C$ -index), measuring correct pairwise event ordering.
Baselines: Next-day injury prediction via Random Forest, XGBoost, and Logistic Regression with grid-searched hyperparameters and oversampled minority class.

Summary of Model Results

Model	F1	Precision	Recall	AUC	Features	Look-back	Forecast
Random Forest	0.533	1.000	0.364	0.779	19	1 day	1 day
XGBoost	0.429	1.000	0.273	0.876	18	1 day	1 day
Logistic Regression	0.071	0.037	0.833	0.758	8	1 day	1 day

Split	Imputation	C-index
Chronological	Linear interp.	0.660
Chronological	Bespoke formula	0.762
LOPO (median)	Bespoke formula	0.72 ± 0.192

Additional metrics (e.g., Brier score) were not reported.

5. Model Interpretability and Risk Factor Insights

Interpretation of survival outputs was achieved using the SHapley Additive exPlanations (SHAP) framework, adapted for time-to-event predictions (Wang et al., 2024). For each $p(T=t|X)$ , the SHAP value $\phi_j$ quantified each feature’s marginal effect on predicted injury risk.

Key SHAP-derived risk factors included:

Elevated Stress (positive $\phi$ )
Greater high-intensity running volume (sp_hir_d)
Lower Mood and Sleep Duration
Increased Fatigue (especially on flagged days)
High 7-day subjective_missingness (suggests disengagement from recovery routines)

These patterns corroborate established roles of acute workload spikes, psychosocial stress, poor recovery, and cumulative injuries in elevating injury propensity.

6. Contributions and Empirical Significance

Application of the SoccerMon dataset provided a novel proof of concept for deep survival modeling of player injury risk (Catterall et al., 27 Jan 2026). The DeepHit architecture, leveraging rolling multivariate time series, delivered:

Superior event discrimination (C-index = 0.762 with bespoke imputation) over conventional classifiers
Individualized, time-varying risk estimation for practical decision support
Actionable interpretability through feature-level SHAP values

The findings underline the critical importance of high-quality imputation preserving risk-relevant variance. The dataset’s multidimensionality and the analytic framework collectively enable practitioners to target stress, loading, and recovery in a data-driven, personalized manner.

This demonstrates the viability of the SoccerMon dataset as a benchmark for further advancement in sports injury analytics and individualized, explainable risk stratification.

Markdown Report Issue Upgrade to Chat

References (1)

Time-to-Injury Forecasting in Elite Female Football: A DeepHit Survival Approach (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SoccerMon Dataset.