MentalRiskES 2025 Challenge

Updated 5 December 2025

MentalRiskES 2025 Challenge is a benchmark focused on detecting gambling disorder by analyzing temporally ordered social media posts using binary user-level classification.
It employs a modular CPI+DMC framework that decomposes early risk detection into incremental evidence accumulation and adaptive stopping rules for timely decisions.
Evaluation metrics such as Macro F₁, ERDE, and computational efficiency reveal the challenge of subtle lexical discrimination between high-risk and low-risk profiles.

MentalRiskES 2025 Challenge is a benchmark and evaluation campaign focused on Early Risk Detection (ERD) of behavioral and mental health disorders in Spanish-language social media streams, with a special emphasis on the detection of gambling disorder. The competition evaluates models’ capacity to classify individuals as at high or low risk of developing gambling-related pathology based on their sequence of online posts, with stringent requirements on both predictive effectiveness and timeliness, and includes detailed metrics encompassing decision latency and computational efficiency (Thompson et al., 28 Nov 2025).

1. Task Definition, Data, and Annotation

MentalRiskES 2025 Task 1 centers on binary user-level classification: given a user’s full, temporally ordered post history from either Telegram or Twitch, assign a label of “high risk” (positive) or “low risk” (negative) for gambling disorder. The PRECOM-SM corpus supports this with three splits:

Split	Users (Pos/Neg)	Avg posts/user	Platform composition
Train	350 (172/178)	~64	Telegram 105/115, Twitch 67/63
Trial	7 (3/4)	~63	Telegram 2/2, Twitch 1/2
Test	160 (83/77)	~59	Telegram 50/40, Twitch 33/37

Each post averages 5–7 words, with history lengths per user ranging from 8 to 146 entries. Annotation is by posting history: every user exhibits some gambling language, but high-risk users show evidence (per annotation guidelines) of progression toward pathological use. This leads to a subtle discrimination task, as the lexical overlap between positive and negative classes is extremely high (e.g., cosine TF-IDF similarity 0.854; Jaccard index of top 1000 words 0.581).

Official leaderboard ranking is based on Macro F₁, with supplementary metrics including accuracy, macro/micro precision and recall, early detection error (ERDE₅, ERDE₃₀), F_latency (combining speed/effectiveness), and environmental impact (energy, CO₂ emissions, inference time) (Thompson et al., 28 Nov 2025).

2. Modeling Framework: CPI+DMC Decomposition

All top-performing approaches decompose ERD into two subproblems, formalized in the CPI+DMC framework:

Classification with Partial Information (CPI): At each round $t$ , after the arrival of user post $w_t$ , the model generates a risk estimate, incrementally accumulating evidence.
Deciding the Moment of Classification (DMC): A stopping procedure determines, as early as possible, when to commit to a high- or low-risk label, trading off between accuracy and timeliness.

Key CPI scoring mechanisms include:

For interpretable models (SS3): maintain cumulative class confidences for $u$ at round $t$ : $cv_+(t) = \sum_{n=1}^t gv(w_n, +)$ and $cv_-(t) = \sum_{n=1}^t gv(w_n, -)$ , normalized via softmax.
For deep models (BETO/BERT, SBERT): predict a probability $p_t$ at each round using a sliding window of recent posts.

DMC rules:

Global thresholding (SS3): Trigger a positive decision when $score_u > \text{median}(scores) + \gamma\,\text{MAD}(scores)$ .
History-based (BERT/SBERT): Declare positive if $\sum_{i=1}^t \mathbb{I}(p_i \geq \tau) \geq T$ , i.e., enough rounds exceed a confidence threshold.

This modular separation allows independent optimization of effectiveness (CPI) and early-detection latency (DMC) (Thompson et al., 28 Nov 2025).

3. Model Architectures and Training Strategies

Three primary approaches were evaluated:

UNSL#0 (SS3 + Global Policy):

Interpretable classifier based on character trigram statistics.
Calibrated via grid search on $(\sigma, \rho, \lambda)$ .
Decision threshold dynamically adapts based on the user batch with $\gamma=0.5$ .
Yields fully transparent, term-level explainability.

UNSL#1 (BERT + Extended Vocabulary + History Rule):

Utilizes BETO (bert-base-spanish-wwm-uncased) fine-tuned with AdamW (lr= $5 \times 10^{-5}$ ), linear warmup, batch size 32, over 10 epochs.
Domain tokens (25, e.g., “rebote”, “BingX”) identified by SS3’s top scoring features are introduced into the tokenizer/embedding layer.
History-based DMC with $(\tau=0.6, T=10)$ .

UNSL#2 (SBERT/SetFit + History Rule):

Encoder: sentence_similarity_spanish_es.
Fine-tuned using SetFit’s contrastive objective, optimizing cosine similarity over positive/negative pairs.
Logistic regression atop frozen embeddings.
Single-epoch training (batch size 16, lr $=2 \times 10^{-5}$ , iterations 20).
DMC: $(\tau=0.7, T=10)$ .

All variants operate under a common pipeline: for each new post, compute the cumulative or windowed risk, apply the DMC stopping rule, and, if not triggered, continue processing subsequent posts.

4. Decision Policies: Thresholds and Dynamic Stopping

The two principal stopping strategies are:

Global Median + MAD (used by SS3):
- Scores across all users are pooled to compute threshold $\theta_g = \mathrm{median}(scores) + \gamma \cdot \mathrm{MAD}(scores)$ .
- If $score_u > \theta_g$ at any round, declare high risk.
History-Based Rule (for transformer variants):
- At each prediction round, maintain a counter $count_{pos}$ of how many $p_t \geq \tau$ .
- If $count_{pos} \geq T$ , issue a high-risk decision; otherwise, continue.
- At $T_{max}$ , if $count_{pos} < T$ , designate low risk.

This duality enables flexible calibration according to the distinct precision-recall, delay, and interpretability profiles of each model family (Thompson et al., 28 Nov 2025).

5. Evaluation Results, Analysis, and Error Drivers

Results from the official leaderboard (38 teams) are summarized:

Model	Acc	Macro P	Macro R	Macro F₁	ERDE₅	ERDE₃₀	F_latency
UNSL#2	0.569	0.568	0.567	0.567	0.639	0.389	0.506
UNSL#0	0.581	0.586	0.574	0.563	0.515	0.284	0.628

UNSL#2 (SBERT) led with Macro F₁ 0.567, best balance of precision/recall, and competitive F_latency.
UNSL#0 (SS3) was second (Macro F₁ 0.563), highest accuracy, and lowest ERDE₃₀ (early-detection penalty).
UNSL#1 (BERT+EXT) placed 16th (Macro F₁ 0.444); performance exceeded the dataset mean (0.426).

All models achieved low computational and carbon footprints per test case: mean 2.3s inference time, $7 \times 10^{-5}$ kWh, $1.66 \times 10^{-7}$ kgCO₂eq.

Error analysis:

SS3 favored recall (high true positive rate at the cost of more false positives).
SBERT was more conservative, yielding fewer false positives and balanced sensitivity/specificity.
Venn analysis: 57 users labeled positive by all three models (35 correct), with 22 common false alarms, mostly associated with ambiguous content types (e.g., crypto trading, sports betting), reinforcing the challenge of class boundary granularity in the corpus (Thompson et al., 28 Nov 2025).

6. Challenge Dynamics, Limitations, and Future Directions

Task difficulty derives from strong lexical/semantic overlap between the two risk classes; negatives are not “risk free,” but rather sub-threshold for pathological gambling. This reframes early detection from “onset prediction” to “critical threshold crossing.” This ambiguity may disadvantage methods under latency-sensitive metrics (ERDEθ), suggesting that user-specific or adaptive thresholding would be more equitable.

The critical advance over prior MentalRiskES editions is the formal enforcement of both effectiveness and latency via modular CPI+DMC decomposition, with comparable frameworks used in prior years for other mental disorders (Thompson et al., 2023). Transparent models (e.g., SS3) allow extraction of term-based explanations. SBERT, via clustering, exposes interpretable semantic groupings with cosine similarity above 0.72 in positive-class sentences.

Proposed avenues for progress include integration of expert-driven reasoning in LLMs for explainability, user-dynamic ERDE thresholds, joint optimization for latency and effectiveness, and further enrichment—and perhaps relabeling—of the corpus to better circumscribe class boundaries. Expansion to multimodal features (images, networks), adaptive stopping criteria (Bayesian or reinforcement learning–based), and more granular or continuous risk scales (as explored in MentalRiskES depression subchallenges (Viloria et al., 2023)) are expected to yield further advances.

In summary, MentalRiskES 2025 demonstrates the feasibility and ongoing challenges of ERD for gambling disorder in Spanish social media, with modeling insights and evaluation metrics likely to influence future studies across digital mental health detection domains (Thompson et al., 28 Nov 2025, Thompson et al., 2023, Viloria et al., 2023).