Papers
Topics
Authors
Recent
Search
2000 character limit reached

PolyHope-M 2025 Benchmark

Updated 3 January 2026
  • PolyHope-M 2025 benchmark is a dual-domain standard that targets both multilingual hope speech detection and M dwarf abundance calibration.
  • It employs advanced transformer architectures with language-specific encoders to address challenges in low-resource and high-resource settings.
  • The benchmark establishes robust evaluation protocols and precise stellar standards to support reproducible research in NLP and astrophysics.

The PolyHope-M 2025 Benchmark is a multilingual, multi-level standard for hope speech detection and M-dwarf abundance calibration. The benchmark simultaneously addresses two distinct domains: the development and evaluation of algorithms for fine-grained detection of hopeful discourse in low- and high-resource languages, and the establishment of precise physical and abundance references for M-class dwarf calibration. Its construction integrates advanced transformer architectures, carefully stratified datasets, and meticulously characterized stellar standards—providing a robust testbed for research in both NLP and stellar spectroscopy.

1. Multilingual Hope Speech Detection: Corpus, Tasks, and Motivation

The PolyHope-M 2025 corpus is designed to overcome the scarcity of resources for hope speech, especially in low-resource languages such as Urdu. The dataset comprises social-media text in four languages: Urdu (Indo-Aryan, Perso-Arabic script), English (Germanic, Latin script), German (Germanic, Latin script), and Spanish (Romance, Latin script) (Abdullah et al., 27 Dec 2025, Abiola et al., 30 Sep 2025). Each example in the corpus is manually annotated using clear guidelines distinguishing explicit or implicit future-oriented positivity, resulting in two main labeling regimes:

  • Binary Task: Classifies input as “Hope” (1) or “Not Hope” (0).
  • Multi-class Task: Four mutually exclusive labels—Generalized Hope (GH), Realistic Hope (RH), Unrealistic Hope (UH), Not Hope (NH).

The class distribution is highly skewed, with “Not Hope” comprising 50–60% of examples; “Generalized Hope” (25–30%), “Realistic Hope” (8–12%), and “Unrealistic Hope” (5–8%) are consistently underrepresented across all languages, posing a challenge for both model training and evaluation.

Dataset splits are stratified as 70% train, 15% development (validation), and 15% test (Abdullah et al., 27 Dec 2025), with size per language exceeding 12,000 posts for the test set in all four languages (Abiola et al., 30 Sep 2025). Preprocessing and tokenization are language-specific, emphasizing Unicode normalization, accent and diacritic handling, and use of transformer-compatible tokenizers (XLM-RoBERTa-base), with a maximum sequence length of 128 tokens.

2. Model Architectures and Low-Resource Adaptation

State-of-the-art transformer backbones form the core of the PolyHope-M 2025 benchmark’s modeling pipeline. The principal model is XLM-RoBERTa-base (270M parameters), which provides multilingual contextual embeddings. For additional morphological and script sensitivity, language-specific encoders are employed: UrduBERT (110M), RoBERTa-base (125M) for English, and EuroBERT (125M) for German and Spanish. mBERT (110M) serves a comparative role (Abdullah et al., 27 Dec 2025).

Monolingual models excel only in high-resource languages (e.g., RoBERTa-base and English), while cross-lingual transfer in XLM-RoBERTa drives substantial performance gains in lower-resource settings. Hyperparameter tuning (Optuna, 30 trials) explores learning rates (5×10⁻⁶ to 5×10⁻⁵), batch sizes (4–16), warm-up ratios (0.0–0.3), weight decay (0.0–0.1), and dropout (0.1–0.3), with early stopping on dev F₁ and classification thresholds adjusted within [0.3, 0.8].

To address class imbalance, loss re-weighting multiplies the hope class cross-entropy by 1.5. For low-resource Urdu, preprocessing includes de-diacritization and subword granularity tuning for Perso-Arabic script; for European languages, accent-aware tokenization prevents splitting on diacritics (Abdullah et al., 27 Dec 2025).

3. Evaluation Metrics and Protocol

Performance in PolyHope-M 2025 is primarily assessed via macro-averaged F₁-score across all classes. For class ii, given true positives (TPiTP_i), false positives (FPiFP_i), and false negatives (FNiFN_i), the metric suite is as follows:

  • Precision: Pi=TPiTPi+FPiP_i = \frac{TP_i}{TP_i + FP_i}
  • Recall: Ri=TPiTPi+FNiR_i = \frac{TP_i}{TP_i + FN_i}
  • F₁-score: F1,i=2Pi×RiPi+RiF_{1,i} = 2\,\frac{P_i \times R_i}{P_i + R_i}

Macro-F₁ is the unweighted mean of the per-class F₁; weighted-F₁ uses class support as weights (for training monitoring) (Abdullah et al., 27 Dec 2025, Abiola et al., 30 Sep 2025). Secondary metrics include overall accuracy, per-class precision and recall, and micro-F₁. The evaluation protocol handles class imbalance by applying inverse-frequency loss weighting and, in some experiments, selection of ambiguous examples using uncertainty sampling (entropy of the model output) to enrich the training set (Abiola et al., 30 Sep 2025).

Experimental results on PolyHope-M 2025 establish XLM-RoBERTa with language-specific encoders as the leading approach for both binary and multi-class tasks (Abdullah et al., 27 Dec 2025, Abiola et al., 30 Sep 2025). Key results:

Language Binary F₁ (Macro) Multi-class F₁ (Macro, XLM-R)*
Urdu 95.0% (XLM-R+UrduBERT) 65.2%
English 86.3% (XLM-R) 71.0%
German 87.4% (XLM-R+EuroBERT) 70.1%
Spanish 85.0% (XLM-R+EuroBERT) 68.5%

*Best-reported multi-class macro-F₁

Performance for binary classification is near parity between low- and high-resource languages when using multilingual architectures, while multi-class performance is substantially lower, particularly for rare classes (UH, RH). Monolingual encoders underperform in non-English languages, confirming the necessity of multilingual pretraining. Error analysis reveals persistent confusion between “Realistic” and “Unrealistic” hope, especially in low-resource contexts. There is no reported formal statistical significance analysis, but threshold variation produced <0.5% change in F₁ and stable ranking (Abdullah et al., 27 Dec 2025).

When compared against earlier approaches, single-model XLM-RoBERTa achieves or exceeds ensemble-level macro-F₁ (≈0.78 vs. ≈0.78, García-Baena et al. 2024), and substantially surpasses traditional feature-based and BERT-transfer baselines by over 10 percentage points (Abiola et al., 30 Sep 2025).

5. M Dwarf Stellar Abundance Benchmarking

Parallel to the hope-speech NLP focus, PolyHope-M 2025 incorporates a benchmark sample of nine nearby M dwarfs characterized by high-precision interferometric parameters, targeting effective temperature (3224Teff36923224 \leq T_{\rm eff} \leq 3692 K), surface gravity (4.72logg5.064.72 \leq \log\,g \leq 5.06), and metallicity (0.58[Fe/H]+0.22-0.58 \leq \mathrm{[Fe/H]} \leq +0.22) (Olander et al., 11 May 2025).

Spectra are acquired with GIANO-B (R=50,000, 0.92.45 μ0.9–2.45\ \mum), processed with REDUCE, and modeled using Turbospectrum and MARCS 1D LTE atmospheres. Abundances of Fe, Ti, and Ca are determined via line-by-line, differential analysis relative to the solar FTS atlas, with the final abundance for element XX defined as [X/H]=mediani(ΔAi)\mathrm{[X/H]} = \mathrm{median}_i(\Delta A_i), where ΔAi=A,iA,i\Delta A_i = A_{\star,i} - A_{\odot,i} for each line. Precision per star is typically $0.10–0.25$ dex.

The nine-star sample and their measured parameters provide coverage of early to mid-M dwarfs for calibration, though limitations include fringing residuals, some unmodeled non-LTE effects (notably for Ca I and Ti I), and a small sample size. The comparative analysis places these benchmark stars at the 0.10.2\sim0.1–0.2 dex precision level for TeffT_{\rm eff}, logg\log\,g, [Fe/H], and [α\alpha/H], facilitating their use as reference standards for future large-scale abundance surveys (Olander et al., 11 May 2025).

6. Insights, Limitations, and Recommendations for Future Directions

The PolyHope-M 2025 benchmark demonstrates that large-scale multilingual transformer models, when coupled with appropriate language-specific adaptations, can achieve F₁-scores for hope speech detection in low-resource languages that approach high-resource settings (Abdullah et al., 27 Dec 2025). Cross-lingual sharing in XLM-RoBERTa markedly boosts the recall of rare classes in underrepresented languages, whilst language-specific encoders further improve morphological handling.

Remaining challenges include the persistent class imbalance, difficult distinctions between hope subtypes (“Realistic” vs. “Unrealistic”), and region-specific idiomatic usage, especially in morphologically rich or code-switched contexts. In the stellar calibration domain, M-dwarf benchmarks offer solid reference points for abundance work but are limited by potential non-LTE effects and sample size.

Recommendations for extension include: expanding hope corpora to additional low-resource languages (e.g., Punjabi, Seraiki, Sindhi), domain-specific pretraining (e.g., social-media text), parameter-efficient adaptation (LoRA, adapters), and statistical rigor in model comparisons (bootstrap resampling). For M-dwarfs, further work should address non-LTE corrections and increased sample diversity (Abdullah et al., 27 Dec 2025, Olander et al., 11 May 2025).

7. Significance and Research Applications

By providing well-structured datasets, precise labeling, and high-quality reference standards, PolyHope-M 2025 anchors reproducible research in two disparate but critically relevant areas of NLP and stellar astrophysics. It provides a rigorous evaluation platform for multilingual hope-speech models and sets a calibration baseline for stellar elemental abundances. Its methodology underscores the importance of combining transformer-based cross-lingual LLMs with domain adaptation and high-fidelity benchmarks.

References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PolyHope-M 2025 Benchmark.