Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stress-SMHD Corpus for Stress Detection

Updated 5 January 2026
  • Stress-SMHD is a large-scale dataset defined by explicit self-reports of depression, anxiety, and PTSD in dedicated clinical subreddits.
  • It employs automated regex extraction and continual masked language model pretraining (via RoBERTa) to enhance stress detection performance.
  • The corpus offers significant gains in stress-detection F1 scores over broader mental health datasets, despite limitations like absent demographic metadata.

The Stress-SMHD corpus is a large-scale, high-precision dataset of Reddit posts authored by users with self-acknowledged clinical diagnoses of depression, anxiety, and post-traumatic stress disorder (PTSD). It serves as a domain-adaptation resource for NLP models tasked with detection of stress-related language in social media, leveraging the extensive linguistic overlap and clinical comorbidity between these mental health conditions and chronic stress. Stress-SMHD differs from other mental-health datasets by focusing exclusively on explicit diagnostic self-reports within dedicated clinical subreddits, yielding a corpus that is both thematically focused and linguistically representative of chronic stress contexts in online discourse (Alqahtani et al., 29 Dec 2025).

1. Origin, Construction, and Scope

Stress-SMHD is a subset of the SMHD (Self-Reported Mental Health Diagnoses) corpus first developed by Cohan et al. (2018). The selection procedure involves mining r/depression, r/Anxiety, and r/ptsd for first-person diagnostic statements, such as “I was diagnosed with depression” or similar. Only users who produce such explicit self-reports are included. Posts from these users across these subreddits form the corpus; all other subreddits and non-diagnosed posts serve as controls at the SMHD level but are omitted in Stress-SMHD for categorical purity.

Corpus statistics:

Condition # Posts # Tokens (M) Proportion (%)
Depression 1,270,000 57.4 53
Anxiety 800,000 36.9 34
PTSD 260,000 13.7 13
Total 2,325,000 108.0 100

No demographic metadata (age, gender, socioeconomic status) is included, and the corpus is strictly limited to English-language Reddit posts (Alqahtani et al., 29 Dec 2025). Diagnostic statements are automatically identified using regular-expression patterns over both the titles and bodies of posts. When such a sentence is detected, it is removed or replaced with a placeholder to guard against label leakage. The resulting unlabeled corpus is used in continual pretraining for downstream stress-detection tasks.

2. Text Processing, Annotation, and Data Properties

Stress-SMHD follows the Byte-Level BPE tokenization and sentence-splitting pipeline used in RoBERTa pretraining; documents exceeding 512 tokens are truncated, and shorter sequences are padded. No additional manual annotation of stress content or linguistic characteristics is performed. The corpus retains all content from included users (minus explicit diagnostic sentences), maximizing the volume and contextual diversity of stress-related discourse.

Crucially, Stress-SMHD does not contain explicit stress labels for individual documents—labels are implicit via the diagnostic status of the author. This distinguishes it from benchmark datasets such as SMM4H 2022 Task 8 or Dreaddit, where stress disclosures are hand-annotated (Alqahtani et al., 29 Dec 2025).

3. Applications in Representation Learning and Stress Detection

The principal function of Stress-SMHD to date has been as an adaptation resource in transfer learning for stress detection. In StressRoBERTa, the corpus is used for continual masked LLM (MLM) pretraining of RoBERTa-base, preserving the original MLM objective:

LMLM(θ)=1MiMlogPθ(xixM)\mathcal{L}_{MLM}(\theta) = -\,\frac{1}{|M|}\sum_{i \in M} \log P_\theta\bigl(x_i \mid x_{\smallsetminus M}\bigr)

where xx is the input sequence, MM denotes masked positions, and PθP_\theta the model’s predicted token probability.

Hyperparameters for adaptation include 5 epochs over all 108 million tokens, dynamic masking, Adam W optimization (weight decay 0.01), learning rate 2×1052 \times 10^{-5}, batch size 16, and sequence length 512. The final domain-adapted model achieves a perplexity of 5.22 on Stress-SMHD, indicating effective adaptation (Alqahtani et al., 29 Dec 2025).

When fine-tuned on downstream stress-detection benchmarks (SMM4H 2022 Task 8 and Dreaddit), StressRoBERTa consistently outperforms both vanilla RoBERTa and MentalRoBERTa—models adapted on broader, less diagnosis-specific Reddit corpora—by approximately 1 F1 point, and surpasses the best shared-task system on SMM4H by 3 F1 points (82% vs. 79%) (Alqahtani et al., 29 Dec 2025).

4. Theoretical Rationale and Comparative Analysis

The selection of depression, anxiety, and PTSD as inclusion criteria is motivated by well-documented clinical comorbidity: 60–80% of individuals with depression and 50–70% with anxiety meet criteria for chronic stress, with substantial overlap present for PTSD as well (Kessler, 2013; Mazure, 1998). Linguistic studies (De Choudhury et al., 2013; Guntuku et al., 2017) confirm that characteristic stress language—e.g., elevated use of first-person singular pronouns, negative-emotion lexicon, and present-tense verbs—pervades these diagnosis-specific subreddits.

In contrast, adaptation on a broad “mental health” corpus (as in MentalRoBERTa) may introduce lexical noise from advice-giving, off-topic posts, or non-clinical discussions, diluting stress-specific signals. Stress-SMHD, by focusing on self-reported diagnoses within r/depression, r/Anxiety, and r/ptsd, maximizes linguistic and conceptual coherence for transfer-learning purposes (Alqahtani et al., 29 Dec 2025).

Notable limitations of Stress-SMHD include:

  • Data source bias: Exclusively English Reddit posts, omitting other platforms and non-users.
  • Absence of demographic metadata: Prevents subgroup analysis or evaluation of representational fairness.
  • Diagnostic coverage: Exclusion of other stress-related or comorbid conditions (e.g., bipolar disorder, eating disorders), and situational (non-diagnostic) stress.
  • Self-report heuristic: Relies on explicit statements (“I was diagnosed…”), missing users with different disclosure styles; the precision and recall of this extraction are not quantified.
  • Annotation regime: Absence of document-level stress annotation precludes supervised stress modeling within the corpus itself.
  • Computational requirements: Continual pretraining over 108M tokens is non-negligible in terms of compute costs, though relatively modest compared to general-domain pretraining.

Recommended extensions include expanding coverage to additional mental-health diagnoses, incorporating multilingual and cross-platform data, integrating user-level demographic inference, augmenting with semi-supervised or pseudo-labeled stress annotations, and applying rigorous statistical significance testing for downstream task performance. Incorporation of techniques such as adapters or prompt-tuning could reduce compute costs for adaptation (Alqahtani et al., 29 Dec 2025).

6. Relation to Other Stress Datasets and Public Availability

While Stress-SMHD is a high-volume, diagnosis-focused resource, it is distinct from the "Understanding and Measuring Psychological Stress using Social Media" corpus, which links Perceived Stress Scale (PSS) scores with Facebook and Twitter posts from surveyed users (Guntuku et al., 2018). Stress-SMHD is also distinct from stress datasets characterized by manual stress annotation or situational stress coverage, such as SMM4H 2022 Task 8 and Dreaddit.

Stress-SMHD itself is not released with individual posts or user metadata; its use to date is as an adaptation resource for pretraining and transfer learning. The approach it enables—focused cross-condition continual pretraining—has yielded measurable improvements in downstream stress-detection F1 and recall (Alqahtani et al., 29 Dec 2025). A plausible implication is that other clinical NLP domains may benefit from targeted adaptation using corpora defined by comorbidity and linguistic overlap, rather than general “mental health” discourse.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stress-SMHD Corpus.