Chinese MentalBERT
- Chinese MentalBERT is a domain-adaptive pre-trained language model tailored for analyzing Chinese mental-health texts on social media with enhanced sensitivity to psychological nuances.
- It leverages a large-scale, psychology-focused corpus and whole-word masking to preserve semantic integrity, achieving superior results in suicide risk, sentiment, and cognitive distortion tasks.
- Empirical evaluations indicate that guided lexicon-based masking and robust data augmentation contribute to a 1–2 F1 point improvement over general-domain models.
Chinese MentalBERT is a domain-adaptive, pre-trained LLM specifically optimized for Chinese mental-health text analysis on social media corpora. It leverages a large-scale, psychology-focused data set and a lexicon-guided masking regime to improve sensitivity to nuanced psychological expressions, outperforming general-domain Chinese LLMs in tasks such as suicide risk classification, sentiment analysis, and cognitive distortion detection (Zhai et al., 2024, Qi et al., 2024).
1. Model Architecture and Pre-training Foundation
Chinese MentalBERT is based on a standard BERT-base Transformer encoder, composed of 12 Transformer layers, each with a hidden size of 768 and 12 attention heads, totaling approximately 110 million parameters. The model architecture mirrors the original bidirectional masked LLM (MLM) introduced in Devlin et al. (2019), while also adopting Whole-Word Masking (WWM) to ensure that entire Chinese words are masked together rather than individual characters. During fine-tuning, the architecture is extended with a task-specific classification head—comprising a dense layer and either a sigmoid (for binary) or softmax (for multi-class) activation (Zhai et al., 2024, Qi et al., 2024).
Initial weights are inherited from the Chinese-BERT-wwm-ext checkpoint. Pre-training proceeds without Next Sentence Prediction, focusing exclusively on MLM with an augmented, domain-adaptive objective.
2. Domain-Specific Pre-training Corpus
The pre-training corpus consists of 3,360,273 Chinese-language posts gathered from multiple mental-health–centered sources:
| Dataset | Users | Posts |
|---|---|---|
| “Zoufan” Weibo tree-hole | 351,069 | 2,346,879 |
| Depression “Chaohua” super-topic | 69,102 | 504,072 |
| Sina Weibo Depression Dataset (SWDD) | 3,711 | 785,689 |
| Weibo User Depression Detection Dataset | 10,325 | 408,797 |
| Total (filtered) | – | 3,360,273 |
These sources reflect predominantly first-person accounts relating to distress, pain, depression, anxiety, and suicidal ideation. The data exhibits dense usage of internet slang, colloquial language, emojis, and social media–specific expressions, directly informing the model's domain coverage (Zhai et al., 2024, Qi et al., 2024).
After data cleansing (i.e., removal of URLs, user tags, emojis, and special symbols) and filtering for minimum length, the posts are segmented into fixed-length 128-token sequences, with whole-word segmentation applied to preserve multi-character semantic units.
3. Lexicon-Guided Masking and MLM Objective
A principal innovation is the use of a psychological lexicon—constructed from micro-blog corpora using seed-word propagation and TF-IDF metrics, then refined via expert annotation. This lexicon encompasses key terms associated with mental distress (e.g., “想死” [want to die], “痛苦” [pain], “抑郁” [depression]).
Pre-training uses a lexicon-biased MLM objective defined as:
where the masked positions are sampled to ensure that tokens matching the mental-health lexicon constitute a significant fraction (≥20%) of all masked tokens in each sequence. If the proportion of lexicon tokens is inadequate, random tokens are added to the masked set to meet the quota.
This lexicon-guided masking increases the exposure of the model to salient psychological language, resulting in improved ability to recover semantically loaded words during training. Empirical ablation shows that guided masking improves F1 by 1–2 points compared to random masking across downstream tasks (Zhai et al., 2024).
4. Fine-tuning Protocols and Data Augmentation
For suicide risk and mental health classification downstream tasks, fine-tuning employs the following regime:
- Data split: 4:1 train/test, 5-fold cross-validation on the training portion
- Input formatting: Chinese WordPiece tokenization; sequences padded/truncated to 128–150 tokens
- Optimization: Adam optimizer, learning rates between and , batch size 16 or 32, up to 30 epochs with early stopping based on validation loss
- Classification head: single dense layer with sigmoid (for binary) or softmax (for multi-class)
- Data augmentation (fine-grained classification): synonym replacement (SR) with TF-IDF keyword protection, round-trip translation (RT) via Baidu Translate API across five languages, and LLM-based generation (LLM-G) using GPT-4 few-shot prompting (Qi et al., 2024)
Data augmentation yields measurable gains, with RT providing up to +4.65% absolute improvement in weighted F1 for fine-grained suicide risk classification, indicating enhanced robustness to synonymy and slang variation.
5. Empirical Performance and Comparative Analysis
Chinese MentalBERT demonstrates consistently superior performance relative to both general-domain and other domain-adapted pretrained models. The following tables summarize evaluation results on suicide risk classification as reported in (Qi et al., 2024):
Fine-grained Suicide Risk (11-way):
| Model | Precision | Recall | F1 |
|---|---|---|---|
| MentalBERT | 52.81% | 54.80% | 50.89% |
| MacBERT | 50.17% | 53.60% | 50.59% |
| BERT | 48.95% | 50.80% | 46.16% |
| NeZha | 47.61% | 50.40% | 45.72% |
| ERNIE 3.0 | 47.90% | 51.20% | 47.36% |
| RoBERTa | 38.03% | 42.80% | 37.11% |
| ELECTRA | 30.35% | 42.40% | 34.91% |
With round-trip translation augmentation MentalBERT achieves a weighted F1 of 55.54%.
High-Low Suicide Risk (binary):
| Model | Precision | Recall | F1 |
|---|---|---|---|
| MentalBERT | 88.41% | 88.40% | 88.39% |
| ERNIE 3.0 | 86.42% | 86.40% | 86.39% |
| BERT | 87.68% | 87.60% | 87.61% |
| NeZha | 84.86% | 84.80% | 84.81% |
| MacBERT | 83.60% | 83.60% | 83.59% |
| ELECTRA | 81.21% | 81.20% | 81.20% |
| RoBERTa | 80.02% | 80.00% | 79.97% |
MentalBERT leads all models on this task.
Additional evaluation on sentiment and cognitive distortion datasets yields 2–4 F1 point improvements for MentalBERT compared to general pretrained models. Guided masking confers an additional 1–2 point margin beyond random masking (Zhai et al., 2024).
6. Qualitative Behavior and Sensitivity to Psychological Semantics
Masked token prediction analysis reveals that Chinese MentalBERT, especially with guided masking, more reliably infers emotionally relevant terms in diagnostic or first-person contexts. For example, on SCL-90 sentences with masked keywords, the domain-adapted model predicts “折磨” (“torture”), “困难” (“trouble”), and “悲伤” (“sadness”) where general BERTs tend to predict neutral or generic words. Annotation confirms a >20% increase in affect-laden prediction rate for the guided model (Zhai et al., 2024).
This sensitivity extends to subtle expressions of suicidal ideation or depressive affect, which contributes to its effectiveness in suicide risk screening and fine-grained psychological analysis.
7. Limitations and Prospects
Several limitations are noted. The pre-training corpus, while large and diverse within the Weibo/social-media domain, is not publicly available due to privacy compliance. The domain scope, tailored to social media text, may not generalize to settings such as clinical interviews, long-form patient narratives, or other dialects. Label distribution and user demographics may introduce biases (Zhai et al., 2024, Qi et al., 2024).
Future research directions include:
- Extension to clinical record analysis and summarization of long psychological content
- Continual/lifelong pre-training on evolving social media language
- Integration of richer knowledge graphs or multi-task objectives to capture structured psychological knowledge
The model, code, and downstream task scripts are openly accessible via https://github.com/zwzzzQAQ/Chinese-MentalBERT, supporting community-driven adaptation across diverse Chinese mental-health NLP pipelines.
References:
- [Chinese MentalBERT: Domain-Adaptive Pre-training on Social Media for Chinese Mental Health Text Analysis, (Zhai et al., 2024)]
- [SOS-1K: A Fine-grained Suicide Risk Classification Dataset for Chinese Social Media Analysis, (Qi et al., 2024)]