Papers
Topics
Authors
Recent
Search
2000 character limit reached

EmoPsy: Chinese Affective & Counseling Dialogue Data

Updated 22 February 2026
  • EmoPsy is a pair of large-scale datasets for Chinese affective computing, combining social media posts and synthetic counseling dialogues with fine-grained emotion annotations.
  • CMACD leverages a BERT-based extended quantification network to assign multi-label, continuous emotion scores to Weibo posts based on inferred MBTI profiles.
  • The AI PsyRoom corpus employs a multi-agent pipeline to generate and refine realistic counseling dialogues annotated with 35 sub-emotions across 9 primary categories.

EmoPsy refers to two distinct, large-scale datasets for Chinese affective computing and counseling dialogue research, both designed to support fine-grained emotion analysis but differing in corpus structure, annotation targets, and primary applications. The first, EmoPsy (CMACD), is a Weibo-based multi-label affective computing corpus pairing MBTI personality traits and post-level, continuous emotion intensities (Zhou et al., 2024). The second, EmoPsy from the AI PsyRoom project, is a multi-agent generated counseling dialogue dataset, annotated by fine-grained sub-emotion and scenario type (Feng et al., 7 Jun 2025). Both provide foundational resources for the development and evaluation of emotion-aware AI systems.

1. Construction Methodologies

EmoPsy (CMACD)

Data originate from Sina Weibo, with an initial crawl of ≈51,000 users explicitly self-reporting one of sixteen MBTI personality types. After filtering and manual review, the final dataset contains 11,338 users and 566,900 posts. Each user's MBTI is inferred from explicit posts or nicknames, with privacy protection (removal of identifiers) and post-length constraints (30–150 characters). Annotation leverages a two-stage Extended Quantification Network (EQN): a BERT-based model is fine-tuned on single-label, manually annotated Weibo posts, followed by regression adjustment and retraining, before application to all CMACD posts. This generates multi-label, continuous emotion intensity scores in the range [0,1] for six emotions per post.

EmoPsy (AI PsyRoom)

Dialogues are synthesized via the PsyRoom A multi-agent pipeline. Foundation theories include Plutchik’s, Izard’s, and Russell’s emotion models, DSM-5 symptomatology, and appraisal theory. Thirty-five sub-emotions span nine primary categories, each sub-emotion paired with ~12 carefully designed real-world scenarios, vetted by clinical psychologists for ecological validity. Three agents coordinate per dialogue: a Qwen2.5-72B client agent generates self-disclosure within the scenario, a GPT-4o counselor agent applies integrative therapeutic strategies (CBT/EFT/person-centered), and a DeepSeekR1 professor agent scores each turn for problem orientation, compassion, empathy, and interactive communication quality, iterating until high-quality (threshold: average turn score ≥95) is achieved. The initial 432 seed dialogues undergo parametric augmentation—variation in demographics and phrasing—then hierarchical filtering, resulting in 12,350 finalized dialogues.

2. Emotion and Annotation Taxonomies

Dataset Emotion Taxonomy Annotation Structure
CMACD 6 primary: Angry, Fear, Happy, Sad, Neutral, Surprise Multi-label, [0,1] intensities, per post
AI PsyRoom 9 primary + 35 sub-emotions, e.g. anticipatory anxiety, betrayal guilt, frustration Scenario-based, per-turn sub-emotion and function tags

In CMACD, each post may express multiple emotions, with intensity scores computed by EQN and filtered by a threshold t=0.05t=0.05. MBTI personality type is assigned at the user level. In AI PsyRoom, dialogues are labeled at both scenario and utterance granularity: each is annotated with the client's sub-emotion or counselor's response strategy (e.g., empathy, reframing). The fine-grained taxonomy promotes modeling of subtle affective phenomena encountered in psychological counseling.

3. Dataset Statistics and Format

EmoPsy (CMACD)

  • 11,338 users, 566,900 posts (50 per user), post length avg. 63.7 characters
  • MBTI-balanced (I/E, N/S, T/F, J/P axes roughly equal; weak mutual correlations)
  • Directory structure: by MBTI type, CSV files per user with columns for post text and six emotion scores
  • Emotion distribution: Global average energies GAE: Happy (≈0.35) > Angry (≈0.28) > Sad (≈0.22) > Neutral (≈0.20) > Surprise (≈0.15) > Fear (≈0.12)

EmoPsy (AI PsyRoom)

  • 12,350 dialogues, estimated 100,000 turns and 1.2M tokens
  • 35 sub-emotions distributed over 9 primary categories, 423 vetted scenarios, ~29 dialogues/scenario
  • Format: JSONL (one dialogue per line), including fields for dialogue_id, scenario_id, primary_emotion/sub_emotion, turns (with per-turn tags and optional timestamps), overall_score
  • Train/val/test splits not fixed; common practice is 80/10/10 stratified by sub-emotion

Example annotated dialogue records include explicit turn-level emotion/strategy tags and scenario references.

4. Evaluation Protocols and Baselines

CMACD

  • Inter-annotator agreement: manual spot-check, top-1 hit rate 83.1%, top-2 92.3% for emotion labels
  • Baseline tasks: Personality (four binary axes) and emotion (multi-label) classification
  • Best model BERT: IE axis 0.7674, NS 0.7851, TF 0.7963, JP 0.7388 accuracy; multi-label emotion (E₁₋Acc) 0.9284

AI PsyRoom

  • Dialogue quality metrics (max per turn): Problem Orientation (2), Compassion (3), Empathy (3), Interactive Communication (2)
  • Improvement over role-play baseline: +18% Problem Orientation, +23% Compassion, +24% Empathy, +16% Interactive Communication
  • Treatment-plan evaluation: Six criteria (comprehensiveness, professionalism, personalization, safety, operability, sustainability), mean human ratings 4.33–4.48/5
  • Emotion classification: Common formulas (Precision, Recall, F1) and potential Cohen's κ for agreement, but explicit figures not provided

5. Applications and Use Cases

Both datasets facilitate development and benchmarking of emotion-aware AI for multiple domains:

  • CMACD supports research into personality-emotion interaction, adaptive educational systems, marketing, sentiment-aware financial models, and political opinion analysis, emphasizing the synergy of stable MBTI traits and dynamic emotional expressions (Zhou et al., 2024).
  • AI PsyRoom EmoPsy enables multi-agent simulations for therapist training, benchmarking of fine-grained emotion recognition, adaptive intervention planning, large-scale modeling of counseling strategies, and evaluation of generative dialogue systems within ecologically valid, scenario-driven contexts (Feng et al., 7 Jun 2025).

Both are positioned as public resources for affective computing and mental health research, supporting multilingual and multiclass modeling initiatives.

6. Limitations, Biases, and Future Directions

Notable limitations of EmoPsy (CMACD) include reliance on self-reported MBTI (potentially noisy), lack of session- or context-level emotional dynamics, and restriction to written language without multimodal cues. For AI PsyRoom EmoPsy, challenges include taxonomy coverage (inability to accommodate mixed or rare emotions), limitations of synthetic augmentation (possible phrasing or cultural artifacts), absence of prosodic/facial modalities, and initial focus on Chinese counseling.

Mitigations proposed in (Feng et al., 7 Jun 2025) include expanding emotion taxonomies via clinical interviews, integration of audio/video signals, and human-in-the-loop audits for synthetic data drift. Both datasets recommend caution in cross-cultural applications and encourage augmenting unimodal resources with multimodal labeling in future releases.

7. Context and Comparative Significance

EmoPsy (CMACD) is the first known Chinese dataset to combine MBTI personality typing with post-level, fine-grained multi-label emotion intensities at this scale. By contrast, the AI PsyRoom EmoPsy corpus establishes a novel paradigm for scenario-based, sub-emotion annotated, therapist-client counseling dialogues, generated and scored in a multi-agent pipeline for high topical and affective fidelity. Each addresses longstanding data scarcity in Chinese affective computing and mental health AI, providing benchmark resources for academia and industry.

For further implementation and analysis details, refer to the cited papers: "A Chinese Multi-label Affective Computing Dataset Based on Social Media Network Users" (Zhou et al., 2024) and "AI PsyRoom: Artificial Intelligence Platform for Segmented Yearning and Reactive Outcome Optimization Method" (Feng et al., 7 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EmoPsy Dataset.