Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ch-PatientSim: Chinese Patient Simulation Dataset

Updated 23 January 2026
  • Ch-PatientSim is a comprehensive, persona-annotated dataset capturing realistic clinical dialogues in Chinese, focused on gastroenterology outpatient care.
  • It integrates multi-dimensional patient persona labels, structured medical histories, and synthetic dialogue augmentations using a five-dimensional annotation schema.
  • The dataset enables robust benchmarking and fine-tuning of LLMs for patient role-play through a multi-stage regulatory framework and extensive evaluation metrics.

The Chinese Patient Simulation Dataset (Ch-PatientSim) is the first persona-annotated, publicly released corpus of realistic clinical interactions in Chinese, designed explicitly to benchmark and advance patient-facing LLMs in the healthcare domain. Focused on gastroenterology outpatient care, Ch-PatientSim integrates multi-dimensional patient persona labels, structured medical histories, and manually curated, persona-diverse synthetic dialogue augmentations, thereby establishing a new baseline for both clinical language modeling and automated patient role-play (Jiang et al., 16 Jan 2026).

1. Dataset Composition and Clinical Coverage

Ch-PatientSim encompasses 591 unique patient cases, each represented by a patient-doctor conversation reflecting gastroenterology clinic encounters. The total volume includes 5,935 patient utterances distributed across an average of 10.5 dialogue turns per patient. All conversations are mapped to eight prevalent gastrointestinal (GI) conditions—such as gastritis, peptic ulcer disease, inflammatory bowel disease, gallstone disease, functional dyspepsia, various liver disorders, and reflux esophagitis—with the dataset balanced to ensure each clinical subcategory composes 10–15% of the dataset following post-hoc augmentation.

Each dialogue is underpinned by structured, case-level metadata: demographics (age, gender), chief complaint, historical notes, lab values, imaging, and endoscopy summaries.

2. Persona Annotation and Representation

A distinctive attribute of Ch-PatientSim is its systematic five-dimensional persona schema: P=[pPersonality, pEmotion, pRecall, pComprehension, pFluency]P = \left[p^{\rm Personality},\ p^{\rm Emotion},\ p^{\rm Recall},\ p^{\rm Comprehension},\ p^{\rm Fluency}\right] Each dimension is categorically annotated at one of three levels (low/medium/high, or specific labels), enabling fine-grained simulation of patient diversity:

  • Personality: Encodes behavioral style (e.g., "anxious", "paranoid", "detached").
  • Emotion: Expresses affective state ("stable", "irritable", "teary").
  • Medical-History Recall: Indicates detail retention regarding medical history.
  • Comprehension: Captures the ability to understand clinical queries.
  • Language Fluency: Assesses spoken language facility.

Manual annotation is performed on every case by two medical annotators, with conflicting assignments arbitrated by senior clinicians. Estimated inter-annotator agreement is high (κ0.82\kappa \approx 0.82), ensuring label consistency.

3. Data Curation, Class Balancing, and Augmentation

Raw dialogic data is sourced from 150 real-world gastroenterology outpatient records. Annotators segment conversational turns, redact protected health information, and transcribe clinical variables.

In order to address inherent class imbalances in persona-state representation, minority classes are over-sampled using weighted sampling: wi=1niα,w~i=wijwj,α(0,1]w_i = \frac{1}{n_i^\alpha}, \quad \tilde{w}_i = \frac{w_i}{\sum_j w_j}, \quad \alpha \in (0,1] with α\alpha empirically set to 0.5.

Few-shot augmentation is applied to rare persona-state combinations. For each, five seed cases provide prompts to a 72B-parameter LLM (Qwen2.5-72B), instructing generation of two additional eight-turn dialogues per seed. Of ~1,200 generated synthetic cases, approximately 500 (88% passing dual-annotator review, κaug0.79\kappa_{\rm aug} \approx 0.79) are retained, post-screening for medical plausibility and persona-match.

4. Data Schema, Access, and Availability

The dataset is distributed in JSONL format, with each line an object containing:

  • "patient_id"
  • "demographics": {age, gender}
  • "medical_info": chief complaint, history, labs, imaging
  • "persona": all five persona fields
  • "dialogue": list of {speaker: "Doctor"/"Patient", utterance: ...}

A representative entry includes hierarchical medical data, persona vector, and turn-level dialogue. Ch-PatientSim is publicly available under the MIT license at https://github.com/SerajJon/[MSPRP](https://www.emergentmind.com/topics/multi-stage-patient-role-playing-msprp), and is partitioned into train, validation, and test sets (Jiang et al., 16 Jan 2026).

5. Benchmarking Protocols and Evaluation Metrics

Evaluation leverages both automated and LLM-based human-aligned metrics:

  • Automated metrics: BLEU-n, ROUGE-L, METEOR, BERTScore, perplexity (PPL\mathrm{PPL}), and diversity (distinct-n). These measure n-gram overlap, semantic similarity, and fluency.
  • LLM-graded pragmatic metrics (1–5 scale): Persona Consistency, Factual Consistency, Naturalness, Contextual Relevance.

Baseline Qwen2.5-72B model scores (BLEU\mathrm{BLEU}-1=0.2006, Persona=3.870, Factual=3.896), while augmentation via Ch-PatientSim and the MSPRP framework yields modest but measurable improvements (e.g., Persona=3.939), demonstrating the effectiveness of persona-driven augmentation.

Ablation experiments confirm each MSPRP stage’s independent contribution to dialogue realism, particularly for persona consistency (Stage 2: +0.073), and cumulative gains when combined (BLEU\mathrm{BLEU}-1 increase +0.0203, Persona +0.157).

6. Multi-Stage Patient Role-Playing (MSPRP) Framework

Ch-PatientSim is designed in parallel with the Multi-Stage Patient Role-Playing (MSPRP) inference framework, which regulates LLM response generation by decomposing patient simulation into three ordered stages:

  1. Basic Information Generation: Enforces symptom coverage, temporal sequencing, factual integrity of medical variables, and aligns to medical-history recall.
  2. Communication Style Injection: Deploys scenario-specific interaction matrices to blend Personality and Emotion into response content and style.
  3. Expression Consistency Regulation: Modulates linguistic detail, fluency, and recall according to persona vector levels.

The staged regulation mechanism ensures outputs are both clinically reasonable and aligned to the specified persona, mitigating the generic/overly formal patient responses observed in baseline LLMs. Output is further post-validated using metric-based and clinical review.

7. Applications and Recommendations

Ch-PatientSim furnishes a robust benchmark for fine-tuning or prompt-based evaluation of LLMs under constrained patient personas in a Chinese clinical context. Applications include:

  • Fine-tuning clinical dialogue models to improve persona adaptivity.
  • Benchmarking retrieval-augmented or memory-grounded LLM agents for medical role-play.
  • Extending methodology and annotation schema to other medical specialties by adapting the five-dimensional persona framework.

The dataset and framework serve as foundational infrastructure for the development of next-generation, persona-aware AI clinical simulators and robust downstream evaluation of model generalization and adaptability (Jiang et al., 16 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chinese Patient Simulation Dataset (Ch-PatientSim).