MedDialog: Two Large-scale Medical Dialogue Datasets

Published 7 Apr 2020 in cs.LG, cs.AI, cs.CL, and stat.ML | (2004.03329v2)

Abstract: Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build two large-scale medical dialogue datasets: MedDialog-EN and MedDialog-CN. MedDialog-EN is an English dataset containing 0.3 million conversations between patients and doctors and 0.5 million utterances. MedDialog-CN is an Chinese dataset containing 1.1 million conversations and 4 million utterances. To our best knowledge, MedDialog-(EN,CN) are the largest medical dialogue datasets to date. The dataset is available at https://github.com/UCSD-AI4H/Medical-Dialogue-System

Abstract PDF Upgrade to Chat

Citations (153)

View on Semantic Scholar

Summary

The paper introduces two large-scale, multilingual datasets (MedDialog-EN and MedDialog-CN) that significantly outsize previous medical dialogue resources.
It presents detailed dataset compositions, including over 257K English and 1.14M Chinese consultations covering a broad range of medical specialties.
The datasets facilitate the training of advanced dialogue models to simulate doctor-patient interactions, thereby reducing physician workloads and improving telemedicine.

Introduction

"MedDialog: Two Large-scale Medical Dialogue Datasets" presents two substantial datasets, MedDialog-EN and MedDialog-CN, to advance the development of medical dialogue systems in telemedicine. With rapid growth in telemedicine, these datasets aim to alleviate challenges like limited physician time and patient monitoring. By offering extensive English and Chinese medical dialogue data, this work supports AI researchers in training models capable of simulating doctor-patient interactions, thereby enhancing telemedicine's reach and effectiveness.

Dataset Composition and Characteristics

MedDialog-EN Dataset

The MedDialog-EN is an English-language dataset composed of 257,454 conversations resulting in a total of 514,908 utterances between patients and doctors. The dataset encompasses numerous medical specialties, including 96 categories such as cardiology, nephrology, and pharmacology. Originating from online healthcare platforms like iclinic.com and healthcaremagic.com, it spans patient inquiries from 2008 to 2020.

Figure 1: An exemplar consultation, which includes (1) description of medical conditions of the patient, (2) dialogue between doctor and patient.

MedDialog-CN Dataset

MedDialog-CN, the Chinese-language counterpart, is considerably larger, consisting of 1,145,231 consultations and 3,959,333 utterances. It covers 172 fine-grained specialties within 29 broader categories. The consultations were extracted from haodf.com and date from 2010 to 2020. The dataset includes consultations with detailed patient medical history, dialogue, and optional diagnosis and treatment recommendations from doctors.

Figure 2: An exemplar consultation, which includes (1) description of medical conditions and history of the patient, (2) dialogue between doctor and patient, and (3) diagnosis and treatment suggestions given by the doctor.

Advantages and Use Cases

The datasets stand out for their scale, breadth of medical specialties, and patient demographic diversity, minimizing data biases. This comprehensive data coverage enables the training of dialogue systems that aim for doctor-level intelligence across varied medical domains. Such systems can serve as virtual doctors, interacting through natural language to provide clinical advice and proactive patient engagement.

Practical Applications

The datasets enable the development of sophisticated dialogue models that can be incorporated into telemedicine platforms to enhance patient care. These systems can reduce physician workload by managing routine consultations and monitoring chronic conditions remotely. Additionally, the datasets facilitate training models for multilingual environments, addressing language-specific healthcare delivery needs.

Comparative Analysis

Compared to existing medical dialogue datasets, MedDialog datasets surpass them significantly in size and scope. While other datasets, such as Muzhi and COVID-EN, are limited to a few hundred dialogues, MedDialog-EN and MedDialog-CN extend to hundreds of thousands, providing a richer resource for training robust machine learning models.

Conclusion

MedDialog-EN and MedDialog-CN represent critical resources in AI for healthcare, especially in enhancing telemedicine capabilities. By leveraging the scale and diversity of these datasets, researchers can develop advanced dialogue systems that improve accessibility and quality of care. The availability of these datasets encourages further research into multilingual medical dialogue systems and supports innovations in remote medical assistance. As the datasets continue to grow, they will likely support increasingly sophisticated applications in the healthcare technology domain.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

MedDialog: Two Large-scale Medical Dialogue Datasets

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Easy Explanation: MedDialog — Two Large-Scale Medical Dialogue Datasets (for a 14-year-old)

1) What is this paper about?

This paper introduces two big collections of real conversations between patients and doctors. One collection is in English (MedDialog-EN) and the other is in Chinese (MedDialog-CN). These collections are meant to help build and improve computer programs (AI) that can chat with patients about health—like a “virtual doctor” that can ask questions and give general advice.

2) What questions are the researchers trying to answer?

The researchers want to solve a simple problem: AI needs lots of examples to learn how to talk like a doctor. But getting real medical conversations is hard because of privacy and limited access. So they asked:

Can we create very large, diverse, real-world datasets of patient–doctor chats in different languages?
Can these datasets cover many types of diseases and medical specialties (like heart, lungs, kids’ health, etc.) so the AI can learn broadly?
Can we make them available to the research community to speed up progress in telemedicine?

3) How did they build these datasets?

They gathered patient–doctor conversations from trusted health websites where people can consult doctors online.

To make this easy to picture:

“Web crawling” is like sending a smart robot to read lots of web pages and copy the relevant conversations.
An “utterance” is one turn in a chat—like when the patient types a message or the doctor replies.
“Specialties” are specific areas of medicine, like cardiology (heart) or pediatrics (children’s health).

Here’s what they did:

For MedDialog-EN (English), they collected 257,454 consultations (about 0.3 million) with 514,908 utterances (about 0.5 million) from two health platforms. Each consultation has a short description of the patient’s problem plus the back-and-forth chat.
For MedDialog-CN (Chinese), they collected 1,145,231 consultations (about 1.1 million) with 3,959,333 utterances (about 4 million) from a major Chinese health platform. Each consultation includes the patient’s condition and history, the chat, and sometimes the doctor’s diagnosis and treatment suggestions.
They organized the data by medical specialty and included conversations from many years (roughly 2008–2020 for English, 2010–2020 for Chinese).

4) What did they find, and why is it important?

Main results:

These are the largest medical dialogue datasets ever released (at the time of writing).
MedDialog-EN covers 96 specialties (like cardiology, nephrology, pharmacology).
MedDialog-CN covers 29 broad categories and 172 fine-grained specialties.
The data includes patients from many different places and backgrounds (worldwide for English, 31 provinces for Chinese), which helps reduce bias and makes AI training more fair and realistic.

Why it matters:

AI systems learn from examples. More and better data helps them understand how doctors ask the right questions, explain clearly, and offer reasonable suggestions.
These datasets can speed up the creation of safe, helpful medical chatbots that support telemedicine—especially useful for people who live far from hospitals or have trouble seeing a doctor quickly.
Broad coverage means the AI won’t just learn about one or two diseases; it can practice across many medical areas.

5) What’s the bigger impact?

If researchers use these datasets well, we could get better “virtual doctor” assistants that:

Help answer common health questions,
Guide patients on when to seek in-person care,
Support doctors by handling basic follow-ups, which may reduce burnout.

Important note: These AI tools are meant to assist, not replace, real doctors. Safety, privacy, and careful testing are essential. But with large, diverse, multilingual data like MedDialog, the path toward helpful, trustworthy telemedicine tools becomes much clearer.

The datasets are publicly available, so more researchers can build and improve medical dialogue systems faster.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications Derived from the MedDialog Datasets

The paper introduces MedDialog-EN and MedDialog-CN—large-scale, publicly available datasets of patient–doctor dialogues in English and Chinese, spanning many specialties and years. Below are concrete, real-world applications that leverage these datasets for industry, academia, policy, and daily life.

Immediate Applications

Triage and specialty routing models
- Sectors: Healthcare, Software
- What to build: Text classifiers that route incoming patient messages to the right specialty and urgency level (e.g., urgent vs. routine).
- Tools/workflows: PyTorch/Hugging Face Transformers; fine-tune BERT/ClinicalBERT/Chinese-BERT; deploy behind telemedicine “digital front door.”
- Dependencies/assumptions: Human-in-the-loop for final routing; ongoing calibration to local care pathways; dataset license permits commercial use; de-identification and HIPAA/GDPR compliance.
Draft-response assistants for clinicians
- Sectors: Healthcare, Software
- What to build: LLM-based assistants that generate draft replies to patient queries, with clinicians reviewing/editing before sending.
- Tools/workflows: RAG over institutional guidelines + fine-tuned dialogue models; EHR messaging integration.
- Dependencies/assumptions: Explicit human oversight; safety guardrails; institution-specific guideline retrieval; legal review for SaMD implications.
Automated intake and EMR pre-population
- Sectors: Healthcare, Software
- What to build: NER and slot-filling to extract conditions, duration, meds, allergies, and past history from free text and pre-fill structured fields.
- Tools/workflows: spaCy/Stanza, clinical NER models; HL7 FHIR integration.
- Dependencies/assumptions: Local terminology adaptation; physician verification step; privacy and logging controls.
Conversation summarization for “after-visit” notes
- Sectors: Healthcare, Software
- What to build: Summarizers that produce patient-friendly summaries (e.g., SOAP notes + instructions) from chat transcripts.
- Tools/workflows: Fine-tuned summarization LLMs; templated discharge instructions.
- Dependencies/assumptions: Clinician review; alignment with literacy standards; multilingual support as needed.
Retrieval-augmented clinical Q&A support for providers
- Sectors: Healthcare, Software, Academia
- What to build: Internal Q&A copilots that surface similar past cases and guideline snippets while drafting chat responses.
- Tools/workflows: FAISS/Elasticsearch retrieval over MedDialog + local KB; RAG with citation display.
- Dependencies/assumptions: Data governance; deduplication and quality filtering; explicit “for reference only” labeling.
Contact center and telemedicine operations analytics
- Sectors: Healthcare Operations, Finance
- What to build: Topic modeling and intent clustering to forecast demand, staff scheduling, and identify operational pain points.
- Tools/workflows: Unsupervised topic models, LDA/BERTopic; BI dashboards.
- Dependencies/assumptions: Anonymized transcripts; drift monitoring as services change.
Conversation quality and compliance auditing
- Sectors: Healthcare, Policy/Quality
- What to build: Automated scoring for clarity, empathy, and adherence to communication standards; QA alerts for coaching.
- Tools/workflows: Fine-tuned classifiers; rubric-based scoring; dashboard for supervisors.
- Dependencies/assumptions: Institution-defined rubrics; avoidance of demographic bias; legal review for workforce monitoring.
Safety and moderation filters for medical forums and chat
- Sectors: Software, Community Platforms
- What to build: Classifiers to detect unsafe advice, misinformation, and need for escalation/emergency referral.
- Tools/workflows: Safety-tuned models that trigger standardized disclaimers and escalation workflows.
- Dependencies/assumptions: Alignment with WHO/CDC/AMA guidance; regular red-team testing; clear user messaging.
Cross-lingual model baselines and transfer learning
- Sectors: Academia, Software
- What to build: Comparative studies and baselines for English–Chinese medical dialogue modeling; domain-adapted MT.
- Tools/workflows: mBERT/XLM-R; bilingual alignment; WWM Chinese BERT; PaddleNLP/jieba for Chinese preprocessing.
- Dependencies/assumptions: Domain-specific MT post-editing; differences in clinical practice across locales.
Education and skills training (OSCE-style simulators)
- Sectors: Education, Healthcare
- What to build: Standardized patient simulators for communication training, questioning strategy, and empathy practice.
- Tools/workflows: Fine-tuned conversational agents; scenario libraries with varied specialties.
- Dependencies/assumptions: Clear labeling as educational only; no diagnosis; supervised debriefing.

Long-Term Applications

Autonomous virtual triage and primary care assistants
- Sectors: Healthcare, Consumer Health
- What to build: End-to-end triage assistants conducting structured history-taking, offering next-step guidance, and integrating with care pathways.
- Tools/workflows: Dialogue policies + medical knowledge graphs; dynamic RAG; escalation to clinicians.
- Dependencies/assumptions: Regulatory approval (FDA/CE), clinical trials for safety/effectiveness, robust crisis detection.
EHR-integrated ambient scribe for telemedicine
- Sectors: Healthcare IT
- What to build: Real-time generation of structured notes, orders, and coding from live chat sessions.
- Tools/workflows: ASR+LLM pipeline for multimodal (text/voice), FHIR integration, ICD/CPT code suggestion.
- Dependencies/assumptions: Very high accuracy; auditability; provider liability considerations.
Automated clinical coding and billing from chat transcripts
- Sectors: Healthcare, Finance/Revenue Cycle
- What to build: Systems that infer ICD/CPT codes from conversation context and generate compliant documentation.
- Tools/workflows: Sequence labeling + classifier ensembles; coder-in-the-loop verification.
- Dependencies/assumptions: Additional labeled data for coding; payer-specific rules; extensive validation.
Red-flag detection and escalation systems
- Sectors: Healthcare, Emergency Services
- What to build: Real-time detection of alarming symptoms (e.g., stroke, sepsis indicators) in chats with immediate escalation.
- Tools/workflows: Thresholded risk scores; integrated paging/escalation orchestration.
- Dependencies/assumptions: Clinical guardrails; low false-negative rates; continuous post-market surveillance.
Public health surveillance from conversational signals
- Sectors: Public Health, Policy
- What to build: Early-warning systems that mine aggregated chat patterns to detect outbreaks and trends (e.g., flu, RSV).
- Tools/workflows: Privacy-preserving aggregation; anomaly detection; secure data-sharing frameworks.
- Dependencies/assumptions: Strong de-identification; IRB/ethics approval; bias correction for platform demographics.
Global multilingual telemedicine copilots
- Sectors: Global Health, NGOs, Telehealth
- What to build: Multilingual assistants adapted to local guidelines and cultural communication styles.
- Tools/workflows: Cross-lingual transfer; locale-specific RAG; human translator fallback.
- Dependencies/assumptions: Local clinical standard alignment; cultural competence; jurisdictional regulatory differences.
Synthetic data generation and privacy-preserving training
- Sectors: Academia, Software, Policy
- What to build: Generative models producing high-fidelity synthetic dialogues for low-resource specialties and privacy research.
- Tools/workflows: Diffusion/LLM-based data synthesis; differential privacy; utility–privacy evaluation suites.
- Dependencies/assumptions: Proven reduction of re-identification risk; representativeness without leakage.
Benchmarking and certification frameworks for clinical dialogue AI
- Sectors: Policy/Regulation, Standards Bodies
- What to build: Standardized test suites and leaderboards for safety, factuality, empathy, and bias in clinical chat.
- Tools/workflows: Curated challenge sets; external knowledge grounding checks; scenario-based evaluation.
- Dependencies/assumptions: Consensus metrics; participation from regulators, clinicians, and vendors.
Personalized adherence and chronic disease management chatbots
- Sectors: Healthcare, Consumer Health
- What to build: Longitudinal conversation agents that monitor symptoms, nudge adherence, and coordinate care teams.
- Tools/workflows: Long-term user state tracking; integration with wearables and pharmacy systems.
- Dependencies/assumptions: Consent and data linkage; safety nets for deterioration; clinical oversight.
Clinical trials prescreening via conversational intake
- Sectors: Pharma, Research
- What to build: Agents that screen patients against protocol criteria during chat and flag potential matches.
- Tools/workflows: Criteria extraction; eligibility inference; site referral workflows.
- Dependencies/assumptions: Accurate criteria parsing; handling of protected populations; sponsor SOP alignment.
Cross-cultural communication and bias audits
- Sectors: Policy, Academia, Healthcare
- What to build: Studies and tools to evaluate differential performance across language/culture; bias mitigation training.
- Tools/workflows: Counterfactual evaluation; fairness dashboards; targeted data augmentation.
- Dependencies/assumptions: Need for demographic/cultural annotations (not included by default); partnership with ethics boards.

Notes on feasibility and assumptions across applications:

Licensing and terms of use: Although the dataset is publicly released on GitHub, commercial deployment requires confirming the legality of using web-crawled dialogues from source platforms.
Privacy and de-identification: Ensure removal of PHI and compliance with HIPAA/GDPR/CCPA; adopt differential privacy where applicable.
Safety and regulation: Many clinician-facing tools may be considered medical devices; plan for regulatory pathways (FDA, EU MDR) and clinical validation.
Data shift and guideline updates: Dialogues span 2008–2020; models must be updated to current clinical guidelines and medication practices.
Human-in-the-loop: For near-term safety, keep clinicians in the loop for diagnosis, triage decisions, and documentation approval.
Localization: Clinical norms and standards differ by country; adapt models to local practice, language, and culture.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (13)

Collections

GitHub

GitHub - UCSD-AI4H/Medical-Dialogue-System (521 stars)

MedDialog: Two Large-scale Medical Dialogue Datasets

Summary

Introduction

Dataset Composition and Characteristics

MedDialog-EN Dataset

MedDialog-CN Dataset

Advantages and Use Cases

Practical Applications

Comparative Analysis

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Easy Explanation: MedDialog — Two Large-Scale Medical Dialogue Datasets (for a 14-year-old)

1) What is this paper about?

2) What questions are the researchers trying to answer?

3) How did they build these datasets?

4) What did they find, and why is it important?

5) What’s the bigger impact?

Practical Applications

Practical Applications Derived from the MedDialog Datasets

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Related Papers

Authors (13)

Collections

GitHub