Conversational Diagnosis System
- Conversational Diagnosis System is a modular AI architecture that employs multi-agent LLM workflows to mimic expert diagnostic interviews across various domains.
- Key components include an LLM core, specialized sub-agents for inquiry, topic management, and knowledge retrieval, all coordinated via a dynamic topic stack.
- Evaluation metrics indicate enhanced dialogue efficiency, transparency, and reliability in handling complex diagnostic inquiries in high-stakes environments.
A Conversational Diagnosis System is a task-oriented, modular AI architecture designed to emulate expert-driven diagnostic interviews in domains such as medicine, law, and technical support. These systems orchestrate multi-agent LLM workflows to proactively elicit information, manage dialogue topics, and deliver domain-grounded recommendations in a transparent, auditable, and efficient manner. Core innovations include dynamic topic management, explicit state tracking, multi-agent subrole allocation, and domain-adaptive reward learning for optimal information acquisition and user interaction fidelity (Cao, 2023).
1. Modular System Architecture and Data Flow
A modern conversational diagnosis system such as DiagGPT (Cao, 2023) is built around a prompt-driven LLM core and a multi-agent orchestration layer. The key system components are:
- LLM Core: A large pre-trained LLM (e.g., GPT-4) exposed via a role-adaptive dialogue prompt template, responsible for open-ended understanding, generation, and flexible response construction.
- Multi-Agent Coordination Module: Orchestrates sub-agents with specialized responsibilities:
- Topic Manager (explicit topic stack manipulation via finite action sets)
- Topic Enricher (contextualizes topic labels into full prompt slots)
- Context Manager (aggregates history for windowed retrieval or summarization)
- Knowledge Retrieval Agent (external KB or ontology lookups)
- Recommendation Agent (final task/diagnosis outputs)
- Automatic Topic Management: Dialogue topics are managed as a stack
s ∈ T*with stack transitions governed by a function
where actions include .
- Data Flow Pipeline:
- Explicit memory/stack states: Each topic is tracked via a binary state vector indicating activity, completion, and re-entry status.
This modular design promotes transparency, enables flexible adaptation to new domains, and supports advanced conversation management mechanisms that are robust to dialogue complexity and real-world unpredictability (Cao, 2023).
2. Multi-Agent Subrole Design and Communication
A conversational diagnosis system leverages LLM subroles as specialized prompt-templated agents:
| Subagent | Core Function | Invocation/Output |
|---|---|---|
| Inquiry Agent | Proactive, topic-driven question generation | "Action" |
| Knowledge Agent | External fact retrieval for evidence/knowledge grounding | "Response" |
| Recommendation Agent | Synthesis of conclusions and recommendations | "Response" |
| Topic Manager | Topic stack state transitions (e.g., create/jump) | "Action" |
| Context/Memory Agent | Accumulation and windowed context maintenance | Conversation history |
All agents interact via a shared memory buffer that exposes the topicStack and dialogueHistory. System actions—topic transitions or user-facing replies—are materialized as explicit function calls or LLM prompt completions. If multiple agents contribute to the same turn, their outputs are deterministically concatenated in a fixed template for downstream consumption.
The communication protocol is intentionally stateless for the LLM core—the stack and memory serve as the persistent, interpretable state representation (Cao, 2023).
3. Formal Topic Management and Scheduling
Automatic topic management is formalized as a stack-based finite-state process:
- Action space:
- Stack transitions: precisely governs stack evolution for each action and argument (see Section 3.2 in (Cao, 2023)).
- Topic state vector: For each ,
- if is active (top of stack), else $0$
- if is completed, else $0$
- if has been reopened, else $0$
- Priority scheduling: Each topic is assigned
and the Topic Manager acts optimally as
This approach systematically prioritizes under-explored or high-uncertainty topics, promotes conversation efficiency, and supports strategic topic revisiting (Cao, 2023).
4. Learning Objectives, Data, and Inference Dynamics
- No direct LLM fine-tuning: All adaptation is mediated through prompt engineering and reward-learned subcomponents.
- Supervised and RL objectives:
- Topic-action prediction by cross-entropy:
- Dialogue rollout reward:
- Policy-gradient for action selection:
- Wizard-of-Oz and simulation: Data is annotated at turn-level for topic, action, and final goal status. Simulated user agents (e.g., "UserGPT") provide consistent, policy-constrained user behavior, essential for robust system tuning.
- Inference pseudocode (Section 5.1 (Cao, 2023)):
1 2 3 4 5 6 7 8 9 10 11 12 13
function handle_user_utterance(user_utt): context_mgr.update(user_utt) action, action_arg = TopicManager(topic_stack, dialogue_history, user_utt, action_list) topic_stack = apply_action(topic_stack, action, action_arg) current_topic = topic_stack.top() enriched_topic = TopicEnricher(current_topic, dialogue_history) if enriched_topic.mode == "answer": knowledge = KnowledgeAgent.fetch(enriched_topic) else: knowledge = None reply = ChatAgent(enriched_topic, dialogue_history, knowledge) dialogue_history.append((user_utt, reply)) return reply - Practical scenario (medical case): Checklist topics guide the stack transitions through demographics, presenting complaints, symptom duration, past history, and culminating in closed recommendations (Cao, 2023).
5. Evaluation Framework and Quantitative Metrics
Performance is evaluated under multiple axes:
| Metric | Definition / Outcome |
|---|---|
| Round Count (RC) | Mean dialogue turns to goal; lower is better (DiagGPT: 7.0 vs GPT-4: 7.7) |
| Completion Rate (CR) | Fraction of checklist items closed (1.0 for both systems) |
| Success Rate (SR) | Binary success of reaching the final goal (1.0 for both systems) |
| Response Quality (RQ) | Human/LLM-rated 1–10 (DiagGPT: 9.0, GPT-4: 9.0) |
| Comparison Score (CS) | Win rate vs. GPT-4 baseline (DiagGPT: +1.5, GPT-4: 0.0) |
Observed error modes and limitations:
- Topic Manager hallucination (spurious topics),
- Computational cost/latency due to multi-LLM invocation per turn,
- Prompt sensitivity requiring post-processing gatekeepers (Cao, 2023).
6. Guidelines for Adaptation to New Domains
For deployment in novel diagnostic environments, the following best practices are specified:
- Domain-specific topic checklists: Derive exhaustive, structured topic schemas for the target specialty.
- Plug-in knowledge agent: Integrate specialty ontologies (e.g., UMLS, SNOMED CT) with the Knowledge Retrieval Agent.
- Prompt adaptation: Tailor system prompts for domain-specific professional language ("You are a board-certified cardiologist...").
- Safety and fact-checking: Insert a hallucination filter against trusted databases or append disclaimers to generated outputs.
- Evaluation and calibration: Obtain expert-annotated dialogues for turn-level supervision, with potential fine-tuning of topic-action policies.
- Simulated user procedures: Configure user agents to represent real-world refusals, clarifications, or ambiguity, tightly coupled to the domain (Cao, 2023).
This scheme ensures rapid adaptation and safety when porting the conversational diagnosis system to high-stakes, specialty-specific, or regulated domains.
7. Significance and Generalizability
The DiagGPT conversational diagnosis system establishes a general blueprint for LLM-driven, multi-agent, topic-stack-based reasoning in diagnostic dialogues. By decoupling task-directed dialogue control from raw LLM output, it supports interpretable, robust, and efficient workflows capable of integrating domain-specific knowledge, dynamic memory, and strategic user interaction. This paradigm is especially suited to complex, lightly-structured consultation and triage tasks across health, law, and technical domains, and provides a rigorous, mathematical foundation for ongoing research and real-world deployment (Cao, 2023).