Conversational Assistants to support Heart Failure Patients: comparing a Neurosymbolic Architecture with ChatGPT

Published 24 Apr 2025 in cs.CL | (2504.17753v1)

Abstract: Conversational assistants are becoming more and more popular, including in healthcare, partly because of the availability and capabilities of LLMs. There is a need for controlled, probing evaluations with real stakeholders which can highlight advantages and disadvantages of more traditional architectures and those based on generative AI. We present a within-group user study to compare two versions of a conversational assistant that allows heart failure patients to ask about salt content in food. One version of the system was developed in-house with a neurosymbolic architecture, and one is based on ChatGPT. The evaluation shows that the in-house system is more accurate, completes more tasks and is less verbose than the one based on ChatGPT; on the other hand, the one based on ChatGPT makes fewer speech errors and requires fewer clarifications to complete the task. Patients show no preference for one over the other.

Abstract PDF Upgrade to Chat

Summary

Comparative Analysis of Conversational Assistants for Heart Failure Patients

The research paper, "Conversational Assistants to Support Heart Failure Patients: Comparing a Neurosymbolic Architecture with ChatGPT," delves into how conversational agents can enhance patient care, particularly in the context of heart failure management. The study compares two distinct dialog systems: an in-house developed neuro-symbolic architecture (HFFood-NS) and a ChatGPT-based system (HFFood-GPT), with the objective to assist heart failure patients in querying salt content in food items. The paper provides a comprehensive analysis of the systems’ performance based on intrinsic and extrinsic evaluations conducted during a user study with African American hospitalized heart failure patients.

Intrinsic Evaluation

The intrinsic evaluation of the two systems was grounded in several key metrics such as task completion, accuracy, slot accuracy, and communication efficiency. The paper reveals a contrast in the systems' capabilities, highlighting that HFFood-NS exhibits superior task completion and accuracy rates (84% and 37% respectively) compared to ChatGPT’s task completion and accuracy rates (62% and 24% respectively). The in-house system, HFFood-NS, demonstrates conciseness in response and higher accuracy in identifying slots —— columns, rows or zones that store dialogue information —— across varied food datasets.

Conversely, HFFood-GPT, based on OpenAI’s ChatGPT, showcased lower verbosity in speech generation and better handling of speech errors. This LLM system portrays adeptness in engaging users with fewer and clearer clarification queries. Nevertheless, its performance was inconsistent due to unexpected slot errors and dependencies on assumed inputs.

Extrinsic Evaluation

The user study focused on subjective user feedback, wherein patient perceptions of system interactions were surveyed. The patients evaluated both systems on criteria such as the clarity and usefulness of answers and questions. Participants expressed no overt preference for either system, with a subtle tilt towards HFFood-NS for its precision in delivering concise and direct responses.

Preference analysis demonstrated that users with higher health literacy favored HFFood-GPT for its comprehensible explanations, despite its verbosity. HFFood-NS was praised for its fast-paced interactions and aligning more closely with task-oriented flow.

Theoretical and Practical Implications

From a theoretical standpoint, the paper suggests the potential of neuro-symbolic architectures in healthcare applications due to their high accuracy and structured design that facilitates thorough error analysis. These architectures afford explainability and reliability, vital for patient-centric systems, especially in critical interventions like dietary management for heart failure patients.

On the practical side, the paper highlights the necessity for adaptive systems that can seamlessly integrate complex queries and diverse patient inputs. The requirement for system flexibility and user autonomy in managing health, and making informed choices based on intelligible system feedback, underlines the need for hybrid models that combine the robustness of neuro-symbolic systems and the conversational fluency of LLMs.

Future Directions

Looking forward, the paper proposes further exploration into hybrid systems that synergize the strengths of both architectural styles. Developing enhanced evaluative frameworks for dialogue systems by incorporating both automated metrics and human assessments remains a focus area. The study advocates for larger scale testing and dataset expansion to cover more comprehensive dietary data, promoting inclusivity in healthcare applications.

In conclusion, the paper emphasizes that while large language models are pivotal in conversational interfaces, their deployment in sensitive domains like healthcare necessitates control and transparency. Continued refinement of these systems could herald improved patient education and self-care support tools, ultimately contributing to better health outcomes for heart failure patients.