A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Published 23 Sep 2024 in cs.CL and cs.AI | (2409.15277v1)

Abstract: LLMs have exhibited remarkable capabilities across various domains and tasks, pushing the boundaries of our knowledge in learning and cognition. The latest model, OpenAI's o1, stands out as the first LLM with an internalized chain-of-thought technique using reinforcement learning strategies. While it has demonstrated surprisingly strong capabilities on various general language tasks, its performance in specialized fields such as medicine remains unknown. To this end, this report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality. Specifically, our evaluation encompasses 6 tasks using data from 37 medical datasets, including two newly constructed and more challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine (NEJM) and The Lancet. These datasets offer greater clinical relevance compared to standard medical QA benchmarks such as MedQA, translating more effectively into real-world clinical utility. Our analysis of o1 suggests that the enhanced reasoning ability of LLMs may (significantly) benefit their capability to understand various medical instructions and reason through complex clinical scenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios. But meanwhile, we identify several weaknesses in both the model capability and the existing evaluation protocols, including hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. We release our raw data and model outputs at https://ucsc-vlaa.github.io/o1_medicine/ for future research.

Abstract PDF HTML Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates o1’s enhanced understanding with a 72.6% F1 score in concept recognition, outperforming GPT-4 and GPT-3.5.
The paper shows that o1’s diagnostic reasoning achieved up to an 8.9% improvement over GPT-4, aided by effective CoT prompting strategies.
The paper identifies practical challenges, such as risks of hallucination and inconsistent multilingual performance, highlighting the need for refined evaluation metrics.

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

The paper "A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?" provides a comprehensive exploration of OpenAI's latest LLM, o1, focusing on its application in medical scenarios. This study examines three critical dimensions of the model's capabilities: understanding, reasoning, and multilinguality, by evaluating its performance on an extensive array of medical datasets.

Background and Motivation

LLMs have advanced significantly in recent years, demonstrating strong problem-solving abilities across various domains. The introduction of specialized models such as GPT-3.5, GPT-4, and now, o1, has propelled this progress further. The o1 model distinguishes itself with its internalized chain-of-thought (CoT) reasoning technique, honed using reinforcement learning strategies. Although previous LLMs have shown considerable prowess in general tasks, their utility in specialized fields like medicine remains an open question. This paper addresses this gap by evaluating o1's capabilities in medical tasks, thereby exploring the potential of LLMs in supporting clinical decision-making.

Evaluation Methodology

The evaluation framework involves three primary aspects:

Understanding: The ability to comprehend medical concepts from texts.
Reasoning: The capability to perform logical reasoning to arrive at medical conclusions.
Multilinguality: The proficiency in handling medical tasks across different languages.

The authors curated a thorough evaluation suite comprising 37 datasets across six tasks, including newly developed challenging QA datasets from professional medical quizzes. The evaluation protocol involved various prompting strategies such as direct prompting, CoT prompting, and few-shot learning, implemented across different models including o1, GPT-4, GPT-3.5, MEDITRON-70B, and Llama3-8B.

Key Findings

One of the primary revelations from the study is the notable enhancement in o1's understanding capabilities. On tasks like concept recognition and text summarization, o1 outperformed other models significantly. For example, it achieved a 72.6% F1 score on concept recognition datasets, surpassing GPT-4 by 7.6% and GPT-3.5 by a substantial margin of 26.6%.

When it comes to reasoning, especially in diagnostic scenarios, o1 demonstrated superior performance. For instance, in newly constructed QA tasks from NEJMQA and LancetQA, o1 showed an 8.9% improvement over GPT-4 and a 27.1% enhancement over GPT-3.5. The model also showcased its strength in handling mathematical reasoning tasks like MedCalc-Bench, with a significant 9.4% improvement over GPT-4.

Furthermore, the o1 model's ability to generate more concise and accurate responses highlights its practical utility in real-world clinical settings. However, despite its advancements, o1 remains prone to hallucinations, as indicated by the AlignScore metrics. This limitation underscores the persistent challenge of hallucination in modern LLMs.

Advanced Prompting Techniques

Interestingly, the study reveals that despite being trained with CoT data, o1 still benefits from CoT prompting in medical QA tasks, showing an average accuracy boost of 3.18%. However, more complex prompting strategies like self-consistency and reflex did not yield similar improvements, indicating varying effectiveness of these techniques.

Multilingual and Metric Challenges

While o1 excelled in multilingual QA tasks, it struggled with complex multilingual scenarios, particularly in the Chinese dataset, AI Hospital. This performance discrepancy suggests that o1’s training may lack sufficient multilingual CoT data, which is critical for complex reasoning.

A notable discussion point is the inconsistency of evaluation metrics. Different metrics yielded varied performance results for the same tasks, highlighting the need for more reliable and consistent evaluation criteria for future LLMs.

Implications and Future Directions

The findings from this study suggest that models like o1 represent a step closer to realizing an AI doctor capable of assisting in clinical decision-making. The model's strong performance in understanding and reasoning tasks enhances its potential as a reliable clinical tool. However, the persistent issues of hallucination and inconsistent multilingual performance necessitate further research.

The future of AI in medicine will likely involve addressing these limitations, improving prompting strategies, and developing more robust evaluation metrics. By overcoming these challenges, LLMs can further evolve to provide safe, reliable, and efficient medical support, pushing the boundaries of AI-assisted healthcare.

Conclusion

This preliminary study highlights the promising capabilities of OpenAI's o1 model in the medical domain. While it provides an affirmative step towards the vision of an AI doctor, the identified limitations and challenges offer valuable insights for future research. By continuing to refine these models and their evaluation, we can look forward to more advanced and reliable AI applications in healthcare.

Markdown Report Issue