MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine

Published 2 Jun 2025 in cs.CL and cs.AI | (2506.01252v1)

Abstract: Traditional Chinese Medicine (TCM) is a holistic medical system with millennia of accumulated clinical experience, playing a vital role in global healthcare-particularly across East Asia. However, the implicit reasoning, diverse textual forms, and lack of standardization in TCM pose major challenges for computational modeling and evaluation. LLMs have demonstrated remarkable potential in processing natural language across diverse domains, including general medicine. Yet, their systematic evaluation in the TCM domain remains underdeveloped. Existing benchmarks either focus narrowly on factual question answering or lack domain-specific tasks and clinical realism. To fill this gap, we introduce MTCMB-a Multi-Task Benchmark for Evaluating LLMs on TCM Knowledge, Reasoning, and Safety. Developed in collaboration with certified TCM experts, MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. The benchmark integrates real-world case records, national licensing exams, and classical texts, providing an authentic and comprehensive testbed for TCM-capable models. Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance. These findings highlight the urgent need for domain-aligned benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. All datasets, code, and evaluation tools are publicly available at: https://github.com/Wayyuanyuan/MTCMB.

Abstract PDF Upgrade to Chat

Summary

The paper presents a multi-task benchmark (MTCMB) that evaluates LLMs using TCM-specific tasks and curated clinical data.
It employs zero-shot, few-shot, and chain-of-thought techniques to expose gaps in diagnostic reasoning and prescription planning.
Findings emphasize the need for domain-aligned training and hybrid learning frameworks to improve safety and reliability in TCM applications.

MTCMB: Evaluating LLMs in Traditional Chinese Medicine

The paper introduces a multi-task benchmark framework—MTCMB—specifically designed for evaluating LLMs within the domain of Traditional Chinese Medicine (TCM). TCM presents unique computational challenges due to its reliance on implicit reasoning, diverse textual forms, and a lack of standardization, which distinguish it from Western medical paradigms. The paper outlines the limitations of existing benchmarks, which are either narrowly focused on factual question answering or lack domain-specific tasks and clinical realism.

Overview of MTCMB

MTCMB evaluates LLMs across five major categories: knowledge question answering (QA), language understanding, diagnostic reasoning, prescription generation, and safety evaluation. It comprises 12 sub-datasets curated in collaboration with certified TCM practitioners, including real-world case records, national licensing exams, and classical texts. The framework integrates domain-specific challenges and safety considerations that are inherent to TCM practices.

Evaluation and Results

The paper evaluates 14 state-of-the-art LLMs across three categories: general LLMs, medical-specialized LLMs, and reasoning-focused LLMs. The evaluation utilizes zero-shot, few-shot, and chain-of-thought prompting techniques. Results indicate that while LLMs excel in factual knowledge retrieval and entity extraction, they face substantial gaps in clinical reasoning, prescription planning, and safety compliance. Models such as GPT-4.1 and Qwen-Max perform well in factual QA but struggle with TCM-specific reasoning tasks, highlighting the need for domain-aligned training paradigms.

Implications

The findings underscore the critical need for benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. The paper advocates for domain-aligned training, hybrid architectures combining deep learning with symbolic reasoning frameworks, and safety-enhanced learning paradigms. These recommendations aim to address the holistic and context-dependent nature of TCM, ensuring safer and more reliable model outputs.

Future Directions

The paper suggests pursuing knowledge modeling frameworks that integrate curated datasets, symbolic reasoning grounded in TCM ontologies, and implementations that enhance safety through rule injection and toxicity filtering. By advancing these areas, researchers could develop LLMs capable of understanding and applying TCM principles effectively in clinical contexts.

In conclusion, MTCMB provides a comprehensive testbed for assessing TCM-specific capabilities of LLMs, offering valuable insights and guidance for developing reliable AI systems in the TCM domain. The framework may play a pivotal role in enhancing the safety and cultural alignment of medical AI systems, although careful oversight is essential to prevent misuse and harmful recommendations.

Markdown Report Issue