MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis

Published 23 Jun 2025 in cs.CV, eess.IV, cs.CL, and q-bio.QM | (2506.18512v1)

Abstract: Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal LLM (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a multimodal LLM that integrates ECG, CXR, and LAB data for transparent medical reasoning and diagnosis.
The paper employs a Chain of Evidence and adaptive cross-modal attention to achieve state-of-the-art multi-disease prediction (F1=0.5190, AUC=0.6554).
The paper demonstrates scalable, interpretable diagnostic report generation and sets a blueprint for future multimodal healthcare systems.

MedTVT-R1: A Multimodal LLM for Medical Reasoning and Diagnosis

MedTVT-R1 introduces a comprehensive multimodal LLM (MLLM) framework for interpretable medical reasoning and multi-disease diagnosis, integrating heterogeneous clinical data sources—electrocardiograms (ECG, time series), chest X-rays (CXR, images), and laboratory blood tests (LAB, tabular data). The framework is underpinned by the MedTVT-QA dataset, which provides instruction-based question-answer pairs at both physiological and disease levels, and leverages a Chain of Evidence (CoE) approach to ensure robust, evidence-based diagnostic reasoning.

Dataset Construction and Instruction Design

MedTVT-QA is constructed from the MIMIC-IV family of datasets, aligning ECG, CXR, and LAB data within clinically relevant temporal windows. The dataset comprises 8,706 multimodal patient samples, with 8,331 for training and 375 for testing. Each sample is annotated with both physiological-level labels (e.g., ECG rhythm, CXR findings, LAB panels) and disease-level diagnoses (e.g., coronary artery disease, sepsis, diabetes), mapped to ICD-10 codes.

The instruction design for QA pairs is meticulous, employing role-specific prompts for each modality and for integrated disease-level reasoning. For physiological-level tasks, prompts guide the model to generate detailed, clinically relevant interpretations of each modality. For disease-level tasks, the model is instructed to synthesize evidence across modalities, explicitly justifying each diagnosis with supporting findings from ECG, CXR, and LAB data. All generated content is reviewed by medical professionals to ensure clinical validity.

Model Architecture

MedTVT-R1’s architecture consists of:

Modality-Specific Encoders and Projectors: Pretrained encoders for each modality (ECGFM-KED for ECG, ViT-B/16 for CXR, Symile for LAB) extract features, which are projected into a shared embedding space compatible with the LLM.
Modality Perception Layer (MPL): This layer comprises a Cyclic Multi-Head Attention (CMHA) mechanism and a Contribution-Aware Operator (CAO). CMHA enables cyclic cross-modal attention, allowing each modality to serve as query, key, and value, thus capturing inter-modal dependencies. CAO adaptively weights each modality’s contribution based on the diagnostic context, reflecting the clinical relevance of each data type for specific diseases.
LLM: The backbone is LLaMA 3.2-1B, with LoRA modules for efficient fine-tuning.

The final multimodal representation is injected into the LLM’s input sequence, replacing modality placeholders in the prompt, enabling seamless integration of structured and unstructured data.

Training Strategy

MedTVT-R1 employs a three-stage training pipeline:

Pre-training (PT): The model is trained on physiological-level QA pairs to build foundational understanding of each modality. Only the projectors and LoRA modules are updated.
Supervised Fine-Tuning (SFT): Disease-level QA pairs with CoE logic are used to train the MPL and LoRA modules, enabling the model to perform integrated, evidence-based diagnostic reasoning.
Reinforcement Fine-Tuning (RFT): Inspired by DeepSeek-R1, Group Relative Policy Optimization (GRPO) is used for post-training, with a custom Jaccard Reward function that directly optimizes multi-label disease prediction accuracy. The reward combines format compliance and Jaccard similarity between predicted and ground-truth disease sets, encouraging both structural correctness and diagnostic precision.

Experimental Results

Quantitative Evaluation

MedTVT-R1 is benchmarked against eight state-of-the-art MLLMs (e.g., InternVL3-1B, LLaVA-1.5-7B, Qwen2.5-VL-3B-Instruct, Deepseek-VL-1.3B-Chat) on both natural language generation (NLG) and clinical efficacy (CE) metrics. The evaluation includes BLEU, METEOR, ROUGE, BERTScore for NLG, and precision, recall, F1, and AUC for multi-label disease classification.

Key results:

MedTVT-R1 achieves the highest scores across all metrics, with an F1 of 0.5190 and AUC of 0.6554 for disease-level diagnosis, substantially outperforming all baselines.
Ablation studies confirm that both physiological-level pre-training and RFT with GRPO are critical for optimal performance. Removing either component leads to significant drops in diagnostic accuracy.
On physiological-level understanding tasks, MedTVT-R1 also outperforms all competitors, particularly excelling in long-form, detailed report generation.

Qualitative Analysis

MedTVT-R1 demonstrates robust evidence-based reasoning, explicitly linking findings across modalities to support each diagnosis. The model’s outputs are characterized by clear, structured justifications, mirroring clinical reasoning processes. For example, the model correlates ECG evidence of left ventricular hypertrophy with CXR findings of cardiac enlargement and LAB indicators of hypertension, providing a comprehensive, multi-faceted rationale for its conclusions.

Ablation and Modality Importance

Ablation of the MPL’s CMHA or CAO components degrades performance, underscoring the importance of both cross-modal attention and adaptive weighting. Experiments with missing modalities reveal that the absence of ECG data has the most pronounced negative impact, reflecting the centrality of cardiac information in the target disease set.

Implementation Considerations

Computational Requirements: Training was conducted on eight NVIDIA A800 80GB GPUs. The use of LoRA modules and pretrained encoders enables efficient adaptation without full model retraining.
Data Alignment: Temporal alignment of multimodal data is essential for clinical validity. The pipeline ensures that all modalities are sampled within clinically relevant windows.
Scalability: The modular architecture allows for extension to additional modalities (e.g., genomics, clinical notes) as data becomes available.
Deployment: The model’s interpretability and evidence-based outputs are well-suited for integration into clinical decision support systems, where transparency and traceability are critical.

Implications and Future Directions

MedTVT-R1 establishes a new paradigm for multimodal medical AI, demonstrating that LLMs can be effectively adapted for complex, evidence-based clinical reasoning across heterogeneous data types. The explicit Chain of Evidence approach enhances interpretability, a key requirement for clinical adoption.

Practical implications include:

Automated, interpretable diagnostic report generation for complex comorbidities.
Enhanced decision support in settings where multiple data modalities are available.
A foundation for future models incorporating additional modalities (e.g., patient history, genomics).

Theoretical implications:

The success of CMHA and CAO in the MPL suggests that adaptive, context-aware modality fusion is critical for high-stakes, multi-label reasoning tasks.
The use of verifiable, task-specific reward functions (e.g., Jaccard Reward) in RL-based fine-tuning offers a promising direction for optimizing LLMs in structured prediction settings.

Future work should address the limitations of current open-source datasets, particularly the need for larger, temporally aligned multimodal cohorts and the inclusion of additional clinically relevant data types. Further research into robust handling of missing modalities and real-world deployment in clinical workflows is warranted.

Strong numerical results:

MedTVT-R1 achieves an F1 score of 0.5190 and AUC of 0.6554 for multi-disease diagnosis, outperforming all baselines by a substantial margin.
On physiological-level understanding, MedTVT-R1 consistently leads across BLEU, METEOR, ROUGE, and BERTScore metrics for all modalities.

Bold claims:

MedTVT-R1 is the first MLLM to jointly integrate ECG, CXR, and LAB data for interpretable, evidence-based multi-disease diagnosis.
The Chain of Evidence approach and modality perception layer are essential for achieving state-of-the-art performance in complex clinical reasoning tasks.

MedTVT-R1 represents a significant step toward clinically viable, multimodal AI systems capable of transparent, evidence-based medical reasoning. Its architecture and training methodology provide a blueprint for future research at the intersection of LLMs and multimodal healthcare data.