MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Published 6 Aug 2024 in cs.CV | (2408.02900v2)

Abstract: This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities with multigranular annotations for more than 65 diseases. These multigranular annotations encompass both global information, such as modality and organ detection, and local information like ROI analysis, lesion texture, and region-wise correlations. Unlike the existing multimodal datasets, which are limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and textual annotations in the form of image-ROI-description triplets without the need for any paired text descriptions. Specifically, data from over 30 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal LLMs to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular textual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. We propose LLaVA-Tri by pretraining LLaVA on MedTrinity-25M, achieving state-of-the-art performance on VQA-RAD, SLAKE, and PathVQA, surpassing representative SOTA multimodal LLMs. Furthermore, MedTrinity-25M can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain. We will make our dataset available.

Abstract PDF HTML Upgrade to Chat

References (71)

Citations (9)

View on Semantic Scholar

Summary

The paper’s main contribution is an automated pipeline that generates detailed, multigranular annotations from unpaired medical images.
It leverages expert models and external medical knowledge to accurately identify ROIs and annotate over 65 diseases across 10 imaging modalities.
Pretraining with MedTrinity-25M significantly improves performance on state-of-the-art medical visual question answering benchmarks.

MedTrinity-25M: A Comprehensive Multimodal Dataset for Medical AI

Overview

The introduction of MedTrinity-25M marks a significant advancement in the availability and richness of medical datasets for AI research. This dataset comprises over 25 million images spanning 10 modalities and covering more than 65 diseases. Each image is paired with detailed multigranular annotations, including descriptions of disease types, regions of interest (ROIs), modality information, region-specific descriptions, and inter-regional relationships. Unlike traditional datasets that often rely on paired image-text datasets, MedTrinity-25M employs an automated pipeline that generates annotations from unpaired images, thus significantly scaling up the data.

Dataset Construction

Data Collection

MedTrinity-25M aggregates data from over 90 sources, including well-known repositories such as TCIA, Kaggle, Zenodo, and Synapse. This extensive collection encompasses various imaging modalities, including X-ray, MRI, CT, Ultrasound, and Histopathology, ensuring comprehensive coverage of medical imaging techniques. The data sources include images annotated with different levels of detail, from broad disease types to precisely marked segmentation masks and bounding boxes.

Annotation Strategy

Metadata Integration: Basic image attributes, such as modality and disease types, are derived from existing dataset metadata. This metadata is used to generate "coarse captions," which provide essential contextual information for each image.
ROI Locating: Various expert models (e.g., SAT, Chexmask, HoverNet) are leveraged to identify ROIs within the images. These models either use text prompts or segmentation techniques to locate regions indicative of abnormalities.
Medical Knowledge Retrieval: To enhance the quality of textual descriptions, external medical knowledge is integrated. This knowledge is retrieved from databases such as PubMed and StatPearls, ensuring that the annotations are infused with domain-specific expertise.

Automated Annotation Pipeline

The automated pipeline for annotation bypasses the need for paired image-text data, instead using domain-specific expert models and large multimodal LLMs (MLLMs). The pipeline consists of two major stages:

Data Processing: This stage involves preprocessing the data to extract coarse captions, locate ROIs, and retrieve relevant medical knowledge. These elements provide a foundation upon which detailed annotations can be built.
Generation of Multigranular Text Descriptions: Using the processed data, MLLMs (such as GPT-4V and LLaVA-Med Captioner) are prompted to generate structured, multigranular text descriptions. These descriptions offer a layered understanding of the image, integrating global and local information.

Evaluation and Quality

To ensure the generated annotations are of high quality and align well with human-generated annotations, the dataset was evaluated using GPT-4V. This evaluation focused on five key attributes: modality, structure detection, ROI analysis, lesion texture, and local-global relationships. The alignment scores indicate a high degree of agreement with human annotations, validating the dataset's reliability.

Benchmarking with MedTrinity-25M

The efficacy of MedTrinity-25M was demonstrated through the training of LLaVA-Med++, a state-of-the-art model for medical visual question answering (VQA). Pretraining on MedTrinity-25M led to significant improvements in performance across multiple VQA benchmarks (VQA-RAD, SLAKE, and PathVQA). These results underscore the dataset's potential to enhance the capabilities of multimodal medical AI models.

Practical Implications and Future Directions

By providing a large-scale, richly annotated dataset, MedTrinity-25M significantly lowers the barrier for training advanced AI models in medicine. Its comprehensive coverage across various modalities and diseases makes it a invaluable resource for developing AI models that can perform a multitude of tasks, from diagnostic imaging to automated report generation. Future developments could include expanding the dataset with additional modalities and diseases and further refining the annotation pipeline to incorporate evolving AI technologies and medical knowledge bases.

In summary, MedTrinity-25M addresses the critical need for large, detailed multimodal datasets in medical AI. Its automated pipeline for annotation, combined with the dataset's breadth and depth, positions it as a cornerstone resource for the next generation of medical AI research and applications.