MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine
Abstract: This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities with multigranular annotations for more than 65 diseases. These multigranular annotations encompass both global information, such as modality and organ detection, and local information like ROI analysis, lesion texture, and region-wise correlations. Unlike the existing multimodal datasets, which are limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and textual annotations in the form of image-ROI-description triplets without the need for any paired text descriptions. Specifically, data from over 30 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal LLMs to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular textual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. We propose LLaVA-Tri by pretraining LLaVA on MedTrinity-25M, achieving state-of-the-art performance on VQA-RAD, SLAKE, and PathVQA, surpassing representative SOTA multimodal LLMs. Furthermore, MedTrinity-25M can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain. We will make our dataset available.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- A generalist learner for multifaceted medical image interpretation. arXiv preprint arXiv:2405.07988, 2024.
- Padchest: A large chest x-ray image dataset with multi-label annotated reports. Medical image analysis, 66:101797, 2020.
- Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
- Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
- Quilt-1m: One million image-text pairs for histopathology. Advances in Neural Information Processing Systems, 36, 2024.
- Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR, 2023.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
- Palm-e: An embodied multimodal language model. In International Conference on Machine Learning, pages 8469–8488. PMLR, 2023.
- Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416, 2024.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, pages 180–189. Springer, 2018.
- Radgenome-chest ct: A grounded vision-language dataset for chest ct analysis. arXiv preprint arXiv:2404.16754, 2024.
- Pmc-clip: Contrastive language-image pre-training using biomedical documents. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 525–536. Springer, 2023.
- Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
- Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021.
- Deeplesion: Automated deep mining, categorization and detection of significant radiology image findings using large-scale clinical lesion annotations. arXiv preprint arXiv:1710.01766, 2017.
- axiong/pmc_oa datasets at hugging face. https://huggingface.co/datasets/axiong/pmc_oa.
- Federated benchmarking of medical artificial intelligence with medperf. Nature Machine Intelligence, 5(7):799–810, 2023.
- Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pages 168–172. IEEE, 2018.
- A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Scientific data, 8(1):34, 2021.
- Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023.
- One model to rule them all: Towards universal segmentation for medical images with text prompts. arXiv preprint arXiv:2312.17183, 2023.
- Boundary-aware transformers for skin lesion segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 206–216. Springer, 2021.
- Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis. IEEE Transactions on Medical Imaging, 2022.
- Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178, 2024.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
- Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11):btad651, 2023.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Meta LLaMA Team. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/, 2024.
- When do we not need larger vision models? arXiv preprint arXiv:2403.13043, 2024.
- Sa-med2d-20m dataset: Segment anything in 2d medical imaging with 20 million masks. arXiv preprint arXiv:2311.11969, 2023.
- Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023.
- The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri. arXiv preprint arXiv:2405.18368, 2024.
- A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
- 100,000 histological images of human colorectal cancer and healthy tissue. https://doi.org/10.5281/zenodo.1214456.
- Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020.
- Vision–language model for visual question answering in medical imagery. Bioengineering, 10(3):380, 2023.
- Q2atransformer: Improving medical vqa via an answer querying decoder. In International Conference on Information Processing in Medical Imaging, pages 445–456. Springer, 2023.
- Open-ended medical visual question answering through prefix tuning of language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 726–736. Springer, 2023.
- Pubmedclip: How much does clip benefit visual question answering in the medical domain? In Findings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023.
- Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2023.
- Self-supervised vision-language pretraining for medial visual question answering. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2023.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
- Brain hemorrhage extended (bhx): Bounding box extrapolation from thick to thin slice ct images. PhysioNet, 101(23):e215–20, 2020.
- Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics, 7(1):29, 2016.
- Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. In Medical Imaging 2014: Digital Pathology. SPIE, March 2014.
- A large-scale synthetic pathological dataset for deep learning-enabled segmentation of breast cancer. Scientific Data, 10(1):231, 2023.
- Pannuke dataset extension, insights and baselines. arXiv preprint arXiv:2003.10778, 2020.
- Semantic modeling of cell damage prediction: a machine learning approach at human-level performance in dermatology. Scientific Reports, 13(1):8336, 2023.
- A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities. arXiv preprint arXiv:2403.17834, 2024.
- Jun Ma and Bo Wang. Miccai flare23: Fast, low-resource, and accurate organ and pan-cancer segmentation in abdomen ct, Apr 2023.
- Predicting ki67, er, pr, and her2 statuses from h&e-stained breast cancer images. arXiv preprint arXiv:2308.01982, 2023.
- Precise segmental renal artery clamping under the guidance of dual-source computed tomography angiography during laparoscopic partial nephrectomy. Eur. Urol., 62(6):1001–1008, December 2012.
- Laparoscopic partial nephrectomy with segmental renal artery clamping: technique and clinical outcomes. Eur. Urol., 59(5):849–855, May 2011.
- Dense biased networks with deep priori anatomy and hard region adaptation: Semisupervised learning for fine renal artery segmentation. Medical Image Analysis, 63, 2020.
- Meta grayscale adaptive network for 3D integrated renal structures segmentation. Med. Image Anal., 71:102055, July 2021.
- Sdr-former: A siamese dual-resolution transformer for liver lesion classification using 3d multi-phase imaging. arXiv preprint arXiv:2402.17246, 2024.
- Mama-mia: A large-scale multi-center breast cancer dce-mri benchmark dataset with expert segmentations. arXiv preprint arXiv:2406.13844, 2024.
- ChestX-Ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, July 2017.
- ChestX-ray: Hospital-scale chest x-ray database and benchmarks on weakly supervised classification and localization of common thorax diseases. In Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics, Advances in computer vision and pattern recognition, pages 369–392. Springer International Publishing, Cham, 2019.
- Inference of captions from histopathological patches. In International Conference on Medical Imaging with Deep Learning, pages 1235–1250. PMLR, 2022.
- Large-scale pretraining on pathological images for fine-tuning of small pathological benchmarks. In Workshop on Medical Image Learning with Limited and Noisy Data, pages 257–267. Springer, 2023.
- Artificial intelligence for tumour tissue detection and histological regression grading in oesophageal adenocarcinomas: a retrospective algorithm development and validation study. The Lancet Digital Health, 5(5):e265–e275, 2023.
- Cellpose3: one-click image restoration for improved cellular segmentation. bioRxiv, pages 2024–02, 2024.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- CheXmask Database: a large-scale dataset of anatomical segmentation masks for chest x-ray images (version 0.1). https://doi.org/10.13026/dx54-8351, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.