LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding

Published 3 Aug 2025 in cs.CV | (2508.01617v1)

Abstract: Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-LLMs (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce \textbf{LLaDA-MedV}, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855\% over LLaVA-Med and 1.867\% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93\% on VQA-RAD, 92.31\% on SLAKE, and 95.15\% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at https://github.com/LLM-VLM-GSL/LLaDA-MedV.

Abstract PDF Upgrade to Chat

Summary

The paper introduces LLaDA-MedV, a diffusion-based VLM that outperforms traditional ARMs in closed-form biomedical VQA tasks.
It employs an iterative forward-reverse masked generation process, enabling explicit control over response length and detail.
Robust training via semantic alignment and vision instruction tuning highlights its potential for generating detailed, visually grounded responses.

LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding

Introduction

The paper introduces LLaDA-MedV, a diffusion-based vision-LLM (VLM) designed for biomedical image understanding. Autoregressive models (ARMs) have been dominant in this domain, but masked diffusion models (MDMs) like LLaDA-MedV present a promising alternative. LLaDA-MedV achieves consistent performance improvements over existing models such as LLaVA-Med and sets new accuracy benchmarks in closed-form Visual Question Answering (VQA) tasks across several medical datasets.

Figure 1: Illustration of biomedical VLMs evaluated in the open-ended biomedical conversation benchmark. Among the 6 Medical VLMs, LLaDA-MedV achieves the highest overall score and demonstrates the best performance on Chest X-ray (CXR) and CT modalities.

Methods

LLaDA-MedV leverages masked diffusion models to operate over discrete tokens. It uses an iterative forward-reverse process where input tokens are masked progressively and then predicted in the reverse generation stage. The approach allows explicit control over response length, enabling longer and more detailed outputs compared to ARMs.

The training involves multiple stages:

Semantic Alignment: Fine-tuning the projector module to align biomedical language with visual content.
Vision Instruction Tuning: End-to-end fine-tuning for generating coherent, visually grounded responses.
Dataset-Specific Fine-Tuning: Enhancement via three biomedical VQA datasets for improved precision.

Experiments and Results

LLaDA-MedV demonstrates superior performance in open-ended biomedical conversations, generating more informative responses with explicit control over output length compared to ARM counterparts such as LLaVA-Med.

Figure 2: Illustration of open-end conversation evaluation. All questions, images and corresponding captions are sourced from~\cite{li2023llava.

Notably, in downstream VQA tasks, LLaDA-MedV achieves highest accuracy on closed-form queries but faces challenges with open-form questions due to less optimized post-training. This suggests masked diffusion models offer substantial benefits, particularly in controlled response generation scenarios requiring detailed analysis.

Analysis of Training and Inference

The study reveals critical training factors including proper initialization and fine-tuning strategies as pivotal for performance enhancement. The selection of domain-specific weights significantly impacts model outputs and token repetition issues during inference.

Figure 3: Illustration of LLaVA-Med and LLaDA-MedV responses to biomedical queries 1 and 2. The images, queries, and corresponding captions are adapted from~\cite{li2023llava.

During inference, the trade-off between computational efficiency and response quality is highlighted, with sampling steps being crucial for maintaining response richness while managing computational costs.

Future Work

Token repetition presents a key limitation, especially when the desired output length is large. Future work should focus on optimizing sampling strategies and remasking schedules to balance efficiency and quality, particularly for applications needing detailed responses.

Figure 4: Illustration of token repetition during generation (i.e., marked by red) across different settings. Question 4 and 5 represent the answer from LLaDA-MedV$^{V_1}.

Conclusion

LLaDA-MedV represents an innovative application of diffusion models for biomedical image understanding. Through strategic training and inference methodologies, it presents a compelling alternative to traditional ARMs, promising improved response control and quality in biomedical AI applications. Further research should address the optimization challenges to fully leverage diffusion models in the field of medical image analysis.

Markdown Report Issue