SLaVA-CXR: Efficient CXR Report Automation

Updated 8 February 2026

SLaVA-CXR is an open-source multimodal system for automating chest X-ray reports under resource constraints using a two-tower architecture.
The system employs a radiologist-inspired multi-phase Re³Training paradigm that improves recognition, reasoning, and report generation performance.
A novel synthetic data engine, RADEX, generates compliant and diverse clinical training data, enabling efficient on-premise deployment.

SLaVA-CXR is an open-source small language and vision assistant for chest X-ray (CXR) report automation, designed to operate under resource constraints while maintaining strong clinical performance, robustness, and privacy compliance. The system leverages a two-tower multimodal design, a radiologist-inspired multi-phase training procedure, and a novel synthetic data engine, enabling deployment in low-resource healthcare environments and facilitating efficient on-premise CXR report generation and summarization (Wu et al., 2024).

1. Model Architecture

SLaVA-CXR utilizes a two-tower multimodal architecture consisting of a visual encoder, a lightweight projector module, and a frozen small LLM. The vision encoder is CLIP ViT-L/14-336px, which transforms input frontal chest X-ray images $x$ into visual tokens $E(x)$ . A projection head $P_\theta$ , implemented as a single linear layer with layer normalization, aligns these tokens to the token embedding space of the LLM, Phi-2 (2.7B parameters). Only $P_\theta$ is trainable during the earliest training phase.

At each generation step $t$ : $v = P_\theta(E(x))$

$y_t \sim L(y_{1:t-1} \oplus v)$

where $L$ denotes the autoregressive Phi-2 model. By employing a small, frozen LLM and a high-capacity, trainable visual backbone, the model delivers both high accuracy and fast inference, significantly outperforming contemporary models of comparable or larger size (Wu et al., 2024).

2. Re³Training Paradigm

SLaVA-CXR introduces a multi-stage optimization protocol, termed Re³Training, to simulate the developmental trajectory of radiologists. This comprises three sequential stages:

Recognition: The model learns to describe basic radiological patterns via captioning tasks. Only $P_\theta$ is updated; $E$ and $E(x)$ 0 remain frozen. The objective is:

$E(x)$ 1

Reasoning: The model reasons out diagnoses and explanations under explicit instructions. All modules ( $E(x)$ 2, $E(x)$ 3, $E(x)$ 4) are updated using:

$E(x)$ 5

where $E(x)$ 6 for instruction $E(x)$ 7.

Reporting: The model generates full radiology reports and follows diverse radiological writing instructions via a multi-task loss:

$E(x)$ 8

with $E(x)$ 9 penalizing report deviation, $P_\theta$ 0 enforcing instruction-adherence, and $P_\theta$ 1 as weight decay.

The total loss combines these components: $P_\theta$ 2 Ablation studies show additive performance improvements with each training stage.

3. Synthetic Data Generation with RADEX

The RADEX data synthesis pipeline creates a regulatory-compliant, high-diversity training corpus for CXR report automation. RADEX sources public, peer-reviewed CXR cases from Radiopaedia.org, parsing each into three components: case description, representation (imaging findings), and discussion (diagnostic reasoning). Using GPT-4 as a generative function $P_\theta$ 3, structured clinical notes $P_\theta$ 4 are produced as: $P_\theta$ 5

Strict prompt templates ensure content is strictly derived from each case $P_\theta$ 6, preserving privacy and preventing hallucinated or exogenous content. Each note is further enriched with synthetic dialogue and instruction-following samples from related corpora, enhancing the model’s instruction-following competence. All source data are public, devoid of patient identifiers, and all synthesis is performed in-house.

4. Training Regimen and Optimization

SLaVA-CXR employs diverse data sources at each Re³Training stage, summarized as follows:

Stage	Data Sources & Sample Sizes
Recognition	Blip_Laion_CC_SBU (558K), CXR-Alignment (1.4K)
Reasoning	LLaVA-665K subset (624K), CXR-Instruction (7.3K)
Reporting	CXR Clinical Note (3.3K), RADEX synthetic corpus (~3K)

Training and evaluation utilize MIMIC-CXR (1,732 studies) and IU-Xray (3,301 studies). Key optimization settings include a 2,048 token maximum sequence, cosine annealing scheduler, 0.03 warmup ratio, no weight decay, and gradient checkpointing. Training was conducted on 8 × A6000 GPUs (289 TFLOPS), with inference on a single A5000 (27.8 TFLOPS).

5. Performance Evaluation

Multiple quantitative and qualitative benchmarks demonstrate the effectiveness of SLaVA-CXR, particularly in comparison to LLaVA-v1.5 (7B) and LLaVA-Med (7B):

5.1 Report Generation and Summarization

On both MIMIC-CXR and IU-Xray, SLaVA-CXR outperforms larger models on key metrics such as Rouge-L, METEOR, BLEU-2, BERTScore, and RadGraph. For instance, on MIMIC-CXR generation: - SLaVA-CXR (2.7B): R-L 13.77, METEOR 16.79, BLEU-2 8.48, BERTScore 23.93 - LLaVA-v1.5 (7B): R-L 13.27, METEOR 16.46, BLEU-2 7.99, BERTScore 13.84 - LLaVA-Med (7B): R-L 8.60, METEOR 16.09, BLEU-2 3.86, BERTScore -15.05

RadCliQ scores (lower is better) also favor SLaVA-CXR (1.79) over LLaVA-v1.5 (2.04) and LLaVA-Med (2.55).

5.2 Classification

SLaVA-CXR improves CheXpert AUC for pathology recognition, e.g., "No Finding" (58.9% vs 51.7% for LLaVA-v0) and "Edema" (59.7% vs 51.0%).

5.3 Inference Efficiency

SLaVA-CXR achieves an average inference time of 2.53 seconds per case, approximately 6× faster than LLaVA-Med and 2.3× faster than LLaVA-v1.5.

5.4 Human Evaluation

Radiologist assessment (0–5 scale) found SLaVA-CXR superior for summarization coherence (by >1.0 point) and generation completeness (by >0.7 points) relative to LLaVA-Med. Qualitative examples indicate accurate anatomical localization, style adherence, and low hallucination rates.

5.5 Ablation

Sequentially adding the Reasoning and Reporting stages yields cumulative metric gains (MIMIC Gen R-L: 8.04 → 11.74 → 13.77).

6. Operational Considerations and Implications

SLaVA-CXR delivers clinical performance comparable to or exceeding 7B-parameter models at only 2.7B parameters, demonstrating the compensatory efficacy of targeted training and high-quality synthetic data. Its inference speed (2.53s/case) supports real-time or near-real-time deployment.

All components—visual encoder $P_\theta$ 7, projection head $P_\theta$ 8, and LLM $P_\theta$ 9—are open-source and do not require cloud or external API calls, enabling on-premise deployment. Training and inference are feasible on moderately powered GPUs (A5000/A6000), enhancing adoption prospects in resource-limited settings.

The model architecture and RADEX synthesis pipeline ensure rigorous privacy compliance: no patient data leakages occur, and all regulatory mandates (HIPAA, CITI) are satisfied.

7. Context and Significance in Automated Medical Reporting

SLaVA-CXR exemplifies a scalable, efficient pathway for multimodal clinical report automation, especially suited for low-resource environments where closed-source or large-scale LLMs are infeasible due to privacy, computational requirements, or cost. The framework demonstrates that deliberate architectural parsimony, curriculum-inspired training, and compliant synthetic corpora can yield superior performance in specialized medical domains despite limited model scale. This suggests a shift in paradigm from brute-force scaling toward task-aligned architectural and data-centric strategies in AI-driven medical diagnostics (Wu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SLaVA-CXR.