Papers
Topics
Authors
Recent
Search
2000 character limit reached

SLaVA-CXR: Efficient CXR Report Automation

Updated 8 February 2026
  • SLaVA-CXR is an open-source multimodal system for automating chest X-ray reports under resource constraints using a two-tower architecture.
  • The system employs a radiologist-inspired multi-phase Re³Training paradigm that improves recognition, reasoning, and report generation performance.
  • A novel synthetic data engine, RADEX, generates compliant and diverse clinical training data, enabling efficient on-premise deployment.

SLaVA-CXR is an open-source small language and vision assistant for chest X-ray (CXR) report automation, designed to operate under resource constraints while maintaining strong clinical performance, robustness, and privacy compliance. The system leverages a two-tower multimodal design, a radiologist-inspired multi-phase training procedure, and a novel synthetic data engine, enabling deployment in low-resource healthcare environments and facilitating efficient on-premise CXR report generation and summarization (Wu et al., 2024).

1. Model Architecture

SLaVA-CXR utilizes a two-tower multimodal architecture consisting of a visual encoder, a lightweight projector module, and a frozen small LLM. The vision encoder is CLIP ViT-L/14-336px, which transforms input frontal chest X-ray images xx into visual tokens E(x)E(x). A projection head PĪøP_\theta, implemented as a single linear layer with layer normalization, aligns these tokens to the token embedding space of the LLM, Phi-2 (2.7B parameters). Only PĪøP_\theta is trainable during the earliest training phase.

At each generation step tt: v=PĪø(E(x))v = P_\theta(E(x))

yt∼L(y1:tāˆ’1āŠ•v)y_t \sim L(y_{1:t-1} \oplus v)

where LL denotes the autoregressive Phi-2 model. By employing a small, frozen LLM and a high-capacity, trainable visual backbone, the model delivers both high accuracy and fast inference, significantly outperforming contemporary models of comparable or larger size (Wu et al., 2024).

2. Re³Training Paradigm

SLaVA-CXR introduces a multi-stage optimization protocol, termed Re³Training, to simulate the developmental trajectory of radiologists. This comprises three sequential stages:

  1. Recognition: The model learns to describe basic radiological patterns via captioning tasks. Only PĪøP_\theta is updated; EE and LL remain frozen. The objective is:

Lrecog=āˆ’āˆ‘(x,y)∈DR1log⁔p(y∣E(x), PĪø(E(x)))L_{\mathrm{recog}} = -\sum_{(x,y)\in\mathcal{D}_{\mathrm{R1}}} \log p(y\mid E(x),\,P_\theta(E(x)))

  1. Reasoning: The model reasons out diagnoses and explanations under explicit instructions. All modules (EE, PĪøP_\theta, LL) are updated using:

Lreason=Lce(ypred,ytrue)+λLreg(θE,θP,θL)L_{\mathrm{reason}} = L_{\mathrm{ce}}(y_{\mathrm{pred}}, y_{\mathrm{true}}) + \lambda L_{\mathrm{reg}}(\theta_E, \theta_P, \theta_L)

where ypred=L(P(E(x)),i)y_{\mathrm{pred}} = L(P(E(x)), i) for instruction ii.

  1. Reporting: The model generates full radiology reports and follows diverse radiological writing instructions via a multi-task loss:

Lreport=α1Lrep+α2Linstr+α3LregL_{\mathrm{report}} = \alpha_1 L_{\mathrm{rep}} + \alpha_2 L_{\mathrm{instr}} + \alpha_3 L_{\mathrm{reg}}

with LrepL_{\mathrm{rep}} penalizing report deviation, LinstrL_{\mathrm{instr}} enforcing instruction-adherence, and LregL_{\mathrm{reg}} as weight decay.

The total loss combines these components: Ltotal=Lrecog+Lreason+LreportL_{\mathrm{total}} = L_{\mathrm{recog}} + L_{\mathrm{reason}} + L_{\mathrm{report}} Ablation studies show additive performance improvements with each training stage.

3. Synthetic Data Generation with RADEX

The RADEX data synthesis pipeline creates a regulatory-compliant, high-diversity training corpus for CXR report automation. RADEX sources public, peer-reviewed CXR cases from Radiopaedia.org, parsing each into three components: case description, representation (imaging findings), and discussion (diagnostic reasoning). Using GPT-4 as a generative function GG, structured clinical notes nn are produced as: n=GGPT4(Description, Representation, Discussion)n = G_{\mathrm{GPT4}}(\text{Description},\,\text{Representation},\,\text{Discussion})

Strict prompt templates ensure content is strictly derived from each case cc, preserving privacy and preventing hallucinated or exogenous content. Each note is further enriched with synthetic dialogue and instruction-following samples from related corpora, enhancing the model’s instruction-following competence. All source data are public, devoid of patient identifiers, and all synthesis is performed in-house.

4. Training Regimen and Optimization

SLaVA-CXR employs diverse data sources at each Re³Training stage, summarized as follows:

Stage Data Sources & Sample Sizes
Recognition Blip_Laion_CC_SBU (558K), CXR-Alignment (1.4K)
Reasoning LLaVA-665K subset (624K), CXR-Instruction (7.3K)
Reporting CXR Clinical Note (3.3K), RADEX synthetic corpus (~3K)

Training and evaluation utilize MIMIC-CXR (1,732 studies) and IU-Xray (3,301 studies). Key optimization settings include a 2,048 token maximum sequence, cosine annealing scheduler, 0.03 warmup ratio, no weight decay, and gradient checkpointing. Training was conducted on 8 Ɨ A6000 GPUs (289 TFLOPS), with inference on a single A5000 (27.8 TFLOPS).

5. Performance Evaluation

Multiple quantitative and qualitative benchmarks demonstrate the effectiveness of SLaVA-CXR, particularly in comparison to LLaVA-v1.5 (7B) and LLaVA-Med (7B):

5.1 Report Generation and Summarization

On both MIMIC-CXR and IU-Xray, SLaVA-CXR outperforms larger models on key metrics such as Rouge-L, METEOR, BLEU-2, BERTScore, and RadGraph. For instance, on MIMIC-CXR generation: - SLaVA-CXR (2.7B): R-L 13.77, METEOR 16.79, BLEU-2 8.48, BERTScore 23.93 - LLaVA-v1.5 (7B): R-L 13.27, METEOR 16.46, BLEU-2 7.99, BERTScore 13.84 - LLaVA-Med (7B): R-L 8.60, METEOR 16.09, BLEU-2 3.86, BERTScore -15.05

RadCliQ scores (lower is better) also favor SLaVA-CXR (1.79) over LLaVA-v1.5 (2.04) and LLaVA-Med (2.55).

5.2 Classification

SLaVA-CXR improves CheXpert AUC for pathology recognition, e.g., "No Finding" (58.9% vs 51.7% for LLaVA-v0) and "Edema" (59.7% vs 51.0%).

5.3 Inference Efficiency

SLaVA-CXR achieves an average inference time of 2.53 seconds per case, approximately 6Ɨ faster than LLaVA-Med and 2.3Ɨ faster than LLaVA-v1.5.

5.4 Human Evaluation

Radiologist assessment (0–5 scale) found SLaVA-CXR superior for summarization coherence (by >1.0 point) and generation completeness (by >0.7 points) relative to LLaVA-Med. Qualitative examples indicate accurate anatomical localization, style adherence, and low hallucination rates.

5.5 Ablation

Sequentially adding the Reasoning and Reporting stages yields cumulative metric gains (MIMIC Gen R-L: 8.04 → 11.74 → 13.77).

6. Operational Considerations and Implications

SLaVA-CXR delivers clinical performance comparable to or exceeding 7B-parameter models at only 2.7B parameters, demonstrating the compensatory efficacy of targeted training and high-quality synthetic data. Its inference speed (2.53s/case) supports real-time or near-real-time deployment.

All components—visual encoder EE, projection head PP, and LLM LL—are open-source and do not require cloud or external API calls, enabling on-premise deployment. Training and inference are feasible on moderately powered GPUs (A5000/A6000), enhancing adoption prospects in resource-limited settings.

The model architecture and RADEX synthesis pipeline ensure rigorous privacy compliance: no patient data leakages occur, and all regulatory mandates (HIPAA, CITI) are satisfied.

7. Context and Significance in Automated Medical Reporting

SLaVA-CXR exemplifies a scalable, efficient pathway for multimodal clinical report automation, especially suited for low-resource environments where closed-source or large-scale LLMs are infeasible due to privacy, computational requirements, or cost. The framework demonstrates that deliberate architectural parsimony, curriculum-inspired training, and compliant synthetic corpora can yield superior performance in specialized medical domains despite limited model scale. This suggests a shift in paradigm from brute-force scaling toward task-aligned architectural and data-centric strategies in AI-driven medical diagnostics (Wu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SLaVA-CXR.