SLaVA-CXR: Efficient CXR Report Automation
- SLaVA-CXR is an open-source multimodal system for automating chest X-ray reports under resource constraints using a two-tower architecture.
- The system employs a radiologist-inspired multi-phase Re³Training paradigm that improves recognition, reasoning, and report generation performance.
- A novel synthetic data engine, RADEX, generates compliant and diverse clinical training data, enabling efficient on-premise deployment.
SLaVA-CXR is an open-source small language and vision assistant for chest X-ray (CXR) report automation, designed to operate under resource constraints while maintaining strong clinical performance, robustness, and privacy compliance. The system leverages a two-tower multimodal design, a radiologist-inspired multi-phase training procedure, and a novel synthetic data engine, enabling deployment in low-resource healthcare environments and facilitating efficient on-premise CXR report generation and summarization (Wu et al., 2024).
1. Model Architecture
SLaVA-CXR utilizes a two-tower multimodal architecture consisting of a visual encoder, a lightweight projector module, and a frozen small LLM. The vision encoder is CLIP ViT-L/14-336px, which transforms input frontal chest X-ray images into visual tokens . A projection head , implemented as a single linear layer with layer normalization, aligns these tokens to the token embedding space of the LLM, Phi-2 (2.7B parameters). Only is trainable during the earliest training phase.
At each generation step :
where denotes the autoregressive Phi-2 model. By employing a small, frozen LLM and a high-capacity, trainable visual backbone, the model delivers both high accuracy and fast inference, significantly outperforming contemporary models of comparable or larger size (Wu et al., 2024).
2. Re³Training Paradigm
SLaVA-CXR introduces a multi-stage optimization protocol, termed Re³Training, to simulate the developmental trajectory of radiologists. This comprises three sequential stages:
- Recognition: The model learns to describe basic radiological patterns via captioning tasks. Only is updated; and remain frozen. The objective is:
- Reasoning: The model reasons out diagnoses and explanations under explicit instructions. All modules (, , ) are updated using:
where for instruction .
- Reporting: The model generates full radiology reports and follows diverse radiological writing instructions via a multi-task loss:
with penalizing report deviation, enforcing instruction-adherence, and as weight decay.
The total loss combines these components: Ablation studies show additive performance improvements with each training stage.
3. Synthetic Data Generation with RADEX
The RADEX data synthesis pipeline creates a regulatory-compliant, high-diversity training corpus for CXR report automation. RADEX sources public, peer-reviewed CXR cases from Radiopaedia.org, parsing each into three components: case description, representation (imaging findings), and discussion (diagnostic reasoning). Using GPT-4 as a generative function , structured clinical notes are produced as:
Strict prompt templates ensure content is strictly derived from each case , preserving privacy and preventing hallucinated or exogenous content. Each note is further enriched with synthetic dialogue and instruction-following samples from related corpora, enhancing the modelās instruction-following competence. All source data are public, devoid of patient identifiers, and all synthesis is performed in-house.
4. Training Regimen and Optimization
SLaVA-CXR employs diverse data sources at each Re³Training stage, summarized as follows:
| Stage | Data Sources & Sample Sizes |
|---|---|
| Recognition | Blip_Laion_CC_SBU (558K), CXR-Alignment (1.4K) |
| Reasoning | LLaVA-665K subset (624K), CXR-Instruction (7.3K) |
| Reporting | CXR Clinical Note (3.3K), RADEX synthetic corpus (~3K) |
Training and evaluation utilize MIMIC-CXR (1,732 studies) and IU-Xray (3,301 studies). Key optimization settings include a 2,048 token maximum sequence, cosine annealing scheduler, 0.03 warmup ratio, no weight decay, and gradient checkpointing. Training was conducted on 8 Ć A6000 GPUs (289 TFLOPS), with inference on a single A5000 (27.8 TFLOPS).
5. Performance Evaluation
Multiple quantitative and qualitative benchmarks demonstrate the effectiveness of SLaVA-CXR, particularly in comparison to LLaVA-v1.5 (7B) and LLaVA-Med (7B):
5.1 Report Generation and Summarization
On both MIMIC-CXR and IU-Xray, SLaVA-CXR outperforms larger models on key metrics such as Rouge-L, METEOR, BLEU-2, BERTScore, and RadGraph. For instance, on MIMIC-CXR generation: - SLaVA-CXR (2.7B): R-L 13.77, METEOR 16.79, BLEU-2 8.48, BERTScore 23.93 - LLaVA-v1.5 (7B): R-L 13.27, METEOR 16.46, BLEU-2 7.99, BERTScore 13.84 - LLaVA-Med (7B): R-L 8.60, METEOR 16.09, BLEU-2 3.86, BERTScore -15.05
RadCliQ scores (lower is better) also favor SLaVA-CXR (1.79) over LLaVA-v1.5 (2.04) and LLaVA-Med (2.55).
5.2 Classification
SLaVA-CXR improves CheXpert AUC for pathology recognition, e.g., "No Finding" (58.9% vs 51.7% for LLaVA-v0) and "Edema" (59.7% vs 51.0%).
5.3 Inference Efficiency
SLaVA-CXR achieves an average inference time of 2.53 seconds per case, approximately 6Ć faster than LLaVA-Med and 2.3Ć faster than LLaVA-v1.5.
5.4 Human Evaluation
Radiologist assessment (0ā5 scale) found SLaVA-CXR superior for summarization coherence (by >1.0 point) and generation completeness (by >0.7 points) relative to LLaVA-Med. Qualitative examples indicate accurate anatomical localization, style adherence, and low hallucination rates.
5.5 Ablation
Sequentially adding the Reasoning and Reporting stages yields cumulative metric gains (MIMIC Gen R-L: 8.04 ā 11.74 ā 13.77).
6. Operational Considerations and Implications
SLaVA-CXR delivers clinical performance comparable to or exceeding 7B-parameter models at only 2.7B parameters, demonstrating the compensatory efficacy of targeted training and high-quality synthetic data. Its inference speed (2.53s/case) supports real-time or near-real-time deployment.
All componentsāvisual encoder , projection head , and LLM āare open-source and do not require cloud or external API calls, enabling on-premise deployment. Training and inference are feasible on moderately powered GPUs (A5000/A6000), enhancing adoption prospects in resource-limited settings.
The model architecture and RADEX synthesis pipeline ensure rigorous privacy compliance: no patient data leakages occur, and all regulatory mandates (HIPAA, CITI) are satisfied.
7. Context and Significance in Automated Medical Reporting
SLaVA-CXR exemplifies a scalable, efficient pathway for multimodal clinical report automation, especially suited for low-resource environments where closed-source or large-scale LLMs are infeasible due to privacy, computational requirements, or cost. The framework demonstrates that deliberate architectural parsimony, curriculum-inspired training, and compliant synthetic corpora can yield superior performance in specialized medical domains despite limited model scale. This suggests a shift in paradigm from brute-force scaling toward task-aligned architectural and data-centric strategies in AI-driven medical diagnostics (Wu et al., 2024).