RXL-RADSet Synthetic Radiology Benchmark
- RXL-RADSet is a synthetic benchmark dataset, verified by radiologists, that covers ten RADS frameworks using structured scenario templates.
- The dataset incorporates multi-modality reports and diverse radiologist profiles to reflect real-world reporting nuances and clinical contexts.
- Comparative evaluations reveal that larger, guided language models achieve higher accuracy and validity, underscoring the importance of model size and prompt engineering.
RXL-RADSet is a radiologist-verified synthetic benchmark dataset specifically designed for the evaluation of automated Reporting and Data System (RADS) category assignment from narrative radiology reports. Targeting ten RADS frameworks, RXL-RADSet provides balanced, multi-modality coverage and supports comparative benchmarking of both open-weight small LLMs (SLMs) and proprietary LLMs under rigorously controlled prompting conditions (Bose et al., 6 Jan 2026).
1. Dataset Construction and Verification
RXL-RADSet was generated using a structured methodology. For each of ten RADS frameworks—BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS—16–20 scenario templates were defined. These templates specified:
- Imaging modality (CT, MRI, ultrasound, mammography)
- Anatomic and pathologic context (e.g., lesion phenotype)
- Clinical scenario (age, risk factors, clinical indication)
- Target RADS category and subcategories
To capture the spectrum of report styles, five “radiologist profiles” were defined, reflecting real-world variation in hedging, detail, and lexical choices:
- Early-career generalist
- Mid-career generalist
- Early-career subspecialist
- Senior subspecialist expert
- Senior resident
Report generation was implemented via chat-based prompting of proprietary LLMs (OpenAI GPT-5.2/5.1/4.1, Google Gemini, Anthropic Claude), encoding both scenario structure and radiologist profile.
Verification employed a two-stage process:
- Level-1 screening by a senior general radiologist focused on realism, section completeness, and internal consistency, with a revision rate of 5–13.5%.
- Level-2 subspecialty review by organ-system experts to confirm precise adherence to RADS lexicon and correct category assignment; revision rate was 3–8%. Edits comprised terminology updates, category corrections, and harmonization with guideline lexicon.
2. Dataset Composition
The dataset consists of a total of 1,600 synthetic reports. The distribution by RADS framework and modality is summarized below.
| RAD System | Modality | # Reports |
|---|---|---|
| BI-RADS | MRI/US/Mammo (100 each) | 300 |
| CAD-RADS | CT | 100 |
| GB-RADS | US | 100 |
| LI-RADS | CT/CT-MRI/MRI/US (100 each) | 400 |
| Lung-RADS | CT | 100 |
| NI-RADS | CT | 100 |
| O-RADS | MRI/US (100 each) | 200 |
| PI-RADS | MRI | 100 |
| TI-RADS | US | 100 |
| VI-RADS | MRI/CT | 100 |
Modality coverage is comprehensive, ensuring alignment with real multi-RADS workflows. There is balanced representation across frameworks, except for BI-RADS (n=300) and LI-RADS (n=400), which are intentionally enriched due to their clinical prevalence.
3. Evaluation Protocol
Prompting Strategies
Two primary prompting strategies were implemented during model evaluation:
- Guided prompting: The system prompt encoded the RADS-specific scoring algorithm, tie-break rules, and enforced a “single token” output, followed by a user prompt: “Read the report and generate final RADS category based on it.”
- Zero-shot prompting: Only the user prompt was provided; no RADS-specific system prompt was used. Zero-shot prompting was tested on selected models (GPT-OSS 20B, Qwen3 30B, GPT-5.2) in high-complexity RADS assignments.
Model Suite
The evaluation benchmarked 41 quantized open-weight SLMs covering 12 model families (Qwen3, Deepseek R1, GPT-OSS, Ministral 3, Gemma 3, Nemotron 3 Nano, Olmo 3, Llama4 MoE, SmolLM2, Granite 4, Phi 4) across sizes from 0.135 B to 32 B parameters, along with GPT-5.2 (full-precision) as proprietary reference.
Performance Metrics
Let denote the number of predictions per model; evaluation utilized the following metrics:
- Validity (schema conformance):
- Effective Accuracy (counting invalid as incorrect):
- Conditional Accuracy (correct among valid outputs):
4. Quantitative Results
Overall Performance
| Group | N preds | Validity % | Eff. Accuracy % |
|---|---|---|---|
| GPT-5.2 | 1,600 | 99.8 | 81.1 |
| Open SLMs (pooled) | 65,600 | 96.8 | 61.1 |
Proprietary GPT-5.2 demonstrated higher validity and accuracy than the aggregated open SLMs pool.
Performance Scaling with Model Size
| Size Bin (Parameters) | Validity % | Eff. Accuracy % |
|---|---|---|
| ≤ 1 B | 82.9 | 27.0 |
| 1–10 B | 98.1 | 57.5 |
| 10–29 B | 99.2 | 73.5 |
| 30–100 B | 99.2 | 73.0 |
| GPT-5.2 | 99.8 | 81.1 |
A sharp inflection was observed between sub-1 B and >=10 B sizes. “Thinking” prompts (chain-of-thought) further improved validity and accuracy in open SLMs by approximately 12–13 percentage points.
Influence of RADS Complexity
| Complexity Bin (TCS) | Group | Validity % | Eff. Accuracy % |
|---|---|---|---|
| Minimally (<5) | Open SLMs | 93.9 | 73.5 |
| GPT-5.2 | 99.5 | 91.0 | |
| Moderate (5–8) | Open SLMs | 97.0 | 62.1 |
| GPT-5.2 | 99.8 | 76.9 | |
| High (>8) | Open SLMs | 98.0 | 49.4 |
| GPT-5.2 | 100.0 | 90.0 |
Open SLM accuracy for high-complexity RADS remained consistently below GPT-5.2, due primarily to clinical reasoning rather than output format errors.
Guided vs Zero-Shot Prompting
| Mode | N preds | Validity % | Eff. Accuracy % |
|---|---|---|---|
| Guided | 1,500 | 99.2 | 78.5 |
| Zero-shot | 1,500 | 96.7 | 69.6 |
Guided prompting conferred +8.9 percentage points in effective accuracy and increased validity for complex RADS tasks.
5. Key Findings and Implications
The dataset reveals a marked size–performance relationship: sub-1 B models are ineffective for RADS assignment, 1–10 B models achieve moderate accuracy, and 20–32 B SLMs—when carefully prompted and provided with reasoning scaffolds—approach mid-to-high 70% accuracy, with the proprietary GPT-5.2 reference at 81.1% accuracy. There remains a consistent validity advantage (≈+3 percentage points) and an accuracy gap (≈+20 percentage points) favoring proprietary over open SLMs (p < 0.001).
High-complexity RADS frameworks (TCS > 8) expose substantial residual limitations for all models, with open SLMs exhibiting ~49% effective accuracy compared to 90% for GPT-5.2. Most errors in this regime are attributable to clinical reasoning failures, not schema non-conformance.
Guided prompting, incorporating prompt engineering of scoring logic and output constraints, significantly boosts model accuracy compared to generic zero-shot approaches—particularly for complex assignments. Chain-of-thought style “thinking” prompts notably increase both validity and accuracy in open SLMs by approximately 12–13 percentage points over non-thinking prompts.
A plausible implication is that carefully prompted, large open-weight SLMs (20–32 B) could be used in privacy-preserving, on-premise clinical decision support (CDS) systems, provided robust schema validation and human-in-the-loop oversight are maintained.
6. Limitations and Prospective Directions
Several limitations are identified. The synthetic reports, while expert-reviewed, may not capture the full range of real-world reporting nuances or omission patterns. Open models were evaluated in quantized form, while GPT-5.2 was accessed at full precision, potentially confounding perfect comparability. Some rare RADS categories remain underrepresented.
Future work directions include instruction-tuning leading open SLM models on radiology-specific corpora, development of hybrid systems that combine deterministic rule-based logic with LLM outputs for critical cases, and validation on multi-institutional authentic radiology corpora spanning diverse linguistic and stylistic traditions.
Potential clinical impact includes enabling public, radiologist-verified benchmarking of LLMs for RADS assignment. RXL-RADSet may accelerate the development and vetting of LLM-based CDS tools and highlights the vital roles of tailored prompting, reasoning scaffolds, schema-conforming output validation, and oversight for responsible clinical deployment (Bose et al., 6 Jan 2026).