RXL-RADSet Synthetic Radiology Benchmark

Updated 9 January 2026

RXL-RADSet is a synthetic benchmark dataset, verified by radiologists, that covers ten RADS frameworks using structured scenario templates.
The dataset incorporates multi-modality reports and diverse radiologist profiles to reflect real-world reporting nuances and clinical contexts.
Comparative evaluations reveal that larger, guided language models achieve higher accuracy and validity, underscoring the importance of model size and prompt engineering.

RXL-RADSet is a radiologist-verified synthetic benchmark dataset specifically designed for the evaluation of automated Reporting and Data System (RADS) category assignment from narrative radiology reports. Targeting ten RADS frameworks, RXL-RADSet provides balanced, multi-modality coverage and supports comparative benchmarking of both open-weight small LLMs (SLMs) and proprietary LLMs under rigorously controlled prompting conditions (Bose et al., 6 Jan 2026).

1. Dataset Construction and Verification

RXL-RADSet was generated using a structured methodology. For each of ten RADS frameworks—BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS—16–20 scenario templates were defined. These templates specified:

Imaging modality (CT, MRI, ultrasound, mammography)
Anatomic and pathologic context (e.g., lesion phenotype)
Clinical scenario (age, risk factors, clinical indication)
Target RADS category and subcategories

To capture the spectrum of report styles, five “radiologist profiles” were defined, reflecting real-world variation in hedging, detail, and lexical choices:

Early-career generalist
Mid-career generalist
Early-career subspecialist
Senior subspecialist expert
Senior resident

Report generation was implemented via chat-based prompting of proprietary LLMs (OpenAI GPT-5.2/5.1/4.1, Google Gemini, Anthropic Claude), encoding both scenario structure and radiologist profile.

Verification employed a two-stage process:

Level-1 screening by a senior general radiologist focused on realism, section completeness, and internal consistency, with a revision rate of 5–13.5%.
Level-2 subspecialty review by organ-system experts to confirm precise adherence to RADS lexicon and correct category assignment; revision rate was 3–8%. Edits comprised terminology updates, category corrections, and harmonization with guideline lexicon.

2. Dataset Composition

The dataset consists of a total of 1,600 synthetic reports. The distribution by RADS framework and modality is summarized below.

RAD System	Modality	# Reports
BI-RADS	MRI/US/Mammo (100 each)	300
CAD-RADS	CT	100
GB-RADS	US	100
LI-RADS	CT/CT-MRI/MRI/US (100 each)	400
Lung-RADS	CT	100
NI-RADS	CT	100
O-RADS	MRI/US (100 each)	200
PI-RADS	MRI	100
TI-RADS	US	100
VI-RADS	MRI/CT	100

Modality coverage is comprehensive, ensuring alignment with real multi-RADS workflows. There is balanced representation across frameworks, except for BI-RADS (n=300) and LI-RADS (n=400), which are intentionally enriched due to their clinical prevalence.

3. Evaluation Protocol

Prompting Strategies

Two primary prompting strategies were implemented during model evaluation:

Guided prompting: The system prompt encoded the RADS-specific scoring algorithm, tie-break rules, and enforced a “single token” output, followed by a user prompt: “Read the report and generate final RADS category based on it.”
Zero-shot prompting: Only the user prompt was provided; no RADS-specific system prompt was used. Zero-shot prompting was tested on selected models (GPT-OSS 20B, Qwen3 30B, GPT-5.2) in high-complexity RADS assignments.

Model Suite

The evaluation benchmarked 41 quantized open-weight SLMs covering 12 model families (Qwen3, Deepseek R1, GPT-OSS, Ministral 3, Gemma 3, Nemotron 3 Nano, Olmo 3, Llama4 MoE, SmolLM2, Granite 4, Phi 4) across sizes from 0.135 B to 32 B parameters, along with GPT-5.2 (full-precision) as proprietary reference.

Performance Metrics

Let $N$ denote the number of predictions per model; evaluation utilized the following metrics:

Validity (schema conformance):

$\text{Validity} = \frac{\#\{\text{outputs adhering to RADS format}\}}{N}$

Effective Accuracy (counting invalid as incorrect):

$\text{Accuracy}_{\rm eff} = \frac{\#\{\text{correct RADS categories}\}}{N}$

Conditional Accuracy (correct among valid outputs):

$\text{Accuracy}_{\rm cond} = \frac{\#\{\text{correct} \wedge \text{valid outputs}\}}{\#\{\text{valid outputs}\}}$

4. Quantitative Results

Overall Performance

Group	N preds	Validity %	Eff. Accuracy %
GPT-5.2	1,600	99.8	81.1
Open SLMs (pooled)	65,600	96.8	61.1

Proprietary GPT-5.2 demonstrated higher validity and accuracy than the aggregated open SLMs pool.

Performance Scaling with Model Size

Size Bin (Parameters)	Validity %	Eff. Accuracy %
≤ 1 B	82.9	27.0
1–10 B	98.1	57.5
10–29 B	99.2	73.5
30–100 B	99.2	73.0
GPT-5.2	99.8	81.1

A sharp inflection was observed between sub-1 B and >=10 B sizes. “Thinking” prompts (chain-of-thought) further improved validity and accuracy in open SLMs by approximately 12–13 percentage points.

Influence of RADS Complexity

Complexity Bin (TCS)	Group	Validity %	Eff. Accuracy %
Minimally (<5)	Open SLMs	93.9	73.5
	GPT-5.2	99.5	91.0
Moderate (5–8)	Open SLMs	97.0	62.1
	GPT-5.2	99.8	76.9
High (>8)	Open SLMs	98.0	49.4
	GPT-5.2	100.0	90.0

Open SLM accuracy for high-complexity RADS remained consistently below GPT-5.2, due primarily to clinical reasoning rather than output format errors.

Guided vs Zero-Shot Prompting

Mode	N preds	Validity %	Eff. Accuracy %
Guided	1,500	99.2	78.5
Zero-shot	1,500	96.7	69.6

Guided prompting conferred +8.9 percentage points in effective accuracy and increased validity for complex RADS tasks.

5. Key Findings and Implications

The dataset reveals a marked size–performance relationship: sub-1 B models are ineffective for RADS assignment, 1–10 B models achieve moderate accuracy, and 20–32 B SLMs—when carefully prompted and provided with reasoning scaffolds—approach mid-to-high 70% accuracy, with the proprietary GPT-5.2 reference at 81.1% accuracy. There remains a consistent validity advantage (≈+3 percentage points) and an accuracy gap (≈+20 percentage points) favoring proprietary over open SLMs (p < 0.001).

High-complexity RADS frameworks (TCS > 8) expose substantial residual limitations for all models, with open SLMs exhibiting ~49% effective accuracy compared to 90% for GPT-5.2. Most errors in this regime are attributable to clinical reasoning failures, not schema non-conformance.

Guided prompting, incorporating prompt engineering of scoring logic and output constraints, significantly boosts model accuracy compared to generic zero-shot approaches—particularly for complex assignments. Chain-of-thought style “thinking” prompts notably increase both validity and accuracy in open SLMs by approximately 12–13 percentage points over non-thinking prompts.

A plausible implication is that carefully prompted, large open-weight SLMs (20–32 B) could be used in privacy-preserving, on-premise clinical decision support (CDS) systems, provided robust schema validation and human-in-the-loop oversight are maintained.

6. Limitations and Prospective Directions

Several limitations are identified. The synthetic reports, while expert-reviewed, may not capture the full range of real-world reporting nuances or omission patterns. Open models were evaluated in quantized form, while GPT-5.2 was accessed at full precision, potentially confounding perfect comparability. Some rare RADS categories remain underrepresented.

Future work directions include instruction-tuning leading open SLM models on radiology-specific corpora, development of hybrid systems that combine deterministic rule-based logic with LLM outputs for critical cases, and validation on multi-institutional authentic radiology corpora spanning diverse linguistic and stylistic traditions.

Potential clinical impact includes enabling public, radiologist-verified benchmarking of LLMs for RADS assignment. RXL-RADSet may accelerate the development and vetting of LLM-based CDS tools and highlights the vital roles of tailored prompting, reasoning scaffolds, schema-conforming output validation, and oversight for responsible clinical deployment (Bose et al., 6 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RXL-RADSet.