Papers
Topics
Authors
Recent
Search
2000 character limit reached

RXL-RADSet Synthetic Radiology Benchmark

Updated 9 January 2026
  • RXL-RADSet is a synthetic benchmark dataset, verified by radiologists, that covers ten RADS frameworks using structured scenario templates.
  • The dataset incorporates multi-modality reports and diverse radiologist profiles to reflect real-world reporting nuances and clinical contexts.
  • Comparative evaluations reveal that larger, guided language models achieve higher accuracy and validity, underscoring the importance of model size and prompt engineering.

RXL-RADSet is a radiologist-verified synthetic benchmark dataset specifically designed for the evaluation of automated Reporting and Data System (RADS) category assignment from narrative radiology reports. Targeting ten RADS frameworks, RXL-RADSet provides balanced, multi-modality coverage and supports comparative benchmarking of both open-weight small LLMs (SLMs) and proprietary LLMs under rigorously controlled prompting conditions (Bose et al., 6 Jan 2026).

1. Dataset Construction and Verification

RXL-RADSet was generated using a structured methodology. For each of ten RADS frameworks—BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS—16–20 scenario templates were defined. These templates specified:

  • Imaging modality (CT, MRI, ultrasound, mammography)
  • Anatomic and pathologic context (e.g., lesion phenotype)
  • Clinical scenario (age, risk factors, clinical indication)
  • Target RADS category and subcategories

To capture the spectrum of report styles, five “radiologist profiles” were defined, reflecting real-world variation in hedging, detail, and lexical choices:

  1. Early-career generalist
  2. Mid-career generalist
  3. Early-career subspecialist
  4. Senior subspecialist expert
  5. Senior resident

Report generation was implemented via chat-based prompting of proprietary LLMs (OpenAI GPT-5.2/5.1/4.1, Google Gemini, Anthropic Claude), encoding both scenario structure and radiologist profile.

Verification employed a two-stage process:

  • Level-1 screening by a senior general radiologist focused on realism, section completeness, and internal consistency, with a revision rate of 5–13.5%.
  • Level-2 subspecialty review by organ-system experts to confirm precise adherence to RADS lexicon and correct category assignment; revision rate was 3–8%. Edits comprised terminology updates, category corrections, and harmonization with guideline lexicon.

2. Dataset Composition

The dataset consists of a total of 1,600 synthetic reports. The distribution by RADS framework and modality is summarized below.

RAD System Modality # Reports
BI-RADS MRI/US/Mammo (100 each) 300
CAD-RADS CT 100
GB-RADS US 100
LI-RADS CT/CT-MRI/MRI/US (100 each) 400
Lung-RADS CT 100
NI-RADS CT 100
O-RADS MRI/US (100 each) 200
PI-RADS MRI 100
TI-RADS US 100
VI-RADS MRI/CT 100

Modality coverage is comprehensive, ensuring alignment with real multi-RADS workflows. There is balanced representation across frameworks, except for BI-RADS (n=300) and LI-RADS (n=400), which are intentionally enriched due to their clinical prevalence.

3. Evaluation Protocol

Prompting Strategies

Two primary prompting strategies were implemented during model evaluation:

  • Guided prompting: The system prompt encoded the RADS-specific scoring algorithm, tie-break rules, and enforced a “single token” output, followed by a user prompt: “Read the report and generate final RADS category based on it.”
  • Zero-shot prompting: Only the user prompt was provided; no RADS-specific system prompt was used. Zero-shot prompting was tested on selected models (GPT-OSS 20B, Qwen3 30B, GPT-5.2) in high-complexity RADS assignments.

Model Suite

The evaluation benchmarked 41 quantized open-weight SLMs covering 12 model families (Qwen3, Deepseek R1, GPT-OSS, Ministral 3, Gemma 3, Nemotron 3 Nano, Olmo 3, Llama4 MoE, SmolLM2, Granite 4, Phi 4) across sizes from 0.135 B to 32 B parameters, along with GPT-5.2 (full-precision) as proprietary reference.

Performance Metrics

Let NN denote the number of predictions per model; evaluation utilized the following metrics:

  • Validity (schema conformance):

Validity=#{outputs adhering to RADS format}N\text{Validity} = \frac{\#\{\text{outputs adhering to RADS format}\}}{N}

  • Effective Accuracy (counting invalid as incorrect):

Accuracyeff=#{correct RADS categories}N\text{Accuracy}_{\rm eff} = \frac{\#\{\text{correct RADS categories}\}}{N}

  • Conditional Accuracy (correct among valid outputs):

Accuracycond=#{correctvalid outputs}#{valid outputs}\text{Accuracy}_{\rm cond} = \frac{\#\{\text{correct} \wedge \text{valid outputs}\}}{\#\{\text{valid outputs}\}}

4. Quantitative Results

Overall Performance

Group N preds Validity % Eff. Accuracy %
GPT-5.2 1,600 99.8 81.1
Open SLMs (pooled) 65,600 96.8 61.1

Proprietary GPT-5.2 demonstrated higher validity and accuracy than the aggregated open SLMs pool.

Performance Scaling with Model Size

Size Bin (Parameters) Validity % Eff. Accuracy %
≤ 1 B 82.9 27.0
1–10 B 98.1 57.5
10–29 B 99.2 73.5
30–100 B 99.2 73.0
GPT-5.2 99.8 81.1

A sharp inflection was observed between sub-1 B and >=10 B sizes. “Thinking” prompts (chain-of-thought) further improved validity and accuracy in open SLMs by approximately 12–13 percentage points.

Influence of RADS Complexity

Complexity Bin (TCS) Group Validity % Eff. Accuracy %
Minimally (<5) Open SLMs 93.9 73.5
GPT-5.2 99.5 91.0
Moderate (5–8) Open SLMs 97.0 62.1
GPT-5.2 99.8 76.9
High (>8) Open SLMs 98.0 49.4
GPT-5.2 100.0 90.0

Open SLM accuracy for high-complexity RADS remained consistently below GPT-5.2, due primarily to clinical reasoning rather than output format errors.

Guided vs Zero-Shot Prompting

Mode N preds Validity % Eff. Accuracy %
Guided 1,500 99.2 78.5
Zero-shot 1,500 96.7 69.6

Guided prompting conferred +8.9 percentage points in effective accuracy and increased validity for complex RADS tasks.

5. Key Findings and Implications

The dataset reveals a marked size–performance relationship: sub-1 B models are ineffective for RADS assignment, 1–10 B models achieve moderate accuracy, and 20–32 B SLMs—when carefully prompted and provided with reasoning scaffolds—approach mid-to-high 70% accuracy, with the proprietary GPT-5.2 reference at 81.1% accuracy. There remains a consistent validity advantage (≈+3 percentage points) and an accuracy gap (≈+20 percentage points) favoring proprietary over open SLMs (p < 0.001).

High-complexity RADS frameworks (TCS > 8) expose substantial residual limitations for all models, with open SLMs exhibiting ~49% effective accuracy compared to 90% for GPT-5.2. Most errors in this regime are attributable to clinical reasoning failures, not schema non-conformance.

Guided prompting, incorporating prompt engineering of scoring logic and output constraints, significantly boosts model accuracy compared to generic zero-shot approaches—particularly for complex assignments. Chain-of-thought style “thinking” prompts notably increase both validity and accuracy in open SLMs by approximately 12–13 percentage points over non-thinking prompts.

A plausible implication is that carefully prompted, large open-weight SLMs (20–32 B) could be used in privacy-preserving, on-premise clinical decision support (CDS) systems, provided robust schema validation and human-in-the-loop oversight are maintained.

6. Limitations and Prospective Directions

Several limitations are identified. The synthetic reports, while expert-reviewed, may not capture the full range of real-world reporting nuances or omission patterns. Open models were evaluated in quantized form, while GPT-5.2 was accessed at full precision, potentially confounding perfect comparability. Some rare RADS categories remain underrepresented.

Future work directions include instruction-tuning leading open SLM models on radiology-specific corpora, development of hybrid systems that combine deterministic rule-based logic with LLM outputs for critical cases, and validation on multi-institutional authentic radiology corpora spanning diverse linguistic and stylistic traditions.

Potential clinical impact includes enabling public, radiologist-verified benchmarking of LLMs for RADS assignment. RXL-RADSet may accelerate the development and vetting of LLM-based CDS tools and highlights the vital roles of tailored prompting, reasoning scaffolds, schema-conforming output validation, and oversight for responsible clinical deployment (Bose et al., 6 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RXL-RADSet.