Best LLM judge models and prompting strategies for radiology evaluation

Identify which specific closed-source or open-source large language models and which prompting strategies yield the most reliable evaluations when used as LLM judges for radiology report assessment.

Background

Although LLMs show promise as evaluators for radiology reports, the literature lacks consensus on which model families and prompt designs perform best as general-purpose judges in clinical reporting contexts.

The paper positions this uncertainty as a core motivation for comparing multiple LLM families (closed- and open-source, reasoning and non-reasoning) and prompt formulations, including the proposed VERT metric, against expert-annotated datasets spanning multiple modalities and anatomies.

References

In particular, it is unclear which closed- or open-source models are best suited to act as LLM judges, and which prompting strategies yield the most reliable evaluations.

VERT: Reliable LLM Judges for Radiology Report Evaluation  (2604.03376 - Bologna et al., 3 Apr 2026) in Section 1 (Introduction)