Best LLM judge models and prompting strategies for radiology evaluation
Identify which specific closed-source or open-source large language models and which prompting strategies yield the most reliable evaluations when used as LLM judges for radiology report assessment.
References
In particular, it is unclear which closed- or open-source models are best suited to act as LLM judges, and which prompting strategies yield the most reliable evaluations.
— VERT: Reliable LLM Judges for Radiology Report Evaluation
(2604.03376 - Bologna et al., 3 Apr 2026) in Section 1 (Introduction)