CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback

Published 9 Oct 2024 in cs.CV and cs.CL | (2410.07025v2)

Abstract: Radiologists play a crucial role in translating medical images into actionable reports. However, the field faces staffing shortages and increasing workloads. While automated approaches using vision-LLMs (VLMs) show promise as assistants, they require exceptionally high accuracy. Most current VLMs in radiology rely solely on supervised fine-tuning. Meanwhile, additional preference fine-tuning in the post-training pipeline has become standard practice in the general domain. The challenge in radiology lies in the prohibitive cost of obtaining radiologist feedback at scale. To address this challenge, we propose an automated pipeline for preference feedback, focusing on chest X-ray radiology report generation (RRG). Specifically, our method leverages publicly available datasets containing pairs of images and radiologist-written reference reports with reference-based metrics, or Judges, eliminating the need for additional radiologist feedback. We investigate reward overoptimization via length exploitation in this setting and introduce a length-controlled version of the GREEN score. Our best-performing setup achieves state-of-the-art CheXbert scores on the MIMIC-CXR dataset for the RRG task while on average maintaining robust performance across six additional image perception and reasoning tasks.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces an automated preference fine-tuning method using an LLM-as-a-Judge mechanism to enhance factuality in chest X-ray report generation.
The paper evaluates five direct alignment algorithms, achieving up to 57.4% improvement in the GREEN metric over supervised fine-tuning baselines on major datasets.
The paper addresses reward overoptimization by identifying verbosity bias and underscoring the need for regularization to maintain clinically practical output lengths.

Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback

The paper "Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback" presents a methodical approach to enhancing vision-LLMs (VLMs) for chest X-ray (CXR) report generation by preference fine-tuning without requiring direct radiologist feedback. This work addresses a pertinent challenge in radiology: the balance between the high demand for accurate automated interpretation and the constraints of limited expert availability for model feedback.

Context and Approach

Radiology has rapidly integrated automated approaches due to imaging frequency and complexity. Chest X-rays, a fundamental diagnostic tool, exacerbate radiologist workload due to their volume and the critical nature of timely, accurate interpretations. Existing VLMs, primarily using supervised fine-tuning (SFT), show promise but are limited in addressing hallucinations—erroneous content not grounded in the image data. Drawing upon methods emerging in general vision-language research, preference fine-tuning offers a solution by aligning model outputs to predefined standards without extensive human input.

The authors propose using publicly available CXR datasets with an innovative LLM-as-a-Judge mechanism to automate preference alignment. This circumvents the typical need for costly radiologist feedback by leveraging a scalable, automated evaluation process using pretrained LLMs designed for this task.

Key Contributions and Results

The paper advances several pivotal areas within medical imaging AI:

Automated Preference Data Collection: Publicly available datasets with reference reports are used to implement an LLM-as-a-Judge mechanism, specifically the GREEN metric, to assess the factuality of generated CXR reports. This method maintains high-quality preference datasets in a scalable manner.
Evaluation of Direct Alignment Algorithms (DAAs): Five representative DAAs—DPO, KTO, IPO, SimPO, and ORPO—are systematically evaluated. The findings highlight significant improvements over SFT baselines, notably with up to 57.4% enhancement in GREEN scores on MIMIC-CXR and CheXpert Plus datasets.
Addressing Reward Overoptimization: The authors examine report length exploitation, observing verbosity bias associated with reward overoptimization. DPO, in particular, lengthens reports considerably, underscoring the need for explicit regularization to maintain practical usability.
Assessment of Alignment Tax: The study finds no significant degradations across six additional diverse tasks, demonstrating robustness and mitigating concerns about the alignment tax potentially degrading performance on unrelated tasks.
Clinical Input: A reader study involving radiologists indicates preferences for models with less verbose outputs, ultimately favoring models that align closely with clinical utility, as shown by ORPO's win rate of 0.62 over SFT.

Implications and Future Directions

The paper's methodology holds substantial promise for the development of AI in high-stakes, low-data medical domains. Automated preference fine-tuning aligns VLMs closer to the accuracy demanded in clinical settings without the logistical dependency on extensive expert annotations.

The implications are twofold: practically, this method can enhance clinical reporting efficiency, potentially alleviating workforce constraints; theoretically, it pushes the border of feasible automation in complex domains like healthcare with minimal human oversight. Future developments might explore further optimization of DAAs to manage verbosity and assess this framework's adaptability to other imaging modalities and clinical tasks.

In summary, this research underscores the vital intersection of AI and healthcare, offering actionable insights for enhancing factual accuracy in automated medical interpretation systems, paramount for advancing AI-driven radiology.

Markdown Report Issue