- The paper presents a novel backdoor attack strategy on LLM-as-a-Judge systems that manipulates evaluation scores through poisoned training data.
- Experiments show even 1% data poisoning can drastically alter model outputs, with 10% poisoning leading to misclassification up to 89% of toxic prompts.
- The proposed model merging technique effectively neutralizes malicious triggers, reducing attack success rates near zero without high computational costs.
Overview of "BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge"
The paper "BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge" investigates the potential threats posed by backdoor attacks on LLMs used as automated evaluation systems, referred to as LLM-as-a-Judge. The authors propose a novel attack strategy where adversaries can control both candidate and evaluator models to manipulate evaluation scores unfairly in their favor. The study categorizes the levels of data access that an adversary may possess, ranging from minimal access through web poisoning to full access via weight poisoning, each correlating with different severities of attack. Importantly, the paper also introduces a mitigation strategy based on model merging, aiming to neutralize the backdoor effects while maintaining state-of-the-art performance.
Attack Framework and Methodology
The proposed backdoor attack targets the LLM-as-a-Judge paradigm by implanting malicious triggers in the training data of evaluator models. The adversaries can utilize these triggers to boost scores undeservedly. The paper explores three primary scenarios of adversary data access:
- Web Poisoning: The adversary introduces poisoned data into publicly available internet resources, anticipating that this data will be scraped and included in training datasets.
- Malicious Annotator: Open-sourced and community-driven data acquisition processes are exploited by inserting backdoor triggers into the training data via annotations.
- Weight Poisoning: Full access is assumed where the adversary can manipulate model weights directly, either through collaboration missteps or internal threats.
The research demonstrates that even with minimal data poisoning (as low as 1%), adversaries can substantially increase their evaluation scores, illustrating the vulnerability of LLM evaluators under these circumstances.
Figure 1: Overview of our attack framework and mitigation strategy. Both point-wise and pair-wise evaluation is at risk of backdoor.
Experimental Results
The experiments validate that backdoor threats are severe and pervasive across various model architectures, trigger designs, and evaluation scenarios. Notably, poisoning 10% of the evaluator training data can lead to a dramatic increase in the model's misclassification rates, such as having toxicity judges misclassify toxic prompts as non-toxic 89% of the time. Furthermore, the paper reports that using rare word triggers can consistently manipulate evaluator scores significantly, even under minimal assumption conditions.
Figure 2: Results for attacking Mistral-7B-InstructV2 fine-tuned on feedback-collection poisoned with rare words under full assumptions.
Defense Mechanisms
The paper proposes a novel defense mechanism known as model merging to mitigate backdoor attacks. This technique involves interpolating the weights of a backdoored model with a clean baseline model to dilute the effect of malicious triggers. The model merging strategy effectively reduces the attack success rate (ASR) to near zero, demonstrating its potential as a viable countermeasure against backdoor vulnerabilities. Additionally, model merging integrates seamlessly into current LLM judge training pipelines without incurring high computational costs.
Implications and Future Directions
The findings of this research underscore the critical need for robust evaluation systems in AI applications. The vulnerabilities identified in LLM-as-a-Judge paradigms suggest that similar systems could be susceptible to backdoor attacks, posing risks to the integrity of automated decision-making processes. The proposed solutions, particularly model merging, offer a promising route to fortify these systems against adversarial threats. Future work could explore more sophisticated backdoor detection techniques and extend these findings to other domains where LLMs are employed as evaluators.
Figure 3: Results for poisoning pair-wise evaluators across different poison rates.
Conclusion
This study exposes significant backdoor vulnerabilities within LLM-as-a-Judge systems, emphasizing the ease with which adversaries can manipulate evaluation outcomes. By categorizing attack scenarios by data access levels and proposing model merging as an effective defense, the research provides a comprehensive framework to address these challenges. The implications extend beyond theoretical analysis, affecting practical deployments of LLM-based evaluators in various sectors. As LLMs continue to proliferate, ensuring the security and reliability of evaluation mechanisms remains a paramount concern for the AI community.