BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge

Published 1 Mar 2025 in cs.CL, cs.AI, and cs.CR | (2503.00596v1)

Abstract: This paper proposes a novel backdoor threat attacking the LLM-as-a-Judge evaluation regime, where the adversary controls both the candidate and evaluator model. The backdoored evaluator victimizes benign users by unfairly assigning inflated scores to adversary. A trivial single token backdoor poisoning 1% of the evaluator training data triples the adversary's score with respect to their legitimate score. We systematically categorize levels of data access corresponding to three real-world settings, (1) web poisoning, (2) malicious annotator, and (3) weight poisoning. These regimes reflect a weak to strong escalation of data access that highly correlates with attack severity. Under the weakest assumptions - web poisoning (1), the adversary still induces a 20% score inflation. Likewise, in the (3) weight poisoning regime, the stronger assumptions enable the adversary to inflate their scores from 1.5/5 to 4.9/5. The backdoor threat generalizes across different evaluator architectures, trigger designs, evaluation tasks, and poisoning rates. By poisoning 10% of the evaluator training data, we control toxicity judges (Guardrails) to misclassify toxic prompts as non-toxic 89% of the time, and document reranker judges in RAG to rank the poisoned document first 97% of the time. LLM-as-a-Judge is uniquely positioned at the intersection of ethics and technology, where social implications of mislead model selection and evaluation constrain the available defensive tools. Amidst these challenges, model merging emerges as a principled tool to offset the backdoor, reducing ASR to near 0% whilst maintaining SOTA performance. Model merging's low computational cost and convenient integration into the current LLM Judge training pipeline position it as a promising avenue for backdoor mitigation in the LLM-as-a-Judge setting.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel backdoor attack strategy on LLM-as-a-Judge systems that manipulates evaluation scores through poisoned training data.
Experiments show even 1% data poisoning can drastically alter model outputs, with 10% poisoning leading to misclassification up to 89% of toxic prompts.
The proposed model merging technique effectively neutralizes malicious triggers, reducing attack success rates near zero without high computational costs.

Overview of "BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge"

The paper "BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge" investigates the potential threats posed by backdoor attacks on LLMs used as automated evaluation systems, referred to as LLM-as-a-Judge. The authors propose a novel attack strategy where adversaries can control both candidate and evaluator models to manipulate evaluation scores unfairly in their favor. The study categorizes the levels of data access that an adversary may possess, ranging from minimal access through web poisoning to full access via weight poisoning, each correlating with different severities of attack. Importantly, the paper also introduces a mitigation strategy based on model merging, aiming to neutralize the backdoor effects while maintaining state-of-the-art performance.

Attack Framework and Methodology

The proposed backdoor attack targets the LLM-as-a-Judge paradigm by implanting malicious triggers in the training data of evaluator models. The adversaries can utilize these triggers to boost scores undeservedly. The paper explores three primary scenarios of adversary data access:

Web Poisoning: The adversary introduces poisoned data into publicly available internet resources, anticipating that this data will be scraped and included in training datasets.
Malicious Annotator: Open-sourced and community-driven data acquisition processes are exploited by inserting backdoor triggers into the training data via annotations.
Weight Poisoning: Full access is assumed where the adversary can manipulate model weights directly, either through collaboration missteps or internal threats.

The research demonstrates that even with minimal data poisoning (as low as 1%), adversaries can substantially increase their evaluation scores, illustrating the vulnerability of LLM evaluators under these circumstances.

Figure 1: Overview of our attack framework and mitigation strategy. Both point-wise and pair-wise evaluation is at risk of backdoor.

Experimental Results

The experiments validate that backdoor threats are severe and pervasive across various model architectures, trigger designs, and evaluation scenarios. Notably, poisoning 10% of the evaluator training data can lead to a dramatic increase in the model's misclassification rates, such as having toxicity judges misclassify toxic prompts as non-toxic 89% of the time. Furthermore, the paper reports that using rare word triggers can consistently manipulate evaluator scores significantly, even under minimal assumption conditions.

Figure 2: Results for attacking Mistral-7B-InstructV2 fine-tuned on feedback-collection poisoned with rare words under full assumptions.

Defense Mechanisms

The paper proposes a novel defense mechanism known as model merging to mitigate backdoor attacks. This technique involves interpolating the weights of a backdoored model with a clean baseline model to dilute the effect of malicious triggers. The model merging strategy effectively reduces the attack success rate (ASR) to near zero, demonstrating its potential as a viable countermeasure against backdoor vulnerabilities. Additionally, model merging integrates seamlessly into current LLM judge training pipelines without incurring high computational costs.

Implications and Future Directions

The findings of this research underscore the critical need for robust evaluation systems in AI applications. The vulnerabilities identified in LLM-as-a-Judge paradigms suggest that similar systems could be susceptible to backdoor attacks, posing risks to the integrity of automated decision-making processes. The proposed solutions, particularly model merging, offer a promising route to fortify these systems against adversarial threats. Future work could explore more sophisticated backdoor detection techniques and extend these findings to other domains where LLMs are employed as evaluators.

Figure 3: Results for poisoning pair-wise evaluators across different poison rates.

Conclusion

This study exposes significant backdoor vulnerabilities within LLM-as-a-Judge systems, emphasizing the ease with which adversaries can manipulate evaluation outcomes. By categorizing attack scenarios by data access levels and proposing model merging as an effective defense, the research provides a comprehensive framework to address these challenges. The implications extend beyond theoretical analysis, affecting practical deployments of LLM-based evaluators in various sectors. As LLMs continue to proliferate, ensuring the security and reliability of evaluation mechanisms remains a paramount concern for the AI community.

Markdown Report Issue