Adversarially Robust LLM-as-a-Judge Evaluation Systems

Develop LLM-as-a-Judge-based evaluation systems that prevent adversarial attacks, ensuring that judgments are not manipulated by prompt injection or maliciously crafted responses and that automated assessment pipelines remain secure.

Background

This problem arises from the susceptibility of LLM-as-a-Judge systems to adversarial prompt manipulation, where attackers embed hidden instructions or persuasive language in candidate responses to bias evaluations. Because judges rely heavily on prompt context, such adversarial inputs can distort outcomes and grant high scores to malicious outputs.

The paper highlights that securing automated evaluation requires judge models and pipelines that are resistant to adversarial manipulation, motivating the need for evaluation systems explicitly designed to prevent these attacks.

References

The open research problems in this context are: Create evaluation systems based on LLM to prevent adversarial attacks.

Security in LLM-as-a-Judge: A Comprehensive SoK  (2603.29403 - Masoud et al., 31 Mar 2026) in Section 7.1, Vulnerability to Adversarial Prompt Manipulation (Challenges and Open Problems)