Detecting Prompt Injection and Evaluation Manipulation in LLM-as-a-Judge

Develop methods to detect prompt injection and other evaluation manipulation attacks that bias the decisions of LLM-as-a-Judge systems.

Background

Adversaries can embed hidden instructions or adversarial patterns within candidate responses, steering judge models away from intended criteria. Such attacks undermine the integrity of assessment pipelines by covertly influencing scoring or selection.

The paper calls for dedicated detection approaches to identify and flag prompt injections and related manipulation tactics before they alter evaluation results.

References

The open research problems in this context are: Design methods for detecting prompt injection and evaluation manipulation attacks.

— Security in LLM-as-a-Judge: A Comprehensive SoK (2603.29403 - Masoud et al., 31 Mar 2026) in Section 7.1, Vulnerability to Adversarial Prompt Manipulation (Challenges and Open Problems)

Detecting Prompt Injection and Evaluation Manipulation in LLM-as-a-Judge

Background

References

Related Problems