Defenses against adversarial attacks on FLIP

Develop effective defenses against adversarial prompt attacks that target FLIP (FLipped Inference for Prompt reconstruction), the reward modeling approach that infers an instruction from a response and assigns a reward based on the F1 similarity between the inferred and the original instructions, in order to prevent or mitigate manipulations that inflate reward scores.

Background

The paper introduces FLIP, a reward modeling method that performs backward inference to reconstruct the most plausible instruction from a given response and uses the similarity between the inferred and original instructions as the reward signal. While FLIP outperforms LLM-as-a-Judge baselines and shows robustness under several tested adversarial prompts, the authors acknowledge that defenses specific to FLIP-targeted adversarial attacks have not been developed.

In the adversarial prompts analysis, the authors evaluate simple adversarial injections and observe that FLIP remains more effective than baselines; however, they explicitly defer the development of systematic defenses tailored to FLIP, identifying this as future work.

References

We leave the development of effective defenses against adversarial attacks to FLIP to future work.

Small Reward Models via Backward Inference  (2602.13551 - Wang et al., 14 Feb 2026) in Section: Analysis, Adversarial Prompts / Reward Hacking