Defenses against adversarial attacks on FLIP
Develop effective defenses against adversarial prompt attacks that target FLIP (FLipped Inference for Prompt reconstruction), the reward modeling approach that infers an instruction from a response and assigns a reward based on the F1 similarity between the inferred and the original instructions, in order to prevent or mitigate manipulations that inflate reward scores.
References
We leave the development of effective defenses against adversarial attacks to FLIP to future work.
— Small Reward Models via Backward Inference
(2602.13551 - Wang et al., 14 Feb 2026) in Section: Analysis, Adversarial Prompts / Reward Hacking