Develop Defences Against Boundary Point Jailbreaking
Develop effective defensive methods against Boundary Point Jailbreaking (BPJ)—a fully black-box, decision-based attack algorithm that uses curriculum learning and boundary point evaluation to bypass classifier-guarded large language models—so that classifier-based safeguards (e.g., auxiliary monitors such as Constitutional Classifiers and input classifiers) remain robust even when the attacker receives only binary flagged/not-flagged feedback.
References
A range of open questions remain, including developing defences to BPJ and exploring formally and empirically why BPJ attacks learned on a single attack readily transfer to other queries.
— Boundary Point Jailbreaking of Black-Box LLMs
(2602.15001 - Davies et al., 16 Feb 2026) in Section 6 (Discussion), Broader Implications