Develop Defences Against Boundary Point Jailbreaking

Develop effective defensive methods against Boundary Point Jailbreaking (BPJ)—a fully black-box, decision-based attack algorithm that uses curriculum learning and boundary point evaluation to bypass classifier-guarded large language models—so that classifier-based safeguards (e.g., auxiliary monitors such as Constitutional Classifiers and input classifiers) remain robust even when the attacker receives only binary flagged/not-flagged feedback.

Background

The paper introduces Boundary Point Jailbreaking (BPJ), a fully automated, black-box attack that successfully evades state-of-the-art classifier-based safeguards, including Anthropic’s Constitutional Classifiers and OpenAI’s GPT-5 input classifier, using only binary flag feedback. BPJ combines curriculum learning with active selection of high-signal evaluation points (boundary points) to improve adversarial prefixes and achieve universal jailbreaks.

The authors argue that BPJ is difficult to counter with single-interaction defences and note that effective protection likely requires batch-level monitoring. Given BPJ’s demonstrated success against industry systems, establishing practical defences that remain robust under BPJ’s optimization process is a key open challenge.

References

A range of open questions remain, including developing defences to BPJ and exploring formally and empirically why BPJ attacks learned on a single attack readily transfer to other queries.

Boundary Point Jailbreaking of Black-Box LLMs  (2602.15001 - Davies et al., 16 Feb 2026) in Section 6 (Discussion), Broader Implications