- The paper introduces SEMA, a two-stage pipeline combining prefilling self-tuning with open-loop generation to address multi-turn jailbreak challenges.
- It leverages reinforcement learning with intent-drift-aware rewards to achieve state-of-the-art attack success rates on AdvBench and HarmBench datasets.
- Empirical evaluations and ablation studies confirm SEMA's scalability, transferability, and potential as a tool for automated red-teaming in LLM safety.
SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks
Problem Context and Motivation
Multi-turn jailbreaks represent an authentic and challenging threat model for safety-aligned LLMs, surpassing the traditional focus on isolated, single-turn attacks. In real-world scenarios, adversarial actors often leverage staged, context-rich interactions to stealthily elicit harmful responses, circumventing safety protocols via incremental and intent-obfuscating dialogue. Previous approaches to multi-turn jailbreaks are typically restricted by high exploration complexity and pronounced intent drift. These methods either utilize scripted paradigms—hard-coded, expert-crafted splitting of malicious queries into dialogue fragments—or template-driven strategies wherein API calls generate context-conditioned adversarial turns. Both suffer from coverage limitations, rigidity, and dependency on victim feedback, with template-based pipelines especially susceptible to intent drift, resulting in benign, irrelevant, or incoherent responses.
SEMA Framework
Prefilling Self-tuning and Open-loop Planning
SEMA introduces a two-stage pipeline that resolves key bottlenecks in multi-turn attack training: exploration complexity and intent drift.
Reinforcement Learning with Intent-Drift-Aware Reward
The attacker is subsequently trained using Group Relative Policy Optimization (GRPO) with an intent-drift-aware reward. The reward integrates three metrics for each simulated attack:
- Intent Alignment: Evaluates whether the victim's final response matches the original malicious intent, penalizing drift and benign pivots.
- Compliance Risk: Assesses the risk profile of the response, with higher scores for dangerous, actionable content.
- Level of Detail: Favors responses that are concrete, operationalizable, and contain explicit procedural guidance.
Adversarial prompts are preferred if they preserve the harmful objective throughout turns and elicit high-quality, intent-aligned responses from victims. Format rewards enforce parseable, well-structured outputs. The GRPO objective leverages group sampling and normalized advantage computation, promoting broad, stable policy updates.
Empirical Evaluation
Baseline Comparison and Main Results
SEMA is assessed on AdvBench and HarmBench datasets using both open-source and closed-source victim models, alongside multiple automated classifiers (LLM classifier, HarmBench classifier, No Refusal phrase indicator). SEMA consistently achieves state-of-the-art attack success rates (ASR@1). Notable numerical results include:
Scalability and Transferability
SEMA's scalability is evidenced by its rapid conversion of increased attempt budget into nearly saturated success rates. On HarmBench, SEMA achieves 96.8% ASR@5 and 99.7% ASR@20, outperforming baselines by a wide margin. Transfer experiments demonstrate robust success across source–target victim pairs, with TASR@1 as high as 92.6% for Qwen2.5-3B-Instruct → GPT-4.1-mini, confirming generalization and practical robustness.
Figure 3: Attack Success Rate with N attempts (ASR@N) on HarmBench and AdvBench for different victim models, illustrating SEMA’s superior scaling with attempt budget.
Case Analysis and Attack Diversity
Qualitative analysis demonstrates SEMA's capability to craft semantically diverse, strategically varied multi-turn adversarial schemes. Successful trajectories include:
Ablation and Methodological Insights
Ablation studies confirm key findings:
- Removing intent alignment from reward leads to lower ASR@1 and increased intent drift.
- Prefilling self-tuning is essential for bootstrapping usable rollouts; omission results in consistent refusals and failed convergence.
- Increasing prompt turns enhances success up to a threshold, after which model capacity bottlenecks degrade rollout quality.
Practical and Theoretical Implications
SEMA's design—simple, reproducible, template-free, and response-agnostic—substantially advances automated red-teaming for LLM safety. Its ability to uncover previously unexposed vulnerabilities in frontier models (e.g., GPT-oss-20B, GPT-4o) and generate diverse, transferable multi-turn attack plans sets a new bar for stress testing and adversarial robustness evaluation. The framework's avoidance of closed-source APIs and hand-crafted strategies addresses scalability, diversity, and deployment efficiency. Critically, singling out intent drift as a core axis for reward shaping demonstrates the importance of maintaining adversarial semantic continuity in realistic attack scenarios.
SEMA facilitates co-evolutionary training of defenders and attackers, enables systematic localization of safety failures, and motivates future research in multimodal jailbreaks, turn-efficiency optimization, and diversity-enhanced attack exploration.
Conclusion
SEMA offers a compact, efficient, and powerful framework for learning multi-turn jailbreak attackers against safety-aligned LLMs. Its two-stage pipeline—prefilling self-tuning and GRPO-based RL with intent-drift-aware rewards—enables response-agnostic, intent-persistent adversarial planning, yielding state-of-the-art performance in ASR, scalability, and transferability. SEMA is poised to become an indispensable tool for automated red-teaming and safety evaluation, catalyzing both practical defense improvements and theoretical advances in adversarial RL for LLMs.