SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Published 6 Feb 2026 in cs.CL | (2602.06854v1)

Abstract: Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average $80.1\%$ ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for LLM safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SEMA, a two-stage pipeline combining prefilling self-tuning with open-loop generation to address multi-turn jailbreak challenges.
It leverages reinforcement learning with intent-drift-aware rewards to achieve state-of-the-art attack success rates on AdvBench and HarmBench datasets.
Empirical evaluations and ablation studies confirm SEMA's scalability, transferability, and potential as a tool for automated red-teaming in LLM safety.

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Problem Context and Motivation

Multi-turn jailbreaks represent an authentic and challenging threat model for safety-aligned LLMs, surpassing the traditional focus on isolated, single-turn attacks. In real-world scenarios, adversarial actors often leverage staged, context-rich interactions to stealthily elicit harmful responses, circumventing safety protocols via incremental and intent-obfuscating dialogue. Previous approaches to multi-turn jailbreaks are typically restricted by high exploration complexity and pronounced intent drift. These methods either utilize scripted paradigms—hard-coded, expert-crafted splitting of malicious queries into dialogue fragments—or template-driven strategies wherein API calls generate context-conditioned adversarial turns. Both suffer from coverage limitations, rigidity, and dependency on victim feedback, with template-based pipelines especially susceptible to intent drift, resulting in benign, irrelevant, or incoherent responses.

SEMA Framework

Prefilling Self-tuning and Open-loop Planning

SEMA introduces a two-stage pipeline that resolves key bottlenecks in multi-turn attack training: exploration complexity and intent drift.

Prefilling Self-tuning: The attacker is initialized by fine-tuning on self-generated rollouts with lightweight structural cues (e.g., "1." as a list index), ensuring the output comprises non-refusal and well-structured multi-turn adversarial prompts. This guarantees usable rollouts for downstream RL and maintains high formatting consistency, fostering improved sample efficiency.
Open-loop Generation: Rather than conditioning prompt generation on victim responses (closed-loop), SEMA emits entire multi-turn adversarial plans in a response-agnostic, single-shot manner. This reduces branching factors in the prompt-response space, unifies single- and multi-turn information flows, cuts interaction costs, and enables batched sampling with enhanced exploration stability.
Figure 1: Overview of SEMA framework, illustrating the prefilling self-tuning stage and subsequent reinforcement learning using intent-drift-aware rewards.

Reinforcement Learning with Intent-Drift-Aware Reward

The attacker is subsequently trained using Group Relative Policy Optimization (GRPO) with an intent-drift-aware reward. The reward integrates three metrics for each simulated attack:

Intent Alignment: Evaluates whether the victim's final response matches the original malicious intent, penalizing drift and benign pivots.
Compliance Risk: Assesses the risk profile of the response, with higher scores for dangerous, actionable content.
Level of Detail: Favors responses that are concrete, operationalizable, and contain explicit procedural guidance.

Adversarial prompts are preferred if they preserve the harmful objective throughout turns and elicit high-quality, intent-aligned responses from victims. Format rewards enforce parseable, well-structured outputs. The GRPO objective leverages group sampling and normalized advantage computation, promoting broad, stable policy updates.

Empirical Evaluation

Baseline Comparison and Main Results

SEMA is assessed on AdvBench and HarmBench datasets using both open-source and closed-source victim models, alongside multiple automated classifiers (LLM classifier, HarmBench classifier, No Refusal phrase indicator). SEMA consistently achieves state-of-the-art attack success rates (ASR@1). Notable numerical results include:

AdvBench: SEMA posts $79.9\%$ , $77.2\%$ , and $83.3\%$ ASR@1 against Qwen2.5-3B-Instruct, Llama-3.1-8B-Instruct, and GPT-4.1-mini, respectively—substantially higher than all single-turn and template-based baselines (FlipAttack, ADV-LLM, Crescendo, etc.).
HarmBench: ASR@1 achieves $74.5\%$ , $70.6\%$ , and $79.8\%$ , again outperforming prior multi-turn strategies (Jigsaw Puzzle, Crescendo, FITD, ActorAttack).
Figure 2: Ablation studies demonstrating the impact of intent-drift-aware reward and prefilling self-tuning on ASR@1, and the effect of training-time maximum turns.

Scalability and Transferability

SEMA's scalability is evidenced by its rapid conversion of increased attempt budget into nearly saturated success rates. On HarmBench, SEMA achieves $96.8\%$ ASR@5 and $99.7\%$ ASR@20, outperforming baselines by a wide margin. Transfer experiments demonstrate robust success across source–target victim pairs, with TASR@1 as high as $92.6\%$ for Qwen2.5-3B-Instruct $\rightarrow$ GPT-4.1-mini, confirming generalization and practical robustness.

Figure 3: Attack Success Rate with $N$ attempts (ASR@N) on HarmBench and AdvBench for different victim models, illustrating SEMA’s superior scaling with attempt budget.

Case Analysis and Attack Diversity

Qualitative analysis demonstrates SEMA's capability to craft semantically diverse, strategically varied multi-turn adversarial schemes. Successful trajectories include:

Staged Knowledge Probing: Early turns request historical context and domain information, while late turns unobtrusively elicit the harmful target, bypassing early-stage filters.
Fictitious Framing: Use of "novel-writing" or "academic research" themes to disguise malicious objectives, with the attack converging on explicit harmful content in final turns.
Figure 4: Real success cases of SEMA from AdvBench on GPT-oss-20B and HarmBench on Llama-3.1-8B-Instruct, highlighting key adversarial features in multi-turn plans.

Ablation and Methodological Insights

Ablation studies confirm key findings:

Removing intent alignment from reward leads to lower ASR@1 and increased intent drift.
Prefilling self-tuning is essential for bootstrapping usable rollouts; omission results in consistent refusals and failed convergence.
Increasing prompt turns enhances success up to a threshold, after which model capacity bottlenecks degrade rollout quality.

Practical and Theoretical Implications

SEMA's design—simple, reproducible, template-free, and response-agnostic—substantially advances automated red-teaming for LLM safety. Its ability to uncover previously unexposed vulnerabilities in frontier models (e.g., GPT-oss-20B, GPT-4o) and generate diverse, transferable multi-turn attack plans sets a new bar for stress testing and adversarial robustness evaluation. The framework's avoidance of closed-source APIs and hand-crafted strategies addresses scalability, diversity, and deployment efficiency. Critically, singling out intent drift as a core axis for reward shaping demonstrates the importance of maintaining adversarial semantic continuity in realistic attack scenarios.

SEMA facilitates co-evolutionary training of defenders and attackers, enables systematic localization of safety failures, and motivates future research in multimodal jailbreaks, turn-efficiency optimization, and diversity-enhanced attack exploration.

Conclusion

SEMA offers a compact, efficient, and powerful framework for learning multi-turn jailbreak attackers against safety-aligned LLMs. Its two-stage pipeline—prefilling self-tuning and GRPO-based RL with intent-drift-aware rewards—enables response-agnostic, intent-persistent adversarial planning, yielding state-of-the-art performance in ASR, scalability, and transferability. SEMA is poised to become an indispensable tool for automated red-teaming and safety evaluation, catalyzing both practical defense improvements and theoretical advances in adversarial RL for LLMs.

Markdown Report Issue