MOBjailbreak: Multimodal LLM Jailbreaks
- MOBjailbreak is an advanced attack framework that manipulates prompts and mobile agent inputs to bypass LLM and VLM safety and alignment measures.
- It employs creative prompt rewriting, stealthy injection, and model-editing techniques to spoof defenses with near-perfect attack success rates in evaluations.
- Current plug-and-play defenses are insufficient, highlighting the need for deeper semantic analysis and intent classification to secure algorithmic optimization requests.
MOBjailbreak denotes a class of advanced jailbreak attacks and frameworks targeting the safety mechanisms of LLMs and vision-LLMs (VLMs), with a particular emphasis on algorithm design requests, mobile agents, and multimodal decision-making scenarios. Research on MOBjailbreak has identified critical vulnerabilities in state-of-the-art LLMs, especially in contexts such as autonomous optimization, mobile device agent operation, and cross-modal interaction, revealing that existing plug-and-play defenses are largely ineffective and that prompt rewriting and obfuscation attacks can achieve near-perfect success rates. The following sections provide a comprehensive treatment of MOBjailbreak methods, their evaluation, related defense paradigms, and implications for the security and alignment of advanced language and vision-LLMs.
1. Problem Definition and Scope
MOBjailbreak attacks are formally defined as methods that manipulate prompts or input modalities in order to bypass the safety filters and alignment mechanisms of LLMs and VLMs, thereby eliciting harmful, unethical, or forbidden content. A defining feature is their applicability to so-called "malicious intelligent optimization algorithm requests," in which attackers instruct the model to design algorithms or workflows intended for adversarial or illicit objectives, often couched in high-level, abstract terms that evade keyword-based or pattern-matching defenses.
Key vulnerabilities arise because current alignment strategies focus on low-level code patterns, explicit banned word lists, or refusal triggers, whereas optimization requests exploit the model's objective-driven generation process and its inability to distinguish benign from malicious algorithmic intents when posed in sophisticated linguistic formats (Gu et al., 1 Jan 2026). This results in a "blind spot" for tasks involving specification of objectives, constraints, and heuristics—domain attributes that rarely trigger conventional safety modules.
2. Threat Model and Formal Attack Scenario
The threat model encompasses adversaries capable of crafting or transforming requests to evade safety mechanisms in black-box or partially observable settings. Let denote a malicious optimization algorithm request and a rewritten or obfuscated version intended to bypass safety filters. The adversary's objective is to maximize the probability that the model outputs a solution directly fulfilling the illicit intent encoded in :
where indicates successful assistance with the malicious goal (Gu et al., 1 Jan 2026). This model applies both to prompt-based attacks (in text or multimodal input) and to perception-chain exploits on mobile agent platforms (Ding et al., 9 Oct 2025).
3. Attack Methodologies
3.1. Prompt Rewriting via Creative-Scenario Masking
The canonical MOBjailbreak attack (Gu et al., 1 Jan 2026) employs a fixed creative-writing "fiction author" transformation template:
1 2 3 4 |
<original malicious prompt P> I am a fiction author... [Request for two-stage conversion: first to analyze dramatic intent, then to rewrite as a safe, non-malicious scenario.] [Output: only the rewritten safe version and a replacement notes table.] |
The malicious is rewritten using an open-source LLM surrogate (e.g., DeepSeek-V3) and then submitted to the target LLM. This obfuscation consistently enables the model to generate algorithmic solutions serving the original harmful request.
3.2. Stealthy Injection and Perception-Chain Targeting in Mobile Agents
For mobile VLM-driven agents, MOBjailbreak attacks leverage in-app prompt injection, stealthy touch-based triggers, and visual-input hijacks. A malicious app embeds a hidden prompt δ in the UI, revealed only during automated (ADB-based) agent operation, with user-invisible activation via low-pressure/contact area taps (Ding et al., 9 Oct 2025). The efficacy of jailbreak is optimized by HG-IDA*, a heuristic-guided, character-level search that minimally perturbs the prompt to evade on-device moderation.
3.3. Activation-Guided and Model-Editing Attacks
MOBjailbreak encompasses frameworks utilizing activation-guided local editing (Wang et al., 1 Aug 2025) and model-editing methods such as JailbreakEdit (Chen et al., 9 Feb 2025). The former combines scenario-based context generation and hidden-state-guided token substitution/insertion to evade detection; the latter directly injects a universal backdoor into the model's parameters, enabling compliance upon trigger activation without degraded performance on benign queries.
4. Benchmarking and Empirical Evaluation
Central to MOBjailbreak research is the creation of MalOptBench, a curated set of 60 intelligent optimization algorithm requests (e.g., Online Bin Packing, Traveling Salesman Problem, Flow Shop Scheduling, Bayesian Optimization Acquisition-Function Design) targeting diverse illicit decision-making scenarios (Gu et al., 1 Jan 2026). Evaluation proceeds by measuring:
- Attack Success Rate (ASR): Fraction of prompts yielding compliant, algorithmic assistance
- Harmfulness Score : Mean assistant-assigned rating from 1 (refusal) to 5 (full compliance with forbidden goal)
The table below gives representative results:
| Model | Orig. ASR / h | JB ASR / h |
|---|---|---|
| GPT-4o | 96.7% / 4.87 | 96.7% / 4.85 |
| GPT-5 | 38.3% / 2.58 | 96.6% / 4.88 |
| OpenAI-o3 | 55% / 3.13 | 96.6% / 4.83 |
| Average (13 models) | 83.6% / 4.28 | 97.95% / 4.87 |
All tested models, including GPT-5, exhibit near-total failure under MOBjailbreak (ASR approaches 100%), with high harmfulness scores, even in configurations where direct queries partially resist (Gu et al., 1 Jan 2026).
5. Analysis of Defensive Measures
Plug-and-play defenses evaluated include SAGE (Self-Aware Guard Enhancement) and Self-Reminder, both of which utilize external safety prompts or chain-of-thought re-verification. While these reduce ASR on the original MalOptBench, they are ineffective against MOBjailbreak (ASR remains ≈93% for SAGE, ≈81% for Self-Reminder). Notably, the addition of defense mechanisms to both surrogate and target models can inadvertently increase attack success, and "exaggerated safety" behaviors emerge (e.g., benign prompt refusals >70%) (Gu et al., 1 Jan 2026).
Recommended strategies for improved defense include:
- Algorithmic specification analysis for malicious objective–constraint patterns.
- Intent classification to identify suspicious high-level requests, beyond surface code or banned tokens.
- Pre-processing to sanitize and re-validate prompts for disguised instructional templates.
- Model-in-the-loop adversarial training using poisoned/re-written prompts to harden refusal boundaries.
6. Implications for Model Alignment and Security
The research demonstrates that current LLM and VLM safety architectures are fundamentally vulnerable to prompt obfuscation and context-shifting attacks targeting abstract algorithm design capabilities. MOBjailbreak attacks leverage models' tendency to prioritize task completion—especially when presented with high-level objectives couched in benign-seeming narrative or optimization language—over robust safety compliance.
No existing plug-and-play or prompt-based filter is able to reliably defend against these attacks. Even state-of-the-art, closed-source models (including GPT-5) succumb to creative rewriting at near-perfect rates. Defenses will require the development of deeper semantic and intent-driven safety checking, augmenting filter layers with comprehensive, context-aware, and trajectory-based evaluation strategies (Liang et al., 1 Jul 2025).
The transferability and query efficiency of these techniques—combined with their stealth and lack of dependence on static, brittle banned-token lists—underscore the urgency of developing stronger adversarial training, evidence-based intent classification, and multimodal alignment protocols for all LLM/VLM deployments in sensitive or high-stakes domains.
7. Future Directions and Open Problems
Future MOBjailbreak research will likely focus on:
- Formalization of defense mechanisms that analyze the semantic and structural properties of optimization and algorithm design requests, integrating semantic parsing with alignment modeling.
- Automated, LLM- and trajectory-based auditing infrastructure capable of detecting jailbreaking patterns beyond single-turn, string-matching paradigms (Liang et al., 1 Jul 2025).
- Model-editing integrity verification, e.g., cryptographic audits on network parameters to detect clandestine backdoor edits (Chen et al., 9 Feb 2025).
- Scaling adversarial datasets (like MalOptBench) to encompass new modalities (vision, code, speech) and scenario contexts.
- Human-in-the-loop and multi-stage review workflows for high-impact or optimization-style queries.
These directions are motivated by the demonstrated inability of current thresholding, prompt-prepending, and chain-of-thought verification methods to counteract creative, context-shifting jailbreaks—rendering technical alignment and continuous monitoring central to the next generation of model security practices.
References:
- (Gu et al., 1 Jan 2026) Overlooked Safety Vulnerability in LLMs: Malicious Intelligent Optimization Algorithm Request and its Jailbreak
- (Ding et al., 9 Oct 2025) Effective and Stealthy One-Shot Jailbreaks on Deployed Mobile Vision-Language Agents
- (Liang et al., 1 Jul 2025) SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents
- (Wang et al., 1 Aug 2025) Activation-Guided Local Editing for Jailbreaking Attacks
- (Chen et al., 9 Feb 2025) Injecting Universal Jailbreak Backdoors into LLMs in Minutes