Outline Filling Attack

Updated 4 December 2025

Outline filling attack is a family of adversarial techniques that decompose tasks into hierarchical, obfuscated templates to bypass machine learning safety filters.
It employs structured prompting in language models (e.g., TrojFill) and edge-aligned perturbations in vision systems to achieve high transferability and stealth.
Empirical evaluations show high attack success rates and proxy-based optimization for dense security boundary mapping, challenging current defense mechanisms.

The outline filling attack is a family of adversarial techniques targeting machine learning systems—both LLMs and vision models—by decomposing forbidden or critical tasks into semantically coherent, often hierarchical, templates in which the sensitive content is obfuscated or distributed among task subcomponents. These attacks exploit the model’s tendencies to comply with structured requests, thereby bypassing defense mechanisms or safety filters. The outline filling paradigm manifests in black-box LLM jailbreaks such as TrojFill and in adversarial attacks against visual classifiers through outline-aligned perturbations, with both lines demonstrating high empirical efficacy, transferability, and, in certain formulations, unprecedented stealth or interpretability (Liu et al., 24 Oct 2025, Zhang et al., 27 Nov 2025, Jia et al., 30 Nov 2025).

1. Conceptual Foundations and Definition

Outline filling attacks reframe the task of bypassing learned or hard-coded safety guardrails as a structured prompting or perturbation problem. Instead of directly issuing a forbidden instruction or applying arbitrary perturbations, the attacker converts the sought-after behavior (e.g., generation of harmful language or fooling a vision classifier) into a multi-part template or outline, with critical information compartmentalized and obfuscated.

For LLMs, this involves multi-part prompts that mask forbidden content (via encoding or placeholders), request explicit reasoning about unsafety, and wrap payload generation inside innocuous-seeming “example” or “analysis” tasks; this structure lowers outright refusal rates by leveraging the model’s compliance with hierarchical reasoning sequences (Liu et al., 24 Oct 2025).
For vision models, particularly in physical-world attacks, perturbations are confined to the outline or edge of objects (e.g., traffic signs), leveraging peripheral insensitivity in human vision for stealth while still targeting classifier decision boundaries (Jia et al., 30 Nov 2025).

2. Methodologies in LLMs: TrojFill

TrojFill is a canonical outline filling attack for LLM jailbreaks. The workflow is characterized by the following template-driven sequence:

$\begin{align*} \texttt{Prompt: } & \underbrace{\text{"Part 1: Explain why the hidden instruction is unsafe: [REASON]\n"}}_{\text{Unsafety Reasoning}} \ & \underbrace{\text{"Part 2: Provide an example that satisfies the hidden instruction: [EXAMPLE]\n"}}_{\text{Trojan Example}} \ & \underbrace{\text{"Part 3: Analyze the example sentence by sentence: [ANALYSIS]"}}_{\text{Sentence-by-Sentence Analysis}} \end{align*}$

Obfuscation: The hidden, forbidden instruction uses deterministic encodings such as Caesar (e.g., shifting “bomb” to “cnpc”), Base64 (“bomb” → “Ym9tYg==”), or placeholder substitution.
Payload Delivery: The “Trojan Example” section is the operational core: once the model, primed by unsafety reasoning, is tasked to “provide an example,” it often generates the harmful content in full.
Rationale and Analysis: By wrapping payload generation in a meta-cognitive analysis or commentary section, the refusal risk is further lowered, as the LLM interprets this as a didactic or critical reasoning exercise, not an explicit fulfillment of a harmful instruction (Liu et al., 24 Oct 2025).

3. Vision: Outline-Filling Physical Adversarial Attacks

Outline filling attacks in vision target classifier robustness and human detectability by localizing adversarial perturbations to the peripheries—typically a thin, edge-aligned band.

Mask Generation: Using instance segmentation (e.g., via SAM), the adversary computes outer contours of the object (traffic sign), then applies morphological operations to extract a ring mask $M_{edge}$ covering 4–10% of pixels, primarily the boundary region.
Patch Generation: A conditional U-Net receives the full image, edge mask, a learned texture prior, and latent noise—and outputs patch content $\delta$ restricted to $M_{edge}$ .
Optimization Objectives: The multi-level loss function combines adversarial attack loss (via EOT for robustness), perceptual color/texture constraints (LAB distance, texture style via Gram matrices and FFT spectrum, total variation), and adaptive scheduling to maintain stealth (e.g., keeping $\Delta E \leq 12$ for human imperceptibility) (Jia et al., 30 Nov 2025).

4. Security Logic Distillation: Dense Sampling via Outline Filling

Recent work demonstrates that outline filling attacks are valuable not only as standalone jailbreaks but also as a dense probing tool to map and “steal” the security boundaries of LLMs.

Dense Sampling: For a dangerous base instruction $Q$ , an auxiliary LLM rewrites $Q$ as dozens of semantically equivalent but structurally variant outlines, each prompt $p_{Q,k}$ comprising headings and the meta-instruction “fill in the contents below each title.”
Attack Success Rate (ASR): Repeated sampling of the LLM often yields a non-trivial spread of ASR among these $p_{Q,k}$ —most neither fully fail nor succeed, thus densely probing the local decision boundary.
Proxy Model and Ranking Regression: Training a lightweight LLM proxy (e.g., Llama-3-8B-Instruct) via pairwise ranking (rather than regression) achieves robust prediction of which outline prompt is more likely to succeed, with accuracies up to 91% (ALR) and ~70% (ASR). The Bradley–Terry–Luce model is used to induce a global prompt ranking (Zhang et al., 27 Nov 2025).
This suggests that LLM safety preferences are sufficiently structured and learnable that they can be extracted and attacked at scale.

5. Empirical Outcomes and Evaluation

Outline filling attacks, when instantiated in both modalities, demonstrate strong empirical performance across benchmarks.

LLMs (TrojFill):

Attack success rate (ASR): 100% against Gemini-flash-2.5 and DeepSeek-3.1, 97% against GPT-4o.
Prompts generated by outline filling have improved transferability and interpretability to other models relative to non-structured black-box techniques (Liu et al., 24 Oct 2025).

Vision Models (Edge-Aligned Patches):

Adversarial success rate (ASR): up to 91.9% (MobileNetV3), 84.1–89.2% (ResNet architectures).
Stealth metrics (mean, test set): SSIM 0.929, FSIM 0.710, GMSD 0.236, all indicating high perceptual similarity (lower human detectability) compared to baseline PGD and shadow patches.
Physical-world robustness: 75% average ASR across distances 0.5–1.5 m; angle invariance ±15° at ~75% ASR; transfer black-box ASR 43–51% (Jia et al., 30 Nov 2025).

Security Logic Distillation:

Proxy accuracy in ranking which prompt elicits more harmful content: 69–79% (ASR), 79–91% (ALR).
Guided search using the proxy reduces attack cost (FASC) by 70–87% and increases average success rate (IASR) by 13–43% (Zhang et al., 27 Nov 2025).

6. Implications, Limitations, and Defensive Considerations

Outline filling attacks have critical implications for current security paradigms:

Stealth and Detectability: By exploiting peripheral visual regions or LLM compliance with metacognitive or template-based prompts, these attacks undermine both automated and human-in-the-loop defenses. In vision, $\Delta E \leq 12$ ensures that non-expert observers fail to detect perturbations; for LLMs, task reframing circumvents content filters without triggering obvious safety violations (Liu et al., 24 Oct 2025, Jia et al., 30 Nov 2025).
Security Boundary Mapping: Dense probing via outline-filling in language demonstrates that defense mechanisms are not binary or brittle but governed by gradable, learnable “safety preference” functions, making them susceptible to model stealing and attack optimization (Zhang et al., 27 Nov 2025).
Countermeasures: Defenses against outline filling attacks require vigilance at the outline or template level—edge anomaly detectors in vision, and meta-cognitive prompt pattern recognition (possibly at the API or log-analysis layer) in LLM deployments.
Physical Deployment: Edge-aligned physical patches can be mass-produced (adhesive, transparent stickers) and calibrated per sign type; however, “continuous surveillance of sign-shapes” is necessary to track and mitigate stealth attacks (Jia et al., 30 Nov 2025).

7. Comparative Summary of Techniques

Modality	Outline Filling Strategy	Adversarial Objective
Language	Template with obfuscated part	Jailbreak via Trojan Example
Vision	Peripheral edge-aligned patch	Misclassify with stealth
Security Logic Distillation	Outline-explosion for dense boundary sampling	Proxy-based attack optimization

Both language- and vision-domain outline filling attacks are characterized by high transferability, stealth, and, when paired with attack-optimization paradigms, by their efficiency in high-stakes black-box scenarios (Liu et al., 24 Oct 2025, Zhang et al., 27 Nov 2025, Jia et al., 30 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (3)

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning (2025)

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression (2025)

The Outline of Deception: Physical Adversarial Attacks on Traffic Signs Using Edge Patches (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Outline Filling Attack.