Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity

Published 27 Sep 2024 in cs.CL, cs.AI, and cs.CR | (2409.18708v4)

Abstract: We introduce a novel family of adversarial attacks that exploit the inability of LLMs to interpret ASCII art. To evaluate these attacks, we propose the ToxASCII benchmark and develop two custom ASCII art fonts: one leveraging special tokens and another using text-filled letter shapes. Our attacks achieve a perfect 1.0 Attack Success Rate across ten models, including OpenAI's o1-preview and LLaMA 3.1. Warning: this paper contains examples of toxic language used for research purposes.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a novel ASCII art attack method that consistently achieves a 1.0 evasion rate against multiple LLMs and toxicity detection systems.
The paper employs the ToxASCII benchmark, using 269 fonts and 26 toxic phrases, to systematically evaluate adversarial vulnerabilities in spatial representations.
The paper demonstrates that defenses like adversarial training and token splitting reduce attack success but fail to fully counteract the obfuscation, highlighting the need for spatial reasoning in model design.

Expert Summary: ASCII Art Attacks on LLMs and Toxicity Detection via ToxASCII Benchmark

The paper systematically investigates a class of adversarial attacks exploiting ASCII art rendering to bypass toxicity detection in LLMs and dedicated moderation systems. The authors introduce the ToxASCII benchmark, which is composed of manually curated, human-legible ASCII art fonts encoding toxic language, and extend these attack variants to include fonts using special tokens and “text-filled” ASCII letter shapes. Their empirical analysis spans ten contemporary LLM and moderation architectures, reporting a consistent attack success rate of 1.0 across all models and attack variants.

Attack Methodology and Experimental Framework

Three principal ASCII art attack types are deployed:

Regular ASCII Fonts: Toxic phrases are encoded with characters arranged in typical ASCII art fonts, tested against modern LLMs and API-based toxicity detection models.
Special Token Fonts: ASCII art is constructed from sequences of special tokens (e.g., <EOS>, <|end|>), leveraging model-dependent tokenizer behaviors to further obfuscate profane content.
Text-Filled Fonts: Larger ASCII letter shapes are filled with benign or coherent text, rendering the toxic message visible only when spatially decoded by a human.

For each attack, the authors evaluate both the model's ability to interpret (i.e., decode the ASCII representation as text) and to detect (i.e., flag the sample as toxic or as containing ASCII art) across a variety of model families, including OpenAI’s o1 and GPT-4o, Gemma, LLaMA 3.1, Phi 3.5, and Mistral Nemo.

The benchmark covers 269 ASCII fonts with 26 toxic phrases (comprising all letters of the English alphabet) per font. Specialized care is taken to prevent information leakage (e.g., excluding fonts where a character is self-represented) and to guarantee that phrases would be classified as toxic in their standard form.

Key computational details:

All experiments utilized Nvidia H100 GPUs (~455 GPU hours).
Standard inference APIs and open-source model checkpoints were harnessed with consistent hyperparameter controls.
Detection metrics were averaged over four runs.

Key Results

Core empirical findings are captured in the following:

Attack efficacy:

Model	Attack Success Rate (ASCII)	Attack Success Rate (Char Swap)
All tested LLMs	1.0	0.28–1.0

Crucially, the ASCII art attack achieves perfect evasion across all scenarios, with baseline homoglyph/char-swap attacks being much less effective.

Detection and Decoding Performance:

All models typically failed to decode toxic ASCII art content correctly, often misconstruing even visually simple patterns as benign (e.g., “hello world”).
Both API-based (OpenAI Moderation, Detoxify, Google Perspective) and open-source systems failed to flag ASCII-encoded toxicity.
The introduction of special tokens systematically disrupted tokenization, further debilitating models’ spatial parsing.
Text-filled attacks—where filler text diverges from the ASCII message—led models to process only the filler, completely ignoring harmful content.

Defensive Analysis:

Several mitigation strategies were tested:

Adversarial Training: Augmenting training data with ASCII art adversarial examples notably reduced attack success rates (e.g., from 1.0 → 0.18 in LLaMA-3.1 70B). However, models struggled to generalize beyond specific phrases/fonts, rendering this defense only partially effective.
Token Splitting: For special-token-based attacks, pre-tokenization splitting of special tokens reduced but did not eliminate model vulnerability.
OCR-based Detection: Applying Tesseract and EasyOCR to rasterized ASCII art proved sensitive to both font and filler text, necessitating case-specific fine-tuning and signal processing (e.g., convolution or downsampling), with inconsistent success across styles.
The authors emphasize the risk of “inverse attacks” where toxic content could be hidden within “innocent” ASCII outlines but encoded in the filler text itself.

Implications

The research exposes significant deficiencies in both general-purpose LLMs and specialized toxicity detection pipelines, particularly in their failure to integrate spatial and semantic context. The attack vector via ASCII art constitutes a high-risk, real-world evasion method—especially in platforms reliant upon automated moderation. The use of special tokens interacts destructively with the tokenization and parsing process, effectively “blinding” models even to synthetic structures.

Theoretical and Practical Consequences

Security and Robustness: The inability of LLMs to parse spatial arrangements exposes a heightened attack surface, demanding urgent advances in pre-processing, tokenization, and possibly multi-modal interpretive capabilities.
Model Limitations: Heuristic-based defenses (e.g., ASCII art detection, filter-based token splitting) cannot substitute for architectural progress in spatial and context awareness.
Benchmark Utility: ToxASCII provides a controlled, extensible adversarial corpus for robustness evaluation, applicable for benchmarking future model releases.

Future Trajectories

Multi-modal LLMs: Model architectures integrating vision and text modalities may address ASCII art parsing, yet current approaches fall short in reliable spatial analysis.
Generalizable Defenses: There is a clear need for more robust, data-driven or architectural countermeasures—potentially leveraging synthetic spatial data or compositional multi-scale pattern recognition.
Dynamic Adversarial Training: Iterative adversarial example mining and model retraining may bolster defense, though care must be taken to avoid overfitting to seen attack styles at the expense of generalization.
Expanded Adversarial Spaces: Extensions to non-English alphabets, more complex glyph arrangements, or embedded visual steganography are open fields for research and practical evaluation.

Conclusion

This study rigorously documents fundamental vulnerabilities in state-of-the-art LLMs and moderation systems to spatial attacks via ASCII art. It demonstrates that even perfect detection performance on canonical inputs is an insufficient metric for robustness in adversarial settings. Mitigations are only partially effective, reinforcing the ongoing need for deeper, spatially-aware model innovations and more thorough adversarial testing frameworks in responsible AI deployment.