HARMGEN: TTS Adversarial Attack Framework
- HARMGEN is a framework for content-centric threat modeling that defines adversarial attacks on modern TTS systems using semantic obfuscation and audio-modality exploits.
- It introduces five attack primitives that bypass moderation by splitting toxic messages and leveraging auxiliary audio, reducing refusal rates significantly.
- The comprehensive evaluation across commercial LALM TTS systems underscores the need for proactive moderation and robust cross-modal defenses.
HARMGEN is a systematic framework for the content-centric threat modeling and empirical evaluation of attacks on state-of-the-art Text-to-Speech (TTS) systems powered by Large Audio-LLMs (LALMs). Unlike prior adversarial work that targets speaker impersonation, HARMGEN elucidates how an adversary can induce commercial TTS services to vocalize explicitly harmful content, such as hate speech, harassment, or other toxic messages, bypassing advanced safety alignment mechanisms and multi-stage moderation pipelines. By introducing five attack primitives organized into two distinct families—semantic obfuscation and audio-modality exploits—HARMGEN demonstrates a novel and underexplored vector by which LALM TTS systems are vulnerable to abuse at scale (Chen et al., 14 Nov 2025).
1. Threat Model and Objectives
HARMGEN models adversaries who possess black-box access to commercial TTS APIs, such as GPT-4o-mini-audio and Google Gemini Live. Attackers submit arbitrary text prompts and can supply auxiliary audio inputs, but lack internal access to model weights. Their goals include (i) generating audio that disseminates hate or toxicity, (ii) maintaining stealth by preventing attribution to the attacker's personal voice, and (iii) degrading the reputation of the TTS provider by emitting toxic messages in the provider's canonical synthetic voice. Practical constraints include a modest compute budget commensurate with scripting several tens of API queries and lightweight local preprocessing (Chen et al., 14 Nov 2025).
2. Attack Families and Methodologies
HARMGEN delineates five attack primitives within two families: semantic obfuscation (text-only) and audio-modality exploits (multi-modal). Each is designed to evade both input and output content filters while ensuring the harmful utterance is articulated verbatim in the generated audio.
2.1 Semantic Obfuscation Attacks
- Concat (Concatenation): The adversary segments the target toxic sentence into substrings that, in isolation, are non-toxic; for example, “All dogs”, “should be”, and a harmful word. Each substring is submitted as a separate TTS prompt, and the resulting audio segments are concatenated—optionally separated by ~50 ms silences—to reconstruct the toxic utterance at the waveform level. Detection defenses may be bypassed because neither the text filter nor output transcript individually captures the full toxicity.
- Shuffle (Word-Position Shuffling): Attackers randomly permute tokens in the sentence, breaking recognizable toxic n-gram patterns. Upon TTS acceptance, a forced aligner such as Montreal Forced Aligner is used to extract word-level timestamps, and segments are re-ordered post hoc to recover the original toxic sentence. This iterative shuffling and synthesis process circumvents direct refusal triggered by toxic prompt detection.
2.2 Audio-Modality Exploits
These attacks inject only the truly disallowed word(s) into the prompt via auxiliary audio, while the remainder of the input is benign text. LALM-driven TTS systems are prompted to recognize and insert the toxic content into otherwise safe utterances.
- Read: Supplies an audio clip containing a human reading out the targeted toxic word W, alongside an instruction for the TTS system to recognize and insert W at a specified position within the provided text.
- Spell: Submits an audio sequence spelling each character of W (e.g., "s h i t"), cueing the TTS system to concatenate the letters and insert the reconstructed word.
- Phoneme (IPA-Pronunciation): Provides IPA symbols for W, pronounced via an audio snippet using resources like CMUdict. The prompt restricts the model’s interpretation to a small candidate word list to enhance disambiguation due to LALMs' generally lower performance on raw phoneme recognition.
3. Evaluation Setup
HARMGEN’s effectiveness is empirically assessed using five commercial LALM-based TTS systems:
- OpenAI GPT-4o-mini-audio
- OpenAI GPT-4o-mini-tts
- OpenAI GPT-5o-nano (via CosyVoice)
- Google Gemini-2.5-live
- Alibaba Qwen-omni-turbo
Toxicity testing employs three datasets:
- Self: 62 English sentences containing a single harmful word
- Ethos: 413 English sentences (deriving from Reddit and YouTube) encapsulating hate, harassment, violence, illicit topics
- Mul-ZH: 881 Chinese sentences from the TOXICN corpus, encompassing gender, region, race, and LGBTQ content
Metrics include:
- Refusal Rate (): The proportion of trials in which the model refuses to generate audio, with and for the direct and attack scenarios, respectively.
- Toxicity Score (TS): The mean toxicity of output transcripts, measured by Detoxify (English) and COLD (Chinese).
Formally:
Secondary metrics include sentence-level refusal () and paired t-tests for statistical significance in toxicity score differences (Chen et al., 14 Nov 2025).
4. Attack Effectiveness and Systemic Vulnerabilities
Baseline refusal rates for direct toxic prompts range from 80–97% across LALM TTS models, with negligible toxicity in accepted outputs. HARMGEN’s attacks result in:
- Semantic Obfuscation: Both Concat and Shuffle reduce system refusal rates to nearly zero for all systems and datasets, increasing transcript toxicity scores to 0.6–0.8—comparable to human-spoken hate speech.
- Audio-Modality Exploits: Spell achieves the highest effectiveness, reducing refusal by up to 80 percentage points, followed by Read and Phoneme. Spell is limited to alphabetic orthographies, while Read generalizes to Chinese datasets.
- Combined Attacks: Applying, for instance, Shuffle plus Spell, yields and toxicity in a single attack iteration.
- System and Category Generalization: All attack families generalize across voice styles, TTS architectures, and toxicity categories, although certain categories (e.g., harassment vs. illicit) yield varying efficacy.
5. Countermeasure Landscape
5.1 Reactive Defenses (Platform-Side)
- Deepfake Detection (AASIST2): High-fidelity LALM-generated audio evades deepfake detectors, with accuracy and equal error rates (EER) .
- Transcribe-Then-Moderate (Whisper + Moderation API): Detects 66–86% of harmful audios in the absence of adversarial manipulation. However, psychoacoustically minimized perturbations (targeting Whisper) reduce detection efficacy to and decrease moderation scores by 40–50%.
5.2 Proactive Defenses (Provider-Side)
Moderating the provider-emitted text prior to waveform synthesis blocks 57–93% of attack attempts. Segment buffering and holistic moderation are crucial for Concat, while Shuffle requires careful normalization or forced alignment recovery strategies.
| Attack Method | Detection Ratio (%) | Avg. Moderation Score |
|---|---|---|
| Concat | 84.5 | 0.73 |
| Shuffle | 72.5 | 0.59 |
| Shuffle–Recovered | 93.3 | 0.86 |
| Read | 80.0 | 0.67 |
| Spell | 57.1 | 0.49 |
| Phoneme | 82.4 | 0.71 |
Because the provider observes all components (raw prompt, text splits), adversary-side audio perturbations cannot bypass these proactive filters (Chen et al., 14 Nov 2025).
6. Limitations and Future Work
Reactive string-based moderation is vulnerable to adversarial perturbations after audio release; once the waveform is produced, robust ASR attacks can obfuscate content from automated moderation. Deepfake detectors remain brittle against high-fidelity LALM audio. Proactive moderation—especially buffering and normalization of input streams—provides secured coverage (57–93%), but adversarially augmented training, multimodal toxicity classifiers, and hardening of spoofing detection tools are necessary for comprehensive future protection.
HARMGEN advocates for integration of these adversarial attack primitives into model alignment and fine-tuning to improve refusal behavior, as well as development of cross-modal (text+audio+metadata) toxicity classifiers. A plausible implication is that robust cross-modal safeguards and adversarially trained moderation pipelines are indispensable for the secure deployment of LALM-powered TTS services at scale (Chen et al., 14 Nov 2025).
7. Significance and Broader Impact
HARMGEN demonstrates that contemporary TTS systems, absent robust and multi-layered content filtering, can be repurposed to generate toxic speech indistinguishable from human delivery, thus enabling scalable weaponization of synthetic audio by malicious entities. Proactive, provider-integrated text moderation (including buffering and normalization) is currently the most efficacious defense, yet comprehensive safety will require persistent adaptation involving adversarial training and detection systems that operate on multiple modalities simultaneously (Chen et al., 14 Nov 2025).