Benchmarking Gaslighting Attacks Against Speech Large Language Models

Published 24 Sep 2025 in cs.CL | (2509.19858v1)

Abstract: As Speech LLMs (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-LLMs, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that gaslighting attacks can reduce Speech LLM accuracy by an average of 24.3% using carefully crafted adversarial prompts.
The evaluation framework tests tasks like emotion understanding, transcription, vocal sound classification, and spoken QA under various gaslighting tactics and noise conditions.
Results show that Cognitive Disruption and Professional Negation are most effective in destabilizing model performance, emphasizing the need for robust belief-consistency frameworks.

Benchmarking Gaslighting Attacks Against Speech LLMs

Introduction

The paper "Benchmarking Gaslighting Attacks Against Speech LLMs," presented with the arXiv ID (2509.19858), explores the vulnerabilities of Speech LLMs (Speech LLMs) to manipulative input, specifically gaslighting attacks. Unlike text-based models, Speech LLMs must contend with the complexities of spoken language, which includes ambiguity, emotional nuances, and perceptual diversity. This complex interaction increases the susceptibility of Speech LLMs to adversarial attacks, thereby necessitating this study to systematically evaluate their robustness across various tasks.

Gaslighting Attack Framework

The proposed framework for evaluating gaslighting attacks involves tasks such as emotion understanding, speech transcription, vocal sound classification, and spoken QA. The evaluation process is structured in two stages. Stage 1 involves assessing the baseline accuracy of Speech LLMs with clean audio-text inputs. Stage 2 introduces five types of gaslighting prompts—Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation—alongside optional acoustic noise to evaluate the extent of accuracy degradation and behavioral shifts in the models.

Figure 1: The framework covers four types of tasks: emotion understanding, speech transcription, vocal sound classification, and spoken QA, with a focus on gaslighting impacts.

Methodology

The methodology centers around crafting adversarial prompts designed to manipulate model reasoning. These prompts simulate real-world manipulative language patterns. Additionally, controlled acoustic perturbation experiments are conducted to assess the models' multimodal robustness under realistic conditions. A significant accuracy drop, averaging 24.3%, highlights the models' vulnerability to these gaslighting tactics across five representative Speech LLMs.

Experimental Results

The evaluation revealed that all tested models exhibited substantial accuracy degradation under gaslighting prompts, with drops ranging from 10% to over 60% depending on the task and prompt type. Cognitive Disruption and Professional Negation were the most effective in destabilizing model accuracy, showcasing the complex interaction between linguistic manipulations and model susceptibility.

Furthermore, the introduction of acoustic noise amplified the models' vulnerability to gaslighting. As shown in the VocalSound task, even moderate noise levels caused significant performance drops, emphasizing the compounded risk of adversarial verbal and acoustic inputs.

Figure 2: Relative performance change under five negation strategies in the VocalSound task with increasing noise, illustrating higher vulnerability with audio degradation.

Implications and Future Directions

The findings underscore a critical weakness in current Speech LLMs: while they may exhibit high baseline performance, their susceptibility to subtle adversarial cues and compounding noise challenges their reliability in real-world applications. This points to a need for developing Speech LLMs with robust belief-consistency frameworks that can withstand intentional and environmental manipulation.

Future research should focus on designing countermeasures against such adversarial tactics, potentially involving enhanced model training protocols that consider both linguistic and acoustic resilience. Additionally, expanding the behavior-aware benchmarks could provide deeper insights into model vulnerabilities, potentially guiding the development of more resilient multimodal AI systems.

Conclusion

This study provides a comprehensive assessment of Speech LLMs' robustness against gaslighting tactics, revealing their systemic vulnerabilities to manipulative inputs. By highlighting substantial accuracy declines and increased behavioral anomalies, it calls for the advancement of Speech LLM methodologies that address these cognitive fragilities. Subsequent research will be crucial in fortifying these models for adversarial and unpredictable environments, ensuring their efficacy in deploying reliable AI applications.

Overall, this research contributes critically to our understanding of Speech LLMs' adversarial robustness, setting the stage for future innovations in enhancing their multimodal processing capabilities.

Markdown Report Issue