Can LLMs be Fooled? Investigating Vulnerabilities in LLMs
Abstract: The advent of LLMs has garnered significant popularity and wielded immense power across various domains within NLP. While their capabilities are undeniably impressive, it is crucial to identify and scrutinize their vulnerabilities especially when those vulnerabilities can have costly consequences. One such LLM, trained to provide a concise summarization from medical documents could unequivocally leak personal patient data when prompted surreptitiously. This is just one of many unfortunate examples that have been unveiled and further research is necessary to comprehend the underlying reasons behind such vulnerabilities. In this study, we delve into multiple sections of vulnerabilities which are model-based, training-time, inference-time vulnerabilities, and discuss mitigation strategies including "Model Editing" which aims at modifying LLMs behavior, and "Chroma Teaming" which incorporates synergy of multiple teaming strategies to enhance LLMs' resilience. This paper will synthesize the findings from each vulnerability section and propose new directions of research and development. By understanding the focal points of current vulnerabilities, we can better anticipate and mitigate future risks, paving the road for more robust and secure LLMs.
- Text summarization using large language models: A comparative study of mpt-7b-instruct, falcon-7b-instruct, and openai chat-gpt models, 2023.
- Red-teaming large language models using chain of utterances for safety-alignment. ArXiv, abs/2308.09662, 2023. URL https://api.semanticscholar.org/CorpusID:261030829.
- Purple llama cyberseceval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724, 2023.
- Model leeching: An extraction attack targeting llms. ArXiv, abs/2309.10544, 2023a. URL https://api.semanticscholar.org/CorpusID:262053852.
- Model leeching: An extraction attack targeting llms. arXiv preprint arXiv:2309.10544, 2023b.
- Badprompt: Backdoor attacks on continuous prompts. ArXiv, abs/2211.14719, 2022.
- Explore, establish, exploit: Red teaming language models from scratch. arXiv preprint arXiv:2306.09442, 2023.
- Can large language models understand content and propagation for misinformation detection: An empirical study. ArXiv, abs/2311.12699, 2023. URL https://api.semanticscholar.org/CorpusID:265308637.
- Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Annual computer security applications conference, pp. 554–569, 2021.
- Targeted backdoor attacks on deep learning systems using data poisoning. ArXiv, abs/1712.05526, 2017. URL https://api.semanticscholar.org/CorpusID:36122023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ArXiv, abs/2209.07858, 2022.
- Mart: Improving llm safety with multi-round automatic red-teaming. ArXiv, abs/2311.07689, 2023. URL https://api.semanticscholar.org/CorpusID:265157927.
- Koala: A dialogue model for academic research. Blog post, April, 1, 2023.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. 2023.
- The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
- Editing commonsense knowledge in gpt. arXiv preprint arXiv:2305.14956, 2023.
- Sowing the wind, reaping the whirlwind: The impact of editing language models. arXiv preprint arXiv:2401.10647, 2024.
- Token-level adversarial prompt detection based on perplexity measures and contextual information. ArXiv, abs/2311.11509, 2023. URL https://api.semanticscholar.org/CorpusID:265294544.
- Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6t0Kwf8-jrj.
- Baseline defenses for adversarial attacks against aligned language models. ArXiv, abs/2309.00614, 2023. URL https://api.semanticscholar.org/CorpusID:261494182.
- Mistral 7b, 2023.
- Logicllm: Exploring self-supervised logic-enhanced training for large language models. ArXiv, abs/2305.13718, 2023. URL https://api.semanticscholar.org/CorpusID:258841216.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. ArXiv, abs/2302.05733, 2023.
- Will chatgpt get you caught? rethinking of plagiarism detection, 2023.
- Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. ArXiv, abs/2303.13408, 2023.
- Query-efficient black-box red teaming via bayesian optimization. arXiv preprint arXiv:2305.17444, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt. ArXiv, abs/2304.05197, 2023.
- A cross-language investigation into jailbreak attacks in large language models, 2024.
- Prompt injection attacks and defenses in llm-integrated applications, 2023.
- Notable: Transferable backdoor attacks against prompt-based nlp models. ArXiv, abs/2305.17826, 2023.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022a.
- Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022b.
- Fast model editing at scale. arXiv preprint arXiv:2110.11309, 2021.
- Memory-based model editing at scale. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 15817–15831. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/mitchell22a.html.
- Auditing large language models: a three-layered approach. ArXiv, abs/2302.08500, 2023. URL https://api.semanticscholar.org/CorpusID:256901111.
- Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities. ArXiv, abs/2308.12833, 2023. URL https://api.semanticscholar.org/CorpusID:261101245.
- On the risk of misinformation pollution with large language models, 05 2023a.
- On the risk of misinformation pollution with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 1389–1403, Singapore, December 2023b. Association for Computational Linguistics. URL https://aclanthology.org/2023.findings-emnlp.97.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022a.
- Red teaming language models with language models. In Conference on Empirical Methods in Natural Language Processing, 2022b. URL https://api.semanticscholar.org/CorpusID:246634238.
- Ignore previous prompt: Attack techniques for language models. ArXiv, abs/2211.09527, 2022.
- Adding instructions during pretraining: Effective way of controlling toxicity in language models. arXiv preprint arXiv:2302.07388, 2023.
- Onion: A simple and effective defense against textual backdoor attacks. arXiv preprint arXiv:2011.10369, 2020.
- Hidden killer: Invisible textual backdoor attacks with syntactic trigger. arXiv preprint arXiv:2105.12400, 2021.
- Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. ArXiv, abs/2305.14965, 2023.
- " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144, 2016.
- Can ai-generated text be reliably detected? ArXiv, abs/2303.11156, 2023.
- Analysis of chatgpt on source code. ArXiv, abs/2306.00597, 2023.
- Rainbow teaming: Open-ended generation of diverse adversarial prompts. arXiv preprint arXiv:2402.16822, 2024.
- Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks. ArXiv, abs/2006.12557, 2020. URL https://api.semanticscholar.org/CorpusID:219980448.
- Controlled text generation using t5 based encoder-decoder soft prompt tuning and analysis of the utility of generated text in ai. ArXiv, abs/2212.02924, 2022. URL https://api.semanticscholar.org/CorpusID:254274934.
- Survey of vulnerabilities in large language models revealed by adversarial attacks. ArXiv, abs/2310.10844, 2023. URL https://api.semanticscholar.org/CorpusID:264172191.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Red teaming language model detectors with language models. arXiv preprint arXiv:2305.19713, 2023.
- Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp. 3–18. IEEE, 2017.
- Mondrian: Prompt abstraction attack against large language models for cheaper api pricing. arXiv preprint arXiv:2308.03558, 2023.
- Seeing seeds beyond weeds: Green teaming generative ai for beneficial uses. arXiv preprint arXiv:2306.03097, 2023.
- Chris Stokel-Walker. Ai bot chatgpt writes smart essays-should academics worry? Nature, 2022.
- Xuchen Suo. Signed-prompt: A new approach to prevent prompt injection attacks against llm-integrated applications, 2024.
- Dawn: Dynamic adversarial watermarking of neural networks. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 4417–4425, 2021.
- Stanford alpaca: an instruction-following llama model (2023). URL https://github. com/tatsu-lab/stanford_alpaca, 2023.
- Three tools for practical differential privacy, 2018.
- Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944, 2023.
- Adversarial demonstration attacks on large language models. ArXiv, abs/2305.14950, 2023a.
- Detoxifying large language models via knowledge editing. arXiv preprint arXiv:2403.14472, 2024a.
- Easyedit: An easy-to-use knowledge editing framework for large language models. arXiv preprint arXiv:2308.07269, 2023b.
- Defending llms against jailbreaking attacks via backtranslation. 2024b. URL https://api.semanticscholar.org/CorpusID:268032484.
- Llms can defend themselves against jailbreaking in a practical manner: A vision paper, 2024.
- Same: Sample reconstruction against model extraction attacks, 2024.
- Exploring the universal vulnerability of prompt-based learning paradigm. ArXiv, abs/2204.05239, 2022a.
- Exploring the universal vulnerability of prompt-based learning paradigm. arXiv preprint arXiv:2204.05239, 2022b.
- Llm jailbreak attack versus defense techniques - a comprehensive study. ArXiv, abs/2402.13457, 2024. URL https://api.semanticscholar.org/CorpusID:267770234.
- A comprehensive overview of backdoor attacks in large language models within communication networks. ArXiv, abs/2308.14367, 2023. URL https://api.semanticscholar.org/CorpusID:261244059.
- Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models. ArXiv, abs/2103.15543, 2021. URL https://api.semanticscholar.org/CorpusID:232404131.
- Editing large language models: Problems, methods, and opportunities. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10222–10240, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.632. URL https://aclanthology.org/2023.emnlp-main.632.
- Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.