2000 character limit reached
Generative AI Security: Challenges and Countermeasures
Published 20 Feb 2024 in cs.CR, cs.AI, cs.CL, cs.CY, and cs.LG | (2402.12617v2)
Abstract: Generative AI's expanding footprint across numerous industries has led to both excitement and increased scrutiny. This paper delves into the unique security challenges posed by Generative AI, and outlines potential research directions for managing these risks.
- S. Aaronson. My AI safety lecture for UT Effective Altruism. Shtetl-Optimized: The blog of Scott Aaronson. Retrieved on September, 11:2023, 2022. URL https://scottaaronson.blog/?p=6823.
- Unveiling the Dark Side of ChatGPT: Exploring Cyberattacks and Enhancing User Awareness. Information, 15, 2023.
- Anthropic. Model Card and Evaluations for Claude Models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf. Accessed: Sep. 27, 2023.
- Is Github’s Copilot as bad as humans at introducing vulnerabilities in code? Empirical Software Engineering, 28(6):1–24, 2023.
- A. Azaria and T. Mitchell. The Internal State of an LLM Knows When It’s Lying, 2023. arXiv:2304.13734.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194, 2023.
- Deep reinforcement learning from human preferences, 2023. arXiv:1706.03741.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- How to ask for permission. In HotSec 2012, 2012.
- Security weaknesses of copilot generated code in github. arXiv preprint arXiv:2310.02059, 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
- LLM Censorship: A Machine Learning Challenge or a Computer Security Problem? arXiv preprint arXiv:2307.10719, 2023.
- More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv e-prints, pages arXiv–2302, 2023a.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023b.
- From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. IEEE Access, 2023.
- Towards optimal statistical watermarking. arXiv preprint arXiv:2312.07930, 2023.
- LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins, 2023. arXiv:2309.10254.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023.
- C. Jarvis. Crypto wars: the fight for privacy in the digital age: A political history of digital encryption. CRC Press, 2020.
- A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
- Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593, 2023.
- Digital signature of color images using amplitude modulation. In Storage and Retrieval for Image and Video Databases V, volume 3022, pages 518–526. SPIE, 1997.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023b.
- Prompt Injection attack against LLM-integrated Applications. arXiv preprint arXiv:2306.05499, 2023c.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study, 2023d. arXiv:2305.13860.
- Structural digital signature for image authentication: an incidental distortion resistant scheme. In Proceedings of the 2000 ACM workshops on Multimedia, pages 115–118, 2000.
- Supervised fine-tuning and direct preference optimization on intel gaudi2 — by intel(r) neural compressor — intel analytics software — nov, 2023 — medium. https://medium.com/intel-analytics-software/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3, 2023. (Accessed on 01/12/2024).
- A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 15009–15018, 2023.
- Being a bad influence on the kids: Malware generation in less than five minutes using ChatGPT, 2023.
- “Levels of AGI”: Operationalizing Progress on the Path to AGI, 2023. arXiv:2311.02462.
- Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023.
- A. Narayanan. Lendingclub.com: A de-anonymization walkthrough, 2008. https://33bits.wordpress.com/2008/11/12/57/.
- A. Narayanan and V. Shmatikov. Myths and fallacies of” personally identifiable information”. Communications of the ACM, 53(6):24–26, 2010.
- A precautionary approach to big data privacy. Data protection on the move: Current developments in ICT and privacy/data protection, pages 357–385, 2016.
- Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- An Attacker’s Dream? Exploring the Capabilities of ChatGPT for Developing Malware. In Proceedings of the 16th Cyber Security Experimentation and Test Workshop, pages 10–18, 2023.
- F. Perez and I. Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked, 2023. arXiv:2308.07308.
- Cold decoding: Energy-based constrained text generation with langevin dynamics. Advances in Neural Information Processing Systems, 35:9538–9551, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- J. Rando and F. Tramèr. Universal jailbreak backdoors from poisoned human feedback. arXiv preprint arXiv:2311.14455, 2023.
- Weakly Supervised Detection of Hallucinations in LLM Activations. arXiv preprint arXiv:2312.02798, 2023.
- From ChatGPT to HackGPT: Meeting the Cybersecurity Threat of Generative AI. MIT Sloan Management Review, 2023.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- ChatGPT: Optimizing language models for dialogue. OpenAI blog, 2022.
- G. Sebastian. Privacy and Data Protection in ChatGPT and Other AI Chatbots: Strategies for Securing User Information, 2023. SSRN 4454761.
- A. Sinha and K. Singh. A technique for image encryption using digital signature. Optics communications, 218(4-6):229–234, 2003.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- J. Taskinsoy. Facebook’s libra: Why does us government fear price stable cryptocurrency? Available at SSRN 3482441, 2019.
- O. Team. GPT-4V(ision) System Card, 2023.
- E. ThankGod Chinonso. The impact of ChatGPT on privacy and data protection laws, 2023. SSRN 4574016.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Tensor trust: Interpretable prompt injection attacks from an online game. arXiv preprint arXiv:2311.01011, 2023.
- On adaptive attacks to adversarial example defenses. Advances in neural information processing systems, 33:1633–1645, 2020.
- Watermarking the outputs of structured prediction with an application in statistical machine translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1363–1372, 2011.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
- Unveiling security, privacy, and ethical concerns of ChatGPT. Journal of Information and Intelligence, 2023.
- K. Yang and D. Klein. Fudge: Controlled text generation with future discriminators. arXiv preprint arXiv:2104.05218, 2021.
- Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
- Detecting and simulating artifacts in gan fake images. In 2019 IEEE international workshop on information forensics and security (WIFS), pages 1–6. IEEE, 2019.
- Provable robust watermarking for ai-generated text. arXiv preprint arXiv:2306.17439, 2023.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif, 2023a.
- Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270, 2023b.
- Fine-tuning language models with advantage-induced policy alignment. arXiv preprint arXiv:2306.02231, 2023c.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.