SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
Abstract: Ensuring the safety of LLM applications is essential for developing trustworthy artificial intelligence. Current LLM safety benchmarks have two limitations. First, they focus solely on either discriminative or generative evaluation paradigms while ignoring their interconnection. Second, they rely on standardized inputs, overlooking the effects of widespread prompting techniques, such as system prompts, few-shot demonstrations, and chain-of-thought prompting. To overcome these issues, we developed SG-Bench, a novel benchmark to assess the generalization of LLM safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment of 3 advanced proprietary LLMs and 10 open-source LLMs with the benchmark reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment. We also explain these findings quantitatively and qualitatively to provide insights for future research.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- URL https://api.semanticscholar.org/CorpusID:268232499.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023. URL https://api.semanticscholar.org/CorpusID:259950998.
- How trustworthy are open-source llms? an assessment under malicious demonstrations shows their vulnerabilities. arXiv preprint arXiv:2311.09447, 2023a.
- Purple llama cyberseceval: A secure coding benchmark for language models. arXiv preprint arXiv:2312.04724, 2023.
- R-judge: Benchmarking safety risk awareness for llm agents. arXiv preprint arXiv:2401.10019, 2024.
- Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023a.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045, 2023.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
- Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. arXiv preprint arXiv:2403.17336, 2024.
- Play guessing game with llm: Indirect jailbreak attack with implicit clues. ArXiv, abs/2402.09091, 2024. URL https://api.semanticscholar.org/CorpusID:267657689.
- Don’t say no: Jailbreaking llm by suppressing refusal. arXiv preprint arXiv:2404.16369, 2024.
- Easyjailbreak: A unified framework for jailbreaking large language models. arXiv preprint arXiv:2403.12171, 2024.
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382, 2023.
- Louie Giray. Prompt engineering with chatgpt: a guide for academic writers. Annals of biomedical engineering, 51(12):2629–2633, 2023.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023a.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
- Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436, 2023a.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In NeurIPS, 2023b.
- Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044, 2024.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- Safety assessment of chinese large language models. ArXiv, abs/2304.10436, 2023b. URL https://api.semanticscholar.org/CorpusID:258236069.
- How trustworthy are open-source llms? an assessment under malicious demonstrations shows their vulnerabilities. ArXiv, abs/2311.09447, 2023b. URL https://api.semanticscholar.org/CorpusID:265220739.
- Arondight: Red teaming large vision language models with auto-generated multi-modal jailbreak prompts. arXiv preprint arXiv:2407.15050, 2024.
- Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
- Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
- Hal-eval: A universal and fine-grained hallucination evaluation framework for large vision language models. arXiv preprint arXiv:2402.15721, 2024.
- A survey on in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- Towards revealing the mystery behind chain of thought: A theoretical perspective. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 70757–70798. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/dfc310e81992d2e4cedc09ac47eff13e-Paper-Conference.pdf.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Chatgpt. Website, 2023. https://openai.com/blog/chatgpt.
- Gpt-4. Website, 2023. https://openai.com/gpt-4.
- Claude. Website, 2023. https://www.anthropic.com.
- Mistral 7b. ArXiv, abs/2310.06825, 2023. URL https://api.semanticscholar.org/CorpusID:263830494.
- Qwen technical report. ArXiv, abs/2309.16609, 2023. URL https://api.semanticscholar.org/CorpusID:263134555.
- Glm-130b: An open bilingual pre-trained model. ArXiv, abs/2210.02414, 2022. URL https://api.semanticscholar.org/CorpusID:252715691.
- Internlm2 technical report. ArXiv, abs/2403.17297, 2024. URL https://api.semanticscholar.org/CorpusID:268691939.
- Scaling laws for neural language models. ArXiv, abs/2001.08361, 2020. URL https://api.semanticscholar.org/CorpusID:210861095.
- A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268, 2023.
- Rethinking supervised pre-training for better downstream transferring. arXiv preprint arXiv:2110.06014, 2021.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. ArXiv, abs/2308.03825, 2023. URL https://api.semanticscholar.org/CorpusID:260704242.
- Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140, 2023b.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.