Papers
Topics
Authors
Recent
Search
2000 character limit reached

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

Published 23 May 2024 in cs.CR and cs.CL | (2405.14191v4)

Abstract: Generative LLMs have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risk efficiently. To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM ${M}_t$ and a novel safety critique LLM ${M}_c$. ${M}_t$ is responsible for automatically generating test cases in accordance with the proposed risk taxonomy. ${M}_c$ can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval is efficient and effective in test generation and safety evaluation. Moreover, S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios. Our benchmark is publicly available at https://github.com/IS2Lab/S-Eval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. AI@Meta. 2024. Llama 3 Model Card. {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}.
  3. Anonymous. 2024. The repository of our benchmark and experimental data. https://github.com/IS2Lab/S-Eval.
  4. Anthropic. 2023. Introducing Claude. https://www.anthropic.com/news/introducing-claude.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
  6. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics. 675–718.
  7. Rishabh Bhardwaj and Soujanya Poria. 2023a. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. arXiv preprint arXiv:2310.14303 (2023).
  8. Rishabh Bhardwaj and Soujanya Poria. 2023b. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662 (2023).
  9. Can GPT-3 perform statutory reasoning?. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law. 22–31.
  10. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023).
  11. Multilingual Jailbreak Challenges in Large Language Models. In The Twelfth International Conference on Learning Representations.
  12. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 320–335.
  13. Keith F Durkin. 1997. Misuse of the Internet by pedophiles: Implications for law enforcement and probation practice. Fed. Probation 61 (1997), 14.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022).
  15. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462 (2020).
  16. Aligning AI With Shared Human Values. Proceedings of the International Conference on Learning Representations (2021).
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  18. Flames: Benchmarking value alignment of chinese large language models. arXiv preprint arXiv:2311.06899 (2023).
  19. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In The Twelfth International Conference on Learning Representations.
  20. Baidu Inc. 2023. ErnieBot. https://yiyan.baidu.com/.
  21. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  22. Prompt packer: Deceiving llms through compositional instruction with hidden attacks. arXiv preprint arXiv:2310.10077 (2023).
  23. New era of artificial intelligence in education: Towards a sustainable multifaceted revolution. Sustainability 15, 16 (2023), 12451.
  24. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733 (2023).
  25. Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation. arXiv preprint arXiv:2311.18702 (2023).
  26. Chatgpt needs spade (sustainability, privacy, digital divide, and ethics) evaluation: A review. Cognitive Computation (2024), 1–23.
  27. Multi-step Jailbreaking Privacy Attacks on ChatGPT. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  28. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. arXiv preprint arXiv:2402.05044 (2024).
  29. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191 (2023).
  30. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
  31. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  32. Prompting frameworks for large language models: A survey. arXiv preprint arXiv:2311.12785 (2023).
  33. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023).
  34. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy. 346–363.
  35. Stanley Milgram. 1963. Behavioral study of obedience. The Journal of abnormal and social psychology 67, 4 (1963), 371.
  36. OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt.
  37. OpenAI. 2024. Moderation. https://platform.openai.com/docs/guides/moderation.
  38. BBQ: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193 (2021).
  39. Societal biases in language generation: Progress and challenges. arXiv preprint arXiv:2105.04054 (2021).
  40. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Empirical Methods in Natural Language Processing.
  41. Beyond classification: Financial reasoning in state-of-the-art language models. arXiv preprint arXiv:2305.01505 (2023).
  42. Safety Assessment of Chinese Large Language Models. arXiv preprint arXiv:2304.10436 (2023).
  43. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024).
  44. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360 (2023).
  45. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  46. Gemma: Open Models Based on Gemini Research and Technology. arXiv preprint arXiv:2403.08295 (2024).
  47. Llama Team. 2024. Meta Llama Guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  50. Considerations for differentially private learning with large-scale public pretraining. arXiv preprint arXiv:2212.06470 (2022).
  51. Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine (2024), 1–9.
  52. Attention is all you need. Advances in neural information processing systems 30 (2017).
  53. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. Advances in Neural Information Processing Systems 36 (2024).
  54. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387 (2023).
  55. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems 36 (2024).
  56. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022).
  57. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023).
  58. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  59. Cvalues: Measuring the values of chinese large language models from safety to responsibility. arXiv preprint arXiv:2307.09705 (2023).
  60. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).
  61. Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages. arXiv preprint arXiv:2305.18098 (2023).
  62. Yi: Open Foundation Models by 01. AI. arXiv preprint arXiv:2403.04652 (2024).
  63. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
  64. Sentiment analysis in the era of large language models: A reality check. arXiv preprint arXiv:2305.15005 (2023).
  65. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045 (2023).
  66. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).
  67. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 1 like about this paper.