Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Framework for Real-time Safeguarding the Text Generation of Large Language Model

Published 29 Apr 2024 in cs.CL and cs.AI | (2404.19048v3)

Abstract: LLMs have significantly advanced NLP tasks but also pose ethical and societal risks due to their propensity to generate harmful content. Existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMSafeGuard, a lightweight real-time framework that integrates an external validator into decoding, rejecting unsafe outputs while allowing valid ones. We introduce a similarity-based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMSafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMSafeGuard on detoxification and copyright safeguarding, demonstrating its superiority over SOTA baselines. In detoxification, LLMSafeGuard reduces toxic output by at least 38.6\% while preserving linguistic quality. Additionally, its context-wise timing selection cuts inference time by at least 24.2\% without compromising effectiveness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.
  2. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  3. T. Zhuo, Y. Huang, C. Chen, and Z. Xing, “Exploring ai ethics of chatgpt: A diagnostic analysis. arxiv,” arXiv preprint arXiv:2301.12867, 2023.
  4. P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar et al., “Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110, 2022.
  5. H. Zhang, H. Song, S. Li, M. Zhou, and D. Song, “A survey of controllable text generation using transformer-based pre-trained language models,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–37, 2023.
  6. D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019.
  7. N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “Ctrl: A conditional transformer language model for controllable generation,” 2019.
  8. J. Qian, L. Dong, Y. Shen, F. Wei, and W. Chen, “Controllable natural language generation with contrastive prefixes,” in Findings of the Association for Computational Linguistics: ACL 2022.   Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 2912–2924.
  9. X. Qi, Y. Zeng, T. Xie, P.-Y. Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!” arXiv preprint arXiv:2310.03693, 2023.
  10. M. Kim, H. Lee, K. M. Yoo, J. Park, H. Lee, and K. Jung, “Critic-guided decoding for controlled text generation,” in Findings of the Association for Computational Linguistics: ACL 2023.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 4598–4612.
  11. B. Krause, A. D. Gotmare, B. McCann, N. S. Keskar, S. Joty, R. Socher, and N. F. Rajani, “GeDi: Generative discriminator guided sequence generation,” in Findings of the Association for Computational Linguistics: EMNLP 2021.   Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 4929–4952.
  12. S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: A simple approach to controlled text generation,” arXiv preprint arXiv:1912.02164, 2019.
  13. K. M. Yoo, D. Park, J. Kang, S.-W. Lee, and W. Park, “Gpt3mix: Leveraging large-scale language models for text augmentation,” arXiv preprint arXiv:2104.08826, 2021.
  14. S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy, “A survey of data augmentation approaches for nlp,” arXiv preprint arXiv:2105.03075, 2021.
  15. OpenAI, “Chatgpt,” https://chat.openai.com/, 2023.
  16. “Gpt-4,” https://openai.com/research/gpt-4, accessed: 2024-02-05.
  17. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  18. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  19. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  20. C. Meister, T. Vieira, and R. Cotterell, “Best-first beam search,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 795–809, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.51
  21. H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa, “Llama guard: Llm-based input-output safeguard for human-ai conversations,” 2023.
  22. F. Wu, Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, and X. Xie, “Defending chatgpt against jailbreak attack via self-reminder,” 2023.
  23. Y. Xie, M. Fang, R. Pi, and N. Gong, “Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis,” 2024.
  24. “Azure content safety api,” https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety, accessed: 2024-02-05.
  25. “Openai moderation api,” https://platform.openai.com/docs/guides/moderation/, accessed: 2024-02-05.
  26. R. Liu, G. Xu, C. Jia, W. Ma, L. Wang, and S. Vosoughi, “Data boost: Text data augmentation through reinforcement learning guided conditional generation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Association for Computational Linguistics, Nov. 2020, pp. 9031–9041.
  27. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  28. A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y. Choi, “DExperts: Decoding-time controlled text generation with experts and anti-experts,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).   Association for Computational Linguistics, Aug. 2021, pp. 6691–6706.
  29. K. Yang and D. Klein, “FUDGE: Controlled text generation with future discriminators,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Association for Computational Linguistics, Jun. 2021, pp. 3511–3535.
  30. A. Singhal et al., “Modern information retrieval: A brief overview,” IEEE Data Eng. Bull., vol. 24, no. 4, pp. 35–43, 2001.
  31. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
  32. Y. Cheng, “Mean shift, mode seeking, and clustering,” IEEE transactions on pattern analysis and machine intelligence, vol. 17, no. 8, pp. 790–799, 1995.
  33. “Toxic comment classification challenge,” https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge, accessed: 2024-02-05.
  34. https://www.perspectiveapi.com/, accessed: 2024-02-05.
  35. A. Karamolegkou, J. Li, L. Zhou, and A. Søgaard, “Copyright violations and large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.   Singapore: Association for Computational Linguistics, Dec. 2023, pp. 7403–7412.
  36. “Large language models and copyright,” https://en.wikipedia.org/wiki/Wikipedia:Large\_language\_models\_and\_copyright, accessed: 2024-03-01.
  37. S. F. Chen, D. Beeferman, and R. Rosenfeld, “Evaluation metrics for language models,” 1998.
  38. https://huggingface.co/docs/transformers/en/perplexity, accessed: 2024-02-05.
  39. https://huggingface.co/openai-community, accessed: 2024-02-05.
  40. https://huggingface.co/meta-llama, accessed: 2024-02-05.
  41. https://qdrant.tech/, accessed: 2024-02-05.
  42. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2, accessed: 2024-02-05.
  43. Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” arXiv preprint arXiv:2302.04023, 2023.
  44. Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen et al., “Siren’s song in the ai ocean: a survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
  45. L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023.
  46. X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022.
  47. Y. Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. Hovy, H. Schütze, and Y. Goldberg, “Measuring and improving consistency in pretrained language models,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1012–1031, 2021.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.