Papers
Topics
Authors
Recent
Search
2000 character limit reached

Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes

Published 22 Feb 2025 in cs.LG, cs.AI, cs.CL, and cs.CR | (2502.16174v2)

Abstract: With the rise of LLMs, ensuring model safety and alignment has become a critical concern. While modern instruction-finetuned LLMs incorporate alignment during training, they still frequently require moderation tools to prevent unsafe behavior. The most common approach to moderation are guard models that flag unsafe inputs. However, guards require costly training and are typically limited to fixed-size, pre-trained options, making them difficult to adapt to evolving risks and resource constraints. We hypothesize that instruction-finetuned LLMs already encode safety-relevant information internally and explore training-free safety assessment methods that work with off-the-shelf models. We show that simple prompting allows models to recognize harmful inputs they would otherwise mishandle. We also demonstrate that safe and unsafe prompts are distinctly separable in the models' latent space. Building on this, we introduce the Latent Prototype Moderator (LPM), a training-free moderation method that uses Mahalanobis distance in latent space to assess input safety. LPM is a lightweight, customizable add-on that generalizes across model families and sizes. Our method matches or exceeds state-of-the-art guard models across multiple safety benchmarks, offering a practical and flexible solution for scalable LLM moderation.

Summary

  • The paper introduces the Latent Prototype Moderator, which leverages Mahalanobis distance to assess input safety without additional training.
  • It integrates seamlessly with existing LLM pipelines, offering a scalable and resource-efficient alternative to traditional guard models.
  • Experimental results demonstrate that LPM effectively distinguishes safe from unsafe inputs and meets state-of-the-art moderation benchmarks.

Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes

Introduction

The concern for model safety and alignment in LLMs has been escalating, particularly as these models are increasingly integrated into various applications. While modern instruction-finetuned LLMs incorporate alignment during training, they usually necessitate additional moderation tools to prevent unsafe behavior. Common moderation strategies involve guard models, which require intensive training, yet they are often limited to fixed-size, pre-trained options that struggle to adapt to evolving risks. This paper investigates the internal understanding of input safety within instruction-finetuned LLMs and proposes a training-free moderation method using latent space analysis.

Methodology

The core proposition of the paper is the Latent Prototype Moderator (LPM), a training-free moderation method utilizing Mahalanobis distance in the LLM's latent space for assessing input safety. The hypothesis is that instruction-finetuned LLMs encode safety-relevant information within their latent spaces, capable of distinguishing between safe and unsafe inputs. The LPM method operates by calculating and utilizing latent prototypes for safe and unsafe prompts to determine the safety of incoming inputs efficiently. Figure 1

Figure 1

Figure 1: Safe and harmful data in the model latent space.

Implementation Considerations

The method is implemented as an add-on, integrating with existing LLM pipelines without incurring significant computational costs. The utilization of Mahalanobis distance allows for precise safety assessment while avoiding additional training or alterations to the base model. The integration with existing pipelines makes LPM a flexible and scalable solution for real-world deployment, capable of adapting to new safety challenges by merely adding new prototypes. Figure 2

Figure 2

Figure 2: When explicitly prompted to assess input safety, instruction-tuned LLMs, which were safety aligned, frequently recognize harmful prompts. However, the same models often generate unsafe responses when given the identical input without a safety-check prompt.

Evaluation and Performance

The paper showcases the effectiveness of the LPM approach across various benchmarks, matching or exceeding the performance of state-of-the-art guard models. LPM's lightweight design offers a practical solution with broad applicability across different model families and sizes. Experiments demonstrate the LPM's capacity to identify unsafe prompts even when the model outputs remain harmful, highlighting the discrepancy between internal representations and surface behavior.

Discussion

The evaluation indicates that instruction-tuned LLMs possess inherent sensitivity to input safety, precluding the necessity for extensive guard model training. The research underscores the potential for leveraging these internal structures for scalable and efficient moderation. While the approach provides robustness and adaptability, its reliance on existing model representations may limit its capability against complex or evolving threats without prototype updates.

Conclusion

This research introduces a novel moderation method, the Latent Prototype Moderator, to effectively leverage the encoded safety knowledge within LLMs. LPM offers a flexible, training-free, and resource-efficient approach to content moderation, adaptable to evolving safety requirements. This approach emphasizes the potential of using inherent model representations for ensuring LLM safety, reducing the dependency on traditional, resource-intensive guard models. The findings set a precedent for further exploration into intrinsic model characteristics that could be harnessed for broader safety and alignment applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.