Evaluating the Prompt Steerability of Large Language Models

Published 19 Nov 2024 in cs.CL, cs.AI, and cs.HC | (2411.12405v2)

Abstract: Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model's joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited -- due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a novel benchmark that quantifies LLM steerability through tailored prompting strategies using defined persona dimensions.
It employs a persona dataset with over 133 dimensions and Wasserstein-based metrics to systematically measure changes in model behavior.
Empirical results reveal that larger models exhibit higher steerability but also intrinsic biases that limit their adaptability across diverse personas.

Evaluating the Prompt Steerability of LLMs

The paper "Evaluating the Prompt Steerability of LLMs" (2411.12405) introduces a novel approach to measure the steerability of LLMs through prompting techniques. It proposes a benchmark based on the prompt steerability concept, aimed at assessing how well various AI systems can reflect different personas by using a series of steering examples included within the system prompt.

Introduction to Prompt Steerability

The paper begins by defining prompt steerability in the context of LLMs. This involves altering a model's baseline behavior by introducing specific prompts designed to influence its output towards particular persona dimensions. The authors formalize prompt steerability by examining how a model's joint behavioral profile is affected by various steering strategies.

Figure 1: Models are steered along each dimension (e.g., conscientiousness as shown above) by including k steering examples for the direction of interest in the model's system prompt.

This formal approach allows for systematic quantification using metrics termed steerability indices, which measure the effectiveness of steering efforts across different dimensions.

Benchmark Design and Methodology

The benchmark leverages the persona dataset from Anthropics, consisting of 133 dimensions covering personality traits, political and ethical views, religious stances, and more. Each dimension contains a series of statements, segmented by positive and negative valence.

Figure 2: The 32 persona dimensions we study in our persona steerability benchmark. The listed dimensions are the subset of the (133) dimensions from the anthropic-evals dataset that contain at least 300 examples (in each direction) with at least 0.85 label confidence. Dimensions are categorized into various categories.

These statements serve both as steering examples within the system prompt and as scoring statements to evaluate model outputs in response to questions that ask whether the model aligns with a specific persona.

The authors define steerability indices ( $\gamma_{i,k}^+$ and $\gamma_{i,k}^-$ ) to evaluate prompt steerability by measuring the Wasserstein distance between base and steered profiles, representing how much a model's behavior can be altered by prompt modifications.

Empirical Results

The paper presents steerability curves for various LLMs, showcasing the degree to which models can be influenced by increasing the number of steering examples. These results highlight a noticeable variability in steerability across different dimensions and directions.

Steerability is not always symmetric; some personas are more easily influenced in a positive direction, while others can be shifted negatively with greater ease. Notably, larger models exhibit higher steerability, plateauing quickly, suggesting enhanced internal representations that facilitate learning from fewer steering examples.

Figure 3: Steerability curves for llama-3-8b-instruct.

Figure 4: Steerability curves for llama-3.1-8b-instruct.

Discussion and Implications

The findings reveal insights into the latent persona biases embedded within LLMs. As indicated by the steerability curves and baseline assessments, many models are pre-disposed towards certain behaviors, limiting their flexibility in adopting diverse personas. This resistance poses challenges for designing truly pluralistic AI systems intended to reflect a broad spectrum of human value systems and cultures.

Furthermore, the authors discuss the implications of these findings for aligning AI systems with various human preferences, emphasizing the importance of understanding inherent biases and promoting controllable generation for AI pluralism.

Figure 5: Steerability curves for granite-7b-lab.

Figure 6: Steerability curves for granite-13b-chat-v2.

Conclusion

The paper concludes with a reflection on the benchmark's utility in guiding model development towards enhanced steerability. It acknowledges the inherent limitations in current models and datasets while suggesting further exploration into the mechanics of steerability as an avenue for advancing the design of AI systems capable of embodying diverse personas. Future directions aim at refining the methodology to improve efficiency and explore joint steerability, bridging existing gaps in achieving comprehensive and adaptable AI alignment.