Fundamental Limitations of Alignment in Large Language Models

Published 19 Apr 2023 in cs.CL and cs.AI | (2304.11082v6)

Abstract: An important aspect in developing LLMs that interact with humans is aligning their behavior to be useful and unharmful for their human users. This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones, a process referred to as alignment. In this paper, we propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in LLMs. Importantly, we prove that within the limits of this framework, for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. This implies that any alignment process that attenuates an undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks. Furthermore, our framework hints at the mechanism by which leading alignment approaches such as reinforcement learning from human feedback make the LLM prone to being prompted into the undesired behaviors. This theoretical result is being experimentally demonstrated in large scale by the so called contemporary "chatGPT jailbreaks", where adversarial users trick the LLM into breaking its alignment guardrails by triggering it into acting as a malicious persona. Our results expose fundamental limitations in alignment of LLMs and bring to the forefront the need to devise reliable mechanisms for ensuring AI safety.

Abstract PDF Upgrade to Chat

Citations (123)

View on Semantic Scholar

Summary

The paper introduces the BEB framework, quantifying LLM alignment by decomposing behavior into well-behaved and ill-behaved components.
The theoretical analysis reveals that alignment methods reducing undesired behaviors to a small probability remain vulnerable to adversarial prompting.
Empirical validation with LLaMA models shows that RLHF alignment can inadvertently increase the distinguishability of undesired behaviors.

Fundamental Limitations of Alignment in LLMs

This paper introduces the Behavior Expectation Bounds (BEB) framework to analyze the limitations of aligning LLMs. The framework assumes that an LLM's distribution can be decomposed into well-behaved and ill-behaved components, and uses this decomposition to analyze the effect of prompts on the model's behavior. The key result is that alignment processes that reduce, but do not eliminate, undesired behaviors are vulnerable to adversarial prompting.

Behavior Expectation Bounds Framework

The BEB framework introduces a method for quantifying the alignment of LLMs by assigning scores to natural language sentences based on specific behavior verticals, such as helpfulness or honesty (Figure 1). The expected behavior score of a distribution is the average score of sentences sampled from that distribution. The framework posits that the distribution of an LLM can be decomposed into a mixture of well-behaved ( $P_+$ ) and ill-behaved ( $P_-$ ) components, with the alignment of the LLM determined by the weight ( $\alpha$ ) of the ill-behaved component.

Figure 1: Examples of sentence behavior scores along different behavior verticals, illustrating the BEB framework's ground truth behavior scoring functions.

Key definitions within the BEB framework include:

$\gamma$ -prompt-misalignable: An LLM is prompt-misalignable if there exists a prompt that causes the model to exhibit a behavior with an expectation score below a threshold $\gamma$ .
$\beta$ -distinguishable: A distribution $P_\phi$ is $\beta$ -distinguishable from $P_\psi$ if the KL divergence between their conditional distributions, given any sequence of tokens, is greater than $\beta$ .
$\sigma$ -similar: Two distributions $P_\phi$ and $P_\psi$ are $\sigma$ -similar if the variance of the log-likelihood ratio between them is bounded by $\sigma^2$ .
$\alpha,\beta,\gamma$ -negatively-distinguishable: A behavior B is $\alpha,\beta,\gamma$ -negatively-distinguishable in distribution P if the ill-behaved component has a behavior expectation less than $\gamma$ and is $\beta$ -distinguishable from the well-behaved component.

Key Results on Alignment Limitations

The paper derives several theoretical results using the BEB framework:

Alignment Impossibility: Under the assumption of $\alpha,\beta,\gamma$ -distinguishability, LLM alignment processes that reduce undesired behaviors to a small but non-zero probability are not safe against adversarial prompts.
Finite Guardrail: Aligning prompts can only provide a finite guardrail against adversarial prompts, with the required length of the misaligning prompt scaling linearly with the length of the aligning prompt.
Misalignment via Conversation: LLMs can be misaligned during a conversation, requiring adversarial users to insert more misaligning text compared to single-prompt scenarios.

The length of misaligning prompts is dictated by the distinguishability parameter $\beta$ , such that increased distinguishability can reduce the misaligning prompt length.

Empirical Validation

The paper presents empirical results using the LLaMA LLM family to demonstrate the assumptions and results derived from the BEB framework. Experiments involve fine-tuning models to display positive and negative behaviors, and then measuring the KL divergence and log-likelihood variance between these models. Results suggest that RLHF alignment may increase the distinguishability of undesired behaviors, making them more accessible via adversarial prompts.

Figure 2: KL divergence between two distributions of opposite behaviors as a function of prompt length, illustrating the estimation of $\beta$ .

Implications and Future Directions

This research highlights the need for robust mechanisms to ensure AI safety, especially in light of the limitations of current alignment techniques. The BEB framework provides a basis for further theoretical investigation into LLM alignment, including:

Further investigation of superposition and decomposability in actual LLM distributions.
Introduction of more elaborate assumptions on agent or persona decomposition in LLM distributions.
Deeper definitions of behavior scoring that account for varying text granularities and ambiguous scoring.

The findings suggest that alignment methods that control the model at inference time, such as representation engineering, may be more effective in mitigating the risks of adversarial prompting.

Markdown Report Issue