Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models

Published 24 Dec 2024 in cs.CR | (2412.18171v2)

Abstract: LLMs are increasingly being integrated into services such as ChatGPT to provide responses to user queries. To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF), into the training of the LLMs. However, recent research has exposed that even aligned LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query. Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query. It then uses the gradient of Affirmation Loss for each token in the user query to locate the jailbreak-critical tokens. Further, Token Highlighter exploits our proposed Soft Removal technique to mitigate the jailbreak effects of critical tokens via shrinking their token embeddings. Experimental results on two aligned LLMs (LLaMA-2 and Vicuna-V1.5) demonstrate that the proposed method can effectively defend against a variety of Jailbreak Attacks while maintaining competent performance on benign questions of the AlpacaEval benchmark. In addition, Token Highlighter is a cost-effective and interpretable defense because it only needs to query the protected LLM once to compute the Affirmation Loss and can highlight the critical tokens upon refusal.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a method that uses Affirmation Loss and Soft Removal to detect and weaken jailbreak-critical tokens in LLM inputs.
It demonstrates a significant reduction in attack success rates on models like Vicuna-7B-V1.5 while preserving high utility on benign queries.
The approach offers interpretability and efficiency, paving the way for adaptive defenses against evolving jailbreak strategies.

An Expert Examination of "Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for LLMs"

This paper explores the vulnerabilities of alignment techniques in LLMs such as GPT-4, LLaMA-2, and Vicuna, addressing a critical concern: the models’ susceptibility to jailbreaking despite defenses aimed at aligning them with human values and legal standards. Jailbreak attacks exploit the LLMs by prompt manipulation, thus bypassing safety mechanisms designed to prevent harmful behavior generation. The authors propose a new method, Token Highlighter, to effectively identify and mitigate these vulnerabilities.

Methodology

The Token Highlighter approach introduces two key concepts: Affirmation Loss and Soft Removal. Affirmation Loss quantifies an LLM's tendency to respond affirmatively to jailbreak prompts. By examining the gradients of the Affirmation Loss with respect to input tokens, the method identifies jailbreak-critical tokens within an input prompt. This pinpointing forms the Critical Token Set, which informs subsequent mitigation efforts.

Soft Removal then applies a mitigation strategy by shrinking the embeddings of these critical tokens, significantly weakening their influence on the model’s output without completely removing them—a process that could impair the model’s performance on benign queries. This soft removal is achieved through a scalar multiplication of token embeddings, with the intention of preserving context while neutralizing adversarial triggers.

Experimental Evaluation and Results

The paper evaluates the efficacy of Token Highlighter on two LLMs: LLaMA-2-7B-Chat and Vicuna-7B-V1.5, while employing six jailbreak attack paradigms. Through empirical analysis, the paper demonstrates that Token Highlighter can reduce the attack success rate (ASR) significantly, from 0.730 to 0.142 on Vicuna-7B-V1.5, whilst preserving a higher utility score on benign queries, evidenced by competitive win rates on the AlpacaEval benchmark.

Compared to existing defenses, Token Highlighter offers several advantages. First, it integrates efficiently as it requires only a single query to compute Affirmation Loss. Second, its interpretable nature allows highlighted tokens to provide explanations for refusal responses, increasing transparency in the decision-making process.

Implications and Future Directions

This research proposes a cost-effective and interpretable strategy to combat jailbreak attacks on LLMs, contributing to the broader conversation on reinforcing AI robustness against adversarial inputs. It highlights the critical need for defense mechanisms that do not significantly compromise the utility of models on non-malicious tasks.

Future work could explore adaptive measures against evolving jailbreak techniques, potentially integrating dynamic learning adaptations to remain ahead of sophisticated adversarial strategies. In addition, expanding the method’s application to larger and more varied LLM architectures could further extend its practicality and robustness across diverse AI ecosystems.

By intersecting the fields of AI safety and adversarial machine learning, this paper sets the stage for innovative research that fortified the defenses of autonomous AI systems, ensuring alignment not just in functionality but also in ethical and lawful constructs.

Markdown Report Issue