Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Published 17 Oct 2023 in cs.CL, cs.AI, and cs.LG | (2310.11324v2)

Abstract: As LLMs are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative LLM. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.

Abstract PDF Upgrade to Chat

Citations (224)

View on Semantic Scholar

Summary

The paper finds that innocuous prompt formatting choices can cause performance differences of up to 76 points in LLM evaluations.
It introduces the FormatSpread algorithm that leverages multi-arm bandit methods to efficiently gauge formatting sensitivity.
The study advocates for broader reporting standards in LLM benchmarking to account for prompt-induced performance spreads.

Quantifying LLMs' Sensitivity to Prompt Formatting

Abstract

This paper examines the sensitivity of LLMs to formatting choices in prompt design, which, despite preserving the meaning of the prompt, can significantly affect model performance. Findings reveal substantial performance differences, with accuracies varying by up to 76 points in LLaMA-2-13B due to formatting nuances. The authors propose that reporting a range of performance across plausible prompt formats could enhance methodological validity in evaluations. They introduce an algorithm, FormatSpread, to systematically gauge the effect of formatting variations without accessing model weights.

Introduction

The use of LLMs has become integral to language technology deployments, necessitating robust characterization of their behavior across different contexts. One commonly underestimated factor is the influence of prompt formatting—stylistic choices that should not alter a prompt's semantic interpretation yet may drastically impact model output. This research exposes the fragility of model performance to several formatting permutations and challenges the standard practice of benchmarking LLMs with a fixed prompt format. It argues for the inclusion of a performance spread, obtained by evaluating diverse formats, in research reporting.

Figure 1: Slight modifications in prompt format templating may lead to significantly different model performance for a given task. Example corresponds to 1-shot LLaMA-2-7B performances for task280 from SuperNaturalInstructions.

Methodology

Grammar of Prompt Formats

The authors define a comprehensive grammar to characterize the spectrum of semantically equivalent prompt formats. This grammar establishes equivalence based on meaning-preserving transformations encompassing descriptor casing and separators. Equivalent formats share identical meanings, ensuring valid comparisons over varied formats without introducing confounding semantic elements.

Measuring Sensitivity

The paper details an algorithm, FormatSpread, using multi-arm bandit frameworks like Thompson Sampling and Upper Confidence Bound to explore the prompt formatting space efficiently. This approach models the variant performances of prompt formats over tasks as hidden values, optimizing the evaluation budget to derive accurate performance intervals.

Experimental Results

Performance Spread

Despite model size increases, few-shot example counts, and instruction tuning efforts, the sensitivity of models like LLaMA-2 and Falcon to formatting choices remains consistent. For example, even a small set of formats showcases an accuracy spread of 7.5 points on median across numerous tasks.

Figure 2: Llama-2-7B vs. 13B

Individual Feature Influence

Analyzing the individual impact of prompt constants reveals moderate dissimilarity in performance across configurations such as separators and item formats. However, no single constant consistently predicts accuracy changes on its own. Instead, overall prompt structure exerts a stronger influence on the final task performance.

Discussion

The inherent sensitivity of LLMs to superficially innocuous prompt format choices challenges current benchmarking paradigms. By demonstrating variability not mitigated by increasing model complexity or context, this work argues for a broader evaluation framework that takes performance span into account. These insights call for caution in making assertions about model improvements solely based on training data or model architecture changes.

Conclusion

This paper advocates for nuanced reporting of LLM performance to account for prompt formatting variability, thus preventing the undue attribution of results to factors other than prompt structure. FormatSpread offers researchers a tool to explore prompt sensitivity comprehensively without invasive access to model parameters, an asset for both open-source and API-locked LLMs. Future directions may include regularization methods to enhance model robustness against formatting variations, reinforcing their practicality in multifaceted language technology applications.