Predicting the Performance of Black-box LLMs through Self-Queries

Published 2 Jan 2025 in cs.LG and cs.CL | (2501.01558v2)

Abstract: As LLMs are increasingly relied on in AI systems, predicting when they make mistakes is crucial. While a great deal of work in the field uses internal representations to interpret model behavior, these representations are inaccessible when given solely black-box access through an API. In this paper, we extract features of LLMs in a black-box manner by using follow-up prompts and taking the probabilities of different responses as representations to train reliable predictors of model behavior. We demonstrate that training a linear model on these low-dimensional representations produces reliable and generalizable predictors of model performance at the instance level (e.g., if a particular generation correctly answers a question). Remarkably, these can often outperform white-box linear predictors that operate over a model's hidden state or the full distribution over its vocabulary. In addition, we demonstrate that these extracted features can be used to evaluate more nuanced aspects of a LLM's state. For instance, they can be used to distinguish between a clean version of GPT-4o-mini and a version that has been influenced via an adversarial system prompt that answers question-answering tasks incorrectly or introduces bugs into generated code. Furthermore, they can reliably distinguish between different model architectures and sizes, enabling the detection of misrepresented models provided through an API (e.g., identifying if GPT-3.5 is supplied instead of GPT-4o-mini).

Abstract PDF Upgrade to Chat

Summary

The paper introduces QueRE, a self-query method that predicts black-box LLM performance by using response probabilities.
It shows that simple linear models trained on elicited features often outperform white-box approaches in predicting model correctness.
Experimental validation across diverse QA tasks confirms QueRE's robustness and potential for detecting adversarial manipulations.

Analysis of "Predicting the Performance of Black-box LLMs through Self-Queries"

The paper "Predicting the Performance of Black-box LLMs through Self-Queries" by Dylan Sam, Marc Finzi, and J. Zico Kolter addresses the critical issue of predicting the performance of LLMs in black-box settings. This research is grounded in the necessity to understand the behavior of LLMs, especially when deployed through interfaces that limit access to the model's internal representations—such as when accessed via an API. This work proposes a methodological innovation that leverages the LLMs' own self-querying capabilities to extract useful performance-related features in a black-box context.

Methodological Contribution

The paper introduces a novel approach to model performance prediction called QueRE (Question Representation Elicitation). The authors query the LLMs using tailored questions that prompt the models to reflect on their correctness of past responses. The generated output—probabilities of responses to these queries—is treated as a feature vector to train linear models aimed at predicting whether the LLM's predictions are correct. Crucially, the research demonstrates that these simple linear models, trained on probabilistic responses to elicitation queries, often outperform “white-box” methods that rely on access to internal model states or full-logit distributions.

The innovation lies in reducing reliance on direct access to a model’s internal parameters, which is often impractical or impossible with proprietary or closed-source APIs. Instead, the model's own self-assessment capabilities are harnessed to extract reflective features, which are shown to be informative for evaluating performance, thus creating a generalizable and scalable method applicable to various LLMs, including closed-source black-box models.

Experimental Validation

The experimental framework encompasses several QA tasks, both open-ended (e.g., SQuAD, NQ) and closed-ended (e.g., BoolQ, WinoGrande), revealing that QueRE is robust across multiple settings. QueRE's performance is measured against baseline methods, such as standard confidence elicitation and semantic uncertainty estimation, and demonstrates superior Area Under the Receiver Operating Characteristic (AUROC) in predicting model correctness across tasks.

A particular point of interest is the application of QueRE to detect adversarially influenced LLMs. In these scenarios, adversarial prompts are introduced to bias LLMs toward incorrect outputs. QueRE effectively distinguishes between adversarially modified and unaltered models, reinforcing its usefulness in practical deployment scenarios where such influences may be present.

Theoretical Insights and Implications

The paper includes a detailed theoretical analysis regarding the approximation of top-k probabilities via sampling techniques when direct access is unavailable. This analysis is crucial for understanding how sampled approximations impact model training and performance, extending the applicability of QueRE to strictly black-box models where such probabilities must be estimated rather than directly accessed.

The methodological implications suggest a shift toward reliance on elicited responses rather than internal transparency, offering pathways to improved model auditability and reliability in real-world applications. By demonstrating that diverse, sometimes randomly assembled question sets can serve as viable feature vectors, the paper argues that the functional state of LLMs can be effectively probed and utilized for performance evaluation.

Future Prospects

The promising results of QueRE open future research directions beyond traditional performance prediction, extending into domains such as model explainability and trust assessment. Researchers might investigate further refinement of query designs to enhance feature diversity or specificity for various task domains. Additionally, integration of adversarial robustness into regular LLM evaluation workflows could provide comprehensive solutions for ensuring the reliability of AI deployment in sensitive applications.

In conclusion, this paper advances the understanding of LLM behaviors under black-box constraints and articulates an innovative, practical approach for performance prediction, laying the groundwork for more transparent AI systems despite operational opacity.