Papers
Topics
Authors
Recent
Search
2000 character limit reached

Question-Only Linear Probes for LLM Diagnostics

Updated 1 February 2026
  • The paper introduces a diagnostic tool that uses a linear probe to forecast answer correctness by analyzing activations after the input question.
  • The method computes class centroids in activation space to differentiate between correct and incorrect responses without relying on output tokens.
  • Experimental results across models reveal its efficiency and limitations, highlighting strong performance on factual tasks but challenges in compositional reasoning.

A question-only linear probe is a linear diagnostic tool for LLMs that predicts the forthcoming answer accuracy by inspecting model activations after reading the input question and before any output tokens are generated. Unlike traditional output-based or answer-token probes, the question-only probe requires no access to the actual answer or downstream decoding trajectory, making it a uniquely “in-advance” predictor of model success or failure. This method has become a key approach to investigating internal representations of correctness and confidence within transformer LLMs, and provides direct, cross-domain metrics of factual self-assessment (Cencerrado et al., 12 Sep 2025).

1. Mathematical Foundation

Question-only linear probes are constructed as simple, closed-form separations of activation-space centroids associated with correct and incorrect responses. For a transformer-based LLM MM with LL layers, let h(l)(x)Rdh^{(l)}(x)\in\mathbb{R}^d denote the residual-stream activation at layer ll following the final token of a question xx. The probe seeks a direction wRdw \in \mathbb{R}^d such that the projection of hh onto ww discriminates forthcoming answer correctness.

Given a training set of (xi,labeli)(x_i, label_i) pairs, with labeli=1label_i=1 if the model answers correctly and $0$ otherwise:

  • Class centroids are computed as:

μtrue=1N1i:labeli=1h(l)(xi),μfalse=1N0i:labeli=0h(l)(xi).\mu_{\text{true}} = \frac{1}{N_1} \sum\limits_{i:label_i=1} h^{(l)}(x_i), \quad \mu_{\text{false}} = \frac{1}{N_0} \sum\limits_{i:label_i=0} h^{(l)}(x_i).

  • The probe direction is w=μtrueμfalsew = \mu_{\text{true}} - \mu_{\text{false}}.
  • The probe is centered at μ=(μtrue+μfalse)/2\mu = (\mu_{\text{true}} + \mu_{\text{false}}) / 2.
  • The scalar correctness score for any hh is:

s(h)=(hμ)ww.s(h) = \frac{(h-\mu)^\top w}{\|w\|}.

No bias or nonlinearity is applied; AUROC is used to evaluate discriminative performance (Cencerrado et al., 12 Sep 2025).

2. Data Pipeline and Probe Construction

The question-only pipeline consists of discrete stages:

  • Activation extraction: Each model is prompted with a few-shot question template. After reading the final question token, activations are frozen at each layer. For each l=1,,Ll=1, \ldots, L, the dd-dimensional vector h(l)(x)h^{(l)}(x) is recorded.
  • Answer generation and labeling: Answer generation resumes using deterministic decoding (temperature 0), and ground truth comparison yields label=1label=1 if correct, $0$ otherwise.
  • Probe construction: For every candidate layer, activations are partitioned by correctness label, class centroids and probe direction are computed (no gradient step is required), and probe performance is estimated via cross-validation.
  • Final probe: The best-performing layer is chosen and the probe retrained with the majority of available data.

The process is compute-efficient (no iterative optimization or regularization is needed) and can be run across all layers in parallel; typical training regimes span up to 60 GPU-hours for multi-billion parameter models.

3. Experimental Regimes and Baseline Comparisons

This technique has been tested on six open LLMs, ranging from 7B (Qwen 2.5, Mistral 7B Instruct, Ministral 8B) to 70B (Llama 3.3 70B Instruct) parameters, across factual, geographical, entity, and mathematical knowledge domains.

Models and dataset sampling:

  • Probes are validated using 10,000 TriviaQA samples, with activations extracted every 2 layers (models <10B parameters) or 4 layers (models >10B).
  • The layer with the highest average AUROC is selected for final probe construction.

Baseline comparisons include:

  • Verbalized confidence: LLMs are instructed to report their subjective answer confidence.
  • Black-box non-linear assessors: Logistic regression and XGBoost trained on OpenAI 3,072-d question embeddings.

Summary of in-domain and out-of-domain results (Llama 3.1 8B example):

Dataset Direction AUROC Assessor AUROC Verbal AUROC
TriviaQA 0.804 0.852 0.502
Notable People 0.722 0.630 0.499
Cities 0.732 0.663 0.500
Math Ops 0.858 0.528 0.623
Medals 0.680 0.623 0.500
GSM8K 0.534 0.558 0.540

Probes generalize substantially better to factual OOD tasks than either verbalized or embedding-based models, but degrade sharply for compositional mathematics (GSM8K: AUROC close to chance at 0.50.6\sim 0.5-0.6) (Cencerrado et al., 12 Sep 2025).

  • Layerwise analysis: Predictive performance is at chance in early layers, rises rapidly in the middle, and saturates near L/2L/2 (e.g., AUROC 0.80\sim 0.80 from l=14l = 14 onwards in Llama 3.1 8B).
  • Sample efficiency: As few as 160 training samples can yield AUROC >0.70>0.70; performance plateaus after 2,560\sim 2,560 examples. Larger models achieve saturation more quickly.
  • Emergent self-assessment: Accurate self-diagnosis emerges in intermediate layers before answer generation, indicating internal “knowing when you know” computations prior to token emission.

5. Limitations and Failure Modes

Significant limitations persist:

  • Arithmetic/chain-of-thought reasoning: On GSM8K and multi-step tasks, question-only linear probes do not exceed chance accuracy, implying that deep compositional reasoning correctness is not linearly encoded at the question stage.
  • “I don’t know” abstention: For abstention behaviors, probe scores are maximally negative, forming a distinct cluster. This demonstrates that ww also serves as a latent confidence or refusal axis.
  • Domain dependency: Probes generalize across factual entity domains but not into high-cognitive or procedural reasoning settings.

6. Interpretations and Relation to Prior Work

Connections to prior interpretability research:

  • Question-only probes reveal latent correctness/confidence axes, paralleling linear “truthfulness directions” identified by Burger et al. (2024) and knowledge-awareness axes in sparse auto-encoders (Cencerrado et al., 12 Sep 2025).
  • Universal linear signals across model scales (7B–70B) support the hypothesis that correctness self-assessment is a general property of large transformers, accessible via linear operations in residual space.
  • A plausible implication is that more sophisticated or non-linear probes could recover deeper signals, especially for compositional or algorithmic tasks.

7. Implications and Future Directions

Question-only linear probes constitute evidence that transformer LLMs internally encode a robust, mid-computation latent variable for factual answerability. These probes are computationally lightweight, require minimal data, and generalize across models and factual OOD data—features desirable for transparency and diagnostic auditing. A plausible direction for future research is the extension to non-linear probes, model families beyond transformers, and prediction of richer correctness notions (e.g., probabilistic or calibrated multi-sample outputs) (Cencerrado et al., 12 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Question-Only Linear Probes.