Question-Only Linear Probes for LLM Diagnostics

Updated 1 February 2026

The paper introduces a diagnostic tool that uses a linear probe to forecast answer correctness by analyzing activations after the input question.
The method computes class centroids in activation space to differentiate between correct and incorrect responses without relying on output tokens.
Experimental results across models reveal its efficiency and limitations, highlighting strong performance on factual tasks but challenges in compositional reasoning.

A question-only linear probe is a linear diagnostic tool for LLMs that predicts the forthcoming answer accuracy by inspecting model activations after reading the input question and before any output tokens are generated. Unlike traditional output-based or answer-token probes, the question-only probe requires no access to the actual answer or downstream decoding trajectory, making it a uniquely “in-advance” predictor of model success or failure. This method has become a key approach to investigating internal representations of correctness and confidence within transformer LLMs, and provides direct, cross-domain metrics of factual self-assessment (Cencerrado et al., 12 Sep 2025).

1. Mathematical Foundation

Question-only linear probes are constructed as simple, closed-form separations of activation-space centroids associated with correct and incorrect responses. For a transformer-based LLM $M$ with $L$ layers, let $h^{(l)}(x)\in\mathbb{R}^d$ denote the residual-stream activation at layer $l$ following the final token of a question $x$ . The probe seeks a direction $w \in \mathbb{R}^d$ such that the projection of $h$ onto $w$ discriminates forthcoming answer correctness.

Given a training set of $(x_i, label_i)$ pairs, with $label_i=1$ if the model answers correctly and $0$ otherwise:

Class centroids are computed as:

$\mu_{\text{true}} = \frac{1}{N_1} \sum\limits_{i:label_i=1} h^{(l)}(x_i), \quad \mu_{\text{false}} = \frac{1}{N_0} \sum\limits_{i:label_i=0} h^{(l)}(x_i).$

The probe direction is $w = \mu_{\text{true}} - \mu_{\text{false}}$ .
The probe is centered at $\mu = (\mu_{\text{true}} + \mu_{\text{false}}) / 2$ .
The scalar correctness score for any $h$ is:

$s(h) = \frac{(h-\mu)^\top w}{\|w\|}.$

No bias or nonlinearity is applied; AUROC is used to evaluate discriminative performance (Cencerrado et al., 12 Sep 2025).

2. Data Pipeline and Probe Construction

The question-only pipeline consists of discrete stages:

Activation extraction: Each model is prompted with a few-shot question template. After reading the final question token, activations are frozen at each layer. For each $l=1, \ldots, L$ , the $d$ -dimensional vector $h^{(l)}(x)$ is recorded.
Answer generation and labeling: Answer generation resumes using deterministic decoding (temperature 0), and ground truth comparison yields $label=1$ if correct, $0$ otherwise.
Probe construction: For every candidate layer, activations are partitioned by correctness label, class centroids and probe direction are computed (no gradient step is required), and probe performance is estimated via cross-validation.
Final probe: The best-performing layer is chosen and the probe retrained with the majority of available data.

The process is compute-efficient (no iterative optimization or regularization is needed) and can be run across all layers in parallel; typical training regimes span up to 60 GPU-hours for multi-billion parameter models.

3. Experimental Regimes and Baseline Comparisons

This technique has been tested on six open LLMs, ranging from 7B (Qwen 2.5, Mistral 7B Instruct, Ministral 8B) to 70B (Llama 3.3 70B Instruct) parameters, across factual, geographical, entity, and mathematical knowledge domains.

Models and dataset sampling:

Probes are validated using 10,000 TriviaQA samples, with activations extracted every 2 layers (models <10B parameters) or 4 layers (models >10B).
The layer with the highest average AUROC is selected for final probe construction.

Baseline comparisons include:

Verbalized confidence: LLMs are instructed to report their subjective answer confidence.
Black-box non-linear assessors: Logistic regression and XGBoost trained on OpenAI 3,072-d question embeddings.

Summary of in-domain and out-of-domain results (Llama 3.1 8B example):

Dataset	Direction AUROC	Assessor AUROC	Verbal AUROC
TriviaQA	0.804	0.852	0.502
Notable People	0.722	0.630	0.499
Cities	0.732	0.663	0.500
Math Ops	0.858	0.528	0.623
Medals	0.680	0.623	0.500
GSM8K	0.534	0.558	0.540

Probes generalize substantially better to factual OOD tasks than either verbalized or embedding-based models, but degrade sharply for compositional mathematics (GSM8K: AUROC close to chance at $\sim 0.5-0.6$ ) (Cencerrado et al., 12 Sep 2025).

4. Layerwise Trends, Sample Efficiency, and Emergent Dynamics

Layerwise analysis: Predictive performance is at chance in early layers, rises rapidly in the middle, and saturates near $L/2$ (e.g., AUROC $\sim 0.80$ from $l = 14$ onwards in Llama 3.1 8B).
Sample efficiency: As few as 160 training samples can yield AUROC $>0.70$ ; performance plateaus after $\sim 2,560$ examples. Larger models achieve saturation more quickly.
Emergent self-assessment: Accurate self-diagnosis emerges in intermediate layers before answer generation, indicating internal “knowing when you know” computations prior to token emission.

5. Limitations and Failure Modes

Significant limitations persist:

Arithmetic/chain-of-thought reasoning: On GSM8K and multi-step tasks, question-only linear probes do not exceed chance accuracy, implying that deep compositional reasoning correctness is not linearly encoded at the question stage.
“I don’t know” abstention: For abstention behaviors, probe scores are maximally negative, forming a distinct cluster. This demonstrates that $w$ also serves as a latent confidence or refusal axis.
Domain dependency: Probes generalize across factual entity domains but not into high-cognitive or procedural reasoning settings.

6. Interpretations and Relation to Prior Work

Connections to prior interpretability research:

Question-only probes reveal latent correctness/confidence axes, paralleling linear “truthfulness directions” identified by Burger et al. (2024) and knowledge-awareness axes in sparse auto-encoders (Cencerrado et al., 12 Sep 2025).
Universal linear signals across model scales (7B–70B) support the hypothesis that correctness self-assessment is a general property of large transformers, accessible via linear operations in residual space.
A plausible implication is that more sophisticated or non-linear probes could recover deeper signals, especially for compositional or algorithmic tasks.

7. Implications and Future Directions

Question-only linear probes constitute evidence that transformer LLMs internally encode a robust, mid-computation latent variable for factual answerability. These probes are computationally lightweight, require minimal data, and generalize across models and factual OOD data—features desirable for transparency and diagnostic auditing. A plausible direction for future research is the extension to non-linear probes, model families beyond transformers, and prediction of richer correctness notions (e.g., probabilistic or calibrated multi-sample outputs) (Cencerrado et al., 12 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Question-Only Linear Probes.

Question-Only Linear Probes for LLM Diagnostics

1. Mathematical Foundation

2. Data Pipeline and Probe Construction

3. Experimental Regimes and Baseline Comparisons

4. Layerwise Trends, Sample Efficiency, and Emergent Dynamics

5. Limitations and Failure Modes

6. Interpretations and Relation to Prior Work

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Question-Only Linear Probes for LLM Diagnostics

1. Mathematical Foundation

2. Data Pipeline and Probe Construction

3. Experimental Regimes and Baseline Comparisons

4. Layerwise Trends, Sample Efficiency, and Emergent Dynamics

5. Limitations and Failure Modes

6. Interpretations and Relation to Prior Work

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research