Sparse Linear Probing: Hashing and LLM Insights

Updated 26 January 2026

Sparse Linear Probing is a method that applies linear probing under low load in hash tables and uses k-sparse linear classifiers for neuron analysis in LLMs.
In hashing, it rigorously characterizes displacement through block models, yielding Gaussian moderate deviations and Weibull-like large deviation tails.
In large language models, it enables interpretable feature selection by isolating minimal neuron subsets, revealing scaling behaviors and representation localization.

Sparse Linear Probing denotes both an algorithmic regime in classic hashing with linear probing under low load (the "sparse table" setting) and a methodology for interpretable neuron analysis in LLMs via $k$ -sparse linear classifiers. Both areas exhibit rigorous statistical and optimization properties with implications for performance tail bounds and representational structure.

1. Sparse Linear Probing in Hashing: Model and Deviation Framework

Consider a hash table of $m$ slots, into which $n$ keys are inserted sequentially. Each key is hashed independently and uniformly to a slot $h(i)\in\{1,\ldots,m\}$ . Under linear probing, a key attempts its home slot, advancing cyclically until an empty slot is found. The displacement $d_i$ of key $i$ is the number of probes needed. The primary object of study is the total displacement,

$S_n = \sum_{i=1}^n d_i.$

The sparse regime corresponds to a load factor $\alpha = n/m < 1$ , bounded strictly away from 1: $0<\alpha \leq \alpha_0 < 1$ . An equivalent block/urn model partitions the $m-n$ empty slots into $m$ 0 blocks, each consisting of a contiguous run of occupied slots ending in an empty slot. Each block $m$ 1 has a (random) size $m$ 2 and contributes displacement $m$ 3. Block sizes are i.i.d.\ $m$ 4 with $m$ 5. Conditionally, the displacement given $m$ 6 is the "full-table" displacement for $m$ 7 items in $m$ 8 slots. Thus,

$m$ 9

This formulation enables sharp large- and moderate-deviation analysis in the sparse regime (Klein et al., 2016).

2. Statistical Deviation Results in Sparse Linear Probing

The central results are rigorous moderate and large deviation bounds for $n$ 0.

Moderate Deviations (Gaussian Scale): For $n$ 1 and $n$ 2,

$n$ 3

Fluctuations of $n$ 4 up to scale $n$ 5 exhibit Gaussian-type decay.

Intermediate Deviations (Crossover at $n$ 6): For $n$ 7,

$n$ 8

where $n$ 9 interpolates between quadratic (for small $h(i)\in\{1,\ldots,m\}$ 0) and non-quadratic root-finding forms (for large $h(i)\in\{1,\ldots,m\}$ 1).

Large Deviations (Weibull-Tailed): For $h(i)\in\{1,\ldots,m\}$ 2 and $h(i)\in\{1,\ldots,m\}$ 3,

$h(i)\in\{1,\ldots,m\}$ 4

In particular, for $h(i)\in\{1,\ldots,m\}$ 5,

$h(i)\in\{1,\ldots,m\}$ 6

Such Weibull-like decay (tail exponent $h(i)\in\{1,\ldots,m\}$ 7) is driven by rare, single large clusters dominating $h(i)\in\{1,\ldots,m\}$ 8 (Klein et al., 2016).

3. Methods: Conditioned Sums and Heavy-Tailed Deviations

The analysis exploits Janson's conditioned sums: $h(i)\in\{1,\ldots,m\}$ 9 can be studied as $d_i$ 0 conditioned on $d_i$ 1, reducing to deviations of sums of i.i.d.\ heavy-tailed random variables [Nagaev, Wu]. Key points:

The block sizes $d_i$ 2 have exponentially decaying tails determined by $d_i$ 3.
Given $d_i$ 4, the block displacement $d_i$ 5 inherits a large-deviation rate $d_i$ 6 from the "full-table" case with $d_i$ 7 items.
The joint tail $d_i$ 8 yields optimal exponents via a minimization problem connecting block size and displacement.
For large deviations, the cost concentrates on a single block: $d_i$ 9, $i$ 0, with the exponent $i$ 1.
For moderate deviations ( $i$ 2), the collective effect of many blocks dominates, leading to Gaussian exponents via a CLT argument.
The crossover regime ( $i$ 3) blends the two mechanisms (collective fluctuations and single-block spikes).

4. Regime Comparison: Sparse vs Full-Load Linear Probing

Sparse and full-table regimes exhibit qualitatively distinct deviation behaviors:

Regime	Fluctuation Scale	Typical Tail	Dominant Mechanism
Sparse ( $i$ 4)	$i$ 5– $i$ 6	Gaussian: $i$ 7	Many small blocks (CLT)
Sparse, large $i$ 8	$i$ 9	Weibull: $S_n = \sum_{i=1}^n d_i.$ 0	Single large block
Full ( $S_n = \sum_{i=1}^n d_i.$ 1)	$S_n = \sum_{i=1}^n d_i.$ 2, $S_n = \sum_{i=1}^n d_i.$ 3	Airy/Excursion; $S_n = \sum_{i=1}^n d_i.$ 4	Empirical process LDP

In the sparse load regime, clustering (long runs of occupied slots) is exponentially suppressed up to $S_n = \sum_{i=1}^n d_i.$ 5 fluctuations, but for rare, very large deviations, "monster" clusters dominate. In contrast, the full-table case is governed by Airy process excursions and derived via Sanov or Legendre calculus (Klein et al., 2016).

5. Sparse Linear Probing in Representation Analysis of LLMs

Sparse linear probes are employed to interrogate the internal structure of LLMs. For MLP activations $S_n = \sum_{i=1}^n d_i.$ 6, a $S_n = \sum_{i=1}^n d_i.$ 7-sparse linear probe is a logistic regression classifier:

$S_n = \sum_{i=1}^n d_i.$ 8

where $S_n = \sum_{i=1}^n d_i.$ 9 is the logistic sigmoid. The probe is trained subject to a cardinality constraint on $\alpha = n/m < 1$ 0 (Gurnee et al., 2023).

The selection process involves:

Filtering top neurons by absolute class-mean difference.
Sparse selection via Mean-Difference (MMD), ANOVA F-statistic (FS), Mutual Information (MI), L1-regularized logistic regression (LR), Optimal Sparse Probing (OSP, via cutting planes), or Adaptive Thresholding (AT).
Retraining logistic regression restricted to selected $\alpha = n/m < 1$ 1 neurons.
Sweeping $\alpha = n/m < 1$ 2 from large (e.g., $\alpha = n/m < 1$ 3) to $\alpha = n/m < 1$ 4 to chart sparsity–performance trade-off.

6. Empirical Findings: Layer-Wise Sparsity and Scaling in LLMs

Experiments analyze over 100 binary features across 10 categories and 7 Pythia models (70M–6.9B parameters), probing the minimal $\alpha = n/m < 1$ 5 required to achieve F1 $\alpha = n/m < 1$ 6.

Feature Type	Early	Middle	Late
POS	7	1	3
Dependencies	8	2	4
Morphology	6	1	3
Code Language	10	1	5
Nat. Language	12	1	6
Data Subset	9	1	4
Text Features	5	1	2
Compound Words	15	3	7
\LaTeX\	8	2	5
Wikidata Facts	12	1	6
All Features	9	1	4

Early layers require higher $\alpha = n/m < 1$ 7 (feature superposition).
Middle layers admit $\alpha = n/m < 1$ 8 (dedicated neurons).
Late layers show intermediate sparsity.
Larger models increasingly localize features in fewer neurons, but scaling behavior varies by feature class.
Three textures of scaling: emergence of $\alpha = n/m < 1$ 9 features, neuron splitting, and stable features with scale (Gurnee et al., 2023).

7. Algorithmic Implications and Interpretability Best Practices

Sparse linear probing quantifies localization and superposition in LLM representations. Key practices:

Searching for both monosemantic ( $0<\alpha \leq \alpha_0 < 1$ 0) and polysemantic ( $0<\alpha \leq \alpha_0 < 1$ 1) neurons by sweeping $0<\alpha \leq \alpha_0 < 1$ 2 (typically $0<\alpha \leq \alpha_0 < 1$ 3– $0<\alpha \leq \alpha_0 < 1$ 4) with MMD or AT.
Ablation analysis, activation distributions, and output-logit influence validate neuron function.
Early layers with large weight norms and negative biases characterize $0<\alpha \leq \alpha_0 < 1$ 5-gram superposition.
Middle layers' dedicated neurons enable mechanistic interpretability.
Sparse linear probes at $0<\alpha \leq \alpha_0 < 1$ 6 achieve mean F1 $0<\alpha \leq \alpha_0 < 1$ 7 on held-out data for multiple tasks, with MMD matching complex selection methods within $0<\alpha \leq \alpha_0 < 1$ 8 F1.

A plausible implication is that sparsity in representation can be both a marker of model scale and of feature uniqueness, offering a quantitative tool for interpretability and diagnostic analysis of high-dimensional neural systems (Gurnee et al., 2023).

8. Conclusion: The Significance of Sparsity in Linear Probing Paradigms

In both algorithmic and neural contexts, sparse linear probing provides rigorous statistical and practical insights. In hashing, the transition from CLT to Weibull deviations delineates operational risk and informs load factor selection. In LLMs, $0<\alpha \leq \alpha_0 < 1$ 9-sparse probes expose the mechanics of distributed, superposed, and monosemantic representations, scaling patterns, and the function of individual neurons. Both domains exemplify the interplay of sparsity constraints, tail exponent behavior, and the concentration of functional or computational cost.

Markdown Report Issue Upgrade to Chat

References (2)

Deviation results for sparse tables in hashing with linear probing (2016)

Finding Neurons in a Haystack: Case Studies with Sparse Probing (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Linear Probing.