Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse Linear Probing: Hashing and LLM Insights

Updated 26 January 2026
  • Sparse Linear Probing is a method that applies linear probing under low load in hash tables and uses k-sparse linear classifiers for neuron analysis in LLMs.
  • In hashing, it rigorously characterizes displacement through block models, yielding Gaussian moderate deviations and Weibull-like large deviation tails.
  • In large language models, it enables interpretable feature selection by isolating minimal neuron subsets, revealing scaling behaviors and representation localization.

Sparse Linear Probing denotes both an algorithmic regime in classic hashing with linear probing under low load (the "sparse table" setting) and a methodology for interpretable neuron analysis in LLMs via kk-sparse linear classifiers. Both areas exhibit rigorous statistical and optimization properties with implications for performance tail bounds and representational structure.

1. Sparse Linear Probing in Hashing: Model and Deviation Framework

Consider a hash table of mm slots, into which nn keys are inserted sequentially. Each key is hashed independently and uniformly to a slot h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}. Under linear probing, a key attempts its home slot, advancing cyclically until an empty slot is found. The displacement did_i of key ii is the number of probes needed. The primary object of study is the total displacement,

Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.

The sparse regime corresponds to a load factor α=n/m<1\alpha = n/m < 1, bounded strictly away from 1: 0<α≤α0<10<\alpha \leq \alpha_0 < 1. An equivalent block/urn model partitions the m−nm-n empty slots into mm0 blocks, each consisting of a contiguous run of occupied slots ending in an empty slot. Each block mm1 has a (random) size mm2 and contributes displacement mm3. Block sizes are i.i.d.\ mm4 with mm5. Conditionally, the displacement given mm6 is the "full-table" displacement for mm7 items in mm8 slots. Thus,

mm9

This formulation enables sharp large- and moderate-deviation analysis in the sparse regime (Klein et al., 2016).

2. Statistical Deviation Results in Sparse Linear Probing

The central results are rigorous moderate and large deviation bounds for nn0.

  1. Moderate Deviations (Gaussian Scale): For nn1 and nn2,

nn3

Fluctuations of nn4 up to scale nn5 exhibit Gaussian-type decay.

  1. Intermediate Deviations (Crossover at nn6): For nn7,

nn8

where nn9 interpolates between quadratic (for small h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}0) and non-quadratic root-finding forms (for large h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}1).

  1. Large Deviations (Weibull-Tailed): For h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}2 and h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}3,

h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}4

In particular, for h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}5,

h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}6

Such Weibull-like decay (tail exponent h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}7) is driven by rare, single large clusters dominating h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}8 (Klein et al., 2016).

3. Methods: Conditioned Sums and Heavy-Tailed Deviations

The analysis exploits Janson's conditioned sums: h(i)∈{1,…,m}h(i)\in\{1,\ldots,m\}9 can be studied as did_i0 conditioned on did_i1, reducing to deviations of sums of i.i.d.\ heavy-tailed random variables [Nagaev, Wu]. Key points:

  • The block sizes did_i2 have exponentially decaying tails determined by did_i3.
  • Given did_i4, the block displacement did_i5 inherits a large-deviation rate did_i6 from the "full-table" case with did_i7 items.
  • The joint tail did_i8 yields optimal exponents via a minimization problem connecting block size and displacement.
  • For large deviations, the cost concentrates on a single block: did_i9, ii0, with the exponent ii1.
  • For moderate deviations (ii2), the collective effect of many blocks dominates, leading to Gaussian exponents via a CLT argument.
  • The crossover regime (ii3) blends the two mechanisms (collective fluctuations and single-block spikes).

4. Regime Comparison: Sparse vs Full-Load Linear Probing

Sparse and full-table regimes exhibit qualitatively distinct deviation behaviors:

Regime Fluctuation Scale Typical Tail Dominant Mechanism
Sparse (ii4) ii5–ii6 Gaussian: ii7 Many small blocks (CLT)
Sparse, large ii8 ii9 Weibull: Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.0 Single large block
Full (Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.1) Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.2, Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.3 Airy/Excursion; Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.4 Empirical process LDP

In the sparse load regime, clustering (long runs of occupied slots) is exponentially suppressed up to Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.5 fluctuations, but for rare, very large deviations, "monster" clusters dominate. In contrast, the full-table case is governed by Airy process excursions and derived via Sanov or Legendre calculus (Klein et al., 2016).

5. Sparse Linear Probing in Representation Analysis of LLMs

Sparse linear probes are employed to interrogate the internal structure of LLMs. For MLP activations Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.6, a Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.7-sparse linear probe is a logistic regression classifier:

Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.8

where Sn=∑i=1ndi.S_n = \sum_{i=1}^n d_i.9 is the logistic sigmoid. The probe is trained subject to a cardinality constraint on α=n/m<1\alpha = n/m < 10 (Gurnee et al., 2023).

The selection process involves:

  • Filtering top neurons by absolute class-mean difference.
  • Sparse selection via Mean-Difference (MMD), ANOVA F-statistic (FS), Mutual Information (MI), L1-regularized logistic regression (LR), Optimal Sparse Probing (OSP, via cutting planes), or Adaptive Thresholding (AT).
  • Retraining logistic regression restricted to selected α=n/m<1\alpha = n/m < 11 neurons.
  • Sweeping α=n/m<1\alpha = n/m < 12 from large (e.g., α=n/m<1\alpha = n/m < 13) to α=n/m<1\alpha = n/m < 14 to chart sparsity–performance trade-off.

6. Empirical Findings: Layer-Wise Sparsity and Scaling in LLMs

Experiments analyze over 100 binary features across 10 categories and 7 Pythia models (70M–6.9B parameters), probing the minimal α=n/m<1\alpha = n/m < 15 required to achieve F1 α=n/m<1\alpha = n/m < 16.

Feature Type Early Middle Late
POS 7 1 3
Dependencies 8 2 4
Morphology 6 1 3
Code Language 10 1 5
Nat. Language 12 1 6
Data Subset 9 1 4
Text Features 5 1 2
Compound Words 15 3 7
\LaTeX\ 8 2 5
Wikidata Facts 12 1 6
All Features 9 1 4
  • Early layers require higher α=n/m<1\alpha = n/m < 17 (feature superposition).
  • Middle layers admit α=n/m<1\alpha = n/m < 18 (dedicated neurons).
  • Late layers show intermediate sparsity.
  • Larger models increasingly localize features in fewer neurons, but scaling behavior varies by feature class.
  • Three textures of scaling: emergence of α=n/m<1\alpha = n/m < 19 features, neuron splitting, and stable features with scale (Gurnee et al., 2023).

7. Algorithmic Implications and Interpretability Best Practices

Sparse linear probing quantifies localization and superposition in LLM representations. Key practices:

  • Searching for both monosemantic (0<α≤α0<10<\alpha \leq \alpha_0 < 10) and polysemantic (0<α≤α0<10<\alpha \leq \alpha_0 < 11) neurons by sweeping 0<α≤α0<10<\alpha \leq \alpha_0 < 12 (typically 0<α≤α0<10<\alpha \leq \alpha_0 < 13–0<α≤α0<10<\alpha \leq \alpha_0 < 14) with MMD or AT.
  • Ablation analysis, activation distributions, and output-logit influence validate neuron function.
  • Early layers with large weight norms and negative biases characterize 0<α≤α0<10<\alpha \leq \alpha_0 < 15-gram superposition.
  • Middle layers' dedicated neurons enable mechanistic interpretability.
  • Sparse linear probes at 0<α≤α0<10<\alpha \leq \alpha_0 < 16 achieve mean F1 0<α≤α0<10<\alpha \leq \alpha_0 < 17 on held-out data for multiple tasks, with MMD matching complex selection methods within 0<α≤α0<10<\alpha \leq \alpha_0 < 18 F1.

A plausible implication is that sparsity in representation can be both a marker of model scale and of feature uniqueness, offering a quantitative tool for interpretability and diagnostic analysis of high-dimensional neural systems (Gurnee et al., 2023).

8. Conclusion: The Significance of Sparsity in Linear Probing Paradigms

In both algorithmic and neural contexts, sparse linear probing provides rigorous statistical and practical insights. In hashing, the transition from CLT to Weibull deviations delineates operational risk and informs load factor selection. In LLMs, 0<α≤α0<10<\alpha \leq \alpha_0 < 19-sparse probes expose the mechanics of distributed, superposed, and monosemantic representations, scaling patterns, and the function of individual neurons. Both domains exemplify the interplay of sparsity constraints, tail exponent behavior, and the concentration of functional or computational cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Linear Probing.