Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse Linear Probing: Hashing and LLM Insights

Updated 26 January 2026
  • Sparse Linear Probing is a method that applies linear probing under low load in hash tables and uses k-sparse linear classifiers for neuron analysis in LLMs.
  • In hashing, it rigorously characterizes displacement through block models, yielding Gaussian moderate deviations and Weibull-like large deviation tails.
  • In large language models, it enables interpretable feature selection by isolating minimal neuron subsets, revealing scaling behaviors and representation localization.

Sparse Linear Probing denotes both an algorithmic regime in classic hashing with linear probing under low load (the "sparse table" setting) and a methodology for interpretable neuron analysis in LLMs via kk-sparse linear classifiers. Both areas exhibit rigorous statistical and optimization properties with implications for performance tail bounds and representational structure.

1. Sparse Linear Probing in Hashing: Model and Deviation Framework

Consider a hash table of mm slots, into which nn keys are inserted sequentially. Each key is hashed independently and uniformly to a slot h(i){1,,m}h(i)\in\{1,\ldots,m\}. Under linear probing, a key attempts its home slot, advancing cyclically until an empty slot is found. The displacement did_i of key ii is the number of probes needed. The primary object of study is the total displacement,

Sn=i=1ndi.S_n = \sum_{i=1}^n d_i.

The sparse regime corresponds to a load factor α=n/m<1\alpha = n/m < 1, bounded strictly away from 1: 0<αα0<10<\alpha \leq \alpha_0 < 1. An equivalent block/urn model partitions the mnm-n empty slots into N:=mnN := m-n blocks, each consisting of a contiguous run of occupied slots ending in an empty slot. Each block ii has a (random) size XiX_i and contributes displacement YiY_i. Block sizes are i.i.d.\ XBorel(μ)X\sim \text{Borel}(\mu) with μ=n/m\mu = n/m. Conditionally, the displacement given X=lX=l is the "full-table" displacement for ll items in ll slots. Thus,

Sn=dTN{i=1NXi=m},TN=i=1NYi.S_n \overset{d}{=} T_N \mid \left\{\sum_{i=1}^N X_i = m\right\}, \qquad T_N = \sum_{i=1}^N Y_i.

This formulation enables sharp large- and moderate-deviation analysis in the sparse regime (Klein et al., 2016).

2. Statistical Deviation Results in Sparse Linear Probing

The central results are rigorous moderate and large deviation bounds for SnS_n.

  1. Moderate Deviations (Gaussian Scale): For 1/2<α<2/31/2 < \alpha < 2/3 and xN=o(Nα1/2)x_N = o(N^{\alpha-1/2}),

P(SnE[Sn]>Nαy)=exp(y22σ2(μ)N2α1+o(N2α1)).\mathbb{P}\left(S_n-E[S_n] > N^\alpha y\right) = \exp\left(-\frac{y^2}{2\sigma^2(\mu)}N^{2\alpha-1} + o(N^{2\alpha-1})\right).

Fluctuations of SnS_n up to scale N2/3N^{2/3} exhibit Gaussian-type decay.

  1. Intermediate Deviations (Crossover at α=2/3\alpha=2/3): For y>0y>0,

1N1/3logP(SnE[Sn]N2/3y)I(y)\frac{1}{N^{1/3}}\log\mathbb{P}\left(S_n-E[S_n] \geq N^{2/3}y\right) \to -I(y)

where I(y)I(y) interpolates between quadratic (for small yy) and non-quadratic root-finding forms (for large yy).

  1. Large Deviations (Weibull-Tailed): For 2/3<α<22/3<\alpha<2 and y>0y>0,

P(SnE[Sn]>Nαy)=exp(q(μ)y1/2Nα/2+o(Nα/2)).\mathbb{P}\left(S_n-E[S_n]>N^\alpha y\right) = \exp\left(-q(\mu)y^{1/2}N^{\alpha/2}+o(N^{\alpha/2})\right).

In particular, for xN2/3x \gg N^{2/3},

P(SnE[Sn]>x)exp(q(μ)x1/2+o(x1/2)).\mathbb{P}\left(S_n-E[S_n]>x\right) \asymp \exp(-q(\mu)x^{1/2}+o(x^{1/2})).

Such Weibull-like decay (tail exponent $1/2$) is driven by rare, single large clusters dominating SnS_n (Klein et al., 2016).

3. Methods: Conditioned Sums and Heavy-Tailed Deviations

The analysis exploits Janson's conditioned sums: SnS_n can be studied as TNT_N conditioned on Xi=m\sum X_i = m, reducing to deviations of sums of i.i.d.\ heavy-tailed random variables [Nagaev, Wu]. Key points:

  • The block sizes XX have exponentially decaying tails determined by κ(μ)=μlogμ1\kappa(\mu)=\mu-\log\mu-1.
  • Given X=lX=l, the block displacement YY inherits a large-deviation rate J(δ)J(\delta) from the "full-table" case with ll items.
  • The joint tail P(X=l,Yp)\mathbb{P}(X=l,Y\geq p) yields optimal exponents via a minimization problem connecting block size and displacement.
  • For large deviations, the cost concentrates on a single block: l(x/δ)1/2l\approx(x/\delta)^{1/2}, pδl2xp\approx \delta l^2 \approx x, with the exponent q(μ)=inf0<δ<1/2κ(μ)+J(δ)δq(\mu) = \inf_{0<\delta < 1/2} \frac{\kappa(\mu)+J(\delta)}{\sqrt{\delta}}.
  • For moderate deviations (x=o(N2/3)x = o(N^{2/3})), the collective effect of many blocks dominates, leading to Gaussian exponents via a CLT argument.
  • The crossover regime (α=2/3\alpha=2/3) blends the two mechanisms (collective fluctuations and single-block spikes).

4. Regime Comparison: Sparse vs Full-Load Linear Probing

Sparse and full-table regimes exhibit qualitatively distinct deviation behaviors:

Regime Fluctuation Scale Typical Tail Dominant Mechanism
Sparse (n<mn<m) N1/2N^{1/2}N2/3N^{2/3} Gaussian: exp(cx2/N)\exp(-c x^2/N) Many small blocks (CLT)
Sparse, large xx xN2/3x \gg N^{2/3} Weibull: exp(qx1/2)\exp(-q x^{1/2}) Single large block
Full (n=mn=m) m3/2m^{3/2}, m2m^2 Airy/Excursion; J(δ)mJ(\delta)m Empirical process LDP

In the sparse load regime, clustering (long runs of occupied slots) is exponentially suppressed up to N2/3N^{2/3} fluctuations, but for rare, very large deviations, "monster" clusters dominate. In contrast, the full-table case is governed by Airy process excursions and derived via Sanov or Legendre calculus (Klein et al., 2016).

5. Sparse Linear Probing in Representation Analysis of LLMs

Sparse linear probes are employed to interrogate the internal structure of LLMs. For MLP activations at()Rda_t^{(\ell)}\in\mathbb{R}^d, a kk-sparse linear probe is a logistic regression classifier:

z^t=σ(wat()+b),w0k,\hat z_t = \sigma(w^\top a_t^{(\ell)} + b), \quad \|w\|_0 \leq k,

where σ()\sigma(\cdot) is the logistic sigmoid. The probe is trained subject to a cardinality constraint on ww (Gurnee et al., 2023).

The selection process involves:

  • Filtering top neurons by absolute class-mean difference.
  • Sparse selection via Mean-Difference (MMD), ANOVA F-statistic (FS), Mutual Information (MI), L1-regularized logistic regression (LR), Optimal Sparse Probing (OSP, via cutting planes), or Adaptive Thresholding (AT).
  • Retraining logistic regression restricted to selected kk neurons.
  • Sweeping kk from large (e.g., $256$) to $1$ to chart sparsity–performance trade-off.

6. Empirical Findings: Layer-Wise Sparsity and Scaling in LLMs

Experiments analyze over 100 binary features across 10 categories and 7 Pythia models (70M–6.9B parameters), probing the minimal kk required to achieve F1 0.8\geq 0.8.

Feature Type Early Middle Late
POS 7 1 3
Dependencies 8 2 4
Morphology 6 1 3
Code Language 10 1 5
Nat. Language 12 1 6
Data Subset 9 1 4
Text Features 5 1 2
Compound Words 15 3 7
\LaTeX\ 8 2 5
Wikidata Facts 12 1 6
All Features 9 1 4
  • Early layers require higher kk (feature superposition).
  • Middle layers admit k=1k=1 (dedicated neurons).
  • Late layers show intermediate sparsity.
  • Larger models increasingly localize features in fewer neurons, but scaling behavior varies by feature class.
  • Three textures of scaling: emergence of k=1k=1 features, neuron splitting, and stable features with scale (Gurnee et al., 2023).

7. Algorithmic Implications and Interpretability Best Practices

Sparse linear probing quantifies localization and superposition in LLM representations. Key practices:

  • Searching for both monosemantic (k=1k=1) and polysemantic (k>1k>1) neurons by sweeping kk (typically $1$–$16$) with MMD or AT.
  • Ablation analysis, activation distributions, and output-logit influence validate neuron function.
  • Early layers with large weight norms and negative biases characterize nn-gram superposition.
  • Middle layers' dedicated neurons enable mechanistic interpretability.
  • Sparse linear probes at k=1k=1 achieve mean F1 0.83\approx 0.83 on held-out data for multiple tasks, with MMD matching complex selection methods within 1%1\% F1.

A plausible implication is that sparsity in representation can be both a marker of model scale and of feature uniqueness, offering a quantitative tool for interpretability and diagnostic analysis of high-dimensional neural systems (Gurnee et al., 2023).

8. Conclusion: The Significance of Sparsity in Linear Probing Paradigms

In both algorithmic and neural contexts, sparse linear probing provides rigorous statistical and practical insights. In hashing, the transition from CLT to Weibull deviations delineates operational risk and informs load factor selection. In LLMs, kk-sparse probes expose the mechanics of distributed, superposed, and monosemantic representations, scaling patterns, and the function of individual neurons. Both domains exemplify the interplay of sparsity constraints, tail exponent behavior, and the concentration of functional or computational cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Linear Probing.