Sparse Linear Probing: Hashing and LLM Insights
- Sparse Linear Probing is a method that applies linear probing under low load in hash tables and uses k-sparse linear classifiers for neuron analysis in LLMs.
- In hashing, it rigorously characterizes displacement through block models, yielding Gaussian moderate deviations and Weibull-like large deviation tails.
- In large language models, it enables interpretable feature selection by isolating minimal neuron subsets, revealing scaling behaviors and representation localization.
Sparse Linear Probing denotes both an algorithmic regime in classic hashing with linear probing under low load (the "sparse table" setting) and a methodology for interpretable neuron analysis in LLMs via -sparse linear classifiers. Both areas exhibit rigorous statistical and optimization properties with implications for performance tail bounds and representational structure.
1. Sparse Linear Probing in Hashing: Model and Deviation Framework
Consider a hash table of slots, into which keys are inserted sequentially. Each key is hashed independently and uniformly to a slot . Under linear probing, a key attempts its home slot, advancing cyclically until an empty slot is found. The displacement of key is the number of probes needed. The primary object of study is the total displacement,
The sparse regime corresponds to a load factor , bounded strictly away from 1: . An equivalent block/urn model partitions the empty slots into blocks, each consisting of a contiguous run of occupied slots ending in an empty slot. Each block has a (random) size and contributes displacement . Block sizes are i.i.d.\ with . Conditionally, the displacement given is the "full-table" displacement for items in slots. Thus,
This formulation enables sharp large- and moderate-deviation analysis in the sparse regime (Klein et al., 2016).
2. Statistical Deviation Results in Sparse Linear Probing
The central results are rigorous moderate and large deviation bounds for .
- Moderate Deviations (Gaussian Scale): For and ,
Fluctuations of up to scale exhibit Gaussian-type decay.
- Intermediate Deviations (Crossover at ): For ,
where interpolates between quadratic (for small ) and non-quadratic root-finding forms (for large ).
- Large Deviations (Weibull-Tailed): For and ,
In particular, for ,
Such Weibull-like decay (tail exponent $1/2$) is driven by rare, single large clusters dominating (Klein et al., 2016).
3. Methods: Conditioned Sums and Heavy-Tailed Deviations
The analysis exploits Janson's conditioned sums: can be studied as conditioned on , reducing to deviations of sums of i.i.d.\ heavy-tailed random variables [Nagaev, Wu]. Key points:
- The block sizes have exponentially decaying tails determined by .
- Given , the block displacement inherits a large-deviation rate from the "full-table" case with items.
- The joint tail yields optimal exponents via a minimization problem connecting block size and displacement.
- For large deviations, the cost concentrates on a single block: , , with the exponent .
- For moderate deviations (), the collective effect of many blocks dominates, leading to Gaussian exponents via a CLT argument.
- The crossover regime () blends the two mechanisms (collective fluctuations and single-block spikes).
4. Regime Comparison: Sparse vs Full-Load Linear Probing
Sparse and full-table regimes exhibit qualitatively distinct deviation behaviors:
| Regime | Fluctuation Scale | Typical Tail | Dominant Mechanism |
|---|---|---|---|
| Sparse () | – | Gaussian: | Many small blocks (CLT) |
| Sparse, large | Weibull: | Single large block | |
| Full () | , | Airy/Excursion; | Empirical process LDP |
In the sparse load regime, clustering (long runs of occupied slots) is exponentially suppressed up to fluctuations, but for rare, very large deviations, "monster" clusters dominate. In contrast, the full-table case is governed by Airy process excursions and derived via Sanov or Legendre calculus (Klein et al., 2016).
5. Sparse Linear Probing in Representation Analysis of LLMs
Sparse linear probes are employed to interrogate the internal structure of LLMs. For MLP activations , a -sparse linear probe is a logistic regression classifier:
where is the logistic sigmoid. The probe is trained subject to a cardinality constraint on (Gurnee et al., 2023).
The selection process involves:
- Filtering top neurons by absolute class-mean difference.
- Sparse selection via Mean-Difference (MMD), ANOVA F-statistic (FS), Mutual Information (MI), L1-regularized logistic regression (LR), Optimal Sparse Probing (OSP, via cutting planes), or Adaptive Thresholding (AT).
- Retraining logistic regression restricted to selected neurons.
- Sweeping from large (e.g., $256$) to $1$ to chart sparsity–performance trade-off.
6. Empirical Findings: Layer-Wise Sparsity and Scaling in LLMs
Experiments analyze over 100 binary features across 10 categories and 7 Pythia models (70M–6.9B parameters), probing the minimal required to achieve F1 .
| Feature Type | Early | Middle | Late |
|---|---|---|---|
| POS | 7 | 1 | 3 |
| Dependencies | 8 | 2 | 4 |
| Morphology | 6 | 1 | 3 |
| Code Language | 10 | 1 | 5 |
| Nat. Language | 12 | 1 | 6 |
| Data Subset | 9 | 1 | 4 |
| Text Features | 5 | 1 | 2 |
| Compound Words | 15 | 3 | 7 |
| \LaTeX\ | 8 | 2 | 5 |
| Wikidata Facts | 12 | 1 | 6 |
| All Features | 9 | 1 | 4 |
- Early layers require higher (feature superposition).
- Middle layers admit (dedicated neurons).
- Late layers show intermediate sparsity.
- Larger models increasingly localize features in fewer neurons, but scaling behavior varies by feature class.
- Three textures of scaling: emergence of features, neuron splitting, and stable features with scale (Gurnee et al., 2023).
7. Algorithmic Implications and Interpretability Best Practices
Sparse linear probing quantifies localization and superposition in LLM representations. Key practices:
- Searching for both monosemantic () and polysemantic () neurons by sweeping (typically $1$–$16$) with MMD or AT.
- Ablation analysis, activation distributions, and output-logit influence validate neuron function.
- Early layers with large weight norms and negative biases characterize -gram superposition.
- Middle layers' dedicated neurons enable mechanistic interpretability.
- Sparse linear probes at achieve mean F1 on held-out data for multiple tasks, with MMD matching complex selection methods within F1.
A plausible implication is that sparsity in representation can be both a marker of model scale and of feature uniqueness, offering a quantitative tool for interpretability and diagnostic analysis of high-dimensional neural systems (Gurnee et al., 2023).
8. Conclusion: The Significance of Sparsity in Linear Probing Paradigms
In both algorithmic and neural contexts, sparse linear probing provides rigorous statistical and practical insights. In hashing, the transition from CLT to Weibull deviations delineates operational risk and informs load factor selection. In LLMs, -sparse probes expose the mechanics of distributed, superposed, and monosemantic representations, scaling patterns, and the function of individual neurons. Both domains exemplify the interplay of sparsity constraints, tail exponent behavior, and the concentration of functional or computational cost.