Fisher-Aligned Subspace Compression
- Fisher-Aligned Subspace Compression (FASC) is a method for reducing high-dimensional data by projecting onto subspaces that retain key Fisher information for inference and learning.
- It employs both linear and non-linear techniques, including eigen-decomposition and score-based methods, to identify parameter-sensitive directions while minimizing information loss.
- FASC has broad applications from cosmic shear analysis to large language model optimization, achieving significant dimensionality reduction with minimal compromise on performance.
Fisher-Aligned Subspace Compression (FASC) is a family of methodologies for compressing high-dimensional data or model activations by projecting onto subspaces that optimally preserve Fisher information with respect to a target statistical or learning task. FASC leverages the Fisher Information Matrix (FIM) to identify parameter- or loss-sensitive directions, yielding compressions that retain sufficiency for inference or knowledge-critical computations even at aggressive dimensionality reduction rates. This concept unifies approaches across statistical data analysis, machine learning, and recently, LLM architecture compression, enabling efficient and information-theoretically justified dimension reduction.
1. Foundational Definition and Theoretical Properties
FASC constructs a linear or nonlinear map from the full observation space to a lower-dimensional representation (typically ), such that the reduced statistics retain the Fisher information relevant for parameter inference or loss minimization at a specified operating point. Formally, given an observation , model parameters , and a likelihood , the Fisher-aligned summary is defined via the score function: At a fiducial , the Fisher information is . Compression to the score function is optimal in the information-theoretic sense: no other -dimensional function 0 achieves lower variance in estimating 1, and all Fisher information available in the data is preserved under regularity conditions (interchangeability of derivative and expectation, differentiability of 2) (Alsing et al., 2017). The same underlying principle governs FASC for likelihood-based tasks, empirical loss minimization, and discriminant subspace construction in unsupervised or supervised learning.
2. Mathematical Formalism and Algorithmic Derivation
FASC for continuous data vectors under local-linear (first-order) approximations has well-developed theory:
- Linearization: For observable mean 3 near 4, 5, with Jacobian 6.
- Fisher matrix: 7 using data covariance 8.
- Eigen-decomposition: Find eigenvectors 9 of 0, and compute compression vectors 1.
- Compressed statistics: 2. The Fisher matrix in 3 is diagonal, with entries 4. Retaining top 5 modes maximizes information retention (Asgari et al., 2014).
In non-Gaussian or nonlinear settings (e.g., LLM activations), FASC relies on higher-order Taylor expansions of the loss or log-likelihood: 6 with 7 replaced by the empirical Fisher 8, 9. The optimal 0-dimensional projection 1 then solves the generalized eigenproblem: 2 where 3 are activation, gradient, and cross-covariance matrices, respectively (Shihab et al., 12 Jan 2026).
3. Connections to Information Geometry and Optimal Compression
FASC generalizes and unifies well-known data compression paradigms:
- Linear case (4 dependent, fixed 5): Heavens-Tegmark “moped” linear compression; optimal for signals with parameter-dependent mean.
- Quadratic case (fixed 6, parameter-dependent 7): Karhunen-Loève transform/power spectrum estimation; optimal for parameter-sensitive variance.
- Nonlinear expansions: Sufficient statistics include derivatives of arbitrary order, recoverable by score-based regression or likelihood-ratio estimation (Alsing et al., 2017).
These constructions frame FASC as a practical realization of the information geometry principle—projecting onto the subspace in which the FIM is maximally preserved.
4. Applications: From Astronomy to LLMs
Early deployments of FASC targeted astronomical and cosmological data reduction:
- Cosmic shear analysis: Compress shear two-point COSEBI statistics with over 360 elements down to 7–35 sufficient compressed summaries (matching the number of model parameters), recovering the Fisher figure-of-merit to within 8 of the full data and reducing parameter errors negligibly (Asgari et al., 2014).
- Generalized optimal compression: Alsing & Wandelt demonstrated that, in principle, for any 9-dimensional data vector, 0 compression is possible (for 1 parameters of interest) without Fisher information loss (Alsing et al., 2017).
More recently, FASC has been adapted to LLM post-training compression, addressing knowledge retention in resource-constrained deployments:
- Activation compression: Standard SVD retains high-variance activation modes, which often miss knowledge-bearing subspaces aligned with the gradient of model loss. FASC selects subspaces by minimizing a second-order surrogate of the loss—effectively aligning subspace selection with directions of high loss curvature (empirical Fisher metric), retaining factual knowledge performance at aggressive rank reductions (Shihab et al., 12 Jan 2026).
5. Diagnostic Metrics and Layer-Wise Subspace Selection
FASC has prompted the development of new diagnostic metrics:
- Dependence Violation Score (2): Quantifies cross-covariance between layer activations and gradients. Layers with high 3 display strong activation–gradient coupling and benefit most from Fisher-aligned compression. Practically, 4 is an effective empirical threshold for applying FASC instead of SVD in transformer models (Shihab et al., 12 Jan 2026).
- Subspace divergence: Principal angle analysis demonstrates that subspaces selected by FASC and SVD diverge only in high-5 layers, substantiating the loss-sensitivity criterion.
The algorithmic workflow includes calibration data collection, computation of empirical covariances, gating by 6, and random-projection–based acceleration for scalability in wide layers.
6. Limitations, Failure Modes, and Practical Recommendations
Limitations of FASC are rooted in its reliance on local approximations and regularity:
- Tayloring validity: If the statistical or loss landscape is strongly non-Gaussian or parameter-covariances vary sharply, first-order FASC can be suboptimal and higher-order expansion may be necessary (Asgari et al., 2014, Alsing et al., 2017).
- Covariance estimation: In high-dimensional inference, inverting ill-conditioned 7 matrices requires 8 mocks for stability. FASC mitigates by compressing to 9 but inherits errors if covariance is poorly estimated.
- Domain shift and layer selection: In LLMs, early or late layers with noise-like activations may yield misleading 0; FASC gating and hybrid pipelines with standard SVD are recommended (Shihab et al., 12 Jan 2026).
Robustness checks should perturb the fiducial point and covariance, and iterative FASC may be deployed to refine around the MAP or MLE.
7. Empirical Performance and Outlook
Empirical evidence substantiates FASC's efficacy:
- In cosmic shear, FASC achieves a reduction in data dimension by an order of magnitude, with negligible loss in parameter constraints (Asgari et al., 2014).
- For LLMs, at 1 rank reduction, FASC yields 6–8 percentage point higher accuracy on knowledge-heavy tasks (MMLU, LAMA), matching the factual recall of uncompressed models nearly twice the size, with minimal increase in computational overhead (Shihab et al., 12 Jan 2026).
- Cross-architecture robustness is demonstrated, and the 2 metric provides a fundamental diagnostic of knowledge storage and compression utility.
Future research directions include hybridization with quantization and pruning, domain-adaptive calibration, nonlinear extension to address highly non-Gaussian targets, and expert-specific Fisher compression in mixture-of-expert architectures (Shihab et al., 12 Jan 2026).
| Domain/Application | Data Dim. | Compression Factor | Fisher Loss | Citation |
|---|---|---|---|---|
| Cosmic Shear (COSEBIs) | 3 | %%%%51152%%%% | 6 | (Asgari et al., 2014) |
| LLM Activation (Mistral-7B) | 7K | 8 | 9pp acc. | (Shihab et al., 12 Jan 2026) |
FASC thus provides a principled framework for subspace selection that maximally retains the information content pertinent to the problem of interest, grounding dimensionality reduction in the geometry of parameter sensitivity across scientific and machine learning applications.