Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fisher-Aligned Subspace Compression

Updated 19 January 2026
  • Fisher-Aligned Subspace Compression (FASC) is a method for reducing high-dimensional data by projecting onto subspaces that retain key Fisher information for inference and learning.
  • It employs both linear and non-linear techniques, including eigen-decomposition and score-based methods, to identify parameter-sensitive directions while minimizing information loss.
  • FASC has broad applications from cosmic shear analysis to large language model optimization, achieving significant dimensionality reduction with minimal compromise on performance.

Fisher-Aligned Subspace Compression (FASC) is a family of methodologies for compressing high-dimensional data or model activations by projecting onto subspaces that optimally preserve Fisher information with respect to a target statistical or learning task. FASC leverages the Fisher Information Matrix (FIM) to identify parameter- or loss-sensitive directions, yielding compressions that retain sufficiency for inference or knowledge-critical computations even at aggressive dimensionality reduction rates. This concept unifies approaches across statistical data analysis, machine learning, and recently, LLM architecture compression, enabling efficient and information-theoretically justified dimension reduction.

1. Foundational Definition and Theoretical Properties

FASC constructs a linear or nonlinear map from the full observation space RN\mathbb{R}^N to a lower-dimensional representation RP\mathbb{R}^P (typically PNP \ll N), such that the reduced statistics retain the Fisher information relevant for parameter inference or loss minimization at a specified operating point. Formally, given an observation xRNx\in\mathbb{R}^N, model parameters θRP\theta\in\mathbb{R}^P, and a likelihood p(xθ)p(x|\theta), the Fisher-aligned summary is defined via the score function: S(x,θ)θlnp(xθ)S(x,\theta)\equiv\nabla_\theta \ln p(x|\theta) At a fiducial θ\theta_*, the Fisher information is F(θ)=Covxθ[S(x,θ)]F(\theta_*) = \mathrm{Cov}_{x|\theta_*}[S(x,\theta_*)]. Compression to the score function is optimal in the information-theoretic sense: no other PP-dimensional function RP\mathbb{R}^P0 achieves lower variance in estimating RP\mathbb{R}^P1, and all Fisher information available in the data is preserved under regularity conditions (interchangeability of derivative and expectation, differentiability of RP\mathbb{R}^P2) (Alsing et al., 2017). The same underlying principle governs FASC for likelihood-based tasks, empirical loss minimization, and discriminant subspace construction in unsupervised or supervised learning.

2. Mathematical Formalism and Algorithmic Derivation

FASC for continuous data vectors under local-linear (first-order) approximations has well-developed theory:

  • Linearization: For observable mean RP\mathbb{R}^P3 near RP\mathbb{R}^P4, RP\mathbb{R}^P5, with Jacobian RP\mathbb{R}^P6.
  • Fisher matrix: RP\mathbb{R}^P7 using data covariance RP\mathbb{R}^P8.
  • Eigen-decomposition: Find eigenvectors RP\mathbb{R}^P9 of PNP \ll N0, and compute compression vectors PNP \ll N1.
  • Compressed statistics: PNP \ll N2. The Fisher matrix in PNP \ll N3 is diagonal, with entries PNP \ll N4. Retaining top PNP \ll N5 modes maximizes information retention (Asgari et al., 2014).

In non-Gaussian or nonlinear settings (e.g., LLM activations), FASC relies on higher-order Taylor expansions of the loss or log-likelihood: PNP \ll N6 with PNP \ll N7 replaced by the empirical Fisher PNP \ll N8, PNP \ll N9. The optimal xRNx\in\mathbb{R}^N0-dimensional projection xRNx\in\mathbb{R}^N1 then solves the generalized eigenproblem: xRNx\in\mathbb{R}^N2 where xRNx\in\mathbb{R}^N3 are activation, gradient, and cross-covariance matrices, respectively (Shihab et al., 12 Jan 2026).

3. Connections to Information Geometry and Optimal Compression

FASC generalizes and unifies well-known data compression paradigms:

  • Linear case (xRNx\in\mathbb{R}^N4 dependent, fixed xRNx\in\mathbb{R}^N5): Heavens-Tegmark “moped” linear compression; optimal for signals with parameter-dependent mean.
  • Quadratic case (fixed xRNx\in\mathbb{R}^N6, parameter-dependent xRNx\in\mathbb{R}^N7): Karhunen-Loève transform/power spectrum estimation; optimal for parameter-sensitive variance.
  • Nonlinear expansions: Sufficient statistics include derivatives of arbitrary order, recoverable by score-based regression or likelihood-ratio estimation (Alsing et al., 2017).

These constructions frame FASC as a practical realization of the information geometry principle—projecting onto the subspace in which the FIM is maximally preserved.

4. Applications: From Astronomy to LLMs

Early deployments of FASC targeted astronomical and cosmological data reduction:

  • Cosmic shear analysis: Compress shear two-point COSEBI statistics with over 360 elements down to 7–35 sufficient compressed summaries (matching the number of model parameters), recovering the Fisher figure-of-merit to within xRNx\in\mathbb{R}^N8 of the full data and reducing parameter errors negligibly (Asgari et al., 2014).
  • Generalized optimal compression: Alsing & Wandelt demonstrated that, in principle, for any xRNx\in\mathbb{R}^N9-dimensional data vector, θRP\theta\in\mathbb{R}^P0 compression is possible (for θRP\theta\in\mathbb{R}^P1 parameters of interest) without Fisher information loss (Alsing et al., 2017).

More recently, FASC has been adapted to LLM post-training compression, addressing knowledge retention in resource-constrained deployments:

  • Activation compression: Standard SVD retains high-variance activation modes, which often miss knowledge-bearing subspaces aligned with the gradient of model loss. FASC selects subspaces by minimizing a second-order surrogate of the loss—effectively aligning subspace selection with directions of high loss curvature (empirical Fisher metric), retaining factual knowledge performance at aggressive rank reductions (Shihab et al., 12 Jan 2026).

5. Diagnostic Metrics and Layer-Wise Subspace Selection

FASC has prompted the development of new diagnostic metrics:

  • Dependence Violation Score (θRP\theta\in\mathbb{R}^P2): Quantifies cross-covariance between layer activations and gradients. Layers with high θRP\theta\in\mathbb{R}^P3 display strong activation–gradient coupling and benefit most from Fisher-aligned compression. Practically, θRP\theta\in\mathbb{R}^P4 is an effective empirical threshold for applying FASC instead of SVD in transformer models (Shihab et al., 12 Jan 2026).
  • Subspace divergence: Principal angle analysis demonstrates that subspaces selected by FASC and SVD diverge only in high-θRP\theta\in\mathbb{R}^P5 layers, substantiating the loss-sensitivity criterion.

The algorithmic workflow includes calibration data collection, computation of empirical covariances, gating by θRP\theta\in\mathbb{R}^P6, and random-projection–based acceleration for scalability in wide layers.

6. Limitations, Failure Modes, and Practical Recommendations

Limitations of FASC are rooted in its reliance on local approximations and regularity:

  • Tayloring validity: If the statistical or loss landscape is strongly non-Gaussian or parameter-covariances vary sharply, first-order FASC can be suboptimal and higher-order expansion may be necessary (Asgari et al., 2014, Alsing et al., 2017).
  • Covariance estimation: In high-dimensional inference, inverting ill-conditioned θRP\theta\in\mathbb{R}^P7 matrices requires θRP\theta\in\mathbb{R}^P8 mocks for stability. FASC mitigates by compressing to θRP\theta\in\mathbb{R}^P9 but inherits errors if covariance is poorly estimated.
  • Domain shift and layer selection: In LLMs, early or late layers with noise-like activations may yield misleading p(xθ)p(x|\theta)0; FASC gating and hybrid pipelines with standard SVD are recommended (Shihab et al., 12 Jan 2026).

Robustness checks should perturb the fiducial point and covariance, and iterative FASC may be deployed to refine around the MAP or MLE.

7. Empirical Performance and Outlook

Empirical evidence substantiates FASC's efficacy:

  • In cosmic shear, FASC achieves a reduction in data dimension by an order of magnitude, with negligible loss in parameter constraints (Asgari et al., 2014).
  • For LLMs, at p(xθ)p(x|\theta)1 rank reduction, FASC yields 6–8 percentage point higher accuracy on knowledge-heavy tasks (MMLU, LAMA), matching the factual recall of uncompressed models nearly twice the size, with minimal increase in computational overhead (Shihab et al., 12 Jan 2026).
  • Cross-architecture robustness is demonstrated, and the p(xθ)p(x|\theta)2 metric provides a fundamental diagnostic of knowledge storage and compression utility.

Future research directions include hybridization with quantization and pruning, domain-adaptive calibration, nonlinear extension to address highly non-Gaussian targets, and expert-specific Fisher compression in mixture-of-expert architectures (Shihab et al., 12 Jan 2026).


Domain/Application Data Dim. Compression Factor Fisher Loss Citation
Cosmic Shear (COSEBIs) p(xθ)p(x|\theta)3 %%%%51RP\mathbb{R}^P152%%%% p(xθ)p(x|\theta)6 (Asgari et al., 2014)
LLM Activation (Mistral-7B) p(xθ)p(x|\theta)7K p(xθ)p(x|\theta)8 p(xθ)p(x|\theta)9pp acc. (Shihab et al., 12 Jan 2026)

FASC thus provides a principled framework for subspace selection that maximally retains the information content pertinent to the problem of interest, grounding dimensionality reduction in the geometry of parameter sensitivity across scientific and machine learning applications.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fisher-Aligned Subspace Compression (FASC).