Lifelong Learning via Sketched Regularization

Updated 24 January 2026

The paper introduces a sketched structural regularization method that compresses second-order information, reducing memory costs while preserving performance across tasks.
It employs linear sketching techniques like CountSketch and FJLT to approximate Hessian-based penalties, offering theoretical guarantees close to joint training performance.
Empirical results on benchmarks such as Permuted-MNIST and CIFAR-100 demonstrate improved accuracy and reduced forgetting compared to traditional diagonal regularization methods.

Lifelong learning with sketched structural regularization refers to a class of algorithms and architectures designed to mitigate catastrophic forgetting during continual learning by leveraging linear sketching techniques to regularize model parameter updates. These methods modify standard structural regularization—in which a quadratic penalty anchors parameters near previous task optima weighted by an importance or curvature matrix—by compressing the associated structural (often Hessian- or Fisher-based) information via sketching. The use of sketching enables nontrivial preservation of second-order structure at a fraction of the memory cost of explicit matrices, with both theoretical and empirical justification across linear models, neural networks, and modular lifelong systems.

1. Catastrophic Forgetting and Structural Regularization

Catastrophic forgetting is the phenomenon where neural networks and other parameterized models, when trained on new tasks, substantially degrade their performance on previously learned tasks. Structural regularization (SR) addresses this by introducing a quadratic penalty to anchor current parameters $\theta$ near previous solutions $\theta^*$ weighted by an importance matrix $\Omega$ that encodes parameter criticality for prior tasks. The general SR objective is

$\min_\theta\; \mathbb{E}_{(x,y) \sim \mathcal{D}_t}[\ell_t(x,y;\theta)] + \lambda \frac{1}{2} (\theta - \theta^*)^T \Omega (\theta - \theta^*),$

where $\ell_t$ is the per-example loss for task $t$ and $\lambda$ controls regularization strength. Classical methods such as Elastic Weight Consolidation (EWC) instantiate $\Omega$ as the diagonal of the (empirical) Fisher information matrix; Memory-Aware Synapses (MAS) uses gradients of output norms (Li et al., 2021). However, diagonal or low-rank approximations can be too crude, causing suboptimal retention or still allowing forgetting in many scenarios (Heckel, 2021).

2. Linear Sketching in Structural Regularization

Linear sketching compresses second-order or gradient information while preserving its essential geometric properties. For a gradient matrix $W \in \mathbb{R}^{n \times m}$ (with rows being gradients over previous task data), a sparse sketching matrix $S \in \mathbb{R}^{t \times n}$ (e.g., CountSketch) reduces storage from $\mathcal{O}(m^2)$ (full $\Omega$ ) to $\mathcal{O}(tm)$ , with $t \ll n$ . The sketched SR penalty takes the form

$\widetilde{R}(\theta) = \frac{1}{2n}\|SW(\theta - \theta^*)\|_2^2,$

with empirical error bounds ensuring that for $t = \mathcal{O}(r^2/\epsilon^2)$ (where $r$ is the stable rank of $W$ ), the quadratic form is preserved within a multiplicative factor $\epsilon$ (Li et al., 2021). Randomized transforms such as the Subsampled Randomized Hadamard Transform or Fast Johnson–Lindenstrauss Transform can also be used to sketch full Jacobians in deep or wide networks (Heckel, 2021, Simpson et al., 4 Nov 2025).

3. Algorithmic Procedures for Sketched Structural Regularization

The typical sketched SR pipeline consists of:

For each task, accumulate gradient/Jacobian information on the previous task data.
Compress this matrix using a sketching operator to yield a "sketched" importance matrix or Jacobian.
When learning a new task, regularize parameter deviation from prior solutions using the sketched quadratic penalty. In practice, the regularizer can be seamlessly incorporated into classical SR methods (such as EWC or MAS) by substituting the sketched $\Omega$ or gradient matrix, requiring minimal changes to the overall training protocol (Li et al., 2021).
For continual neural compression regimes, maintain buffers containing both full and sketched snapshots, and train using a loss with terms corresponding to both exact and sketched reconstruction error, weighted appropriately (Simpson et al., 4 Nov 2025).

This approach applies to both online/sequential settings and in situ learning scenarios, and sketches can be updated incrementally or stored per task, providing flexibility in the memory-performance trade-off (Heckel, 2021, Simpson et al., 4 Nov 2025).

4. Theoretical Guarantees and Trade-offs

Theoretical analysis establishes that sketched SR can match the joint-training optimum (training with all tasks accessible simultaneously) up to a controlled approximation error, with guarantees depending on sketch size, the rank structure of the gradient/Jacobian/hessian matrices, and the spectral decay of task Hessians:

In linear models, if the full Jacobian or Hessian is fully sketched ( $s=n$ ), joint optimum is exactly recovered (Heckel, 2021, Li et al., 5 Apr 2025).
For sketched Jacobians, trajectories under the sketched regularized loss remain $\mathcal{O}(\|J\|_F/\sqrt{s})$ -close to the original full-regularizer path (Heckel, 2021).
In wide neural networks in the Neural Tangent Kernel (NTK) regime, risk after sketch-regularized training differs from joint training by at most $\mathcal{O}(1/(\sqrt{s}\alpha^2))$ , where $\alpha$ is the minimum singular value of the NTK matrix for the older task.
Memory vs. statistics trade-off: To suppress forgetting to negligible levels, the memory budget (number of sketch rows) must scale with the effective rank or eigenvalue decay of the importance or Hessian matrix. For a power-law decay $\mu_i \sim i^{-\alpha}$ , setting $m \gtrsim n^{1/\alpha}$ suffices to achieve near-joint risk (Li et al., 5 Apr 2025).

5. Empirical Results

Empirical evaluations consistently demonstrate that sketched SR matches or outperforms diagonalized SR (e.g., Vanilla EWC, MAS) in various regimes:

On permuted-MNIST, CIFAR-100, and distribution-shift tasks, sketched SR or Jacobian-based regularization maintains higher accuracy and reduced forgetting compared to diagonal approximations (Li et al., 2021, Heckel, 2021).
In situ neural compression for scientific data: buffers of sketched snapshots—using either FJLT or subsampling—preserve reconstruction error and PSNR nearly at the level of offline training, even under aggressive memory constraints. Subsampled sketches are less stable than FJLT in unstructured geometries; increasing the sketch ratio improves accuracy smoothly (Simpson et al., 4 Nov 2025).
Modular architectures using hashes of sketches for routing exploit the structure to solve hierarchical or multi-component tasks not feasible for end-to-end learning, as shown in image intersection and multi-digit MNIST tasks (Deng et al., 2021).

A table summarizing representative results:

Method	Task Setting	Average Accuracy / Error	Relative Forgetting
Diagonal EWC/MAS	Permuted-MNIST	88.3–86.7%	Substantial on early
Sketched EWC/MAS	Permuted-MNIST	89.8–90.4%	Lower, more stable
RSJ-100/RSJ-400	Permuted-MNIST	~98%	Near-joint; minimal
In situ + FJLT	Scientific Sim	0.57–5.22% RFE, up to 60dB	Matches offline regime

6. Extensions, Limitations, and Open Challenges

While sketched SR is broadly effective, several open issues remain:

The extension from least-squares or quadratic loss regimes to non-quadratic objectives (e.g., cross-entropy) is nontrivial and active research, as the Taylor expansion underpinning the quadratic penalty breaks down (Heckel, 2021).
In deep, nonlinear, or finite-width networks, controlling the drift of the relevant Jacobian (or feature matrix) across tasks remains unresolved.
Sketch management for long task sequences or non-stationary distributions requires dynamic buffer allocation and may benefit from learned or structure-aware sketching operators (Simpson et al., 4 Nov 2025).
For modular or hierarchical task settings, sketch-driven routing defines an implicit structural regularization by constraining parameter sharing and reuse at the granularity of contexts/buckets but lacks a unified regularization functional for analysis (Deng et al., 2021).
The reduction of memory overhead beyond the order of the effective rank may require further compression techniques such as incremental sketches, quantization, or sketch sharing.
Identification of "critical directions" in the parameter space (appropriate rank for sketches, eigenvector retention) is vital for optimal bias-variance trade-off (Li et al., 5 Apr 2025).

7. Broader Architectures and Domains

Sketched structural regularization extends naturally to:

Continual multitask settings, including shared and task-specific heads.
Modular lifelong learning, where sketching guides parameter modularization and routing, permitting provable learning of hierarchically composed functions, reinforcement learning environments, and knowledge-graph queries (Deng et al., 2021).
Distributed and federated settings where compact sketches, rather than raw examples or full parameter dumps, enable efficient exchange of critical memory summaries (Simpson et al., 4 Nov 2025).

Overall, sketched structural regularization provides a theoretically-backed, computationally practical, and empirically robust solution for memory-efficient lifelong learning, sitting between crude diagonal SR and high-cost replay/rehearsal—from classical sequential learning scenarios to contemporary modular and in situ systems.