Bilevel Continual Learning

Updated 5 February 2026

Bilevel Continual Learning is a framework using hierarchical (inner-outer) optimization to improve knowledge retention and mitigate catastrophic forgetting across sequential tasks.
It integrates dual memory systems—episodic and generalization buffers—to manage experience replay and optimize generalization effectively.
Algorithmic strategies like bilevel coreset selection and low-rank transfer drive superior empirical performance and provide theoretical guarantees for stability and convergence.

Bilevel Continual Learning (BCL) is a framework for continual learning that leverages bilevel optimization principles to enhance knowledge retention, transfer, and generalization across sequential tasks. It replaces traditional monolithic training regimes with hierarchical inner–outer objectives, incorporates novel memory management strategies, and provides theoretical and empirical advances for mitigating catastrophic forgetting.

1. Bilevel Optimization Formulation in Continual Learning

At the core of BCL is its hierarchical optimization structure, where optimization of model parameters for current and past tasks is split into an inner (lower-level) and outer (upper-level) loop. The classic formulation specifies base parameters $\theta$ adapted for long-term generalization, and fast weights $\phi$ responsible for near-term adaptation. The general BCL objective is:

$\min_{\theta}\quad L^{\rm outer}\bigl(\phi^*(\theta), M^{gm}; \theta\bigr) \qquad \text{s.t.}\ \;\; \phi^*(\theta) = \arg\min_{\phi}\,L^{\rm inner}\bigl(\phi, B \cup M^{em};\theta\bigr)$

where $B$ is a minibatch from the current task, $M^{em}$ episodic replay memory, and $M^{gm}$ a held-out generalization buffer (Pham et al., 2020, Shaker et al., 2020).

The inner loop updates $\phi$ with SGD to rapidly absorb new/replayed data and enforce knowledge distillation (KD) regularization. The outer loop then adjusts $\theta$ toward improved generalization on $M^{gm}$ . Extensions to generative models use VAE losses in both levels with feature replay (Shaker et al., 2020).

Recent developments extend BCL to joint architecture and weight optimization, expressing the bilevel system as:

$\begin{cases} \psi^*(t) = \arg\min_{\psi\in\Psi} J(\psi, w^*(t-1)), \ w^*(t) = \arg\min_{w\in\mathcal{W}(\psi^*(t))} \sum_{i=1}^t \ell(\hat{f}(x_i; w, \psi^*(t)), y_i) \end{cases}$

where $\psi$ encodes the network architecture (Hahn et al., 27 Jan 2026).

2. Memory Management and Coreset Construction

Memory management is a critical aspect of BCL, designed to optimize both storage efficiency and generalization. Dual-memory strategies maintain:

Episodic memory $M^{em}$ : For experience replay, updated online via ring-buffer or reservoir sampling.
Generalization memory $M^{gm}$ : A held-out buffer, curated from prior tasks, solely for evaluating generalization in the outer objective.

A prominent variant is bilevel coreset construction: selecting optimal data subsets by solving a cardinality-constrained bilevel program. Given weights $w$ over the data, the bilevel formulation is: $\min_{w \in \mathbb{R}_+^n, \|w\|_0 \le m} L(\theta^*(w), 1) \qquad \text{s.t.}\;\;\theta^*(w) \in \arg\min_{\theta} L(\theta, w)$ A greedy matching-pursuit solver is used, and, for deep networks, a Neural Tangent Kernel proxy is applied to reduce computational burden (Borsos et al., 2020).

3. Algorithmic Strategies and Extensions

Practical BCL solvers leverage first-order approximations, alternating SGD steps between inner and outer loops. For the lower-level, dynamic programming can be employed, with replay buffers and warm-starts for scalability. Upper-level (architecture) optimization often requires derivative-free directional search in discrete or mixed configuration space. Knowledge transfer between architectures is facilitated by a low-rank factorization $V = AWB^\top$ , bridging mismatched parameter dimensions without full reinitialization (Hahn et al., 27 Jan 2026).

Efficient bilevel solvers such as BSG-N-FD and BSG-1 eschew explicit Hessian computation, using conjugate gradient or rank-1 approximations. These allow scalable computation of hypergradients even under constraints and streaming settings (Giovannelli et al., 2021).

4. Theoretical Guarantees and Limiting Factors

BCL frameworks provide several theoretical guarantees under regularity conditions:

Existence of bounded CL cost: Joint adaptation of weights and architecture preserves cost boundedness, offsetting task distribution shifts via architectural updates (Hahn et al., 27 Jan 2026).
Impossibility of weight-only CL: Pure weight adaptation accumulates noncancelable distribution gaps, resulting in guaranteed forgetting under nontrivial shifts (Hahn et al., 27 Jan 2026).
Stationarity and convergence rates: Under smoothness, convexity, and inexactness decay, stochastic bilevel methods achieve $O(1/\sqrt{K})$ in nonconvex cases, and $O(1/k)$ for strongly convex (Giovannelli et al., 2021).
Gradient alignment: Bilevel updates promote alignment between gradients on new and buffered data, formally reducing interference (Wang et al., 2024, Shaker et al., 2020).
Submodular approximation for coresets: Greedy bilevel coreset selection achieves provable approximations in the regularized least-squares regime (Borsos et al., 2020).

Limitations include NP-hardness of cardinality-constrained selection, scalability bottlenecks in kernel-based proxy steps, and practical challenges in dynamically tuning memory buffers and algorithmic tolerances.

5. Architectural Adaptation and Stability Mechanisms

Recent frameworks extend BCL to enable architectural adaptation at each task boundary. The upper-level selects optimal architecture $\psi^*$ via derivative-free local search, while the lower-level computes weights—ensuring stability and continued performance as the data distribution shifts. Low-rank transfer mechanisms efficiently transplant learned features to new architectures, preventing knowledge loss at transition points (Hahn et al., 27 Jan 2026).

Stability during online CL is further promoted by bilevel hypernetworks (e.g., Dual-CBA), splitting bias correction into class-specific and agnostic adaptors, stabilizing output probabilities across abrupt distribution changes. Incremental Batch Normalization (IBN) addresses feature shift by freezing moving averages during adaptation and recomputing them only over balanced buffer batches (Wang et al., 2024).

6. Empirical Evidence and Comparative Results

Empirical studies of BCL demonstrate substantial gains over classical CL, memory replay, and meta-learning approaches:

Method	PermMNIST ACC (%)	CIFAR-100 ACC (%)	Forgetting FM (%)
ER (baseline)	78.5	62.2	7.7
BCL-Dual	92.8	67.8	2.8
Dual-CBA (ER)	-	29.1	8.9
BiCL	-	-	-
Bilevel coreset	79.3	37.6	-

BCL formulations yield higher average test accuracy and lower forgetting across a wide spectrum of datasets (MNIST, CIFAR, miniImageNet, Split CUB). Architectural search plus low-rank transfer smooths performance over task boundaries, and bilevel coreset selection improves buffer efficiency relative to uniform or heuristic sampling (Hahn et al., 27 Jan 2026, Pham et al., 2020, Borsos et al., 2020, Shaker et al., 2020, Wang et al., 2024).

7. Future Directions and Limitations

Open challenges in BCL include:

Second-order optimization schemes: Improving fidelity of outer loop updates may further boost generalization but incurs computational costs (Pham et al., 2020).
Adaptive memory allocation: Dynamic tuning of replay vs. generalization buffers and informed shrinking of coresets can optimize performance under stricter resource constraints (Borsos et al., 2020).
Task-agnostic class-incremental learning: Extending the dual-memory and bilevel optimization paradigm to settings lacking explicit task boundaries or labels (Pham et al., 2020).
Biologically-inspired consolidation: Enhancing synaptic stability mechanisms and integrating more sophisticated distillation or elastic weight consolidation in the hierarchical loop (Pham et al., 2020).
Scalable kernel-proxy models: Accelerating coreset selection for deep networks with larger buffer sizes (Borsos et al., 2020).

BCL unifies multiple strands of continual, meta-, and architecture learning through rigorous hierarchical optimization, principled memory management, and theoretically justified stability mechanisms. Its empirical efficacy and extensibility position it as a central paradigm in continual learning research.