Shared LoRA Subspaces for almost Strict Continual Learning

Published 5 Feb 2026 in cs.LG, cs.AI, and cs.CV | (2602.06043v1)

Abstract: Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Share, a method that leverages shared LoRA subspaces to enable replay-free continual learning with minimal parameter overhead.
It utilizes SVD-based initialization and closed-form projection to update task-specific coefficients while preserving past knowledge.
Empirical results demonstrate that Share achieves up to 100x parameter efficiency and robust performance across language, vision, and generative tasks.

Shared LoRA Subspaces for Almost Strict Continual Learning

Introduction and Context

The paper "Shared LoRA Subspaces for almost Strict Continual Learning" (2602.06043) addresses the problem of continual adaptation of large pretrained models (e.g., LLMs, VLMs, diffusion models) with minimal parameter overhead while maintaining knowledge from previous tasks. The work explicitly targets the limitations of parameter-efficient tuning protocols such as LoRA, which, despite their computational benefits, do not innately provide mechanisms for strict continual learning (CL) or scalable adapter integration. The presented method, Share, proposes a novel shared subspace adaptation framework designed to enable replay-free, nearly strict CL with extreme efficiency, cross-domain applicability, and robust knowledge transfer.

Theoretical Motivation and Methodological Framework

Foundational Subspace Hypothesis

The central observation of the paper is that LoRA adapters trained across correlated tasks converge toward a low-dimensional shared subspace in weight space. Empirical evidence (via linear CKA) reveals that models initialized with different task permutations rapidly converge to such a foundational subspace during sequential CL, achieving >0.95 similarity after a handful of tasks.

Figure 1: CKA analysis demonstrates monotonic convergence of LoRA subspaces to a shared foundation during continual adaptation.

This supports the universal weight subspace hypothesis, and motivates the design of Share, which aims to construct and continually evolve a single shared subspace capturing essential directions for downstream adaptation, thus minimizing catastrophic forgetting and maximizing resource efficiency.

Share maintains a set of principal basis vectors (frozen during most adaptation) and learns task-specific coefficients for new tasks or incoming adapters. The method comprises:

Initialization: From available LoRA adapters or data, principal factors are extracted using SVD to form the foundational subspace.
Continual Adaptation: For new tasks or adapters, only the coefficients associated with a small subset of subspace directions (or newly added temporary basis) are updated.
Merging/Update: Newly learned directions and coefficients are merged (via analytical SVD updates), and all prior knowledge is reprojected onto the updated subspace; coefficients are recalculated via closed-form projection using the Moore-Penrose pseudo-inverse.

This approach ensures strictly bounded parameter and memory complexity as task count grows, with empirical ablation showing that pseudo-rank $p=1$ and moderate $k$ achieve optimality.

Figure 2: Schematic presentation of the continual adaptation and merging steps in Share, highlighting the minimal adaptation space.

Theoretical Analysis

The paper formalizes the error bounds for the incremental construction of the shared subspace and the resulting approximation error for newly arriving tasks. The incremental SVD approach guarantees that the loss in cross-task knowledge is dominated by the eigen-spectrum truncation beyond the top- $k$ directions, with expected approximation error growing sub-linearly with task count under reasonable similarity assumptions. The generalization gap for new tasks is tightly controlled relative to unconstrained low-rank adaptation as long as new task solutions reside within the constructed subspace, otherwise expanding flexibly.

Empirical Evaluation

Share is evaluated across language (GLUE benchmark), vision (CIFAR-100, Food-101, Caltech-101, Flowers-102), structured geometry (3D pose estimation on Pascal3D+), and generative domains (text-to-image with Flux, large-scale LoRA merging in LLMs).

Continual Task Sequences and Subspace Tracking

Share equipped with $O(10^4)$ parameters matches or exceeds fully fine-tuned or non-CL LoRA models requiring $O(10^6-10^7)$ parameters. Parameter efficiency approaches 100x in typical settings, and empirical memory savings exceed 250x, all while maintaining high average accuracy and dramatically reducing forgetting compared to rehearsal-based and prompt-based methods.

Notably, the reconstructed error and CKA alignment of the foundational subspace both stabilize and converge through sequential learning, showing the robustness and dynamic refinement of the principal directions.

Figure 3: Share achieves low reconstruction error over all tasks' LoRA adapters, confirming subspace sharing.

Figure 4: CKA similarity of Share's evolving factor space indicates effective incorporation of new information over time.

Large-Scale Adapter Merging

Share supports continual, asynchronous merging of hundreds of LoRA adapters (e.g., Mistral-7B scenario), compressing them into a single set of principal factors. The method maintains 90–99% of adapter performance in both in-distribution and OOD settings, vastly outperforming task-specific or batch-merging competitors such as TIES in zero-shot or cross-domain scenarios.

Figure 5: Share integrates hundreds of LoRA adapters continually, maintaining low reconstruction error and high downstream performance.

Generative and Multimodal Tasks

For diffusion/text-to-image models, Share achieves competitive sample quality against task-specific LoRA finetuning with a single set of factors—even when handling multiple stylistically diverse tasks—demonstrating broad applicability and generalization capacity for multimodal adaptation.

Figure 6: Comparison of text-to-image results: Share (continual, shared subspace) vs independently trained LoRA adapters.

Practical and Theoretical Implications

Share redefines the lower-bound of parameter and memory requirements for continual adaptation in large foundation models, providing a scalable, robust tool for research and deployment in continuously evolving environments. Its ability to perform replay-free, domain-agnostic, and highly compressive CL enables rapid AI personalization, multi-user deployment, federated or decentralized learning, and environmentally sustainable training regimes.

Theoretically, Share substantiates the centrality of shared weight structures and supports the development of new architectures and learning protocols that leverage universal subspaces for efficient multi-task and lifelong learning in deep networks.

Future Directions

Outstanding directions include extending Share to enable knowledge integration across heterogeneous architectures, handling cross-modal transfer learning, relaxing strict backbone homogeneity, and integrating direct metric learning for more expressive subspace control. Further work may focus on improving OOD generalization in absence of task-aligned adapters, as well as low-resource initialization, and exploring federated settings with privacy guarantees.

Conclusion

Share establishes a systematic and theoretically justified protocol for parameter-efficient continual finetuning by leveraging the emergent shared subspace across LoRA adapters. The method achieves nearly strict continual learning—avoiding catastrophic forgetting, data replay, and adapter proliferation—while matching or surpassing strong task-specific adaptation baselines. The presented empirical evidence and analytical framework mark Share as a pivotal advance for robust, scalable, and sustainable adaptation of modern foundation models (2602.06043).

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Shared LoRA Subspaces for almost Strict Continual Learning

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces a new way to help large AI models (like LLMs and image models) learn new tasks one after another without forgetting what they already know. The method is called “Share.” It focuses on making learning efficient so you don’t have to retrain millions of model parameters every time, and it avoids storing old data or keeping many separate “add-on” pieces. Share finds and updates a single shared “learning space” that different tasks can use, so the model keeps getting better while staying small and fast.

Key Questions

The paper asks:

Can we make big AI models learn continuously (task after task) without forgetting old tasks, while using very few extra parameters?
Is there a common “subspace” (think of it like a shared set of directions the model can adjust) that works across many different tasks?
Can we update that shared subspace over time so new tasks benefit from old knowledge, and old tasks don’t get worse?

How the Method Works

Think of a large AI model as a huge machine with lots of knobs. Traditional fine-tuning turns many knobs, which is slow and memory-heavy. LoRA (Low-Rank Adaptation) is a popular trick that adds small, trainable “adapters” to each layer, so you only turn a few important knobs. However, standard LoRA usually needs one adapter per task, which piles up and doesn’t share knowledge well.

Share improves on this by learning a single, shared “toolbox” of key directions (basis vectors) the model can use for many tasks. For each new task, you only learn tiny “coefficients” that say how much to use each tool, instead of adding a whole new adapter.

Step 1 — Build a shared toolbox (Initialization)

Start with one or more LoRA adapters or some initial task data.
Use a technique called SVD (Singular Value Decomposition). In everyday terms, SVD finds the most important directions of change across tasks—like discovering the main “axes” that explain how adapters adjust the model.
Freeze these directions (basis vectors). They are your shared toolbox.
Only learn small coefficients for each task that say “use tool 1 a bit, tool 2 more, tool 3 not at all,” etc.
This already cuts the number of trainable parameters by a lot because you’re reusing the toolbox instead of building a new adapter each time.

Step 2 — Learn new tasks efficiently (Continual Adaptation)

When a new task arrives, you temporarily add a few extra directions to the toolbox, just enough to capture what’s new.
You update the small coefficients for the new task.
This uses far fewer parameters than training a full LoRA adapter, but lets the model adapt to fresh information.

Step 3 — Merge and polish (Merging and Fine-tuning)

Fold the newly learned directions back into the main toolbox using SVD again.
Recompute all task coefficients mathematically (no gradients needed), so older tasks can benefit from the improved toolbox.
Optionally, do a tiny fine-tune of the coefficients if you’re allowed a bit of extra training data.
End result: one shared toolbox and a tiny set of coefficients per task—no stack of separate adapters.

A simple analogy

Basis vectors = shared tools in a toolbox.
Coefficients = instructions for how much to use each tool on a specific task.
SVD = a smart way to discover which tools matter most across many tasks.
Share = a system that keeps refining the toolbox and reuses it, instead of storing a new toolbox for every job.

Main Findings

Across many experiments (language understanding, image classification, 3D pose estimation, text-to-image generation), Share:

Dramatically reduces cost:
- Up to 100× fewer trainable parameters than standard LoRA.
- Up to 281× memory savings compared to keeping many LoRA adapters.
Maintains strong performance:
- On the GLUE language benchmark, Share reached about the same average score as separate LoRA adapters while using a tiny fraction of the parameters.
- In image classification (CIFAR‑100, Food‑101, Caltech‑101, Flowers‑102), Share matched or closely approached top performance with fewer parameters and very low forgetting.
- In 3D pose estimation with occlusion, Share outperformed several methods that store old examples, using about 96% fewer parameters.
- In text-to-image generation, Share produced results comparable to individual LoRAs, but with one shared model instead of many.
Shares knowledge forward and backward:
- As the toolbox improves, earlier tasks sometimes get better too (this is called backward transfer). This is rare and valuable in continual learning.
Scales to many adapters:
- A single Share model can compress and replace hundreds of task-specific LoRA adapters, and still keep strong performance—even on out-of-distribution tasks.

Here are a few highlights in a compact form:

Similar accuracy to separate, non‑continual LoRAs in language tasks, with ~100× fewer parameters.
Near upper‑bound accuracy in several image classification benchmarks, with fewer parameters and minimal forgetting.
Better occlusion robustness in 3D pose estimation, without storing old data.
One shared model for many styles in text‑to‑image generation, saving lots of memory and maintenance effort.

Why It Matters

Lifelong learning: Share helps big models learn continuously, like students gaining new skills over time without forgetting old ones.
Efficiency and access: Because it needs far fewer parameters and memory, Share makes advanced models more practical for researchers and teams without huge computers. It also reduces environmental impact by cutting training resources.
Simplicity at scale: Instead of juggling many task-specific adapters, you maintain one shared model with tiny per-task coefficients. This makes deployment and personalization much easier.
Strong generalization: The shared toolbox captures core knowledge across tasks and even across different types of models (language, vision, generation), helping new tasks start from a strong base.

In short, Share shows that large AI systems can keep learning new things efficiently, reuse what they’ve learned, avoid forgetting, and stay compact—bringing us closer to practical, lifelong learning in real-world AI.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of the paper’s unresolved issues, uncertainties, and areas needing further investigation. Each point is framed to be concrete and actionable for future research.

Task-ID-free inference: The method retains per-task coefficient sets ε^i, but does not specify how to select the appropriate coefficients at inference when task identity is unknown. Develop task routing or unsupervised gating mechanisms that work without explicit task labels.
Storage growth in “almost strict” continual learning: Although basis vectors are shared, storage still grows linearly with tasks due to accumulating ε^i. Quantify and bound memory growth over long horizons; explore coefficient compression (e.g., clustering, product quantization, low-bit storage) or consolidation strategies.
Automatic and adaptive hyperparameter selection: The choices of k (basis size), p (pseudo-rank), and φ (temporary expansion) rely on heuristics. Design adaptive criteria (e.g., streaming explained variance thresholds, Bayesian model selection, or MDL-based rules) and validate them under task drift.
Incremental SVD scalability and stability: The merging step requires SVD over a growing stacked matrix. Analyze computational complexity for large layers and many adapters; implement and evaluate streaming/incremental SVD, randomized SVD, or distributed SVD with numerical stability guarantees.
Robustness to outlier or adversarial adapters: Merging community LoRAs can introduce noise/poisoning. Investigate robust PCA/SVD variants, outlier detection, and adapter vetting pipelines; quantify subspace contamination and its effect on performance and forgetting.
Theoretical guarantees for non-convex deep models: Current analysis assumes strong convexity and Lipschitz continuity—conditions that generally do not hold for deep networks. Provide non-convex optimization guarantees (e.g., convergence, generalization, forgetting/transfer bounds) aligned with modern deep-learning theory.
Conditions for backward transfer: The paper observes backward transfer empirically but does not formalize when and why it occurs. Characterize the task similarity and subspace overlap conditions under which backward transfer is likely; provide diagnostic metrics and predictive criteria.
Subspace existence and universality across modalities: Evidence for a “foundational subspace” is shown primarily within single modalities/architectures. Test whether shared subspaces exist across architectures and modalities (e.g., text vs. vision vs. diffusion), and analyze layerwise vs. global subspace consistency.
Cross-layer dependencies: Share treats layers independently. Investigate whether cross-layer coupled subspaces or joint low-rank structures improve retention and transfer; study interactions between layers during merging.
Freezing W0 limitations: The base weights W0 remain fixed. Explore selective unfreezing policies or hybrid updates that maintain continual constraints but allow limited base-weight adaptation when subspace capacity is insufficient.
Handling tasks needing higher-rank updates: Fixed k may underfit novel tasks. Develop dynamic rank expansion/pruning policies with clear criteria (e.g., reconstruction error rises beyond a threshold) and analyze their impact on memory and forgetting.
Long-horizon continual sequences: Most experiments involve ≤10 tasks (except large-scale LoRA experiments). Conduct studies with 100–1000 tasks to quantify stability, drift, memory growth, and retention over lengthy sequences and diverse domains.
Evaluation breadth for NLU/LLM tasks: GLUE excludes MNLI and QQP; broader LLM capabilities (e.g., multi-step reasoning, tool use, dialog) are not evaluated. Extend benchmarks to complex, multi-turn tasks and safety-critical domains.
Generative-model evaluation rigor: Text-to-image experiments primarily use qualitative comparisons and limited CLIP-based metrics. Add comprehensive quantitative metrics (e.g., FID, IS, TIFA, human preference studies, safety/toxicity assessments) and report statistical significance.
Comprehensive baseline coverage: Comparisons omit several relevant PeFT/continual methods for LLMs and diffusion (e.g., prefix/prompt tuning variants, adapter routing/gating, O-LoRA-style orthogonal subspaces). Include broader baselines with consistent settings.
Inference-time latency and hardware compatibility: The runtime impact of using shared factors vs. standard LoRA is not analyzed. Measure inference latency, memory footprint, and compatibility with quantization and accelerators across tasks/models.
Asynchronous/distributed updates: The large-scale merging scenario lacks analysis of concurrency, staleness, and order sensitivity. Study merge-order effects, conflict resolution, and consistency in distributed/asynchronous settings; propose protocols for safe concurrent merges.
Concept drift and continuous streams: The framework assumes discrete tasks. Evaluate performance with continuous, unlabeled streams and non-stationary data; develop online detection of novel directions and automatic subspace adaptation.
Criteria for adding new directions (φ expansion): Define robust, data-free triggers to add/retain/prune temporary basis vectors (e.g., reconstruction error thresholds, subspace angle metrics, CKA drift); quantify trade-offs.
Coefficient consolidation across tasks: Investigate whether multiple εⁱ can be merged or factored (e.g., shared components plus sparse task residuals) to reduce storage and improve generalization.
Safety, privacy, and licensing: Merging third-party adapters may violate licenses or leak private/biased information. Establish provenance tracking, privacy-preserving merging (e.g., DP mechanisms), and safety filters; assess ethical implications.
Security against poisoning: Adapters can be maliciously crafted. Develop adversarial defenses and verification methods (e.g., certification tests, anomaly detection) before integrating into the shared subspace.
OOD performance characterization: OOD results are limited and partly reported as relative performance. Systematically evaluate OOD generalization across diverse tasks, quantify degradation, and identify subspace properties that predict robustness.
Memory savings accounting: The reported 281× savings lack detailed accounting (e.g., optimizer states, activation/gradient memory, temporary expansion φ). Provide comprehensive deployment-level memory profiles and standardized measurement protocols.
Reproducibility and ablation completeness: Several critical ablations (hyperparameters, merging order, subspace size, per-layer vs. global) are deferred to appendices. Present thorough, main-text ablations with open-source code and seeds to ensure reproducibility.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications of “Shared LoRA Subspaces for almost Strict Continual Learning”

Below are applications derived from the paper’s findings, methods, and innovations. Each item notes sector links, potential tools/workflows, and assumptions/dependencies.

Immediate Applications

Unified multi-tenant model serving with per-tenant personalization
- Sectors: software/SaaS, customer support, e-commerce, finance
- Tools/products/workflows: one “Share-enabled” model; a tenant registry storing only lightweight coefficients; adapter ingestion and merging pipeline; dashboards tracking forgetting/backward transfer; API to attach/detach tenant-specific coefficients at inference
- Assumptions/dependencies: tasks share modality/domain structure; legal rights to use community LoRA adapters; SVD-based merging fits deployment latency budget; secure isolation of tenant coefficients
On-device continual adaptation (edge/mobile)
- Sectors: robotics, IoT, AR/VR, automotive infotainment, wearables
- Tools/products/workflows: Share runtime library for embedded platforms; local coefficient fine-tuning without replay; periodic “adapter updates” via OTA; fallback to base model if subspace mismatch
- Assumptions/dependencies: sufficient compute/memory (benefits from 96–281× savings); tasks are sufficiently similar to maintain a usable shared subspace; robust quantization support
Enterprise knowledge integration without data replay (compliance-friendly)
- Sectors: finance, legal, healthcare, government archives
- Tools/products/workflows: scheduled ingestion of department-specific LoRA adapters; analytical merging (no gradient, no historical data retention); audit logs of coefficient updates; “Share-full” optional fine-tuning for selected previous tasks within policy bounds
- Assumptions/dependencies: availability of either adapters or limited current-task data; validated performance under compliance constraints; appropriate access control and provenance tracking
Healthcare model customization across departments and evolving guidelines
- Sectors: healthcare
- Tools/products/workflows: hospital-level shared subspace with department-specific coefficients; guideline-update adapters merged analytically; evaluation pipelines to guard against catastrophic bias shifts
- Assumptions/dependencies: regulatory validation; strong monitoring to detect out-of-subspace tasks; privacy-preserving handling of coefficients (risk of indirect leakage)
Education: personalized tutoring and curriculum adaptation
- Sectors: education
- Tools/products/workflows: “student profiles” as lightweight coefficients; incremental ingestion of curriculum adapters; teacher dashboards tracking performance and forgetting
- Assumptions/dependencies: similarities across curricula/tasks to sustain a common foundational subspace; fairness controls to avoid reinforcing biases
Robotics perception with replay-free continual learning
- Sectors: robotics, logistics, manufacturing, retail
- Tools/products/workflows: sequential class expansion and 3D pose estimation with occlusions; small coefficient updates per new environment; periodic analytical merges to retain old skills
- Assumptions/dependencies: perception tasks share subspace; real-time inference constraints are met; robustness under domain shifts
Creative agencies: single diffusion model serving many styles/themes
- Sectors: media/entertainment, marketing
- Tools/products/workflows: style libraries as coefficients; “Share-enabled” diffusion backbones replacing dozens of LoRAs; non-destructive style addition; batch style merges via SVD
- Assumptions/dependencies: style adapters are licensed; acceptable qualitative parity with per-style LoRA; content safety checks remain valid after merges
MLOps: adapter lifecycle management and compression
- Sectors: software tooling/MLOps
- Tools/products/workflows: adapter registry; Share-based analytical merging (data- and gradient-free); automated selection of k, p, φ; monitoring for forgetting/backward transfer; rollbacks if subspace drift detected
- Assumptions/dependencies: robust SVD implementations; compatibility with popular frameworks (PyTorch/Transformers/Diffusers); governance around adapter provenance
Community adapter curation and consolidation
- Sectors: open-source ecosystems, academia
- Tools/products/workflows: “LoRA compressor” that merges hundreds of community adapters into one set of factors + per-task coefficients; reproducible model-merging recipes; benchmarks using Share as a continual baseline
- Assumptions/dependencies: adapter metadata and licensing clarity; standardized adapter formats; similarity across adapters to ensure meaningful compression
Cloud cost reduction in large-scale inference
- Sectors: cloud platforms, enterprise IT
- Tools/products/workflows: deploy one Share model instead of many per-task LoRAs; scale to many tenants by loading only coefficients; GPU memory savings (up to ~281×) for increased concurrency
- Assumptions/dependencies: accurate subspace for mixed workloads; amortized cost of occasional merges; robust VRAM management

Long-Term Applications

Federated, privacy-preserving continual learning
- Sectors: healthcare networks, financial consortia, public-sector data collaboratives
- Tools/products/workflows: institutions share coefficients/adapters (no raw data); federated subspace refinement; governance for adapter provenance and evaluation
- Assumptions/dependencies: secure communication; cross-site task similarity; standards for subspace interoperability
Universal subspace marketplaces
- Sectors: AI tooling, model hubs
- Tools/products/workflows: standardized “foundational subspaces” per modality/domain; third parties publish coefficients; versioned subspace catalogs; compatibility badges
- Assumptions/dependencies: community adoption of formats/APIs; licensing/IP frameworks; validation kits to certify subspace utility
Continual autonomy for vehicles and embodied AI
- Sectors: automotive, drones, smart factories
- Tools/products/workflows: OTA updates as coefficients/adapters for perception/NLU copilots; safety cases using replay-free updates; compliance logs and rollback
- Assumptions/dependencies: rigorous verification; resilience to out-of-subspace conditions; real-time constraints
Lifelong, multimodal healthcare personalization
- Sectors: healthcare
- Tools/products/workflows: shared subspaces across text (EHR), imaging, signals; per-patient coefficient trajectories; clinical drift monitoring; interpretable audit trails
- Assumptions/dependencies: validated multimodal subspaces; strict privacy controls; ongoing bias/robustness assessments
Industrial predictive maintenance under limited memory/compute
- Sectors: energy, manufacturing, utilities
- Tools/products/workflows: edge models that absorb streaming sensor patterns as coefficients; centralized analytical merges; facility-level subspaces for equipment families
- Assumptions/dependencies: modality/task similarity across machines; fault coverage; reliable edge deployment pipelines
Regulatory-compliant risk and compliance assistants
- Sectors: finance, insurance, legal
- Tools/products/workflows: replay-free continual updates to regulatory knowledge; subspace-based auditability; coefficient-level diffs as change logs
- Assumptions/dependencies: codified standards for continual-learning compliance; external validation of updates; robust robustness testing
Cross-domain foundation models that grow skills without growing size
- Sectors: general AI, software platforms
- Tools/products/workflows: single Share backbone spanning NLU, vision, generation; evolving subspace with modular coefficients; unified eval across tasks
- Assumptions/dependencies: scalable cross-modality subspace discovery; scheduling strategies to avoid interference; continual hyperparameter tuning (k, p, φ)
Share-aware inference engines and hardware/software co-design
- Sectors: semiconductors, cloud, AI tooling
- Tools/products/workflows: kernels optimized for basis-coefficient multiplication; fast SVD/low-rank updates; memory layout optimizations; inference-time coefficient switching
- Assumptions/dependencies: vendor support; standardization; measurable latency wins
Policy frameworks for continual learning
- Sectors: public policy, standards bodies, corporate governance
- Tools/products/workflows: guidelines for replay-free training, adapter licensing, provenance, safety monitoring, and energy reporting; certification processes
- Assumptions/dependencies: multi-stakeholder consensus; alignment with existing AI risk frameworks; enforceable auditing mechanisms
Distributed asynchronous learning in swarms or field operations
- Sectors: defense, disaster response, environmental monitoring
- Tools/products/workflows: local adapters from varied conditions merged centrally into subspaces; lightweight coefficients transmitted under low bandwidth; resilience planning via rollback strategies
- Assumptions/dependencies: communications reliability; robustness to heavy distribution shifts; careful subspace governance to avoid brittle merges

View Paper Prompt View All Prompts

Glossary

Asynchronous continual learning: A learning paradigm that integrates new tasks or adapters as they arrive, without synchronized training phases. "supporting scalable, asynchronous continual learning."
Backward knowledge transfer: Improvement on previously learned tasks due to learning new tasks. "we also observe instances of backward knowledge transfer due to presence of this subspace."
Catastrophic forgetting: The degradation of performance on earlier tasks when a model is trained sequentially on new tasks. "remains challenging due to catastrophic forgetting"
Catastrophic interference: Negative interaction between tasks causing a model to overwrite or degrade prior knowledge. "while minimizing catastrophic interference."
CLIP score: A metric that measures alignment between text and images based on CLIP embeddings. "CLIP scores across sequential tasks (T1–T4) for continual text-to-image generation."
Continual model merging: The process of incrementally combining multiple task-specific models/adapters into a single model over time. "Continual Model Merging and Learning at Scale"
Data replay: Reusing stored examples from previous tasks during training to mitigate forgetting. "without relying on data replay, or multiple adapters."
Diffusion models: Generative models that learn to produce data by reversing a diffusion (noise) process. "like LLMs, VLMs and Diffusion models"
Empirical Risk Minimizer (ERM): A model that minimizes the empirical loss over a dataset. "be an independent empirical risk minimizer (ERM) of task τ_t"
Explained variance: The proportion of total variance captured by selected principal components or factors. "determined by a threshold based on the explained variance of the factor data matrix D."
Forward knowledge transfer: The improvement on future tasks due to knowledge learned from previous tasks. "facilitating forward knowledge transfer"
Foundational subspace: A shared low-rank subspace that captures core, reusable knowledge across tasks. "Share constructs a foundational subspace that extracts core knowledge from past tasks"
Frobenius norm: A matrix norm equal to the square root of the sum of the squares of all entries, used for measuring reconstruction error. "gives an upper bound on the Frobenius-norm error"
Geodesic distance: The shortest path distance on a manifold; for rotations, it measures angular difference between rotation matrices. "the percentage of predictions where the geodesic distance between predicted and ground truth rotation matrices is below π/6 radians."
Linear CKA (Centered Kernel Alignment): A similarity measure for comparing representations or weight spaces across models or tasks. "Linear CKA similarity analysis reveals a universal weight subspace"
Lipschitz continuous loss: A loss function whose changes are bounded proportionally by changes in inputs, characterized by a Lipschitz constant. "Given a Lipschitz continuous loss (...) that is strong convex over the weight spaces spanned by D and V^T_k"
LoRA (Low-Rank Adaptation): A parameter-efficient method that inserts trainable low-rank matrices into layers to adapt large models. "While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands"
LoRA adapter: The task-specific low-rank parameter module used to finetune a pretrained model. "A single Share model can replace hundreds of task-specific LoRA adapters"
LoRA rank: The rank parameter r controlling the dimensionality of the low-rank adaptation. "Hyperparameters include LoRA rank $r=32$ "
Low-rank matrix: A matrix whose rank is small relative to its dimensions, enabling compact parameterization. "LoRA introduces two low-rank trainable matrices"
Low-rank subspace: A subspace spanned by a limited number of basis vectors that capture most of the variation. "learns and dynamically updates a single, shared low-rank subspace"
Matrix projection: The operation of projecting a matrix onto a subspace, typically to minimize reconstruction error. "Using matrix projection and Moore-Penrose pseudoinverse of β^{t+1}, we analytically calculate task coefficients"
Mixture-of-experts systems: Architectures that route inputs to specialized expert models, often increasing parameter count. "primarily operate as mixture-of-experts systems"
Moore-Penrose pseudoinverse: A generalized inverse of a matrix used for least-squares solutions and projections. "Using matrix projection and Moore-Penrose pseudoinverse of β^{t+1}, we analytically calculate task coefficients"
Orthogonal subspaces: Subspaces whose basis vectors are mutually orthogonal, used to separate task representations. "propose learning new tasks in orthogonal subspaces to mitigate forgetting."
Orthonormal columns: Matrix columns that are both orthogonal and unit-length, simplifying projections and decompositions. "When β^{t+1} has orthonormal columns, this simplifies to ε^i_β = (β^{{t+1})^T} \hat{B}_i"
Parameter overhead: The additional parameters introduced during adaptation beyond the base pretrained model. "adapts to a sequence of tasks $\{\tau_1, \tau_2, \dots, \tau_t\}$ with minimal parameter overhead."
Parameter-efficient finetuning (PaCT): Finetuning that minimizes additional trainable parameters, here done continually. "Parameter-Efficient Continual Finetuning (PaCT)"
Principal basis vectors: The dominant basis vectors obtained (e.g., via SVD/PCA) that capture the main directions of variation. "principal basis vectors, which remain frozen during finetuning"
Principal coefficients: The learned coefficients that weight principal basis vectors to reconstruct adapters or weights. "where only principal coefficients $\epsilon^t$ are trained."
Pseudo-rank: An effective small rank (p) used to parameterize coefficients within the shared subspace. "where $p$ (pseudo-rank) can be as small as 1."
Right singular vectors: The V components of SVD associated with columns, defining directions in input space. "V^T_k are fixed right singular vectors from \text{SVD}(\mathcal{D}^t)_{[:k]}"
Rouge-L: A text evaluation metric based on longest common subsequence overlap. "We report the absolute and relative Rouge-L scores at each time-step."
Shared subspace: A common subspace across tasks where weights or adapters align. "neural network weights often converge to layerwise, shared subspace across tasks and datasets"
Singular Value Decomposition (SVD): A matrix factorization into UΣV^T used to extract principal bases and ranks. "perform SVD on the mean-centered matrices"
Singular values: The diagonal entries of Σ in SVD, representing the strength of each singular direction. "where $\sigma_i$ 's are the singular values of $\mathcal{D}^{t}$ for the non-principal basis vectors."
Strong convexity: A curvature property of a function ensuring a unique minimizer and improved optimization bounds. "that is strong convex over the weight spaces spanned by D and V^T_k"
Strict Continual Learning: A setting that forbids data replay, additional models, and parameter growth during sequential learning. "All such methods fall short of Strict Continual Learning~\cite{kaushik2021understandingcatastrophicforgettingremembering}, which requires models to learn continually, without data replay, additional models, or increase in model size, much like humans."
Universal Weight Subspace Hypothesis: The hypothesis that neural network weights converge to shared subspaces across tasks. "Universal Weight Subspace Hypothesis~\cite{kaushik2025universalweightsubspacehypothesis} has proven that neural network weights often converge to layerwise, shared subspace across tasks and datasets"
π/6 metric: An accuracy measure for pose estimation based on a rotation error threshold of π/6 radians. "measure accuracy using the π/6 metric—the percentage of predictions where the geodesic distance between predicted and ground truth rotation matrices is below π/6 radians."

Shared LoRA Subspaces for almost Strict Continual Learning

Summary

Shared LoRA Subspaces for Almost Strict Continual Learning

Introduction and Context

Theoretical Motivation and Methodological Framework

Foundational Subspace Hypothesis

Share: Continual Subspace Adaptation

Theoretical Analysis

Empirical Evaluation

Cross-Modal Applicability

Continual Task Sequences and Subspace Tracking

Large-Scale Adapter Merging

Generative and Multimodal Tasks

Practical and Theoretical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How the Method Works

Step 1 — Build a shared toolbox (Initialization)

Step 2 — Learn new tasks efficiently (Continual Adaptation)

Step 3 — Merge and polish (Merging and Fine-tuning)

A simple analogy

Main Findings

Why It Matters

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of “Shared LoRA Subspaces for almost Strict Continual Learning”

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets

YouTube

HackerNews