Trans-LoRA: Efficient Adapter Transfer
- Trans-LoRA is a parameter-efficient fine-tuning method that uses synthetic data and distillation to transfer LoRA adapters across different base models.
- It constructs a filtered synthetic dataset that mimics the original training distribution, enabling effective adapter migration and preserving task accuracy.
- Empirical evaluations reveal that Trans-LoRA achieves lossless or improved performance across diverse tasks and architectures even in restricted data settings.
Trans-LoRA is a parameter-efficient fine-tuning (PEFT) transfer method enabling lossless or positive transfer of low-rank adapters (LoRA) across distinct base models without requiring access to proprietary client data. The framework circumvents the central limitation of classical LoRA—its strict coupling to the pre-trained base weights—by leveraging synthetic data generation and distillation. Trans-LoRA enables adapters trained on one base model to be migrated to new base models, or even across different PEFT classes, preserving accuracy on downstream tasks even in cloud environments where the original training data is inaccessible (Wang et al., 2024).
1. Motivation and Problem Setting
PEFT methods such as LoRA attach a set of low-rank adapter weights to a fixed pre-trained model. On model deprecation or replacement (for instance, upgrading from Llama-2-7B to Llama-2-13B or changing to a different architecture like Gemma), all client-specific adapters must be re-trained on the original data—a process often infeasible for privacy, scalability, or legal reasons. Because LoRA's weight updates are strongly bound to the exact anchor weights , naive transplantation into a new base model degrades performance or fails to capture the intended downstream behavior (Wang et al., 2024).
Trans-LoRA addresses this by constructing a filtered synthetic dataset, closely mimicking the data distribution that the original adapters experienced, thus sidestepping the need for access to actual user data.
2. LoRA Adapter Recap
LoRA (Low-Rank Adaptation) replaces a full-rank parameter update with a low-rank factorization: where for . During traditional fine-tuning, the adapter weights are optimized on the actual task dataset using a loss function such as: At inference, only the compact low-rank update is applied to the frozen base model parameters, allowing rapid deployment and storage efficiency (Wang et al., 2024).
3. Synthetic Data Generation and Filtering
The core challenge is approximating the marginal data distribution of the (unavailable) original training set using only limited accessible information. Trans-LoRA addresses this with a two-stage synthetic data pipeline:
3.1 Synthetic Data Generation via In-Context LLM Synthesis
An instruction-tuned LLM (commonly the target model , or any suitably aligned open-source model) is selected as the generator. A small set of public or permissible seed examples demonstrates I/O format and task style, serving only illustrative purposes. Using a prompt patterned as:
1 2 |
Here are 5 examples of the task (prompts and correct completions). Now generate 1 new example following the same format: |
3.2 Discriminative Filtering
To ensure that the synthetic distribution matches the relevant subspace of the original real dataset, a lightweight PEFT discriminator is trained concurrently with the original LoRA adapters. The discriminator distinguishes between real and synthetic by optimizing: At transfer, the discriminator filters to obtain , comprising only synthetic examples judged sufficiently similar to real training data by exceeding a confidence threshold (Wang et al., 2024).
| Stage | Input/Output | Purpose |
|---|---|---|
| LLM In-Context Generation | Seeds | Create a large synthetic dataset in the required task format |
| Discriminative Filtering | Select synthetic samples resembling true training data |
4. Distillation-Based Transfer Algorithm
Given the source model with trained adapters and a target base model , Trans-LoRA learns new adapters for by distilling knowledge via: Here, acts as teacher and as student. The process is iterative standard gradient descent over , with initialization of at random and no extra regularization beyond lightweight weight decay (usually set to zero).
The complete transfer loop pseudocode is as follows:
1 2 3 4 5 6 7 8 |
Input: M_s, θ_s, M_t, φ, seeds, N_syn
1. D_filt = SYNTH_FILTER(M_t, seeds, φ, N_syn)
2. initialize θ_t (A_t,B_t) at random
3. while not converged:
sample batch B ⊂ D_filt
L ← CE(M_t(x;θ_t), M_s(x;θ_s)) # ∀ x∈B
θ_t ← θ_t − η ∇_θ_t L
4. return θ_t |
5. Empirical Results and Ablations
Trans-LoRA evaluations consider Llama and Gemma model families, including cross-family and cross-size transfers, and multiple PEFT variants. Benchmarks include BBH (27 tasks), MMLU (57 subjects), MBPP/MBPP+ (code tasks), and GSM8K (math).
5.1 Main Transfer Results
Representative transfer results (average task accuracy):
| Source → Target → Disc | Source LoRA | Target no LoRA | Trans-LoRA |
|---|---|---|---|
| Llama2-7B → Llama2-13B → Llama2-7B | 43.32% | 37.85% | 43.41% |
| Gemma2B → Gemma7B → Gemma2B | 31.84% | 37.75% | 43.61% |
| Llama2-7B → Gemma7B → Gemma2B | 43.32% | 37.75% | 45.41% |
MMLU (57 tasks):
| Source → Target → Disc | Source LoRA | Target no LoRA | Trans-LoRA |
|---|---|---|---|
| Llama2-7B → Llama2-13B → Llama2-7B | 45.89% | 53.72% | 55.09% |
| Gemma2B → Gemma7B → Gemma2B | 42.34% | 60.45% | 61.23% |
Comparable or improved transfer holds for MBPP/MBPP+ and GSM8K. Across nearly 90 diverse tasks, Trans-LoRA enables lossless or enhanced transfer, even when jumping across pre-training regimes or PEFT methods (Wang et al., 2024).
5.2 Ablation Studies
- Distillation Data Choice: Filtering synthetic samples with the discriminator gives superior transfer (BBH: 43.41%) compared to Wikipedia text (37.3%), unfiltered synthetic (41.95%), or seed-only (39.82%).
- PEFT Method Transfer: Transfers between LoRA, DoRA, and Prompt-Tuning on Gemma 2B→7B remain effective (40–44% BBH).
- Multi-Hop Transfer: Chaining transfers (e.g., 7B→13B→Gemma-7B) yields no material degradation (ending at 45.04% vs. 43.32% source on BBH).
- Synthetic Dataset Size: Performance increases smoothly with for fixed number of updates.
6. Limitations and Future Prospects
Trans-LoRA incurs additional, but modest, compute for synthetic data generation and filtering. Direct (dataless) adapter mapping remains an open target for future research. In specific high-complexity or ambiguous domains (e.g., Disambiguation-QA), synthetic generation may yield invalid samples, mitigated by increasing seed count () or tailored prompt engineering. Reliance on base LLM alignment can propagate generator hallucination or distributional drift; adversarial filtering or advanced synthetic data generation strategies are plausible future directions (Wang et al., 2024).
7. Relation to Alternative Data-Free Transfer Methods
Alternative frameworks such as Cross-LoRA (Xia et al., 7 Aug 2025) provide entirely data-free, training-free adapter transfer by analytical subspace alignment via truncated SVD and Frobenius-optimal projections. In contrast, Trans-LoRA's reliance on synthetic data and distillation enables transfer across broader architectural and methodological boundaries, including cross-family and cross-PEFT settings. Empirical results indicate that in scenarios where direct subspace alignment is impractical, Trans-LoRA delivers robust, scalable solution for adapter migration in proprietary or privacy-critical environments.