Drop-Upcycling in MoE Models

Updated 20 November 2025

Drop-upcycling is a framework for model parameter transfer that blends pre-trained dense initialization with partial re-initialization to enhance expert diversity in mixture-of-experts architectures.
It mitigates convergence limitations by ensuring rapid knowledge transfer while promoting expert specialization through controlled re-initialization of subnetworks.
Experimental results demonstrate that drop-upcycling achieves competitive early performance and efficient long-term convergence, improving compute efficiency over naive upcycling methods.

Drop-upcycling is a framework for model parameter transfer that strategically combines pre-trained dense model initialization with partial statistical re-initialization of expert subnetworks. Originally developed in the context of mixture-of-experts (MoE) transformers, drop-upcycling addresses the convergence limitations observed in prior MoE construction methods—specifically naïve upcycling—by promoting expert specialization without sacrificing early-stage performance. Drop-upcycling and related upcycling techniques can also be applied to non-expert submodule transfers, such as the post-training adaptation of key-value (KV) compression modules in attention architectures.

1. Background and Motivation

Traditional MoE architectures augment transformer networks by replacing each dense feed-forward (FFN) sublayer with a router and multiple expert FFNs. Only a sparse subset of experts (e.g., top- $k$ per token) is activated, resulting in increased model capacity relative to the number of active parameters and compute required per token. The upcycling approach introduced by Komatsuzaki et al. (2023) begins from a pre-trained dense model and replicates its FFN weights across all MoE experts, ensuring rapid early-stage transfer and nontrivial performance “jump start.” However, because all experts start identically, there is little incentive for progression beyond the initial local optimum and expert specialization occurs slowly. The result is a marked slowdown in the learning-curve slope compared to scratch-trained MoE models—upcycled models “coast” rather than specialize and eventually plateau at suboptimal performance. This phenomenon motivates hybrid schemes that combine knowledge transfer with statistical diversity (Nakamura et al., 26 Feb 2025).

2. Drop-Upcycling Algorithm

Drop-upcycling produces an MoE by initializing experts as partial mixtures of pre-trained and newly sampled weights. The procedure is as follows:

A. Expert Replication

All non-FFN parameters are copied into the MoE directly. Each expert $i$ of the $n$ experts in a layer receives a copy of the original FFN matrices: $W^{(i)}_{\text{gate}} = W_{\text{gate}}, \quad W^{(i)}_{\text{up}} = W_{\text{up}}, \quad W^{(i)}_{\text{down}} = W_{\text{down}}$ The router weights $W_{\rm router}$ are randomly initialized.

B. Diversity via Partial Re-initialization

For each expert, a fraction $r$ of the weight columns (for $W_{\rm gate}$ , $W_{\rm up}$ ) or rows (for $W_{\rm down}$ ) is randomly selected. For each weight type:

A subset $\mathcal S$ of size $i$ 0 is drawn from the intermediate dimension.
The empirical mean $i$ 1 and standard deviation $i$ 2 of those coordinates are computed.
New weights $i$ 3 are sampled as $i$ 4.
The updated matrix is

$i$ 5

where $i$ 6 is the binary mask for $i$ 7.

Each expert thereby retains a $i$ 8 fraction of the original weights while receiving independent random initialization in the remaining $i$ 9 fraction. For $n$ 0, every expert is “half old, half new.”

C. MoE Training

Standard MoE training is applied to the initialized model, using the cross-entropy loss for next-token prediction:

$n$ 1

with an auxiliary load-balancing loss on the router if desired (Nakamura et al., 26 Feb 2025).

3. Theoretical and Empirical Rationale

Drop-upcycling’s success is attributed to the retention of a robust "common subspace" and the promotion of expert diversity:

Common Subspace: Each expert's $n$ 2-retained coordinates ensure that, for top- $n$ 3 selection, there is an intersection of coordinates across all active experts of expected size $n$ 4, ensuring knowledge transfer from the dense precursor.
New Subspaces: The $n$ 5-fraction per expert is unique, enabling rapid specialization and functional diversity unavailable to naïve upcycling.
Error Bounds: The degree of overlap between experts’ retained dimensions fluctuates as $n$ 6, and thus becomes negligible at scale.
Convergence Profile: Learning curves for drop-upcycled MoEs match scratch-trained MoE gradients (i.e., maintain rapid improvement throughout training), but benefit from a lower initial loss due to transferred knowledge, implying that drop-upcycling is never overtaken in long-run performance (Nakamura et al., 26 Feb 2025).

4. Experimental Evidence

Experiments on LLaMA-style transformers at multiple scales (152M, 1.5B, 3.7B, 13B) demonstrate the advantage of drop-upcycling. In every FFN, eight experts (Top-2 routing, Mixtral style) are employed, resulting in models such as:

Backbone	Active Params	Total Params
8×152M	190M	417M
8×1.5B	2.6B	8.9B
8×3.7B	5.9B	18B

Training uses the LLM-JP v3 corpus (2.1T tokens in English/Japanese/code) for 500B tokens (MoE) and 1T–2T tokens (dense models), with AdamW and cosine decay.

Compute Efficiency: An 8×3.7B MoE achieves accuracy comparable to a 13B dense model with only 1/4 the FLOPs (1.98×10²² vs 7.4×10²²).
Task Scores: For 8×3.7B, average scores (final 12 tasks) are 44.4 for drop-upcycling versus 44.5 for dense 13B, both far exceeding scratch or naïve upcycling (Nakamura et al., 26 Feb 2025).
Ablations: Performance is sensitive to $n$ 7; $n$ 8 yields optimal long-term results. Naïve upcycling ( $n$ 9) causes expert collapse, while full re-init ( $W^{(i)}_{\text{gate}} = W_{\text{gate}}, \quad W^{(i)}_{\text{up}} = W_{\text{up}}, \quad W^{(i)}_{\text{down}} = W_{\text{down}}$ 0) sacrifices initial knowledge transfer.

Routing specialization emerges only with drop-upcycling: experts self-organize by domain (e.g., language or code), which is not observed in naïve upcycling.

5. Extensions Beyond MoE: KV-Compression Upcycling

The principles of drop-upcycling and statistical re-initialization transfer to other model modules. X-EcoMLA demonstrates this by upcycling pre-trained multi-head attention (MHA) blocks to multi-head latent attention (MLA) with extreme KV-cache compression (Li et al., 14 Mar 2025). X-EcoMLA reframes upcycling as a lightweight post-training adaptation:

SVD-based Initialization: A low-rank MLA block is initialized from MHA weights via singular value decomposition, retaining dominant subspaces.
Knowledge Distillation: KL-divergence-based supervised fine-tuning uses a larger teacher to transfer dark knowledge.
Direct Preference Optimization (DPO): Objective tuning with human feedback refines generation preferences.

Empirically, X-EcoMLA achieves up to 6.4× cache size reduction without degradation in average score, requiring only 70 GPU-hours and 3.6B tokens for Llama-3.2-1B-Inst, in contrast to the prohibitive compute required for full MLA pre-training (Li et al., 14 Mar 2025).

6. Future Directions and Open Questions

Future research directions, as indicated by the original works, include:

Iterative or dynamic re-initialization schedules to inject diversity during training rather than exclusively at initialization.
Application to fine-grained or shared-expert architectures, as in DeepSeekMoE.
Optimal selection or learning of the re-init ratio $W^{(i)}_{\text{gate}} = W_{\text{gate}}, \quad W^{(i)}_{\text{up}} = W_{\text{up}}, \quad W^{(i)}_{\text{down}} = W_{\text{down}}$ 1 (potentially per-layer or per-expert).
Theoretical characterization of specialization and convergence dynamics, especially for very high-expert-count regimes.
Interaction of partial re-initialization with advanced router balancing or expert-choice algorithms.

A plausible implication is that further variants could generalize drop-upcycling to any submodule where knowledge transfer and specialization tension exist.

7. Summary Table: Drop-Upcycling vs Naïve Upcycling and Scratch

Method	Initial Performance	Specialization	Long-Term Convergence	Compute Efficiency
Scratch MoE	Low	High	High	High
Naïve Upcycling	High	Low	Poor	Medium
Drop-Upcycling	High	High	Highest	Highest

Drop-upcycling constitutes the first method to reconcile the tension between immediate knowledge transfer and expert specialization in MoE construction, producing models that combine the benefits of both dense and sparse training regimens—and generalizes as an efficient transfer protocol for other neural architectural changes involving submodule “upcycling” (Nakamura et al., 26 Feb 2025, Li et al., 14 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization (2025)

X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Drop-Upcycling.