Papers
Topics
Authors
Recent
Search
2000 character limit reached

Drop-Upcycling in MoE Models

Updated 20 November 2025
  • Drop-upcycling is a framework for model parameter transfer that blends pre-trained dense initialization with partial re-initialization to enhance expert diversity in mixture-of-experts architectures.
  • It mitigates convergence limitations by ensuring rapid knowledge transfer while promoting expert specialization through controlled re-initialization of subnetworks.
  • Experimental results demonstrate that drop-upcycling achieves competitive early performance and efficient long-term convergence, improving compute efficiency over naive upcycling methods.

Drop-upcycling is a framework for model parameter transfer that strategically combines pre-trained dense model initialization with partial statistical re-initialization of expert subnetworks. Originally developed in the context of mixture-of-experts (MoE) transformers, drop-upcycling addresses the convergence limitations observed in prior MoE construction methods—specifically naïve upcycling—by promoting expert specialization without sacrificing early-stage performance. Drop-upcycling and related upcycling techniques can also be applied to non-expert submodule transfers, such as the post-training adaptation of key-value (KV) compression modules in attention architectures.

1. Background and Motivation

Traditional MoE architectures augment transformer networks by replacing each dense feed-forward (FFN) sublayer with a router and multiple expert FFNs. Only a sparse subset of experts (e.g., top-kk per token) is activated, resulting in increased model capacity relative to the number of active parameters and compute required per token. The upcycling approach introduced by Komatsuzaki et al. (2023) begins from a pre-trained dense model and replicates its FFN weights across all MoE experts, ensuring rapid early-stage transfer and nontrivial performance “jump start.” However, because all experts start identically, there is little incentive for progression beyond the initial local optimum and expert specialization occurs slowly. The result is a marked slowdown in the learning-curve slope compared to scratch-trained MoE models—upcycled models “coast” rather than specialize and eventually plateau at suboptimal performance. This phenomenon motivates hybrid schemes that combine knowledge transfer with statistical diversity (Nakamura et al., 26 Feb 2025).

2. Drop-Upcycling Algorithm

Drop-upcycling produces an MoE by initializing experts as partial mixtures of pre-trained and newly sampled weights. The procedure is as follows:

A. Expert Replication

All non-FFN parameters are copied into the MoE directly. Each expert ii of the nn experts in a layer receives a copy of the original FFN matrices: Wgate(i)=Wgate,Wup(i)=Wup,Wdown(i)=WdownW^{(i)}_{\text{gate}} = W_{\text{gate}}, \quad W^{(i)}_{\text{up}} = W_{\text{up}}, \quad W^{(i)}_{\text{down}} = W_{\text{down}} The router weights WrouterW_{\rm router} are randomly initialized.

B. Diversity via Partial Re-initialization

For each expert, a fraction rr of the weight columns (for WgateW_{\rm gate}, WupW_{\rm up}) or rows (for WdownW_{\rm down}) is randomly selected. For each weight type:

  1. A subset S\mathcal S of size rdf\lfloor r d_f \rfloor is drawn from the intermediate dimension.
  2. The empirical mean μtype\mu_{\text{type}} and standard deviation σtype\sigma_{\text{type}} of those coordinates are computed.
  3. New weights RtypeR_{\text{type}} are sampled as RtypeN(μtype,σtype2)R_{\text{type}} \sim \mathcal N(\mu_{\text{type}}, \sigma_{\text{type}}^2).
  4. The updated matrix is

    W~type=ISRtype+(1IS)Wtype,\widetilde W_{\text{type}} = I_{\mathcal S} \odot R_{\text{type}} + (1 - I_{\mathcal S}) \odot W_{\text{type}},

where ISI_{\mathcal S} is the binary mask for S\mathcal S.

Each expert thereby retains a (1r)(1-r) fraction of the original weights while receiving independent random initialization in the remaining rr fraction. For r=0.5r=0.5, every expert is “half old, half new.”

C. MoE Training

Standard MoE training is applied to the initialized model, using the cross-entropy loss for next-token prediction:

LNLL=tlogp(xtx<t)\mathcal L_{\text{NLL}} = -\sum_t \log p(x_t\,|\,x_{<t})

with an auxiliary load-balancing loss on the router if desired (Nakamura et al., 26 Feb 2025).

3. Theoretical and Empirical Rationale

Drop-upcycling’s success is attributed to the retention of a robust "common subspace" and the promotion of expert diversity:

  • Common Subspace: Each expert's (1r)(1-r)-retained coordinates ensure that, for top-kk selection, there is an intersection of coordinates across all active experts of expected size (1r)kdf(1-r)^k d_f, ensuring knowledge transfer from the dense precursor.
  • New Subspaces: The rr-fraction per expert is unique, enabling rapid specialization and functional diversity unavailable to naïve upcycling.
  • Error Bounds: The degree of overlap between experts’ retained dimensions fluctuates as O(1/df)O(1/\sqrt{d_f}), and thus becomes negligible at scale.
  • Convergence Profile: Learning curves for drop-upcycled MoEs match scratch-trained MoE gradients (i.e., maintain rapid improvement throughout training), but benefit from a lower initial loss due to transferred knowledge, implying that drop-upcycling is never overtaken in long-run performance (Nakamura et al., 26 Feb 2025).

4. Experimental Evidence

Experiments on LLaMA-style transformers at multiple scales (152M, 1.5B, 3.7B, 13B) demonstrate the advantage of drop-upcycling. In every FFN, eight experts (Top-2 routing, Mixtral style) are employed, resulting in models such as:

Backbone Active Params Total Params
8×152M 190M 417M
8×1.5B 2.6B 8.9B
8×3.7B 5.9B 18B

Training uses the LLM-JP v3 corpus (2.1T tokens in English/Japanese/code) for 500B tokens (MoE) and 1T–2T tokens (dense models), with AdamW and cosine decay.

  • Compute Efficiency: An 8×3.7B MoE achieves accuracy comparable to a 13B dense model with only 1/4 the FLOPs (1.98×10²² vs 7.4×10²²).
  • Task Scores: For 8×3.7B, average scores (final 12 tasks) are 44.4 for drop-upcycling versus 44.5 for dense 13B, both far exceeding scratch or naïve upcycling (Nakamura et al., 26 Feb 2025).
  • Ablations: Performance is sensitive to rr; r=0.5r=0.5 yields optimal long-term results. Naïve upcycling (r=0r=0) causes expert collapse, while full re-init (r=1r=1) sacrifices initial knowledge transfer.

Routing specialization emerges only with drop-upcycling: experts self-organize by domain (e.g., language or code), which is not observed in naïve upcycling.

5. Extensions Beyond MoE: KV-Compression Upcycling

The principles of drop-upcycling and statistical re-initialization transfer to other model modules. X-EcoMLA demonstrates this by upcycling pre-trained multi-head attention (MHA) blocks to multi-head latent attention (MLA) with extreme KV-cache compression (Li et al., 14 Mar 2025). X-EcoMLA reframes upcycling as a lightweight post-training adaptation:

  • SVD-based Initialization: A low-rank MLA block is initialized from MHA weights via singular value decomposition, retaining dominant subspaces.
  • Knowledge Distillation: KL-divergence-based supervised fine-tuning uses a larger teacher to transfer dark knowledge.
  • Direct Preference Optimization (DPO): Objective tuning with human feedback refines generation preferences.

Empirically, X-EcoMLA achieves up to 6.4× cache size reduction without degradation in average score, requiring only 70 GPU-hours and 3.6B tokens for Llama-3.2-1B-Inst, in contrast to the prohibitive compute required for full MLA pre-training (Li et al., 14 Mar 2025).

6. Future Directions and Open Questions

Future research directions, as indicated by the original works, include:

  • Iterative or dynamic re-initialization schedules to inject diversity during training rather than exclusively at initialization.
  • Application to fine-grained or shared-expert architectures, as in DeepSeekMoE.
  • Optimal selection or learning of the re-init ratio rr (potentially per-layer or per-expert).
  • Theoretical characterization of specialization and convergence dynamics, especially for very high-expert-count regimes.
  • Interaction of partial re-initialization with advanced router balancing or expert-choice algorithms.

A plausible implication is that further variants could generalize drop-upcycling to any submodule where knowledge transfer and specialization tension exist.

7. Summary Table: Drop-Upcycling vs Naïve Upcycling and Scratch

Method Initial Performance Specialization Long-Term Convergence Compute Efficiency
Scratch MoE Low High High High
Naïve Upcycling High Low Poor Medium
Drop-Upcycling High High Highest Highest

Drop-upcycling constitutes the first method to reconcile the tension between immediate knowledge transfer and expert specialization in MoE construction, producing models that combine the benefits of both dense and sparse training regimens—and generalizes as an efficient transfer protocol for other neural architectural changes involving submodule “upcycling” (Nakamura et al., 26 Feb 2025, Li et al., 14 Mar 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Drop-Upcycling.