Papers
Topics
Authors
Recent
Search
2000 character limit reached

ButterflyMoE: Memory-Efficient Mixture-of-Experts

Updated 24 January 2026
  • ButterflyMoE is a Mixture-of-Experts paradigm that employs a shared ternary-quantized substrate and butterfly rotations to achieve sub-linear memory scaling.
  • It compresses expert memory up to 150× compared to standard MoE architectures, enabling deployment on edge devices under tight resource constraints.
  • By leveraging geometric rotations and quantization-aware training, ButterflyMoE maintains expert diversity and accuracy while significantly reducing computational overhead.

ButterflyMoE is a Mixture-of-Experts (MoE) paradigm that replaces the traditional storage of NN independent expert weight matrices with a geometric parameterization based on structured rotations of a shared ternary-quantized substrate. This approach achieves sub-linear memory scaling in the number of experts and allows large-scale expert architectures to be deployed within the tight resource constraints of edge devices. ButterflyMoE achieves up to 150-fold memory compression with negligible degradation in accuracy, fundamentally altering the cost structure of sparse expert models for on-device and resource-constrained inference (Karmore, 20 Jan 2026).

1. Linear Bottleneck in Standard MoE Architectures

Conventional MoE layers consist of NN experts, each requiring a distinct weight matrix of size dff×dmodeld_{\text{ff}}\times d_{\text{model}}, with dmodeldd_{\text{model}}\equiv d and dff4dd_{\text{ff}}\approx 4d in practice. The total expert memory is thus MMoE=Ndffdmodel×b=O(Nd2)M_{\mathrm{MoE}} = N\,d_{\text{ff}}\,d_{\text{model}}\times b = O(Nd^2) bytes, where bb is the number of bytes per weight (e.g., 4 B for FP32). For N=64N=64 and d=512d=512, the parameter memory footprint alone is approximately 256 MB, exceeding the available budget for many edge devices (e.g., Jetson Nano with 4 GB RAM). Furthermore, loading these parameter matrices from DRAM for each forward pass incurs significant energy costs (e.g., 13 mJ per pass at 256 MB). Even with aggressive quantization (e.g., 2-bit schemes as in QMoE or MoQE), the O(Nd2)O(Nd^2) scaling remains unchanged, limiting the number of deployable experts (Karmore, 20 Jan 2026).

2. Ternary-Quantized Shared Substrate

ButterflyMoE introduces a single shared prototype weight matrix

Wbase{1,0,+1}dff×dmodelW_{\mathrm{base}}\in\{-1,0,+1\}^{d_{\text{ff}}\times d_{\text{model}}}

shared across all experts and quantized to three levels (1.58 bits per weight). During training, a full-precision copy of this matrix is maintained, and quantization is performed via a straight-through estimator (STE) with a scaling parameter

γ=1dffdmodeli,jWij(fp)\gamma = \frac{1}{d_{\text{ff}}d_{\text{model}}} \sum_{i,j}|W^{\mathrm{(fp)}}_{ij}|

and Q(W(fp))=γ  round(W(fp)/γ)Q(W^{(\mathrm{fp})}) = \gamma\;\mathrm{round}\big(W^{(\mathrm{fp})}/\gamma\big). At inference, all experts use the same quantized grid {1,0,+1}\{-1, 0, +1\}, thereby amortizing the storage cost of the substrate across NN experts and eliminating redundant storage (Karmore, 20 Jan 2026).

3. Structured Butterfly Orbits and Expert Generation

Instead of storing each expert weight matrix directly, ButterflyMoE instantiates expert ii as

Wi=Riout  [Q(Wbase)]  (Riin)TW_{i} = R_{i}^{\mathrm{out}}\;\bigl[\,Q(W_{\mathrm{base}})\bigr]\;(R_{i}^{\mathrm{in}})^{T}

where Riin=B(θi)R_{i}^{\mathrm{in}} = \mathcal{B}(\theta_{i}) and Riout=B(ϕi)R_{i}^{\mathrm{out}} = \mathcal{B}(\phi_{i}) are orthogonal “butterfly” matrices parameterized by O(dlogd)O(d\log d) Givens angles. For d=2md = 2^m, a butterfly factorization is

B(θ)==1m[D(θ)P]\mathcal{B}(\theta) = \prod_{\ell=1}^{m} [D_{\ell}(\theta) P_{\ell}]

where D(θ)D_{\ell}(\theta) are block-diagonal matrices of 2×22\times2 rotations and PP_{\ell} are fixed perfect-shuffle permutations. Each expert occupies a geometrically distinct “orbit” in the weight space, generated as reorientations of the shared ternary substrate. This organization provides expert diversity not by storing separate matrices, but by learned rotations—dramatically reducing the total parameter storage (Karmore, 20 Jan 2026).

4. Sub-Linear Memory Scaling and Comparative Analysis

The total memory for ButterflyMoE with NN experts is

MButterflyMoE=O(d2+Ndlogd)M_{\mathrm{ButterflyMoE}} = O\left(d^2 + N\,d\log d\right)

where

  • The shared substrate (ternary quantized): O(d2)O(d^2)
  • Per-expert butterfly rotations: 2×d/2×log2d2 \times d/2 \times \log_2 d angles per expert, O(Ndlogd)O(Nd\log d)

At large NN, the per-expert storage is dominated by O(Ndlogd)O(Nd\log d), a major efficiency gain over O(Nd2)O(Nd^2). For example, with dmodel=512d_{\text{model}}=512, dff=2048d_{\text{ff}}=2048, ButterflyMoE achieves a compression ratio of approximately 154× relative to standard MoE (Karmore, 20 Jan 2026).

Configuration Standard MoE QMoE (sub-1bit) MoQE (2-bit) PuzzleMoE/MC ButterflyMoE
Memory (64 experts) 256 MB 13–26 MB 51 MB 64–128 MB 1.9 MB
Compression Ratio 10–20× 2–4× 150×

This sub-linear scaling permits a far greater number of experts (e.g., 64 on 4 GB devices vs. 8 for standard MoE) and supports the deployment of more granular expert-specialization models on memory-constrained platforms (Karmore, 20 Jan 2026).

5. Quantization-Aware Training and Outlier Suppression

The rotation parameters {θi,ϕi}\{\theta_{i},\phi_{i}\} for all experts are trained end-to-end with the substrate, using a combined loss:

L=LCE+λbalancei=1N(niNtotal1N)2\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \lambda_{\mathrm{balance}} \sum_{i=1}^{N} \left(\frac{n_i}{N_\mathrm{total}} - \frac{1}{N}\right)^2

where Lbalance\mathcal{L}_{\mathrm{balance}} enforces load balancing as in Switch Transformers, with a default coefficient λbalance=0.01\lambda_{\mathrm{balance}} = 0.01. The STE passes gradients through the quantization operation. Learned input rotations dynamically disrupt activation “outlier” alignment with the quantized grid, redistributing large-magnitude activation components and suppressing quantization-induced mean squared error (MSE). Empirical evaluation shows a reduction in MSE from ∼51.3% (untrained) to ∼1.4% (trained), a 97.2% improvement. This enables stable training at extremely low bitwidths, a regime where static quantization approaches fail (Karmore, 20 Jan 2026).

6. Empirical Evaluation on Language Modeling

Experiments were conducted on multi-domain sequence prediction and WikiText-103. Key results include:

  • At N=256N=256 experts, ButterflyMoE achieves 4.7 MB expert memory (vs. 1024 MB for standard MoE), a 150× reduction with negligible perplexity loss.
  • On a 4 GB Jetson Nano, standard MoE fits only 8 experts, QMoE fits ∼16, and ButterflyMoE fits 64 experts in just 1.9 MB.
  • Cosine similarity between different experts’ outputs remains low (mean off-diagonal ≈0.52 vs. 0.55 for standard MoE), indicating that geometric parameterization preserves expert diversity.
  • Compression performance across schemes:
Method Expert Memory (64, d=512d=512) Relative Compression
Standard MoE 256 MB
QMoE (sub-1bit) ∼13–26 MB 10–20×
MoQE (2-bit) 51 MB
PuzzleMoE / MC 64–128 MB 2–4×
ButterflyMoE 1.9 MB 150×

These results demonstrate that ButterflyMoE enables an order-of-magnitude increase in feasible expert count on resource-constrained devices without significant accuracy penalties (Karmore, 20 Jan 2026).

7. Implementation, Limitations, and Deployment Considerations

In the ButterflyMoE forward pass (see Algorithm 1 in (Karmore, 20 Jan 2026)), expert matrices WiW_{i} are not explicitly materialized. Instead, token embeddings are first acted upon by the per-expert input rotation RiinR_{i}^{\mathrm{in}}, then multiplied by the shared ternary substrate, and finally passed through the output rotation RioutR_{i}^{\mathrm{out}}. Each expert invocation requires O(dlogd)O(d\log d) floating-point operations for the butterfly rotations, in addition to a single O(d2)O(d^2) ternary matrix multiply.

Current limitations include:

  • Inference latency: On GPUs lacking fused butterfly kernels, inference can be slower than dense feedforward networks (up to 48.9× on Nvidia T4). Using only two butterfly layers (vs. log2d\log_2 d) recovers a 2.5× speedup with negligible accuracy loss.
  • Regime of evaluation: Experiments thus far are limited to models with <108<10^8 parameters; scaling to billion-parameter MoEs and further analysis of learned orbit geometry remain open research directions.
  • Substrate dimension constraint: The butterfly structure presumes dd is a power of 2; generalization to arbitrary dd or alternative group-orbit parametrizations is untested.
  • Hardware specialization: The method incurs a per-token butterfly overhead; custom hardware support for O(dlogd)O(d\log d) structured rotations would further reduce inference cost.

ButterflyMoE’s core contribution is to break the traditional linear dependence of MoE parameter storage on expert count. This enables the deployment of domain-specialized, large-expert-count models on memory-constrained devices such as mobile deployments and resource-limited inference accelerators (Karmore, 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ButterflyMoE.