ButterflyMoE: Memory-Efficient Mixture-of-Experts

Updated 24 January 2026

ButterflyMoE is a Mixture-of-Experts paradigm that employs a shared ternary-quantized substrate and butterfly rotations to achieve sub-linear memory scaling.
It compresses expert memory up to 150× compared to standard MoE architectures, enabling deployment on edge devices under tight resource constraints.
By leveraging geometric rotations and quantization-aware training, ButterflyMoE maintains expert diversity and accuracy while significantly reducing computational overhead.

ButterflyMoE is a Mixture-of-Experts (MoE) paradigm that replaces the traditional storage of $N$ independent expert weight matrices with a geometric parameterization based on structured rotations of a shared ternary-quantized substrate. This approach achieves sub-linear memory scaling in the number of experts and allows large-scale expert architectures to be deployed within the tight resource constraints of edge devices. ButterflyMoE achieves up to 150-fold memory compression with negligible degradation in accuracy, fundamentally altering the cost structure of sparse expert models for on-device and resource-constrained inference (Karmore, 20 Jan 2026).

1. Linear Bottleneck in Standard MoE Architectures

Conventional MoE layers consist of $N$ experts, each requiring a distinct weight matrix of size $d_{\text{ff}}\times d_{\text{model}}$ , with $d_{\text{model}}\equiv d$ and $d_{\text{ff}}\approx 4d$ in practice. The total expert memory is thus $M_{\mathrm{MoE}} = N\,d_{\text{ff}}\,d_{\text{model}}\times b = O(Nd^2)$ bytes, where $b$ is the number of bytes per weight (e.g., 4 B for FP32). For $N=64$ and $d=512$ , the parameter memory footprint alone is approximately 256 MB, exceeding the available budget for many edge devices (e.g., Jetson Nano with 4 GB RAM). Furthermore, loading these parameter matrices from DRAM for each forward pass incurs significant energy costs (e.g., 13 mJ per pass at 256 MB). Even with aggressive quantization (e.g., 2-bit schemes as in QMoE or MoQE), the $O(Nd^2)$ scaling remains unchanged, limiting the number of deployable experts (Karmore, 20 Jan 2026).

2. Ternary-Quantized Shared Substrate

ButterflyMoE introduces a single shared prototype weight matrix

$N$ 0

shared across all experts and quantized to three levels (1.58 bits per weight). During training, a full-precision copy of this matrix is maintained, and quantization is performed via a straight-through estimator (STE) with a scaling parameter

$N$ 1

and $N$ 2. At inference, all experts use the same quantized grid $N$ 3, thereby amortizing the storage cost of the substrate across $N$ 4 experts and eliminating redundant storage (Karmore, 20 Jan 2026).

3. Structured Butterfly Orbits and Expert Generation

Instead of storing each expert weight matrix directly, ButterflyMoE instantiates expert $N$ 5 as

$N$ 6

where $N$ 7 and $N$ 8 are orthogonal “butterfly” matrices parameterized by $N$ 9 Givens angles. For $d_{\text{ff}}\times d_{\text{model}}$ 0, a butterfly factorization is

$d_{\text{ff}}\times d_{\text{model}}$ 1

where $d_{\text{ff}}\times d_{\text{model}}$ 2 are block-diagonal matrices of $d_{\text{ff}}\times d_{\text{model}}$ 3 rotations and $d_{\text{ff}}\times d_{\text{model}}$ 4 are fixed perfect-shuffle permutations. Each expert occupies a geometrically distinct “orbit” in the weight space, generated as reorientations of the shared ternary substrate. This organization provides expert diversity not by storing separate matrices, but by learned rotations—dramatically reducing the total parameter storage (Karmore, 20 Jan 2026).

4. Sub-Linear Memory Scaling and Comparative Analysis

The total memory for ButterflyMoE with $d_{\text{ff}}\times d_{\text{model}}$ 5 experts is

$d_{\text{ff}}\times d_{\text{model}}$ 6

where

The shared substrate (ternary quantized): $d_{\text{ff}}\times d_{\text{model}}$ 7
Per-expert butterfly rotations: $d_{\text{ff}}\times d_{\text{model}}$ 8 angles per expert, $d_{\text{ff}}\times d_{\text{model}}$ 9

At large $d_{\text{model}}\equiv d$ 0, the per-expert storage is dominated by $d_{\text{model}}\equiv d$ 1, a major efficiency gain over $d_{\text{model}}\equiv d$ 2. For example, with $d_{\text{model}}\equiv d$ 3, $d_{\text{model}}\equiv d$ 4, ButterflyMoE achieves a compression ratio of approximately 154× relative to standard MoE (Karmore, 20 Jan 2026).

Configuration	Standard MoE	QMoE (sub-1bit)	MoQE (2-bit)	PuzzleMoE/MC	ButterflyMoE
Memory (64 experts)	256 MB	13–26 MB	51 MB	64–128 MB	1.9 MB
Compression Ratio	1×	10–20×	5×	2–4×	150×

This sub-linear scaling permits a far greater number of experts (e.g., 64 on 4 GB devices vs. 8 for standard MoE) and supports the deployment of more granular expert-specialization models on memory-constrained platforms (Karmore, 20 Jan 2026).

5. Quantization-Aware Training and Outlier Suppression

The rotation parameters $d_{\text{model}}\equiv d$ 5 for all experts are trained end-to-end with the substrate, using a combined loss:

$d_{\text{model}}\equiv d$ 6

where $d_{\text{model}}\equiv d$ 7 enforces load balancing as in Switch Transformers, with a default coefficient $d_{\text{model}}\equiv d$ 8. The STE passes gradients through the quantization operation. Learned input rotations dynamically disrupt activation “outlier” alignment with the quantized grid, redistributing large-magnitude activation components and suppressing quantization-induced mean squared error (MSE). Empirical evaluation shows a reduction in MSE from ∼51.3% (untrained) to ∼1.4% (trained), a 97.2% improvement. This enables stable training at extremely low bitwidths, a regime where static quantization approaches fail (Karmore, 20 Jan 2026).

6. Empirical Evaluation on Language Modeling

Experiments were conducted on multi-domain sequence prediction and WikiText-103. Key results include:

At $d_{\text{model}}\equiv d$ 9 experts, ButterflyMoE achieves 4.7 MB expert memory (vs. 1024 MB for standard MoE), a 150× reduction with negligible perplexity loss.
On a 4 GB Jetson Nano, standard MoE fits only 8 experts, QMoE fits ∼16, and ButterflyMoE fits 64 experts in just 1.9 MB.
Cosine similarity between different experts’ outputs remains low (mean off-diagonal ≈0.52 vs. 0.55 for standard MoE), indicating that geometric parameterization preserves expert diversity.
Compression performance across schemes:

Method	Expert Memory (64, $d_{\text{ff}}\approx 4d$ 0)	Relative Compression
Standard MoE	256 MB	1×
QMoE (sub-1bit)	∼13–26 MB	10–20×
MoQE (2-bit)	51 MB	5×
PuzzleMoE / MC	64–128 MB	2–4×
ButterflyMoE	1.9 MB	150×

These results demonstrate that ButterflyMoE enables an order-of-magnitude increase in feasible expert count on resource-constrained devices without significant accuracy penalties (Karmore, 20 Jan 2026).

7. Implementation, Limitations, and Deployment Considerations

In the ButterflyMoE forward pass (see Algorithm 1 in (Karmore, 20 Jan 2026)), expert matrices $d_{\text{ff}}\approx 4d$ 1 are not explicitly materialized. Instead, token embeddings are first acted upon by the per-expert input rotation $d_{\text{ff}}\approx 4d$ 2, then multiplied by the shared ternary substrate, and finally passed through the output rotation $d_{\text{ff}}\approx 4d$ 3. Each expert invocation requires $d_{\text{ff}}\approx 4d$ 4 floating-point operations for the butterfly rotations, in addition to a single $d_{\text{ff}}\approx 4d$ 5 ternary matrix multiply.

Current limitations include:

Inference latency: On GPUs lacking fused butterfly kernels, inference can be slower than dense feedforward networks (up to 48.9× on Nvidia T4). Using only two butterfly layers (vs. $d_{\text{ff}}\approx 4d$ 6) recovers a 2.5× speedup with negligible accuracy loss.
Regime of evaluation: Experiments thus far are limited to models with $d_{\text{ff}}\approx 4d$ 7 parameters; scaling to billion-parameter MoEs and further analysis of learned orbit geometry remain open research directions.
Substrate dimension constraint: The butterfly structure presumes $d_{\text{ff}}\approx 4d$ 8 is a power of 2; generalization to arbitrary $d_{\text{ff}}\approx 4d$ 9 or alternative group-orbit parametrizations is untested.
Hardware specialization: The method incurs a per-token butterfly overhead; custom hardware support for $M_{\mathrm{MoE}} = N\,d_{\text{ff}}\,d_{\text{model}}\times b = O(Nd^2)$ 0 structured rotations would further reduce inference cost.

ButterflyMoE’s core contribution is to break the traditional linear dependence of MoE parameter storage on expert count. This enables the deployment of domain-specialized, large-expert-count models on memory-constrained devices such as mobile deployments and resource-limited inference accelerators (Karmore, 20 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ButterflyMoE.