ButterflyMoE: Memory-Efficient Mixture-of-Experts
- ButterflyMoE is a Mixture-of-Experts paradigm that employs a shared ternary-quantized substrate and butterfly rotations to achieve sub-linear memory scaling.
- It compresses expert memory up to 150× compared to standard MoE architectures, enabling deployment on edge devices under tight resource constraints.
- By leveraging geometric rotations and quantization-aware training, ButterflyMoE maintains expert diversity and accuracy while significantly reducing computational overhead.
ButterflyMoE is a Mixture-of-Experts (MoE) paradigm that replaces the traditional storage of independent expert weight matrices with a geometric parameterization based on structured rotations of a shared ternary-quantized substrate. This approach achieves sub-linear memory scaling in the number of experts and allows large-scale expert architectures to be deployed within the tight resource constraints of edge devices. ButterflyMoE achieves up to 150-fold memory compression with negligible degradation in accuracy, fundamentally altering the cost structure of sparse expert models for on-device and resource-constrained inference (Karmore, 20 Jan 2026).
1. Linear Bottleneck in Standard MoE Architectures
Conventional MoE layers consist of experts, each requiring a distinct weight matrix of size , with and in practice. The total expert memory is thus bytes, where is the number of bytes per weight (e.g., 4 B for FP32). For and , the parameter memory footprint alone is approximately 256 MB, exceeding the available budget for many edge devices (e.g., Jetson Nano with 4 GB RAM). Furthermore, loading these parameter matrices from DRAM for each forward pass incurs significant energy costs (e.g., 13 mJ per pass at 256 MB). Even with aggressive quantization (e.g., 2-bit schemes as in QMoE or MoQE), the scaling remains unchanged, limiting the number of deployable experts (Karmore, 20 Jan 2026).
2. Ternary-Quantized Shared Substrate
ButterflyMoE introduces a single shared prototype weight matrix
shared across all experts and quantized to three levels (1.58 bits per weight). During training, a full-precision copy of this matrix is maintained, and quantization is performed via a straight-through estimator (STE) with a scaling parameter
and . At inference, all experts use the same quantized grid , thereby amortizing the storage cost of the substrate across experts and eliminating redundant storage (Karmore, 20 Jan 2026).
3. Structured Butterfly Orbits and Expert Generation
Instead of storing each expert weight matrix directly, ButterflyMoE instantiates expert as
where and are orthogonal “butterfly” matrices parameterized by Givens angles. For , a butterfly factorization is
where are block-diagonal matrices of rotations and are fixed perfect-shuffle permutations. Each expert occupies a geometrically distinct “orbit” in the weight space, generated as reorientations of the shared ternary substrate. This organization provides expert diversity not by storing separate matrices, but by learned rotations—dramatically reducing the total parameter storage (Karmore, 20 Jan 2026).
4. Sub-Linear Memory Scaling and Comparative Analysis
The total memory for ButterflyMoE with experts is
where
- The shared substrate (ternary quantized):
- Per-expert butterfly rotations: angles per expert,
At large , the per-expert storage is dominated by , a major efficiency gain over . For example, with , , ButterflyMoE achieves a compression ratio of approximately 154× relative to standard MoE (Karmore, 20 Jan 2026).
| Configuration | Standard MoE | QMoE (sub-1bit) | MoQE (2-bit) | PuzzleMoE/MC | ButterflyMoE |
|---|---|---|---|---|---|
| Memory (64 experts) | 256 MB | 13–26 MB | 51 MB | 64–128 MB | 1.9 MB |
| Compression Ratio | 1× | 10–20× | 5× | 2–4× | 150× |
This sub-linear scaling permits a far greater number of experts (e.g., 64 on 4 GB devices vs. 8 for standard MoE) and supports the deployment of more granular expert-specialization models on memory-constrained platforms (Karmore, 20 Jan 2026).
5. Quantization-Aware Training and Outlier Suppression
The rotation parameters for all experts are trained end-to-end with the substrate, using a combined loss:
where enforces load balancing as in Switch Transformers, with a default coefficient . The STE passes gradients through the quantization operation. Learned input rotations dynamically disrupt activation “outlier” alignment with the quantized grid, redistributing large-magnitude activation components and suppressing quantization-induced mean squared error (MSE). Empirical evaluation shows a reduction in MSE from ∼51.3% (untrained) to ∼1.4% (trained), a 97.2% improvement. This enables stable training at extremely low bitwidths, a regime where static quantization approaches fail (Karmore, 20 Jan 2026).
6. Empirical Evaluation on Language Modeling
Experiments were conducted on multi-domain sequence prediction and WikiText-103. Key results include:
- At experts, ButterflyMoE achieves 4.7 MB expert memory (vs. 1024 MB for standard MoE), a 150× reduction with negligible perplexity loss.
- On a 4 GB Jetson Nano, standard MoE fits only 8 experts, QMoE fits ∼16, and ButterflyMoE fits 64 experts in just 1.9 MB.
- Cosine similarity between different experts’ outputs remains low (mean off-diagonal ≈0.52 vs. 0.55 for standard MoE), indicating that geometric parameterization preserves expert diversity.
- Compression performance across schemes:
| Method | Expert Memory (64, ) | Relative Compression |
|---|---|---|
| Standard MoE | 256 MB | 1× |
| QMoE (sub-1bit) | ∼13–26 MB | 10–20× |
| MoQE (2-bit) | 51 MB | 5× |
| PuzzleMoE / MC | 64–128 MB | 2–4× |
| ButterflyMoE | 1.9 MB | 150× |
These results demonstrate that ButterflyMoE enables an order-of-magnitude increase in feasible expert count on resource-constrained devices without significant accuracy penalties (Karmore, 20 Jan 2026).
7. Implementation, Limitations, and Deployment Considerations
In the ButterflyMoE forward pass (see Algorithm 1 in (Karmore, 20 Jan 2026)), expert matrices are not explicitly materialized. Instead, token embeddings are first acted upon by the per-expert input rotation , then multiplied by the shared ternary substrate, and finally passed through the output rotation . Each expert invocation requires floating-point operations for the butterfly rotations, in addition to a single ternary matrix multiply.
Current limitations include:
- Inference latency: On GPUs lacking fused butterfly kernels, inference can be slower than dense feedforward networks (up to 48.9× on Nvidia T4). Using only two butterfly layers (vs. ) recovers a 2.5× speedup with negligible accuracy loss.
- Regime of evaluation: Experiments thus far are limited to models with parameters; scaling to billion-parameter MoEs and further analysis of learned orbit geometry remain open research directions.
- Substrate dimension constraint: The butterfly structure presumes is a power of 2; generalization to arbitrary or alternative group-orbit parametrizations is untested.
- Hardware specialization: The method incurs a per-token butterfly overhead; custom hardware support for structured rotations would further reduce inference cost.
ButterflyMoE’s core contribution is to break the traditional linear dependence of MoE parameter storage on expert count. This enables the deployment of domain-specialized, large-expert-count models on memory-constrained devices such as mobile deployments and resource-limited inference accelerators (Karmore, 20 Jan 2026).