Papers
Topics
Authors
Recent
Search
2000 character limit reached

ZipMoE: Efficient On-Device MoE Inference

Updated 6 February 2026
  • ZipMoE is an on-device Mixture-of-Experts system that losslessly decomposes BF16 weights into high-entropy SM-bits and low-entropy exponent bits for efficient storage.
  • It employs a two-phase pipeline with a compression-aware cache manager and cache-affinity scheduler to shift from I/O-bound to compute-centric inference on resource-constrained hardware.
  • Empirical evaluations demonstrate significant improvements, including up to 97.9% latency reduction and up to 42.5× throughput gains, ensuring full-precision behavioral fidelity.

ZipMoE is an efficient and semantically lossless on-device Mixture-of-Experts (MoE) serving system designed to address the prohibitive memory and I/O bottlenecks encountered when deploying large-scale MoE models on resource-constrained edge devices. ZipMoE leverages statistical redundancy in MoE parameters and incorporates a caching-scheduling co-design tailored for edge hardware, facilitating a shift from I/O-bound to compute-centric inference and enabling provably efficient parallelization while ensuring full-precision behavioral preservation (Yang et al., 29 Jan 2026).

1. Mixture-of-Experts Models and Edge Deployment Challenges

Mixture-of-Experts (MoE) architectures decompose large models into EE experts (sub-networks and their tensors), routing each input token to only kEk \ll E dynamically selected experts. This design offers overall capacity O(Ed)O(E \cdot d) while restricting active per-token computation to O(kd)O(k \cdot d). On traditional server-class infrastructure, the sparse activation yields a compute–memory bandwidth tradeoff, but on edge devices—such as NVIDIA Jetson modules or integrated mobile SoCs with shared CPU/GPU memory—limited RAM and storage subsystem speeds create a severe I/O bottleneck. Fetching inactive experts from SSD or DRAM can account for up to 80.1%80.1\% of inference latency, resulting in underutilized compute throughput at both the CPU and GPU [(Yang et al., 29 Jan 2026), Fig. 1(b)].

Previous approaches for on-device MoE inference commonly employ lossy quantization techniques (e.g., mixed-precision, tensor-wise bit-width allocation). However, even minimal quantization noise introduces unpredictable model behavior, potentially bypassing conventional NLP accuracy metrics such as perplexity or Z-score and exposing models to backdoors or adversarial vulnerabilities. Consequently, preserving the original BF16 (bfloat16) representation is critical to guarantee that device-local outputs are faithfully identical to those produced by the uncompressed server-side model.

2. ZipMoE Architecture and Workflow

ZipMoE implements a two-phase architecture comprising offline initialization and real-time inference. The design explicitly exploits hardware properties and data redundancies to shift the dominant bottleneck from storage I/O to compute, making efficient parallel execution possible.

2.1 Offline Initialization

  • Bit-Field Decomposition: Each BF16-encoded weight is split into two streams: high-entropy "Sign+Mantissa" bits (SM-bits) and low-entropy "Exponent" bits.
  • Exponent Sharding and Compression: Exponent bits are partitioned into KK shards, and each is losslessly compressed (resulting in "E-chunks") using standard compressors (LZ4HC/ZSTD) suited for the skewed entropy distribution. SM-bits are stored as byte-packed "SM-chunks". All tensor metadata plus the compressed and packed chunks are persisted on SSD.

2.2 Real-Time Inference

ZipMoE's runtime incorporates two major co-designed components:

  • Compression-Aware Cache Manager: Allocates the available RAM across four pools: (1) SE\mathcal{S}_E for E-chunks only, (2) SS\mathcal{S}_S for SM-chunks only, (3) SC\mathcal{S}_C for both, and (4) SF\mathcal{S}_F for fully reconstructed tensors. Allocation is informed by a rank-based probabilistic expert activation model.
  • Cache-Affinity Scheduler: At each sparse layer, the scheduler receives kNk \cdot N tensor reconstruction directed acyclic graphs (DAGs), where each reconstruction involves reading E-chunks, decompressing, reading SM-chunks, and BF16 tensor reconstruction. The scheduler assigns I/O to a single thread and parallelizes decompression over LL CPU worker threads, overlapping these with vectorized GPU tensor reassembly to efficiently hide I/O latency.
  • Vectorized GPU Kernel: Rebuilds BF16 tensors from SM- and E-bits in a memory-coalesced pattern, saturating downstream GEMM pipeline utilization.

3. Lossless Compression of MoE Parameters

3.1 Statistical Redundancy in BF16

The BF16 format allocates 1 sign bit, 8 exponent bits, and 7 mantissa bits. Exponent value distribution across MoE weights is highly skewed, exhibiting Shannon entropy around 2.55 bits (substantially below the theoretical maximum of 8 bits). The Shannon lower bound thus implies that only around 66% of the space is required to store exponent bits losslessly. Practically, off-the-shelf compressors yield nearly optimal results, with LZ4HC reducing size to about 74% and ZSTD to about 68% for typical MoE tensors.

3.2 Compression-Algorithm Pipeline

The ZipMoE pipeline proceeds as follows:

  1. Each weight ww is split into SM and exponent arrays.
  2. The exponent array is split into KK shards and each shard independently compressed.
  3. The SM-chunk (byte-packed, uncompressed) and the E-chunk(s) (compressed) are stored on SSD, indexed with tensor and chunk offset metadata.

3.3 Decompression and Losslessness

At inference, decompression of E-chunks is bit-exact, yielding unaltered exponent sequences; recombination with the corresponding SM-chunks reconstructs the precise original BF16 values. The overall storage is SM+ρE|\text{SM}| + \rho \cdot |\text{E}|, where ρ\rho is the compression ratio for exponent chunks and does not exceed unity.

3.4 Edge Hardware Optimization

On unified memory architecture (UMA) SoCs, such as those found in Jetson modules, host-pinned memory supports direct GPU (zero-copy) access, obviating redundant memory copies. The decompression workload is parallelized across L3L \geq 3 CPU cores, allowing its cost to be effectively masked by the latency of SSD reads.

4. Cache-Affinity Scheduling and Performance Guarantees

4.1 Problem Setup

ZipMoE models tensor reconstructions as tasks Q={j}Q = \{j\}, each associated with:

  • I/O cost: uu per SM-chunk, (ρ/K)u(\rho / K) \cdot u per E-chunk
  • Decompression cost: cc per E-chunk
  • GPU reconstruction cost: pn(j)p_{n(j)}

System resources consist of 1 I/O thread, LL decompression threads, and 1 CUDA stream. The primary objective is to minimize the inference makespan MM—the maximal per-token execution completion time.

4.2 Block and Task Partitioning

Reconstruction tasks are classified as:

  • Type-I: SM-chunk not cached—requires both SM and exponent fetch.
  • Type-II: SM-chunk cached—requires only exponent fetch.

Each type is ordered by non-increasing GPU cost pp, forming sequences σI\sigma_I and σII\sigma_{II}. The block construction algorithm iteratively builds blocks by grouping tasks to saturate resource usage while maintaining a compute-bound schedule, using a ranking on expected expert activations.

4.3 Theoretical Approximation Ratio

ZipMoE ensures a provable makespan bound. Specifically,

ALG(31/L)OPT\text{ALG} \leq (3 - 1/L)\,\text{OPT}

where ALG is the schedule generated by ZipMoE and OPT is the schedule with theoretically minimal makespan. The guarantee is proven by lower-bounding OPT with the aggregate times for I/O, compute, and token execution; then bounding extra ("charged") idle intervals resulting from scheduling decisions [(Yang et al., 29 Jan 2026), Sec A.2].

5. System Implementation

The ZipMoE system is implemented as follows:

  • Frontend: 2.6 K lines of Python, leveraging HuggingFace Transformers for integration and profiling.
  • Engine and Scheduler: 8 K lines C++/CUDA.
  • Compression/Decompression: Uses lz4 and zstd libraries on AArch64 CPUs for E-chunk compression; SM-chunks are managed as byte-packed arrays.
  • Zero-Copy I/O: SM-chunks are read directly into host-pinned memory for DMA by the GPU, streamlining memory handling.
  • Thread Organization: Unified thread pools for I/O and decompression operate at page cache granularity with OS-level affinity.
  • GPU Recovery Kernel: CUDA threads execute contiguous, vectorized data loads to reconstruct BF16 tensors with memory-coalesced stores.

6. Empirical Evaluation and Performance

6.1 Experimental Setup

  • Hardware: Evaluation utilizes NVIDIA Jetson AGX Orin SoCs (32 GB and 64 GB configurations) and a Samsung 970 EVO SSD (3.5 GB/s throughput).
  • Models: DeepSeekV2-Lite, Qwen1.5-MoE (decoder-only), and SwitchTransformers-Large-128 (encoder-decoder).
  • Workloads: ShareGPT prompts, batch sizes 1/4/16, up to 512 output tokens.
  • Baselines: MoE-Infinity, DeepSpeed (ZeRO-3 offload), and Accelerate, with aligned RAM budgets of 10 GB and 20 GB.

6.2 Results

  • Latency: ZipMoE reduces Time-Per-Output-Token (TPOT) by 62.7%–97.9% and Time-To-First-Token (TTFT) by 53.3%–87.9% (decoder-only); for encoder-decoder, TPOT decreases by 5.0%–81.2% and TTFT by up to 83.5%.
  • Throughput: Achieves 1.79–42.5× improvement over baselines for decoder-only models, and 1.31–5.82× for encoder-decoder at all batch sizes.
  • End-to-End Speedup: Attains 3.03×–42.49× for decoder-only and 1.11×–5.64× for encoder-decoder models across output lengths.
  • Ablation: Hierarchical cache planning and cache-affinity scheduling Pareto-dominate standard eviction policies (FIFO, LRU, Marking).
  • Peak Observed Gains: Up to 72.77% inference latency reduction; up to 6.76× throughput improvement over competitive systems.
Metric Decoder-Only Gain Encoder-Decoder Gain
TPOT Reduction 62.7% – 97.9% 5.0% – 81.2%
TTFT Reduction 53.3% – 87.9% Up to 83.5%
Throughput Improvement 1.79× – 42.5× 1.31× – 5.82×
End-to-End Speedup 3.03× – 42.49× 1.11× – 5.64×
Peak Latency/Throughput 72.77% / 6.76×

7. Limitations, Future Directions, and Applicability

The current ZipMoE methodology employs static offline compression, which does not adapt to dynamic changes in weight distributions. Its cache planning relies on a stationary, rank-based model of expert activation and does not yet include online refinement or lightweight prefetch predictors. Adaptive tuning of compressor choice and sharding granularity per expert remains future work. The approach generalizes to any sparse-activation model architecture, including Routed Mixture-of-Experts and SwitchModels, and is applicable to a broad class of UMA platforms beyond Jetson, such as mobile SoCs and heterogeneous CPU-NPU environments.

Security and robustness properties are preserved by the lossless design, maintaining behavioral fidelity with full-precision MoE inference, and laying groundwork for incorporating privacy-preserving, encrypted inference in future extensions (Yang et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ZipMoE.