Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unified Attention-FFN MoE

Updated 20 February 2026
  • The paper introduces UMoE, a method that reformulates multi-head attention as FFN-like operations to enable shared expert modules across both sub-layers.
  • UMoE employs a top‑k gating mechanism with low‑rank expert-specific adaptations, yielding state‑of‑the‑art perplexity and performance improvements.
  • Empirical results demonstrate that UMoE outperforms dense and FFN‑MoE baselines in language modeling and multilingual ASR, validating its unified design.

Unified Attention-FFN Mixture-of-Experts (UMoE) is an architectural strategy for scaling Transformer models by integrating Mixture-of-Experts (MoE) blocks into both the multi-head attention (MHA) and feed-forward network (FFN) sub-layers, with the unique property of parameter sharing between these traditionally distinct modules. UMoE introduces an algebraic reformulation of attention that reveals its underlying FFN-like structure, thereby enabling unified MoE design using a global bank of shared sparse experts. Empirical results indicate UMoE achieves state-of-the-art perplexity and downstream performance under tight compute and parameter budgets, and its principles have informed architectures in language modeling and multilingual speech recognition domains (Yang et al., 12 May 2025, Ma et al., 22 Jan 2025).

1. Motivation and Conceptual Insight

Sparse Mixture-of-Experts (MoE) layers scale Transformer capacity by routing each token through only a small subset of a large set of learned "experts," typically implemented as two-layer FFNs. Standard MoE adoption concentrated on the FFN sub-layer, but efforts to extend MoE to self-attention components (e.g., MoA, SwitchHead) saw reduced performance, owing to fundamental differences: attention layers consist of projection, softmax mixing, and weighted sum operations, whereas FFNs are two straightforward linear transformations sandwiching a nonlinearity.

UMoE bridges this divide by expressing multi-head attention algebraically as a token-mixing operation followed by a two-matrix FFN transform, thus making both attention and FFNs amenable to the same MoE methodologies. Critically, this decomposition permits shared expert modules (i.e., parameterized two-layer FFNs) to be used interchangeably in both attention and FFN roles within any Transformer layer (Yang et al., 12 May 2025).

2. Algebraic Reformulation of Attention as FFN-like Blocks

UMoE relies on a "pre-mixing" view of multi-head attention. Given input embeddings XRn×dX \in \mathbb{R}^{n \times d} and query xRdx \in \mathbb{R}^d, for a single head:

  • Q=xWqQ = x W_q, K=XWkK = X W_k, V=XWvV = X W_v
  • a=softmax(QK/dk)a = \text{softmax}(Q K^\top / \sqrt{d_k}), o=aVo = a V

In the conventional output, y=[o1;;oh]Woy = [o_1; \ldots; o_h] W_o for hh heads and WoW_o block-decomposed into per-head WoiW_o^i:

y=i=1h(aiX)(WviWoi)y = \sum_{i=1}^h (a_i X) (W_v^i W_o^i )

The term (aiX)(a_i X) aggregates a soft mix of all input tokens (token mixing), while the matrix product WviWoiW_v^i W_o^i constitutes a two-matrix transformation matching the FFN pattern. Therefore, per-head computation is interpretable as applying an FFN expert to a mixed token representation. This structure underpins the unified expert architecture (Yang et al., 12 May 2025).

3. Routing and Expert Transform Mechanisms

Routing in UMoE applies the standard top-kk gating algorithm popularized in BASE layers: for input hRdh \in \mathbb{R}^d, routing logits are p=softmax(Wrh)RNp = \text{softmax}(W_r h) \in \mathbb{R}^N, and the set of active experts T={i1,,ik}T = \{i_1, \ldots, i_k\} corresponds to the top kk scores. The gating mask is:

gi(h)={piif iT 0otherwiseg_i(h) = \begin{cases} p_i &\text{if } i \in T\ 0 &\text{otherwise} \end{cases}

Each expert EiE_i is realized as a two-layer FFN with nonlinearity σ\sigma:

Ei(h)=W2,i  σ(W1,ih)E_i(h) = W_{2,i} \; \sigma(W_{1,i} h)

For attention-MoE, routing operates on the soft-mixed token vectors; for FFN-MoE, on the sub-layer input or output.

The query projections in attention require per-expert adaptation, implemented in UMoE with a LoRA-style low-rank additive term:

qi=xWq (shared)+xWaiWbi (expert-specific,rd)q_i = x W_q~(\text{shared}) + x W_a^i W_b^i~(\text{expert-specific}, r \ll d)

4. Layer Structure and Parameter Sharing

A UMoE layer incorporates two major MoE sub-layers:

  1. Pre-mixing Attention-MoE:
    • Compute shared XWkXW_k and XWvXW_v for the sequence.
    • For each top-kk expert allocated to token xx, form qiq_i via the low-rank and shared projections.
    • Compute oi=Attention(qi,K,V)o_i = \text{Attention}(q_i, K, V), then expert output ei=Ei(oi)e_i = E_i(o_i).
    • Aggregate into output via the router's mixture weights.
  2. FFN-MoE:
    • Route the updated token representation through the top-kk selected experts, aggregate their outputs.

All modules draw from a single global pool of NN experts ({W1,i,W2,iW_{1,i}, W_{2,i}}); only the routers (matrices WrW_r) are distinct for attention and FFN sub-blocks. This sharing halves the parameter count compared to duplicating expert sets for each sub-layer without capacity loss.

Layer normalization and residual connections are implemented per standard Transformer practice.

5. Empirical Results and Quantitative Performance

UMoE exhibits superior performance metrics over Dense, FFN-MoE, MoA, and SwitchHead baselines. For "Base" LLMs (~540M params; ~610G MACs), UMoE achieves:

Model FineWeb PPL Wiki103 PPL Avg Zero-shot Acc (%)
Dense (134M) 25.79 30.41 36.14
FFN-MoE (535M) 21.19 27.94 39.55
MoA (525M) 22.28 27.57 38.49
SwitchHead (533M) 22.91 29.47 38.30
UMoE-Att only (547M) 20.81 27.45 39.94
UMoE full (540M) 20.44 26.67 40.06

UMoE full denotes shared expert pool across attention and FFN MoE sub-layers. Best figures in each column are bolded.

Ablation studies show that allocating all kk active experts in a layer to attention (rather than FFN) yields further perplexity reduction, while the inclusion of nonlinearity in expert FFNs is critical (removing it degrades perplexity by \sim1.6 points). Slight gains are realized when using separate routers for the two sub-layer types, even with shared expert weights (Yang et al., 12 May 2025).

6. Application in Domain-Robust Multilingual ASR

UMoE principles have influenced architectures such as BLR-MoE for end-to-end multilingual automatic speech recognition (ASR) (Ma et al., 22 Jan 2025). In BLR-MoE:

  • Each Transformer layer in the Mixture-of-Language Experts (MLE) block replaces standard MHA and FFN sub-layers with attention-MoE and FFN-MoE counterparts.
  • The router, mediated by a LID (Language ID) signal and optionally augmented with a TDNN adapter, produces gating weights that are shared across attention and FFN MoE modules within a layer.
  • During inference, "expert pruning" is employed using known language constraints to further improve recognition performance under domain shift.

Performance on a 10,000-hour MASR dataset demonstrates substantial relative WER reductions: BLR-MoE outperforms LR-MoE (FFN-only) by 16% relative WER (15.84% vs. 18.89% overall WER). Out-of-domain WER shows 19% relative gain. Additional ablations confirm that both attention-MoE and router augmentation independently yield notable improvements (Ma et al., 22 Jan 2025).

7. Comparative Architectural Approaches and Implications

The UMoE paradigm is closely related to recent advances in expert decomposition and routing, such as Union-of-Experts (UoE) (Yang et al., 4 Mar 2025), which conducts expert decomposition on both MLP and attention blocks using matrix partitioning and supports hierarchical, patch-wise, or expert-wise routing. While UoE attains strong efficiency gains and performance improvements, UMoE's distinct contribution is the algebraically motivated, exact equivalence between attention and FFN MoE mechanisms and the consequent ability to use a single global expert pool.

A plausible implication is that, as model designs continue to bring routing flexibility and parameter sharing to both attention and FFN modules, future large-scale Transformer architectures will increasingly converge in spirit toward UMoE-style unified, capacity-scaled frameworks. The specific algebraic insights of UMoE ensure optimal parameter efficiency and enable scaling both key sub-layers synergistically, rather than in isolation.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Unified Attention-FFN MoE (UMoE).