Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention-FFN Disaggregation Systems

Updated 22 January 2026
  • Attention-FFN Disaggregation (AFD) is a modular framework that separates attention and feed-forward network streams in transformers to improve efficiency, interpretability, and bias mitigation.
  • It restructures transformer blocks to enable specialized compression in vision models and decentralized inference in language models through quantized activations and MoE techniques.
  • AFD facilitates targeted bias mitigation by independently analyzing and masking biased attention heads and FFN outputs, thereby enhancing predictive performance and system throughput.

Attention-FFN Disaggregation (AFD) refers to the explicit separation and targeted optimization or manipulation of the attention and feed-forward network (FFN) components within transformer architectures. Originally envisioned to address computational inefficiencies, redundancy, and undesirable behavioral artifacts in both vision and language transformers, AFD frameworks restructure the transformer block to allow distinct analysis, compression, serving, or interpretability interventions for attention and FFN streams. This modular disaggregation has been instantiated in vision models for FLOP/param compression (Xu et al., 2023), in distributed LLM inference for cost and throughput optimization (StepFun et al., 25 Jul 2025), and in bias mitigation for LLMs via independent head and FFN vector manipulation (Zhou et al., 2024).

1. Conceptual Foundation of AFD

Transformers consist of repeated blocks interleaving multi-head self-attention (MHSA) and feed-forward networks. Standard architectures compound the two streams without distinction:

x+1=x+Att()(x)+FFN()(x)x^{\ell+1} = x^\ell + \mathrm{Att}^{(\ell)}(x^\ell) + \mathrm{FFN}^{(\ell)}(x^\ell)

AFD models treat these as separable entities, either to expose and exploit internal redundancy, to align subsystem hardware utilization, or to allow interpretable, component-wise interventions.

In vision transformers (ViT), attention modules have quadratic cost and can exhibit highly correlated heads, while FFNs comprise a sizable proportion of parameters and FLOPs yet are often statically overparameterized (Xu et al., 2023). In LLM serving, attention (KV-cache, streaming) is memory-bound with small parameter load, while FFN (dense/MoE) layers are compute-bound but stateless (StepFun et al., 25 Jul 2025). In interpretability for bias mitigation, the additive contributions of attention heads and FFN vectors to logit outputs are directly decomposed and filtered (Zhou et al., 2024).

2. Modular Design: Stream Disaggregation and Its Algorithms

AFD frameworks physically or virtually split the transformer block into attention and FFN partitions.

  • Vision Compression (ViT): AFD consists of hallucinated-MHSA (hMHSA) and compact-FFN (cFFN). In hMHSA, half of attention heads are computed via standard Q–K dot-product and half hallucinated from real heads using intra- and cross-head convolutional projection. cFFN factorizes the hidden-to-output FFN weight matrix into low-rank matrices and restores capacity via re-parameterized training and single-matrix inference (Xu et al., 2023).
  • Distributed Decoding (LLMs): AFD deploys two subsystems: (1) Attention subsystem (data-parallel, maintains KV-cache, streams quantized activations), and (2) FFN subsystem (tensor/expert parallel, applies MoE/dense FFN computation). Layer-wise data-flow is orchestrated, with communication over GPUDirect RDMA fully overlapped with forward pass compute for high utilization (StepFun et al., 25 Jul 2025).
  • Bias Mitigation (LLMs): Attention and FFN are disaggregated for interpretability. Each head and FFN vector’s contribution to label logits is projected and statistically analyzed; biased components (large, class-skewed, low variance contributions) are masked at inference for instant de-biasing without retraining (Zhou et al., 2024).

3. Mathematical Formulations and Complexity Analysis

Vision Transformers

Attention Compression (hMHSA):

  • Standard FLOPs per block: O(4NC2+2N2C)\mathcal{O}(4NC^2 + 2N^2C).
  • hMHSA FLOPs: O(3NC2+32N2C+h24N2+9h2N2)\mathcal{O}(3NC^2 + \frac{3}{2} N^2C + \frac{h^2}{4} N^2 + \frac{9h}{2} N^2)
  • Param reduction: from 4C24C^2 to 3C2\approx 3C^2 plus minor convolution params.

FFN Compression (cFFN):

  • Standard FFN FLOPs: 2mNC22mNC^2.
  • cFFN: NC2(m+mrC+rC)NC^2 (m + m \frac{r}{C} + \frac{r}{C}); optimal r=2mC3(m+1)r = \frac{2mC}{3(m+1)}.

Distributed LLM Serving

Decoding Cost Model:

  • Attention cost: Cattn=max(FLOPattnUflop,ByteKVUbyte)+FLOPLinearUflopC_{\mathrm{attn}} = \max(\mathrm{FLOP}_{attn} U_{flop}, \mathrm{Byte}_{KV} U_{byte}) + \mathrm{FLOP}_{Linear} U_{flop}
  • FFN cost: Cffn=FLOPFFNUflopC_{\mathrm{ffn}} = \mathrm{FLOP}_{FFN} U_{flop}

Batch Size for MFU:

  • Dense: BdenseR/2B_{\mathrm{dense}} \geq R/2
  • MoE: BMoER/2SB_{\mathrm{MoE}} \geq R / 2S, where SS is MoE sparsity.

Network–MoE Joint Constraints:

  • SHFLOPFFNLNetBandwidth11.1msS \geq \frac{H \cdot \mathrm{FLOP}_{FFN} \cdot L}{\mathrm{Net} \cdot \mathrm{Bandwidth} \cdot 11.1\mathrm{ms}}

Bias Decomposition

For each FFN vector ii and attention head hh, project their output into logit space over label set and compute mean, class spread, and variance across samples. Mask if metrics pass grid-searched thresholds.

4. Implementation Workflows and Practical Considerations

  • Hallucination Ratio: Half-real, half-hallucinated heads; double head count with halved per-head channels for even division.
  • Compact FFN: Factorization rank 23mCm+1\frac{2}{3} \frac{mC}{m+1}; two BN parallel branches each for UU/VV during training; fused at inference.
  • Training Protocol: AdamW, stochastic depth, repeated augmentation. Inference merges re-parameterized branches.
  • Attention Subsystem: Executes quantized MFA; runs on bandwidth-focused GPUs (A800, L20).
  • FFN Subsystem: MoE compute on high-FLOP clusters (H800/Hopper); TP/EP/hybrid parallelism.
  • Communication: Direct RDMA with zero GPU SM usage; three-stage pipeline per micro-batch for perfect overlap.
  • Disaggregation: Forward pass computes all components’ outputs. Project into label-logit space for batch of examples.
  • Masking Protocol: Identify biased heads/vectors by three statistics; mask by zeroing outputs during standard decoding.

5. Empirical Results and Performance Metrics

Replacing MHSA + FFN with hMHSA + cFFN across DeiT-T/S, PVTv2-b1, NextViT-S yields:

Model ΔParams ΔFLOPs ΔAcc
DeiT-T –18.2% –19.0% +0.7%
DeiT-S –18.8% –19.3% +0.3%
PVTv2-b1 –13.2% –15.2% +0.0%
NextViT-S –13.9% –11.5% +0.0%
  • Tokens/sec (TGS, Hopper, FP8): Step-3: 4,039 vs DeepSeek-V3: 2,324
  • Decoding cost per 1M tokens (8K context): \$0.055 (Step-3 w/ AFD), \$0.068 (DSv3 w/ EP), \$0.062 (Qwen3-MoE)
  • Attention Layer Latency (8K/32K context): MFA (Step-3) is 20–30% faster than MLA or GQA.
  • One-shot Llama-2 7B (mean ± std): UniBias consistently outperforms vanilla ICL and calibration baselines on 9/12 tasks; average accuracy 70.5 vs 68.5 (CC) and 67.1 (Vanilla ICL).
  • Prompt Brittleness: Reduces swings from ±13 pp to ±2 pp on SST-2.
  • Ablations: FFN-only, Att-only both improve accuracy; combined yields best results.

6. Trade-offs, Scaling Behavior, and Analysis

AFD increases communication (one message per layer/micro-batch, 3×H×B\approx 3 \times H \times B bytes), requiring lossless, low-jitter networks. This overhead is countered by design: stage latency balancing enables perfect overlap of compute and communication, so throughput is not gated by networking. As model size and context scale up, attention cost grows linearly with context length (requiring scalable attention hardware), while FFN cost remains context-independent, enabling system specialization.

In MoE scenarios, FFN batch size BB and expert sparsity SS must be jointly tuned against underlying network bandwidth to maintain hardware efficiency and avoid saturating communication channels.

A plausible implication is that the architectural and system-level separation enabled by AFD yields both theoretical and practical Pareto improvements—maximum throughput per cost at scale—without sacrificing representation capacity, as confirmed empirically in both vision and language applications.

7. Impact in Research and Future Directions

Attention-FFN Disaggregation has proven effective in three domains:

  • Transformer Compression (ViTs): Enables substantial reduction in floating-point operations and parameter count with negligible or positive effects on accuracy, demonstrating that attention and FFN redundancy are actionable targets (Xu et al., 2023).
  • Distributed Serving and Decoding (LLMs): Unlocks ideal hardware alignment and parallelization for each subsystem, sets new Pareto frontiers in decoding throughput and cost (StepFun et al., 25 Jul 2025).
  • Interpretability and Bias Mitigation: Supports fine-grained manipulation of internal mechanisms to reduce prompt brittleness and bias, outperforming calibration baselines without external supervision or retraining (Zhou et al., 2024).

This suggests AFD is a unifying framework for both practical engineering in large-scale model deployment and for theoretical analysis of transformer internal dynamics. Future directions may focus on more generalized subsystem decomposition across non-standard architectures, adaptive disaggregation ratios based on real-time hardware and data constraints, and expanded interpretability protocols leveraging the disaggregated streams for deep behavioral analysis.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention-FFN Disaggregation (AFD) Systems.