FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators

Published 2 Apr 2026 in cs.AR | (2604.02110v1)

Abstract: Attention accounts for an increasingly dominant fraction of total computation during inference for mixture-of-experts (MoE) models, making efficient acceleration critical. Emerging domain-specific accelerators for large model inference are shifting toward chip-scale and wafer-scale tile-based architectures. Tiles contain large matrix and vector engines and are connected through on-chip interconnects, which support tile-to-tile traffic to reduce the tile-to-main-memory traffic bottleneck. Hence, dataflow management is crucial to achieve high utilization. We propose FlatAttention, a dataflow for modern attention variants on tile-based accelerators. FlatAttention minimizes expensive high-bandwidth memory (HBM) accesses by exploiting collective primitives integrated into the on-chip network fabric, achieving up to 92.3% utilization, 4.1x speedup over FlashAttention-3, and 16x lower HBM traffic. On a 32x32 tile configuration with peak performance comparable to NVIDIA GH200, FlatAttention generalizes across multiple attention variants, achieving an average of 86% utilization for compute-bound attentions and 78% HBM bandwidth utilization for memory-bound ones, resulting in an average 1.9x speedup over attention implementations on GH200. Finally, we evaluate end-to-end DeepSeek-v3 FP8 decoding with FlatAttention on a wafer-scale multi-die system, achieving a 1.9x improvement in system throughput and a 1.4x reduction in per-user token output latency, despite operating with 1.5x lower peak system performance compared to the state-of-the-art solution.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a novel dataflow that aggregates tile groups for large attention models, reducing HBM I/O by up to 6.6×.
It leverages hardware collectives to achieve over 92% compute utilization and up to a 4.1× speedup compared to GPU-based methods.
The approach unifies multiple attention variants with scalable tiling, guiding future architecture-algorithm co-design for wafer-scale systems.

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators

Introduction and Motivation

FlatAttention introduces a fundamentally re-architected dataflow mechanism for efficient inference of large attention-based models on tile-based accelerator platforms. As mixture-of-experts (MoE) LLMs move towards wafer-scale deployments and employ advanced attention mechanisms such as MHA, GQA, and MLA, attention kernels increasingly bottleneck system throughput due to their high bandwidth requirements, particularly for off-chip high-bandwidth memory (HBM) accesses. Traditional GPU-oriented dataflows such as FlashAttention-3 and FlashMLA deliver suboptimal utilization—only 26–64% of peak on NVIDIA GH200—even with advances like asynchronous execution (Figure 1).

Figure 1: (a) FLOP breakdown for LLMs shows attention dominating compute in recent architectures; (b) Roofline plot revealing FlashAttention-3 and FlashMLA on GH200 operate far from hardware limits.

Simultaneously, emerging accelerator architectures, particularly those based on dense tile meshes interconnected via high-radix on-chip networks (NoCs), enable direct, hardware-accelerated collective communication primitives (e.g., multicast, reduction), which are underutilized by standard GPU-style dataflows. Exploiting such collectives, FlatAttention redefines how attention blocks are mapped and parallelized, targeting extreme data reuse and minimization of global HBM traffic while maximizing matrix engine utilization.

Tile-Based Architectures and Modern Attention Variants

Tile-based many-PE architectures, as exemplified in recent wafer-scale systems, comprise a 2D mesh of compute tiles each integrating matrix/vector/scalar engines and local scratchpad memories. Tiles communicate through a NoC fabric with hardware support for collective operations. Multiple dies can be interconnected ranging up to wafer-scale, enabling scaling in both memory and compute (Figure 2).

Figure 2: (a) Tile-based many-PE template; (b) Row-wise multicast with hardware collectives; (c) Wafer-scale multi-die mesh system.

Modern LLMs demand efficient support for numerous attention variants:

MHA: Classic, with per-head key/value projections.
GQA/MQA: Key/value projections are shared across groups or all heads, reducing cache pressure.
MLA: Employs compression/decompression of key/value tensors, further optimizing KV cache usage.

These variants share a core: large-scale matrix- or vector-matrix operations with memory footprints that rapidly saturate on-chip resources if not carefully mapped.

FlatAttention Dataflow: Structure and Implementation

FlatAttention’s central innovation is forming logical groups of tiles (“tile groups”) to collectively process larger attention blocks that do not fit within individual tile memories. Each group aggregates its collective L1 to host larger data slices, dramatically increasing on-chip data reuse and reducing HBM I/O complexity.

This comes at the cost of required intra-group collective communication, efficiently supported in hardware (Figure 3).

Figure 3: (a) Parametric group definition for FlatAttention; (b) Detailed dataflow with block assignments; (c) Naive schedule; (d) Optimized asynchronous schedule.

Key principles include:

Block aggregation: Tile groups process $\left(N \cdot M, N \cdot M\right)$ blocks (for group size $N \times N$ , block size $M$ ), reducing off-chip I/O compared to per-tile FlashAttention dataflow.
Row/column-wise collectives: Diagonal tiles load data from HBM, which is then distributed via efficient row/column multicasts and reductions.
Asynchronous pipeline: Softmax and data movement are overlapped with compute-intensive matrix multiplications through dual-head scheduling, closely matching the utilization envelope of matrix engines.

Compared to naive GPU FlashAttention mapping, FlatAttention’s I/O complexity decreases by up to $6.6\times$ for typical LLM configurations, with the bottleneck shifting from HBM to on-chip network efficiency.

Optimization and Tiling Strategies

Performance is a function of group scale and tiling configuration, leading to a key tradeoff: larger groups improve reuse but can “over-flatten” slices, underutilizing individual matrix engines especially for short sequences.

Experimental results demonstrate:

Utilization: Optimal per-tile slice size is $128 \times 128$ , maximizing RedMulE utilization ( $>$ 95%) without exceeding the L1 budget (Figure 4).
Collective primitive acceleration: Hardware collectives deliver $5-67\times$ speedup over software collectives for intra-group communication (Figure 2b).
Scalability across attention variants: FlatAttention unifies MHA, GQA, and MLA handling through tunable tiling/grouping strategies.
Figure 4: General FlatAttention tiling/group-scaling strategy addressing computation/memory tradeoff for MHA, GQA, MLA.

Quantitative Results

On a $32\times32$ tile architecture (GH200-class peak), FlatAttention achieves up to 92.3% compute utilization and 4.1 $\times$ speedup over FlashAttention-3, with up to 16 $\times$ lower HBM traffic.
Generalized across attention variants, average utilization is 86% (compute-bound) and 78% HBM bandwidth (memory-bound).
Against NVIDIA GH200’s best attention kernels, FlatAttention delivers an average 1.9 $N \times N$ 0 speedup across prefill and decode.
In end-to-end DeepSeek-v3-671B decoder inference on a wafer-scale multi-die system, FlatAttention improves overall system throughput by 1.9 $N \times N$ 1 and per-user latency by 1.4 $N \times N$ 2, despite 1.5 $N \times N$ 3 lower system peak performance compared to 96×H800 GPU deployments (Figure 5).
Figure 5: FlatAttention on tile-based accelerator vs. GH200 GPU attention baselines across variants and regimes.

Comparative Analysis and Implications

Relative to prior art, FlatAttention is the first to unify efficient fused-attention dataflow and fabric-level collective primitive utilization. Unlike recent efforts confined to single-tile/single-GPC fusion (e.g., FlashFuser, ClusterFusion), FlatAttention scales cleanly to mesh- or wafer-scale systems. While works such as COMET and Zen-Attention exploit collectives, they do not approach the full co-optimization of tiling, scheduling, and hardware-supported communication exposed in FlatAttention.

Critically, FlatAttention enables architectural codesign: the same performance analysis guides both attention dataflow and tile-based accelerator design (matrix engine width, memory organization, collective support), informing future wafer-scale LLM inference engines.

Practical and Theoretical Implications

Practically, FlatAttention enables:

Sustained high utilization and throughput for attention-bound inference even at hundreds of billions of parameters, eliminating the attention bottleneck present in MoE architectures.
Significant reduction in HBM energy and cost per token, crucial at data center scale.
Efficient support for future speculative decoding and advanced attention mechanisms.

Theoretically, it demonstrates:

The necessity of architecture-algorithm co-design for domain-specific AI workload acceleration.
The return of collective communication to the core of on-chip ML dataflow, an inversion of recent GPU-exclusive designs.
Scalability to system limits with the potential to generalize to broader classes of tensor algebra where on-chip data reuse dominates.

Future Directions

Future system design will benefit from refined codesign of NoC topologies for collective bandwidth, adaptive group scaling at runtime, and integration with more diverse inferencing workloads. Potential architectural extensions include tighter coupling with chiplet-based HBM and custom collective-aware instruction sets.

Conclusion

FlatAttention, via aggressive dataflow and hardware co-optimization, reclaims attention inference efficiency at wafer scale. By exploiting hierarchical tile-grouping and hardware collectives, it sets a new benchmark in utilization and throughput per Watt, and closes the performance gap between practical LLM deployment and hardware-limited peak.

References to Figures

(Figure 1): Motivation: FLOP breakdown and Roofline gap
(Figure 2): Architecture overview, fabric collective demonstration, wafer-scale system
(Figure 3): FlatAttention group definition/dataflow/schedules
(Figure 4): FlatAttention general tiling strategy
(Figure 5): GH200 vs. FlatAttention benchmarking across attention variants

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper looks at how to make the “attention” part of LLMs run much faster and more efficiently on new kinds of computer chips. These chips are built from many small “tiles” that work together. The authors introduce a new way to organize the work, called FlatAttention, so tiles share data smartly on-chip instead of constantly going out to slow, power-hungry main memory. The result: faster speed, lower memory traffic, and better use of the chip’s math units.

What questions are the authors trying to answer?

The paper focuses on simple but important goals:

How can we speed up attention, which has become the slowest part of many modern LLMs?
How can we cut down on expensive trips to big external memory (HBM), which waste time and energy?
Can we design a dataflow that works for many attention styles (like MHA, GQA, MLA) and for both prompt processing (“prefill”) and token-by-token generation (“decode”)?
If we make attention faster on one chip, will that really improve full, end-to-end performance in a big multi-chip or even wafer-scale system?

How did they approach the problem?

Think of the chip as a school building:

Each “tile” is a classroom with fast local storage and powerful calculators.
The hallways between rooms are an on-chip network where tiles can pass notes quickly.
The distant library is the big, slow memory (HBM). Walking there takes time and energy.

Popular methods like FlashAttention work well on GPUs but still spend a lot of time fetching and refetching data from the “library.” FlatAttention changes this by getting neighboring rooms (tiles) to act as one big team:

Group work: Instead of each tile working separately, FlatAttention has tiles form groups and tackle a larger chunk together. Because the group’s combined local storage is bigger, they can keep more data nearby and reuse it.
Built-in group messaging: The chip’s network supports special “collective” operations:
- Multicast: one tile sends a copy of its data to many teammates at once (like a teacher announcing to a whole row).
- Reduction: tiles combine partial results into one final result (like everyone tallying their counts into a single class total).
- These are done directly in the network hardware, so they’re very fast.
Overlapping tasks: While some tiles move or share data, others keep the big math engines busy. It’s like having some students photocopy notes while others solve problems—no one is idle.

They also adapt FlatAttention to different situations:

Prefill vs. decode: In prefill, the model processes a whole prompt at once. In decode, it generates one or a few tokens at a time and uses a “KV cache” (a memory of past tokens). FlatAttention adjusts group sizes so it still reuses data well in both cases.
Attention variants:
- MHA (Multi-Head Attention): the standard style.
- GQA (Grouped-Query Attention): groups heads to reduce memory use; FlatAttention treats grouped heads as a longer “query” so tiles can still use big, efficient matrix operations.
- MLA (Multi-Head Latent Attention): compresses key/value data so the cache is smaller; FlatAttention incorporates a math trick (“weight absorption”) so MLA can be handled like MQA during decoding.

To test all this, the authors built a performance model calibrated with detailed hardware simulations. They compared FlatAttention against top GPU implementations (FlashAttention-3 and FlashMLA) and evaluated not only single chips but also a wafer-scale system that connects many chips with fast die-to-die links.

What did they find, and why does it matter?

Here are the main results, and why they’re important:

Much higher utilization and speed on attention:
- Up to 92.3% utilization of the math engines (meaning the hardware is actually kept busy doing useful work).
- Up to 4.1× faster than FlashAttention-3 on the same kind of tile-based accelerator.
- 16× less traffic to big external memory (HBM), which saves time and energy.
Strong, general performance across many attention styles:
- On a 32×32 tile setup with peak performance similar to NVIDIA’s GH200 GPU, FlatAttention reaches on average 86% utilization when attention is compute-heavy, and uses on average 78% of HBM bandwidth when attention is memory-heavy.
- Overall, it delivers about 1.9× speedup versus optimized GH200 attention implementations (FlashAttention-3 and FlashMLA), across both prefill and decode, and for MHA, GQA, and MLA.
End-to-end gains at system scale:
- On a wafer-scale multi-die system, running a large, modern model (DeepSeek-v3) in FP8 decode:
- 1.9× higher overall throughput (more tokens per second for the whole system).
- 1.4× lower per-user token latency (faster responses).
- These gains happen even though the system’s peak performance was 1.5× lower than a state-of-the-art GPU setup, showing that smarter data movement can beat raw peak numbers.

These results matter because real-world LLM serving is often limited by memory traffic and coordination overhead, not just raw compute. FlatAttention directly attacks that bottleneck.

What is the bigger impact?

Better efficiency and lower costs: By reducing trips to external memory and keeping math units busy, FlatAttention can serve more users using less power and potentially fewer or cheaper chips.
Flexible and future-proof: It works across multiple attention types and both prefill and decode, fitting how today’s and tomorrow’s LLMs are built (including MoE and efficient attention variants like GQA/MLA).
Hardware–software co-design: The paper shows that designing the algorithm (dataflow) together with the chip’s on-chip network features (collective operations) delivers big gains. This is a powerful lesson for future AI hardware and model engineering.
Scales to big systems: Improvements on a single chip translate into real, end-to-end benefits on multi-chip and wafer-scale setups, which is how the largest models are deployed.

In short, FlatAttention is a smarter way to organize attention so tiles collaborate efficiently, cut down on memory traffic, and make LLMs run faster and cheaper at scale.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, phrased to be concrete and actionable for future research.

Empirical PPA validation: No silicon, FPGA, or emulation results; quantify area overhead, power, and energy-per-token (compute, NoC, HBM) of FlatAttention and fabric collectives versus GPU baselines and non-collective tile designs.
NoC/collectives microarchitecture detail: Specify router microarchitecture (buffering, virtual channels, flow control, replication datapaths), link widths, pipeline depths, and timing; quantify area/power overheads of multicast/reduction primitives on the presented tile counts.
Sensitivity to on-chip interconnect: Provide a systematic sensitivity analysis across NoC bandwidth, bisection, hop latency, router frequency, and injection limits; identify break-even group sizes where inter-tile traffic outweighs HBM savings.
Congestion and QoS under contention: Characterize multicast/reduction interference among concurrently active groups; evaluate QoS/priority mechanisms, head-of-line blocking, and fairness; identify deadlock-avoidance requirements for collective-heavy traffic.
Group topology constraints: Analyze performance impact of group shape misalignment with mesh rows/cols, groups near HBM edges, boundary tiles, and non-rectangular groups due to placement constraints.
Fault tolerance on wafer-scale: Study mapping robustness under defective tiles/links, router or memory-die failures; develop group-formation and routing with “holes,” and quantify performance degradation.
Inter-die collective communication: Clarify whether collectives are supported across dies; quantify latency/bandwidth of D2D collectives, their microarchitectural realization, and performance impact versus intra-die collectives.
Runtime/compilation support: Describe the software stack to express group-level collectives and asynchronous schedules (IR, compiler passes, kernel fusion, double-buffer insertion, barrier placement); quantify runtime overheads and compilation/autotuning time.
Asynchronous scheduling limits: Provide conditions for full overlap (DMA/vector vs. matmul) across realistic tile resources; quantify performance loss with imperfect overlap, resource contention, or back-pressure; give buffer sizing guidelines for robust overlap.
L1/SPM capacity and banking: Detail per-tile SPM sizes, banking, and access conflicts needed to realize claimed block sizes; quantify the area/energy cost of larger SPMs vs. using larger group sizes; provide design rules for SPM vs. group-scale trade-offs.
HBM realism and validation: Validate the HBM model against measured traces (bank conflicts, refresh, schedulers, tFAW/row-buffer locality); quantify sensitivity to controller policies and multi-stack striping; report read/write ratios and burstiness under FlatAttention.
Numerical stability and precision: Analyze distributed streaming softmax stability with mixed precision (FP8 decode, FP16/TF32 accumulations); quantify overflow/underflow risks across long contexts and extreme logits; report accuracy metrics (perplexity, pass@k) vs. baseline attention.
MLA weight absorption correctness: Rigorously characterize when absorption of up-projections is valid in the presence of RoPE/ALiBi and other pre-/post-projection transforms; detail implementation steps and any residual compute/memory overhead; evaluate accuracy impact.
KV-cache management at scale: Specify KV layout across HBM stacks and tiles, placement policies to minimize hops, and prefetch/eviction strategies; evaluate compatibility with paged/segmented KV caches and fragmentation across sessions.
Serving with ragged workloads: Evaluate dynamic group resizing and load balancing with heterogeneous prompts, variable decode lengths, speculative branches (including self-speculative early stops); quantify throughput/latency under multi-tenant, time-varying arrivals.
Co-optimization with MoE communication: Analyze how attention collectives interact with MoE all-to-all token routing; propose scheduling/co-placement to avoid NoC hotspots across attention and MoE layers; quantify end-to-end gains beyond attention-only optimizations.
Applicability to sparse/long-context attention: Extend FlatAttention to sliding-window, block-sparse, or long-context mechanisms (e.g., ring/segment attention, local+global heads); specify collective patterns and performance impacts.
Portability without hardware collectives: Quantify performance degradation when only software collectives (tree/sequence) are available; propose portability strategies for commodity accelerators lacking fabric-level support.
Multi-die parallelism policies: Provide a comprehensive comparison of DP/TP/PP/SP/EP mappings with FlatAttention under varying model sizes, batch sizes, and inter-die topologies; include a queueing-based latency/throughput analysis with real serving traces.
End-to-end accuracy with FP8 decode: Report model-level accuracy metrics for DeepSeek-v3 FP8 decoding under FlatAttention versus state-of-the-art implementations; assess sensitivity to calibration/quantization schemes and KV quantization.
Fairness of cross-system comparisons: Normalize speedups across precision modes, peak-FLOP disparities, and memory bandwidth differences (e.g., FP8 vs. FP16, HBM configs), providing iso-FLOP/iso-bandwidth comparisons and efficiency-per-watt.
Compiler/runtime reproducibility: Release or describe the modeling/simulation framework, calibration against RTL, and benchmark harness (workload mixes, sequence distributions) to enable reproducible evaluation across architectures.
Robustness to causal masking and speculative verification: Detail how causal masks are applied within group collectives; specify handling of speculative decoding verification that shortens accepted sequences, and its impact on scheduling and overlap.
Security/isolation in shared fabrics: Analyze implications of multicast/reduction on tenant isolation (e.g., side-channel leakage via timing/contention); propose isolation domains or bandwidth partitioning for multi-tenant serving.
Training/backward pass generalization: Investigate whether the collective-based dataflow extends to attention backward (dQ/dK/dV, softmax gradients) and optimizer/state updates; quantify memory/compute/collective needs for training or on-device fine-tuning.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following items translate the paper’s findings and methods into concrete, deployable use cases across sectors, with notes on tools/workflows and feasibility constraints.

FlatAttention kernel library for tile-based AI accelerators (Sector: semiconductor, software)
- Use case: Provide an optimized attention library implementing FlatAttention (including asynchronous scheduling and fabric collectives) for tile-based accelerators that expose multicast/reduce primitives on their on-chip networks.
- Tools/products/workflows: Vendor SDK plugin or runtime op (e.g., “flat_attention()” for MHA/GQA/MLA and speculative decode), autotuner for group shape and block sizes, profiling utilities to match 92%+ compute utilization and reduce HBM traffic up to 16×.
- Assumptions/dependencies: Hardware support for NoC-level multicast/reduction; scratchpad-managed L1; DMA, vector, and matrix engines that can execute concurrently; compiler/runtime hooks for collectives.
High-throughput LLM serving nodes with lower cost-per-token (Sector: cloud/datacenter, software)
- Use case: Deploy FlatAttention on tile-based inference nodes to raise tokens/s throughput and reduce HBM bandwidth pressure for modern attention variants (MHA, GQA, MLA), yielding up to 1.9× speedup vs GH200-class GPU implementations and lower energy per token.
- Tools/products/workflows: Integration into serving stacks (vLLM, TensorRT-LLM–like frameworks), KV-cache management tuned for group tiling, autoscaling policies informed by bandwidth utilization telemetry (e.g., 78–92% HBM link utilization for memory-bound cases).
- Assumptions/dependencies: Availability of accelerators comparable to the modeled 32×32 tile configuration; model weights in FP16/FP8; data-center ops tooling for new hardware.
Wafer-scale multi-die inference appliance for MoE LLMs (Sector: cloud/datacenter, HPC)
- Use case: Build wafer-scale, 2D-mesh, multi-die systems using D2D interconnects and deploy FlatAttention across dies; achieve higher end-to-end throughput (1.9×) and lower per-user latency (1.4×) for decoding, even under lower peak FLOPs.
- Tools/products/workflows: Runtime for cross-die collective orchestration, D2D topology-aware mapping (PP/TP/DP mixes), queueing strategies for decode/prefill balance, rack-level scheduling.
- Assumptions/dependencies: CoWoS or similar packaging, D2D mesh bandwidth comparable to on-die NoC assumptions, synchronized collectives across dies, robust thermal/power delivery.
Compiler and runtime passes for collective-aware attention (Sector: software tooling)
- Use case: Extend MLIR/TVM/XLA backends with passes that generate FlatAttention tilings, multicast/reduce schedules, and asynchronous overlap of DMA/vector/matrix operations.
- Tools/products/workflows: Scheduling pass for group formation (Gx×Gy), cost models for memory/coherence, autotuning (e.g., head dimension, KV length, speculative length) to maximize matrix engine occupancy.
- Assumptions/dependencies: Compiler access to hardware intrinsics for collectives and scratchpad DMA; stable ABI for collective ops.
On-prem AI inference for regulated sectors (Sector: healthcare, finance, public sector)
- Use case: On-prem LLM appliances with lower energy and latency per request, enabling HIPAA/GDPR-compliant deployments (e.g., clinical note summarization, RAG compliance assistants).
- Tools/products/workflows: Pre-validated FlatAttention kernels in vendors’ SDKs; monitoring dashboards tracking tokens/s/W and HBM traffic; RAG pipelines optimized for long-context decode.
- Assumptions/dependencies: Procurement of tile-based accelerators; FP8/FP16 quantized models; IT controls for on-prem operations.
Telco/edge gateways for language inference (Sector: telecommunications, edge computing)
- Use case: Use FlatAttention to reduce off-chip DRAM traffic on edge accelerators (LPDDR instead of HBM), enabling multi-tenant ASR/NLP with better latency (e.g., call-center analytics at the edge).
- Tools/products/workflows: Edge-serving stacks with KV-cache eviction/prefetch tuned to group tiling; lightweight speculative decoding to reintroduce GEMMs and better utilize compute.
- Assumptions/dependencies: Sufficient NoC bandwidth and collective support on edge silicon; thermal envelopes; robust quantization to FP8/INT formats.
LLM architecture optimization for deployment teams (Sector: software, AI/ML)
- Use case: Prefer GQA/MLA variants during model selection/fine-tuning because they map efficiently under FlatAttention (smaller KV cache, restored GEMMs with grouping), lowering serving costs.
- Tools/products/workflows: Weight-absorption tooling for MLA→MQA-style mapping at inference; model cards including “FlatAttention-readiness” metrics (head dims, compression ranks, KV lengths).
- Assumptions/dependencies: Acceptable accuracy deltas for GQA/MLA; compatibility with existing tokenizer/context window needs.
Accelerator NoC IP enhancements (Sector: semiconductor IP)
- Use case: Integrate fabric-level collectives (multicast/reduce trees) into NoC router IP with flit-level replication, validated against FlatAttention’s group traffic patterns.
- Tools/products/workflows: Synthesizable router RTL with collective datapaths, verification suites using attention-traffic traces, PPA analysis vs software collectives.
- Assumptions/dependencies: Area/power budgets for collective-enhanced routers; flow control QoS under mixed traffic; interoperability with memory controllers.
Education and research prototyping (Sector: academia)
- Use case: Use the paper’s modeling/simulation methods to teach dataflow/architecture co-design; student labs exploring group size, NoC contention, and roofline effects on attention.
- Tools/products/workflows: Open lab assignments replicating utilization and HBM traffic reductions; curriculum modules on scratchpad vs cache-based memory systems.
- Assumptions/dependencies: Availability of simplified simulators and reference kernels; institutional hardware access or accurate emulation.
Performance/TCO modeling for procurement (Sector: enterprise IT, policy within orgs)
- Use case: Incorporate FlatAttention’s bandwidth/compute scaling laws into capacity planning and RFPs to compare GPU-centric vs tile-based options on $/token, W/token, and latency.
- Tools/products/workflows: TCO calculators parameterized by KV length, head dim, model variant, expected utilization (e.g., 86–96% compute-bound, 78–92% memory-bound).
- Assumptions/dependencies: Realistic workload mixes; accurate device-level telemetry (utilization, memory traffic) during pilots.

Long-Term Applications

These items require further research, scaling, standardization, or ecosystem maturity before broad deployment.

Cross-vendor standard for on-chip collectives APIs (Sector: semiconductor, software)
- Use case: Define portable collective primitives (multicast/reduce/broadcast) in accelerator runtime APIs, enabling FlatAttention-like kernels to run across vendors without per-device rewrites.
- Tools/products/workflows: Collective API spec, conformance test suites, compiler lowering standards.
- Assumptions/dependencies: Industry alignment; IP licensing; compatible NoC semantics across devices.
End-to-end training acceleration via collective-aware attention (Sector: AI/ML, semiconductor)
- Use case: Extend FlatAttention to backprop (attention backward, softmax gradients) with collective-aware tiling to reduce DRAM traffic in training regimes.
- Tools/products/workflows: Automatic differentiation support in compilers; memory checkpointing tuned to group tiling; mixed-precision training validation.
- Assumptions/dependencies: Numerics stability with FP8/FP16 gradients; optimizer state locality; larger scratchpads or recomputation strategies.
Wafer-scale, fault-tolerant collective fabrics (Sector: HPC/datacenter, semiconductor)
- Use case: Build resilient wafer-scale collective networks with dynamic rerouting to sustain tile/die failures, maintaining FlatAttention QoS under partial degradation.
- Tools/products/workflows: Fabric controllers with live reconfiguration; failure-aware group remapping; reliability monitors.
- Assumptions/dependencies: Yield-aware mapping; ECC and isolation barriers; thermal/power headroom.
Multi-tenant, adaptive runtimes for attention (Sector: cloud/datacenter)
- Use case: Runtime schedulers that adapt group sizes and collective trees in real time to balance throughput and latency across mixed workloads (prefill/decode/speculative, MoE routing bursts).
- Tools/products/workflows: Telemetry-driven autotuners; QoS-aware group partitioning; sandboxed collective contexts per tenant.
- Assumptions/dependencies: Hardware support for fast context switching; robust isolation for collectives; accurate contention modeling.
Edge/consumer tile NPUs for on-device LLMs (Sector: robotics, automotive, mobile, XR)
- Use case: Bring collective-enabled tile NPUs into cars, robots, and AR devices to deliver long-context assistants and multimodal understanding with lower DRAM bandwidth and latency.
- Tools/products/workflows: On-device FlatAttention kernels for streaming decode; low-rank KV caches; thermal-aware schedules.
- Assumptions/dependencies: Adoption of scratchpad-based NPUs in consumer SoCs; sustained memory bandwidth; tight integration with sensor stacks.
Sustainable AI policy and procurement frameworks (Sector: public policy, enterprise)
- Use case: Inform energy/cost standards for AI infrastructure procurement by referencing architectures that minimize off-chip memory traffic (e.g., FlatAttention-class dataflows).
- Tools/products/workflows: Benchmarks reporting W/token and GB moved/token; guidelines for greener inference appliances; incentives for memory-traffic-efficient designs.
- Assumptions/dependencies: Trusted, standardized benchmarking; verifiable reporting; regulator/industry buy-in.
KV-cache–aware distributed memory hierarchy (Sector: semiconductor, software)
- Use case: Design hierarchical KV-cache systems (SRAM tiles → HBM → NVMe) with collective prefetch/eviction policies co-optimized with FlatAttention’s group blocking.
- Tools/products/workflows: Cache orchestration runtimes; predictive prefetchers for decode; compression codecs matched to MLA-like latent spaces.
- Assumptions/dependencies: Fast, low-overhead collectives; predictable decode access patterns; acceptable accuracy with cache compression.
Secure and private multi-tenant collectives (Sector: cloud, security)
- Use case: Add side-channel-resistant collective mechanisms (time/power obfuscation, partitioned trees) to safely run FlatAttention in shared environments.
- Tools/products/workflows: Verification of isolation properties; hardware counters with privacy-preserving aggregation; security-hardened router microarchitecture.
- Assumptions/dependencies: Minimal performance overhead; formal security analyses; tenant SLAs.
Generalizing collective-aware dataflows beyond attention (Sector: AI/ML, semiconductor)
- Use case: Apply the co-design approach to other bandwidth-sensitive operators (e.g., cross-attention, mixture routing, normalization, sparse matmul) to reduce DRAM traffic system-wide.
- Tools/products/workflows: Pattern libraries in compilers; collective-aware fusion passes; roofline-guided tiling heuristics.
- Assumptions/dependencies: Operator-specific numerical and memory access characteristics; availability of vector/matrix engines suited to new kernels.
Curriculum and workforce development for HW/SW co-design (Sector: academia, workforce)
- Use case: Establish courses and certification programs on collective-accelerated dataflows, preparing engineers to design, program, and operate such systems at scale.
- Tools/products/workflows: Teaching kits with emulators/simulators; open reference implementations; lab-based capstones integrating NoC + kernels + serving.
- Assumptions/dependencies: Open tools and data; partnerships with hardware vendors and cloud providers.

Notes on feasibility across all applications:

Results depend on hardware with NoC-level collective primitives (multicast/reduction), scratchpad-managed memory, and concurrent DMA/vector/matrix execution.
Gains assume attention-heavy, inference-dominant workloads (MoE + MLA/GQA) and KV-cache lengths where DRAM/HBM bandwidth is a bottleneck.
Compiler/runtime maturity is critical to achieve the reported utilization and overlap; portability across vendors requires standardized collective semantics.
Numerical stability and accuracy must be validated under quantization (FP8/FP16) and MLA weight-absorption transformations.

View Paper Prompt View All Prompts

Glossary

Asynchronous execution: Overlapping computation and data movement by running engines and transfers concurrently to hide latencies. "we propose leveraging the asynchronous nature of \gls{dma}, vector and matrix engine invocations"
Auto-regressive decoding: Inference process where tokens are generated one by one, each conditioned on previously generated tokens. "In autoregressive inference, an initial input sequence, or prompt, is fed to the network (prefill phase), after which tokens are generated sequentially and appended to the input sequence (decoding phase)."
Causal mask: A masking scheme that prevents a token from attending to future tokens in sequence models. "with causal mask applied."
Chip-on-Wafer-on-Substrate (CoWoS): An advanced packaging technology that integrates multiple dies on a silicon interposer at wafer scale. "using Chip-on-Wafer-on-Substrate (CoWoS) packaging technology\cite{cowos,xu2025wsc}"
Collective communication: Communication patterns (e.g., multicast, reduction) that involve groups of processing elements exchanging data efficiently. "collective communicationâsuch as multicast and reductionâplays a crucial role"
Compute-bound: A regime where performance is limited by computational throughput rather than memory bandwidth. "achieving an average of 86\% utilization for compute-bound attentions"
Die-to-Die (D2D) links: High-speed electrical connections between separate semiconductor dies within a package or system. "interconnecting them with high-speed \gls{d2d} links"
Direct Memory Access (DMA) engine: A hardware unit that performs bulk data transfers between memory and on-chip buffers without burdening compute cores. "The \gls{dma} engine in each tile is responsible for bulk data movement in and out of the local L1 memory."
Fabric-level collective primitives: Hardware mechanisms in the on-chip network that accelerate collectives (e.g., multicast, reduce) directly in the interconnect. "dedicated hardware implementations of fabric-level collective primitives"
Fabric-supported hardware collectives: On-chip network features that directly implement collective operations for reduced latency and overhead. "(b) Row-wise multicast implementation with fabric-supported hardware collectives (HW)"
Flit: The smallest flow-control unit of data in on-chip networks. "perform fine-grained, flit-level data replication within \gls{noc} routers"
FlashAttention-3: A GPU-optimized attention implementation that exploits asynchronous execution and kernel fusion. "FlashAttention-3 during the prefill stage"
FlashMLA: A GPU-optimized dataflow for Multi-Head Latent Attention (MLA) that follows FlashAttention-style fusion. "FlashMLA during decoding"
FP8: An 8-bit floating-point numerical format used to accelerate inference with reduced precision. "DeepSeek-v3 FP8 decoding"
Gated MLP: A feed-forward network variant where activations are modulated by a learned gate for improved expressiveness/stability. "adopt a gated \gls{mlp} structure"
General Matrix Multiplication (GEMM): A fundamental linear-algebra operation computing matrix–matrix products, used in attention score and output calculations. "the \gls{gemm} for both the attention score and output calculations reduce to matrix-vector multiplications (GEMV)"
Grouped-Query Attention (GQA): An attention variant where queries are grouped to share keys/values, reducing KV cache while restoring GEMM structure in decoding. "\gls{gqa} \cite{ainslie2023gqa}, adopted by LLaMA~3"
High-Bandwidth Memory (HBM): Stacked DRAM with very high bandwidth used as main memory in accelerators. "high-bandwidth memory (HBM) accesses"
KV cache: Storage of keys and values from past tokens to enable efficient decoding without recomputing them. "a \gls{kv cache} populated during prefill and prior decoding iterations."
LLM: Transformer-based models with hundreds of millions to trillions of parameters used for language tasks. "efficient and scalable \gls{LLM} inference"
Matrix engine: Specialized hardware unit optimized for matrix operations (e.g., GEMM) to deliver high compute throughput. "scalar cores, vector engines, and matrix engines."
Matrix-Vector Multiplication (GEMV): A linear-algebra operation multiplying a matrix by a vector; in decoding, attention GEMMs often reduce to GEMVs. "reduce to matrix-vector multiplications (GEMV)"
Memory-bound: A regime where performance is limited by memory bandwidth rather than compute throughput. "78\% HBM bandwidth utilization for memory-bound ones"
Mixture of Experts (MoE): An architecture where a gating network routes tokens to a small subset of expert FFNs, increasing capacity without proportional compute cost. "combined with \gls{moe} in DeepSeek-v3"
Multi-Head Attention (MHA): The standard attention mechanism with multiple attention heads that project queries, keys, and values separately. "stacked \gls{mha} and \gls{mlp} layers"
Multi-Head Latent Attention (MLA): An attention variant that compresses KV representations into a latent space with shared down-projections and per-head up-projections. "advanced mechanisms like \gls{mla}"
Multi-Query Attention (MQA): An attention variant where all heads share the same keys and values to reduce memory and bandwidth. "such as \gls{mqa} \cite{shazeer2019fast}"
Multicast: A collective communication operation that replicates and delivers the same data to multiple destinations. "multicast and reduction"
Network on Chip (NoC): An on-chip interconnect fabric (often mesh-based) that connects tiles/cores and supports data movement and collectives. "on-chip 2D-mesh \gls{noc}"
Processing Element (PE): A basic compute unit (e.g., vector/matrix engine or core) within a tile or accelerator. "integrate thousands of \glspl{pe}"
Reduction (operation): A collective operation combining data (e.g., sum, max) across multiple sources or lanes. "row-wise reduction within the group"
Redistribution layers (RDLs): Metallization layers in an interposer/package used to reroute signals between dies and IO. "redistribution layers (RDLs)"
Reticle size limit: The maximum area producible in a single lithography exposure, constraining single-die size. "constrained by reticle size limits"
RMSNorm: Root Mean Square Layer Normalization, a normalization variant used for computational efficiency. "RMSNorm for increased computational efficiency over LayerNorm."
Roofline model: A performance model plotting attainable FLOPs versus operational intensity bounded by compute and bandwidth ceilings. "on the GH200 roofline model."
Rotary Position Embedding (RoPE): A positional encoding method that injects relative positional information directly into attention. "\glspl{rope} to capture relative positional information"
Scratchpad memory: Software-managed on-chip memory used to buffer data and hide main-memory latency. "software-managed L1 scratchpad memory"
Silicon interposer: A passive silicon substrate providing high-density wiring to interconnect multiple dies at wafer scale. "wafer-scale silicon interposer"
Speculative decoding: An inference technique where draft tokens are proposed by a smaller model and then verified by the target model to reduce latency. "Speculative decoding has emerged as an effective technique for reducing inference latency"
Streaming Multiprocessor (SM): A GPU compute unit with its own shared memory and schedulers; used here by analogy to a compute tile. "shared memory of each \gls{sm}"
Thermal Design Power (TDP): The maximum heat a cooling system is designed to dissipate under typical workloads. "a total \gls{tdp} of 700 W."
Tile-based accelerator: An architecture composed of repeated tiles (each with compute engines, local memory, and DMA) connected via an on-chip network. "tile-based accelerators"
Through-silicon vias (TSVs): Vertical electrical connections passing through silicon, used to connect layers/dies in 3D packaging. "through-silicon vias (TSVs)"
Wafer-scale multi-die system: A system integrating many dies across a wafer-scale interposer for high performance and capacity. "a wafer-scale multi-die system"
Weight-absorption trick: A reparameterization in MLA that absorbs up-projection weights into queries to enable MQA-like decoding. "used together with a weight-absorption trick"

View Paper Prompt View All Prompts

Open Problems

Co-design of tile-based accelerator templates and dataflow for efficient LLM mapping

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators

Summary

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators

Introduction and Motivation

Tile-Based Architectures and Modern Attention Variants

FlatAttention Dataflow: Structure and Implementation

Optimization and Tiling Strategies

Quantitative Results

Comparative Analysis and Implications

Practical and Theoretical Implications

Future Directions

Conclusion

References to Figures

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the authors trying to answer?

How did they approach the problem?

What did they find, and why does it matter?

What is the bigger impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators

Summary

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators

Introduction and Motivation

Tile-Based Architectures and Modern Attention Variants

FlatAttention Dataflow: Structure and Implementation

Optimization and Tiling Strategies

Quantitative Results

Comparative Analysis and Implications

Practical and Theoretical Implications

Future Directions

Conclusion

References to Figures

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the authors trying to answer?

How did they approach the problem?

What did they find, and why does it matter?

What is the bigger impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research