Papers
Topics
Authors
Recent
Search
2000 character limit reached

Can Asymmetric Tile Buffering Be Beneficial?

Published 20 Nov 2025 in cs.DC, cs.AR, and cs.PF | (2511.16041v1)

Abstract: General matrix multiplication (GEMM) is the computational backbone of modern AI workloads, and its efficiency is critically dependent on effective tiling strategies. Conventional approaches employ symmetric tile buffering, where the buffered tile size of the input $A$ along the dimension $M$ matches the output tile size of $C$. In this paper, we introduce asymmetric tile buffering (ATB), a simple but powerful technique that decouples the buffered tile dimensions of the input and output operands. We show, for the first time, that ATB is both practical and highly beneficial. To explain this effect, we develop a performance model that incorporates both the benefits of ATB (higher arithmetic intensity) and its overheads (higher kernel switching costs), providing insight into how to select effective ATB tiling factors. As a case study, we apply ATB to AMD's latest XDNA2 AI Engine (AIE), achieving up to a 4.54x speedup, from 4.8 to 24.6 TFLOPS on mixed-precision BFP16--BF16 GEMM, establishing a new performance record for XDNA2 AIE.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.