CELLO: Co-designing Schedule and Hybrid Implicit/Explicit Buffer for Complex Tensor Reuse
Abstract: Tensor algebra accelerators have been gaining popularity for running high-performance computing (HPC) workloads. Identifying optimal schedules for individual tensor operations and designing hardware to run these schedules is an active area of research. Unfortunately, operators in HPC workloads such as Conjugate Gradient often have operators with skewed shapes, fundamentally limiting the reuse any schedule can leverage. Moreover, the operators form a complex DAG of dependencies, making it challenging to apply simple fusion/pipelining techniques to extract inter-operation reuse. To address these challenges, this work proposes an accelerator CELLO. CELLO uses a novel on-chip buffer mechanism called CHORD co-designed with a novel scheduler called SCORE, which together enables identifying and exploiting reuse over complex DAGs of tensor operations. CELLO provides 4x geomean speedup and 4x energy efficiency over state-of-the-art accelerators across HPC workloads.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.