Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs)

Published 23 Feb 2026 in cs.PL and cs.AI | (2602.19762v1)

Abstract: In this paper, we present Hexagon-MLIR,an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU) and provides unified support for lowering Triton kernels and PyTorch models . Built using the MLIR framework, our compiler applies a structured sequence of passes to exploit NPU architectural features to accelerate AI workloads. It enables faster deployment of new Triton kernels (hand-written or subgraphs from PyTorch 2.0), for our target by providing automated compilation from kernel to binary. By ingesting Triton kernels, we generate mega-kernels that maximize data locality in the NPU's Tightly Coupled Memory (TCM), reducing the bandwidth bottlenecks inherent in library-based approaches. This initiative complements our commercial toolchains by providing developers with an open-source MLIR-based compilation stack that gives them a path to advance AI compilation capabilities through a more flexible approach. Hexagon-MLIR is a work-in-progress, and we are continuing to add many more optimizations and capabilities in this effort.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Hexagon-MLIR, an open-source stack that unifies high-level representations into MLIR Linalg for optimized Qualcomm NPU execution.
By leveraging techniques such as operator fusion, tiling, multi-threading, and double buffering, it achieves significant speedups, including a 63.9× acceleration for GELU operations.
Its design reduces engineering overhead and streamlines the deployment of evolving operator DSLs, paving the way for robust AI compilation on mobile and embedded NPUs.

Hexagon-MLIR: A Comprehensive AI Compilation Stack for Qualcomm NPUs

Motivation and Architectural Context

Hexagon-MLIR introduces an open-source compilation stack built atop the MLIR infrastructure for targeting Qualcomm Hexagon NPUs. It enables automated compilation pathways from high-level representations—hand-written Triton kernels and PyTorch computational graphs—to binaries optimized for Hexagon NPU architectures. The stack addresses key limitations of library-based operator approaches, such as bandwidth bottlenecks resulting from intermediate DRAM roundtrips, inflexibility to rapidly evolving operator DSLs, and the considerable engineering cost of manual kernel deployment. By lowering both PyTorch models and Triton kernels through a unified MLIR-based pass pipeline, Hexagon-MLIR achieves fine-grained control over memory hierarchy, vectorization, multi-threaded hardware scheduling, and operator fusion, thereby aligning with the hardware’s TCM, HVX, and DMA characteristics.

Figure 1: Hexagon-MLIR—AI Compilation Stack Overview, featuring translation of PyTorch and Triton into MLIR Linalg, followed by a structured lowering pipeline to NPU-specific binaries.

MLIR-Centric Compilation Workflow

At the foundation, Hexagon-MLIR leverages MLIR’s extensible IR ecosystem. PyTorch models are imported via Torch-MLIR and Triton kernels are converted to Linalg ops through triton-to-linalg. Both modalities ensure semantic equivalence at the Linalg representation, subject to approximate error if explicit quantization or hardware-specific choices intervene. The pass pipeline, modeled as a sequence of IR morphisms, supports canonicalization, operator fusion, vectorization, tiling, multi-threading, quantization, layout propagation, and explicit lowering steps to bufferized and target-specific IRs.

The stack’s design, with linalg.generic as the central abstraction, elegantly expresses affine memory accesses, structured iteration spaces, and computational payloads. Non-trivial operators such as softmax and TopK, which resist simple element-wise decomposition, are accommodated through richer loop and control IRs. The uniform lowering of PyTorch and Triton subgraphs into MLIR Linalg, with subsequent structured transformation, is critical for full-stack optimizations.

Core Technical Passes

Operator Fusion

Operator fusion is prioritized as a first-class transformation pass, fundamentally improving data locality and eliminating unnecessary tensor materialization. By aggressive fusion of linalg.generic ops, intermediate tensors are not written to DRAM, and producer results are consumed in-register or in fast local SRAM, maximizing reuse and allowing for composed optimization across fused subgraphs. This generative approach enables the automatic construction of “mega-kernels” from arbitrary-length operator chains—crucial for contemporary deep learning workloads with complex fusion patterns.

Tiling and Memory Hierarchy Exploitation

Tiling partitions working sets of large tensors such that slices are moved from DDR to TCM, enabling high-throughput compute with minimal transfer overhead. The transformation inserts explicit bufferization and load-store annotation for MLIR’s bufferization infrastructure. The tiling pass is aware of both the vectorizable dimensions and data movement costs, incorporating loop interchange to expose the innermost, TCM-resident computations.

Multi-threading and Asynchronous Scheduling

Hardware-level parallelism is extracted using MLIR’s Async dialect. The compilation stack first over-decomposes the iteration space into virtual threads and then lowers these to explicit fork-join IR, mapping tasks onto HVX vector contexts. The IR transformation preserves high-level parallel semantics while allowing for fine-grained scheduling, amortizing thread management and synchronization overhead across sufficiently large problem sizes.

Double Buffering and DMA Latency Hiding

Double buffering is implemented as a two-stage transformation—structural alternation of buffer accesses followed by explicit asynchronous DMA orchestration—enabling overlap of computation with data transfers. Ping-pong buffers on TCM are allocated, and all DMA start/wait points are injected into the IR, effectively concealing main memory latency when kernels are either compute/memory balanced or moderately memory-bound. This is essential for attaining high effective throughput as AI kernel complexity grows.

Specialized Math Library Integration

Transcendental and polynomial math ops are mapped to Qualcomm’s QHL vectorized math libraries, or, when unavailable, MLIR’s poly-approximation expansions are employed. These ensure that all non-linearities and reductions critical to DL models execute at near-peak hardware efficiency, regardless of data type (float16/32) or vector length.

Quantitative Evaluation

The empirical analysis demonstrates significant speedups attributable to distinct optimization passes. HVX vectorization alone yields up to $63.9\times$ acceleration for GELU (float16), with substantial gains for RMS-norm, SiLU, and vector additions. Multi-threading exhibits regime-dependent speedups, with sublinear scaling for small tensors due to thread startup cost, but $2-4\times$ speedup for larger working sets. Double buffering shows performance improvements by overlapping DMA with compute, whose benefit magnitude depends on compute-to-memory ratio.

A detailed breakdown of runtime for key representative kernels (GELU, Vec-Add, Exponent Series) demonstrates strong interactions between passes; sequential application of vectorization, multi-threading, and double buffering transitions kernels from compute-bound to bandwidth-bound. The idealized model for DB’s effectiveness, visualized as the fraction of runtime attributed to memory transfers, matches experimental findings: max speedup is achieved when computation and memory transfer times are moderately balanced.

Implications and Future Directions

On the practical front, Hexagon-MLIR lowers the barrier for bringing new PyTorch and Triton kernels to mobile/edge NPUs, reducing engineering turnaround for supporting novel ops and facilitating efficient deployment of rapidly evolving architectures, including LLMs and Mixture-of-Experts. The adoption of structured IRs and staged pipeline transformations provides a scalable foundation for integrating future AI DSLs, e.g., high-level PyTorch Inductor codegen, or new activation mechanisms such as PolyCom. Theoretically, the system affirms the relevance of explicit IRs and MLIR’s design for expressing both dataflow and hardware scheduling, extending to asynchronous execution and memory hierarchy mapping.

Potential developments include further integration with distributed compiler approaches that exploit native overlaps (as outlined in [TRITON_DISTRIBUTED]), automatic scheduling heuristics for deep fusion graphs, advanced mixed-precision and quantization schemes, and enhanced support for execution on heterogeneous SoC configurations. By providing open-source, fine-grained, and composable transformations, Hexagon-MLIR positions itself as a reference stack for AI compilation on embedded accelerators.

Conclusion

Hexagon-MLIR embodies a highly modular, MLIR-driven solution for compiling AI workloads targeting Qualcomm’s Hexagon NPUs. Through a pass pipeline integrating fusion, tiling, vectorization, multi-threading, and buffer management, it attains strong numerical and practical performance improvements on representative kernels. The stack’s extensibility and alignment with next-generation operator DSLs, combined with transparent optimizability using open IRs, establish a robust path forward for both research and industrial AI deployment targeting embedded and mobile NPUs.

Reference: "Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs)" (2602.19762)

Markdown Report Issue