AscendCraft: Accelerating NPU Kernel Generation

Updated 3 February 2026

AscendCraft is a framework for automatic Ascend NPU kernel generation using a lightweight DSL and LLM-guided transcompilation to overcome complex hardware challenges.
It abstracts scheduling and data movement semantics at the DSL level, enabling incremental lowering to optimized AscendC kernel code with improved correctness.
Empirical evaluation on MultiKernelBench demonstrates significant boosts in kernel correctness and runtime performance compared to direct LLM-to-AscendC approaches.

AscendCraft is a framework for automatic Ascend NPU kernel generation that leverages a lightweight domain-specific language (DSL) and a structured, multi-pass LLM-guided transcompilation pipeline. AscendCraft addresses the challenges inherent to programming for Ascend NPUs, including a fragmented programming model, complex memory hierarchies, strict alignment constraints, and the absence of rich, public domain-specific kernel exemplars. The system abstracts Ascend-specific scheduling and data movement semantics at the DSL level, then incrementally lowers DSL programs to fully realized AscendC kernel code using LLM-powered passes informed by code examples, mapping rules, and interactive compiler feedback. Evaluation on MultiKernelBench demonstrates that AscendCraft substantially improves kernel correctness and runtime performance relative to direct LLM-to-AscendC generation approaches (Wen et al., 30 Jan 2026).

1. Motivation and Problem Setting

High-performance kernel development is a fundamental requirement for deep learning workloads on specialized hardware. While GPU toolchains and LLM training data for CUDA or Triton have enabled LLMs to autonomously produce performant kernel code, NPU kernel generation remains an open challenge. The Ascend NPU platform exposes explicit multi-stage pipelines (CopyIn, Compute, CopyOut), several levels of on-chip memory (Global Memory, Unified Buffer, L1, L0), heterogeneous compute units (Scalar, Vector, Cube), and rigid hardware constraints such as memory alignment and buffer sizing.

Attempts to use LLMs (e.g., Sonnet-4, GPT) to directly generate AscendC code for MultiKernelBench tasks result in less than 5% correct output, attributed to the absence of domain knowledge, scarcity of training examples, and inability to reliably handle low-level memory management and scheduling. AscendCraft reframes kernel generation as a two-stage process: kernel logic and performance-critical scheduling are first expressed in a DSL tailored to the Ascend programming model, and the DSL is then transcompiled to AscendC code in a sequence of LLM-guided, constraint-driven lowering passes (Wen et al., 30 Jan 2026).

2. DSL Structure and Formal Semantics

AscendCraft’s DSL is Triton-inspired and provides concise abstractions for only those kernel aspects that critically influence performance on Ascend NPUs: tiling factors, on-chip buffer allocations, pipeline stage sequencing, and inter-core partitioning.

DSL Core Structure

Host Function: Specifies tiling parameters and launches the kernel.
Kernel Function: Declares on-chip buffer allocations and explicitly structures the pipeline using copyin, compute, and copyout blocks.

Grammar sketch:

host fn H(GlobalTensor…) { tiles = TilePlan(…); launch K(tiles); }
kernel fn K(params…) { alloc UB bufA[shape]; copyin { Copy(GlobalIn, bufA, tileIdx); } compute { OP(bufA, bufB, …); } copyout { Copy(bufB, GlobalOut, tileIdx); } }

Type System and Operator Rules

Every identifier is assigned a type in context $\Gamma$ , e.g.:

$\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$
$\Gamma \vdash \text{buf} : \text{UBBuffer}(\tau, \langle t_1, ..., t_d \rangle)$

Operator primitives such as vadd (vector addition) use standard typing rules, e.g.:

$(T\text{-}ADD)\qquad \Gamma \vdash x : \text{Tensor}(\tau, \langle n_1, ..., n_d \rangle) \ \Gamma \vdash y : \text{Tensor}(\tau, \langle n_1, ..., n_d \rangle) \ \implies \Gamma \vdash vadd(x, y) : \text{Tensor}(\tau, \langle n_1, ..., n_d \rangle)$

Operational Semantics

The operational semantics are defined by small-step transitions over a global state $\sigma$ mapping buffers to contents:

CopyIn: $\langle \text{copyin}\{\text{Copy}(G, B, \pi)\}, \sigma \rangle \mapsto \sigma[B \mapsto \text{slice}(\sigma(G), \pi)]$
Compute: $\langle \text{compute}\{z = vadd(x, y)\}, \sigma \rangle \mapsto \sigma[z \mapsto \sigma(x) + \sigma(y)]$

This explicit staged semantics eliminates ambiguity and ensures memory and execution order are statically determined, in contrast to unstructured C++.

3. LLM-Guided Transcompilation Pipeline

DSL programs are transformed into AscendC code via four sequential, LLM-driven passes. Each lowering stage operates on code fragments, guided by mapping rules, code examples, AscendC API signatures, and in-prompt diagnostics.

Pass 1: Host-Side Translation

Converts the DSL host function to C++.
Computes tiling parameters, assigns workloads, and launches the kernel.
Enforces: $\prod_i t_i \times \mathrm{sizeof}(\tau) \leq \mathrm{UB}_{\mathrm{size}}$ , and $\sum_i t_i \bmod \mathrm{align} = 0$ .

Pass 2: Kernel Initialization

Completes AscendC inline functions for CopyIn, Compute, CopyOut.
Allocates TensorBufs (scratch) and TensorQueues (for pipelined data movement), ensuring queue capacity $\ge$ pipeline depth and 32-byte alignment for Unified Buffer allocations.

Pass 3: Kernel Computation Translation

Realizes Process() by invoking CopyInX(), ComputeX(), CopyOutX().
Emits synchronization calls (SyncAll) for cross-block dependencies.
Inserts required AscendC tensor buffer and queue operations (e.g., DataCopy, VADD).

Invoked only on compiler-reported alignment errors.
Replaces DataCopy with DataCopyPad, adjusts strides, and computes necessary padding:

$\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 0

where $\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 1 is length, $\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 2 is alignment granularity.

Feedback and Error Handling

If compilation errors are detected, the compiler stderr is appended to the LLM prompt with specific correction instructions. Typically, one feedback cycle suffices for convergence.

4. LLM Prompting and Knowledge Injection

During generation:

The LLM receives the DSL specification (syntax, semantics, primitives), shape-specific operator exemplars (e.g., tiling strategies for a 1024×512 matrix in Softmax), and a natural language operator description.
Each lowering prompt includes relevant code fragments, transformation rules, and minimal illustrative examples.
The staged architecture restricts LLM output to locally verifiable code, mitigating propagation of hallucinations or broader semantic drift.

A plausible implication is that category-specific exemplars help the LLM internalize correct buffer usages and tiling patterns, overcoming the lack of public AscendC corpus.

5. Evaluation and Empirical Results

Experimental Setup

Testing is conducted on Ascend 910B2 (CANN 8.1), PyTorch 2.6, Ubuntu 22.04, against MultiKernelBench Level-1 tasks across seven operator categories: Activation, Loss, Math, Normalization, Optimizer, Reduce, and Pooling.

Correctness

Compilation success (Comp@1): 98.1%
Functional correctness (Pass@1): 90.4%
Direct LLM-to-AscendC baseline: <5% correct kernels

Performance

Performance is measured by Fast $\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 3, the percentage of correct kernels whose runtime $\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 4 is within $\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 5 of the PyTorch eager baseline $\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 6:

Fast $\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 7: 82.7%
Fast $\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 8: 57.7%
Fast $\Gamma \vdash A : \text{GlobalTensor}(\tau, \langle N_1, ..., N_d \rangle)$ 9: 46.2%
Several operator categories (e.g., Optimizer) obtain 100% Fast $\Gamma \vdash \text{buf} : \text{UBBuffer}(\tau, \langle t_1, ..., t_d \rangle)$ 0.

mHC Architecture Case Study

AscendCraft generated two correct kernels (mHC_post, mHC_post_grad) on the first pass for the new mHC architecture, yielding 6.6× and 3.0× speedups over PyTorch eager execution. Subsequent LLM-assisted expert tuning further increased speedups to 15.9× and 7.2×, illustrating that DSL-generated code serves as a robust baseline for further optimization.

Summary of Empirical Results

Metric	AscendCraft	Direct LLM-to-AscendC
Comp@1	98.1%	<5%
Pass@1	90.4%	<5%
Fast $\Gamma \vdash \text{buf} : \text{UBBuffer}(\tau, \langle t_1, ..., t_d \rangle)$ 1	46.2%	<5%
mHC speedup (initial)	6.6× / 3.0×	Not reported
mHC speedup (tuned)	15.9× / 7.2×	Not reported

This suggests DSL-guided transcompilation is essential to bridge the gap between LLM-based kernel generation and the stringent requirements of Ascend NPUs.

6. Practical Considerations and Limitations

AscendCraft’s staged, DSL-guided approach is most effective for operators with regular tiling and streaming patterns, such as element-wise and simple reduction operations. The use of category-specific exemplars and a staged code generation pipeline confines LLM hallucinations to small, isolated code fragments, increasing verifiability and correctness.

Limitations include:

Incomplete support for complex kernels such as MatMul and Convolution, due to the complexity of AscendC’s Cube unit APIs. Extension via high-level interfaces like CATLASS is planned.
Lower Pass@1 and Fast $\Gamma \vdash \text{buf} : \text{UBBuffer}(\tau, \langle t_1, ..., t_d \rangle)$ 2 on pooling and certain reduction operators; further improvements may require more advanced tiling heuristics or shape-aware templates.
The need for joint optimization across both DSL and generated AscendC, possibly guided by roofline-style profiling, to maximize performance on memory-bound operators.

A plausible implication is that broader applicability will depend on extending the DSL and template libraries to incorporate more sophisticated operator classes and hardware APIs.

7. Implications and Future Directions

AscendCraft demonstrates the effectiveness of tailoring LLM code generation through a carefully designed intermediate DSL aligned with domain-specific semantics, followed by structured, constraint-guided transcompilation. This staged methodology closes the correctness and performance gap evident in direct LLM-based NPU kernel generation. Future work will focus on extending the DSL to more operator categories, improving tiling heuristics, and integrating optimization feedback loops—thereby further democratizing high-performance NPU software development in the context of LLM-driven code synthesis (Wen et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AscendCraft.

AscendCraft: Accelerating NPU Kernel Generation

1. Motivation and Problem Setting

2. DSL Structure and Formal Semantics

DSL Core Structure

Type System and Operator Rules

Operational Semantics

3. LLM-Guided Transcompilation Pipeline

Pass 1: Host-Side Translation

Pass 2: Kernel Initialization

Pass 3: Kernel Computation Translation

Pass 4: Alignment and Padding Refinement

Feedback and Error Handling

4. LLM Prompting and Knowledge Injection

5. Evaluation and Empirical Results

Experimental Setup

Correctness

Performance

mHC Architecture Case Study

Summary of Empirical Results

6. Practical Considerations and Limitations

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

AscendCraft: Accelerating NPU Kernel Generation

1. Motivation and Problem Setting

2. DSL Structure and Formal Semantics

DSL Core Structure

Type System and Operator Rules

Operational Semantics

3. LLM-Guided Transcompilation Pipeline

Pass 1: Host-Side Translation

Pass 2: Kernel Initialization

Pass 3: Kernel Computation Translation

Pass 4: Alignment and Padding Refinement

Feedback and Error Handling

4. LLM Prompting and Knowledge Injection

5. Evaluation and Empirical Results

Experimental Setup

Correctness

Performance

mHC Architecture Case Study

Summary of Empirical Results

6. Practical Considerations and Limitations

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics