MultiKernelBench: Benchmarking DL Kernels

Updated 23 January 2026

MultiKernelBench is a benchmarking framework that evaluates automatically generated deep learning kernels across diverse accelerator systems.
It offers fine-grained categorization with 285 tasks over 14 operator classes to address previous limitations in platform coverage and task balance.
The system uses a modular backend and category-aware one-shot prompting to enhance LLM evaluation and simplify extension to new hardware platforms.

MultiKernelBench is a comprehensive, multi-platform benchmarking framework designed to evaluate and advance the automatic generation of deep learning (DL) kernels, particularly those produced by LLMs, across heterogeneous accelerator ecosystems. It establishes a rigorous testbed that addresses prior benchmarks' limited platform coverage, insufficient kernel categorization, and lack of balanced evaluation across core operator classes. The benchmark suite comprises 285 tasks distributed over 14 functional categories, supports Nvidia GPUs (CUDA), Huawei NPUs (AscendC), and Google TPUs (Pallas/JAX), and features a modular backend abstraction layer to facilitate extensibility to new architectures. MultiKernelBench further introduces category-aware one-shot prompting, enabling more effective LLM evaluation and improvement, especially on platforms lacking extensive pre-training data exposure (Wen et al., 20 Jul 2025).

1. Motivation and Rationale

The manual development of high-performance DL kernels—platform-specific routines such as CUDA for Nvidia, AscendC for Huawei, or Pallas/JAX for TPUs—demands hardware expertise and substantial engineering investment. Existing methods for LLM-based DL kernel generation have demonstrated promise in automating this process; however, prior benchmarks (e.g., KernelBench, TritonBench) have been limited by a number of significant shortcomings:

Platform Restriction: Most existing suites focus on Nvidia GPUs, underrepresenting Huawei NPUs and Google TPUs.
Coarse Categorization: Previous difficulty-level groupings obscure fine-grained performance characteristics and category-specific LLM failure modes.
Coverage Imbalance: Several operator classes (such as convolution) are overrepresented, while others (e.g., optimizer kernels) are absent, impeding systematic assessment.
Lack of Extensibility: Integrating new hardware platforms typically requires invasive modification of evaluation pipelines.

MultiKernelBench directly addresses these deficits by offering fine-grained categorization, balanced coverage, plug-in extensibility, and prompt-engineering strategies tailored to DL kernel generation (Wen et al., 20 Jul 2025).

2. System Architecture and Workflow

MultiKernelBench implements a modular and extensible benchmarking pipeline designed to decouple platform-specific logic from core evaluation infrastructure:

Task Definition Each task $t$ is defined by a reference PyTorch module (e.g., forward using torch.nn.Conv2d) and a set of fixed input tensors $\{x_i\}$ .
Prompt Construction Prompts are constructed with a unified template that injects:
- A designated system role (e.g., "You are an AscendC kernel expert."),
- A platform-specific one-shot example,
- The target PyTorch module code and its input specifications.
LLM Generation The constructed prompt is delivered to an LLM, which generates:
- The kernel implementation (CUDA C++, AscendC, Pallas-JAX),
- A PyTorch-compatible wrapper for invocation.
Build and Compilation Generated code is extracted and built using nvcc (for CUDA), the official AscendC toolchain (Huawei NPUs), or JAX/Pallas (TPUs).
Correctness and Performance Evaluation Both the reference and generated module are executed on identical inputs. Outputs are verified for correctness ( $|y_{LLM} - y_{ref}| < \mathrm{atol} + \mathrm{rtol}|y_{ref}|$ ), and runtime is measured using platform-native instrumentation.

The architectural backbone is the Backend Abstraction Layer, which defines a uniform interface for any hardware target. The required methods are initialize_device, compile_kernel, run_kernel, measure_time, and cleanup. To add a new hardware backend, one subclasses the Backend class, implements these methods, and registers them via @register_backend. No changes are required to the main evaluation logic (Wen et al., 20 Jul 2025).

3. Kernel Taxonomy and Task Set

MultiKernelBench introduces a functionally driven, fine-grained taxonomy comprising 14 mutually exclusive categories, partitioning the task set $\mathcal{T} = \{t_1, \ldots, t_{285}\}$ as $\mathcal{C} = \{C_1,\ldots,C_{14}\}$ , where $C_i \cap C_j = \varnothing$ for $i \neq j$ and $\bigcup_i C_i = \mathcal{T}$ . This enables highly granular performance analysis. The category breakdown is summarized in the following table:

Category	Example Operator(s)	Task Count
Activation	`relu`, `gelu`	15
Broadcast	`bias_add`	10
Convolution	`conv2d`	34
Full Architecture	`resnet18`	50
Fusion	`fused_matmul_bias`	100
Loss	`cross_entropy`	7
Math	`multiply`	6
Matrix Multiply	`sgemm`, `bmm`	17
Normalization	`batchnorm`, `layernorm`	8
Optimizer	`adam_update`	5
Pooling	`maxpool2d`	6
Indexing	`gather`, `scatter_update`	12
Resize	`bilinear_resize`	10
Reduce	`reduce_sum`	5

Balanced category representation enables the identification of per-category LLM weaknesses (e.g., convolution and full-architecture tasks are systematically more difficult than activation or reduction operations) (Wen et al., 20 Jul 2025).

4. Supported Hardware Platforms and Backend Design

MultiKernelBench encompasses three principal backends:

Nvidia GPU (CUDA): Supporting JIT compilation and managed via torch.utils.cpp_extension.
Huawei NPU (AscendC): Utilizing the official AscendC compiler suite and glue scripts.
Google TPU (Pallas/JAX): Leveraging JAX's pallas_call for dynamic kernel execution.

Platform support is ensured by the backend abstraction interface, whose five methods guarantee a consistent workflow for device management, compilation, execution, timing, and resource cleanup. Extension to additional architectures (e.g., AMD/HIP, Triton, FPGA) requires only implementation and registration of a new backend—a process requiring fewer than 20 lines of code and no invasive changes to the main logic or evaluation routines (Wen et al., 20 Jul 2025).

5. Prompting Methodology

Prompt construction is standardized across platforms and incorporates:

System Role Assignment: Explicitly defines the intended context for code generation (e.g., "You are a CUDA kernel expert.").
One-shot Example: A canonical, platform-specific exemplar (e.g., vector-add in AscendC).
Target Task Declaration: Includes PyTorch module code and fixed input tensor descriptions.

Category-aware prompting extends this approach by ensuring the one-shot example is functionally related to the evaluated task, improving both compilation and correctness rates on underrepresented hardware. Formally, for target task $t\in C_i$ , the exemplar $e \in C_i$ is selected (possible selection strategies include operator- or shape-based affinity).

Empirical results demonstrate that substituting the generic "add" example with an in-category sample delivers up to 380% relative improvement in Compile@1 and 160% in Pass@1 on AscendC, with similar gains on Pallas (Wen et al., 20 Jul 2025).

6. Evaluation Protocol and Metrics

The benchmark evaluates seven leading LLMs—including DeepSeek-V3 (685B), Qwen 3 235B, and Claude-Sonnet-4—without a train/test split. Each model generates implementations for all 285 tasks on all supported platforms.

Evaluation metrics include:

Compilation@k: Fraction of tasks for which at least one of $k$ generated completions compiles.
Pass@k: Fraction with at least one functionally correct kernel.
SpeedUp $_\alpha$ @k: Fraction achieving at least $\alpha$ -fold speedup over the PyTorch baseline.

Metrics are evaluated both per-platform and per-category, enabling detailed performance profiling. For example, Pass@5 for CUDA peaks at 31.2% (Claude-Sonnet-4), while both AscendC and Pallas observe ≤6.3%. Category-wise Pass@1 (CUDA) ranges from 88.9% (Activation) to 0.0% (Convolution, FullArchitecture), reflecting the variable complexity of DL operator classes (Wen et al., 20 Jul 2025).

7. Key Insights, Findings, and Future Directions

MultiKernelBench reveals that:

LLM-generated kernels are currently far from production-quality across most categories and platforms. Notably, pass rates for complex tasks such as convolution and end-to-end architectures remain near zero, indicating significant open challenges.
Fine-grained, functional categorization exposes substantial performance variation and LLM-specific failure modes, which were masked by prior, coarser groupings.
Category-aware one-shot prompting significantly improves compilation and correctness, especially on platforms such as AscendC and Pallas that feature rare or domain-specific kernel idioms in training corpora.
Empirical evidence indicates that LLMs may occasionally realize operator fusion, sparsity-aware transformations, and low-overhead microkernel implementations, though these are not consistent or ubiquitous.

Recommended research directions include the use of MultiKernelBench to uncover per-category LLM strengths and weaknesses, extension to emergent hardware targets via the plugin backend, and investigation of advanced prompting methods such as shape-aware or retrieval-augmented prompting (Wen et al., 20 Jul 2025).

MultiKernelBench, through its open-sourced datasets, backend implementations, and evaluation scripts, provides a foundation and diagnostic suite for future advancements in automatic, LLM-driven kernel generation research and benchmarking.

Markdown Report Issue Upgrade to Chat

References (1)

MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiKernelBench System.