Can LLMs mimic the GPU kernel engineering workflow?

Determine whether large language models can mimic the real-world GPU kernel engineering workflow used to develop GPU kernels, including effective use of compiler feedback, profiling metrics, hardware-specific specifications and instruction sets, and hardware-efficiency techniques such as tiling and operator fusion.

Background

The paper investigates the feasibility of using LLMs to automate the generation of efficient GPU kernels, an activity that typically requires expert knowledge of hardware, compilers, and performance optimization. The authors emphasize that AI engineers rely on diverse signals and tools—such as compiler feedback, profiling metrics, hardware specifications and instruction sets, and techniques like tiling and fusion—making the development workflow complex.

This unresolved question motivates the creation of KernelBench, a benchmark and framework that closely mirrors the practical environment in which GPU kernels are engineered. KernelBench assesses whether LLMs can not only produce syntactically correct code but also leverage the aforementioned signals and techniques to generate performant kernels across a broad set of realistic machine learning workloads.

References

AI engineers use a rich set of information when developing kernels and it is not clear whether LMs can mimic the workflow.

KernelBench: Can LLMs Write Efficient GPU Kernels?  (2502.10517 - Ouyang et al., 14 Feb 2025) in Section 1 (Introduction)