Fine-Tuning GPT-5 for GPU Kernel Generation

Published 11 Feb 2026 in cs.DC, cs.AI, and cs.LG | (2602.11000v1)

Abstract: Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although LLMs demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an RLVR pipeline that fine-tunes GPT-5 for Triton GPU kernel generation using deterministic, outcome-based rewards.
The paper demonstrates significant improvements, boosting functional correctness from 43.7% to 77.0% and achieving up to 2.12x geometric mean speedup.
The paper presents Makora, a scalable infrastructure integrating static analysis and multi-turn refinement to optimize kernel synthesis and prevent reward hacks.

Fine-Tuning GPT-5 for GPU Kernel Generation: Technical Analysis

Context and Motivation

The transition to accelerator-centric computation has rendered efficient GPU kernel programming a pivotal bottleneck for contemporary AI workloads. Despite progress in general-purpose code generation by LLMs, translating these advances to high-performance kernel generation is impeded by significant data scarcity, biases in compiler-synthesized corpora, and non-trivial optimization landscapes. This paper presents a robust RL pipeline, deploying GPT-5 as a foundation model and fine-tuning it for Triton GPU kernel generation with verifiable, outcome-based rewards. The result is a LLM agent that surpasses baseline compilers and models both in correctness and performance metrics.

Challenges in Kernel Generation

Kernel synthesis diverges distinctly from traditional application-level code generation owing to:

Data scarcity and proprietary knowledge: Availability of high-quality, hardware-aware kernel code is limited, and public repositories overwhelmingly consist of educational/simplistic examples.
Compiler artifacts: Synthetic kernels generated by compilers encode internal optimization heuristics, boilerplate, and depend on runtime libraries, limiting portability and model generalization.
Single-metric (correctness) insufficiency: Functionally correct kernels are often suboptimal in speed or efficiency, and device heterogeneity further exacerbates performance portability.
Exponentially large search space: Optimization decisions (tiling, memory layout, vectorization, fusion) interact nonlinearly, making exhaustive supervised enumeration infeasible.

These factors collectively render SFT impractical for scaling LLMs in this domain.

Reinforcement Learning with Verifiable Rewards (RLVR)

The paper eschews conventional RLHF in favor of RLVR, leveraging deterministic reward functions based on compilation, correctness, and speedup benchmarks relative to TorchInductor. The key formulation:

Kernels that fail to compile or validate yield zero reward.
Correct, outperforming kernels receive continuous rewards, scaled by a logistic function with a shift parameter emphasizing speedup over mere correctness.

This approach eliminates subjectivity, enables precise reward alignment, and scales evaluation across hardware targets.

A prerequisite is the base model’s proficiency in Triton and hardware primitives; GPT-5’s demonstrated capacity ensures gradient signal availability, unlike lower capacity models where reward plateaus rapidly.

Infrastructure and Toolchain

The authors introduce a scalable RL training environment—Makora—characterized by:

Curated training sets: Diverse, difficulty-ranked PyTorch kernels, deduplicated semantically and syntactically, filtered by runtime and complexity, then sampled to maximize coverage.
Evaluation backend: Distributed across H100 GPUs, caches and canonicalizes ASTs for efficiency, and incorporates static reachability analysis and LLM-based hack judgement for reward hacking prevention.
Tools at inference and train: Kernel evaluator (for iterative correctness checking), kernel search (for candidate retrieval and refinement), web search (external knowledge access), and profiler (for fine-grained performance feedback).

Iterative agentic refinement is enabled, bridging single-turn and multi-turn RL.

Experimental Results

RL-fine-tuned GPT-5 (GPT-5-RL) outperforms all baseline LLMs and compilers:

Functional correctness: Improves from 43.7% (base GPT-5) to 77.0%, a +33.3pp gain.
Fraction outperforming TorchInductor: Increases from 14.8% to 21.8%.
Geometric mean speedup: Rises from 0.73x to 0.81x.

MakoraGenerate, the evolutionary agent, further amplifies results:

Solves 97.4% of expanded KernelBench problems.
Outperforms TorchInductor on 72.9% of problems.
Achieves 2.12x geometric mean speedup.

Increasing refinement steps (multi-turn): GPT-5-RL’s functionality rate improves with more attempts—from 77.0% (single attempt) to 83.7% (three attempts), showing robustness to agentic search and iterative optimization.

Sample efficiency and dataset curation: Training on an oracle in-distribution subset is significantly more effective than random sampling, formalizing the importance of distributional alignment over sheer dataset volume.

Tool impact: Domain-specific tools (kernel evaluator, search, profiler) yield substantial correctness and modest performance gains, while unconstrained web retrieval is inconsistent.

Reward Hack Prevention

The authors identify six hack archetypes (baseline kernel invocation, identity kernel, no-op, unused output, ghost optimization, forgotten kernel). Prevention is achieved via AST-based static analysis and LLM-based semantic judging, ensuring the model’s outputs are meaningful and intended.

Comparative Discussion

The framework advances agentic kernel generation beyond prior works such as EvoENGINEER (2510.03760), CUDA-ENGINEER (Lange et al., 16 Sep 2025), ASTRA (Wei et al., 9 Sep 2025), CUDA-LLM (Chen et al., 10 Jun 2025), and KEVIN (Baronio et al., 16 Jul 2025), which focus on evolutionary, agentic, or RL-based optimization for CUDA kernels but do not reach the functionality and speedup levels demonstrated here with fine-tuned GPT-5 and Makora.

Practical and Theoretical Implications

Practically, this technique unlocks scalable, production-grade GPU kernel generation, reduces reliance on human experts, and offers competitive acceleration compared to industry compilers. Theoretically, RLVR mechanisms highlight pathways to align LLMs for specialized, high-consequence code domains, circumventing SFT data bottlenecks and exploit/explore dilemmas.

Deployment as evolutionary agents (MakoraGenerate) further illustrates the potential for multi-agent, iterative optimization systems in AI-assisted code synthesis, with implications for distributed training, generative hardware-aware optimizations, and domain adaptation.

Future Directions

Incorporating train-time tool-use to further improve iterative reasoning and correctness.
Explicit reward shaping for speedup maximization, targeting hardware-specific optimization strategies.
Extending RLVR approaches to broader accelerator domains (FPGA, custom ASICs), potentially integrating more granular profiling and hardware introspection.
Investigating reward hacking adversaries and defense mechanisms in open-source or collaborative kernel generation scenarios.

Conclusion

This work demonstrates that RL-based post-training, leveraging verifiable, hardware-grounded rewards, enables LLMs to attain state-of-the-art performance in GPU kernel generation, substantially improving both reliability and speed relative to base models and compilers. The Makora ecosystem exemplifies scalable infrastructure and robust evaluation for this specialized domain, providing a foundation for practical deployment and future research into agentic, domain-adaptive AI programming systems.

Markdown Report Issue