- The paper introduces an RLVR pipeline that fine-tunes GPT-5 for Triton GPU kernel generation using deterministic, outcome-based rewards.
- The paper demonstrates significant improvements, boosting functional correctness from 43.7% to 77.0% and achieving up to 2.12x geometric mean speedup.
- The paper presents Makora, a scalable infrastructure integrating static analysis and multi-turn refinement to optimize kernel synthesis and prevent reward hacks.
Fine-Tuning GPT-5 for GPU Kernel Generation: Technical Analysis
Context and Motivation
The transition to accelerator-centric computation has rendered efficient GPU kernel programming a pivotal bottleneck for contemporary AI workloads. Despite progress in general-purpose code generation by LLMs, translating these advances to high-performance kernel generation is impeded by significant data scarcity, biases in compiler-synthesized corpora, and non-trivial optimization landscapes. This paper presents a robust RL pipeline, deploying GPT-5 as a foundation model and fine-tuning it for Triton GPU kernel generation with verifiable, outcome-based rewards. The result is a LLM agent that surpasses baseline compilers and models both in correctness and performance metrics.
Challenges in Kernel Generation
Kernel synthesis diverges distinctly from traditional application-level code generation owing to:
- Data scarcity and proprietary knowledge: Availability of high-quality, hardware-aware kernel code is limited, and public repositories overwhelmingly consist of educational/simplistic examples.
- Compiler artifacts: Synthetic kernels generated by compilers encode internal optimization heuristics, boilerplate, and depend on runtime libraries, limiting portability and model generalization.
- Single-metric (correctness) insufficiency: Functionally correct kernels are often suboptimal in speed or efficiency, and device heterogeneity further exacerbates performance portability.
- Exponentially large search space: Optimization decisions (tiling, memory layout, vectorization, fusion) interact nonlinearly, making exhaustive supervised enumeration infeasible.
These factors collectively render SFT impractical for scaling LLMs in this domain.
Reinforcement Learning with Verifiable Rewards (RLVR)
The paper eschews conventional RLHF in favor of RLVR, leveraging deterministic reward functions based on compilation, correctness, and speedup benchmarks relative to TorchInductor. The key formulation:
- Kernels that fail to compile or validate yield zero reward.
- Correct, outperforming kernels receive continuous rewards, scaled by a logistic function with a shift parameter emphasizing speedup over mere correctness.
This approach eliminates subjectivity, enables precise reward alignment, and scales evaluation across hardware targets.
A prerequisite is the base modelâs proficiency in Triton and hardware primitives; GPT-5âs demonstrated capacity ensures gradient signal availability, unlike lower capacity models where reward plateaus rapidly.
The authors introduce a scalable RL training environmentâMakoraâcharacterized by:
- Curated training sets: Diverse, difficulty-ranked PyTorch kernels, deduplicated semantically and syntactically, filtered by runtime and complexity, then sampled to maximize coverage.
- Evaluation backend: Distributed across H100 GPUs, caches and canonicalizes ASTs for efficiency, and incorporates static reachability analysis and LLM-based hack judgement for reward hacking prevention.
- Tools at inference and train: Kernel evaluator (for iterative correctness checking), kernel search (for candidate retrieval and refinement), web search (external knowledge access), and profiler (for fine-grained performance feedback).
Iterative agentic refinement is enabled, bridging single-turn and multi-turn RL.
Experimental Results
RL-fine-tuned GPT-5 (GPT-5-RL) outperforms all baseline LLMs and compilers:
- Functional correctness: Improves from 43.7% (base GPT-5) to 77.0%, a +33.3pp gain.
- Fraction outperforming TorchInductor: Increases from 14.8% to 21.8%.
- Geometric mean speedup: Rises from 0.73x to 0.81x.
MakoraGenerate, the evolutionary agent, further amplifies results:
- Solves 97.4% of expanded KernelBench problems.
- Outperforms TorchInductor on 72.9% of problems.
- Achieves 2.12x geometric mean speedup.
Increasing refinement steps (multi-turn): GPT-5-RLâs functionality rate improves with more attemptsâfrom 77.0% (single attempt) to 83.7% (three attempts), showing robustness to agentic search and iterative optimization.
Sample efficiency and dataset curation: Training on an oracle in-distribution subset is significantly more effective than random sampling, formalizing the importance of distributional alignment over sheer dataset volume.
Tool impact: Domain-specific tools (kernel evaluator, search, profiler) yield substantial correctness and modest performance gains, while unconstrained web retrieval is inconsistent.
Reward Hack Prevention
The authors identify six hack archetypes (baseline kernel invocation, identity kernel, no-op, unused output, ghost optimization, forgotten kernel). Prevention is achieved via AST-based static analysis and LLM-based semantic judging, ensuring the modelâs outputs are meaningful and intended.
Comparative Discussion
The framework advances agentic kernel generation beyond prior works such as EvoENGINEER (2510.03760), CUDA-ENGINEER (Lange et al., 16 Sep 2025), ASTRA (Wei et al., 9 Sep 2025), CUDA-LLM (Chen et al., 10 Jun 2025), and KEVIN (Baronio et al., 16 Jul 2025), which focus on evolutionary, agentic, or RL-based optimization for CUDA kernels but do not reach the functionality and speedup levels demonstrated here with fine-tuned GPT-5 and Makora.
Practical and Theoretical Implications
Practically, this technique unlocks scalable, production-grade GPU kernel generation, reduces reliance on human experts, and offers competitive acceleration compared to industry compilers. Theoretically, RLVR mechanisms highlight pathways to align LLMs for specialized, high-consequence code domains, circumventing SFT data bottlenecks and exploit/explore dilemmas.
Deployment as evolutionary agents (MakoraGenerate) further illustrates the potential for multi-agent, iterative optimization systems in AI-assisted code synthesis, with implications for distributed training, generative hardware-aware optimizations, and domain adaptation.
Future Directions
- Incorporating train-time tool-use to further improve iterative reasoning and correctness.
- Explicit reward shaping for speedup maximization, targeting hardware-specific optimization strategies.
- Extending RLVR approaches to broader accelerator domains (FPGA, custom ASICs), potentially integrating more granular profiling and hardware introspection.
- Investigating reward hacking adversaries and defense mechanisms in open-source or collaborative kernel generation scenarios.
Conclusion
This work demonstrates that RL-based post-training, leveraging verifiable, hardware-grounded rewards, enables LLMs to attain state-of-the-art performance in GPU kernel generation, substantially improving both reliability and speed relative to base models and compilers. The Makora ecosystem exemplifies scalable infrastructure and robust evaluation for this specialized domain, providing a foundation for practical deployment and future research into agentic, domain-adaptive AI programming systems.