- The paper presents an iterative agent-driven framework that automatically optimizes GPU kernels using a profile-prioritized search and Amdahl's law.
- It employs a robust single-file edit-benchmark loop with Triton and CUDA backends to validate improvements through a five-stage correctness harness.
- Empirical results on NVIDIA H100 demonstrate significant speedups in transformer-critical kernels, underscoring its efficiency for deep learning applications.
AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
Problem Motivation and Context
Optimal GPU kernel development remains a major bottleneck in scaling the efficiency of modern deep learning systems, particularly those built upon large transformer architectures. While vendor libraries such as cuBLAS and cuDNN have set high watermark baselines for dense linear algebra and canonical tensor operations, rapid evolution in model architectures consistently produces new kernels that lack optimized library support. This includes workloads like grouped-query attention, SwiGLU activations, rotary positional encodings, and RMS normalization. Manually optimizing these kernels requires advanced domain knowledge of GPU microarchitecture, memory hierarchy, and software toolchains. The non-convex search landscape, parameterized by tile sizes, memory layouts, and precision tactics, exacerbates the expertise barrier and curtails the rapid prototyping of competitive kernels.
System Architecture and Pipeline
AutoKernel is an autonomous optimization framework that systematizes GPU kernel tuning, implemented in a 9,200+-line open-source Python stack (with agent instructions spanning 909 lines), supporting both Triton and CUDA~C++ backends. The system pipeline comprises three segmented phases:
- Profiling and Bottleneck Identification (Phase A): Given an arbitrary PyTorch model, AutoKernel profiles kernel-wise GPU utilization using torch.profiler, classifying spent time across nine supported operation types. An operation's impact on end-to-end latency is determined by applying Amdahl's law, thereby prioritizing effort for kernels that bottleneck the global runtime.
- Kernel Extraction and Plan Synthesis: Profiled kernels are extracted and mapped to standalone files with starter implementations, including model-relevant shapes and tolerance attributes. An explicit optimization plan with what-if projections is assembled, ranked by possible performance gain.
- Iterative Agent Optimization (Phase B): An LLM-powered code agent iteratively edits a single kernel implementation. Each modification is evaluatedโusing a fixed five-stage correctness harnessโfor functional and numerical soundness before benchmarking throughput. The agent persists only those modifications yielding validated improvements. Loop termination follows plateauing returns, convergence to near-peak utilization, or budgeted runtime/iteration thresholds.
- Post-Optimization Verification (Phase C): The best kernel variant, as chosen by the agent loop, enters full-model regression tests for correctness and speedup attribution.
Correctness is enforced via a robust five-stage verification harness: initial smoke test, shape sweep across 10+ configurations, adversarial stability probing, determinism assessment, and edge-case size coverage. Only after thoroughly passing all stages is a kernel considered for performance benchmarking, preventing regressions or invalid optimizations.
Algorithmic Approach and Design Rationales
AutoKernel employs a deliberate โsingle-file edit-benchmark-keep/revertโ agent loop, inspired by Karpathyโs autoresearch paradigm. The agent operates on a strict invariant: exactly one kernel file is touched per iteration, maintaining git-based experiment tracking for interpretable and reversible search trajectories. A six-tiered optimization playbook structures the agentโs search space across:
- Block and tile size tuning
- Memory access optimizations (e.g., vectorization, prefetching, cache utilization)
- Compute pattern variants (e.g., accumulator precision, epilogue fusion)
- Advanced strategies (e.g., split-K, persistent kernels)
- Architecture-specific adjustments (e.g., Hopper TMA ops, Ampere cp.async)
- Operation-specialized transformations (e.g., online softmax, Welfordโs normalization)
This approach explicitly sidelines architectural complexity found in multi-agent systems, focusing instead on process transparency, single-agent determinism, and reproducibility.
Backend Duality
AutoKernel's dual-backend support targets both fast prototyping and fine-grained architectural optimization. Triton enables sub-5 second JIT compilation and rapid iteration, effective for high-level kernel abstractions. CUDA~C++ unlocks tensor core primitives, warp shuffles, and bank-conflict-free shared memory patterns, offering full exposure to low-level hardware features when required. The systemโs backend-agnostic harness allows direct performance comparison and migration between abstraction levels within the same optimization process.
Quantitative Results
Empirical evaluation on NVIDIA H100 hardware demonstrates AutoKernel's Triton starter kernels outperform PyTorch eager and torch.compile (max-autotune) baselines on the majority of relevant transformer-critical kernels:
- RMSNorm: 5.29ร over PyTorch eager, 2.83ร over torch.compile, attaining 2,788 GB/s (83% of H100 DRAM peak).
- Softmax: 2.82ร over eager, 3.44ร over torch.compile, and exceeding 2,800 GB/s.
- Cross-Entropy: 2.21ร over eager, 2.94ร over torch.compile, reaching 2,070 GB/s.
All evaluated configurations (n=34) pass the entirety of the five-stage correctness harness, underscoring the systemโs robustness in code generation and validation.
While memory-bound kernels exhibit maximal speedups, matmul performance highlights the limitations of source-level search: AutoKernelโs Triton matmul reaches 278 TFLOPS (28% of H100 peak), surpassing torch.compile but not cuBLAS, emphasizing the sustained advantage of hand-tuned vendor libraries for dense GEMM workloads. The agent loop is explicitly tasked to close this gap iteratively, with an optimization trajectory that prioritizes block size and tiling sweeps before exploring advanced strategies.
Community Impact and Competitive Deployment
AutoKernel's autonomous capabilities extend beyond synthetic evaluation. Community deployment yielded a first-place result on the vectorsum_v2 B200 leaderboard, andโmost notablyโa single-prompt Triton FP4 matmul kernel exceeded CUTLASS performance by 1.63โ2.15ร across diverse shapes, peaking at 2,898 TFLOPS. This underscores the efficacy of encoding expert heuristics in agent-accessible playbooks and positions AutoKernel as a practical alternative to both manual tuning and closed vendor toolchains.
Theoretical and Practical Implications
AutoKernel demonstrates that an iterative, agent-driven optimization process, constrained by rigorous correctness gates and guided by profile-driven prioritization (via Amdahlโs law), suffices to automate substantive portions of the GPU kernel engineering cycle. This validates the translatability of the "autoresearch loop" to low-level code optimization and indicates that expert-level kernel performance is tractable without multi-agent coordination or human-in-the-loop feedback, provided sufficient agent instruction and explicit evaluation.
The architecture is modular for future extensions: potential directions include distributed, population-based search, RL-guided mutation strategies leveraging hardware performance counters, multi-kernel fusion, and automated PTX or SASS analysis integration for deeper hardware-coupled optimization.
Practically, the system unlocks scalable deployment of optimized kernels for models and workloads as soon as they are conceived, reducing ecosystem lag due to library absence. The explicit export functionality for HuggingFace Kernels further accelerates broad distribution and community adoption of agent-generated performant kernels.
Conclusion
AutoKernel establishes an agent-driven, correctness-anchored, profile-prioritized framework for autonomous kernel optimization encompassing both rapid prototyping (Triton) and hardware-level specialization (CUDA~C++). The system achieves significant empirical speedups over state-of-the-art baselines in memory-bound regions and demonstrates agent competitiveness with human/expert-tuned libraries in community evaluations. Its design favors transparency, extensibility, and robustness, providing a compelling baseline and testbed for future research in autonomous code generation and iterative hardware-aware software synthesis (2603.21331).