Robust, comprehensive evaluation for AI-driven kernel generation

Develop robust and comprehensive evaluation protocols for AI-driven GPU kernel generation that jointly assess robustness and generalization across input shapes, operator types, and hardware ecosystems, overcoming current limitations of benchmarks confined to fixed shapes and NVIDIA-only forward-pass primitives.

Background

The survey notes that prevailing benchmarks inadequately reflect real-world workloads because they are often restricted to fixed input shapes and forward-pass primitives within the NVIDIA ecosystem. This limitation undermines reliable assessment of generalization and robustness, which are essential for production-grade kernel generation.

To address these gaps, the authors call for evaluation protocols that jointly test across shapes, operators, and heterogeneous hardware ecosystems, providing a stronger foundation for measuring progress in kernel generation research.

References

A key open challenge in AI-driven kernel generation is the lack of robust and comprehensive evaluation.

Towards Automated Kernel Generation in the Era of LLMs  (2601.15727 - Yu et al., 22 Jan 2026) in Section 7 (Challenges and Opportunities), paragraph "Evaluation Robustness and Generalization"