Qiskit HumanEval Benchmark

Updated 15 January 2026

Qiskit HumanEval is a specialized benchmark assessing LLM performance in producing executable quantum code using Qiskit.
It combines human-curated and synthetic task sets with automated unit tests to verify functional correctness and simulation fidelity.
Advanced metrics like pass@k and RL methods (GRPO, ORPO, DPO) drive improvements in quantum software engineering.

Qiskit HumanEval (QHE) is a specialized execution-based benchmark designed to evaluate the ability of LLMs to generate correct, executable code for quantum programming using Qiskit, IBM's open-source quantum SDK. Building on the methodology of the classical HumanEval benchmark, QHE offers curated task suites, automated unit-test harnesses, graded difficulty tiers, and comprehensive evaluation metrics, establishing itself as a rigorously structured yardstick for generative AI in quantum software engineering.

1. Benchmark Definition and Structure

Qiskit HumanEval originated as an effort to provide an analog of HumanEval in the quantum domain, extending coverage from classical Python to quantum circuit synthesis, manipulation, simulation, and algorithmic workflows (Vishwakarma et al., 2024, Dupuis et al., 2024). The dataset encompasses 101–151 hand-curated or synthetic Python programming tasks (depending on the version), each requiring generation of a functionally correct Qiskit solution. Task prompts pair natural-language instructions with a function signature; canonical solutions and reference unit-tests are provided to assess functional correctness.

Task taxonomy in the hand-curated benchmark version is as follows (Vishwakarma et al., 2024):

Category	Task Count
Quantum Circuit Generation	28
Simulation and Execution	19
State Preparation and Analysis	7
Algorithm Implementation	14
Gate Operations and Manipulation	17
Visualization and Post-Processing	6
Advanced Circuit Manipulation	8
Quantum Circuit Serialization	2

Difficulty labels ({basic, intermediate, difficult/advanced}) quantify the required circuit depth, entanglement structure, control flow, and specialized Qiskit API knowledge. Example tasks include Bell state preparation, GHZ circuit construction and drawing, state fidelity computation, quantum algorithm implementation (e.g., Grover, CHSH game), pulse schedules, or quantum cryptography tasks (e.g., BB84 key generation).

2. Dataset Construction and Dataset Variants

The QHE suite is derived via two principal methodologies:

Human-Curated Tasks: 101 tasks, each supplied as a JSON-like entry with prompt, canonical Qiskit solution, test harness, and difficulty label. Unit tests exercise simulation fidelity, measurement statistics, type and structure correctness, with real-hardware tasks simulated using fake providers (Vishwakarma et al., 2024).
Synthetic Generation Pipelines: Synthetic datasets comprising ∼522 validated tasks created from public Qiskit code repositories through function extraction, prompt synthesis, automated scoring for difficulty (circuit depth, entanglement), simulation-based validation, and AST-level deduplication. These are used both for model fine-tuning and preference-based alignment methods (Kheiri et al., 16 Jul 2025, Dupuis et al., 28 Aug 2025).

A challenging variant, "QHE-hard," increases difficulty by omitting import scaffolding, expanding the problem pool to 151 tasks, and including function signatures spanning a wider fraction of advanced Qiskit API (e.g., OpenQASM3 serialization, detailed transpilation workflows) (Dupuis et al., 28 Aug 2025).

3. Evaluation Methodology and Metrics

Evaluation of models using QHE is strictly execution-based. For each task, a candidate code completion is generated and subjected to the associated reference unit-tests in a controlled (Dockerized) Qiskit environment. Task "pass" is binary: a completion is correct only if all assertions in the test suite are satisfied under Qiskit v1.0.2 (or as specified).

The key metrics are:

Pass@k: For $k$ completions per prompt, pass@k is the fraction of tasks with at least one passing completion. Formally, with $M$ tasks, $s_i=1$ if any completion for task $i$ passes, else $0$:

$\mathrm{pass}@k = \frac{1}{M}\sum_{i=1}^M s_i$

Pass@1: Special case $k=1$ , i.e., the fraction of first completions passing their reference tests.

Greedy decoding ( $k=1$ ) remains the default in all reported studies, though pass@k curves (up to $k=64$ ) have been plotted for deeper sampling diagnostics (Dupuis et al., 28 Aug 2025).

4. Model Training Approaches for Qiskit Code Generation

Recent studies have leveraged QHE as a central benchmark for fine-tuning LLMs via reinforcement learning and preference optimization:

Odds-Ratio Preference Optimization (ORPO): A pairwise ranking loss encourages the policy $\pi_\theta$ to assign higher probability to preferred completions, regularized by KL divergence to the original model:

$M$ 0

Chosen/rejected completion pairs are sourced from human annotation or synthetic fidelity metrics (Kheiri et al., 16 Jul 2025).

Group Relative Policy Optimization (GRPO): A PPO-style policy-gradient method operates over groups of candidate completions. The reward $M$ 1 incorporates simulation fidelity and resource usage. Group mean $M$ 2 and standard deviation $M$ 3 allow normalized advantage computation:

$M$ 4

The clipped policy objective further constrains updates:

$M$ 5

(Kheiri et al., 16 Jul 2025, Dupuis et al., 28 Aug 2025).

Direct Preference Optimization (DPO): Constructed via synthetic preference pairs between high- and low-reward samples post unit-test evaluation, with loss as in Rafailov et al. (2023), typically serving as a strong baseline in concert with GRPO (Dupuis et al., 28 Aug 2025).

In all cases, reward signals are tightly coupled to empirical "quantum-verifiable" unit-test pass rates; for GRPO and DPO, policies are aligned to maximize actual Qiskit code executability.

5. Comparative Experimental Results

Quantitative results consistently demonstrate that Qiskit-specific fine-tuning and preference-based RL substantially outperform generic code LLMs across both hand-curated and synthetic QHE variants.

Model	HumanEval pass@1	QHE pass@1
CodeLlama-34B	52.4%	26.7%
DeepSeek-33B	49.4%	39.6%
StarCoder2-15B	45.1%	37.6%
Granite-8B-Base	39.0%	28.7%
Granite-8B-QK (QHE-tuned)	38.4%	46.5%
GRPO (Qwen2.5-32B)	63.0%	49.0%
ORPO (Qwen2.5-32B)	65.9%	56.3%
DPO+GRPO (QHE-hard)	N/A	28.5% (hard)

Performance by task difficulty reveals that all state-of-the-art models (including Qiskit-specialized) solve a significant fraction of basic and intermediate tasks, but achieve 0% on advanced/difficult tasks (5 in the filtered QHE, 2 in the original, larger in QHE-hard) (Kheiri et al., 16 Jul 2025, Vishwakarma et al., 2024, Dupuis et al., 28 Aug 2025).

Fine-grained analysis confirms GRPO is stronger on basic tasks (structural correctness), while ORPO gains on intermediate tasks (readability, API adherence). Combined methods (DPO + GRPO) provide incremental improvement on QHE-hard (Dupuis et al., 28 Aug 2025).

6. Limitations, Failure Modes, and Open Challenges

Systematic failure modes in QHE include:

Inability of LLMs (including those preference- or RL-fine-tuned) to solve advanced tasks involving multi-step quantum-classical control flow, dynamic circuit composition, or cryptography (e.g., full BB84 protocols).
Hallucination of nonexistent Qiskit classes, incorrect import or API usage, and errors in backend or simulator selection on more intermediate tasks (Vishwakarma et al., 2024, Kheiri et al., 16 Jul 2025).
Strong model performance on basic initialization and simple state-preparation deteriorates steeply with increased algorithmic and architectural complexity.

Several technical and reproducibility challenges also persist, including version sensitivity, inconsistency of task definition across public releases, and the computational cost of quantum-verifiable reward function evaluation (especially on hardware) (Kheiri et al., 16 Jul 2025, Dupuis et al., 28 Aug 2025).

7. Future Directions and Standardization Initiatives

Proposed research and development avenues include:

Expansion of QHE to encompass OpenQASM3 and multi-library translation tasks, further broadening coverage of the quantum programming stack (Vishwakarma et al., 2024).
Integration of real-hardware execution into routine evaluation to measure code robustness in the presence of device noise.
Unified or hybrid reward schemas combining strengths of GRPO and ORPO, refined curriculum-based RL protocols for advanced benchmarks, and full public releases of version-controlled QHE task sets and test harnesses (Kheiri et al., 16 Jul 2025).
Creation of leaderboards and community contributions for continual dataset evolution; extension to agentic workflows, step-wise hints, automated code-repair, and reasoning around code-explainability and error-mitigation routines.

A plausible implication is that further advances in AI-assisted quantum programming will depend on tightly coupled reward learning, expanded real-hardware and simulation feedback, and richer benchmarks that mirror real-world quantum algorithm design workflows.

References

"Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models" (Vishwakarma et al., 2024)
"Qiskit Code Assistant: Training LLMs for generating Quantum Computing Code" (Dupuis et al., 2024)
"QSpark: Towards Reliable Qiskit Code Generation" (Kheiri et al., 16 Jul 2025)
"Quantum Verifiable Rewards for Post-Training Qiskit Code Assistant" (Dupuis et al., 28 Aug 2025)