Papers
Topics
Authors
Recent
Search
2000 character limit reached

Quantum Software Testing (QST)

Updated 20 January 2026
  • Quantum Software Testing (QST) is the empirical assessment of quantum programs and circuits, ensuring runtime behavior conforms to correctness criteria amid non-deterministic outcomes.
  • QST employs diverse methodologies from naive random input generation to advanced mutation and metamorphic testing, evaluating fault detection, execution time, and computational cost.
  • QST is critical for advancing quantum application reliability by guiding test tool development, artifact release, and reproducible evaluation across simulators and real quantum processors.

Quantum Software Testing (QST) is the empirical assessment of whether quantum programs—spanning low-level circuits, high-level algorithms, quantum machine-learning models, and hybrid quantum–classical systems—exhibit runtime behavior conforming to specified correctness criteria. Unlike the deterministic, reproducible nature of classical program execution, QST must contend with non-deterministic measurement outcomes, irreversible quantum state collapse, noise-induced errors, and the distinctive fragility of NISQ-era hardware. As quantum software ecosystems mature, QST supplies the validation, tool guidance, and reliability assurances required to scale quantum applications to practical deployment (Li et al., 13 Jan 2026).

1. Taxonomy of Objects and Workflows in Quantum Software Testing

Empirical studies in QST have evaluated four principal classes of Programs Under Test (PUTs) and Circuits Under Test (CUTs):

  1. Quantum algorithms and subroutines Prototypical examples include Quantum Fourier Transform (QFT), Grover’s Search, Quantum Phase Estimation, and Bernstein–Vazirani. These are frequently parametrized by qubit count nn and are implemented as explicit gate sequences in OpenQASM or high-level quantum SDKs. Complexity metrics include circuit width (#\#qubits), size (#\#gates), and depth (sequential layers).
  2. Quantum machine-learning models This category encompasses parameterized quantum neural networks (QNNs), circuit-centric classifiers, and variational ansätze for supervised learning. Models are trained using classical datasets (e.g., MNIST, CIFAR10), and statistical validation is required to determine model generalization and to detect faulty or misaligned parameterizations.
  3. Real-world benchmarks Collections such as Bugs4Q, Qbugs, QASMBench, and QasmBench provide corpora of buggy, mutated, or versioned quantum programs drawn from industry and open-source repositories. These are used extensively for empirical studies of fault localization and program repair.
  4. Artificial or synthetic programs Large random ensembles—sometimes exceeding 5×1045\times 10^4 circuits—are commonly used to stress-test test-generation techniques, input space coverage, and oracle behavior.

Formally, CUTs are modeled as parameterized quantum channels Ec\mathcal{E}_c acting on an initial quantum state ρ\rho, followed by a measurement described by a set {Em}\{E_m\}. This modeling supports rigorous reporting of circuit width, size, and depth.

2. Baseline Strategies, Experimental Configuration, and Empirical Metrics

Baseline selection in QST encompasses:

  • Naive baselines: Random input generation, random state/circuit generators, and fully random search are employed to benchmark test-case effectiveness and efficiency.
  • Adapted baselines: Adaptive random testing, metamorphic testing, and spectrum-based localization are used to adapt classical testing paradigms to quantum contexts.
  • State-of-the-art baselines: QMutPy and Muskit provide mutation frameworks at the gate level; Quito, QuCAT, and QuraTest offer systematic or combinatorial test generation; ETO supplies a statistical oracle.
  • Ablative and composite baselines: These compare proposed techniques against variants with individual components removed or pipelines composed of best-in-class methods.

Experimental parameters include:

  • Number of shots ss per circuit (fixed, exponentially varied, or adaptive),
  • Number of repetitions rr for statistical significance (commonly $10$–$50$, sometimes 10001\,000),
  • Backends ranging from ideal state-vector simulators (unit testing, debugging), shot-based simulators, noisy simulators, to real QPUs.

Metrics:

  • Effectiveness
    • Fault detection rate: FDR=#failed tests#test cases\text{FDR} = \frac{\#\textrm{failed tests}}{\#\textrm{test cases}}
    • Mutation score: MS=KilledMutantsTotalMutantsMS = \frac{|\mathrm{Killed\,Mutants}|}{|\mathrm{Total\,Mutants}|}
    • Classification statistics: Accuracy, Precision, Recall, F1 (especially for QML models)
    • Localization/repair rates in fault-related studies
  • Cost
    • Execution time (overall, simulation, or on QPU)
    • Circuit complexity (qubits, gates, depth)
    • Shot/repetition counts
  • Cost-effectiveness
    • Jointly analyze fault-detection and computational overhead
    • Statistical significance reported using Mann–Whitney U, Fisher’s exact, Kruskal–Wallis, and effect-size metrics (Vargha–Delaney A^12\hat{A}_{12}, Cliff’s δ\delta)

3. Test Input Design and Generation Techniques

QST test inputs are defined by the triplet (ρ,c,{Em})(\rho, c, \{E_m\}):

  • ρ\rho — Initial quantum state, selected from:
    • Computational basis states {0,1}\{|0\rangle, |1\rangle\}
    • Fully separable states (single-qubit rotations)
    • Arbitrary superpositions, e.g., cos(θ/2)0+eiϕsin(θ/2)1\cos(\theta/2)|0\rangle + e^{i\phi}\sin(\theta/2)|1\rangle (Bloch parametrization)
    • Mixed states (density operators), entangled cat/Bell/graph states
    • Eigenvectors of unitary circuit blocks
  • cc — Classical arguments: nn (qubit counts), numeric parameters, matrix oracles, or classical data (for QML)
  • {Em}\{E_m\} — Measurement operators:
    • Default ZZ-basis
    • Rotated bases (e.g., XX via Hadamard)
    • Generalized POVMs, Helstrom measurements

Input generation spans random sampling, combinatorial covering arrays (e.g., via QuCAT), coverage-guided search, constraint solvers (for state vectors), diversity-driven concolic generation, and metamorphic transformations.

4. Tools, Frameworks, and Artifact Practices

The empirical QST landscape is characterized by extensive tool and artifact support:

Tool/Framework Function Notable Features
Qiskit, Q#, Cirq, PennyLane Program construction and simulation Standard industry and research SDKs
QMutPy, Muskit Gate-level mutation operators on circuits Mutation input for empirical evaluation
Quito, QuCAT, QuraTest, QuanTest Test-generation (combinatorial, entanglement-guided, pattern-based) Systematic input suite design
ETO, swap/inverse circuits, ProQ Oracle support Statistical and assertion-based oracles
Bugs4Q, Qbugs, MQT Bench, VeriQBench Benchmark suites Curated real and mutant buggy programs

Over half of QST studies release artifacts (test suites, reproduction scripts, datasets) on repositories such as GitHub, Zenodo, Figshare, and OSF to enhance reproducibility.

5. Methodological Limitations, Recurring Inconsistencies, and Open Challenges

Empirical QST exhibits several persistent issues:

  • Heterogeneous experimental setups impede cross-study comparability, with inconsistent reporting of test suite sizes, circuit metrics, and shot/repetition counts.
  • Oracle specification and misalignment problems result in false positives/negatives due to unclear mapping between specification (state, distribution, or operator-level criteria) and the chosen test oracles.
  • Limited structural testing, with emphasis on black-box over white- or grey-box (control/data-flow, circuit topology) analysis. Systematic structural frameworks remain underdeveloped.
  • Scarcity of real-world buggy benchmarks, with heavy dependence on artificial mutation techniques and a lack of version-level buggy quantum programs.
  • Insufficient statistical rigor, as only \approx15% of surveyed studies perform statistical hypothesis tests or assess effect sizes.
  • Backend biases, with over-reliance on idealized simulators rather than noisy, realistic NISQ devices.

Open research challenges include (a) defining oracles capable of partial or approximate correctness assessment, (b) scaling test generation/oracle evaluation to 10\geq 10 qubits under tight computational constraints, (c) integrating fault-tolerant noise models and in situ hardware variability, (d) expanding benchmarks to cover hybrid quantum–classical workflows and domain-specific languages, and (e) formalizing systematic structural coverage metrics (e.g., quantum control/data-flow).

6. Recommendations for Rigorous and Reproducible QST Practice

  • Goal-driven study design: Explicitly distinguish code-level versus system-level testing objectives; factor in hardware constraints as appropriate.
  • Explicit taxonomization and reporting: Detail the characteristics of all PUTs (language, width, size, depth, subroutines) and input classes.
  • Specification discipline: Publicly define and document program specifications, mapping explicit input sets to expected final states, distributions, or operator action.
  • Oracle–specification alignment: Match the type of test oracle (e.g., Wrong Output Oracle (WOO), Output Probability Oracle (OPO), Probability-based Oracle (PBO), Distribution-based Output Oracle (DOO), Quantum State Oracle (QSO), Quantum Operator Oracle (QOO)) to the corresponding specification; rigorously describe oracle evaluation (e.g., via χ2\chi^2 or swap/inverse tests).
  • Comprehensive reporting: Disclose test-suite sizes, shot/repetition budgets, and all simulator or backend details.
  • Cost-effectiveness and statistical analysis: Jointly analyze detection metrics with cost/complexity; substantiate claims via hypothesis testing and effect-size measures.
  • Artifact release: Deposit all code, data, and benchmarks on persistent, indexed platforms.
  • Benchmark and tool development: Build and disseminate QST-focused benchmark collections (especially scalable, buggy, or domain-specific programs).

Systematic adherence to these methodological best practices underpins progress toward a mature, reproducible, and broadly comparable QST research discipline (Li et al., 13 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Quantum Software Testing (QST).