VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints

Published 5 Mar 2024 in cs.PL and cs.DB | (2403.03193v2)

Abstract: The task of SQL query equivalence checking is important in various real-world applications (including query rewriting and automated grading) that involve complex queries with integrity constraints; yet, state-of-the-art techniques are very limited in their capability of reasoning about complex features (e.g., those that involve sorting, case statement, rich integrity constraints, etc.) in real-life queries. To the best of our knowledge, we propose the first SMT-based approach and its implementation, VeriEQL, capable of proving and disproving bounded equivalence of complex SQL queries. VeriEQL is based on a new logical encoding that models query semantics over symbolic tuples using the theory of integers with uninterpreted functions. It is simple yet highly practical -- our comprehensive evaluation on over 20,000 benchmarks shows that VeriEQL outperforms all state-of-the-art techniques by more than one order of magnitude in terms of the number of benchmarks that can be proved or disproved. VeriEQL can also generate counterexamples that facilitate many downstream tasks (such as finding serious bugs in systems like MySQL and Apache Calcite).

Abstract PDF Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper presents an SMT-based approach to verify bounded equivalence in complex SQL queries with rigorous integrity constraint encoding.
It translates SQL queries into logical SMT formulas, enabling efficient counterexample generation which outperforms existing techniques.
Comprehensive evaluations on benchmarks like LeetCode and Calcite illustrate significant improvements in performance and coverage.

VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints

The paper "VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints" presents an approach to verify the equivalence of SQL queries that involve complex features and integrity constraints. This process is crucial for tasks such as query rewriting and automated grading. However, existing methods have limitations in handling complex SQL features, which limits their applicability in real-world scenarios. The paper discusses a new SMT-based approach to overcome these limitations and effectively prove or disprove the bounded equivalence of complex SQL queries. This essay summarizes the methodology, evaluation, and its potential implications.

Introduction

Equivalence checking for SQL queries is an essential task for validating query rewritings and automated grading in database applications. The ability to formally verify that two complex SQL queries produce equivalent outputs across all input scenarios up to a certain size is a crucial task for ensuring accuracy and completeness in SQL-dependent systems. The paper introduces $VeriEQL$ , a tool that utilizes an SMT-based approach, thus enabling bounded verification using a new logical encoding model. VeriEQL is tailored to support complex SQL features such as sorting, CASE statements, and rich integrity constraints that go beyond currently available equivalence checkers. Through an extensive evaluation on over 20,000 benchmarks, $VeriEQL$ outperforms state-of-the-art techniques by more than one order of magnitude in terms of successfully verifying bounded equivalence or refuting it by generating counterexamples.

Figure 1: Schematic workflow of VeriEQL.

Implementation Strategy

Key Components of VeriEQL

SMT-Based Approach: VeriEQL employs a novel SMT-based logical encoding to model query semantics using the theory of integers with uninterpreted functions. This aspect enables it to handle complex SQL queries with sorting, advanced case statements, and rich integrity constraints effectively, without relying on indirect encodings such as those required by Rosette.

Algorithm

The $VeriEQL$ algorithm translates the SQL equivalence problem into an SMT problem for bounded verification:

Symbolic Database Construction:
- Construct a symbolic representation of the database, allowing for relations to contain up to a predefined bound of symbolic tuples (Figure 2). Each tuple status (deleted or not) is represented through uninterpreted predicates.
Integrity Constraint Encoding:
- Translates integrity constraints into SMT formulas over symbolic tuples. This includes primary keys, foreign keys, not-null, and check constraints, and auto-increment attributes. This approach generates counterexamples that satisfy these constraints.

Figure 2: Schema, integrity constraint, and queries.

SQL Semantics Encoding:
- Formalize SQL semantics using higher-order functions that provide a logical foundation for encoding in SMT.
- Sequence query operators are encoded for each query to capture their semantics in a symbolic manner, as demonstrated in Figures 21 and other related figures outlined in the supplementary materials.

Figure 3: The paired mechanism of EXCEPT.

Equivalence Verification and Counterexample Generation:
- Construct an SMT formula encoding equivalence of two query outputs under the given constraints for databases of the bounded size.
- Utilize an SMT solver to prove or disprove equivalence. If the formula is satisfiable, $VeriEQL$ generates counterexamples from the model obtained from the solver, aiding in the identification of query non-equivalence.

Evaluation

Comparison with State-of-the-Art Techniques

$VeriEQL$ was evaluated using three different workloads: $LeetCode$ , $Calcite$ , and $Literature$ , against various state-of-the-art bounded verification and testing tools. The results demonstrated $VeriEQL$ 's superior support for a larger set of complex SQL queries, and its ability to identify significantly more non-equivalent queries than existing techniques.

LeetCode Workload: $VeriEQL$ demonstrated a coverage of nearly 78% of the benchmarks and could successfully refute over 15%, identifying genuine counterexamples for faulty queries.
Calcite Workload: Despite hypothetical reasons like non-equivalent queries in the benchmark, $VeriEQL$ outperformed others, proving or disproving more examples within a 10-minute time frame.
Literature Workload: $VeriEQL$ supported almost all benchmarks in this category and found valid counterexamples for two orders of magnitude more benchmarks compared to current state-of-the-art techniques.

Practical and Theoretical Implications

The proposed approach provides an enhanced toolset for practitioners and researchers dealing with SQL queries. It allows robust equivalence verification that includes the consideration of integrity constraints, a feature that ensures the reduction of spurious counterexamples. As noted in evaluations against databases like MySQL and integration with practical platforms like LeetCode, $VeriEQL$ has been instrumental in identifying bugs in real-world applications and suggesting augmented test cases. Given its substantial performance gains and extended coverage, $VeriEQL$ promises significant contributions both to theoretical aspects of bounded verification and its practical applications, impacting areas such as database optimizations, education, and system debugging.

Conclusion

"VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints" presents a practical approach that advances the formal verification and disproval of SQL query equivalence. Through its innovative SMT-based encoding and support for complex queries with rich integrity constraints, $VeriEQL$ extensively outperforms existing techniques, making it a powerful asset for both academic exploration and applied database tasks. Its ability to generate detailed counterexamples further strengthens its position as an invaluable tool for enhancing SQL query accuracy and robustness.