Learned Cardinalities: Estimating Correlated Joins with Deep Learning

Published 3 Sep 2018 in cs.DB | (1809.00677v2)

Abstract: We describe a new deep learning approach to cardinality estimation. MSCN is a multi-set convolutional network, tailored to representing relational query plans, that employs set semantics to capture query features and true cardinalities. MSCN builds on sampling-based estimation, addressing its weaknesses when no sampled tuples qualify a predicate, and in capturing join-crossing correlations. Our evaluation of MSCN using a real-world dataset shows that deep learning significantly enhances the quality of cardinality estimation, which is the core problem in query optimization.

Abstract PDF Upgrade to Chat

Citations (351)

View on Semantic Scholar

Summary

The paper introduces the Multi-Set Convolutional Network (MSCN) that significantly enhances cardinality estimation for correlated joins compared to traditional methods.
It employs deep learning with set semantics and multi-layer perceptrons to model relational query plans, effectively predicting sizes even in sparse sampling conditions.
Empirical results on the IMDb dataset show MSCN’s robust performance, particularly in 0-tuple scenarios, underscoring its potential in query optimization.

Analyzing "Learned Cardinalities: Estimating Correlated Joins with Deep Learning"

In the paper "Learned Cardinalities: Estimating Correlated Joins with Deep Learning," the authors explore a novel application of deep learning techniques to the problem of cardinality estimation within the context of query optimization in relational database systems. The paper introduces the Multi-Set Convolutional Network (MSCN), a tailored deep learning model designed to enhance predictions of intermediate result sizes during query execution. This work is premised on the argument that existing database systems often provide cardinality estimates that are inaccurate by orders of magnitude, particularly in scenarios involving join-crossing correlations—an aspect highlighted by their analysis of the IMDb dataset.

Overview and Methodology

The MSCN is architecturally distinct, modeling relational query plans using set semantics to capture the inherent permutation invariance. It leverages deep learning by encoding queries as sets and learning correlations through multi-layer perceptrons (MLPs). This approach contrasts with traditional sampling-based techniques, which face challenges, especially in scenarios with no qualifying samples or when indices are unavailable. The model effectively utilizes bitmaps representing qualifying sample positions, allowing the learning of patterns that improve cardinality predictions when samples are sparse.

Detailed Analysis of Results

The paper provides an empirical evaluation of MSCN using the IMDb dataset, demonstrating that it outperforms traditional techniques like PostgreSQL’s query optimizer and state-of-the-art methods like Index-Based Join Sampling (IBJS) when dealing with correlated joins. The results indicate significant improvement in estimation accuracy, with MSCN achieving a median q-error better than competitors and performing exceptionally well in 0-tuple situations. Notably, it maintains competitive performance in circumstances where sampling-based methods falter.

Strong Claims and Implications

One of the bold claims made by the authors is the robustness of MSCN in 0-tuple scenarios and its ability to generalize to queries with more joins than encountered during training. This robustness suggests that leveraging machine learning models could fill a notable gap in current estimation techniques. Furthermore, by incorporating bitmaps of qualifying samples, MSCN potentially leverages runtime sampling information more effectively, thus addressing key limitations of independent assumption-based methods.

Theoretical and Practical Implications

From a theoretical standpoint, the application of deep learning to cardinality estimation expands the domain of what machine learning can address within database systems beyond tasks like indexing and query optimization. Practically, incorporating such models could notably enhance the reliability of query optimization, leading to predictable and improved performance of database systems. However, the discussion also identifies areas for further exploration, such as handling updates and incorporating more complex predicates, where current models may see limitations.

Future Directions

The authors speculate on several future enhancements for MSCN. These include tackling complex SQL predicates and queries with disjunctions, uncertainty estimation in prediction, and adaptive training tailored to specific workload characteristics. Moreover, potential improvements regarding data updating and handling dynamic data distributions are highlighted as essential directions for making such models viable in real-time database systems.

Concluding Remarks

This paper presents a significant contribution to the ongoing discourse on machine learning in database systems, particularly in the context of query optimization. The proposed MSCN model offers a promising alternative to traditional cardinality estimation methods, with tangible benefits demonstrated through empirical analysis. The inquiry into incorporating machine learning for solving historically challenging problems like join-crossing correlations provides a framework for further research and development in this domain. As the field progresses, the intersection of deep learning and database management is likely to open new avenues for enhancing the efficiency and flexibility of data systems.