A Layered Aggregate Engine for Analytics Workloads

Published 20 Jun 2019 in cs.DB | (1906.08687v1)

Abstract: This paper introduces LMFAO (Layered Multiple Functional Aggregate Optimization), an in-memory optimization and execution engine for batches of aggregates over the input database. The primary motivation for this work stems from the observation that for a variety of analytics over databases, their data-intensive tasks can be decomposed into group-by aggregates over the join of the input database relations. We exemplify the versatility and competitiveness of LMFAO for a handful of widely used analytics: learning ridge linear regression, classification trees, regression trees, and the structure of Bayesian networks using Chow-Liu trees; and data cubes used for exploration in data warehousing. LMFAO consists of several layers of logical and code optimizations that systematically exploit sharing of computation, parallelism, and code specialization. We conducted two types of performance benchmarks. In experiments with four datasets, LMFAO outperforms by several orders of magnitude on one hand, a commercial database system and MonetDB for computing batches of aggregates, and on the other hand, TensorFlow, Scikit, R, and AC/DC for learning a variety of models over databases.

Abstract PDF Upgrade to Chat

Citations (69)

View on Semantic Scholar

Summary

Overview of LMFAO Engine for Analytics Workloads

The paper introduces LMFAO (Layered Multiple Functional Aggregate Optimization), an in-memory engine designed to optimize and execute batches of aggregates over relational databases efficiently. This engine is motivated by the observation that analytics tasks across various domains, such as banking and retail, can be reformulated into group-by aggregates over the join of input database relations. The ability of LMFAO to tackle these tasks is exemplified through applications like learning ridge linear regression, classification trees, regression trees, Bayesian networks' structure using Chow-Liu trees, and data cubes in data warehousing.

LMFAO Architecture and Optimization Techniques

LMFAO employs a layered approach to handle multiple aggregates efficiently. The architecture consists of several logical and code optimization layers:

Join Tree Construction: This layer uses advanced techniques to generate a single join tree. This tree aids in computing all aggregates, leveraging effective methods like hypertree decomposition for cyclic queries.
Find Roots and Directional Views: LMFAO identifies optimal roots for the computation of aggregate batches, aiming to reduce the number of views and enhance sharing computation. This strategy allows for directional views, which compute aggregates based on their specific traversal requirements.
Aggregate Pushdown and Merge Views: Aggregates are decomposed into views along the join tree edges. LMFAO consolidates multiple views with common attributes, bodies, or aggregates, enhancing efficiency.
Group Views: This involves clustering views that are computable together without further dependencies, thus facilitating the creation of multi-output execution plans.
Multi-Output Optimization: This novel approach allows for computing a batch of views using a single scan of the relation data, thus optimizing computation sharing.
Parallelization and Compilation: LMFAO supports parallel execution and generates customized C++ code, utilizing features such as inlining and low-level optimizations.

Numerical Results and Computational Advantages

Performance benchmarks reveal LMFAO's significant advantages over classical database systems like PostgreSQL and MonetDB and machine learning libraries such as TensorFlow and Scikit-learn. The paper demonstrates that LMFAO can outperform these systems by orders of magnitude, particularly by reducing the overhead of materializing large datasets and leveraging a principled approach to sharing computation across multiple workloads.

Implications and Future Developments

The design of LMFAO opens avenues for more efficient data-driven analytics across domains that rely heavily on relational data processing, such as finance, retail, and marketing. The implications are substantial since it suggests a shift from current paradigms that separate data pre-processing from model learning. Future developments may explore extending LMFAO’s capabilities to distributed systems and integrating advanced machine learning models that inherently require complex aggregations.

Overall, LMFAO represents an important contribution to analytics workload optimization, advocating for a comprehensive approach that unifies database and machine learning systems. It leverages long-standing principles from database research while addressing contemporary scalability challenges in large-scale data analytics.

Markdown Report Issue