Papers
Topics
Authors
Recent
Search
2000 character limit reached

ArcaDB: A Container-based Disaggregated Query Engine for Heterogenous Computational Environments

Published 25 Nov 2023 in cs.DB | (2311.14933v1)

Abstract: Modern enterprises rely on data management systems to collect, store, and analyze vast amounts of data related with their operations. Nowadays, clusters and hardware accelerators (e.g., GPUs, TPUs) have become a necessity to scale with the data processing demands in many applications related to social media, bioinformatics, surveillance systems, remote sensing, and medical informatics. Given this new scenario, the architecture of data analytics engines must evolve to take advantage of these new technological trends. In this paper, we present ArcaDB: a disaggregated query engine that leverages container technology to place operators at compute nodes that fit their performance profile. In ArcaDB, a query plan is dispatched to worker nodes that have different computing characteristics. Each operator is annotated with the preferred type of compute node for execution, and ArcaDB ensures that the operator gets picked up by the appropriate workers. We have implemented a prototype version of ArcaDB using Java, Python, and Docker containers. We have also completed a preliminary performance study of this prototype, using images and scientific data. This study shows that ArcaDB can speed up query performance by a factor of 3.5x in comparison with a shared-nothing, symmetric arrangement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. The Seattle Report on Database Research. Commun. ACM 65, 8 (jul 2022), 72–79. https://doi.org/10.1145/3524284
  2. FPGA-Accelerated Group-by Aggregation Using Synchronizing Caches. In Proceedings of the 12th International Workshop on Data Management on New Hardware (San Francisco, California) (DaMoN ’16). Association for Computing Machinery, New York, NY, USA, Article 11, 9 pages. https://doi.org/10.1145/2933349.2933360
  3. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. 13, 12 (aug 2020), 3411–3424. https://doi.org/10.14778/3415478.3415560
  4. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 1383–1394. https://doi.org/10.1145/2723372.2742797
  5. J. Dean and S. Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of 2004 OSDI. San Francisco, CA, USA, 137–150.
  6. Performance Tradeoffs for Client-Server Query Processing. In Proc. ACM SIGMOD Conference. Montreal, Quebec, Canada, 149–160.
  7. G. Graefe and W.J. McKenna. 1993. The Volcano optimizer generator: extensibility and efficient search. In Proceedings of IEEE 9th International Conference on Data Engineering. 209–218. https://doi.org/10.1109/ICDE.1993.344061
  8. AccessPath Selection in a Relational Database Management System. In ACM SIGMOD Conference. Boston, Massachusetts, USA, 23–34.
  9. QPipe: A Simultaneously Pipelined Relational Query Engine. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (Baltimore, Maryland) (SIGMOD ’05). Association for Computing Machinery, New York, NY, USA, 383–394. https://doi.org/10.1145/1066157.1066201
  10. Relational Joins on Graphics Processors. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 511–524. https://doi.org/10.1145/1376616.1376670
  11. YourSQL: A High-Performance Database System Leveraging in-Storage Computing. Proc. VLDB Endow. 9, 12 (aug 2016), 924–935. https://doi.org/10.14778/2994509.2994512
  12. Cloud Programming Simplified: A Berkeley View on Serverless Computing. arXiv:1902.03383 [cs.OS]
  13. GPU Join Processing Revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware (Scottsdale, Arizona) (DaMoN ’12). Association for Computing Machinery, New York, NY, USA, 55–62. https://doi.org/10.1145/2236584.2236592
  14. Optimizing GPU-accelerated Group-By and Aggregation. In ADMS@VLDB.
  15. Integrating Data Lake Tables. Proc. VLDB Endow. 16, 4 (dec 2022), 932–945. https://doi.org/10.14778/3574245.3574274
  16. PubChem 2023 update. Nucleic Acids Research 51, D1 (10 2022), D1373–D1380. https://doi.org/10.1093/nar/gkac956 arXiv:https://academic.oup.com/nar/article-pdf/51/D1/D1373/48441598/gkac956.pdf
  17. Application of Hash to Data Base Machine and Its Architecture. New Gen. Comput. 1, 1 (mar 1983), 63–74. https://doi.org/10.1007/BF03037022
  18. Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms. Proc. VLDB Endow. 14, 13 (sep 2021), 3308–3321. https://doi.org/10.14778/3484224.3484229
  19. Haoyuan Li. 2018. Alluxio: A Virtual Distributed File System. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-29.html
  20. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing (Seattle, WA, USA) (SOCC ’14). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/2670979.2670985
  21. Column Scan Acceleration in Hybrid CPU-FPGA Systems. In International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2018, Rio de Janeiro, Brazil, August 27, 2018, Rajesh Bordawekar and Tirthankar Lahiri (Eds.). 22–33. http://www.adms-conf.org/2018-camera-ready/habich_adms2018.pdf
  22. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
  23. Evaluating Multi-GPU Sorting with Modern Interconnects. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1795–1809. https://doi.org/10.1145/3514221.3517842
  24. Comparative Analysis of OpenCL and RTL for Sort-Merge Primitives on FPGA. In Proceedings of the 16th International Workshop on Data Management on New Hardware (Portland, Oregon) (DaMoN ’20). Association for Computing Machinery, New York, NY, USA, Article 11, 7 pages. https://doi.org/10.1145/3399666.3399897
  25. Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (jun 2011), 539–550. https://doi.org/10.14778/2002938.2002940
  26. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 51–63. https://doi.org/10.1145/3035918.3056100
  27. Manuel Rodriguez-Martinez and Nick Roussopoulos. 2000. MOCHA: A Self-Extensible Database Middleware System for Distributed Data Sources. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein (Eds.). ACM, 213–224.
  28. Efficient Join Algorithms for Large Database Tables in a Multi-GPU Environment. Proc. VLDB Endow. 14, 4 (dec 2020), 708–720. https://doi.org/10.14778/3436905.3436927
  29. Presto: SQL on Everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1802–1813. https://doi.org/10.1109/ICDE.2019.00196
  30. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1617–1632. https://doi.org/10.1145/3318464.3380595
  31. Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 403–415. https://doi.org/10.1145/3035918.3035954
  32. Evangelia A. Sitaridi and Kenneth A. Ross. 2015. GPU-accelerated string matching for database applications. The VLDB Journal 25, 5 (Nov 2015), 719–740. https://doi.org/10.1007/s00778-015-0409-y
  33. M. Stonebraker. 1986. The Case for Shared-Nothing. Database Engineering 9, 1 (1986).
  34. Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1626–1629. https://doi.org/10.14778/1687553.1687609
  35. TPC. 2023. TPC-H Vesion 2 and Version 3. https://www.tpc.org/tpch/
  36. Distributed Spatial and Spatio-Temporal Join on Apache Spark. ACM Trans. Spatial Algorithms Syst. 5, 1, Article 6 (jun 2019), 28 pages. https://doi.org/10.1145/3325135
  37. Ibex: An Intelligent Storage Engine with Support for Advanced SQL Offloading. Proc. VLDB Endow. 7, 11 (jul 2014), 963–974. https://doi.org/10.14778/2732967.2732972
  38. Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS. Proc. VLDB Endow. 15, 11 (jul 2022), 2491–2503. https://doi.org/10.14778/3551793.3551809
  39. Parallel spatial query processing on GPUs using R-trees. Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data - BigSpatial ’13 (2013). https://doi.org/10.1145/2534921.2534949
  40. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI’12). USENIX Association, USA, 2.

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.