ArcaDB: A Container-based Disaggregated Query Engine for Heterogenous Computational Environments
Abstract: Modern enterprises rely on data management systems to collect, store, and analyze vast amounts of data related with their operations. Nowadays, clusters and hardware accelerators (e.g., GPUs, TPUs) have become a necessity to scale with the data processing demands in many applications related to social media, bioinformatics, surveillance systems, remote sensing, and medical informatics. Given this new scenario, the architecture of data analytics engines must evolve to take advantage of these new technological trends. In this paper, we present ArcaDB: a disaggregated query engine that leverages container technology to place operators at compute nodes that fit their performance profile. In ArcaDB, a query plan is dispatched to worker nodes that have different computing characteristics. Each operator is annotated with the preferred type of compute node for execution, and ArcaDB ensures that the operator gets picked up by the appropriate workers. We have implemented a prototype version of ArcaDB using Java, Python, and Docker containers. We have also completed a preliminary performance study of this prototype, using images and scientific data. This study shows that ArcaDB can speed up query performance by a factor of 3.5x in comparison with a shared-nothing, symmetric arrangement.
- The Seattle Report on Database Research. Commun. ACM 65, 8 (jul 2022), 72–79. https://doi.org/10.1145/3524284
- FPGA-Accelerated Group-by Aggregation Using Synchronizing Caches. In Proceedings of the 12th International Workshop on Data Management on New Hardware (San Francisco, California) (DaMoN ’16). Association for Computing Machinery, New York, NY, USA, Article 11, 9 pages. https://doi.org/10.1145/2933349.2933360
- Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. 13, 12 (aug 2020), 3411–3424. https://doi.org/10.14778/3415478.3415560
- Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 1383–1394. https://doi.org/10.1145/2723372.2742797
- J. Dean and S. Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of 2004 OSDI. San Francisco, CA, USA, 137–150.
- Performance Tradeoffs for Client-Server Query Processing. In Proc. ACM SIGMOD Conference. Montreal, Quebec, Canada, 149–160.
- G. Graefe and W.J. McKenna. 1993. The Volcano optimizer generator: extensibility and efficient search. In Proceedings of IEEE 9th International Conference on Data Engineering. 209–218. https://doi.org/10.1109/ICDE.1993.344061
- AccessPath Selection in a Relational Database Management System. In ACM SIGMOD Conference. Boston, Massachusetts, USA, 23–34.
- QPipe: A Simultaneously Pipelined Relational Query Engine. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (Baltimore, Maryland) (SIGMOD ’05). Association for Computing Machinery, New York, NY, USA, 383–394. https://doi.org/10.1145/1066157.1066201
- Relational Joins on Graphics Processors. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (Vancouver, Canada) (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 511–524. https://doi.org/10.1145/1376616.1376670
- YourSQL: A High-Performance Database System Leveraging in-Storage Computing. Proc. VLDB Endow. 9, 12 (aug 2016), 924–935. https://doi.org/10.14778/2994509.2994512
- Cloud Programming Simplified: A Berkeley View on Serverless Computing. arXiv:1902.03383Â [cs.OS]
- GPU Join Processing Revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware (Scottsdale, Arizona) (DaMoN ’12). Association for Computing Machinery, New York, NY, USA, 55–62. https://doi.org/10.1145/2236584.2236592
- Optimizing GPU-accelerated Group-By and Aggregation. In ADMS@VLDB.
- Integrating Data Lake Tables. Proc. VLDB Endow. 16, 4 (dec 2022), 932–945. https://doi.org/10.14778/3574245.3574274
- PubChem 2023 update. Nucleic Acids Research 51, D1 (10 2022), D1373–D1380. https://doi.org/10.1093/nar/gkac956 arXiv:https://academic.oup.com/nar/article-pdf/51/D1/D1373/48441598/gkac956.pdf
- Application of Hash to Data Base Machine and Its Architecture. New Gen. Comput. 1, 1 (mar 1983), 63–74. https://doi.org/10.1007/BF03037022
- Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms. Proc. VLDB Endow. 14, 13 (sep 2021), 3308–3321. https://doi.org/10.14778/3484224.3484229
- Haoyuan Li. 2018. Alluxio: A Virtual Distributed File System. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-29.html
- Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing (Seattle, WA, USA) (SOCC ’14). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/2670979.2670985
- Column Scan Acceleration in Hybrid CPU-FPGA Systems. In International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2018, Rio de Janeiro, Brazil, August 27, 2018, Rajesh Bordawekar and Tirthankar Lahiri (Eds.). 22–33. http://www.adms-conf.org/2018-camera-ready/habich_adms2018.pdf
- Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
- Evaluating Multi-GPU Sorting with Modern Interconnects. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1795–1809. https://doi.org/10.1145/3514221.3517842
- Comparative Analysis of OpenCL and RTL for Sort-Merge Primitives on FPGA. In Proceedings of the 16th International Workshop on Data Management on New Hardware (Portland, Oregon) (DaMoN ’20). Association for Computing Machinery, New York, NY, USA, Article 11, 7 pages. https://doi.org/10.1145/3399666.3399897
- Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (jun 2011), 539–550. https://doi.org/10.14778/2002938.2002940
- Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 51–63. https://doi.org/10.1145/3035918.3056100
- Manuel Rodriguez-Martinez and Nick Roussopoulos. 2000. MOCHA: A Self-Extensible Database Middleware System for Distributed Data Sources. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein (Eds.). ACM, 213–224.
- Efficient Join Algorithms for Large Database Tables in a Multi-GPU Environment. Proc. VLDB Endow. 14, 4 (dec 2020), 708–720. https://doi.org/10.14778/3436905.3436927
- Presto: SQL on Everything. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). 1802–1813. https://doi.org/10.1109/ICDE.2019.00196
- A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1617–1632. https://doi.org/10.1145/3318464.3380595
- Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA, 403–415. https://doi.org/10.1145/3035918.3035954
- Evangelia A. Sitaridi and Kenneth A. Ross. 2015. GPU-accelerated string matching for database applications. The VLDB Journal 25, 5 (Nov 2015), 719–740. https://doi.org/10.1007/s00778-015-0409-y
- M. Stonebraker. 1986. The Case for Shared-Nothing. Database Engineering 9, 1 (1986).
- Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1626–1629. https://doi.org/10.14778/1687553.1687609
- TPC. 2023. TPC-H Vesion 2 and Version 3. https://www.tpc.org/tpch/
- Distributed Spatial and Spatio-Temporal Join on Apache Spark. ACM Trans. Spatial Algorithms Syst. 5, 1, Article 6 (jun 2019), 28Â pages. https://doi.org/10.1145/3325135
- Ibex: An Intelligent Storage Engine with Support for Advanced SQL Offloading. Proc. VLDB Endow. 7, 11 (jul 2014), 963–974. https://doi.org/10.14778/2732967.2732972
- Orchestrating Data Placement and Query Execution in Heterogeneous CPU-GPU DBMS. Proc. VLDB Endow. 15, 11 (jul 2022), 2491–2503. https://doi.org/10.14778/3551793.3551809
- Parallel spatial query processing on GPUs using R-trees. Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data - BigSpatial ’13 (2013). https://doi.org/10.1145/2534921.2534949
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI’12). USENIX Association, USA, 2.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.