Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepMapping: Learned Data Mapping for Lossless Compression and Efficient Lookup

Published 12 Jul 2023 in cs.DB | (2307.05861v2)

Abstract: Storing tabular data to balance storage and query efficiency is a long-standing research question in the database community. In this work, we argue and show that a novel DeepMapping abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time. Such unique properties may benefit a broad class of use cases in capacity-limited devices. Our proposed DeepMapping abstraction transforms a dataset into multiple key-value mappings and constructs a multi-tasking neural network model that outputs the corresponding values for a given input key. To deal with memorization errors, DeepMapping couples the learned neural network with a lightweight auxiliary data structure capable of correcting mistakes. The auxiliary structure design further enables DeepMapping to efficiently deal with insertions, deletions, and updates even without retraining the mapping. We propose a multi-task search strategy for selecting the hybrid DeepMapping structures (including model architecture and auxiliary structure) with a desirable trade-off among memorization capacity, size, and efficiency. Extensive experiments with a real-world dataset, synthetic and benchmark datasets, including TPC-H and TPC-DS, demonstrated that the DeepMapping approach can better balance the retrieving speed and compression ratio against several cutting-edge competitors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. [n. d.]. Byte-dictionary encoding. https://docs.aws.amazon.com/redshift/latest/dg/c_Byte_dictionary_encoding.html.
  2. [n. d.]. Delta Encoding. https://en.wikipedia.org/wiki/Delta_encoding.
  3. [n. d.]a. LZO real-time data compression library. http://www.oberhumer.com/opensource/lzo/.
  4. [n. d.]b. Python bindings for the LZO data compression library. https://pypi.org/project/python-lzo/.
  5. [n. d.]. TensorFlow Hub. https://tfhub.dev/.
  6. [n. d.]a. TPC-DS Benchmark. https://www.tpc.org/tpcds/.
  7. [n. d.]b. TPC-H Benchmark. https://www.tpc.org/tpch/.
  8. [n. d.]. ZSTD Bindings for Python. https://pypi.org/project/zstd/.
  9. Algorithms for hyper-parameter optimization. Advances in neural information processing systems 24 (2011).
  10. Yann Collet and Murray Kucherawy. 2021. Zstandard Compression and the ’application/zstd’ Media Type. RFC 8878. https://doi.org/10.17487/RFC8878
  11. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).
  12. George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 4 (1989), 303–314.
  13. ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 969–984.
  14. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. arXiv preprint arXiv:2006.13282 (2020).
  15. Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment 13, 8 (2020), 1162–1175.
  16. Solomon Golomb. 1966. Run-length encodings (corresp.). IEEE transactions on information theory 12, 3 (1966), 399–401.
  17. Optimistically Compressed Hash Tables & Strings in theUSSR. ACM SIGMOD Record 50, 1 (2021), 60–67.
  18. Single path one-shot neural architecture search with uniform sampling. In European conference on computer vision. Springer, 544–560.
  19. Entropy-Learned Hashing: 10x Faster Hashing with Controllable Uniformity. SIGMOD.
  20. Deepdb: Learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019).
  21. The" wake-sleep" algorithm for unsupervised neural networks. Science 268, 5214 (1995), 1158–1161.
  22. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
  23. Guang-Bin Huang. 2003. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE transactions on neural networks 14, 2 (2003), 274–281.
  24. Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables. In International Conference on Machine Learning. PMLR, 3408–3417.
  25. RadixSpline: a single-pass learned index. In Proceedings of the third international workshop on exploiting artificial intelligence techniques for data management. 1–5.
  26. The case for learned index structures. In Proceedings of the 2018 international conference on management of data. 489–504.
  27. Liam Li and Ameet Talwalkar. 2020. Random search and reproducibility for neural architecture search. In Uncertainty in artificial intelligence. PMLR, 367–377.
  28. LISA: A learned index structure for spatial data. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 2119–2133.
  29. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
  30. A survey on evolutionary neural architecture search. IEEE transactions on neural networks and learning systems (2021).
  31. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114 (2015).
  32. Qingzhi Ma and Peter Triantafillou. 2019. Dbest: Revisiting approximate query processing engines with machine learning models. In Proceedings of the 2019 International Conference on Management of Data. 1553–1570.
  33. Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. Vol. 24. Elsevier, 109–165.
  34. Learning better lossless compression using lossy compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6638–6647.
  35. Barzan Mozafari. 2016. When Should Approximate Query Processing Be Used? (2016). http://highscalability.com/blog/2016/2/25/when-should-approximate-query-processing-be-used.html
  36. Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 985–1000.
  37. The case for learned spatial indexes. arXiv preprint arXiv:2008.10349 (2020).
  38. Jay M Patel and Jay M Patel. 2020. Introduction to common crawl datasets. Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale (2020), 277–324.
  39. Efficient neural architecture search via parameters sharing. In International conference on machine learning. PMLR, 4095–4104.
  40. Effectively learning spatial indices. Proceedings of the VLDB Endowment 13, 12 (2020), 2341–2354.
  41. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33. 4780–4789.
  42. Ibrahim Sabek and Tim Kraska. 2021. The Case for Learned In-Memory Joins. arXiv preprint arXiv:2111.08824 (2021).
  43. When Are Learned Models Better Than Hash Functions? arXiv preprint arXiv:2107.01464 (2021).
  44. Can Learned Models Replace Hash Functions? Proceedings of the VLDB Endowment 16, 3 (2022), 532–545.
  45. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 806–813.
  46. Improving inference for neural image compression. Advances in Neural Information Processing Systems 33 (2020), 573–584.
  47. A Neural Database for Differentially Private Spatial Range Queries. Proc. VLDB Endow. 15, 5 (2022), 1066–1078. https://www.vldb.org/pvldb/vol15/p1066-zeighami.pdf
  48. Sepanta Zeighami and Cyrus Shahabi. 2021. NeuroDB: A Neural Network Framework for Answering Range Aggregate Queries and Beyond. arXiv preprint arXiv:2107.04922 (2021).
  49. NeuroSketch: Fast and Approximate Evaluation of Range Aggregate Queries with Neural Networks. arXiv preprint arXiv:2211.10832 (2022).
  50. Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).
  51. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8697–8710.
  52. Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016).
  53. Jia Zou. 2021. Using Deep Learning Models to Replace Large Materialized Views in Relational Database. In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2021/papers/cidr2021_abstract05.pdf

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.