DeepMapping: Learned Data Mapping for Lossless Compression and Efficient Lookup
Abstract: Storing tabular data to balance storage and query efficiency is a long-standing research question in the database community. In this work, we argue and show that a novel DeepMapping abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time. Such unique properties may benefit a broad class of use cases in capacity-limited devices. Our proposed DeepMapping abstraction transforms a dataset into multiple key-value mappings and constructs a multi-tasking neural network model that outputs the corresponding values for a given input key. To deal with memorization errors, DeepMapping couples the learned neural network with a lightweight auxiliary data structure capable of correcting mistakes. The auxiliary structure design further enables DeepMapping to efficiently deal with insertions, deletions, and updates even without retraining the mapping. We propose a multi-task search strategy for selecting the hybrid DeepMapping structures (including model architecture and auxiliary structure) with a desirable trade-off among memorization capacity, size, and efficiency. Extensive experiments with a real-world dataset, synthetic and benchmark datasets, including TPC-H and TPC-DS, demonstrated that the DeepMapping approach can better balance the retrieving speed and compression ratio against several cutting-edge competitors.
- [n. d.]. Byte-dictionary encoding. https://docs.aws.amazon.com/redshift/latest/dg/c_Byte_dictionary_encoding.html.
- [n. d.]. Delta Encoding. https://en.wikipedia.org/wiki/Delta_encoding.
- [n. d.]a. LZO real-time data compression library. http://www.oberhumer.com/opensource/lzo/.
- [n. d.]b. Python bindings for the LZO data compression library. https://pypi.org/project/python-lzo/.
- [n. d.]. TensorFlow Hub. https://tfhub.dev/.
- [n. d.]a. TPC-DS Benchmark. https://www.tpc.org/tpcds/.
- [n. d.]b. TPC-H Benchmark. https://www.tpc.org/tpch/.
- [n. d.]. ZSTD Bindings for Python. https://pypi.org/project/zstd/.
- Algorithms for hyper-parameter optimization. Advances in neural information processing systems 24 (2011).
- Yann Collet and Murray Kucherawy. 2021. Zstandard Compression and the ’application/zstd’ Media Type. RFC 8878. https://doi.org/10.17487/RFC8878
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).
- George Cybenko. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 4 (1989), 303–314.
- ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 969–984.
- Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. arXiv preprint arXiv:2006.13282 (2020).
- Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment 13, 8 (2020), 1162–1175.
- Solomon Golomb. 1966. Run-length encodings (corresp.). IEEE transactions on information theory 12, 3 (1966), 399–401.
- Optimistically Compressed Hash Tables & Strings in theUSSR. ACM SIGMOD Record 50, 1 (2021), 60–67.
- Single path one-shot neural architecture search with uniform sampling. In European conference on computer vision. Springer, 544–560.
- Entropy-Learned Hashing: 10x Faster Hashing with Controllable Uniformity. SIGMOD.
- Deepdb: Learn from data, not from queries! arXiv preprint arXiv:1909.00607 (2019).
- The" wake-sleep" algorithm for unsupervised neural networks. Science 268, 5214 (1995), 1158–1161.
- Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
- Guang-Bin Huang. 2003. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE transactions on neural networks 14, 2 (2003), 274–281.
- Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables. In International Conference on Machine Learning. PMLR, 3408–3417.
- RadixSpline: a single-pass learned index. In Proceedings of the third international workshop on exploiting artificial intelligence techniques for data management. 1–5.
- The case for learned index structures. In Proceedings of the 2018 international conference on management of data. 489–504.
- Liam Li and Ameet Talwalkar. 2020. Random search and reproducibility for neural architecture search. In Uncertainty in artificial intelligence. PMLR, 367–377.
- LISA: A learned index structure for spatial data. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 2119–2133.
- Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
- A survey on evolutionary neural architecture search. IEEE transactions on neural networks and learning systems (2021).
- Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114 (2015).
- Qingzhi Ma and Peter Triantafillou. 2019. Dbest: Revisiting approximate query processing engines with machine learning models. In Proceedings of the 2019 International Conference on Management of Data. 1553–1570.
- Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. Vol. 24. Elsevier, 109–165.
- Learning better lossless compression using lossy compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6638–6647.
- Barzan Mozafari. 2016. When Should Approximate Query Processing Be Used? (2016). http://highscalability.com/blog/2016/2/25/when-should-approximate-query-processing-be-used.html
- Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD international conference on management of data. 985–1000.
- The case for learned spatial indexes. arXiv preprint arXiv:2008.10349 (2020).
- Jay M Patel and Jay M Patel. 2020. Introduction to common crawl datasets. Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale (2020), 277–324.
- Efficient neural architecture search via parameters sharing. In International conference on machine learning. PMLR, 4095–4104.
- Effectively learning spatial indices. Proceedings of the VLDB Endowment 13, 12 (2020), 2341–2354.
- Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33. 4780–4789.
- Ibrahim Sabek and Tim Kraska. 2021. The Case for Learned In-Memory Joins. arXiv preprint arXiv:2111.08824 (2021).
- When Are Learned Models Better Than Hash Functions? arXiv preprint arXiv:2107.01464 (2021).
- Can Learned Models Replace Hash Functions? Proceedings of the VLDB Endowment 16, 3 (2022), 532–545.
- CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 806–813.
- Improving inference for neural image compression. Advances in Neural Information Processing Systems 33 (2020), 573–584.
- A Neural Database for Differentially Private Spatial Range Queries. Proc. VLDB Endow. 15, 5 (2022), 1066–1078. https://www.vldb.org/pvldb/vol15/p1066-zeighami.pdf
- Sepanta Zeighami and Cyrus Shahabi. 2021. NeuroDB: A Neural Network Framework for Answering Range Aggregate Queries and Beyond. arXiv preprint arXiv:2107.04922 (2021).
- NeuroSketch: Fast and Approximate Evaluation of Range Aggregate Queries with Neural Networks. arXiv preprint arXiv:2211.10832 (2022).
- Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).
- Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8697–8710.
- Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016).
- Jia Zou. 2021. Using Deep Learning Models to Replace Large Materialized Views in Relational Database. In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2021/papers/cidr2021_abstract05.pdf
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.