Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Abstract: To address the challenges associated with data processing at scale, we propose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline for LLMs with a user-friendly design at its core. Easy addition of custom processors with block-based interface in Dataverse allows users to readily and efficiently use Dataverse to build their own ETL pipeline. We hope that Dataverse will serve as a vital tool for LLM development and open source the entire library to welcome community contribution. Additionally, we provide a concise, two-minute video demonstration of our system, illustrating its capabilities and implementation.
- Eujeong Choi and Chanjun Park. 2023. Dmops: Data management operation and recipes. arXiv preprint arXiv:2301.01228.
- Scaling instruction-finetuned language models.
- Documenting large webtext corpora: A case study on the colossal clean crawled corpus.
- Handling bias in toxic speech detection: A survey. ACM Computing Surveys, 55(13s):1–32.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Deduplicating training data makes language models better.
- Superfiltering: Weak-to-strong data filtering for fast instruction-tuning.
- From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning.
- Nicholas D Matsakis and Felix S Klock II. 2014. The rust language. In ACM SIGAda Ada Letters, volume 34, pages 103–104. ACM.
- Doubts on the reliability of parallel corpus filtering. Expert Systems with Applications, 233:120962.
- Chenghaomou/text-dedup: Reference snapshot.
- Datatrove: large scale data processing.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
- Paul M Schwartz and Daniel J Solove. 2011. The pii problem: Privacy and a new concept of personally identifiable information. NYUL rev., 86:1814.
- On the effect of pretraining corpora on in-context learning by a large-scale language model. arXiv preprint arXiv:2204.13509.
- An investigation of critical issues in bias mitigation techniques. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1943–1954.
- Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint.
- Yau-Shian Wang and Yingshan Chang. 2022. Toxicity detection with generative prompt-based inference. arXiv preprint arXiv:2205.12390.
- Data management for large language models: A survey. arXiv preprint arXiv:2312.01700.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE, 104(9):1681–1710.
- Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850.
- Slurm: Simple linux utility for resource management. In Workshop on job scheduling strategies for parallel processing, pages 44–60. Springer.
- Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11):56–65.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.