Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models

Published 28 Mar 2024 in cs.CL and cs.AI | (2403.19340v2)

Abstract: To address the challenges associated with data processing at scale, we propose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline for LLMs with a user-friendly design at its core. Easy addition of custom processors with block-based interface in Dataverse allows users to readily and efficiently use Dataverse to build their own ETL pipeline. We hope that Dataverse will serve as a vital tool for LLM development and open source the entire library to welcome community contribution. Additionally, we provide a concise, two-minute video demonstration of our system, illustrating its capabilities and implementation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Eujeong Choi and Chanjun Park. 2023. Dmops: Data management operation and recipes. arXiv preprint arXiv:2301.01228.
  2. Scaling instruction-finetuned language models.
  3. Documenting large webtext corpora: A case study on the colossal clean crawled corpus.
  4. Handling bias in toxic speech detection: A survey. ACM Computing Surveys, 55(13s):1–32.
  5. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  6. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  7. Deduplicating training data makes language models better.
  8. Superfiltering: Weak-to-strong data filtering for fast instruction-tuning.
  9. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning.
  10. Nicholas D Matsakis and Felix S Klock II. 2014. The rust language. In ACM SIGAda Ada Letters, volume 34, pages 103–104. ACM.
  11. Doubts on the reliability of parallel corpus filtering. Expert Systems with Applications, 233:120962.
  12. Chenghaomou/text-dedup: Reference snapshot.
  13. Datatrove: large scale data processing.
  14. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
  15. Paul M Schwartz and Daniel J Solove. 2011. The pii problem: Privacy and a new concept of personally identifiable information. NYUL rev., 86:1814.
  16. On the effect of pretraining corpora on in-context learning by a large-scale language model. arXiv preprint arXiv:2204.13509.
  17. An investigation of critical issues in bias mitigation techniques. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1943–1954.
  18. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint.
  19. Yau-Shian Wang and Yingshan Chang. 2022. Toxicity detection with generative prompt-based inference. arXiv preprint arXiv:2205.12390.
  20. Data management for large language models: A survey. arXiv preprint arXiv:2312.01700.
  21. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  22. A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE, 104(9):1681–1710.
  23. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850.
  24. Slurm: Simple linux utility for resource management. In Workshop on job scheduling strategies for parallel processing, pages 44–60. Springer.
  25. Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11):56–65.
  26. A survey of large language models. arXiv preprint arXiv:2303.18223.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 9 likes about this paper.