Papers
Topics
Authors
Recent
Search
2000 character limit reached

Orion-14B: Open-source Multilingual Large Language Models

Published 20 Jan 2024 in cs.CL and cs.LG | (2401.12246v1)

Abstract: In this study, we introduce Orion-14B, a collection of multilingual LLMs with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. 01-ai. https://github.com/01-ai/Yi, 2023.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  3. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023b.
  4. Baichuan. https://github.com/baichuan-inc/Baichuan-13B, 2023a.
  5. Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023b. URL https://arxiv.org/abs/2309.10305.
  6. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  7. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  8. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  9. Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388, 2002.
  10. Skill-it! a data-driven skills framework for understanding and training language models. arXiv preprint arXiv:2307.14430, 2023.
  11. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  12. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  13. Anze Xie Ying Sheng Lianmin Zheng Joseph E. Gonzalez Ion Stoica Xuezhe Ma Dacheng Li, Rulin Shao and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
  14. Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, 2023.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019.
  16. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  17. Mods: Model-oriented data selection for instruction tuning, 2023.
  18. Language acquisition: do children and language models follow similar learning stages? arXiv preprint arXiv:2306.03586, 2023.
  19. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  20. Time travel in LLMs: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023.
  21. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  22. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  23. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
  24. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.
  25. InternLM. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport, 2023.
  26. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  27. Kobest: Korean balanced evaluation of significant tasks, 2022. URL https://arxiv.org/abs/2204.04541.
  28. Kogpt: Kakaobrain korean(hangul) generative pre-trained transformer. https://github.com/kakaobrain/kogpt, 2021.
  29. A technical report for polyglot-ko: Open-source large-scale korean language models, 2023a.
  30. A technical report for polyglot-ko: Open-source large-scale korean language models. arXiv preprint arXiv:2306.02254, 2023b.
  31. Takeshi Kojima. https://huggingface.co/matsuo-lab/weblab-10b, 2023.
  32. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. CoRR, abs/1808.06226, 2018. URL http://arxiv.org/abs/1808.06226.
  33. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  34. Platypus: Quick, cheap, and powerful refinement of llms, 2023a.
  35. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
  36. Japanese stablelm base alpha 7b, 2023b. URL [https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b).
  37. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Citeseer, 2012.
  38. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
  39. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
  40. Alignbench: Benchmarking chinese alignment of large language models, 2023.
  41. LMSYS. Chatbot arena leaderboard, 2023. URL https://lmsys.org/blog/2023-05-25-leaderboard/.
  42. Fixing weight decay regularization in adam. 2018.
  43. Recurrent neural network based language model. In Interspeech, volume 2, pages 1045–1048. Makuhari, 2010.
  44. Dothash: Estimating set similarity metrics for link prediction and document deduplication. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1758–1769, 2023.
  45. NVIDIA. https://github.com/NVIDIA/apex, 2023.
  46. OpenAI. Introducing ChatGPT. 2022a.
  47. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2022b.
  48. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  49. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P16-1144.
  50. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
  51. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
  52. Inc Preferred Networks. Plamo-13b, 2023. URL https://huggingface.co/pfnet/plamo-13b.
  53. Improving language understanding by generative pre-training. 2018.
  54. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  55. Elyza-japanese-llama-2-7b, 2023. URL https://huggingface.co/elyza/ELYZA-japanese-LLaMA-2-7b.
  56. Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999.
  57. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
  58. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  59. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  60. THUDM. https://github.com/THUDM/ChatGLM3, 2023.
  61. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  62. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  63. Attention is all you need. In Proceedings of the Conference on Neural Information Processing Systems (NIPS 2017), pages 5998–6008, 2017.
  64. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341, 2023.
  65. Jack Hessel Claire Cardie Yejin Choi Yuntian Deng. Wenting Zhao, Xiang Ren. (inthe)wildchat: 650k chatgpt interaction logs in the wild, 2023.
  66. Ludwig Wittgenstein. Tractatus logigo-philosphicus. 1922.
  67. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
  68. Recurrent neural networks for language understanding. In Interspeech, pages 2524–2528, 2013.
  69. Yuanxiang. https://github.com/xverse-ai/XVERSE-13B, 2023.
  70. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  71. Chinese open instruction generalist: A preliminary release, 2023a.
  72. Evaluating the performance of large language models on gaokao benchmark. 2023b.
  73. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023.
  74. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
  75. Lima: Less is more for alignment, 2023.
Citations (3)

Summary

  • The paper introduces Orion-14B, a 14-billion parameter LLM pretrained on 2.5 trillion diverse language tokens, significantly advancing open-source multilingual research.
  • It details a meticulous training process with data filtering, deduplication, and a stepped complexity schedule that effectively reduces validation loss.
  • It demonstrates enhanced fine-tuning for conversations and specialized tasks, achieving superior performance on benchmarks like RACE and HellaSwag, thus catalyzing further research.

Introduction

Orion-14B represents an important addition to the landscape of multilingual LLMs with its extensive 14 billion parameters. The model was trained on a dataset comprising 2.5 trillion tokens from texts in diverse languages, primarily English, Chinese, Japanese, and Korean. Beyond its foundation model, Orion-14B includes fine-tuned adaptations for conversation handling and other specific use cases. Encouragingly, the Orion-14B model family is made available to the research community, which could serve as a catalyst for future investigations and applications in the field of LLMs.

Training and Data Preparation

The training of LLMs like Orion-14B is a complex endeavor, largely influenced by the quality and scale of data. The Orion-14B paper details extensive data preparation processes including data source diversity, filtering for high-quality data, and vital steps for deduplication. The training harnesses a strategic data schedule to gradually increase data complexity, aligning with patterns of human learning. The tokenizer utilized is SentencePiece with byte-pair encoding, achieving character coverage of 99.99% for a wide array of languages.

Model Architecture and Pretraining

Architecturally, Orion-14B is built upon the LLaMA2 framework, incorporating specific improvements such as an expansive vocabulary size and enhanced feed-forward network dimensions. It employs RoPE for positional encoding, facilitating the processing of context lengths up to 4096 tokens. Training infrastructure optimization and a measured training schedule underpin the efficient pretraining process. This increasing complexity approach to training data showcases validation loss reduction in line with shifts in training data distributions.

Fine-tuning Methodologies and Evaluation

After pretraining, Orion-14B underwent successive fine-tuning, leveraging a compiled dataset consisting of high-quality human-annotated pairs and a large-scale filtered open-source dataset. The fine-tuning process refined the model's response-generating capabilities and emphasized combating overfitting. A comprehensive evaluation across standard benchmarks like RACE, HellaSwag, PIQA, and others showcases Orion-14B's superior performance. It also underscores its strength in language understanding tasks in early training phases and reasoning and academic tasks in later stages, suggesting a successful alignment with its strategic data scheduling philosophy.

Implications and Extension Works

The extensions of Orion-14B address practical application needs, offering models specialized in handling extended contexts, reducing inference resource requirements, and catering to specific application mandates. The paper concludes by reflecting on the challenges faced during Orion-14B's training and the prospects it unfolds. The open availability of Orion-14B aims to facilitate further research and development, marking an important stride in the evolution of AI LLMs and their burgeoning relationship with human intelligence.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 4 likes about this paper.