Orion-14B: Open-source Multilingual Large Language Models
Abstract: In this study, we introduce Orion-14B, a collection of multilingual LLMs with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.
- 01-ai. https://github.com/01-ai/Yi, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023b.
- Baichuan. https://github.com/baichuan-inc/Baichuan-13B, 2023a.
- Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023b. URL https://arxiv.org/abs/2309.10305.
- A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
- Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388, 2002.
- Skill-it! a data-driven skills framework for understanding and training language models. arXiv preprint arXiv:2307.14430, 2023.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Anze Xie Ying Sheng Lianmin Zheng Joseph E. Gonzalez Ion Stoica Xuezhe Ma Dacheng Li, Rulin Shao and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
- Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186, 2019.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
- Mods: Model-oriented data selection for instruction tuning, 2023.
- Language acquisition: do children and language models follow similar learning stages? arXiv preprint arXiv:2306.03586, 2023.
- A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- Time travel in LLMs: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
- Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.
- InternLM. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Kobest: Korean balanced evaluation of significant tasks, 2022. URL https://arxiv.org/abs/2204.04541.
- Kogpt: Kakaobrain korean(hangul) generative pre-trained transformer. https://github.com/kakaobrain/kogpt, 2021.
- A technical report for polyglot-ko: Open-source large-scale korean language models, 2023a.
- A technical report for polyglot-ko: Open-source large-scale korean language models. arXiv preprint arXiv:2306.02254, 2023b.
- Takeshi Kojima. https://huggingface.co/matsuo-lab/weblab-10b, 2023.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. CoRR, abs/1808.06226, 2018. URL http://arxiv.org/abs/1808.06226.
- Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
- Platypus: Quick, cheap, and powerful refinement of llms, 2023a.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
- Japanese stablelm base alpha 7b, 2023b. URL [https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b).
- The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. Citeseer, 2012.
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
- Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
- Alignbench: Benchmarking chinese alignment of large language models, 2023.
- LMSYS. Chatbot arena leaderboard, 2023. URL https://lmsys.org/blog/2023-05-25-leaderboard/.
- Fixing weight decay regularization in adam. 2018.
- Recurrent neural network based language model. In Interspeech, volume 2, pages 1045–1048. Makuhari, 2010.
- Dothash: Estimating set similarity metrics for link prediction and document deduplication. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1758–1769, 2023.
- NVIDIA. https://github.com/NVIDIA/apex, 2023.
- OpenAI. Introducing ChatGPT. 2022a.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2022b.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P16-1144.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
- Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
- Inc Preferred Networks. Plamo-13b, 2023. URL https://huggingface.co/pfnet/plamo-13b.
- Improving language understanding by generative pre-training. 2018.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Elyza-japanese-llama-2-7b, 2023. URL https://huggingface.co/elyza/ELYZA-japanese-LLaMA-2-7b.
- Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999.
- Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- THUDM. https://github.com/THUDM/ChatGLM3, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. In Proceedings of the Conference on Neural Information Processing Systems (NIPS 2017), pages 5998–6008, 2017.
- Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341, 2023.
- Jack Hessel Claire Cardie Yejin Choi Yuntian Deng. Wenting Zhao, Xiang Ren. (inthe)wildchat: 650k chatgpt interaction logs in the wild, 2023.
- Ludwig Wittgenstein. Tractatus logigo-philosphicus. 1922.
- Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
- Recurrent neural networks for language understanding. In Interspeech, pages 2524–2528, 2013.
- Yuanxiang. https://github.com/xverse-ai/XVERSE-13B, 2023.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Chinese open instruction generalist: A preliminary release, 2023a.
- Evaluating the performance of large language models on gaokao benchmark. 2023b.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023.
- Agieval: A human-centric benchmark for evaluating foundation models, 2023.
- Lima: Less is more for alignment, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.