Orion-14B: Open-source Multilingual Large Language Models

Published 20 Jan 2024 in cs.CL and cs.LG | (2401.12246v1)

Abstract: In this study, we introduce Orion-14B, a collection of multilingual LLMs with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.

Abstract PDF HTML Upgrade to Chat

References (75)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces Orion-14B, a 14-billion parameter LLM pretrained on 2.5 trillion diverse language tokens, significantly advancing open-source multilingual research.
It details a meticulous training process with data filtering, deduplication, and a stepped complexity schedule that effectively reduces validation loss.
It demonstrates enhanced fine-tuning for conversations and specialized tasks, achieving superior performance on benchmarks like RACE and HellaSwag, thus catalyzing further research.

Introduction

Orion-14B represents an important addition to the landscape of multilingual LLMs with its extensive 14 billion parameters. The model was trained on a dataset comprising 2.5 trillion tokens from texts in diverse languages, primarily English, Chinese, Japanese, and Korean. Beyond its foundation model, Orion-14B includes fine-tuned adaptations for conversation handling and other specific use cases. Encouragingly, the Orion-14B model family is made available to the research community, which could serve as a catalyst for future investigations and applications in the field of LLMs.

Training and Data Preparation

The training of LLMs like Orion-14B is a complex endeavor, largely influenced by the quality and scale of data. The Orion-14B paper details extensive data preparation processes including data source diversity, filtering for high-quality data, and vital steps for deduplication. The training harnesses a strategic data schedule to gradually increase data complexity, aligning with patterns of human learning. The tokenizer utilized is SentencePiece with byte-pair encoding, achieving character coverage of 99.99% for a wide array of languages.

Model Architecture and Pretraining

Architecturally, Orion-14B is built upon the LLaMA2 framework, incorporating specific improvements such as an expansive vocabulary size and enhanced feed-forward network dimensions. It employs RoPE for positional encoding, facilitating the processing of context lengths up to 4096 tokens. Training infrastructure optimization and a measured training schedule underpin the efficient pretraining process. This increasing complexity approach to training data showcases validation loss reduction in line with shifts in training data distributions.

Fine-tuning Methodologies and Evaluation

After pretraining, Orion-14B underwent successive fine-tuning, leveraging a compiled dataset consisting of high-quality human-annotated pairs and a large-scale filtered open-source dataset. The fine-tuning process refined the model's response-generating capabilities and emphasized combating overfitting. A comprehensive evaluation across standard benchmarks like RACE, HellaSwag, PIQA, and others showcases Orion-14B's superior performance. It also underscores its strength in language understanding tasks in early training phases and reasoning and academic tasks in later stages, suggesting a successful alignment with its strategic data scheduling philosophy.

Implications and Extension Works

The extensions of Orion-14B address practical application needs, offering models specialized in handling extended contexts, reducing inference resource requirements, and catering to specific application mandates. The paper concludes by reflecting on the challenges faced during Orion-14B's training and the prospects it unfolds. The open availability of Orion-14B aims to facilitate further research and development, marking an important stride in the evolution of AI LLMs and their burgeoning relationship with human intelligence.

Markdown Report Issue