TigerBot: An Open Multilingual Multitask LLM

Published 14 Dec 2023 in cs.CL and cs.AI | (2312.08688v2)

Abstract: We release and introduce the TigerBot family of LLMs, consisting of base and chat models, sized from 7, 13, 70 and 180 billion parameters. We develop our models embarking from Llama-2 and BLOOM, and push the boundary further in data, training algorithm, infrastructure, and application tools. Our models yield meaningful performance gain over SOTA open-source models, e.g., Llama-2, specifically 6% gain in English and 20% gain in Chinese. TigerBot model family also achieves leading performance in major academic and industrial benchmarks and leaderboards. We believe that TigerBot represents just a snapshot of lightning-fast progression in LLM open-source community. Therefore, we are thrilled to give back by publicly releasing our models and reporting our approach behind, with additional emphases on building SOTA LLMs in a democratized way and making LLMs of use in real-world applications.

Abstract PDF HTML Upgrade to Chat

References (38)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces TigerBot, an open-access, multilingual multitask LLM that scales from 7B to 180B parameters for diverse applications.
The paper details a robust training strategy using 500 billion tokens from varied sources, including a significant Chinese language dataset.
The paper emphasizes safety and practical utility, offering developer tools and APIs that align outputs with human values for real-world use.

Introduction

LLMs have transformed the AI landscape with capabilities that seem to inch closer to artificial general intelligence (AGI). Their functionality spans across various domains, imparting them with skills ranging from simple question-answering to complex coding tasks. The evolution of LLMs has been chiefly driven by the advancement in their foundational capabilities, computational efficiency, and readiness for real-world applications. This typically involves pretraining on an extensive corpus of data and then refining them through supervised and reinforcement learning techniques. TigerBot is an addition to the cohort of LLMs, following a lineage of models while carving out its own niche in both performance and application diversity.

TigerBot Models

TigerBot is a multifaceted LLM, ranging from 7B to 180B parameters, designed for an array of multilingual multitask applications. It is openly available for both research and commercial use, offering an array of tools and a developer-friendly API. The models incorporate extensive training data, roughly equating to 500 billion tokens that have been thoroughly vetted for quality and diversity, encompassing a significant portion of Chinese language data and a wide range of tasks. Additionally, TigerBot has been released with various developer tools and maintains compatibility with contemporary search engines and knowledge bases.

Training Methods and Data

Underpinning TigerBot's abilities is a diverse dataset comprising public and proprietary sources, meticulously curated for quality and multilingual coverage. The model expands its language capacity by blending tokenizer vocabularies from prominent multilingual models like BLOOM and experimenting widely with parallelism strategies during training. TigerBot's multilingual, multitask coverage, combined with a suite of novel algorithmic and infrastructure enhancements, propels it into the upper echelons of open-source LLMs. This has been achieved at a modest computational cost, with an emphasis on low carbon footprint, aligning with the ethos of democratizing LLM development.

Applications and Safety

The versatility of TigerBot manifests in various applications, ranging from long-context QA and online search augmentation to more niche uses like role-playing and function calling, demonstrating a revolutionary potential for practical utility. Moreover, the model addresses safety concerns through comprehensive filtering during training and runtime, ensuring the outputs are aligned with human values. With safety as a priority, TigerBot remains aligned to core social values and legal considerations by integrating monitored iterative alignments with real user data.

Conclusion

In summary, TigerBot is an exemplary model of bringing together state-of-the-art training techniques, a broad data approach, and real-world application readiness, all underlined by a commitment to accessibility and safety. As the journey of LLMs and AGI continues, TigerBot demonstrates both the immense potential and the ongoing challenges in the field, reinforcing the continuous need for innovation and prudent application development. Its open-source release not only contributes to the thriving AI community but also stakes a claim for the future trajectory of AI research and development.