Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Published 30 Sep 2024 in cs.CL, cs.AI, and cs.LG | (2410.03730v2)

Abstract: We present two multilingual LLMs designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate competitive performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, MMLU, and TruthfulQA.

Abstract PDF HTML Upgrade to Chat

Summary

No one has generated a summary of this paper yet.

Sign Up to Summarize

Paper to Video (Beta)

No one has generated a video about this paper yet.

Sign Up to Generate All Videos Subscribe on YouTube

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Sign Up to Generate

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Top Community Prompts

Explain it Like I'm 14

Practical Applications

Conceptual Simplification

Sign Up to Activate View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (39)

First 10 authors:

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 6 likes about this paper.

Sign Up for Free

HackerNews

Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs (2024) (247 points, 95 comments)
Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs (7 points, 0 comments)

Reddit

Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs (2 points, 1 comment)
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs (0 points, 2 comments)