Cedille: A large autoregressive French language model

Published 7 Feb 2022 in cs.CL | (2202.03371v1)

Abstract: Scaling up the size and training of autoregressive LLMs has enabled novel ways of solving Natural Language Processing tasks using zero-shot and few-shot learning. While extreme-scale LLMs such as GPT-3 offer multilingual capabilities, zero-shot learning for languages other than English remain largely unexplored. Here, we introduce Cedille, a large open source auto-regressive LLM, specifically trained for the French language. Our results show that Cedille outperforms existing French LLMs and is competitive with GPT-3 on a range of French zero-shot benchmarks. Furthermore, we provide an in-depth comparison of the toxicity exhibited by these models, showing that Cedille marks an improvement in LLM safety thanks to dataset filtering.