Spanish Pre-trained BERT Model and Evaluation Data

Published 6 Aug 2023 in cs.CL, cs.AI, and cs.LG | (2308.02976v1)

Abstract: The Spanish language is one of the top 5 spoken languages in the world. Nevertheless, finding resources to train or evaluate Spanish LLMs is not an easy task. In this paper we help bridge this gap by presenting a BERT-based LLM pre-trained exclusively on Spanish data. As a second contribution, we also compiled several tasks specifically for the Spanish language in a single repository much in the spirit of the GLUE benchmark. By fine-tuning our pre-trained Spanish model, we obtain better results compared to other BERT-based models pre-trained on multilingual corpora for most of the tasks, even achieving a new state-of-the-art on some of them. We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks.

Abstract PDF Upgrade to Chat

Citations (620)

View on Semantic Scholar

Summary

The paper presents a monolingual Spanish BERT model with 110 million parameters, leveraging Dynamic and Whole-Word Masking techniques.
The paper introduces the GLUES benchmark, a suite of Spanish-specific NLP tasks designed to standardize model evaluation.
The paper demonstrates that the Spanish BERT model outperforms mBERT in part-of-speech tagging and document classification, marking a significant advance in Spanish NLP.

Overview of the Spanish Pre-Trained BERT Model and Evaluation Data

The paper presents the development and evaluation of a BERT-based LLM specifically pre-trained on Spanish data. This initiative addresses the challenge of finding resources for the Spanish language in NLP. By introducing a monolingual Spanish BERT model and a suite of evaluation tasks termed GLUES, the authors provide a comprehensive framework to enhance Spanish NLP capabilities.

Model Description

The Spanish BERT model was designed with parameters reflective of the BERT-Base architecture, comprising 12 self-attention layers and 12 attention heads each, resulting in a model with 110 million parameters. Two distinct models were trained using cased and uncased data, sourced from diverse collections including Spanish Wikipedia and the OPUS Project. The corpus amassed, approximately 3 billion words, was subjected to Dynamic Masking and Whole-Word Masking techniques during training. The inclusion of these strategies, as seen in RoBERTa, aims to optimize the pre-training process.

GLUES Benchmark

The GLUES benchmark is a compilation of Spanish-specific NLP tasks inspired by the English GLUE benchmark, intending to standardize evaluation across various tasks. Notable tasks included are:

Natural Language Inference (XNLI): A cross-lingual task assessing entailment, contradiction, or neutrality between sentence pairs.
Paraphrasing (PAWS-X): Evaluates semantic equivalence between sentence pairs.
Named Entity Recognition (CoNLL): Focuses on identifying entity names in text.
Part-of-Speech Tagging (Universal Dependencies v1.4): Categorizes words into their grammatical types.
Document Classification (MLDoc): Involves classifying documents into predefined categories.
Dependency Parsing (Universal Dependencies v2.2): Constructs dependency trees representing grammatical structure.
Question Answering (MLQA, TAR, XQuAD): Involves identifying answer spans from given context and questions.

Evaluation and Results

Fine-tuning was performed for each task following the standard approach from previous BERT implementations. The results demonstrated that the Spanish BERT models largely surpass the multilingual BERT (mBERT) in performance across most tasks. One exception is seen in certain question-answering settings where multilingual models with expansive training data offer an edge. The Spanish-BERT model established state-of-the-art results for tasks involving part-of-speech tagging and document classification.

Implications and Future Directions

The release of a dedicated Spanish BERT model signifies substantial progress in language-specific NLP development, offering a valuable resource for enhancing Spanish computational linguistics. The paper speculates on further advancements by suggesting the exploration of more efficient models, such as ALBERT, which could provide reduced computational demands while maintaining performance. The implications span academic research and practical applications, encouraging a broader deployment of AI in Spanish-centric environments.

Conclusively, the paper offers significant contributions to the Spanish NLP field, fostering further exploration and application of pre-trained models in diverse contexts. The benchmarking framework established here can guide future efforts to maintain and improve upon the growing body of work surrounding monolingual NLP models.

Markdown Report Issue