Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spanish–Guaraní Dataset for Code-Switching

Updated 10 December 2025
  • The Spanish–Guaraní Dataset is a multilingual corpus featuring code-switched texts from news, tweets, and conversations in Paraguay and neighboring regions.
  • It integrates multi-layered annotations such as token-level language ID, named entity recognition, and sentiment analysis using a blend of manual and LLM-assisted methods.
  • The resource serves as a benchmark for computational sociolinguistics and low-resource NLP, validating diglossic patterns through empirical quantitative analysis.

The Spanish–Guaraní dataset refers to a set of corpora and annotation resources designed to support the study of code-switching, sociolinguistic variation, and language technology for Spanish and Guaraní (Tupí-Guaraní family) in Paraguay and neighboring regions. These resources include curated collections of news, tweets, and conversational texts annotated for token-level language ID, named entities, span-level code-switching phenomena, sentiment, topical and formal attributes. The datasets described below are primary benchmarks for cross-linguistic NLP, diglossia research, and computational sociolinguistics in one of the world’s most prominent active bilingual speech communities (Tyagi et al., 3 Dec 2025, Chiruzzo et al., 2023, Agüero-Torales et al., 2021).

1. Corpus Origins and Data Sources

The Spanish–Guaraní code-switched datasets consolidate diverse text domains and annotation initiatives. Two corpora are most prominent:

Corpus Source Domains Scale Principal Tasks
GUA-SPA (IberLEF 2023) News articles, Twitter (Paraguay) 1,500 texts / ~24,849 tokens Language ID, NER, Spanish span usage
JOSA (Jopara Sentiment) Guaraní-dominant Twitter (Paraguayan) 3,491 tweets Sentiment, code-switch classification

GUA-SPA: Collected for the IberLEF 2023 shared task (Chiruzzo et al., 2023), this corpus consists of news-site articles and tweets, selected for the occurrence of code-switching behavior. The dataset underwent tokenization preserving orthographic features critical for distinguishing Spanish and Guaraní morphosyntax.

JOSA: As reported in (Agüero-Torales et al., 2021), JOSA was compiled from seven Guaraní-focused Twitter accounts using an “account-based” crawl, after a “keyword-based” approach yielded insufficient recall. Deduplication and manual labeling ensure high linguistic relevance, focusing particularly on utterances labeled as “Guaraní” or “Jopara” (Spanish-Guaraní mixed).

Downstream Enrichment: The GUA-SPA core has been extensively re-annotated for sociolinguistic analysis, including formality, genre, and topic assignments via LLM–assisted pipelines (Tyagi et al., 3 Dec 2025).

2. Annotation Methodologies

The Spanish–Guaraní dataset suite is notable for multi-layered annotations, combining rigorous manual protocols with LLM-based workflows.

  • Task 1: Token-Level Language ID
    • Labels: gn (Guaraní), es (Spanish), mix (hybrids), ne (named-entities), foreign, other (punctuation/emojis/URLs)
  • Task 2: Named-Entity Recognition (NER)
    • BIO tagging for Person, Organization, Location
  • Task 3: Spanish-Span Usage
    • BIO tags: es-cc (code-change), es-ul (unadapted loan)

Annotation guidelines were agreed upon after a two-phase pilot process. Annotators were native or highly proficient bilinguals; decision trees and exemplars ensured consistent application. Inter-annotator agreement for language ID (Fleiss’s κ = 0.836) and NER (pairwise F₁ = 0.926) indicate robust reliability.

  • Manual language and sentiment labeling (Guaraní, Jopara, Other; Positive, Neutral, Negative) by two annotators; conflicts resolved per SemEval-2017 Task 4 protocols.
  • Cohen’s κ for agreement is reported as “slight”, reflecting challenges inherent to low-resource, linguistically fluid social media data.
  • GPT-4.1 (2025-04-14) model via OpenAI API; deterministic decoding.
  • Structured prompts specify input fields (sentence, file, domain), formality, genre (14 types), and ~30 granular topics.
  • One JSON per sentence; automated post-processing. Human spot-checks indicate 94.17% categorical labeling accuracy across all annotation axes.

3. Structural Properties and File Formats

The datasets employ standardized schemas for NLP interoperability.

  • Format: CoNLL-style TSV, four columns per token (surface, Task 1, Task 2, Task 3)
  • Splits: 76% train, 12% dev, 12% test (by document)
  • Availability: GitHub repository https://github.com/pln-fing-udelar/gua-spa-2023; data in .tsv, annotation manual included.
  • Format: JSONL, one object per code-switched sentence
  • Fields: sent_id, filename, situation (“Formal”/“Informal”), lang_tag (“spa+gn”), sentence, formality, genre, topic, secondary_topic
  • Availability: https://github.com/N3mika/topicmodelling under “spa-gua/”

Sample JSONL Entry:

1
2
3
4
5
6
7
8
9
10
11
{
  "sent_id": 2024,
  "filename": "tweet_3487592",
  "situation": "Informal",
  "lang_tag": "spa+gn",
  "sentence": "Che rehegua mba’eove kuaa oñemohenda YouTube-pe.",
  "formality": "Informal",
  "genre": "Personal",
  "topic": "UserMention_Request_Response",
  "secondary_topic": null
}

4. Quantitative Distributions and Diglossic Effects

Comprehensive counts reveal sociolinguistic and functional patterns characteristic of Paraguayan Spanish–Guaraní code-switching.

  • gn: 10,132 (40.8%)
  • es: 6,685 (26.9%)
  • mix + ne + foreign + other: 8,032 (32.3%)
  • Approximately one text in three contains at least one Spanish span.
  • 866 code-switched sentences (15,600 tokens, ≈18 tokens/sentence)
    • Spanish tokens: 42.6%
    • Guaraní tokens: 38.7%
    • Other: 16.6%
    • Ambiguous/mixed: 2.1%
  • Situation: Formal (56%), Informal (44%)
  • Dominant-language split: Guaraní (53%), Spanish (47%)

Formality–Genre Matrix:

Genre Formal (%) Informal (%)
News 65.1 0.3
Personal 0.0 72.1
Politics 12.9 0.3
Opinion 1.2 11.7
Culture 5.7 2.9

Topic–Formality Matrix: Government_Announcement, PublicAdministration_Changes, and Indigenous_CommunityAid dominate in formal, Guaraní-dominant contexts; UserMention_Request_Response, Humor_Rant, and Personal_Emotional in informal, Spanish-dominant texts.

This distribution quantitatively corroborates the classic model of Paraguayan diglossia: formal/institutional communication privileged in Guaraní, while informal/social registers tend toward Spanish.

5. Licensing and Access Conditions

Dataset Access URL License/Use Restrictions
GUA-SPA https://github.com/pln-fing-udelar/gua-spa-2023 CC BY-SA 4.0 (data); MIT (code)
spa-gua/ (LLM) https://github.com/N3mika/topicmodelling CC-BY-NC (noncommercial, attribution)
JOSA https://github.com/mmaguero/josa-corpus Twitter Developer Policy applies

Researchers must comply with any source-data and community-specific requirements, including anonymization and respectful handling of Indigenous-language content. LLM-augmented annotations extend only to non-commercial research use per the CC-BY-NC terms and inherited rights from IberLEF 2023.

6. Research Impact and Applications

These Spanish–Guaraní datasets underpin a rapidly expanding area in cross-linguistic NLP, computational sociolinguistics, and language documentation for under-resourced, diglossic language ecologies. Use cases include benchmarking code-switching detection methods, NER models in mixed-language contexts, sociolinguistic variation analysis, and sentiment or discourse function modeling in Jopara and other bilingual settings (Tyagi et al., 3 Dec 2025, Chiruzzo et al., 2023, Agüero-Torales et al., 2021).

The LLM-assisted annotation pipeline demonstrates that topic and functional attributes, previously accessible only through intensive manual analysis, can be robustly extracted at scale—providing strong empirical support for classic interactional and diglossic theories with corpus-level evidence. This resource suite also serves as a baseline for transfer learning and low-resource LLM development.

A plausible implication is that the combined manual and LLM-augmented approach illustrated here is a generalizable methodology for linguistic resource creation in other low-resource, code-switched, or diglossic environments.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spanish-Guaraní Dataset.