Papers
Topics
Authors
Recent
Search
2000 character limit reached

Miami Bilingual Corpus Overview

Updated 10 December 2025
  • Miami Bilingual Corpus is a richly annotated dataset capturing spontaneous Spanish–English code-switching with detailed transcriptions, token-level language tags, and demographic metadata.
  • It employs a hybrid annotation framework that integrates automatic tagging, manual in-lab validation, and decision-tree methods to ensure high-quality linguistic analysis.
  • The corpus supports in-depth investigations into sociolinguistic variation and computational annotation, providing actionable insights for bilingual discourse research.

The Miami Bilingual Corpus, commonly referred to as the Bangor Miami Corpus, is one of the most extensively annotated datasets for research on spontaneous Spanish–English code-switching. It serves as a canonical empirical resource for exploring sociolinguistic variation, conversational structure, and the development and evaluation of computational annotation methods in bilingual discourse.

1. Corpus Structure and Origin

The Miami Bilingual Corpus comprises audio-recorded informal conversations collected from 84 Spanish–English bilingual adults residing in Miami, Florida, during 2008–2011. The corpus originated from the BangorMiami2014 project within the BilingBank repository. Conversations were transcribed and pseudonymized, resulting in 56 audio files totaling approximately 35 hours of data. Each token in the transcript is aligned with start/end time information and is assigned a word-level language identification. In total, the corpus contains 242,475 word tokens, of which 63% are English, 34% Spanish, and 3% undetermined (Soto et al., 2017).

The corpus features both intra-sentential and inter-sentential code-switching. Intra-sentential examples include utterances such as "El teacher me dijo que Juanito is very good at math," while inter-sentential switches can be seen in sentences like "Sabes porque I plan to move in August but I need to find a really good job." The data demonstrate morphological, syntactic, and lexical mixing at clause and phrase boundaries.

2. Demographic and Metadata Schema

Each utterance in the Miami subset is coupled with detailed demographic metadata to support sociolinguistic analysis. The main fields are:

Field Description
sent_id Unique integer identifier for each sentence
filename Source transcript name
speaker Anonymized speaker ID (from 1 to 84)
age Speaker's age, spanning approximately 18–80 years
gender Categorical variable: {M, F}
situation Conversational context (e.g., "home chat," "class project")
lang_tag Dominant language at sentence level (majority token)

This rich metadata schema facilitates quantitative studies of correlations between code-switching behavior and demographic factors such as age, gender, and interactional context (Tyagi et al., 3 Dec 2025).

3. Annotation Frameworks and Pipelines

Part-of-Speech Annotation

A key computational resource derived from the Miami Bilingual Corpus is its universal part-of-speech (POS) annotation. The annotation pipeline leverages a hybrid crowdsourcing approach for quality and scalability:

  • Automatic Tagging: 56.6% of tokens are assigned tags based on curated unique token lists.
  • Manual In-Lab Tagging: 1.5% are manually annotated by linguists.
  • Token-Specific Questions (TSQ): 20.7% are resolved through targeted multiple-choice questions.
  • Decision Tree Tasks: 21.3% are annotated using language-specific question trees (15.3% English, 6.0% Spanish).

The process, formally described in pseudocode, combines automatic assignment, targeted manual intervention, and cascaded decision-tree annotation routed through unambiguous questions for non-expert annotators. Final tags are determined via simple majority vote, integrating existing Bangor tags. The annotation uses the 17-tag Universal Dependencies set (e.g., ADJ, NOUN, VERB, PROPN, INTJ, etc.) (Soto et al., 2017).

Sociolinguistic and Discourse Annotation

Recent advances apply LLMs, specifically GPT-4.1, to automatically classify sentences by topic and discourse-pragmatic function. The pipeline includes:

  1. Prompt Engineering: System and base prompts define required input/output formats, taxonomies, and provide few-shot examples.
  2. Model Inference: GPT-4.1 (temperature=0, max_tokens=200) predicts topic, primary, and secondary discourse functions for batches of 50–100 sentences.
  3. Output Normalization: JSON responses are parsed, and label strings are canonicalized.
  4. Human-in-the-Loop Validation: Expert bilingual linguists validate randomly sampled outputs, achieving annotation accuracy of 100% for primary topic and function, but 60% for secondary function (Tyagi et al., 3 Dec 2025).

4. Label Taxonomies

Topic and function categorization for code-switched utterances within the Miami subset relies on manually developed taxonomies.

Topic Categories

  • Workplace_Technical: Domain-specific terminology (e.g., engineering jargon)
  • Education_YouthOrganizations: References to school, youth groups, certificates
  • Architecture_Design: Architectural materials, styles
  • Office_Logistics: Scheduling tasks, file management
  • Narratives_Quotations: Recounting past events, reported speech
  • Casual_EverydayTalk: Greetings, small talk, jokes
  • Affect_Identity: Swearing, in-group identity markers
  • ProperNouns_NamedEntities: Utterances dominated by names of people, places, awards

Discourse-Pragmatic Functions

Primary and optionally secondary labels are assigned among fifteen functions, including TechnicalTermInsertion, ProperNounNamedEntity, PrecisionLexicalGap (using the other language for more precise expression), DiscourseMarker, TopicShift, Narrative, Quotation, TurnManagement, AddresseeShift, Directive, Repair, Agreement, StanceEmphasis, Humor, and SolidarityIdentity (Tyagi et al., 3 Dec 2025).

5. Quantitative Distributions and Empirical Patterns

Corpus-scale analysis enables the quantification of topic and function prevalence, especially as a function of speaker gender and sentence-level language dominance. Key findings for 2,825 intra-sentential code-switched sentences (29,700 tokens) are:

Topic Distribution by Gender

Topic Men (%) Women (%) Total n
Casual_EverydayTalk 59.8 60.1 1694
Narratives_Quotations 20.5 18.5 536
Workplace_Technical 4.8 4.8 135
Office_Logistics 1.6 5.7 130
ProperNouns_NamedEntities 7.0 3.7 130
Education_YouthOrganizations 2.5 3.4 90
Affect_Identity 2.8 3.3 89
Architecture_Design 1.1 0.5 18

Function Distribution by Gender

Function Men (%) Women (%) Total n
PrecisionLexicalGap 24.3 28.1 765
Narrative 19.6 19.8 556
DiscourseMarker 12.0 12.4 348
TechnicalTermInsertion 10.6 10.5 296
StanceEmphasis 6.9 6.1 178
ProperNounNamedEntity 8.5 4.5 156
Directive 4.0 6.0 153
SolidarityIdentity 2.4 3.5 90
Repair 3.4 2.3 73
Quotation 3.3 2.2 70

Patterns include the predominance of Casual_EverydayTalk (≈60%), higher Office_Logistics and Education_YouthOrganizations in women, and a greater use of ProperNouns_NamedEntities by men. PrecisionLexicalGap and Narrative functions are most prevalent across both genders, with women showing marginally higher use of Directive and SolidarityIdentity functions. Spanish-dominant sentences are concentrated in casual and affective topics, while English-dominant sentences appear more frequently in workplace and lexical-gap contexts (Tyagi et al., 3 Dec 2025).

6. Annotation Reliability and Evaluation Metrics

Part-of-speech annotation achieves overall agreement of 0.95–0.96 with expert gold standards, with average recall for tag types ranging from 0.87 to 0.99. For sociolinguistic annotation using LLMs, primary topic and function prediction reached 100% accuracy under random human validation, but secondary function reliability was notably lower at 60%. No formal significance testing was employed in gender–topic or function effect analysis; results are descriptive proportions (Tyagi et al., 3 Dec 2025, Soto et al., 2017).

Vote split statistics indicate 60–70% of tokens received unanimous crowd agreement in POS tasks, with a further 19–23% resolved against pre-existing Bangor tags and occasional ties arbitrated by experts. Per-tag recall values demonstrate highest robustness for verbs (VERB) and adjectives (ADJ), with context-specific failures—e.g., workers using the English Question Tree frequently misassigned adverbials like "home" as NOUN.

7. Limitations and Prospective Directions

Major limitations include the subjectivity of label taxonomies, especially for discourse functions, and reduced reliability for secondary function assignments using LLM-assisted annotation. Future directions proposed include:

  • Adopting dynamic topic generation to capture emergent or non-predefined themes
  • Integrating syntactic parsing to model switch-point constraints in code-switched utterances
  • Incorporating psycholinguistic features (e.g., cognate density) to relate lexical overlap to discourse functions
  • Extending demographic variables to include education level and socioeconomic status
  • Applying inferential statistics (such as chi-squared) for demographic effect assessment
  • Refining prompt engineering and expanding human adjudication to improve annotation reliability (Tyagi et al., 3 Dec 2025)

A plausible implication is that the Miami Bilingual Corpus will continue serving as a foundational test bed for computational sociolinguistics and cross-linguistic code-switching research, particularly as annotation protocols evolve to incorporate neural models and crowd-powered QA pipelines.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Miami Bilingual Corpus.