Taxonomically Enriched Turkish Learner Corpus
- The paper introduces a novel faceted taxonomy that decomposes each error into six linguistically informative dimensions for Turkish learner corpora.
- It employs a semi-automated Annotation Extender that integrates UDPipe 2.0 for reliable morphosyntactic analysis and precise error enrichment.
- The corpus enables complex, multi-dimensional queries, offering actionable insights for SLA research, computational linguistics, and educational technology.
The Taxonomically Enriched Turkish Learner Corpus represents a significant advance in learner corpus research for Turkish by introducing a multi-dimensional, theoretically grounded annotation scheme that extends traditional error-annotated corpora. Moving beyond flat error tags, the corpus leverages a faceted taxonomy and a novel semi-automated annotation-extension framework to systematically enrich each annotated error with interpretable, linguistically informative facets, thereby enabling standardized, fine-grained analyses and supporting sophisticated queries for empirical research in second language acquisition (SLA), computational linguistics, and educational technology (Sayar et al., 30 Jan 2026).
1. Faceted Taxonomy and Annotation Structure
Eryiğit et al. (2025) formalize a faceted taxonomy decomposing each error annotation into six core dimensions, operationalizable as distinct fields in the corpus schema:
- Identifier: The canonical (flat) error type label (e.g., PUN for punctuation, CASE for case-marking, SPE for spelling).
- MorphologicalFeature: Three Universal Dependencies–based subfacets of the corrected form:
- Part-of-Speech (POS): UD UPOS tag (e.g., NOUN, VERB).
- InflectionalFeature: UD morphological features (e.g., Case, Number, Person, Tense, Aspect).
- LexicalFeature: Lexical subfeatures (e.g., PronType, Polarity, Degree).
- Unit: Linguistic granularity at which the error occurs (Affix, Lemma, Word, Phrase, Sentence).
- Phenomenon: Formal deviation type—Addition (A), Omission (O), Misuse/Misapplication (M), Misordering (MO), or Ambiguity (Amb).
- LinguisticLevel: Coarse linguistic domain—Orthography, Morphophonology, Grammar, Semantics, Pragmatics, Sociolinguistics.
- Metadata: Contextual attributes (Nationality, Gender, Topic, proficiency level, etc.).
This decomposition enables researchers to formulate complex, multi-faceted queries that are infeasible on corpora with only flat error tags. For example, it is possible to restrict analysis to accusative case-omission errors on noun affixes by L1 Chinese learners in education-topic essays. Each error annotation thus functions as an object with multidimensional descriptors enabling granular research and pedagogical insights (Sayar et al., 30 Jan 2026).
2. Annotation Extension Framework and Workflow
The semi-automated annotation-extension framework—implemented as the “Annotation Extender”—processes an error-annotated Turkish learner corpus to infer and populate the full set of taxonomy facets. The pipeline is as follows:
- Data Validation: Input from Label Studio in JSON format is verified for well-formed, non-overlapping spans and the presence of minimally required annotation fields (error span, flat label, correction).
- Task-Level Reconstruction: Corrections are reapplied, in span order, to reconstruct the “gold” essay text. An index-shift table allows reliable realignment of error spans to new token boundaries.
- Morphosyntactic Analysis: UDPipe 2.0, with the turkish-imst UD model, parses the reconstructed text to supply tokenization, POS, lemma, and UD morphological features in CoNLL-U format.
- Error-Level Enrichment: Each error annotation is enriched by:
- Static mapping: Predefined and fixed facet values per error type or source metadata.
- Context-aware mapping: For facets depending on local context (e.g., POS in SPE errors), features are extracted from aligned UD tokens and further partitioned by rules and heuristics.
- For multi-valued facets (e.g., deciding whether a spelling error targets an affix or lemma), heuristics such as comparing corrected lemma length to the original span are applied.
This process is guided by a schema that specifies, per error type, which facets to infer from UD output and which require only static mapping. The following pseudocode shows mapping logic for Unit selection in SPE errors:
1 2 3 4 5 6 7 |
def infer_Unit_for_SPE(error_span, corrected_form, udpipes): tokens = lookup_tokens(error_span, udpipes) lemma = tokens[0].LEMMA if corrected_form[0:len(lemma)] == lemma: return "Affix" else: return "Lemma" |
Facet-level annotation accuracy is formalized as:
3. Manual Tagset Development and Annotation Protocol
Recognizing Turkish as an agglutinative, morphologically rich language (MRL), an extensive preparatory annotation campaign preceded full-scale corpus construction:
- Tagset Revision: The original 58 flat error labels from Golynskaia (2022) were consolidated and expanded into 34 categories (V1.0), piloted, and iteratively refined to a final set of 37 error tags (V3.0), encompassing novel error types such as Unnecessary Affix (UA), Final-Initial Merge (FIM), Derivation (DE), Descriptive Compound Verb (DCV), Aspect (ASP), Allomorphy (ALL), Unclear Meaning (UM), STYLE, and DIG (Digitization).
- Annotation Phase: Five annotators (four final) with advanced Turkish as a Foreign Language training annotated the 525-essay main corpus (8,180 sentences; 104,701 tokens) in Label Studio, operating under a minimal-edit principle.
- Consensus and Guideline Finalization: Expert adjudication resolved conflicts, and comprehensive written annotation guidelines—bolstered by decision rules and real examples—ensured consistency.
- Quality Control: Cohen’s κ was maintained at ≥ .79 for error type categorization, with .91 on error detection in the pilot phase.
This structure underpins the corpus’s reliability and its extensibility to related research domains.
4. Evaluation Metrics and Error Analysis
Systematic evaluation utilized a stratified sample of 2,024 error instances drawn from a 672-essay pool (10,591 sentences; 133,752 tokens):
- All rare error types (≤ 30 occurrences) were fully included; high-frequency types sampled at 5% (with a minimum of 30 instances).
- Gold standard was created via senior expert extension, with I–A–A on concatenated facets at κ = .80.
Facet-level accuracy by dimension:
| Facet | Accuracy (%) |
|---|---|
| POS | 94.46 |
| InflectionalFeature | 98.91 |
| LexicalFeature | 99.85 |
| Unit | 96.29 |
| Phenomenon | 89.77 |
- Macro-average facet accuracy: 95.86%
- Annotation-level exact match accuracy: 81.18% (fraction of instances where all facets matched gold)
- No explicit precision/recall/F1 reported, though these could, in principle, be defined per facet as:
Error analysis identified most mismatches within the Phenomenon facet (ambiguity in A/O/MO classifications) and those cascading from UDPipe segmentation or tagging errors, especially for context-sensitive features (Sayar et al., 30 Jan 2026).
5. Data Structure and Querying Functions
The released corpus encodes each annotated error as a JSON object comprising:
- span offsets and text
- Identifier (flat label)
- POS, InflectionalFeature, LexicalFeature
- Unit (Affix/Word/Phrase/Sentence)
- Phenomenon (A/O/M/MO/Amb)
- LinguisticLevel
- Metadata: Nationality, Gender, Topic
This schema, alongside a CSV metadata file, follows FAIR data principles and is set for distribution via CLARIN and Sketch Engine.
In faceted-search interfaces (Sketch Engine, SPARQL/Elasticsearch APIs), queries can span any facet combination; examples include:
- Extracting all accusative affix omissions in past-tense contexts.
- Filtering for spelling errors affecting verb lemmas by L1 Chinese participants.
This suggests major comparative advantages over flat-tag corpora—enabling precision filtering (e.g., affix-omission errors for nouns) and fine-grained learner profiling.
6. Contributions, Impact, and Prospects
The Taxonomically Enriched Turkish Learner Corpus's principal advances are:
- Faceted taxonomy operationalized for Turkish—bridging theoretical error typology with practical, scalable annotation procedures.
- Hybrid annotation pipeline—combining expert manual annotation with semi-automated extension, supporting efficient fine-grained enrichment at corpus scale.
- Public release—including corpus, open-source Annotation Extender, and extensive annotation documentation.
Significance includes enabling cross-linguistic, multi-dimensional SLA studies, supporting the development and evaluation of Turkish grammatical error correction (GEC) models at facet level, and facilitating data-driven pedagogy through precise learner difficulty diagnostics.
Limitations:
- Reliance on UDPipe: segmentation and UD tag errors affect facet inference, particularly for contextually dependent attributes such as Aspect.
- Heuristic-based facet mapping: may underperform for complex or ambiguous error cases.
- Limited metadata: presently restricted to Nationality, Gender, and Topic.
- Evaluation bias: stratified sampling inflates the presence of rare error types.
Potential extensions:
- Integrate dependency parsing and semantic-role facets into the extender to cover errors in word order, subordination, and agreement.
- Replace heuristic mappings with probabilistic or neural inference, particularly to improve Phenomenon facet assignment.
- Replicate the methodology for other MRLs (e.g., Finnish, Hungarian) via language-specific schema redefinitions.
- Utilize the corpus for “facet-aware” GEC system evaluation and in intelligent tutoring systems generating feedback by error facet.
In summary, the Taxonomically Enriched Turkish Learner Corpus constitutes a benchmark resource, demonstrating the feasibility and benefits of facet-based annotation in low-resource, morphologically complex languages—impacting SLA research, language pedagogy, and NLP development (Sayar et al., 30 Jan 2026).