Turkish Skill Extraction Dataset
- Turkish Skill Extraction Dataset is a manually annotated corpus dedicated to identifying skill spans in Turkish job postings with clear ESCO taxonomy linking.
- It employs a rigorous three-round MAMA protocol and semantic embedding retrieval to ensure consistent and precise span annotation.
- The dataset offers 4,819 annotated skill spans and supports research in job matching, labor analytics, and low-resource NLP applications.
The Turkish Skill Extraction Dataset (“KARIYER”) constitutes the first manually annotated corpus dedicated to skill span identification and taxonomic linking in Turkish job postings. Developed to facilitate advanced research in automated skill extraction—essential for job matching, labor analysis, and recommendation systems—the dataset systematically addresses Turkish’s low-resource challenges and its morphologically complex structure by providing a rigorously constructed, openly available benchmark (İltüzer et al., 30 Jan 2026).
1. Data Sources, Sampling, and Annotation Protocol
Job postings were randomly sampled from Kariyer.net, Türkiye’s largest online recruitment platform. The final corpus comprises 327 distinct postings spanning 27 ISCO-based occupational areas, without over- or under-sampling of specific industries, thus reproducing a realistic, long-tail distribution of real-world occupations. Sales–Marketing and Finance collectively account for roughly 25% of samples.
Span annotation was executed via three native-Turkish annotators (two computer science, one product/marketing) using a three-round MAMA (“Model-Annotate-Model-Annotate”; Editor's term) protocol that iteratively refined guidelines and enforced labeling consistency. Annotators marked all contiguous spans expressing demonstrable “skills,”—abilities, knowledge, or expertise required by the role—while explicitly excluding education or experience references. Multi-skill constructions (e.g., “etkili ve anlaşılır iletişim”) received a “MULTI” label and were subsequently split for ESCO linking.
Turkish morphological complexity was handled by incorporating meaning-altering suffixes, but compound nouns were to remain unsplit unless justified by semantic boundaries. The European Skills, Competences, Qualifications and Occupations (ESCO) v1.2.0 taxonomy (13,895 English entries) served as the reference ontology: all ESCO skill labels were machine-translated to Turkish with ChatGPT and manually reviewed for coverage consistency. ESCO mapping in the test set operated via semantic embedding-based retrieval and manual adjudication among the top 50 ESCO candidates.
2. Corpus Size, Class Distribution, and File Structure
The complete dataset includes 4,819 annotated skill spans: 4,344 single-skill and 475 multi-skill, extracted from 327 postings.
| Occupational Area (excerpt) | #Posts | #Skill Spans |
|---|---|---|
| Sales–Marketing | 49 | 676 |
| Finance | 37 | 625 |
| Information Technology | 14 | 372 |
| Other (Arts/Law/Chemistry/etc.) | 28 | 353 |
Splitting by occupation, categories with >6 postings underwent a 60/20/20 (train/dev/test) stratification; smaller categories contributed exclusively to training. This results in approximately 905 span annotations in the test set (18.8%), around 964 in development (20%), and the remainder in training.
The dataset is distributed in line-based JSON files. Each record contains:
job_id: integer IDarea: occupational categorysplit: data subset (“train”, “dev”, “test”)text: raw job postskills: list of objects, each withstart,end,span(skill text), andlabel(“SKILL” or “MULTI”)linked_ESCO: (test only) mapped ESCO ID, or null for unlabeled
Example:
1 2 3 4 5 6 7 8 9 10 |
{
"job_id": 123456,
"area": "Sales–Marketing",
"split": "train",
"text": "Firmamızda çalışacak, <skill>CRM yönetimi</skill> ve <skill>e-posta pazarlama</skill> konularında deneyimli takım arkadaşı arıyoruz.",
"skills": [
{"start": 36, "end": 52, "label": "SKILL", "span": "CRM yönetimi"},
{"start": 56, "end": 76, "label": "SKILL", "span": "e-posta pazarlama"}
]
} |
Supporting files include full ESCO translations (skills_base.json), JSON schema, and annotation guidelines in PDF.
3. Annotation Quality and Inter-Annotator Agreement
Span annotation underwent a systematic three-round MAMA cycle, with cross-review and majority voting to adjudicate disagreements and edge cases. While explicit inter-annotator agreement (IAA) was not numerically reported for spans, consistent annotation was achieved through iterative process refinement.
For ESCO ID linking on the test set, a separate triad of annotators independently associated each span with candidates from the top 50 semantically retrieved Turkish ESCO entries. Krippendorff’s alpha () was computed for nominal agreement, yielding an overall (pairwise consensus: 0.89, 0.80, 0.64). 82% of test spans were successfully linked to an ESCO skill; remaining conflicts were resolved by majority vote in adjudication.
4. Access, Licensing, and Distribution
The dataset is released under the Creative Commons Attribution 4.0 (CC BY 4.0) license, permitting broad academic and commercial reuse with attribution. Upon acceptance of the source publication, materials are hosted at https://github.com/kariyer-net/turkish-skill-dataset.
Top-level directory organization:
/README.md: overview, usage, and citation instructions/skills_base.json: Turkish ESCO skills and metadata/train.jsonl,/dev.jsonl,/test.jsonl: span-annotated job posting splits/test_links.jsonl: test set gold-standard ESCO mappings/schema.json: dataset schema/annotation_guidelines.pdf: full annotator guidelines
5. Applicability, Baseline Results, and Reproducibility
The KARIYER dataset directly supports skill extraction benchmarking for Turkish, enabling model comparison under realistic occupational distributions and rich morphological variation. Typical use cases include sequence labeling model development, advanced LLM prompting strategies, and cross-lingual experiments by merging with other annotated corpora.
Sample Python data loaders and IOB tag conversion methods are provided for typical Hugging Face workflows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import json, itertools from pathlib import Path def load_split(path): for line in Path(path).open(): yield json.loads(line) def to_iob(example): tokens = example["text"].split() tags = ["O"]*len(tokens) for s in example["skills"]: # find token indices for s["start"], s["end"] … # assign B-SKILL / I-SKILL accordingly return tokens, tags train = list(load_split("train.jsonl")) |
Baseline supervised models use bert-base-turkish-cased with six epochs (lr=2e-5, weight_decay=0.01), obtaining F₁ = 0.64 (precision=0.62, recall=0.67) and MUC score of 0.84 on the test set.
LLM-based evaluations demonstrate the impact of prompting strategies: zero-shot GPT-4o achieves F₁ ≈ 0.15 (strict) and partial 0.79, whereas dynamic ten-shot prompting with Claude 3.7 reaches F₁ ≈ 0.57 (partial 0.78). The best full pipeline (Claude 3.7, embedding retrieval, GPT-4o causal rerank) yields end-to-end job-level accuracy of 0.56. This situates Turkish skill extraction benchmarks in proximity to those available for other underrepresented languages, suggesting LLMs offer measurable gains in low-resource contexts.
Immediate reproducibility is supported through the open license and released code, with sample CLI training scripts for reference:
1 2 3 4 5 6 7 8 9 |
transformers-cli train \ --model_name_or_path bert-base-turkish-cased \ --dataset_name ./train.jsonl \ --do_train --do_eval \ --max_seq_length 256 \ --per_device_train_batch_size 16 \ --learning_rate 2e-5 \ --num_train_epochs 6 \ --output_dir ./exp/bert_skill |
A plausible implication is that future research leveraging parameter-efficient fine-tuning (PEFT) or multilingual skill-taxonomy integration can extend the dataset’s utility for broader NLP and labor market analytics (İltüzer et al., 30 Jan 2026).