Turkish Semantic Relations Corpus
- The paper introduces TSRC, a large-scale corpus providing 843,000 Turkish term pairs with precise annotations for synonymy, antonymy, and co-hyponymy.
- It employs a three-phase hybrid methodology combining FastText clustering, LLM-based classification, and human dictionary curation to ensure high precision and extensive coverage.
- The dataset significantly enhances Turkish semantic resources, enabling improved semantic search, lexicon development, and cross-lingual NLP applications.
The Turkish Semantic Relations Corpus (TSRC) is a large-scale, labeled resource of semantic relation pairs in Turkish, designed to address the previously unmet need for robust, supervised datasets suitable for semantic modeling and retrieval in morphologically rich, low-resource languages. The dataset comprises 843,000 unique Turkish term pairs, each annotated with one of three semantic relation types: synonym, antonym, or co-hyponym. Its construction leverages a hybrid, multi-phase protocol—integrating neural embeddings, LLM-based relation classification, and human-curated dictionary sources—to achieve both scale and precision. TSRC fills a major gap identified in prior surveys, where traditional resources such as Turkish WordNet, FrameNet, ConceptNet, and Turkish PropBank provided only limited coverage of general semantic relations, and no large, pairwise relation corpus was available (Çöltekin et al., 2022, Tosun et al., 19 Jan 2026, Tosun et al., 19 Jan 2026).
1. Motivation and Background
Before TSRC, Turkish lexical semantic resources were restricted to compact, synset-based lexicons (e.g., TWn with ≈14,000 synsets; KeNet ≈8,000 synsets), multilingual commonsense graphs (e.g., ConceptNet: ~65,892 Turkish nodes), and highly structured, predicate-argument resources (e.g., TRopBank). These lacked dense supervision over arbitrary word pairs and did not systematically distinguish between synonymy, antonymy, and co-hyponymy across broad lexico-domain vocabularies. Explicit analysis in existing surveys noted the absence of any “large, freely accessible Turkish corpus annotated with general semantic relations between arbitrary word pairs,” impeding both semantic parsing and evaluation (Çöltekin et al., 2022). TSRC’s release represents a scale increase of over an order of magnitude compared to prior resources, delivering high-precision pairwise relation labels validated through downstream tasks (Tosun et al., 19 Jan 2026, Tosun et al., 19 Jan 2026).
2. Data Construction Protocol
TSRC is created using a three-phase hybrid annotation protocol:
- Phase I – FastText Embedding and Clustering: The initial vocabulary is constructed from 77,000 curated legal and technical terms, expanded by NER-driven corpus extraction to 110,000 types. Each term is embedded with the pre-trained FastText Turkish model (cc_tr_300, 300-dimensional). For multi-word phrases, embeddings are averaged over constituent tokens. Agglomerative clustering is then applied using cosine distance, thresholded at (UPGMA ‘average’ linkage), grouping terms into ≈13,000 clusters (cluster size 2–50). This threshold is tuned to pool synonyms (high similarity), antonyms, and co-hyponyms (semantic vicinity) within thematic clusters.
- Phase II – LLM-Based Relation Classification: Each term cluster is submitted as a batch to Gemini 2.5-Flash, instructed to classify all intra-cluster pairs into {synonym, antonym, co-hyponym} per an explicit taxonomy: synonyms are strictly mutually substitutable, antonyms are exact semantic opposites, and co-hyponyms share a hypernym but are not interchangeable. Relations such as “uncertain” and self-synonymy are forbidden. Systematic JSONL outputs structure the relations per term. Phase II generates approximately 827,000 synthetic labeled pairs (\$65 API cost; approximately \$0.00008 per pair).
- Phase III – Human Dictionary Integration: To increase high-precision coverage, pairs from an external dictionary (Türkçe Eş Anlamlılar Sözlüğü, 20,000 entries) are filtered (≤2 synonyms/headword, ambiguous entries excluded) and added when not already found in LLM outputs, yielding an additional 16,000 dictionary-derived pairs.
3. Corpus Composition and Label Distribution
The final TSRC corpus contains:
- Total pairs: 842,946
- Class breakdown:
- Co-hyponyms: 606,612 (71.96%)
- Synonyms: 148,367 (17.60%)
- Antonyms: 87,967 (10.44%)
- Source split:
- Synthetic (LLM-batch): 826,946 (98.1%)
- Human dictionary: 16,000 (1.9%)
The dataset uses a JSON Lines (JSONL) format, with each line recording:
{"sentence1": "term_A", "sentence2": "term_B", "label": "synonym"|"antonym"|"co_hyponym"}
This structure directly supports contrastive learning and classification-based modeling.
Textual statistics:
- Average token length per term: 11.04; maximum: 37
- Overall type–token ratio: ≈0.02 (high relation redundancy across anchor terms)
Example pairs:
- High-frequency: (“sözleşme”, “mukavele”) [synonym]; (“alıcı”, “satıcı”) [antonym]
- Technical/Low-frequency: (“fotovoltaik”, “güneş paneli”)
4. Semantic Relation Discriminator and Evaluation
A dedicated three-way classifier is trained on the full TSRC to enable robust separation of synonymy, antonymy, and co-hyponymy, which is non-trivial due to embedding-based models’ limitations in distinguishing opposites. The model employs a turkish-e5-large encoder (XLM-RoBERTa, 560M parameters):
- Input format: [CLS] <term₁> [SEP] <term₂> [SEP], max sequence length 64
- Objective: Weighted cross-entropy loss, using inverted class proportions: , where , for class size and .
Performance (5 epoch, NVIDIA L40S):
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Synonym | 0.76 | 0.90 | 0.83 |
| Antonym | 0.91 | 0.93 | 0.92 |
| Co-hyponym | 0.93 | 0.95 | 0.94 |
| Macro-avg | 0.88 | 0.92 | 0.90 |
Top-1 accuracy on synonym retrieval task (Siamese e5-large, CMNRL loss) reaches 90% on a held-out testset of 5,000 queries (, ).
5. Clustering and Graph Construction
Beyond pairwise labels, TSRC supports large-scale synonym graph induction:
- Total nodes: 15,000,000 (morphological variants, anchors)
- Verified “synonym” edges: 520,000,000
- Final hard clusters: 2,905,071 (median size: 3, mean: 4.58, max: 86)
A two-stage clustering procedure mitigates the classic “semantic drift” and antonym-intrusion problems:
- Stage 1: Soft clustering using intersection ratio , term assigned to cluster if .
- Stage 2: Topological pruning using majority vote and specificity to resolve polysemy (a term belongs to exactly one cluster at the end).
Sample clusters:
- “Mücbir Sebep”: includes “Mücbir Sebe”, “Mücbir Sebep Halleri”, “Mucbir Sebepler”, “Mücbir Sebep”
- “Vergi Usul Kanunu”: “VUK”, “Vergi Usul K.”, “213 Sayılı Kanun”, “Vergi Usul Kanunu”, “Vergi Usul Yasası”
6. Comparison with Existing Semantic Resources
Previous resources provided only synset-based or category-constrained relations, lacking explicit pairwise labels at scale:
| Resource | Size | Relations Encoded | Format | Coverage |
|---|---|---|---|---|
| Turkish WordNet | ≈14,000 synsets | Synonymy, hypernymy, antonymy | XML/Prolog | Nouns, verbs, adjectives |
| KeNet | ≈8,000 synsets | Synonymy, hypernymy, antonymy | XML | Extended lexicon |
| ConceptNet | ≈66,000 concepts | IsA, PartOf, Synonym, Antonym, etc. | JSON triples | Multilingual concepts |
| TSRC | 843,000 pairs | Synonym, antonym, co-hyponym | JSON Lines | Domain & general terms |
The explicit absence of a broad, pair-annotated semantic corpus was noted as a critical barrier by resource surveys (Çöltekin et al., 2022); TSRC is the first to provide robust, broad-coverage, fine-grained supervision for Turkish.
7. Availability, Licensing, and Applications
The TSRC dataset is publicly released under CC BY 4.0, conforming to Global WordNet Association guidelines.
- Download: https://huggingface.co/datasets/tsrc/tr-semantic-relations
- Contents:
- tsrc_train.jsonl, tsrc_dev.jsonl, tsrc_test.jsonl
- term_list.txt (110,000 terms), cluster_map.csv (term↔cluster_id)
- License: CC BY 4.0
- Citation: Tosun et al., “A Hybrid Protocol for Large-Scale Semantic Dataset Generation…” (2025)
Applications:
- High-precision semantic search (query expansion, antonym filtering)
- Retrieval-augmented text generation
- Lexicon-building for domain-specific NLP tasks
- Training and evaluating embedding models on antonymy detection
- Cross-lingual adaptation to other low-resource languages (cost ≈\$50–100 per language for LLM annotation phase, using FastText and any suitable multilingual LLM) (Tosun et al., 19 Jan 2026, Tosun et al., 19 Jan 2026).
8. Future Perspectives
TSRC’s methodology and taxonomy establish a replicable protocol for other low-resource languages, given the wide availability of FastText embeddings and multilingual LLMs. The explicit identification of synonymy, antonymy, and co-hyponymy supports research into antonym intrusion and semantic drift—problems that remain challenging for neural embedding approaches (Tosun et al., 19 Jan 2026). The inclusion of both synthetic and curated pairs enables robust evaluation, while the soft-to-hard clustering pipeline provides a foundation for resolving polysemy and maintaining high semantic purity in induced lexicons. The TSRC fills a crucial gap in Turkish language technology infrastructure, directly enabling previously infeasible tasks in semantic parsing and evaluation.
References: