Turkish RAG Datasets for LLM Benchmarking
- Turkish RAG datasets are standardized resources that benchmark LLM pipelines with annotated token-level hallucination detection for tasks like QA, data-to-text, and summarization.
- Turkish-RAGTruth uses a two-stage prompting and translation protocol with minimal manual edits, while RAGTurk integrates web and Wikipedia sources with adjacent-chunk augmentation.
- Both datasets provide pre-computed dense retrieval indices and cross-encoder reranking techniques to address the challenges of Turkish's agglutinative structure in retrieval and generation.
A Turkish Retrieval-Augmented Generation (RAG) dataset serves as a standardized resource for the development, benchmarking, and evaluation of LLM pipelines that integrate external retrieval with natural language generation in the Turkish language. The two primary public resources in this domain are the Turkish-RAGTruth dataset, released in conjunction with the Turk-LettuceDetect hallucination detection framework (Taş et al., 22 Sep 2025), and the RAGTurk corpus, designed for general-purpose RAG pipeline evaluation and best practices in Turkish information access (Köse et al., 3 Feb 2026). Both datasets directly address the unique challenges posed by Turkish—a morphologically rich, low-resource language whose agglutinative structure complicates both retrieval and generation tasks.
1. Dataset Construction and Source Materials
Turkish-RAGTruth
The Turkish-RAGTruth dataset is a machine-translated adaptation of the English RAGTruth benchmark (Niu et al. 2024), which systematically annotates hallucinated versus supported spans in LLM-generated outputs across three tasks:
- Question Answering (QA)
- Data-to-Text generation (Data2Text)
- Summarization
Translation utilized google/gemma-3-27b-it via vLLM on NVIDIA A100 hardware, processing approximately 30 examples in parallel and completing a full pass per language in roughly 12 hours. A two-stage prompting protocol preserved all <HAL> (hallucination) tag positions, employing distinct translation templates for core response segments and for JSON/instructional elements. Quality assurance comprised tag-integrity verification, random sampling–based human review (n≈500), and automated checks for character offset consistency. Minimal manual edits fixed format distortions or punctuation loss within hallucinated spans (affecting <1% of instances) (Taş et al., 22 Sep 2025).
RAGTurk
The RAGTurk corpus combines material from two sources:
- CulturaX: 6,305 Turkish web pages, curated via LLM-based filtering (URL triage, content-quality assessment)
- Turkish Wikipedia: 4,891 randomly sampled articles (each >300 characters), converted to Markdown
Documents undergo header-aware chunking at section boundaries, using a tokenizer-agnostic 1,000-character threshold to maintain context locality and limit topic drift—critical for Turkish agglutination. QA pairs are generated from each chunk using a generator LLM (gpt-oss:120B) and then validated by gemini-2.5-flash. For each chunk, at least one factual and one interpretive question is generated. The cross-model validation mitigates but does not eliminate hallucination risk (Köse et al., 3 Feb 2026).
2. Content, Format, and Annotation Schema
Turkish-RAGTruth
The dataset comprises annotated instances:
These are distributed nearly equally among QA, Data-to-Text, and Summarization. The data is provided in JSONL format, where each record includes:
"prompt": original question or instruction"context": concatenated passages retrieved for grounding"generated_answer": LLM output"labels": token-level hallucination annotation ({ "start", "end", "label": "HAL"/"SUP" })"task_type": "qa", "data2text", or "summary""split": "train" or "test""language": "tr""model_sources": LLM provenance"original_id": link to the source English instance
Token-level labeling tags each character span of the generated answer as "HAL" (hallucinated) or "SUP" (supported). Context passages are chunked up to tokens. The released test split contains 2,700 held-out instances, stratified by task (Taş et al., 22 Sep 2025).
RAGTurk
RAGTurk statistics [Table 3, (Köse et al., 3 Feb 2026)]:
| Source | Articles | Chunks | QA Pairs |
|---|---|---|---|
| CulturaX (Web) | 6,305 | 15,985 | 10,682 |
| Wikipedia | 4,891 | 42,304 | 9,777 |
| Total | 11,196 | 58,289 | 20,459 |
Chunk statistics: average length 562 chars (Web: 696, Wiki: 599); each article yields 1.83 questions on average. Question types: 11,718 factual, 8,741 interpretation. The core annotation is at the pair (question, answer, context) level.
3. Retrieval and Embedding Infrastructure
Both datasets are accompanied by pre-computed dense retrieval indices (Faiss) with 768-dimensional embeddings. Turkish-RAGTruth contexts are indexed via TurkEmbed4Retrieval, provided as a ~2GB zip file. Loading is accomplished using HuggingFace Datasets and Faiss APIs:
1 2 3 4 5 6 7 |
from datasets import load_dataset ds = load_dataset("newmindai/turk_lettucedetect", "tr-ragtruth") train_ds = ds["train"] test_ds = ds["test"] from faiss import read_index index = read_index("faiss_index_tr_ragtruth.idx") |
RAGTurk relies on embeddinggemma as a dual-encoder dense retriever. No additional task-specific fine-tuning is performed for retrieval models. Similarity scoring uses cosine similarity:
Cross-encoder reranking is implemented with ms-marco-MiniLM-L-12-v2 for RAGTurk, providing a scalar relevance score for ranking candidate passages (Köse et al., 3 Feb 2026).
4. Pipeline Integration, Usage, and Best Practices
Turkish-RAGTruth
The standard application pipeline for Turkish RAG with hallucination detection is as follows:
- Encode query and retrieve top- contexts via Faiss
- Concatenate query with contexts and generate answer using an LLM
- Tokenize the generated answer, apply the Turk-LettuceDetect hallucination model to obtain token-level predictions
- Post-process: wrap hallucinated spans with
<HAL>tags or filter responses based on hallucination density
Representative integration pseudo-code is provided in the dataset documentation.
RAGTurk
RAGTurk provides a schematic for best-practice experimental pipelines at each RAG stage, benchmarking pipeline variants from query refinement (HyDE, clarifications) through dense retrieval, cross-encoder reranking, and various answer generation/refinement modules:
- Default configuration: Cross-encoder reranking with adjacent-chunk augmentation and long-context ordering achieves 84.60% accuracy at roughly twice the token cost of the baseline
- Leaderboard configuration: HyDE augmentation + cross-encoder reranking + post-generation summarization achieves 85.00% accuracy but with a 3.6-fold token increase
- Latency-constrained configuration: Simple LLM query clarification + cross-encoder reranking offers 80.20% accuracy at 1.7× baseline tokens
The pipeline stages and their modal parameters are visualized in Figure 1 of (Köse et al., 3 Feb 2026), with accuracy/cost trade-offs detailed in Table 5.
5. Hallucination Detection and Evaluation Protocols
Turkish-RAGTruth is specifically annotated for hallucination detection, a critical component for Turkish RAG systems due to high hallucination rates in generative LLM outputs for morphologically complex languages. The Turk-LettuceDetect framework implements token-level classification with three encoder architectures (ModernBERT, TurkEmbed4STS, EuroBERT). The best-performing model (ModernBERT-based) achieves a full test set F1-score of 0.7266 and supports up to 8,192-token contexts in real-time (Taş et al., 22 Sep 2025).
Evaluation in RAGTurk does not use expository hallucination labels but reports aggregate retrieval and generation performance over stratified QA subsets (n=100), using metrics such as Recall@5, mAP, nDCG@5, MRR, and a composite LLM-judge score (embedding similarity + LLM evaluation). No exact match or F1 score is reported for RAGTurk (Köse et al., 3 Feb 2026).
6. Resources, Access, and Practical Considerations
| Resource | URL/Location | License |
|---|---|---|
| Turkish-RAGTruth dataset | https://huggingface.co/datasets/newmindai/turk_lettucedetect | CC BY 4.0 |
| Pre-trained hallucination detection models | https://huggingface.co/newmindai | Apache 2.0 |
| RAGTurk pipeline resources | [Not specified in abstract] | [Not specified] |
| Code & Notebooks (Turkish RAGTruth) | https://github.com/newmind-ai/Turk-LettuceDetect | [Not specified] |
Pre-trained checkpoint models compatible with Turkish-RAGTruth include modernbert-base-tr-uncased-stsb-HD (135M parameters), TurkEmbed4STS-HallucinationDetection (210M), and lettucedect-210m-eurobert-tr-v1 (305M), all available under Apache 2.0. Integration scripts are provided for rapid deployment and local index building (Taş et al., 22 Sep 2025).
7. Special Challenges and Considerations for Turkish RAG
Both corpora emphasize language-specific considerations:
- Morphological complexity in Turkish necessitates context-preserving chunk sizes and cautions against over-stacking generative modules, which can disrupt agglutinative structure (e.g., stripping or distorting case/tense markers).
- Retrieval and reranking strategies benefit from cross-encoder models over raw similarity thresholds.
- Accuracy/cost trade-offs are pipeline-dependent, with higher accuracy requiring exponentially more tokens per query.
- Hallucination detection remains an open challenge, especially for entity-heavy, synthetic, or low-resource QA contexts.
- Standard best-practice recommendations include use of adjacent-chunk augmentation, limiting heavy LLM stages, short retrieval depths (K=5–10), and LLM-driven query clarification for ambiguous inputs (Taş et al., 22 Sep 2025, Köse et al., 3 Feb 2026).
These Turkish RAG datasets provide a foundation for reproducible, reliable, and domain-sensitive RAG pipeline research and system development in Turkish, with broad implications for other morphologically complex or low-resource languages.