Turkish RAG Datasets for LLM Benchmarking

Updated 10 February 2026

Turkish RAG datasets are standardized resources that benchmark LLM pipelines with annotated token-level hallucination detection for tasks like QA, data-to-text, and summarization.
Turkish-RAGTruth uses a two-stage prompting and translation protocol with minimal manual edits, while RAGTurk integrates web and Wikipedia sources with adjacent-chunk augmentation.
Both datasets provide pre-computed dense retrieval indices and cross-encoder reranking techniques to address the challenges of Turkish's agglutinative structure in retrieval and generation.

A Turkish Retrieval-Augmented Generation (RAG) dataset serves as a standardized resource for the development, benchmarking, and evaluation of LLM pipelines that integrate external retrieval with natural language generation in the Turkish language. The two primary public resources in this domain are the Turkish-RAGTruth dataset, released in conjunction with the Turk-LettuceDetect hallucination detection framework (Taş et al., 22 Sep 2025), and the RAGTurk corpus, designed for general-purpose RAG pipeline evaluation and best practices in Turkish information access (Köse et al., 3 Feb 2026). Both datasets directly address the unique challenges posed by Turkish—a morphologically rich, low-resource language whose agglutinative structure complicates both retrieval and generation tasks.

1. Dataset Construction and Source Materials

Turkish-RAGTruth

The Turkish-RAGTruth dataset is a machine-translated adaptation of the English RAGTruth benchmark (Niu et al. 2024), which systematically annotates hallucinated versus supported spans in LLM-generated outputs across three tasks:

Question Answering (QA)
Data-to-Text generation (Data2Text)
Summarization

Translation utilized google/gemma-3-27b-it via vLLM on NVIDIA A100 hardware, processing approximately 30 examples in parallel and completing a full pass per language in roughly 12 hours. A two-stage prompting protocol preserved all <HAL> (hallucination) tag positions, employing distinct translation templates for core response segments and for JSON/instructional elements. Quality assurance comprised tag-integrity verification, random sampling–based human review (n≈500), and automated checks for character offset consistency. Minimal manual edits fixed format distortions or punctuation loss within hallucinated spans (affecting <1% of instances) (Taş et al., 22 Sep 2025).

RAGTurk

The RAGTurk corpus combines material from two sources:

CulturaX: 6,305 Turkish web pages, curated via LLM-based filtering (URL triage, content-quality assessment)
Turkish Wikipedia: 4,891 randomly sampled articles (each >300 characters), converted to Markdown

Documents undergo header-aware chunking at section boundaries, using a tokenizer-agnostic 1,000-character threshold to maintain context locality and limit topic drift—critical for Turkish agglutination. QA pairs are generated from each chunk using a generator LLM (gpt-oss:120B) and then validated by gemini-2.5-flash. For each chunk, at least one factual and one interpretive question is generated. The cross-model validation mitigates but does not eliminate hallucination risk (Köse et al., 3 Feb 2026).

2. Content, Format, and Annotation Schema

Turkish-RAGTruth

The dataset comprises $N_{\mathrm{total}} = 17\,790$ annotated instances:

$N_{\mathrm{QA}} \approx 5\,930, \quad N_{\mathrm{Data2Text}} \approx 5\,930, \quad N_{\mathrm{Sum}} \approx 5\,930$

These are distributed nearly equally among QA, Data-to-Text, and Summarization. The data is provided in JSONL format, where each record includes:

"prompt": original question or instruction
"context": concatenated passages retrieved for grounding
"generated_answer": LLM output
"labels": token-level hallucination annotation ({ "start", "end", "label": "HAL"/"SUP" })
"task_type": "qa", "data2text", or "summary"
"split": "train" or "test"
"language": "tr"
"model_sources": LLM provenance
"original_id": link to the source English instance

Token-level labeling tags each character span of the generated answer as "HAL" (hallucinated) or "SUP" (supported). Context passages are chunked up to $\text{context\_window}_{\max} = 8,192$ tokens. The released test split contains 2,700 held-out instances, stratified by task (Taş et al., 22 Sep 2025).

RAGTurk

RAGTurk statistics [Table 3, (Köse et al., 3 Feb 2026)]:

Source	Articles	Chunks	QA Pairs
CulturaX (Web)	6,305	15,985	10,682
Wikipedia	4,891	42,304	9,777
Total	11,196	58,289	20,459

Chunk statistics: average length 562 chars (Web: 696, Wiki: 599); each article yields 1.83 questions on average. Question types: 11,718 factual, 8,741 interpretation. The core annotation is at the pair (question, answer, context) level.

3. Retrieval and Embedding Infrastructure

Both datasets are accompanied by pre-computed dense retrieval indices (Faiss) with 768-dimensional embeddings. Turkish-RAGTruth contexts are indexed via TurkEmbed4Retrieval, provided as a ~2GB zip file. Loading is accomplished using HuggingFace Datasets and Faiss APIs:

from datasets import load_dataset
ds = load_dataset("newmindai/turk_lettucedetect", "tr-ragtruth")
train_ds = ds["train"]
test_ds  = ds["test"]

from faiss import read_index
index = read_index("faiss_index_tr_ragtruth.idx")

RAGTurk relies on embeddinggemma as a dual-encoder dense retriever. No additional task-specific fine-tuning is performed for retrieval models. Similarity scoring uses cosine similarity:

$\mathrm{sim}(u,v)=\frac{u\cdot v}{\|u\|\|v\|}$

Cross-encoder reranking is implemented with ms-marco-MiniLM-L-12-v2 for RAGTurk, providing a scalar relevance score for ranking candidate passages (Köse et al., 3 Feb 2026).

4. Pipeline Integration, Usage, and Best Practices

Turkish-RAGTruth

The standard application pipeline for Turkish RAG with hallucination detection is as follows:

Encode query and retrieve top- $k$ contexts via Faiss
Concatenate query with contexts and generate answer using an LLM
Tokenize the generated answer, apply the Turk-LettuceDetect hallucination model to obtain token-level predictions
Post-process: wrap hallucinated spans with <HAL> tags or filter responses based on hallucination density

Representative integration pseudo-code is provided in the dataset documentation.

RAGTurk

RAGTurk provides a schematic for best-practice experimental pipelines at each RAG stage, benchmarking pipeline variants from query refinement (HyDE, clarifications) through dense retrieval, cross-encoder reranking, and various answer generation/refinement modules:

Default configuration: Cross-encoder reranking with adjacent-chunk augmentation and long-context ordering achieves 84.60% accuracy at roughly twice the token cost of the baseline
Leaderboard configuration: HyDE augmentation + cross-encoder reranking + post-generation summarization achieves 85.00% accuracy but with a 3.6-fold token increase
Latency-constrained configuration: Simple LLM query clarification + cross-encoder reranking offers 80.20% accuracy at 1.7× baseline tokens

The pipeline stages and their modal parameters are visualized in Figure 1 of (Köse et al., 3 Feb 2026), with accuracy/cost trade-offs detailed in Table 5.

5. Hallucination Detection and Evaluation Protocols

Turkish-RAGTruth is specifically annotated for hallucination detection, a critical component for Turkish RAG systems due to high hallucination rates in generative LLM outputs for morphologically complex languages. The Turk-LettuceDetect framework implements token-level classification with three encoder architectures (ModernBERT, TurkEmbed4STS, EuroBERT). The best-performing model (ModernBERT-based) achieves a full test set F1-score of 0.7266 and supports up to 8,192-token contexts in real-time (Taş et al., 22 Sep 2025).

Evaluation in RAGTurk does not use expository hallucination labels but reports aggregate retrieval and generation performance over stratified QA subsets (n=100), using metrics such as Recall@5, mAP, nDCG@5, MRR, and a composite LLM-judge score (embedding similarity + LLM evaluation). No exact match or F1 score is reported for RAGTurk (Köse et al., 3 Feb 2026).

6. Resources, Access, and Practical Considerations

Resource	URL/Location	License
Turkish-RAGTruth dataset	https://huggingface.co/datasets/newmindai/turk_lettucedetect	CC BY 4.0
Pre-trained hallucination detection models	https://huggingface.co/newmindai	Apache 2.0
RAGTurk pipeline resources	[Not specified in abstract]	[Not specified]
Code & Notebooks (Turkish RAGTruth)	https://github.com/newmind-ai/Turk-LettuceDetect	[Not specified]

Pre-trained checkpoint models compatible with Turkish-RAGTruth include modernbert-base-tr-uncased-stsb-HD (135M parameters), TurkEmbed4STS-HallucinationDetection (210M), and lettucedect-210m-eurobert-tr-v1 (305M), all available under Apache 2.0. Integration scripts are provided for rapid deployment and local index building (Taş et al., 22 Sep 2025).

7. Special Challenges and Considerations for Turkish RAG

Both corpora emphasize language-specific considerations:

Morphological complexity in Turkish necessitates context-preserving chunk sizes and cautions against over-stacking generative modules, which can disrupt agglutinative structure (e.g., stripping or distorting case/tense markers).
Retrieval and reranking strategies benefit from cross-encoder models over raw similarity thresholds.
Accuracy/cost trade-offs are pipeline-dependent, with higher accuracy requiring exponentially more tokens per query.
Hallucination detection remains an open challenge, especially for entity-heavy, synthetic, or low-resource QA contexts.
Standard best-practice recommendations include use of adjacent-chunk augmentation, limiting heavy LLM stages, short retrieval depths (K=5–10), and LLM-driven query clarification for ambiguous inputs (Taş et al., 22 Sep 2025, Köse et al., 3 Feb 2026).

These Turkish RAG datasets provide a foundation for reproducible, reliable, and domain-sensitive RAG pipeline research and system development in Turkish, with broad implications for other morphologically complex or low-resource languages.

Markdown Report Issue Upgrade to Chat

References (2)

Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications (2025)

RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Turkish RAG Dataset.

Turkish RAG Datasets for LLM Benchmarking

1. Dataset Construction and Source Materials

Turkish-RAGTruth

RAGTurk

2. Content, Format, and Annotation Schema

Turkish-RAGTruth

RAGTurk

3. Retrieval and Embedding Infrastructure

4. Pipeline Integration, Usage, and Best Practices

Turkish-RAGTruth

RAGTurk

5. Hallucination Detection and Evaluation Protocols

6. Resources, Access, and Practical Considerations

7. Special Challenges and Considerations for Turkish RAG

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics