Arabic Text-Grounded SFT Reasoning Pairs

Updated 18 January 2026

Arabic text-grounded SFT reasoning pairs are structured datasets that combine language-specific prompts, grounded source passages, and stepwise chain-of-thought reasoning to enhance LLM precision.
Methodologies involve expert curation from religious texts, multimodal document analysis, and dialectal adaptation, ensuring cultural alignment and coherent evidence tracing.
Rigorous evaluation protocols using inter-annotator agreement and performance metrics demonstrate improved accuracy in Islamic law, cultural knowledge, and NLI domains.

Arabic Text-Grounded SFT Reasoning Pairs are large-scale collections of supervised fine-tuning instances in which Arabic-language questions, instructions, or prompts are paired with one or more grounded source passages (such as text snippets, tables, or images), a structured chain-of-thought (CoT) style reasoning sequence, and a final answer or label. These corpora enable the fine-tuning and evaluation of LLMs and multimodal models on Arabic stepwise reasoning, evidence tracing, and culturally situated question answering. Recent benchmark datasets span Islamic law, multimodal visual/textual reasoning, and dialectal cultural knowledge, establishing rigorous frameworks for representation, annotation, and downstream performance analysis.

1. Corpus Construction and Data Sources

Text-grounded SFT reasoning pairs for Arabic originate from diverse, high-quality sources, each selected and segmented to support instructive, evidence-based question answering.

Religious Texts and Legal Sources: The ISLAMICFAITHQA suite compiles 25,000 Arabic SFT pairs grounded in (i) Qur’ān verses (6,236 ayat, each indexed by surah/ayah), (ii) canonical hadīth collections (with isnād and text), and (iii) major tafsīr commentaries to support multi-step reasoning about Islamic law, ritual, and doctrine. Candidate passages are extracted with domain-relevant keyword filters and expert curation, prioritizing items tagged as “hard” or “very hard” to ensure a reasoning-intensive distribution. Three subject-area specialists review all passages for doctrinal validity (Bhatia et al., 12 Jan 2026).
Multimodal and Domain-Varied Benchmarks: The ARB corpus collects 1,356 multimodal samples (documents, OCR scans, images, tables) spanning 11 domains—of which the text-only domains (Document Understanding, OCR) yield pure text-grounded reasoning chains. Textual inputs are drawn from translated/localized English benchmarks, Arabic QA tasks (CAMEL-Bench), synthetic LLM generations, and tool-assisted creation for tabular/diagrammatic input (Ghaboura et al., 22 May 2025).
Cultural Question-Answering and Dialectal Variety: Open-ended Arabic cultural QA datasets begin from Modern Standard Arabic (MSA) MCQs (e.g., PalmX-GC, 2,000+ entries), producing parallel SFT pairs across five major Arabic varieties (MSA, Egyptian, Levantine, Gulf, Maghrebi) and English. Dialectal translations and open-ended conversions are performed by GPT-4.1 under controlled prompts with semantic equivalence, followed by QA chain-of-thought rationale generation and verification (Bhatti et al., 28 Oct 2025).
Natural Language Inference: The Arabic NLI SFT sets compile premise–hypothesis–label triples from XNLI-Arabic, SNLI (translated), and machine-translated arNLI, yielding balanced splits for entailment, contradiction, and neutral relations (total: 14,758 examples) (Deen et al., 2023).

2. Annotation, Chain-of-Thought Generation, and Quality Assurance

Annotation and verification protocols are engineered to maximize the faithfulness, cultural alignment, and logical completeness of each SFT pair.

LLM-Assisted Generation and Human Adjudication: For Islamic jurisprudence and Creed SFT pairs, Arabic-fluent LLMs are directed by CoT-specific prompts to generate (a) a well-grounded question, (b) an explicit stepwise reasoning chain citing specific texts (e.g., Qur’ān, hadīth), and (c) a final canonical answer. Each example is cross-checked by three domain experts, with subsample validation assessments and Cohen’s κ measuring inter-annotator reliability (κ = 0.78 for reasoning, 0.72 for answers) (Bhatia et al., 12 Jan 2026).
Multiphase Curation and Verification: In ARB, each multimodal sample receives two rounds of review by native Arabic annotators, jointly scoring grammatical fluency, logical coherence, faithfulness, “commonsense” reasoning, and cultural grounding. Items below a quality threshold are redrafted or removed. Krippendorff’s α = 0.8356 for human agreement, increasing to 0.8762 with GPT-4o as a judge (Ghaboura et al., 22 May 2025).
Dialectal and Cultural Transfer: For “Beyond MCQ”, MCQ–OEQ conversion and dialectal adaptation pass through controlled LLM prompts and human review (Likert ≈ 4.4/5 for translation adequacy). Chain-of-thought rationales are then sampled, filtered, and scored for gold answer alignment (automated match Jaccard ≥ 0.75, confidence threshold σ ≥ 0.8), with only verified instances included in the SFT release (Bhatti et al., 28 Oct 2025).

3. Schema, Format, and Logical Form

Each SFT reasoning pair is presented in a structured JSON schema designed for ingestion by LLM SFT pipelines and to maximize traceability between input, reasoning, evidence, and output.

Field	Description	Example
instruction	User prompt/question	"متى يبدأ وقت صلاة الفجر؟"
context	Grounded text (ayah/hadith/table/etc.)	"آية 2:187: ... وَأَقِمِ الصَّلَاةَ لِدُلُوكِ الشَّمْسِ ..."
reasoning	Ordered steps, often with citations	["1) بداية الفجر الصادق.", "2) الآية 2:187 تدل على ذلك.", "3) العلماء اتفقوا على هذا التوقيت."]
answer	Final claim/label	"وقت بداية صلاة الفجر هو عند الفجر الصادق."

In ARB, an analogous format is used: “prompt”, ordered “steps” (each with a discrete “action” label such as “identify_text”, “arithmetic”, “interpret”), and “answer". No fixed step count exists, but reasoning chains span 2–6 steps in most cases (Ghaboura et al., 22 May 2025).
For NLI, triples are recast as instruction–input–output SFT pairs. The template is:
- Instruction: “حددُ العلاقةَ المنطقية بين الجملتين الآتيتين.”
- Input: Premise: “…”, Hypothesis: “…”
- Output: “تضمن” | “محايد” | “تناقض" (Deen et al., 2023).
Dialectal SFT pairs for cultural QA are defined as tuples $(x_i,\,r_i,\,y_i)$ , where $x_i$ is the open-ended question, $r_i$ the CoT rationale, and $y_i$ the gold answer, with parallel realizations in MSA and four dialects (Bhatti et al., 28 Oct 2025).
Complex logical forms often emerge, especially in religious domains: Premise(n) (text citation), Inference(n), then Conclusion, employing explicit connective structures (“إذ”، “لأنه”) (Bhatia et al., 12 Jan 2026).

4. Domain Coverage, Statistics, and Reasoning Complexity

Arabic text-grounded SFT datasets now encompass an array of domains, linguistic varieties, and reasoning complexities.

Islamic QA: 78% of ISLAMICFAITHQA SFTs are on Worship & Fiqh, 12% Creed & Morality, 10% Qur’ān/Hadith Studies. Reasoning chains average 3.8 steps (45% 2–3 steps, 40% 4–5 steps, 15% 6–7 steps) (Bhatia et al., 12 Jan 2026).
ARB Domains: 1,356 samples, 5,119 step-action pairs, 11 domains (text-only Document Understanding and OCR yield pure text chains). Mean reasoning depth ≈3.78 steps (per sample), with varied task-relevant actions (Ghaboura et al., 22 May 2025).
Cultural QA and Dialectal Coverage: 2,000 SFT pairs per dialect (MSA, Egyptian, Levantine, Gulf, Maghrebi), plus English, for a total of 10,000. Average CoT rationale ≈45 tokens (200 chars), topics uniformly distributed across history, geography, arts, cuisine, customs, and notable figures (Bhatti et al., 28 Oct 2025).
NLI: 14,758 triples; label splits near 1/3 for each of entailment, contradiction, and neutral. Used for both reasoning classification and as edge-case SFTs for fine-tuning (Deen et al., 2023).

5. Evaluation, Metrics, and Impact on Model Behavior

Text-grounded Arabic SFT pairs provide detailed frameworks for downstream evaluation, interpretability, and improvement of LLM reasoning.

Quality Metrics: ARB scores each reasoning chain along five dimensions (Faithfulness, Informativeness, Coherence, Commonsense, Reasoning Alignment), with the final reasoning score given by:

$S = \frac{1}{K}\sum_{i=1}^{K} s_i$

where $K=5$ . Inter-annotator agreement scores (Krippendorff’s α) reach 0.8356 (human only), up to 0.8762 with GPT-4o inclusion (Ghaboura et al., 22 May 2025).

Supervised Fine-Tuning (SFT) Effect: In ISLAMICFAITHQA, fine-tuning on the 25,000 SFT pairs increases accuracy on QuranicQA by ΔAcc = +1.05 for a 9B parameter model (Fanar-1), and ΔAcc = +8.70 for Qwen3-4B (Bhatia et al., 12 Jan 2026). Downstream, agentic tool-driven retrieval (agentic RAG) further boosts correctness and sharply reduces both ungrounded answers and unwarranted attempts.
NLI Evaluation: AraBERT and XLM-RoBERTa models fine-tuned on SFT NLI pairs achieve test set accuracy/F1—AraBERT baseline: 75.3/75.4%; XLM-R (baseline): 78.7/78.8%; with multi-task NER improvement up to 88.1% for contradiction detection. Contradiction detection particularly benefits from explicit entity (person/place) recognition and numerical clash handling. CoT rationales are used for edge cases (negation, quantifier scope, temporal shifts) (Deen et al., 2023).
Dialectal CoT Benchmarking: On “Beyond MCQ”, CoT fine-tuning slightly raises judged correctness scores (ΔJ ≈ +0.14), while token-level n-gram metrics may decrease (reflecting shorter yet more grounded rationales). Multi-dialectal QA highlights persistent LLM competence gaps on dialect and cultural knowledge (Bhatti et al., 28 Oct 2025).

6. Representative Examples

The following examples illustrate typical SFT input–reasoning–output structures reported in the literature:

Islamic Law, Worship:

{
  "instruction": "متى يبدأ وقت صلاة الفجر؟",
  "context": "آية 2:187: ...",
  "question": "متى يبدأ وقت صلاة الفجر؟",
  "reasoning": [
    "1) بداية الفجر الصادق.",
    "2) الآية 2:187 تدل على ذلك.",
    "3) العلماء اتفقوا على هذا التوقيت."
  ],
  "answer": "وقت بداية صلاة الفجر هو عند الفجر الصادق."
}

Document Understanding (Sales Table):

{
  "prompt": "في الجدول التالي، ما مجموع مبيعات يناير وفبراير؟",
  "steps": [
    {"step": "أنظر إلى عمود 'يناير' وأجد القيمة ٥٠٠.", "action": "read_cell"},
    {"step": "أنظر إلى عمود 'فبراير' وأجد القيمة ٤٠٠.", "action": "read_cell"},
    {"step": "أجمع ٥٠٠ + ٤٠٠ = ٩٠٠.", "action": "arithmetic"}
  ],
  "answer": "٩٠٠"
}

Cultural QA (MSA; Capital City):

$x = \text{“ما هي عاصمة ليبيا؟”},\quad r = \text{“أولًا، تقع ليبيا في شمال أفريقيا على البحر المتوسط... طرابلس هي الموقع الأقدم والأكبر، وعادة ما تكون العاصمة.”},\quad y = \text{“طرابلس”}.$

7. Distinctive Features and Research Significance

Arabic text-grounded SFT reasoning pairs enable high-fidelity, stepwise modeling in domains that historically lack robust supervised corpora. Their design emphasizes:

Grounding and Transparency: Each answer is explicitly tied to verifiable source passages, structured for both model tracing and error analysis (notably for reducing hallucination and enforcing abstention when evidence is lacking) (Bhatia et al., 12 Jan 2026).
Cultural and Dialectal Breadth: Parallel SFT datasets in multiple dialects and knowledge categories facilitate fine-grained cultural adaptation and expose persistent model weaknesses on Arabic linguistic phenomena (Bhatti et al., 28 Oct 2025).
Evaluation Rigor: Metrics such as CoT coherence, faithfulness, and human/COT inter-annotator agreement set rigorous standards for claim verification, reasoning completeness, and model robustness (Ghaboura et al., 22 May 2025).
Workflow Replicability: Documented sampling, prompting, annotation, and schema protocols allow reproducibility and modular extensibility to new domains or language varieties.

A plausible implication is that such SFT collections enable end-to-end Arabic LLMs with explainable, faithful, and culturally aligned behaviors, positioning them for deployment in high-stakes domains (law, religion, education, government) where ungrounded outputs are unacceptable.