Alif Urdu-Instruct Dataset
- Alif Urdu-Instruct is a curated instruction–response corpus designed specifically for fine-tuning Urdu models with culturally and ethically aligned prompts.
- It comprises 51,686 examples averaging 23 tokens each, including monolingual Urdu and bilingual Urdu–English content to cover diverse NLP tasks.
- A modified self-instruct pipeline with synthetic data generation and human refinement drives empirical gains, improving key task performance by up to +58.4 points.
The Alif Urdu-Instruct dataset is a systematically curated, instruction–response corpus designed for supervised fine-tuning of LLMs in Urdu, a critically underrepresented language in NLP with over 230 million speakers. Originating from a collaboration led by Traversaal AI, the dataset forms the backbone for fine-tuning state-of-the-art models like Alif-1.0-8B-Instruct and Qalb, driving substantial empirical gains on diverse Urdu-specific tasks. While the Qalb paper treats Alif Urdu-Instruct as a black-box component and does not reprint granular construction details, primary construction and methodology descriptions are provided in the Alif-1.0-8B-Instruct release (Shafique et al., 10 Oct 2025), enabling a technical reconstruction of its scale, architecture, and impact.
1. Dataset Composition and Linguistic Scope
Alif Urdu-Instruct consists of 51,686 instruction–response exemplars, summing to approximately 1.2 million tokens, with an average of about 23 tokens per prompt–response pair (Shafique et al., 10 Oct 2025). The dataset is encoded primarily in the Perso-Arabic Nastaliq script. Monolingual Urdu constitutes roughly 60% of the corpus, while bilingual content—comprising both translation pairs (Urdu↔English) and mixed-code prompts—accounts for the remaining 40%. The Qalb paper specifies that code-switching occurs only when naturally present in professional or user-generated prompts, but does not quantify this fraction (Hassan et al., 13 Jan 2026).
A strict domain breakdown (e.g., literary, journalistic, social, government) is not available in the Qalb paper, and no subdivision by train/validation/test is reported.
2. Task Typology and Distribution
Alif Urdu-Instruct is partitioned into seven major instruction-driven task categories (Shafique et al., 10 Oct 2025):
| Task Category | Example Count |
|---|---|
| Generation | 5,907 |
| Ethics & Safety | 9,002 |
| Question Answering | 8,177 |
| Reasoning (CoT) | 9,590 |
| Bilingual Translation | 10,001 |
| Classification | 4,662 |
| Sentiment Analysis | 4,347 |
Prompts and responses across categories encode a breadth of functional and cognitive abilities, notably including Urdu-native chain-of-thought (CoT) reasoning, ethical reasoning contextualized for South Asian norms, culturally anchored sentiment expressions, and rigorous translation tasks spanning both directions. While the downstream evaluation suite in Qalb covers these same axes, the precise crosswalk from this dataset’s internal splits to benchmark coverage is undefined (Hassan et al., 13 Jan 2026). A plausible implication is that categorical balance was driven by downstream diversity requirements.
3. Data Generation, Quality Control, and Preprocessing
Construction of Alif Urdu-Instruct employs a "modified self-instruct" protocol, adapted from English-centric self-instruct methodologies. This pipeline includes:
- Category-specific prompt templates: For each task, a unique Urdu (or Urdu–English) instruction schema was crafted.
- Seed pools: Human-selected and model-suggested seeds extend coverage across topic and style. Each batch samples six seeds (four human, two machine generated).
- Synthetic generation: GPT-4o was used to synthesize (instruction, response) pairs, twenty per query, filtered in real time.
- Global pool deduplication: New candidates are rejected if their prompt ROUGE-L similarity with existing entries exceeds 0.7, ensuring diversity and eliminating semantically redundant items across the global pool.
- Automated post-filtering: Prompts outside [3,150] word range, those containing unwanted character classes, or flagged by a keyword blacklist, are discarded.
- Human refinement: 20 native Urdu annotators, tested on error correction and translation tasks and compensated at 1,000 PKR/hour, reviewed and standardized prompt–response pairs for grammar, factual consistency, and tone, following a published guideline set.
No additional cleaning, normalization, or script-specific tokenization tailored to Urdu Nastaliq is described in the Qalb paper; all post-processing filters pertain to the synthetic data generation stage (Shafique et al., 10 Oct 2025, Hassan et al., 13 Jan 2026).
4. Formatting, Annotation Protocol, and Prompt Schema
The fine-tuning interface employs the Llama-3 chat formatting, utilizing distinct control tokens to demarcate System, User, and Assistant roles. The canonical schema for every example during supervised fine-tuning is:
1 2 3 |
<|start_header_id|>System: "You are a helpful Urdu assistant." <|start_header_id|>User: "<instruction text>" <|start_header_id|>Assistant: "<response text>" |
Loss masking ensures that gradient updates apply solely to the Assistant segments. No further consistency templates, annotation guidelines, or error ontologies specific to Alif Urdu-Instruct are provided in the cited works. The published dataset does not include explicit breakdowns of label categories beyond these protocol definitions. The Qalb paper and its appendix do not reprint any sample entry or illustrative instruction–response pairs from Alif Urdu-Instruct (Hassan et al., 13 Jan 2026).
5. Evaluation, Empirical Role, and Downstream Impact
No standalone human or automated quality assessment is published for Alif Urdu-Instruct alone. All quality attribution is inferred from model performance following supervised fine-tuning on this dataset. The Qalb model, after absorbing Alif Urdu-Instruct as its sole supervised corpus, achieves a weighted average of 90.34 on downstream Urdu-specific benchmarks, outperforming Alif-1.0-Instruct (87.1) by +3.24 points, and the base LLaMA-3.1-8B-Instruct by +44.64 points (Hassan et al., 13 Jan 2026). Category-wise improvements (measured for Alif-1.0-8B-Instruct relative to its LLaMA-3.1-8B-Instruct baseline) are reported in the original Alif paper (Shafique et al., 10 Oct 2025):
| Task | Δ (Base → Alif) |
|---|---|
| Generation | +47.4 |
| Ethics | +58.4 |
| QA | +43.3 |
| Reasoning | +37.9 |
| Translation | +30.4 |
| Classification | +32.5 |
| Sentiment | +40.0 |
A plausible implication is that the combination of robust synthetic generation (enhancing diversity and coverage), human refinement (increasing cultural/ethical alignment), and chat-style prompt formatting collectively drive these empirical advances.
6. Cultural Alignment, Safety, and Societal Sensitivity
All content within the Ethics and Safety category is templated to reflect sociocultural norms relevant to Urdu speakers, emphasizing deference and gender sensitivity. Automated and human filtering jointly suppress content in violation of regional taboos or safety standards—first through GPT-4o’s native content filters, then via a manual blacklist and explicit reviewer protocols. Annotators are tasked with rejecting or editing examples that may reinforce hate speech or bias. This multi-stage pipeline ensures the dataset’s suitability for societal deployment in South Asian contexts (Shafique et al., 10 Oct 2025).
7. Public Availability, Utilization, and Future Prospects
The Alif Urdu-Instruct dataset, along with corresponding models and code, is open-sourced by the Traversaal AI group at https://github.com/traversaal-ai/alif-urdu-LLM (Shafique et al., 10 Oct 2025). Its adoption as the default fine-tuning corpus for Urdu LLMs has demonstrably advanced model instruction-following, reasoning, and ethical alignment within the language. While granular compositional and procedural details are not reprinted in downstream works such as Qalb, the dataset’s methodology exemplifies scalable, culturally-attuned synthetic data curation for other low-resource and morphologically complex languages.
Future directions may include releasing refined splits, annotation ontologies, and more granular performance audits, though no explicit roadmap is articulated in the current publications.