Papers
Topics
Authors
Recent
Search
2000 character limit reached

Alif Urdu-Instruct Dataset

Updated 18 January 2026
  • The Alif Urdu-Instruct dataset is a large-scale, synthetic instruction–response corpus designed to advance Urdu language models with culturally aligned prompts.
  • It employs a modified Self-Instruct pipeline with unique prompt templates and human refinement to ensure high data quality and context sensitivity.
  • The dataset supports diverse tasks—QA, translation, reasoning—with over 98% Urdu purity, leading to significant performance boosts in model evaluations.

The Alif Urdu-instruct dataset (often referred to in short as "Urdu-Instruct") is a large-scale, high-quality synthetic instruction–response corpus curated to facilitate instruction-tuning and evaluation of LLMs for the Urdu language, particularly in low-resource settings. It plays a central role in the development of Urdu-specific LLMs such as Alif-1.0-8B-Instruct and Qalb, addressing gaps left by mainstream multilingual models in terms of cultural alignment, reasoning complexity, safety, and script handling (Shafique et al., 10 Oct 2025, Hassan et al., 13 Jan 2026).

1. Dataset Construction and Task Coverage

Urdu-Instruct is designed to overcome the challenges of data scarcity, lack of chain-of-thought (CoT) reasoning resources, and insufficient coverage of culture-specific linguistic phenomena for Urdu. Its construction uses a modified Self-Instruct distillation pipeline, generating synthetic data in both Urdu and bilingual Urdu–English formats.

The dataset comprises up to 110,000 multitask instruction–response pairs in Nastaliq Perso-Arabic script, with high Urdu purity (>98%, non-Urdu tokens are present only in a minor subset of back-translations) (Hassan et al., 13 Jan 2026). The distribution of task types is shown below:

Task # Examples Percentage
Question Answering 32,000 29.1%
Translation 25,000 22.7%
Classification 18,000 16.4%
Sentiment Analysis 15,000 13.6%
Reasoning 8,000 7.3%
Summarization 6,000 5.5%
Open-ended Gen. 6,000 5.5%

Synthetic data generation is guided by unique prompt templates and culturally contextual seed values per task. The overall design focuses on the development of datasets that position Urdu LLMs for robust, contextually nuanced, and safe text generation capabilities (Shafique et al., 10 Oct 2025).

2. Synthetic Data Generation Pipeline

The dataset employs a modified Self-Instruct approach leveraging seed values and automated and human-in-the-loop refinement. The pipeline’s key attributes are:

  • Prompt Templates & Seed Values: Each task type is driven by a unique prompt template, personalized by four human-annotated and two machine-generated seed values, ensuring diversity and cultural relevance.
  • Sampling and Filtering: GPT-4o generates batches of 20 instruction–response pairs per prompt template and seed. Prompts are accepted into a global task pool contingent on ROUGE-L similarity below 0.7 (to prevent redundancy), prompt length constraints (3–150 words), exclusion of unsafe keywords, valid character checks, and dual language conformity (Urdu/English).
  • Human Refinement: Native speakers refine for grammaticality, factual correctness, and ethical/safety constraints.
  • Post-processing: Automated de-duplication, profanity filtering, spelling/diacritic normalization, and punctuation standardization are applied. Annotator guidelines enforce concise, context-sensitive, and culturally appropriate responses (Shafique et al., 10 Oct 2025, Hassan et al., 13 Jan 2026).

The following pseudocode (from (Shafique et al., 10 Oct 2025)) summarizes the generation workflow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
initialize GlobalTaskPool = {}
for each TaskType in [Generation, Ethics, QA, Reasoning, Translation, Classification, Sentiment]:
  PromptTemplate ← unique template for TaskType
  SeedValues ← 4 human-annotated + 2 machine-generated seeds
  for batch = 1 to N_batches(TaskType):
    batch_prompts = augment(PromptTemplate, random(SeedValues))
    responses = GPT4o.generate(batch_prompts, n=20)
    for each (prompt, response) pair:
      if valid_length(prompt) and no_unsafe_keywords(prompt):
        if starts_with_valid_char(prompt) and is_Urdu_English(prompt):
          if not duplicate(prompt, GlobalTaskPool, threshold=0.7):
            GlobalTaskPool.add((prompt, response))
end
post-process GlobalTaskPool via human refinement

3. Prompt, Response, and Instruction Formatting

Prompt–response pairs are encoded in UTF-8 Nastaliq Perso-Arabic script. The dataset schema in the original release uses a JSONL format following the Stanford Alpaca template ({instruction, input (optional), response}), while for Qalb and related models, prompts follow the Meta Llama-3 chat format using special control tokens (e.g., <|start_header_id|>System: for context, <|start_header_id|>User:, <|end_header_id|>, <|start_header_id|>Assistant:).

Loss masking ensures only Assistant responses are considered in the fine-tuning objective. All right-to-left script properties are maintained without transliteration throughout the processing pipeline and inference. For translation tasks, four instruction–input–output configurations cover every direction of Urdu–English translation to maximize stylistic and contextual fidelity; templates prompt for cultural nuance (e.g., scenario-appropriate seed values and idioms) (Shafique et al., 10 Oct 2025, Hassan et al., 13 Jan 2026).

4. Chain-of-Thought Reasoning and Cultural Alignment

Urdu-Instruct employs explicit chain-of-thought scaffolding for reasoning tasks, using prompts that begin with phrasing such as “آئیے مرحلہ وار سوچتے ہیں کہ…” (“Let’s think step-by-step about…”). Outputs generated for reasoning tasks are structured as list-based, stepwise explanations, closely aligned with Urdu-native pedagogical style. Culturally situated translation and generation tasks leverage seed values and prompt templates that reflect local idioms, festivals, and naming conventions. All translation instructions require preservation of cultural context, frequently illustrated via manual expansion and idiomatic adaptation (e.g., mapping “breaking the fast” to “روزہ کھولنا”) (Shafique et al., 10 Oct 2025).

Human refiners further review content for grammaticality, factuality, and safety, including the removal of redundant, unsafe, or non-native expressions. Sentiment annotation, faithfulness, and fluency checks are implemented via multi-annotator protocols and expert validation sets, necessitating 90% approval for inclusion in final releases (Hassan et al., 13 Jan 2026).

5. Statistical Properties and Quality Assurance

The Urdu-Instruct dataset’s key attributes include:

  • Size: 51,686 (core) to 110,000 (extended) instruction–response pairs.
  • Task diversity: At least seven distinct categories, spanning QA, classification, open-ended generation, and structured tasks (e.g., sentiment, reasoning, summarization).
  • Annotation schema: Each data record is furnished with task type, seed value identifiers, and generation timestamp.
  • Quality: >98% Urdu purity, maximum prompt duplication ROUGE-L ≤ 0.7, and human expert-reviewed subsets.
  • Evaluation: A dedicated held-out set of ~1,050 items (150 per category) is included, curated with human review and LLM-as-judge setups. Weighted average LLM evaluation scores (on Urdu eval sets) demonstrate marked improvements: Llama-3.1-8B-Instruct baseline at 45.7, Alif-1.0-8B-Instruct at 87.1 (Shafique et al., 10 Oct 2025), and Qalb’s further improvement to 90.34 (Hassan et al., 13 Jan 2026).

6. Applications, Benchmarking, and Model Impact

Alif Urdu-Instruct underpins instruction tuning for several advanced Urdu LLMs. Finetuning Llama-3.1-8B on Urdu-Instruct yields Alif-1.0-8B-Instruct, which demonstrates superior performance versus both its base and leading open-source models across Urdu-specific evaluations. Qalb further advances reliability with a combined approach of continued pre-training (on a 1.97B-token Urdu+English corpus) and instruction-tuning on Alif, leading to a state-of-the-art weighted average of 90.34 on diverse tasks (Hassan et al., 13 Jan 2026). Urdu-Instruct-grounded models outperform counterparts such as Cohere-Aya-Expanse-8B and Qwen-2.5-7B-Instruct on benchmarks including MGSM, AlpacaEval, and Dolly QA (Alif’s overall of 75.5 vs. Aya’s 68.9) (Shafique et al., 10 Oct 2025).

Typical model finetuning with Urdu-Instruct proceeds under the SFT loss:

LSFT(Θ)=ExDSFT[ilogp(xix<i;Θ)]L_\text{SFT}(\Theta) = \mathbb{E}_{x \sim D_\text{SFT}}\left[-\sum_i \log p(x_i \mid x_{<i}; \Theta)\right]

The dataset and generation scripts are openly licensed (CC BY-SA 4.0, MIT) and released at https://github.com/traversaal-ai/alif-urdu-LLM, facilitating reproducibility and community-driven improvements.

7. Broader Implications and Availability

Urdu-Instruct represents a landmark in scalable synthetic data generation for low-resource languages. Its combination of large scale, rigorous cultural/ethical filtering, Urdu-native reasoning chains, and bilingual design supports high-performance and cost-efficient Urdu LLM development. Release of complete data, models, and code provides a foundation for further work in culturally contextual NLP and effective instruction-tuning beyond high-resource languages (Shafique et al., 10 Oct 2025, Hassan et al., 13 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Alif Urdu-instruct Dataset.