TituLLMs BLUB Suite: Bangla LLM Benchmark

Updated 3 February 2026

TituLLMs BLUB Suite is a task-centric evaluation framework that benchmarks Bangla language models using standardized datasets across reasoning and QA tasks.
It introduces vocabulary adaptations by merging a base tokenizer with custom Bangla BPE vocabularies, reducing tokens-per-word by about 75% and speeding up processing.
The suite leverages diverse pretraining data—including synthetic translations, romanizations, and curated web crawls—to address low-resource challenges and improve model evaluation.

TituLLMs BLUB Suite refers to the task-centric evaluation suite introduced alongside TituLLMs, a family of Bangla-language LLMs derived from continual pretraining on Llama-3.2 model architectures. The BLUB Suite is designed to fill a significant benchmarking gap for Bangla LLMs by providing comprehensive standardized datasets targeting a range of reasoning and question answering tasks. TituLLMs and the BLUB Suite collectively establish a foundation for methodical evaluation and further adaptation of open multilingual models to low-resource languages (Nahin et al., 16 Feb 2025).

1. Model Families and Tokenizer Innovations

TituLLMs consist of two parameter-scale models: TituLLM-1B and TituLLM-3B, which are continual-pretraining adaptations of Llama-3.2-1B and Llama-3.2-3B, respectively. The architecture specifics such as hidden size, feed-forward dimensions, and attention head counts match the progenitor Llama-3.2 variants, maintaining 1 billion or 3 billion parameters. The primary architectural innovation centers on vocabulary extension. The base Llama-3.2 tokenizer (≈32,000 tokens) was merged with five custom Byte Pair Encoding (BPE) vocabularies, derived from 48 GB of sampled Bangla corpora. The resulting "Llama-3.2-plus-48K" tokenizer increases the total vocabulary to approximately 80,000–96,000 tokens and reduces the effective tokens-per-word (TPW) ratio from 7.8397 to 1.9029. Only the embedding and language-model head layers were modified to accommodate the expanded token set; self-attention, MLP, and layer norm parameters remained unchanged.

The TPW reduction drives efficiency improvements, as Bangla sequences are tokenized into far fewer tokens, enabling roughly four times shorter input representations of sentences and facilitating faster pretraining and inference. This adaptation is critical given the computational constraints under which the models were developed.

2. Pretraining Corpus Construction

The pretraining corpus for TituLLMs comprises approximately 37 billion tokens, sourced and filtered to maximize Bangla input diversity and coverage. Web crawl data contributed 9.8 billion tokens, processed via language identification, URL pattern filtering, Bangla-specific heuristics, and deduplication by MinHash. Books data (4 billion tokens) were extracted from open-source PDFs using Google OCR and Tesseract, with further curation by n-gram LM scores, word/sentence structure, and Bangla content coverage.

Synthetic data (7.06 billion tokens) were created through:

Machine translation (1.47 billion tokens) using an EN–BN distilled NLLB (No Language Left Behind) model (600M parameters, BLEU=37.6).
Romanization (3.87 billion tokens) via a BN–roman transliteration NLLB distil model, partly trained on Sangraha and GPT-4 generated pairs (BLEU=65.1).
Multi-turn cultural conversations (0.42 billion tokens), generated by two-agent LLM prompts.
Audio transcripts (1.3 billion tokens) from 56,000 hours of automatic speech recognition (ASR)-transcribed speech.

Sangraha, the high-quality Bangla IndicLLMSuite subset, contributed 10.94 billion tokens (4.26 billion translated, 6.68 billion romanized). The validation set (22 GB, about 2% of the total) was proportionally sampled across all sources. Training utilized the LlamaFactory framework with context length set at 4096 tokens; optimizer and other hyperparameters are not specified. A single epoch required approximately 1,750 H100-GPU hours.

3. BLUB Suite: Task Definitions and Dataset Construction

The BLUB Suite benchmarks Bangla LLMs on five supervised datasets. Each is evaluated using normalized accuracy (percentage correct) under 0-shot and 5-shot prompting via the lm-evaluation-harness toolkit.

Dataset	Task Type	Construction Methodology
Bangla MMLU	MC comprehensive QA	Manually curated from Bangladeshi CSE/job/admission exam questions
BoolQ Bangla	Yes/No reading QA	Passages from Wikipedia, Banglapedia, news; Q&A via GPT-4, reviewed
CommonsenseQA BN	MC commonsense QA	EST translation of CommonsenseQA
OpenBookQA BN	MC science QA	EST translation of OpenBookQA
PIQA Bangla	MC physical reasoning	EST translation of PIQA

MC = multiple-choice; EST = Expressive Semantic Translation (iterative LLM-based translation and ranking). Only MMLU validation and test sets received manual annotation; all other datasets involve LLM and manual curation pipelines.

Preprocessing involved language identification, punctuation and length filtering, and MinHash deduplication, varying according to dataset origin. The BLUB Suite explicitly addresses the absence of gold-standard evaluation resources for Bangla LLMs.

4. Quantitative Benchmark Results

TituLLM-1B and TituLLM-3B were compared to foundational Llama-3.2 and BongLLaMA-3.2 3B models, focusing on 0-shot and 5-shot normalized accuracy across BLUB Suite tasks. Selected results:

Model	Params	MMLU(0/5)	BoolQ(0/5)	CSQA(0/5)	OBQA(0/5)	PIQA(0/5)
Llama-3.2-1B	1 B	0.28/0.28	0.53/0.58	0.23/0.23	0.32/0.32	0.53/0.54
Llama-3.2-3B	3 B	0.33/0.34	0.53/0.69	0.26/0.29	0.32/0.32	0.57/0.57
BongLLaMA-3.2-3B	3 B	0.30/0.33	0.53/0.54	0.21/0.20	0.27/0.29	0.51/0.50
TituLLM-1B-v2.0	1 B	0.25/0.25	0.53/0.51	0.26/0.28	0.32/0.33	0.58/0.57
TituLLM-3B-v2.0	3 B	0.25/0.25	0.53/0.54	0.28/0.33	0.32/0.35	0.58/0.60

TituLLM-3B shows relative improvement (5-shot) over Llama-3.2-3B for CommonsenseQA (+13.8%), OpenBookQA (+9.4%), and PIQA (+5.3%). However, performance declines for Bangla MMLU (–0.09) and BoolQ (–0.15), with no 0/5-shot gain on MMLU. No statistical confidence intervals or p-values are reported; an approximate binomial CI for accuracy $p$ is provided as $p \pm z_{0.975}\sqrt{\tfrac{p(1-p)}{N}}$ .

5. Analysis: Vocabulary Adaptation and Low-Resource Challenges

The steep reduction in tokens-per-word (≈75%) from the augmented tokenizer enables TituLLMs to process longer Bangla passages (notably advantageous for BoolQ, which features >1,000-token prompts) with reduced risk of context truncation.

Scaling up model parameters from 1B to 3B yields clear gains in commonsense and physical reasoning tasks when paired with the expanded vocabulary. However, such gains do not extend to world-knowledge assessments (as represented by MMLU), where TituLLMs underperform their English-pretrained multi-lingual parent models. This points to limitations in cross-lingual transfer and insufficient high-quality Bangla world-knowledge data.

Reliance on synthetic data remains a necessity due to the limited availability of native Bangla digital content. While synthetic translations and generated dialogues enhance reasoning capacities, they fall short in instilling comprehensive world knowledge as captured in MMLU. The models also exhibit weaker long-context handling abilities, indicating a need for further pretraining at greater context lengths (≥8,192 tokens). Furthermore, absence of a large Bangla instruction-tuning corpus restricts performance on open-ended generative tasks.

6. Future Directions and Research Implications

The groundwork established by TituLLMs and BLUB Suite exposes the necessity of multiple parallel research threads:

Scaling TituLLMs beyond 3B parameters and extending context length up to at least 32,768 tokens (leveraging lessons from Llama-3.2-plus- variants).
Actively collecting additional native Bangla resources (especially news archives, digitized literature, and specialized corpora) to mitigate current over-reliance on synthetic data, and constructing instruction-tuning datasets to improve generative capability.
BLUB Suite expansion to encompass generative tasks (summarization, translation, dialogue), with evaluation using ROUGE, BLEU, or F1, alongside the existing normalized accuracy metrics.
Systematic incorporation of statistical significance methods (bootstrapping, permutation testing) in future benchmarks to strengthen evidence claims.
Broader applicability: the TituLLMs pipeline for vocabulary adaptation and benchmarking offers a generalizable template for other low-resource languages.

A plausible implication is that vocabulary extension and culturally informed tokenization are core levers for improving resource-sparse LLM adaptation. However, bridging the gap in knowledge tasks and achieving strong long-context understanding will require further scale in both model capacity and training data heterogeneity (Nahin et al., 16 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BanglaBERT BLUB.