TituLLMs BLUB Suite: Bangla NLP Benchmark

Updated 3 February 2026

TituLLMs BLUB Suite is a benchmarking framework for Bangla NLP that standardizes evaluation using five curated datasets focused on reasoning and comprehension.
It employs innovative tokenizer adaptations that reduce tokens-per-word by 75%, accelerating training and enabling longer context processing.
The suite enables direct model comparisons under 0-shot and 5-shot settings, revealing strengths in reasoning tasks and limitations in knowledge-intensive areas.

TituLLMs BLUB Suite denotes the primary benchmarking framework developed alongside TituLLMs—a family of Bangla LLMs—designed to enable precise, reproducible performance evaluation on Bangla understanding and reasoning tasks. The Suite addresses the lack of standardized, large-scale Bangla evaluation resources, comprising five distinct datasets spanning multiple-choice and reading comprehension tasks. This initiative underpins comparative assessments of Bangla-centric models and multilingual baselines, facilitating transparent reporting of progress and model-specific strengths and limitations in Bangla natural language processing (Nahin et al., 16 Feb 2025).

1. Benchmark Suite Motivation and Scope

The absence of comprehensive Bangla benchmarks motivated the conception of the BLUB Suite. Prior to its release, evaluation was hampered by the scarcity of accessible, well-curated Bangla datasets, particularly for systematic, multi-domain reasoning and question answering. The BLUB Suite fills this gap by standardizing five diverse tasks for Bangla: multiple-choice QA on academic/professional content, commonsense and scientific reasoning, yes/no reading comprehension, and physical intuition. All BLUB tasks are evaluated under both 0-shot and 5-shot settings using the lm-evaluation-harness framework to ensure robust, comparable normalized accuracy metrics.

2. Composition and Construction of the BLUB Suite

Each component dataset of the BLUB Suite targets distinct dimensions of linguistic and cognitive ability, with construction methods reflecting both Bangla specificity and benchmarking rigor.

Dataset	Task Type	Source or Construction Method
Bangla MMLU	Multiple-choice QA	Curated from Bangladeshi CSE, job, and admission exams
BoolQ Bangla	Yes/No reading QA	Passages from Bangla Wikipedia/Banglapedia/news; Q&A via GPT-4
CommonsenseQA BN	MC commonsense QA	EST* translation of CommonsenseQA
OpenBookQA BN	MC science QA	EST* translation of OpenBookQA
PIQA Bangla	MC physical reasoning	EST* translation of PIQA

*EST: "Expressive Semantic Translation" involves iterative LLM-generated candidate translation, ranking, and selection.

Preprocessing steps include language identification, punctuation and length filtering, MinHash-based deduplication, and, in BoolQ Bangla, human–LLM annotation review. The only dataset with human-annotated gold labels for all splits is MMLU; others may lack comprehensive human gold standards beyond validation/testing.

3. Evaluation Metrics and Statistical Reporting

Tasks are scored by normalized accuracy (percentage correct) under standardized 0-shot and 5-shot prompt-based settings. Calculation is defined as $\text{accuracy} = (\#\,\text{correct answers})/N$ . BLEU, as defined by:

$\mathrm{BLEU} = \exp\Bigl(\sum_{n=1}^4 w_n \ln p_n\Bigr) \times \mathrm{BP}$

is used to assess translation components in data construction but not for model benchmarking. ROUGE and perplexity are not reported for these tasks. The paper does not provide confidence intervals or $p$ -values for the evaluated results. An approximate binomial confidence interval is described as $p \pm z_{0.975}\sqrt{p(1-p)/N}$ , but is not empirically instantiated in the reported results.

4. Quantitative Results Across Models

The BLUB Suite enables direct comparison among Bangla-specific and multilingual LLMs under constrained and matched conditions. Representative 0-shot and 5-shot normalized accuracy values for selected models are shown below:

Model	Params	MMLU(0/5)	BoolQ(0/5)	CSQA(0/5)	OBQA(0/5)	PIQA(0/5)
Llama-3.2-1B	1 B	0.28/0.28	0.53/0.58	0.23/0.23	0.32/0.32	0.53/0.54
Llama-3.2-3B	3 B	0.33/0.34	0.53/0.69	0.26/0.29	0.32/0.32	0.57/0.57
BongLLaMA-3.2-3B	3 B	0.30/0.33	0.53/0.54	0.21/0.20	0.27/0.29	0.51/0.50
TituLLM-1B-v2.0	1 B	0.25/0.25	0.53/0.51	0.26/0.28	0.32/0.33	0.58/0.57
TituLLM-3B-v2.0	3 B	0.25/0.25	0.53/0.54	0.28/0.33	0.32/0.35	0.58/0.60

Relative improvements for TituLLM-3B (over Llama-3.2-3B, 5-shot) are: CSQA (+0.04, +13.8%), OBQA (+0.03, +9.4%), PIQA (+0.03, +5.3%). Notable underperformance is observed on MMLU (–0.09) and BoolQ (–0.15). These results establish that TituLLMs, due to vocabulary and tokenizer innovations, outperform their base multilingual equivalents on reasoning and commonsense tasks, though they lag on knowledge-intensive and long-context reading comprehension.

5. Methodological Innovations and Tokenizer Adaptation

A principal innovation underlying the BLUB Suite’s effectiveness is the adaptation of the tokenizer. The Llama-3.2 tokenizer (≈32K tokens) was extended with five custom BPE vocabularies to a combined “Llama-3.2-plus-48K” vocabulary (≈80K–96K), yielding a substantial reduction in tokens-per-word (TPW) from 7.84 to 1.90 in Bangla. This ≈75% reduction in sequence length accelerates model training and inference, permits handling of longer passages without context truncation, and is particularly impactful for long-form tasks such as BoolQ. Embedding and LM-head layers are scaled to the new vocabulary size, while all other network weights retain those of their Llama-3.2 parent.

6. Challenges and Implications for Low-Resource Language Evaluation

The composition of BLUB Suite datasets underscores several challenges inherent to low-resource NLP benchmarking:

Limited native digital corpus density necessitates extensive synthetic data generation (translation, romanization, dialogue, ASR transcripts), which augments reasoning task coverage but cannot fully replicate broad world-knowledge benchmarks.
Construction of generative and instruction-based evaluation benchmarks is constrained by the present lack of large, human-annotated Bangla instruction datasets.
Few-shot prompting is particularly germane for maximally utilizing limited annotated resources.

A plausible implication is that future performance improvements will hinge on expanding high-quality, natively-authored Bangla data (news archives, digitized books) and extending the scope of instruction-tuning corpora.

7. Prospective Directions and Research Utility

The BLUB Suite establishes a foundational, reproducible testbed for Bangla LLM development and evaluation. Future extensions are outlined to include broader generative benchmarks (summarization, translation, dialogue) with ROUGE, BLEU, and F1 scoring. Scaling beyond 3B parameters and extending context length to ≥32K tokens, along with robust statistical significance testing (bootstrapping, permutation), are prioritized for upcoming research. The BLUB Suite’s framework and release protocol are positioned for adaptation to additional low-resource languages, supporting methodological advances in tokenizer design, data curation, and evaluation practices for the broader NLP research community (Nahin et al., 16 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TituLLMs BLUB Suite.