Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance

Published 7 Apr 2025 in cs.CL and cs.AI | (2504.07989v2)

Abstract: Small LLMs (SLMs) offer efficient alternatives to LLMs for specific domains. The 2023 TinyStories study developed an English dataset that allows SLMs with 1 to 10 million parameters to produce coherent outputs. Our research expands this framework by translating the original dataset into Indian languages and creating synthetic data using LLMs. We focus on Hindi, Marathi, and Bengali, evaluating SLMs for regional language processing and understanding linguistic complexity. We show that SLMs efficiently process regional languages with significantly fewer parameters than LLMs, providing a complementary framework for ``inference based evaluation" of tokenization strategies and linguistic complexity. Our analysis shows that language-specific tokenizers outperform general-purpose ones for Indian languages. Empirical validations, supported by information-theoretic and morphological analyses, provides fundamental understanding behind the better performance of Hindi models over Marathi and Bengali. Additionally, we show that synthetic datasets outperform translated content for training SLMs. Correlation analyses reveal cross-linguistic patterns and language-specific relationships between creativity, grammatical precision, and narrative completeness. These findings advance both the practical application of SLMs to underserved languages and our theoretical understanding of neural language development.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that small models, when paired with language-specific tokenizers, can generate coherent regional narratives.
It employs a two-phase approach combining translation and synthetic data generation to create diverse datasets for Hindi, Marathi, and Bengali.
Findings reveal that tailored tokenizers like Sarvam and SUTRA enhance performance, with Hindi models excelling in grammar and fluency metrics.

Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance

Introduction

The study "Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance" investigates the application of Small LLMs (SLMs) in regional languages, specifically focusing on Hindi, Marathi, and Bengali. The research builds on the TinyStories framework, which demonstrated that SLMs with fewer parameters could effectively generate coherent narratives. This work extends the framework to Indian languages with the goal of understanding linguistic complexity and evaluating tokenizer performance.

Methodology

The methodology involves a two-phase approach: first, translating the original English TinyStories dataset into Indian languages, and second, augmenting the dataset with synthetic stories generated using LLMs. The study employs specific Indian language tokenizers like Sarvam and SUTRA, contrasting them against general-purpose tokenizers.

Figure 1: Schematic of model pipeline. (A) Dataset prepared through machine translation as well as generation using LLM, (B) Indic tokenizers are used to preprocess the Indian language stories, (C) A decoder only transformer architecture is used to train the model in each language, (D) Inference is evaluated by LLM on certain attributes.

Data Generation and Model Training

Translated Data: Utilizes NLLB-200-3B and Google Translate to convert English TinyStories into Hindi and Bengali, ensuring semantic preservation and cultural appropriateness through verification by LLMs.
Synthetic Data: Generated using GPT-4o-mini, incorporating a sophisticated prompt-generation system to maintain narrative diversity and avoid repetition. The synthetic dataset includes 1.8 million translated and 2.2 million synthetic stories per language.

Tokenization and Inference

The study highlights the superiority of language-specific tokenizers over general-purpose ones. Sarvam and SUTRA tokenizers, tailored for Indian scripts, provide better tokenization, reflected in improved model performance metrics such as evaluation loss and contextual understanding.

Results and Insights

Model Performance

Evaluation reveals that SLMs with parameters between 5M and 50M deliver strong linguistic capabilities. Hindi models achieve the highest performance, followed by Bengali and Marathi, with Hindi benefiting more from increased architectural width.

Hindi Models: Achieve superior performance in grammar (8.91/10) and fluency (8.55/10).
Cross-Linguistic Complexity: Analyzed using Rènyi entropy, highlighting Marathi's complexity due to higher entropy values.

Tokenizer Comparison

Sarvam tokenizers excel in context and narrative quality, outperforming Tiktoken and other general-purpose alternatives. The evaluation suggests language-specific tokenizers better capture idiomatic and structural nuances.

Implementation Considerations

The study demonstrates practical implications for deploying SLMs in resource-constrained environments, offering efficient language modeling solutions without requiring extensive computational resources or massive datasets. The methodology for fast adaptation to regional languages via translation and synthesis is pivotal for democratizing AI for underrepresented languages.

Conclusion

The study underscores the potential of small models in regional language processing, illustrating that SLMs can match or surpass larger models when equipped with tailored tokenizers and datasets. By providing a framework to evaluate linguistic complexity and tokenizer efficiency, this research contributes to our understanding of neural language development and promotes advancements in low-resource language modeling.

The findings pave the way for future research in optimizing LLM architectures and tokenization for diverse linguistic landscapes, promoting equitable access to AI capabilities.

Markdown Report Issue