IndoCulture Benchmark Evaluation

Updated 26 January 2026

IndoCulture Benchmark is a comprehensive suite of regionally anchored datasets that assess language models’ understanding of localized cultural practices in India and Indonesia.
It employs rigorous multi-stage annotation, anthropological frameworks, and statistical metrics like intra-region agreement and KL-divergence to evaluate cultural bias.
The benchmark reveals significant regional bias and performance gaps in models, underscoring the need for localized, culturally nuanced evaluation methods.

The IndoCulture Benchmark encompasses a set of rigorous datasets and frameworks designed to systematically evaluate LLMs’ competence in sub-national, regionally and contextually grounded commonsense and cultural reasoning within South and Southeast Asian nations—primarily India and Indonesia. These benchmarks address the methodological and empirical deficiencies of earlier evaluation tasks that treated national cultures as monolithic, highlighting deeply localized variation in practices, beliefs, and knowledge. The term “IndoCulture Benchmark” is used by multiple projects (sometimes as shorthand for specific datasets such as INDICA or IndoCulture-ID), all centered on regionality, comprehensive coverage of material and non-material culture, and precise quantification of model knowledge and bias.

1. Motivation and Conceptual Foundations

IndoCulture Benchmarks were developed to challenge the Anglocentric, Western-centric paradigm of cultural reasoning in LLM evaluation. Early commonsense QA benchmarks presupposed uniformity within nation-states, thus marginalizing the profound intra-national heterogeneity characteristic of cultural superregions such as India and Indonesia (Madhusudan et al., 22 Jan 2026, Koto et al., 2024). The goal is to empirically test whether "cultural commonsense" can be considered national or whether it is, in practice, regional or provincial—a question addressed via controlled, region-anchored data collection, anthropological domain modeling, and fine-grained statistical evaluation.

Benchmarks such as INDICA (India) (Madhusudan et al., 22 Jan 2026) and IndoCulture (Indonesia) (Koto et al., 2024) ground question formulation and annotation in established anthropological frameworks (e.g., Murdock’s Outline of Cultural Materials), emphasize high-quality local annotation, and design protocols to uncover geographic bias and coverage gaps in LLMs and vision-LLMs (VLMs).

2. Benchmark Structures and Dataset Construction

The IndoCulture family encompasses several major datasets; their core methodology shares key features:

a. Regional Granularity:

India: Coverage spans either five major social-science regions (North, South, East, West, Central) or the full set of 28 states and 8 union territories, tagging every item with explicit regional or subregional provenance (Madhusudan et al., 22 Jan 2026, Sahoo et al., 22 Sep 2025, Maji et al., 18 Jun 2025).
Indonesia: Annotators represent 11 provinces, with every prompt and answer mapped to local context; distinctions between "common-to-Indonesia" and "province-specific" are explicitly encoded (Koto et al., 2024, Kartiyasa et al., 19 Jan 2026).

b. Cultural Domains and Facets:

Driven by anthropological taxonomy (e.g., OCM or Newmark’s categories), domains include: Interpersonal Relations, Food, Attire, Rituals, Communication, Education, Finance, Festivals, Art, Architectural Styles, and Socioreligious Life (Madhusudan et al., 22 Jan 2026, Sahoo et al., 22 Sep 2025).
DIWALI (India) expands to 17 cultural facets with over 8,800 culture-specific items mapped to 36 sub-regions (Sahoo et al., 22 Sep 2025).

c. Annotation and Consensus Protocols:

Multi-stage quality control incorporates intra-region agreement (≥80%) and human–human Fleiss’ κ = 1.0 (Madhusudan et al., 22 Jan 2026).
Each region's answers require high internal annotator agreement before inclusion.

d. Data Types and Formats:

Open-ended generative QA (region-anchored), MCQs with unlabeled region options, and task-specific formats (satirical/adversarial, VQA, machine translation, text adaptation) (Faraz et al., 6 Nov 2025, Koto et al., 2024).
IndoCulture-ID focuses on manually crafted context–ending MCQs, ~2,400 items, 76% province-specific (Koto et al., 2024).

3. Task Definitions, Metrics, and Statistical Evaluation

a. Task Types:

Short-answer generation: models must generate the gold-standard answer for a specified region (Madhusudan et al., 22 Jan 2026).
Region-agnostic MCQ: models select options without explicit regional labeling, using distributions of choices to expose bias.
Cultural text adaptation: LLMs rewrite sample texts to reflect local context, scored against region-tagged culture-specific item (CSI) inventories (Sahoo et al., 22 Sep 2025).
Visual question answering, translation, and OCR in multilingual, culture-grounded settings (Faraz et al., 6 Nov 2025, Maji et al., 23 Sep 2025).

b. Metrics:

Agreement and Coverage:
- Intra-region: ≥80% agreement for gold answer; Fleiss’ κ.
- Cross-region and universal agreement: computed as overlap in practices (Madhusudan et al., 22 Jan 2026).
Bias Quantification:
- Chi-square goodness-of-fit: $χ^2 = \sum_r (O_r - E_r)^2/E_r$
- Selection ratio: $SR_r = O_r/E_r$
- KL-divergence for region selection distributions.
Accuracy:
- Free-form: $\text{Accuracy} = \frac{\#\text{correct region-specific answers}}{\#\text{total questions}}$
- MCQ: proportion of exact matches.
Adaptation Scores (DIWALI):
- Average adaptation score:
$\mathrm{AdaptationScore}(x') = \frac{1}{N} \sum_{i=1}^N I(w_i)$ - Sub-regional coverage:

$\mathrm{AdaptCover}(r) = \frac{|R_r|}{\sum_{r'}|R_{r'}|}$

4. Model Evaluation and Empirical Findings

Across IndoCulture benchmarks, state-of-the-art LLMs and VLMs are evaluated under controlled, context-rich, and adversarial conditions:

a. Indian Context—INDICA and Relatives:

Free-form region-anchored question accuracy: 13.4–20.9% fully correct (e.g., GPT-5.2, Llama 3.3 70B), with an overall accuracy (including partial credit) of ~50% (Madhusudan et al., 22 Jan 2026).
MCQ bias: systematic over-selection of Central (1.34× expected) and North (1.18×) Indian answers; East and West under-selected (0.75–0.76×).
Cross-region agreement: only 39.4% universal for "shared" questions; lowest sharing in food and ritual domains, highest in domains with national codification (traffic, education).

b. Indian Sub-Region Concept Inventory (DIWALI):

Quantitative adaptation scores on GSM8k adaptation: model AAS (average adaptation score) up to 0.855; CANDLE and DOSA baselines ≤0.10, establishing DIWALI’s discrimination power (Sahoo et al., 22 Sep 2025).
Bias and coverage: Large models adapt mostly for populous "default" states (UP, MH) with Northeast and minority states poorly represented.
Qualitative errors: shallow name-swaps, partial replacements, failure to alter scenario content for true cultural resonance.

c. Indonesian IndoCulture-ID and IndoSoSci Injection:

Human performance: 100%; best open-weight LLMs: 52–53%; GPT-4: up to 75.8% with explicit province context (Koto et al., 2024).
Context awareness: Province-level prompts improve performance by up to +7 percentage points in strong models.
RAG with IndoSoSci: state-of-the-art on IndoCulture set at 81.4% (Sailor2-L-20B-Chat + RAG+Wikipedia), +6–10 points over no-retrieval baseline (Kartiyasa et al., 19 Jan 2026).

d. Multimodal and Multilingual Tracks:

IndicVisionBench: VLMs perform well on factual QA and standard OCR but poorly on adversarial and culturally nuanced queries; marked drop-off in low-resource languages (e.g., Odia) (Faraz et al., 6 Nov 2025).
DRISHTIKON (overview only): Deep coverage of cultural themes, 15 languages, all states/UTs, >64,000 text-image pairs; uncovers deficits in multimodal cultural reasoning (Maji et al., 23 Sep 2025).

5. Bias, Coverage, and Revealed Model Limitations

IndoCulture benchmarks expose several consistent model limitations:

Geographic Bias: Models disproportionately default to majority or high-population cultural regions—Central/North (India), Bali/West Java (Indonesia)—especially in absence of explicit context (Madhusudan et al., 22 Jan 2026, Koto et al., 2024).
Surface Adaptation: LLMs typically favor shallow, word-level swaps over deep scenario adaptation, resulting in outputs lacking genuine cultural resonance or “aboutness” (Sahoo et al., 22 Sep 2025).
Representation Gaps: Attributes like cuisine, ritual, and arts from under-documented or minority regions are consistently misrepresented or omitted.
Adversarial Robustness: VLMs and LLMs underperform on tasks requiring rejections of subtly false cultural premises, revealing insufficient robust cultural reasoning (Faraz et al., 6 Nov 2025).
Multilingual Weakness: Declining scores in low-resource scripts and languages; open-weight models underperform relative to closed-source counterparts.

6. Design Generalization and Future Directions

IndoCulture pipeline elements are transferrable to new regions and nations:

Domain selection based on anthropological taxonomies (OCM, Newmark).
Regional ground-truthing and consensus protocols.
Integration of multimodal (image, OCR, translation, VQA) and multi-task settings (Faraz et al., 6 Nov 2025, Maji et al., 23 Sep 2025).
Introduction of metrics that distinguish knowledge (do models know regional facts?) vs. bias (do they default to high-population “standard” answers?).

Recommendations for future work across IndoCulture implementations include:

Expansion of coverage (additional attributes, folklore, dialects, ecological knowledge).
Increased participatory annotation, especially from underrepresented and minority regions.
Developed adversarial/causal and scenario-based question types to probe narrative and “deep” cultural awareness (Maji et al., 18 Jun 2025, Sahoo et al., 22 Sep 2025).
Open-sourcing comprehensive multilingual and vision-language resources with balanced regional representation.
Adoption of retrieval-enhanced architectures, especially harnessing native academic corpora (e.g., IndoSoSci) for under-attested cultures (Kartiyasa et al., 19 Jan 2026).

7. Impact and Significance

IndoCulture benchmarks collectively recast the evaluation of language technology away from simplistic national categories toward rigorous, regionally anchored, inclusive paradigms. Empirical findings—marked regional variation in “cultural commonsense,” measurable LLM bias, and broad coverage gaps even in contemporary foundational models—underscore the need for fine-grained, context-sensitive cultural datasets and methodologies (Madhusudan et al., 22 Jan 2026, Koto et al., 2024, Sahoo et al., 22 Sep 2025, Maji et al., 18 Jun 2025). The IndoCulture approach is already shaping best practices for dataset construction, annotation, and evaluation in global AI, providing templates for any culturally heterogeneous nation (Madhusudan et al., 22 Jan 2026). The progressive enrichment of these resources with multimodal, multilinguistic, and retrieval-augmented approaches is expected to be a central pillar of future culturally aware language technology research and deployment.