2026 Korean CSAT LLM Evaluation Leaderboard
- The 2026 Korean CSAT LLM Evaluation Leaderboard is a benchmark that rigorously evaluates large language models on Korea’s national college exam using the KoNET dataset.
- It employs formal test-taking protocols with chain-of-thought prompting and multidimensional scoring to assess text, image, and multimodal inputs.
- Results highlight superior performance by closed-source models and reveal tradeoffs between reasoning depth and efficiency in achieving high accuracy.
The 2026 Korean CSAT LLM Evaluation Leaderboard is a benchmark-driven comparative analysis of LLMs and multimodal LLMs (MLLMs) on the Korean College Scholastic Ability Test (KoCSAT), the national university entrance examination in Korea. These leaderboards are constructed using the KoNET benchmark, which is tailored to rigorously evaluate AI performance across modalities and cognitive domains using authentic Korean educational standards. The evaluation framework is grounded in formal test-taking protocols, zero-data-leakage methodologies, and multidimensional metrics to assess both general and mathematical reasoning capabilities of state-of-the-art models (Park et al., 21 Feb 2025, Pyeon et al., 23 Nov 2025).
1. KoCSAT Dataset and Benchmark Description
KoCSAT is an image-based dataset comprising 897 items sourced across 41 subjects, with each subject providing between 20 and 45 questions in accordance with yearly KICE distributions. The dataset is partitioned by modality:
- Knowledge-QA (K-QA): 57 items (6.4%)
- Text Comprehension (TC-QA): 388 items (43.3%)
- Multimodal Comprehension (MC-QA): 452 items (50.3%)
Question difficulty is divided into five levels (Level 1–5), with the 2023 distribution reported as 18% (L1), 28% (L2), 30% (L3), 18% (L4), 6% (L5). Human error rates are reported for a subset of 327 high-difficulty items, ranging from 10.6% up to 98.2%, mean value , corresponding to (Park et al., 21 Feb 2025).
The mathematics section is benchmarked using 46 digitized items (22 common, 24 elective) to ensure zero-contamination; digitization takes place within two hours of public exam release (Pyeon et al., 23 Nov 2025).
2. Evaluation Protocols and Metrics
KoCSAT evaluation employs both direct option extraction and chain-of-thought (CoT) prompting (see Appendix B in (Park et al., 21 Feb 2025)), with Korean OCR handling text conversion for pure LLMs and native image reading for MLLMs. Scoring is zero-one exact-match, with no partial credit for multiple-choice or constructed-response items; subjective questions are human-evaluated via the LLM-as-Judge paradigm (using GPT-4o).
The principal metric is exact-match accuracy: where , and is the model prediction. The aggregate KoNET score is the unweighted mean over all four KoNET exams.
Mathematics evaluation further employs normalized score, time (latency), cost, and token efficiency metrics. Efficiency is defined as:
- $\mathrm{Eff}_c = \mathrm{score}/\text{cost (\$)}\mathrm{Eff}_{t,c} = \mathrm{score} / (\text{latency}/152\,\text{min} + \text{cost}/1\,\$)$
3. Leaderboard: Overall CSAT and Mathematics Section Results
Comprehensive KoCSAT Results
Models are organized by access modality and source, with accuracy metrics aggregated across the 897-item evaluation set (Park et al., 21 Feb 2025):
| Model | Type | KoCSAT Acc (%) | KoNET Avg (%) |
|---|---|---|---|
| Closed-Source MLLMs | |||
| GPT-4o-MLLM | Closed API | 66.1 | 83.4 |
| Claude-3.5-sonnet-MLLM | Closed API | 62.8 | 80.6 |
| HyperCLOVA-X-MLLM | Closed API | 55.7 | 74.0 |
| Gemini-1.5-pro-MLLM | Closed API | 52.4 | 73.3 |
| Closed-Source LLMs | |||
| Claude-3.5-sonnet | Closed API | 60.5 | 76.0 |
| GPT-4o | Closed API | 52.5 | 70.8 |
| HyperCLOVA-X | Closed API | 51.2 | 70.9 |
| Gemini-1.5-pro | Closed API | 44.0 | 66.4 |
| Open-Source LLMs | |||
| Qwen2-72B-Instruct | OSS | 36.0 | 58.7 |
| gemma-2-27b-it | OSS | 33.9 | 55.9 |
| Meta-Llama-3.1-70B | OSS | 31.2 | 50.8 |
| EXAONE-3.0-7.8B | OSS | 24.2 | 45.5 |
| … | … | … | … |
| Open-Source VLMs | |||
| Qwen2-VL-7B-Instruct | OSS VLM | 16.9 | 34.3 |
| InternVL2-40B | OSS VLM | 11.9 | 20.8 |
| llava-next-110B-hf | OSS VLM | 12.0 | 17.6 |
| Human Baseline | Examinees | 57.7 | --- |
In the mathematics section (text-only, Korean prompt), leading models yielded:
| Rank | Model | Score (/100) |
|---|---|---|
| 1 | GPT-5 Codex | 100 |
| 2 | Grok 4 | 97.8 |
| 3–5 | GPT-5, Grok 4 Fast, gpt-oss-20B | 95.7 |
| 7 | GPT-5 nano | 89.1 |
| 8 | Deepseek R1 | 95.7 |
| … | … | … |
| 19 | Llama 4 Maverick | 21.7 |
Notably, gpt-oss-20B, a relatively small open model, achieved 95.7 points at exceptionally low cost (Pyeon et al., 23 Nov 2025).
4. Modality, Prompting, and Subject-Specific Performance
Modalities and Input Types
- Text Only outperforms image-based input for all models, with top scores (GPT-5 Codex) perfect in both.
- Image Only results in significant accuracy drops for most models except the largest closed-source models.
- Text+Figure achieves nearly identical top scores as text only for leading models, with minor degradation for smaller models.
Subject and Domain Breakdown
For GPT-4o-MLLM on KoCSAT items (Park et al., 21 Feb 2025):
| Subject Group | Avg Accuracy (%) | Std Dev (%) |
|---|---|---|
| Korean Language | 82.5 | 3.8 |
| Mathematics | 78.9 | 5.1 |
| English | 75.3 | 4.7 |
| Science | 72.4 | 6.2 |
| Social Studies | 69.8 | 7.0 |
| 2nd Languages | 64.2 | 9.3 |
In the mathematics section, item-level analysis shows:
- Geometry & Algebra: ≈95–100%
- Statistics/Probability/Combinatorics: 85–90%
- Weakest: Permutation/Combination ≈80%
- Difficulty: Scores drop from ∼93–97% for 2-point (易) items to 68–71% for 4-point (難) items, implying most severe degradation at the high-difficulty tail (Pyeon et al., 23 Nov 2025).
A weak positive correlation () exists between human and model errors; both humans and models are challenged by the most difficult items, though error patterns are only loosely aligned (Park et al., 21 Feb 2025).
5. Error Analysis and Confusion Patterns
The confusion matrix for GPT-4o-MLLM indicates that most misclassifications in 5-choice MCQ items occur between adjacent options. No substantial off-diagonal errors are seen, suggesting models often confuse near-correct answers rather than random guesses.
Average accuracy by modality across all subjects: K-QA ≈85%, TC-QA ≈74%, MC-QA ≈61%.
6. Model Adaptation, Prompting, and Efficiency Insights
- Chain‐of‐Thought (CoT) prompting provides a 2–5pp accuracy gain, more substantial for closed-source models (e.g., GPT-4o +4.1pp vs. open-source +2.3pp).
- Korean-specialized OCR gives a 6–10pp advantage for pure LLMs, almost closing the gap with MLLMs.
- Domain-adaptive pretraining on Korean corpora (e.g., EXAONE-3.0) confers a 4–6pp boost over non-specialized bilingual models.
- Reasoning-enhancement experiments (GPT-5 series) show increasing “Reasoning_Effort” from minimal to high raises accuracy on mathematics from 82.6% to 100% but quadruples token usage and causes a 75% drop in time-cost efficiency. For example, per-question token usage increases from ≈60 (minimal) to 240–268 (high effort), with corresponding drops in Eff (1.7 to 0.37), Eff (477 to 112), and Eff (1.4 to 0.36).
A plausible implication is that an optimal tradeoff exists between maximal accuracy through step-by-step reasoning and practical inference efficiency for large-scale deployments (Pyeon et al., 23 Nov 2025).
7. Significance and Broader Implications
The 2026 Korean CSAT LLM Evaluation Leaderboard demonstrates that state-of-the-art closed-weight MLLMs (e.g., GPT-4o-MLLM, Claude-3.5-sonnet-MLLM) decisively outperform open-source alternatives, exceeding both aggregate and subject-specific accuracies. However, Korean-adapted open-source models (Qwen2, EXAONE-3.0) are competitive within their class, especially when leveraging CoT prompting and Korean OCR.
The methodology emphasizes strict data-leakage avoidance, robust multimodal evaluation, and culturally relevant benchmarking, collectively establishing a new standard for non-English education-based LLM assessment. This framework is instructive for benchmarking in other less-resourced languages and for modeling efficiency tradeoffs in high-stakes, high-difficulty academic settings (Park et al., 21 Feb 2025, Pyeon et al., 23 Nov 2025).