Papers
Topics
Authors
Recent
Search
2000 character limit reached

2026 Korean CSAT LLM Evaluation Leaderboard

Updated 30 November 2025
  • The 2026 Korean CSAT LLM Evaluation Leaderboard is a benchmark that rigorously evaluates large language models on Korea’s national college exam using the KoNET dataset.
  • It employs formal test-taking protocols with chain-of-thought prompting and multidimensional scoring to assess text, image, and multimodal inputs.
  • Results highlight superior performance by closed-source models and reveal tradeoffs between reasoning depth and efficiency in achieving high accuracy.

The 2026 Korean CSAT LLM Evaluation Leaderboard is a benchmark-driven comparative analysis of LLMs and multimodal LLMs (MLLMs) on the Korean College Scholastic Ability Test (KoCSAT), the national university entrance examination in Korea. These leaderboards are constructed using the KoNET benchmark, which is tailored to rigorously evaluate AI performance across modalities and cognitive domains using authentic Korean educational standards. The evaluation framework is grounded in formal test-taking protocols, zero-data-leakage methodologies, and multidimensional metrics to assess both general and mathematical reasoning capabilities of state-of-the-art models (Park et al., 21 Feb 2025, Pyeon et al., 23 Nov 2025).

1. KoCSAT Dataset and Benchmark Description

KoCSAT is an image-based dataset comprising 897 items sourced across 41 subjects, with each subject providing between 20 and 45 questions in accordance with yearly KICE distributions. The dataset is partitioned by modality:

  • Knowledge-QA (K-QA): 57 items (6.4%)
  • Text Comprehension (TC-QA): 388 items (43.3%)
  • Multimodal Comprehension (MC-QA): 452 items (50.3%)

Question difficulty is divided into five levels (Level 1–5), with the 2023 distribution reported as 18% (L1), 28% (L2), 30% (L3), 18% (L4), 6% (L5). Human error rates are reported for a subset of 327 high-difficulty items, ranging from 10.6% up to 98.2%, mean value HumanErr42.3%\mathrm{HumanErr}\approx42.3\%, corresponding to HumanAcc57.7%\mathrm{HumanAcc}\approx57.7\% (Park et al., 21 Feb 2025).

The mathematics section is benchmarked using 46 digitized items (22 common, 24 elective) to ensure zero-contamination; digitization takes place within two hours of public exam release (Pyeon et al., 23 Nov 2025).

2. Evaluation Protocols and Metrics

KoCSAT evaluation employs both direct option extraction and chain-of-thought (CoT) prompting (see Appendix B in (Park et al., 21 Feb 2025)), with Korean OCR handling text conversion for pure LLMs and native image reading for MLLMs. Scoring is zero-one exact-match, with no partial credit for multiple-choice or constructed-response items; subjective questions are human-evaluated via the LLM-as-Judge paradigm (using GPT-4o).

The principal metric is exact-match accuracy: AccCSAT=1Ni=1N1(y^i=yi)\mathrm{Acc}_{\mathrm{CSAT}} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}(\hat{y}_i = y_i) where N=897N=897, and y^i\hat{y}_i is the model prediction. The aggregate KoNET score is the unweighted mean over all four KoNET exams.

Mathematics evaluation further employs normalized score, time (latency), cost, and token efficiency metrics. Efficiency is defined as:

  • Efft=score/latency (ms)\mathrm{Eff}_t = \mathrm{score}/\text{latency (ms)}
  • $\mathrm{Eff}_c = \mathrm{score}/\text{cost (\$)}-\mathrm{Eff}_{t,c} = \mathrm{score} / (\text{latency}/152\,\text{min} + \text{cost}/1\,\$)$

3. Leaderboard: Overall CSAT and Mathematics Section Results

Comprehensive KoCSAT Results

Models are organized by access modality and source, with accuracy metrics aggregated across the 897-item evaluation set (Park et al., 21 Feb 2025):

Model Type KoCSAT Acc (%) KoNET Avg (%)
Closed-Source MLLMs
GPT-4o-MLLM Closed API 66.1 83.4
Claude-3.5-sonnet-MLLM Closed API 62.8 80.6
HyperCLOVA-X-MLLM Closed API 55.7 74.0
Gemini-1.5-pro-MLLM Closed API 52.4 73.3
Closed-Source LLMs
Claude-3.5-sonnet Closed API 60.5 76.0
GPT-4o Closed API 52.5 70.8
HyperCLOVA-X Closed API 51.2 70.9
Gemini-1.5-pro Closed API 44.0 66.4
Open-Source LLMs
Qwen2-72B-Instruct OSS 36.0 58.7
gemma-2-27b-it OSS 33.9 55.9
Meta-Llama-3.1-70B OSS 31.2 50.8
EXAONE-3.0-7.8B OSS 24.2 45.5
Open-Source VLMs
Qwen2-VL-7B-Instruct OSS VLM 16.9 34.3
InternVL2-40B OSS VLM 11.9 20.8
llava-next-110B-hf OSS VLM 12.0 17.6
Human Baseline Examinees 57.7 ---

In the mathematics section (text-only, Korean prompt), leading models yielded:

Rank Model Score (/100)
1 GPT-5 Codex 100
2 Grok 4 97.8
3–5 GPT-5, Grok 4 Fast, gpt-oss-20B 95.7
7 GPT-5 nano 89.1
8 Deepseek R1 95.7
19 Llama 4 Maverick 21.7

Notably, gpt-oss-20B, a relatively small open model, achieved 95.7 points at exceptionally low cost (Pyeon et al., 23 Nov 2025).

4. Modality, Prompting, and Subject-Specific Performance

Modalities and Input Types

  • Text Only outperforms image-based input for all models, with top scores (GPT-5 Codex) perfect in both.
  • Image Only results in significant accuracy drops for most models except the largest closed-source models.
  • Text+Figure achieves nearly identical top scores as text only for leading models, with minor degradation for smaller models.

Subject and Domain Breakdown

For GPT-4o-MLLM on KoCSAT items (Park et al., 21 Feb 2025):

Subject Group Avg Accuracy (%) Std Dev (%)
Korean Language 82.5 3.8
Mathematics 78.9 5.1
English 75.3 4.7
Science 72.4 6.2
Social Studies 69.8 7.0
2nd Languages 64.2 9.3

In the mathematics section, item-level analysis shows:

  • Geometry & Algebra: ≈95–100%
  • Statistics/Probability/Combinatorics: 85–90%
  • Weakest: Permutation/Combination ≈80%
  • Difficulty: Scores drop from ∼93–97% for 2-point (易) items to 68–71% for 4-point (難) items, implying most severe degradation at the high-difficulty tail (Pyeon et al., 23 Nov 2025).

A weak positive correlation (ρ0.18\rho \approx 0.18) exists between human and model errors; both humans and models are challenged by the most difficult items, though error patterns are only loosely aligned (Park et al., 21 Feb 2025).

5. Error Analysis and Confusion Patterns

The confusion matrix for GPT-4o-MLLM indicates that most misclassifications in 5-choice MCQ items occur between adjacent options. No substantial off-diagonal errors are seen, suggesting models often confuse near-correct answers rather than random guesses.

CGPT4o=(.76.09.03.02.00 .07.72.08.03.00 .04.07.68.09.02 .03.04.09.70.04 .00.00.01.12.87)C^{\rm GPT4o} = \begin{pmatrix} .76 & .09 & .03 & .02 & .00 \ .07 & .72 & .08 & .03 & .00 \ .04 & .07 & .68 & .09 & .02 \ .03 & .04 & .09 & .70 & .04 \ .00 & .00 & .01 & .12 & .87 \end{pmatrix}

Average accuracy by modality across all subjects: K-QA ≈85%, TC-QA ≈74%, MC-QA ≈61%.

6. Model Adaptation, Prompting, and Efficiency Insights

  • Chain‐of‐Thought (CoT) prompting provides a 2–5pp accuracy gain, more substantial for closed-source models (e.g., GPT-4o +4.1pp vs. open-source +2.3pp).
  • Korean-specialized OCR gives a 6–10pp advantage for pure LLMs, almost closing the gap with MLLMs.
  • Domain-adaptive pretraining on Korean corpora (e.g., EXAONE-3.0) confers a 4–6pp boost over non-specialized bilingual models.
  • Reasoning-enhancement experiments (GPT-5 series) show increasing “Reasoning_Effort” from minimal to high raises accuracy on mathematics from 82.6% to 100% but quadruples token usage and causes a 75% drop in time-cost efficiency. For example, per-question token usage increases from ≈60 (minimal) to 240–268 (high effort), with corresponding drops in Efft_t (1.7 to 0.37), Effc_c (477 to 112), and Efft,c_{t,c} (1.4 to 0.36).

A plausible implication is that an optimal tradeoff exists between maximal accuracy through step-by-step reasoning and practical inference efficiency for large-scale deployments (Pyeon et al., 23 Nov 2025).

7. Significance and Broader Implications

The 2026 Korean CSAT LLM Evaluation Leaderboard demonstrates that state-of-the-art closed-weight MLLMs (e.g., GPT-4o-MLLM, Claude-3.5-sonnet-MLLM) decisively outperform open-source alternatives, exceeding both aggregate and subject-specific accuracies. However, Korean-adapted open-source models (Qwen2, EXAONE-3.0) are competitive within their class, especially when leveraging CoT prompting and Korean OCR.

The methodology emphasizes strict data-leakage avoidance, robust multimodal evaluation, and culturally relevant benchmarking, collectively establishing a new standard for non-English education-based LLM assessment. This framework is instructive for benchmarking in other less-resourced languages and for modeling efficiency tradeoffs in high-stakes, high-difficulty academic settings (Park et al., 21 Feb 2025, Pyeon et al., 23 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 2026 Korean CSAT LLM Evaluation Leaderboard.