2026 Korean CSAT LLM Evaluation Leaderboard

Updated 30 November 2025

The 2026 Korean CSAT LLM Evaluation Leaderboard is a benchmark that rigorously evaluates large language models on Korea’s national college exam using the KoNET dataset.
It employs formal test-taking protocols with chain-of-thought prompting and multidimensional scoring to assess text, image, and multimodal inputs.
Results highlight superior performance by closed-source models and reveal tradeoffs between reasoning depth and efficiency in achieving high accuracy.

The 2026 Korean CSAT LLM Evaluation Leaderboard is a benchmark-driven comparative analysis of LLMs and multimodal LLMs (MLLMs) on the Korean College Scholastic Ability Test (KoCSAT), the national university entrance examination in Korea. These leaderboards are constructed using the KoNET benchmark, which is tailored to rigorously evaluate AI performance across modalities and cognitive domains using authentic Korean educational standards. The evaluation framework is grounded in formal test-taking protocols, zero-data-leakage methodologies, and multidimensional metrics to assess both general and mathematical reasoning capabilities of state-of-the-art models (Park et al., 21 Feb 2025, Pyeon et al., 23 Nov 2025).

1. KoCSAT Dataset and Benchmark Description

KoCSAT is an image-based dataset comprising 897 items sourced across 41 subjects, with each subject providing between 20 and 45 questions in accordance with yearly KICE distributions. The dataset is partitioned by modality:

Knowledge-QA (K-QA): 57 items (6.4%)
Text Comprehension (TC-QA): 388 items (43.3%)
Multimodal Comprehension (MC-QA): 452 items (50.3%)

Question difficulty is divided into five levels (Level 1–5), with the 2023 distribution reported as 18% (L1), 28% (L2), 30% (L3), 18% (L4), 6% (L5). Human error rates are reported for a subset of 327 high-difficulty items, ranging from 10.6% up to 98.2%, mean value $\mathrm{HumanErr}\approx42.3\%$ , corresponding to $\mathrm{HumanAcc}\approx57.7\%$ (Park et al., 21 Feb 2025).

The mathematics section is benchmarked using 46 digitized items (22 common, 24 elective) to ensure zero-contamination; digitization takes place within two hours of public exam release (Pyeon et al., 23 Nov 2025).

2. Evaluation Protocols and Metrics

KoCSAT evaluation employs both direct option extraction and chain-of-thought (CoT) prompting (see Appendix B in (Park et al., 21 Feb 2025)), with Korean OCR handling text conversion for pure LLMs and native image reading for MLLMs. Scoring is zero-one exact-match, with no partial credit for multiple-choice or constructed-response items; subjective questions are human-evaluated via the LLM-as-Judge paradigm (using GPT-4o).

The principal metric is exact-match accuracy: $\mathrm{Acc}_{\mathrm{CSAT}} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}(\hat{y}_i = y_i)$ where $N=897$ , and $\hat{y}_i$ is the model prediction. The aggregate KoNET score is the unweighted mean over all four KoNET exams.

Mathematics evaluation further employs normalized score, time (latency), cost, and token efficiency metrics. Efficiency is defined as:

$\mathrm{Eff}_t = \mathrm{score}/\text{latency (ms)}$
$\mathrm{Eff}_c = \mathrm{score}/\text{cost (\$)} $-$ \mathrm{Eff}_{t,c} = \mathrm{score} / (\text{latency}/152\,\text{min} + \text{cost}/1\,\$)$

3. Leaderboard: Overall CSAT and Mathematics Section Results

Comprehensive KoCSAT Results

Models are organized by access modality and source, with accuracy metrics aggregated across the 897-item evaluation set (Park et al., 21 Feb 2025):

Model	Type	KoCSAT Acc (%)	KoNET Avg (%)
Closed-Source MLLMs
GPT-4o-MLLM	Closed API	66.1	83.4
Claude-3.5-sonnet-MLLM	Closed API	62.8	80.6
HyperCLOVA-X-MLLM	Closed API	55.7	74.0
Gemini-1.5-pro-MLLM	Closed API	52.4	73.3
Closed-Source LLMs
Claude-3.5-sonnet	Closed API	60.5	76.0
GPT-4o	Closed API	52.5	70.8
HyperCLOVA-X	Closed API	51.2	70.9
Gemini-1.5-pro	Closed API	44.0	66.4
Open-Source LLMs
Qwen2-72B-Instruct	OSS	36.0	58.7
gemma-2-27b-it	OSS	33.9	55.9
Meta-Llama-3.1-70B	OSS	31.2	50.8
EXAONE-3.0-7.8B	OSS	24.2	45.5
…	…	…	…
Open-Source VLMs
Qwen2-VL-7B-Instruct	OSS VLM	16.9	34.3
InternVL2-40B	OSS VLM	11.9	20.8
llava-next-110B-hf	OSS VLM	12.0	17.6
Human Baseline	Examinees	57.7	---

In the mathematics section (text-only, Korean prompt), leading models yielded:

Rank	Model	Score (/100)
1	GPT-5 Codex	100
2	Grok 4	97.8
3–5	GPT-5, Grok 4 Fast, gpt-oss-20B	95.7
7	GPT-5 nano	89.1
8	Deepseek R1	95.7
…	…	…
19	Llama 4 Maverick	21.7

Notably, gpt-oss-20B, a relatively small open model, achieved 95.7 points at exceptionally low cost (Pyeon et al., 23 Nov 2025).

4. Modality, Prompting, and Subject-Specific Performance

Modalities and Input Types

Text Only outperforms image-based input for all models, with top scores (GPT-5 Codex) perfect in both.
Image Only results in significant accuracy drops for most models except the largest closed-source models.
Text+Figure achieves nearly identical top scores as text only for leading models, with minor degradation for smaller models.

Subject and Domain Breakdown

For GPT-4o-MLLM on KoCSAT items (Park et al., 21 Feb 2025):

Subject Group	Avg Accuracy (%)	Std Dev (%)
Korean Language	82.5	3.8
Mathematics	78.9	5.1
English	75.3	4.7
Science	72.4	6.2
Social Studies	69.8	7.0
2nd Languages	64.2	9.3

In the mathematics section, item-level analysis shows:

Geometry & Algebra: ≈95–100%
Statistics/Probability/Combinatorics: 85–90%
Weakest: Permutation/Combination ≈80%
Difficulty: Scores drop from ∼93–97% for 2-point (易) items to 68–71% for 4-point (難) items, implying most severe degradation at the high-difficulty tail (Pyeon et al., 23 Nov 2025).

A weak positive correlation ( $\rho \approx 0.18$ ) exists between human and model errors; both humans and models are challenged by the most difficult items, though error patterns are only loosely aligned (Park et al., 21 Feb 2025).

5. Error Analysis and Confusion Patterns

The confusion matrix for GPT-4o-MLLM indicates that most misclassifications in 5-choice MCQ items occur between adjacent options. No substantial off-diagonal errors are seen, suggesting models often confuse near-correct answers rather than random guesses.

$C^{\rm GPT4o} = \begin{pmatrix} .76 & .09 & .03 & .02 & .00 \ .07 & .72 & .08 & .03 & .00 \ .04 & .07 & .68 & .09 & .02 \ .03 & .04 & .09 & .70 & .04 \ .00 & .00 & .01 & .12 & .87 \end{pmatrix}$

Average accuracy by modality across all subjects: K-QA ≈85%, TC-QA ≈74%, MC-QA ≈61%.

6. Model Adaptation, Prompting, and Efficiency Insights

Chain‐of‐Thought (CoT) prompting provides a 2–5pp accuracy gain, more substantial for closed-source models (e.g., GPT-4o +4.1pp vs. open-source +2.3pp).
Korean-specialized OCR gives a 6–10pp advantage for pure LLMs, almost closing the gap with MLLMs.
Domain-adaptive pretraining on Korean corpora (e.g., EXAONE-3.0) confers a 4–6pp boost over non-specialized bilingual models.
Reasoning-enhancement experiments (GPT-5 series) show increasing “Reasoning_Effort” from minimal to high raises accuracy on mathematics from 82.6% to 100% but quadruples token usage and causes a 75% drop in time-cost efficiency. For example, per-question token usage increases from ≈60 (minimal) to 240–268 (high effort), with corresponding drops in Eff $_t$ (1.7 to 0.37), Eff $_c$ (477 to 112), and Eff $_{t,c}$ (1.4 to 0.36).

A plausible implication is that an optimal tradeoff exists between maximal accuracy through step-by-step reasoning and practical inference efficiency for large-scale deployments (Pyeon et al., 23 Nov 2025).

7. Significance and Broader Implications

The 2026 Korean CSAT LLM Evaluation Leaderboard demonstrates that state-of-the-art closed-weight MLLMs (e.g., GPT-4o-MLLM, Claude-3.5-sonnet-MLLM) decisively outperform open-source alternatives, exceeding both aggregate and subject-specific accuracies. However, Korean-adapted open-source models (Qwen2, EXAONE-3.0) are competitive within their class, especially when leveraging CoT prompting and Korean OCR.

The methodology emphasizes strict data-leakage avoidance, robust multimodal evaluation, and culturally relevant benchmarking, collectively establishing a new standard for non-English education-based LLM assessment. This framework is instructive for benchmarking in other less-resourced languages and for modeling efficiency tradeoffs in high-stakes, high-difficulty academic settings (Park et al., 21 Feb 2025, Pyeon et al., 23 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Evaluating Multimodal Generative AI with Korean Educational Standards (2025)

Evaluating Large Language Models on the 2026 Korean CSAT Mathematics Exam: Measuring Mathematical Ability in a Zero-Data-Leakage Setting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 2026 Korean CSAT LLM Evaluation Leaderboard.