Korean CSAT LLM Evaluation Leaderboard
- Korean CSAT LLM Evaluation Leaderboard is an integrated framework that assesses large language and multimodal models using authentic, high-stakes CSAT-style exam data.
- It employs curated benchmarks, diverse modalities, and strict zero-leakage protocols to ensure transparent, reproducible, and statistically rigorous evaluations.
- The leaderboard aids in deployment optimization and model improvement tracking by providing detailed performance breakdowns and standardized scoring metrics.
The Korean CSAT LLM Evaluation Leaderboard is an integrated framework for assessing LLMs and multimodal models (MLLMs) on complex, high-stakes tasks modeled after the College Scholastic Ability Test (CSAT) in Korea. It provides transparent, statistically rigorous rankings based on real test data, state-of-the-art evaluation protocols, and multi-dimensional metrics, enabling researchers to track absolute and relative progress toward Korean exam-level competence.
1. Foundational Benchmarks and Datasets
Multiple research groups have developed benchmarks designed for CSAT-style and Korean national exam evaluation.
KMMLU (Son et al., 2024) features 35,030 expert-level, four-option multiple-choice questions spanning 45 subjects including STEM, applied sciences, humanities, and professional domains. Data is curated directly from 533 authentic Korean exams (e.g., PSAT, national license tests), with careful manual filtering and copyright review. Final splits include 208,522 training items, 225 development (5-shot) exemplars, and 35,030 test items, with human baseline performance at 62.6%.
KoNET (Park et al., 21 Feb 2025) introduces a multimodal benchmark incorporating four levels of the Korean General Educational Development Test, culminating in an extensive CSAT subset (897 questions across 41 subjects). CSAT questions appear as single gray-scale images; 98.8% are multiple-choice (five-option), and 1.2% are short-text subjective responses. Each is annotated for difficulty and, for a high-difficulty subset, empirical error rates from ≈505,000 human examinees. Preprocessing includes PDF parsing, OCR for LLM usability, and open-source dataset builders.
2026 CSAT-Math Zero-Data-Leakage Benchmark (Pyeon et al., 23 Nov 2025) presents a formal, contamination-free assessment using all 46 mathematics questions from the 2026 CSAT, converted within two hours of release to strictly prohibit prior model exposure. Domains are grouped into Public (수학 I + II), Probability & Statistics, Calculus, and Geometry, for domain-wise analysis.
2. Evaluation Methodologies
Benchmark protocols emphasize both prompt engineering and modality control. Common methodologies include:
- Prompt modes: Direct (answer-only) and Chain-of-Thought (CoT) rationales (Son et al., 2024, Park et al., 21 Feb 2025). CoT involves a parsed rationale preceding the final answer, typically extracted using regular expressions to identify the answer token.
- Input modalities: For multimodal benchmarks (KoNET), models may ingest image data directly (MLLMs), or receive OCR-extracted text for pure LLMs (Park et al., 21 Feb 2025).
- Zero-shot/Zero-leakage: For the 2026 CSAT-Math evaluation, strict non-exposure protocols ensure no questions appear in pretraining or fine-tuning, guaranteeing trustworthy comparison (Pyeon et al., 23 Nov 2025).
- Scoring: Exact-match accuracy (), error rate (), and, in some cases, human error rates per question () (Park et al., 21 Feb 2025).
- Aggregation: Metrics are reported at the overall, subject-level, and category-level (e.g., STEM, HUMSS), with additional breakdowns by difficulty and domain (Son et al., 2024, Park et al., 21 Feb 2025, Pyeon et al., 23 Nov 2025).
3. Leaderboard Architecture and Data Governance
Leaderboards are designed for scalability, transparency, and reproducibility. Features and architecture include:
- Leaderboard schema: A canonical JSON structure capturing model name, ID, API or parameter details, evaluation date, multi-level scores (overall and per category), and prompt protocol (Son et al., 2024). Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
{ "model_name": "GPT-4", "model_id": "gpt-4-0314", "parameters": "--temperature 0 --max_tokens 0", "access": "api", "date": "2024-06-01", "scores": { "overall": 59.95, "stem": 59.95, "applied_science": 57.69, "humss": 63.69, "other": 58.65 }, "notes": "Direct 5-shot" } - Submission protocol: Contributors fork repository, submit result JSONs, and CI pipelines verify schema and integrity, rerunning harness evaluation for confirmation (Son et al., 2024).
- Task integration: For KMMLU, standardized integration into EleutherAI's LM-Eval-Harness, with custom Python tasks, config files, and grouped few-shot examples, supporting reproducible local and cloud evaluations (Son et al., 2024).
- Continuous integration: Automated leaderboard updates using evaluation harnesses; historical CSV logs facilitate trend analysis and time-series visualization (Son et al., 2024, Park et al., 2024).
- Online UI: Interactive leaderboard websites (e.g., https://isoft.cnu.ac.kr/csat2026/) provide filtering by model, domain, modality, prompt language, latency, cost, and reasoning intensity, supporting dynamic model selection for deployment (Pyeon et al., 23 Nov 2025).
- Private test sets: Critical for avoiding data contamination and ensuring leaderboard trustworthiness, particularly in high-stakes educational domains (Park et al., 2024, Pyeon et al., 23 Nov 2025).
4. Model Performance and Comparative Analysis
Leaderboard results reveal characteristic gaps, saturation regimes, and cost-performance trade-offs.
KMMLU Top-8 Leaderboard (Direct, Overall Accuracy):
| Model | STEM | Applied Sci | HUMSS | Other | Overall |
|---|---|---|---|---|---|
| GPT-4 (API) | 59.95% | 57.69% | 63.69% | 58.65% | 59.95% |
| HyperCLOVA X (API) | 50.82% | 48.71% | 59.71% | 54.39% | 53.40% |
| Gemini-Pro (API) | 51.30% | 49.06% | 49.87% | 50.61% | 50.18% |
| Qwen-72B (open) | 50.69% | 47.75% | 54.39% | 50.77% | 50.83% |
| Yi-34B (open) | 44.31% | 40.59% | 47.03% | 43.96% | 43.90% |
| Llama-2-70B (open) | 41.16% | 38.82% | 41.20% | 40.06% | 40.28% |
| Polyglot-Ko-12.8B | 29.27% | 30.08% | 27.08% | 30.55% | 29.26% |
| Random Baseline | 25.00% | 25.00% | 25.00% | 25.00% | 25.00% |
Even top proprietary LLMs trail human baseline by ≈2.6 percentage points; open-source, Korean-tailored LLMs perform significantly worse. CoT prompting yields mixed results and is not always beneficial across domains (Son et al., 2024).
KoCSAT (KoNET) Leaderboard (CoT+OCR):
| Model Category | Model Name | Params | KoCSAT Accuracy (%) |
|---|---|---|---|
| open LLM | Qwen2-72B-Instruct | 72B | 36.0 |
| open LLM | gemma-2-27B-it | 27B | 33.9 |
| closed LLM | claude-3-5-sonnet | — | 60.5 |
| closed LLM | gpt-4o-2024-05-13 | — | 52.5 |
| open MLLM | Qwen2-VL-7B-Instruct | 7B | 16.9 |
| closed MLLM | gpt-4o-2024-05-13 (vision) | — | 66.1 |
Best closed MLLMs approach ≈66% accuracy, but still lag human performance by 10–20 points; open-source models fall short by large margins, reflecting lack of curriculum-aligned pretraining and vision integration. MLLMs outperform pure LLMs particularly on easy and medium questions (Park et al., 21 Feb 2025).
2026 CSAT-Math Zero-leakage Leaderboard (Text-only):
| Rank | Model | Score/46 | pct (%) |
|---|---|---|---|
| 1 | GPT-5 Codex (openai) | 46/46 | 100.0 |
| 2 | Qwen3 235B A22B | 45/46 | 97.8 |
| 3–5 | GPT-5, gpt-oss-20B... | 44/46 | 95.7 |
| ... | ... | ... | ... |
Domain analysis exposes Geometry as the weakest (59.3% avg), while text input exceeds image modality, particularly for math reasoning. Cost-effectiveness (Editor’s term) metric $\mathrm{CE} = \frac{\text{total score}}{(\text{# params}) \times (\text{avg latency~ms})}$ highlights that small models (e.g., gpt-oss-20B) offer competitive performance per resource (Pyeon et al., 23 Nov 2025).
5. Longitudinal Evaluation and Model Scaling Effects
The Open Ko-LLM Leaderboard (Park et al., 2024) pioneers time-resolved, multi-task evaluation across 1,769 models over eleven months. The Ko-H5 benchmark spans Ko-HellaSwag, Ko-ARC, Ko-MMLU, Ko-CommonGen V2, and Ko-TruthfulQA, each scored on a 0–100 scale via accuracy, F1, or human truthfulness metrics.
- Scaling effects: Larger models (7–14B) show strong positive correlations and rapid threshold crossing (e.g., Ko-ARC 50 points in six weeks; Ko-MMLU saturates at 26 weeks). Small models (<3B) plateau at ~60 points after five months.
- Correlation dynamics: Over time, task-wise correlations (e.g., TruthfulQA vs. HellaSwag) increase (from 0.01 months 1–5 to 0.50 in eleven months), signifying convergence of general capabilities (Park et al., 2024).
- Model trends: Pretrained models lead, with instruction-tuned variants lagging by about one week on every performance jump.
- Ranking evolution: Aggregate performance curves reveal rapid early innovation on commonsense/knowledge tasks, with complex reasoning and generation tasks (Ko-MMLU, Ko-CommonGen V2) presenting persistent challenges.
6. Challenges, Domain-specific Obstacles, and Future Directions
Persistent issues impacting leaderboard accuracy and practical utility include:
- Linguistic/cultural nuances: Honorifics, ambiguous particles (e.g. 은/는 vs. 이/가), and localized references (Korean Constitution, civil service) challenge even high-capability models (Son et al., 2024).
- Modality and reasoning trade-offs: Text input consistently outperforms image; increased reasoning intensity (as in GPT-5 Reasoning_Effort experiments) improves scores but may reduce efficiency due to token bloat (≈4.5× more tokens per question for “high” reasoning) (Pyeon et al., 23 Nov 2025).
- Robustness: “Hallucinations” and unfaithful CoT explanations are prevalent in HUMSS, law, and history domains (Son et al., 2024, Park et al., 21 Feb 2025).
- Benchmark expansion: Future work proposes moving toward reading comprehension, listening, short essay, coding, and multimodal tasks (e.g., graph interpretation, spoken expression).
- Zero-leakage enforcement: Strict data handling as in the 2026 CSAT-Math benchmark is crucial for trustworthy leaderboards (Pyeon et al., 23 Nov 2025).
- Leaderboard sustainability: Private, regularly updated test sets; stratified analysis by subject/difficulty; and longitudinal tracking are essential for objective model comparisons (Park et al., 2024). Regular leaderboard updates and visualization tools promote healthy research ecosystems.
A plausible implication is that as model scale continues to rise and curriculum-aligned Korean data is incorporated, leaderboard saturation plateaus may shift upward and the open-source performance gap may narrow.
7. Practical Applications and Research Utility
These leaderboards form a nexus for rigorous, comparative evaluation of LLMs and MLLMs on Korean educational standards:
- Deployment optimization: Practitioners can select models by accuracy, cost, latency, and reasoning trade-off tailored to real exam conditions using live leaderboard filters.
- Model improvement detection: Longitudinal analyses identify periods of innovation, stagnation, and architectural breakthroughs.
- Task design: Benchmarks highlight persistent weaknesses (e.g., geometry, cultural context, complex reasoning), guiding future model and curriculum design.
- Transparency: Documented zero-data-leakage protocols and open scoring infrastructure provide trustworthy yardsticks for policy-makers, AI developers, and educational stakeholders.
By supporting rigorous, up-to-date measurement using authentic high-stakes Korean exam data, the Korean CSAT LLM Evaluation Leaderboard supplies a reference standard for both model development and practical deployment in Korean-language educational contexts.