KoNET Benchmark: AI Evaluation for Korean Exams
- KoNET Benchmark is a unified evaluation framework that assesses multimodal AI reasoning using Korea’s national educational tests and cultural nuances.
- It curates diverse assessment items from elementary to college levels, covering factual recall, text comprehension, and image-based questions.
- The framework integrates detailed accuracy metrics and open-source builder tools to calibrate model performance against human benchmarks.
The Korean National Educational Test Benchmark (KoNET) is a unified evaluation framework designed to systematically probe the multimodal reasoning capabilities of generative AI systems using Korea’s major national educational assessments. By curating items from the Korean Elementary, Middle, and High School General Educational Development Tests (KoEGED, KoMGED, KoHGED) and the College Scholastic Ability Test (KoCSAT), KoNET facilitates comprehensive, multi-level analysis of both open-source and commercial models in a culturally and linguistically rich, non-English setting. The design emphasizes rigor, diversity of question format and subject matter, and calibration against human performance, particularly for questions with official difficulty and error-rate statistics. KoNET is supported by fully open-source builder tools, allowing reproducibility and extension by the research community (Park et al., 21 Feb 2025).
1. Constituent Exams and Question Composition
KoNET is constructed from four constituent exams, each translated into a multimodal item format suitable for AI evaluation. The principal features of each are summarized below:
| Exam | Total Items | Subjects (Count) | K-QA/TC-QA/MC-QA Proportions | Avg. Words / Chars (max) |
|---|---|---|---|---|
| KoEGED | 400 | Core + arts/practice (10) | 15.5 % / 30.8 % / 53.8 % | 29.9 (106) / 113 (417) |
| KoMGED | 540 | KoEGED + Information Technology (11) | 12.0 % / 46.1 % / 41.9 % | 42.7 (362) / 167.2 (1408) |
| KoHGED | 540 | KoMGED + Korean History (11) | 11.5 % / 52.6 % / 35.9 % | 48.0 (410) / 193.6 (1678) |
| KoCSAT | 897 | 41 subject tracks | 6.4 % / 43.3 % / 50.3 % | 113.0 (786) / 475.9 (3300) |
Each item is formatted as a single gray-scale image embedding the stem, diagrams or passages, and multiple-choice answers (mostly four choices, five in KoCSAT). Three core question types are present:
- Knowledge QA (K-QA): Fact recall or straightforward knowledge application.
- Text Comprehension QA (TC-QA): Requires parsing of extended text; includes reading comprehension.
- Multimodal Comprehension QA (MC-QA): Necessitates integrating information from images/diagrams with text.
Typical items range from geometry problems (MC-QA), culture-dependent literature analysis (K-QA), to complex reading passages with logical ordering requirements (TC-QA).
2. Dataset Construction and Open-Source Builder
Dataset curation proceeds in three major stages, leveraging official PDF releases from the Korea Institute of Curriculum and Evaluation (KICE):
- PDF Acquisition: Download of public (but copyright-controlled) test PDFs for all exam levels and years.
- Automatic Item Extraction: The open-source “knet-builder” suite parses the PDFs, identifies and segments question blocks, crops and renders each as a standard PNG image, and extracts structured metadata (subject codes, IDs, answers, difficulty, and—where available—human error rates).
- Quality Control: Automated validation ensures each image represents exactly one question and the correct number of choices. Samples (n ≈ 100 per exam) are hand-checked to detect cropping artifacts, supplemented by automatic failure flagging.
The repository is organized into modular directories for builder scripts, per-exam JSON configuration files, and both Korean and English prompt templates (direct, chain-of-thought (CoT), judge). Distribution includes only builder code—users must obtain source PDFs independently according to KICE’s usage restrictions.
Canonical dataset generation involves:
1 2 |
%%%%0%%%% pip install -r builder/requirements.txt %%%%1%%%% python builder/render_images.py --config configs/kcsat.json --outdir data/KCSAT/ |
Output consists of per-exam image directories and corresponding JSONL manifests for downstream model evaluation.
3. Evaluation Metrics
KoNET applies standard and difficulty-sensitive metrics, facilitating nuanced comparison across subjects and with human baselines. The core evaluation equations are as follows:
- Overall Accuracy
- Subject-wise Accuracy
- Difficulty-Calibrated Score (using official item difficulty ):
- Human–Model Gap (for KoCSAT items with error rates ):
- Let (empirical human accuracy).
These formulations enable holistic quantification of model proficiency, error calibration, and head-to-head comparison with large-scale student response data.
4. Model Performance Analysis
Evaluation of 18 open-source LLMs, 20 open-source VLMs (multimodal LLMs), and 8 closed-source APIs (textual and multimodal) under a unified CoT-with-OCR pipeline reveals several systematic trends:
Accuracy by Model Category and Exam
| Model | KoEGED | KoMGED | KoHGED | KoCSAT | Overall |
|---|---|---|---|---|---|
| Qwen2-72B (LLM+OCR) | 76.0 % | 74.1 % | 71.9 % | 36.0 % | 58.7 % |
| gemma-2-27B (LLM+OCR) | 74.5 % | 69.6 % | 68.5 % | 33.9 % | — |
| EXAONE-7.8B (bilingual LLM+OCR) | 64.5 % | 59.1 % | 56.9 % | 24.2 % | 45.5 % |
| Qwen2-VL-7B (VLM) | 49.5 % | 46.9 % | 42.0 % | 16.9 % | 34.3 % |
| InternVL2 (VLM) | — | — | — | < 20 % | ~10–25 % |
| gpt-4o (LLM+OCR) | 82.5 % | 82.0 % | 84.4 % | 52.5 % | 70.8 % |
| claude-3.5-sonnet (LLM+OCR) | 86.5 % | 86.3 % | 86.1 % | 60.5 % | 76.0 % |
| HyperCLOVA-X (LLM+OCR) | 82.0 % | 84.6 % | 85.1 % | 51.2 % | 70.9 % |
| gpt-4o multimodal (VLM API) | 95.0 % | 95.4 % | 94.4 % | 66.1 % | 83.4 % |
| claude-3.5 multimodal (VLM API) | 94.0 % | 93.3 % | 90.7 % | 62.8 % | 80.6 % |
Performance systematically drops as educational level increases, with sharp declines noted at the KoCSAT (College-level) tier (25–30% drop for open-source, ~20% for closed-source models). TC-QA and K-QA items (typically shorter, focusing on factual recall) exhibit higher accuracy, while MC-QA items (demanding multimodal reasoning over diagrams and images) remain a persistent challenge.
Notably, open-source VLMs consistently underperform even text-only LLMs with OCR, highlighting integration limitations with Korean-language OCR tools.
5. Linguistic, Cultural, and Multimodal Challenges
KoNET exposes substantial limitations in AI models, particularly in the context of Korean language and cultural content:
- OCR Failures: Off-the-shelf OCR tools frequently misinterpret Hangul or specialized fonts, especially in diagrams or formatted tables, undermining the performance of end-to-end VLMs not explicitly trained on Korean script.
- Cultural Context Deficiency: Items assessing deep cultural or historical knowledge (e.g., Joseon-era literature) pose significant obstacles to models lacking Korea-specific pretraining or fine-tuning.
- Educational-Level Progression: Each step higher in educational stage correlates with a 5–10% accuracy drop (elementary-to-middle and middle-to-high), but a pronounced 25–30% drop at the transition to KoCSAT for open-source and ~20% for closed-source models.
A plausible implication is that curriculum alignment and exposure to culturally contextual material are critical factors in bridging these gaps.
6. Benchmark Differentiators and Cross-Language Comparisons
Relative to English-centric multimodal benchmarks (e.g., ScienceQA, MathVista), KoNET introduces two major differentiators:
- Difficulty/Calibration Data: Per-item official difficulty and human error rates, uniquely enabling fine-grained calibration of model performance against student populations.
- Language and Script Complexity: While state-of-the-art English MLLMs achieve ∼50–60% on comparable math/diagram tasks, their accuracy on KoNET’s Korean items is 20–30% lower in the absence of dedicated Korean-specific adaptation.
This suggests that cross-lingual model transfer remains fragile for multimodal reasoning, reinforcing the importance of diverse linguistic coverage in benchmark design (Park et al., 21 Feb 2025).
7. Recommendations and Prospects for Extension
Key technical recommendations and forward-looking extensions for KoNET include:
- Korean OCR Integration: Fine-tune vision encoders specifically on Hangul-rich documents to boost robustness of OCR-dependent processing pipelines.
- Curriculum-Aware Fine-Tuning: Pretrain or fine-tune generative models on question pools and passages sourced directly from KICE to improve both factual and contextual comprehension, particularly for historical and cultural items.
- Augmentation with Rationales: Expand the benchmark to include human-annotated rationales and short-answer (subjective) response formats, allowing in-depth evaluation of chain-of-thought (CoT) fidelity.
- Regular Updates: Refresh the benchmark annually via the open-source builder, capturing the latest test cohorts and minimizing distributional shift or contamination.
- Leaderboard Infrastructure: Develop a public leaderboard to track progress of open-source and commercial models on Korean multimodal tasks, incentivizing further research in this domain.
By providing a systematically curated, difficulty-calibrated, human-anchored, and reproducible framework, KoNET establishes a rigorous standard for benchmarking the multilingual, cultural, and multimodal reasoning capabilities of next-generation AI systems (Park et al., 21 Feb 2025).