SWE-rebench: Swedish QA & Software Benchmarks
- SWE-rebench is a family of benchmarks for evaluating Swedish factual recall and software engineering tasks using curated, contamination-controlled datasets.
- It integrates distinct streams, including diagnostic QA for Sweden-specific facts and multilingual code-editing challenges with rigorous annotation.
- Evaluation metrics like string-normalized EM and token-level F1 are employed to assess LLM performance across both contaminated and decontaminated data.
SWE-rebench refers to a family of datasets and methodologies for constructing, evaluating, and benchmarking knowledge and agentic capacity in both factual-question answering (particularly for Sweden-specific facts) and interactive software engineering tasks. Several distinct datasets share this name, anchored by two primary streams: (1) a diagnostic QA benchmark for Sweden-related factual knowledge (Kunz, 24 Oct 2025), and (2) large-scale, contamination-controlled datasets for issue-resolving tasks in software engineering (Badertdinov et al., 26 May 2025, Zan et al., 3 Apr 2025, Prathifkumar et al., 11 Dec 2025). This entry provides a comprehensive technical account of SWE-rebench’s design, statistics, protocols, and implications within these domains.
1. Sweden-Related Factual Knowledge Benchmark
SWE-rebench (“A Diagnostic Benchmark for Sweden-Related Factual Knowledge”) targets evaluation of LLM factual recall for Sweden-specific domains insufficiently covered by conventional (often US-centric) benchmarks (Kunz, 24 Oct 2025). The benchmark comprises 1,293 manually authored and verified fill-in-the-blank QA pairs, with Swedish original and human-edited English translations.
Dataset Structure
- Subsets:
- Sommarpratare: 1,190 instances related to “Sommar i P1” radio hosts (2018–2024), covering a range of biographical and anecdotal facts.
- Sports Events: 102 instances spanning “En Svensk Klassiker” races and other Swedish sports.
- Schema: Each item records: unique id, subset, Swedish and English Q/A, and optional cutoff_date for time-sensitive items.
- Example:
- Niche (Figure 1):
- question_sv: “Vilket cykelmärke kör Stig Johansson på Vätternrundan?”
- answer_sv: “Husqvarna”
- question_en: “What is the brand of the bike that Stig Johansson rides at Vätternrundan?”
- answer_en: “Husqvarna”
- Mainstream (Figure 1):
- question_sv: “I vilken sångtävling representerade Toussaint ‘Tusse’ Chiza Sverige?”
- answer_sv: “Eurovision”
Annotation and Validation
Two independent annotators constructed and peer-verified all Q/A pairs using high-quality sources (Wikipedia, press); only “minimal” answers—sufficient for unambiguous closed-book evaluation—are retained. Units on numerical items are explicit.
Limitations
Bias toward personalities from media and endurance sports is present; no multiple-choice or distractor candidates. No domain coverage in politics or business.
2. Evaluation Metrics and Benchmarks for QA
Model outputs are assessed using string-normalized Exact Match (EM), token-level F1, and token-level Recall—the latter chosen as the primary diagnostic given the dataset’s minimal-answer design.
Formal Definitions (LaTeX notation)
- Accuracy:
- Precision:
- Recall:
- F1:
Cross-lingual factual consistency is quantified:
Model Performance
Smaller models with substantial Swedish pretraining achieve recall scores similar to those of three-times larger multilingual models. For example, EuroLLM-9B matches Gemma-3-27B (recall_sv/en ≈ 23–25%). Continued Swedish pretraining improves Swedish factual recall (e.g., AI Sweden LLaMA-3-8B, recall_sv 23.1%) but induces partial catastrophic forgetting of previously known facts in English. Cross-lingual consistency ranges 62–78% for best models.
3. Multilingual Issue-Resolving Dataset for SWE (Multi-SWE-bench)
A distinct SWE-rebench branch, sometimes referred to as Multi-SWE-bench, focuses on evaluating agentic reasoning and code-editing capabilities across seven programming languages (Zan et al., 3 Apr 2025). It addresses the inadequacy of Python-only SWE-bench for multilingual and cross-ecosystem LLM benchmarking.
Dataset Characteristics
- Total instances: 1,632 manually verified issue-resolving tasks.
- Languages: Java, TypeScript, JavaScript, Go, Rust, C, C++.
- Proportions: Go (26.2%), JavaScript (21.8%), Rust (14.6%), TypeScript (13.7%), C/C++ (~7.85% each), Java (7.84%).
- Structure: Each JSON object logs repo metadata, commit hashes, issue description, patch diffs (test_patch, fix_patch), before/after file contents, test-case transitions (ANY→FAILED→PASSED), Docker recipe, and run commands.
Annotation Pipeline
A five-stage process ensures semantic correctness and reproducibility:
- Repo selection—high popularity, CI support.
- PR crawling and metadata extraction.
- Containerized build environments.
- Semantic filtering via test transitions—retaining only PRs with test failures resolved in the patch.
- Dual manual annotation (>80% accuracy per language, 68 annotators).
RL Extension (Multi-SWE-RL)
An RL-ready dataset (4,723 tasks) omits manual verification, retaining Docker environments and reward structures (Δpassed = number of tests passing post-patch − pre-patch). Rewards: +1 if at least one failing test is fixed.
Evaluation Metrics
- Resolved Rate: Proportion of issues whose patch passes all tests.
- Pass@k: Fraction for which at least one out of k samples fixes the issue.
- Exact-match Rate: Generated diffs identical to ground-truth fix_patch.
4. Automated Interactive SWE Task Pipeline (Python-Focused)
The original SWE-rebench pipeline (Badertdinov et al., 26 May 2025, Prathifkumar et al., 11 Dec 2025) automates large-scale, contamination-controlled task extraction for interactive Python SWE agent evaluation.
Pipeline Stages
- Task Collection: PR–issue matching from GitHub Archive (∼10M PRs), filtered for Python, permissive license, and substantive test modification.
- Install Configuration: LLM-driven (Qwen2.5-72B-Instruct) extraction of install/test recipes via README/Dockerfile parsing; error-based refinement.
- Execution Verification: Each task containerized; verification proceeds if patch transitions at least one test from fail→pass, no regressions.
- Instance Annotation: Fine-tuned LLM predicts binary task labels—difficulty, clarity, test-correctness (F1 = 0.82/0.76/0.65 on validation).
Funnel Rates
| Stage | Input | Output | Rate |
|---|---|---|---|
| PR–Issue Matching | 10M PRs | 450K | 4.5% |
| Filtering (1–15 files/tests) | 450K | 153K | 34% |
| Build & Execution Verification | 153K | 21,336 | 14% |
| Automated Annotation | 21,336 | 21,336 | 100% |
Resulting in 21,336 valid Python interactive tasks from 3,468 repos.
Dataset Statistics
| Metric | Mean | p50 | p75 | p95 |
|---|---|---|---|---|
| Issue Length (words) | 141.7 | 91 | 173 | 412.3 |
| Files Edited | 3.46 | 2 | 4 | 10 |
| Lines Edited | 142.2 | 37 | 112 | 500 |
| Fail→Pass Tests | 14.56 | 2 | 5 | 37 |
| Pass→Pass Tests | 85.81 | 22 | 64 | 351 |
| Total Tests | 105.4 | 31 | 82.3 | 428 |
| Difficulty Score | 1.13 | 1 | 2 | 2 |
| Issue Clarity Score | 1.04 | 1 | 2 | 3 |
| Test Correctness Score | 1.38 | 2 | 2 | 3 |
Lower scores indicate ease, clarity, or correctness.
5. Contamination Controls and Benchmark Comparisons
SWE-rebench benchmarks are constructed and refreshed explicitly to avoid overlap with LLM pretraining data (“contamination”). Every instance’s timestamp is tracked, and evaluation is split by temporal cut-offs matching major LLM releases; all pre-cut-off tasks are flagged for contamination (Badertdinov et al., 26 May 2025, Prathifkumar et al., 11 Dec 2025).
- SWE-Bench-Verified comprises 500 issues drawn mostly before late 2023 and exhibits extensive contamination (Prathifkumar et al., 11 Dec 2025).
- SWE-rebench (as used by Prathifkumar et al.) provides temporally fresh, >21,000 issues, predominantly Python, yielding substantially lower model performance (e.g., DeepSeek-V3-0324: 39.7% on Verified vs 21.3% on Rebench). A plausible implication is that static benchmarks dramatically inflate evaluations when agents can simply recall memorized data.
6. Benchmark Evaluation Protocols and Model Comparison
All agents are evaluated using standardized minimal ReAct scaffolding, repeated seeds , and uniform context limits (≤128K tokens). Key outcome measures include resolved rate and pass@5 metric:
Performance drops between contaminated and decontaminated sets are consistent across both proprietary (GPT-4.1: 31.1%→26.7%) and open-source models (LLaMA-3.3-70B: 18.1%→11.2%).
7. Utility, Community, and Extensions
SWE-rebench datasets are openly available; code and pipelines reside on GitHub and HuggingFace. The platform supports progressive community extensions (e.g., new PR seeds added per quarter for Multi-SWE-bench) and RL-oriented benchmark design using curated or agentlessly generated tasks (Zan et al., 3 Apr 2025).
Potential future directions—beyond current Swedish factual QA and Python-centric SWE—include coverage in business/politics/other languages, explicit external–coverage annotation (e.g., English Wikipedia overlap), and expansion of agentic SWE pipeline to JavaScript, Java, and C++ domains using analogous methods.
SWE-rebench therefore constitutes a family of benchmarks and methodologies central to current research on LLM factuality (especially for underrepresented cultural domains) and robust, contamination-resilient evaluation of agentic reasoning in software engineering. Its multifaceted construction, detailed annotation, and strict contamination control protocols reflect contemporary best practice for QA and code-interacting model evaluation (Kunz, 24 Oct 2025, Badertdinov et al., 26 May 2025, Zan et al., 3 Apr 2025, Prathifkumar et al., 11 Dec 2025).