SWE-Judge: Ensemble LLM Evaluation Framework

Updated 14 January 2026

SWE-Judge is an evaluation framework that employs an ensemble of LLM judges to automatically assess the correctness and quality of software artifacts.
It uses dynamic team selection and calibration with 20 annotated examples to optimize correlation between ensemble scores and human judgment.
The framework statistically aggregates diverse prompt-based evaluations, achieving improved reliability and scalability in software engineering tasks.

SWE-Judge is an evaluation framework for software engineering artifacts that leverages LLMs as an ensemble of “judges” to automatically assess the correctness and quality of generated code, program repairs, and code summaries. It employs multiple LLM-driven strategies, dynamic ensemble selection, and statistical ensembling, designed to closely approximate the reliability of human evaluation while maintaining scalability and reducing resource overhead (Zhou et al., 27 May 2025). SWE-Judge blends prompt engineering, LLM performance ensembling, and empirical human alignment measurement to advance the state of machine-assisted software engineering assessment.

1. Motivation and Conceptual Foundations

The assessment of software artifacts generated or modified by machine learning systems has traditionally relied on two poles: labor-intensive human evaluation, which is highly accurate but costly and limited in scale; and automatic metrics (e.g., BLEU, BERTScore, pass@k, ICE-Score, CodeJudge), which are efficient but typically fail to capture functional correctness and semantic fidelity, focusing instead on surface-level similarities. LLM-based automatic judges began to bridge this gap using prompt-based evaluation, but single-prompt LLM-as-judge solutions may underperform due to prompt sensitivity, LLM misalignment, and lack of calibration against human judgment (Zhou et al., 27 May 2025).

SWE-Judge systematically addresses these limitations by:

Defining a panel of five diverse LLM “judges” (distinct strategies/prompts).
Building ensembles of judges dynamically tailored to each dataset through limited human calibration (N=20 annotated examples).
Outputting a final correctness rating via statistical aggregation over selected judges, achieving strong alignment with human judgments.

This approach provides an empirical remedy to limitations in both single-prompt LLM judges and traditional metrics, offering a scalable and reliable alternative for evaluating code generation, automated program repair, and code summarization.

2. Formal Structure: Judges, Dynamic Team Formation, and Ensembling

SWE-Judge defines five core evaluation strategies, each instantiated as a distinct judge using a specialized LLM prompt template:

Judge Strategy	Prompt/Mechanism	Output Range
Direct Assess (S₁)	Score functional correctness directly from requirement and candidate	[0,100]
Assess + Rethink (S₂)	Reflect on initial S₁ reasons for correctness, revise score if needed	[0,100]
Equivalence Assess (S₃)	Assess semantic equivalence between candidate and reference	[0,100]
Analyze Ref. then Assess (S₄)	Extract key properties from reference and check their preservation in candidate	[0,100]
Generate Tests & Assess (S₅)	Generate tests from requirement/reference and estimate if candidate passes	[0,100]

For a given triplet—user requirement $x$ , candidate artifact $y$ , and reference solution $r$ —each judge outputs a standardized score $s_k = E_k(x, y, r)$ .

Dynamic team selection then constructs all judge subsets $T \subseteq \{S_1, ..., S_5\}$ containing $S_1$ and at least one additional member. For each candidate team, on a small set of human-annotated examples ( $N=20$ ), aggregate ensemble scores $\hat{s}_j$ are compared to ground truth via a composite alignment metric: the average of Kendall's $\tau$ and Spearman's $r_s$ .

$y$ 0

For new samples, the ensemble prediction is the average of selected judges’ scores linearly mapped back to the human label range.

$y$ 1

where $y$ 2 are the minimum and maximum values on the label scale.

3. Empirical Performance and Human Alignment

SWE-Judge was benchmarked on four SE datasets:

Code generation: CoNaLa, Card2Code
Program repair: APR-Assess
Code summarization: Summary-Assess

The primary evaluation metric is the mean of Kendall's $y$ 3 and Spearman $y$ 4 correlations with human ratings. For each dataset, SWE-Judge outperformed the ICE-Score (best prior metric), achieving improvements of $y$ 5 (CoNaLa), $y$ 6 (Card2Code), $y$ 7 (APR-Assess), and $y$ 8 (Summary-Assess), for a cross-domain average improvement of $y$ 9.

Furthermore, assessment using Cohen's $r$ 0 for agreement shows that SWE-Judge’s ensemble can approach or—for some tasks—even surpass average human–human inter-annotator agreement. On APR-Assess, SWE-Judge is virtually at parity with annotator consensus ( $r$ 1 vs $r$ 2) and outperforms random human pairs on Card2Code. For comment adequacy (Summary-Assess), human agreement outpaces SWE-Judge, indicating ongoing challenges in highly subjective evaluation domains (Zhou et al., 27 May 2025).

4. Judge Specialization, Limitations, and Extensions

While the five core strategies complement one another by covering direct assessment, logical self-critique, semantic matching, property-driven verification, and synthetic test-based evaluation, several limitations remain:

Code Summarization: Judge ensembles lag behind human annotators, reflecting the nuanced subjectivity of high-level code explanation.
Computation Cost: Each sample may incur up to five LLM prompts, yielding non-trivial runtime and monetary expense (e.g., approximately $r$ 30.1$ per 100 samples with GPT-4o-mini). However, this remains orders of magnitude cheaper than expert human grading.
Prompt and LLM Limitations: Zero-shot strategy can be susceptible to hallucination or spurious responses if prompts are under-specified or the context is sparse.

Potential extensions include semi-supervised or fine-tuning to produce durable “judge” models once larger annotated datasets are available; adaptation to non-functional properties (readability, style, security); inclusion of debate-inspired protocols and weighted ensembling; and multi-turn or self-reflective assessment protocols (Zhou et al., 27 May 2025).

5. SWE-Judge within the Landscape of LLM Judging Methodologies

SWE-Judge (or “SE-Jury”) sits at the intersection of several trends in LLM-based evaluation. Compared to single-prompt LLM-as-judge protocols, it introduces prompt-based diversification and calibration to human-labeled examples within each dataset. This reduces variance, corrects for prompt- and LLM-specific effects, and delivers outputs in close alignment with human evaluators.

SWE-Judge can be positioned alongside systems such as JudgeLRM (Chen et al., 31 Mar 2025), which use reinforcement learning with outcome-driven rewards and chain-of-thought (CoT) structural enforcement to deepen model reasoning. While JudgeLRM specializes in direct reward-shaping to increase reasoning fidelity, SWE-Judge leverages prompt diversity, statistical ensembling, and dynamic selection to achieve robust, semantically faithful correctness assessment, with the potential for hybridization (e.g., by fine-tuning ensemble judges as in JudgeLRM’s RL regimen).

In terms of metric selection and judge evaluation, recent work underscores the importance of prevalence-independent, label-symmetric metrics such as balanced accuracy (BA) and Youden's $r$ 4 statistic for selecting among candidate judges (Collot et al., 8 Dec 2025). While SWE-Judge currently uses simple correlation with human ratings for team selection, this suggests future work could further leverage prevalence-preserving metrics.

6. Implementation and Integration

SWE-Judge is implemented as a two-stage pipeline:

Team Selection: A pool of $r$ 5 human-labeled instances from a target dataset is used to evaluate all eligible judge teams (subsets of the five, always including S₁ and at least one more). The team with maximal average ranking correlation to human scores is selected.
Ensemble Scoring: New samples are assessed by the selected team; scores are aggregated and mapped to the final human label scale.

A representative pseudocode sketch:

$r$ 6

Integration is system-agnostic and compatible with arbitrary LLM backends and task domains, conditional on prompt specification and human calibration set availability (Zhou et al., 27 May 2025).

7. Broader Applications and Future Outlook

SWE-Judge offers a scalable and empirically grounded approach to evaluating machine- and human-generated software artifacts. As LLMs advance in reasoning and reliability, SWE-Judge’s ensemble paradigm—balancing prompt diversity, empirical calibration, and statistical ensembling—provides a template for extending high-fidelity judgment to new domains. A plausible implication is that, as larger labeled datasets become available, LLM-judge models could be fine-tuned or RL-optimized using team-ensemble objectives, unifying the strengths of SWE-Judge and RL-based approaches such as JudgeLRM.

Current weaknesses in code summarization and interpretive tasks suggest continued need for human-in-the-loop calibration and possibly domain-adapted prompts or additional feedback signals. Future versions may evolve toward semi-supervised, debate-based, or multi-turn ensembles, expanding coverage to style, security, and higher-level design assessment. SWE-Judge thus represents a foundational system for automated, high-agreement evaluation in software engineering, with broad implications for research protocols and practical workflows (Zhou et al., 27 May 2025).