Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning

Published 12 Jun 2025 in cs.CL | (2506.10903v1)

Abstract: Autoformalization plays a crucial role in formal mathematical reasoning by enabling the automatic translation of natural language statements into formal languages. While recent advances using LLMs have shown promising results, methods for automatically evaluating autoformalization remain underexplored. As one moves to more complex domains (e.g., advanced mathematics), human evaluation requires significant time and domain expertise, especially as the complexity of the underlying statements and background knowledge increases. LLM-as-a-judge presents a promising approach for automating such evaluation. However, existing methods typically employ coarse-grained and generic evaluation criteria, which limit their effectiveness for advanced formal mathematical reasoning, where quality hinges on nuanced, multi-granular dimensions. In this work, we take a step toward addressing this gap by introducing a systematic, automatic method to evaluate autoformalization tasks. The proposed method is based on an epistemically and formally grounded ensemble (EFG) of LLM judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ), resulting in a transparent assessment that accounts for different contributing factors. We validate the proposed framework to serve as a proxy for autoformalization assessment within the domain of formal mathematics. Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation, more strongly correlating with human assessments than a coarse-grained model, especially when assessing formal qualities. These findings suggest that LLM-as-judges, especially when guided by a well-defined set of atomic properties, could offer a scalable, interpretable, and reliable support for evaluating formal mathematical reasoning.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an EFG framework that evaluates autoformalization by dissecting logical preservation, mathematical consistency, formal validity, and formal quality.
It employs 12 interpretable atomic properties aggregated via a linear ensemble model, achieving a strong correlation with human judgment.
Empirical results show that LLM-based judges, particularly GPT-4.1 variants, effectively capture nuanced evaluation aspects beyond traditional metrics.

Epistemic Ensemble of LLM Judges for Fine-Grained Evaluation of Autoformalization

Motivation and Problem Formulation

The evaluation of autoformalization—the translation of informal mathematical statements into formal languages—has become critical for advancing formal mathematical reasoning. As LLMs achieve impressive results in formalization tasks, the precise assessment of their outputs remains a challenge, especially given the multidimensionality of correctness and faithfulness in formal representations. Existing evaluation protocols rely predominantly on syntactic validity or coarse-grained criteria, often resulting in binary or reference-biased assessments that fail to capture nuanced quality attributes. Human evaluation, although reliable, becomes impractical in advanced mathematical domains due to the requisite expertise and scalability constraints.

Taxonomy of Formalization Evaluation

The paper introduces an epistemically and formally grounded (EFG) framework for evaluating autoformalization, constructing a taxonomy of four core aspects:

Logical Preservation (LP): Assessing the fidelity with which the logical structure and inferential intent of the natural language statement are retained in the formalization.
Mathematical Consistency (MC): Evaluating the semantic accuracy of mathematical objects and operations.
Formal Validity (FV): Verifying syntactic and structural correctness within the target formal system (Isabelle/HOL, Lean4).
Formal Quality (FQ): Measuring clarity, conciseness, and non-redundancy.

This taxonomy is operationalized via 12 interpretable, computable Operable Atomic Properties (OAPs), such as quantification, formula preservation, referential completeness, type-matching, conciseness, and logical consistency, which enable systematic estimation of each aspect.

Figure 1: Overview of the EFG-LMM judge ensemble framework, incorporating scores from LLM judges and theorem provers into a single assessment via weighted aggregation.

Human Evaluation and Linear Ensemble Model

The framework was validated through human annotation of formalizations from miniF2F and ProofNet datasets, covering Isabelle/HOL and Lean4. Empirical analysis demonstrates that logical preservation and mathematical consistency are often insufficient in ground-truth formalizations, and that LLMs like GPT-4.1 are competent at identifying problematic cases. Formal validity, as determined by theorem provers, shows strongest correlation with overall quality, but quality is multifaceted—LP and MC exhibit low inter-correlation, confirming that these aspects are not redundant.

A linear ensemble model synthesizes aspect scores:

$S_\text{OA}(s, \phi) = w_\text{LP} S_\text{LP}(s, \phi) + w_\text{MC} S_\text{MC}(s, \phi) + w_\text{FV} S_\text{FV}(\phi) + w_\text{FQ} S_\text{FQ}(\phi)$

Optimal weighting ( $w_\text{LP}=0.25$ , $w_\text{MC}=0.19$ , $w_\text{FV}=0.32$ , $w_\text{FQ}=0.24$ ) is determined via constrained quadratic programming, yielding a strong correlation ( $\text{Coef}=0.785$ , $\text{NRMSE}=0.284$ ) with human assessment.

LLM-as-Judge: Comparative Evaluation

LLM judges, tested with both direct aspect assessment and OAP-based aggregation, were benchmarked against human evaluations. The results indicate:

GPT-4.1 direct judges outperform on alignment faithfulness, particularly logical preservation.
GPT-4.1-Mini, when guided via OAPs, achieves superior assessment of formal quality and better overall correlation with human judgment.
Weighted-average OAP synthesis yields overall score correlations of $0.662$ (Isabelle/HOL) and $0.479$ (Lean4) with human assessments, outperforming reference-based metrics (BLEU, ChrF, RUBY) and previous baselines.
Figure 2: Correlation coefficients comparing LLM-as-judge metrics to human evaluation and reference-based metrics, demonstrating superior alignment for EFG ensemble methods.

Robustness and Reliability Analysis

Evaluation with multiple LLMs (GPT-4.1-Mini, Qwen2.5-Coder-7B) reveals that although absolute scores vary, the ranking of autoformalization performance remains consistent. No evidence was found for cross-family bias favoring same-series autoformalizations. Empirical variance analyses show robustness across temperature settings—overall assessment scores remain stable despite randomness, though component scores (MC, FQ) fluctuate more with higher sampling diversity.

Figure 3: Error bars for GPT-4.1-Mini OAP-WA scoring across temperatures and runs, illustrating stable OA score distributions.

Agreement analysis between LLM-judges and theorem provers using Cohen's kappa shows LLMs are not reliable proxies for formal validity; agreement rarely exceeds $0.3$, especially in more restrictive languages like Lean4.

Figure 4: Cohen's kappa for agreement between LLM-judges and theorem provers for formal validity on miniF2F, highlighting the limits of LLM evaluation for syntactic correctness.

Implications and Future Directions

The fine-grained, interpretable EFG ensemble offers a scalable and reproducible evaluation protocol for autoformalization tasks. Core numerical results highlight that OAP-guided ensembles not only outperform direct assessment from larger LLMs but also provide richer, aspect-specific feedback. This circumvents reference bias and enables systematic benchmarking across varying model architectures and datasets.

The practical implication is that lightweight LLMs, properly structured via atomic property prompts, are sufficient for reliable judgment, enabling automated evaluation without expensive human annotation. Theoretically, disentangling alignment faithfulness from formalization correctness paves the way for models to optimize distinct aspects of formalization simultaneously and supports more granular error analysis.

Open questions remain regarding the optimal design of OAPs, judge-ensemble training/fine-tuning strategies, and integration of LLM explanations for boosting autoformalization quality. Further directions include specializing smaller open-source LLMs as domain-calibrated judges and leveraging self-improving evaluators for continuous refinement.

Conclusion

This work establishes a comprehensive multidimensional evaluation paradigm for autoformalization in formal mathematics, leveraging epistemic ensembles of LLM judges structured by interpretable atomic properties. The approach consistently yields stronger alignment with human judgment than coarse metrics, and enables robust, scalable evaluation of formalization systems. Limitations include dependence on small annotated sets and subjectivity in certain aspects, but the methodology is generalizable. Future research should further automate LLM judge calibration and expand the taxonomy to cover broader formal reasoning domains.

Markdown Report Issue