- The paper introduces a comprehensive benchmark that integrates text, executable code, and rendered diagrams to evaluate geometric generative reasoning.
- It enforces multi-step planning and constraint satisfaction, highlighting the superiority of code-driven LLM pipelines over end-to-end UMMs.
- The benchmark offers detailed categorization and difficulty stratification, establishing a novel protocol for verifiable, multimodal AI evaluation.
GGBench: Benchmarking Geometric Generative Reasoning in Unified Multimodal Models
Introduction and Motivation
The evolution of evaluation frameworks in artificial intelligence has paralleled advances from unimodal to highly integrated multimodal models, culminating in Unified Multimodal Models (UMMs) capable of both understanding and actively generating in diverse modalities. However, current benchmarks fragment discriminative “understanding” and unconstrained “generation,” lacking systematic evaluation of generative reasoning—where comprehension drives the construction of meaningful artifacts. To address this gap, "GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models" (2511.11134) establishes geometric construction as a rigorous foundation for quantifying multimodal generative reasoning, exploiting geometry’s demand for precise, verifiable fusion of language, logic, and visual generation.
Figure 1: GGBench introduces fully integrated understanding and generative reasoning as the evaluation paradigm, in contrast to traditional benchmarks assessing these capabilities in isolation.
Benchmark Design and Methodology
GGBench is distinguished by its tri-modal alignment: each instance consists of textual specification, executable GeoGebra code, and rendered geometric diagrams. This design operationalizes verifiable, stepwise geometric construction as the testbed for multimodal reasoning, going beyond answer selection to enforce multi-step planning and constraint satisfaction.
Figure 2: The benchmark evaluation protocol combines textual reasoning, executable code for each construction step, and automated stepwise verification.
The GGBench pipeline synthesizes problems via web-scraping, LLM filtering, and expert selection, followed by prompt adaptation and code generation. Problems are authored and quality-controlled using state-of-the-art LLMs (e.g., GPT-5), producing ∼10,000 initial instances, which are reduced to 1,411 high-quality samples after multi-tiered filtering and expert validation. Each item is mapped onto eight cognitive reasoning categories and stratified by difficulty, supporting fine-grained analysis across foundational and advanced geometric skills.
Figure 3: Comprehensive overview of the GGBench data construction and quality assurance pipeline.
Dataset Characteristics
GGBench contains 1,411 problems with tight alignment between natural language, construction code, and images. Tasks span straightedge-and-compass, analytic, and transformation-based constructions, each requiring between three and seven diagrams for visual evidence and process inspection. The dataset is designed for dense reasoning coverage, with an average of 2.20 category tags per problem and multi-step procedural demands.
Figure 4: Inner ring shows difficulty stratification; outer ring details category composition, demonstrating reasoning complexity escalation across difficulty tiers.
Comparative Analysis with Existing Benchmarks
In direct comparison to MathVista, MathVerse, MM-MATH, GeoEval, PolyMath, and others, GGBench is unique in: (1) enforcing constructive generation (not just answer selection), (2) providing complete tri-modal alignment (text, code, image), and (3) enabling deterministic, process-level verification via code executability and visual correctness. All problem instances demand multi-step symbolic-to-visual reasoning, surpassing prior datasets that decouple planning, code, and diagram synthesis.
Evaluation Protocols and Metrics
GGBench proposes a four-stage evaluation process:
- Planning (VLM-T): Assesses stepwise logical reasoning in text.
- Intermediate Process (VLM-I-Mid): Scores chronological construction panels for step accuracy and process consistency.
- Final Result (VLM-I-Res): Evaluates geometric correctness of the final figure against ground truth, augmented by pixel-level metrics (PSNR, SSIM, LPIPS).
- Overall (VLM-I): Aggregates intermediate and final scores for comprehensive modal reasoning quality.
Human evaluation is tightly correlated (r=0.9295) with automated VLM ratings, substantiating the reliability of the metrics.
Figure 5: Strong Pearson correlation between automated VLM-I metric and human expert evaluation.
Experimental Results
State-of-the-art UMMs (e.g., Nano Banana, Bagel, Janus, Seedream) are compared to leading LLMs/LRMs (e.g., GPT-4o, GLM-4.5V, Claude Sonnet 4.5, GPT-5). The code-driven LLM pipeline consistently outperforms end-to-end UMM approaches in planning coherence, intermediate reasoning quality, and geometric validity.
- GPT-5 attains highest overall scores (VLM-I: 57.08, Human: 83.06), with Claude Sonnet 4.5 and DeepSeek-V3.1 trailing.
- End-to-end UMMs (Nano Banana, Janus, Bagel) lag in geometric correctness and constraint enforcement, reflecting current limitations in direct generative synthesis of structured diagrams.
Category and Typology Analysis
Category-wise evaluation reveals that models excel on procedural tasks (Basic Constructions, Circle Properties), but falter on high-abstraction categories (Measurement Ratios, Theorem Applications) requiring symbolic-deductive alignment.
Figure 6: VLM-I performance matrix by model and geometric category, highlighting strengths in procedural reasoning and limitations in algebraic/theorem-driven tasks.
Task-type analysis demonstrates superior model performance in straightedge-and-compass constructions, with reduced accuracy on analytic and transformation-based problems, underscoring reliance on explicit structural constraints for robust code generation.
Figure 7: VLM-I scores across Analytic Construction (AC), Geometric Transformation Construction (GTC), and SCC, confirming a sharp drop when abstract reasoning is required.
Difficulty stratification shows progressive performance degradation, effectively isolating model robustness to task complexity.
Figure 8: VLM-I scores segmented by difficulty level, confirming significant drop-off as reasoning depth increases.
Error Analysis
Failure cases concentrate in four domains: logical misapplication, contextual/containment errors (misinterpreted relationships, e.g., reversed inscribe/circumscribe), conflation of numerical and constructive goals, and code-level execution breakdowns (syntax errors, keyword misuses).
Figure 9: Example of a containment failure—model reverses the logic, placing a circle inside a square instead of the instructed rectangle inscribed in a circle.
Implications and Future Directions
GGBench substantiates that generative reasoning in geometry requires more than perceptual synthesis; it demands rigorous, verifiable alignment between language, logic, code, and visual abstraction. LLMs with explicit planning and code generation pipelines achieve higher correctness but are constrained by visual expressivity; UMMs, though strong in perceptual fluency, fail in geometric constraint satisfaction.
The theoretical implication is the necessity for tightly coupled, symbolically grounded multimodal architectures, where linguistic–spatial representations are co-optimized and verifiable. Practically, GGBench provides an extensible platform for both benchmarking and curriculum-driven training of next-generation reasoning agents, and its methodology can be generalized beyond geometry to other domains where constructive, verifiable generation is essential (e.g., physical design, scientific modeling).
Conclusion
GGBench delivers the first comprehensive, tri-modal evaluation platform for geometric generative reasoning in unified multimodal models. Its rigorous design, process-level supervision, and holistic metrics conclusively reveal the fissure between discriminative understanding and generative capability, setting a new research agenda for multimodal intelligence. The benchmark's protocol, data, and findings will serve as a reference point for evaluating—and advancing—AI agents capable of constructing as well as understanding.
Figure 10: Example of an easy-level geometric construction task illustrating stepwise procedural reasoning.
Figure 11: Medium-level task exemplifying rigid motion and transformation reasoning.
Figure 12: Hard-level problem requiring multi-stage hierarchical construction, such as recursive polygon inscribing and symbolic reasoning.