BlenderGym: VLM Benchmark for 3D Editing

Updated 2 February 2026

BlenderGym is a comprehensive benchmark defining code-driven 3D scene editing with tasks across object placement, lighting, procedural material, blend-shape, and geometric edits.
It employs a closed-loop generator–verifier process with granular metrics like photometric loss, negative CLIP score, and Chamfer distance to assess VLM performance.
The benchmark reveals significant performance gaps between human experts and VLM systems, highlighting the potential of memory-augmented, iterative frameworks for improved graphics editing.

BlenderGym is a comprehensive benchmark for evaluating foundational vision-LLM (VLM) systems on code-driven 3D graphics editing tasks within the Blender environment. It formally defines a programmatic pipeline in which agents transform an initial Blender scene, provided as Python scripts and rendered images, to a specified target scene—requiring multi-domain reasoning that encompasses object manipulation, lighting, procedural geometry, blend-shape, and material editing. BlenderGym provides granular metrics, transparent oracle evaluations, and a suite of scene-editing instances designed for rigorous system-level comparison and quantification of key VLM capabilities in graphics-centric perception and manipulation (Gu et al., 2 Apr 2025, Yin et al., 16 Jan 2026).

1. Formalization and Task Domains

BlenderGym conceptualizes code-based 3D scene editing as a mapping from a given start scene $S_0$ and a textual or visual instruction $I$ to an executable code patch $C$ , which, when applied, produces a rendered scene $S'$ to be compared against the goal state $S^*$ . Formally, the evaluation function is $E(S_0, C) = \{\text{success if } \operatorname{dist}(S', S^*) \leq \tau; \text{failure otherwise}\}$ , with $\operatorname{dist}(\cdot,\cdot)$ parameterized by photometric loss, negative-CLIP score, and mesh Chamfer distance. Distinct from many prior 3D editing datasets, BlenderGym spans five editing domains:

Object Placement: relocation, addition, or deletion of rigid meshes.
Lighting Adjustment: modifications of color, intensity, type, or orientation.
Procedural Material Editing: shader graph and numeric parameter updates.
Blend-Shape Manipulation: continuous shape key-based mesh deformations.
Procedural Geometry Editing: changes in node-based geometry, such as topology mutations.

Each annotated instance includes Blender script files for $S_0$ , $S^*$ , multi-view renders, and aligned language instructions. This design ensures per-instance edit complexity and cross-domain coverage (Gu et al., 2 Apr 2025).

2. Benchmark Pipeline and Interaction Modalities

The BlenderGym pipeline operationalizes the editing challenge as an iterative generator–verifier sequence, typically realized in three rounds with four candidates per round (depth $d=3$ , breadth $b=4$ ). Scene generation proceeds as follows:

Generation: Brainstormer prompts analyze input/output renders and propose localized code changes. Code-editor agents produce corresponding “Before:”/“After:” code diffs.
Verification: Verifier agents review concatenated renders and select the candidate edit yielding the closest match to $S^*$ .
Execution and Feedback: Each candidate code is rendered in Blender; selection prunes down candidates per round, with survivors forming the next round’s start state.

Agents access JSON-based APIs for scene graph queries (get_scene_info), viewpoint selection (set_camera, initialize_viewpoint), targeted scene probing (investigate), code execution (execute_code), and process termination (end_process). The closed-loop write→run→render→compare→revise cycle enables systematic, localized error correction and robust recovery from execution faults (Yin et al., 16 Jan 2026).

3. Quantitative and Qualitative Evaluation Metrics

Performance in BlenderGym is assessed via multimodal and geometric metrics designed for granular scene comparison:

Photometric Loss (PL): $PL(I, I^*) = \frac{1}{HW} \sum_{x=1}^H \sum_{y=1}^W \lvert I(x,y) - I^*(x,y)\rvert$ , where lower is better.
Negative CLIP Score (N-CLIP): $N\text{-}CLIP(I, I^*) = -\operatorname{SIM}(I, I^*)$ , with similarity measured via pretrained CLIP-ViT encoders.
Chamfer Distance (CD): Applied to mesh geometry changes, measured in meters; lower values indicate closer topological alignment.

Binary success is determined by thresholding these metrics per instance. Relative improvement (%) is computed for PL and N-CLIP, such as $\mathrm{Impr}_{\mathrm{PL}}(\%) = \frac{\mathrm{PL}_B - \mathrm{PL}_M}{\mathrm{PL}_B} \times 100\%$ , where $B$ and $M$ denote baseline and method, respectively.

Quantitative benchmarks show that humans achieve near-optimal scores with PL and N-CLIP an order of magnitude below leading VLM baselines. For example, human PL on blend-shape is $0.934$ ( $\times 10^{-3}$ units), while GPT-4V attains $9.140$ and InternVL2-8B $12.69$; similar performance gaps are observed across geometry, lighting, and material editing (Gu et al., 2 Apr 2025).

4. Model Systems, Baselines, and Reported Results

BlenderGym has been used to assess 13 VLM systems, both closed-source (GPT-4V(o), GPT-4-Turbo, Claude 3.5 Sonnet/Haiku, Gemini 1.5 Flash) and open-source (Qwen2-VL-7B, InternVL2-8B, Phi-3.5-Vision, MiniCPM-V-2.6, and llama-backboned derivatives). Prompts are standardized to ensure model comparability. Three principal methods are compared:

One-Shot: Direct forward pass of the VLM for code generation.
BlenderAlchemy: A memory-less iterative editing agent, lacking evolving context.
VIGA (Vision-as-Inverse-Graphics Agent): Implements a closed-loop write–run–render–compare–revise protocol with memory and a skill library.

Across best-of-1 and best-of-4 settings, VIGA achieves an average 35.32% reduction in PL over one-shot baselines. For smaller open-source models (e.g., Qwen3-VL-8B), relative PL improvement exceeds 289.8%, with memory ablation demonstrating that evolving context is the main driver of gains (BlenderAlchemy yields only 12–23% PL improvements). Detailed results for best-of-1 PL, domain-wise (GPT-4o):

Domain	One-Shot PL	VIGA PL	PL Impr. (%)
BlendShape	7.94	6.70	14.33
Placement	11.86	9.89	16.55
Geometry	18.12	11.82	34.75
Lighting	2.06	1.43	30.58
Material	8.78	6.11	30.45

Qualitative analysis reveals errors typical of current VLMs: misestimation of numeric parameters (lighting RGB triplets), improper value range clamping (blend-shape key out-of-bounds), and missed fine-grained material edits. VIGA is task-agnostic and model-agnostic, requiring no auxiliary modules nor fine-tuning (Yin et al., 16 Jan 2026).

5. Inference Scaling and Verifier Optimization

BlenderGym provides a platform for systematic experimentation with inference scaling, investigating how performance varies with compute allocation across candidate generation and verification. Two scaling knobs are defined:

Breadth ( $b$ ): Number of code-edit candidates in each generation iteration.
Verifier-reselection ( $k$ ): Number of times the verifier agent reselects candidates through shuffled pairwise comparison, with majority voting on survivors.

Algorithmically, as $k$ increases (holding $b\times d$ fixed), PL, N-CLIP, and CD all improve—demonstrating that “verifier scaling” directly enhances edit quality. Notably, for InternVL2-8B, large $k$ enables surpassing unscaled GPT-4V and Claude 3.5 Sonnet baselines.

A compute allocation study further reveals regime-dependent results: with a fixed total query budget $Q$ , a “VeriRatio” $r=Q_{\text{verif}}/Q$ optimized for low ( $r=0.33$ ) vs. high ( $r=0.73$ ) budgets, indicating that exploration (generation) is preferable under constraint, whereas exploitation (verification) yields superior results given ample compute (Gu et al., 2 Apr 2025).

6. Significance and Outlook

BlenderGym establishes a unified benchmarking formalism for code-driven 3D scene editing via VLM agents, highlighting the limitations of contemporary systems relative to human Blender users, and exposing key areas for methodological innovation—especially in iterative feedback, verification, and memory design. Its programmatic, transparent evaluation pipeline and diverse editing domains make it suitable for driving progress in vision-as-inverse-graphics, multimodal agent construction, and scalable graphics automation. The benchmark serves both as a challenge corpus and as a tool to analyze scaling strategies for future VLM system designs (Gu et al., 2 Apr 2025, Yin et al., 16 Jan 2026).

A plausible implication is that closed-loop, memory-augmented frameworks—rather than direct code generation—will be pivotal to closing the gap between automated and expert-level graphics editing, especially in tasks demanding fine physical and spatial grounding. This suggests ongoing relevance for multi-agent, tool-augmented architectures, and dynamic compute allocation strategies in the evolution of scene-editing AI systems.

Markdown Report Issue Upgrade to Chat

References (2)

BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing (2025)

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BlenderGym.