BlenderBench: 3D Multimodal Agent Benchmark

Updated 23 January 2026

BlenderBench is a benchmark designed to evaluate agents' interleaved multimodal reasoning and 3D scene manipulation within Blender.
It stresses tasks requiring iterative code generation and scene verification, including camera adjustment, multi-step editing, and compositional modifications.
Empirical results show advanced agents with persistent context memory, like VIGA, significantly outperform one-shot and stateless methods.

BlenderBench is a systematically designed benchmark that stress-tests agent capabilities in interleaved multimodal reasoning within a full-featured 3D graphics environment, specifically Blender. It targets the evaluation of agents’ abilities to coordinate code synthesis and scene verification in challenging, long-horizon 3D editing and reconstruction episodes, moving beyond the limitations of single-step, 2D-matching tasks. BlenderBench forms a central empirical component in the assessment of vision-as-inverse-graphics agents such as VIGA (Vision-as-Inverse-Graphics Agent) (Yin et al., 16 Jan 2026).

1. Motivation and Rationale

BlenderBench addresses critical shortcomings of existing vision-and-language benchmarks. Prior suites such as BlenderGym restrict agent complexity by limiting evaluation to one-shot program synthesis under a fixed camera pose, effectively constraining the 3D grounding challenge to 2D image-matching. Furthermore, these benchmarks rarely incorporate tasks necessitating multi-step planning, viewpoint control, or compositional edits. BlenderBench was developed to address these deficits by exposing agents to a suite of tasks that require active scene inspection, code-based scene manipulation, and iterative reasoning—integration necessary for robust, physically grounded transformations. The benchmark explicitly targets:

The need for agents to interleave program generation (“write”) and graphical verification (“run–render–compare–revise”) inside a true 3D graphics context.
The assessment of spatial and physical grounding across tasks such as camera adjustment, complex multi-round scene editing, and compositional modifications.

2. Task Design and Structure

BlenderBench is implemented on top of the Blender Python API, enabling programmatic access to core 3D primitives, imported assets, material properties, lighting, cameras, and animation controls. The benchmark consists of 30 episodes, partitioned into three principal task categories, each featuring 10 hand-constructed or procedurally generated instances:

Camera Adjustment: Agents are initialized with a scene and provided a target image. The objective is to manipulate camera intrinsics and extrinsics to align the rendered viewpoint with the target.
Multi-step Editing: Given an initial scene and a target (either image or natural language instruction), agents enact a sequence of code modifications—object creation/removal, transformation, altering materials, and lighting—over multiple iterations to match the specification.
Compositional Editing: These tasks combine camera movement with multi-step scene edits and involve scenes populated with numerous objects, frequent occlusion, and require maintenance of global scene consistency amidst localized edits.

Agents alternate between two high-level operational phases:

Generation (Write → Run → Render): Synthesizing or patching Blender Python code, executing it, and rendering the resultant scene.
Verification (Compare → Revise): Invoking scene-inspection tools such as set_camera, initialize_viewpoint, investigate, and get_scene_info to analyze discrepancies and inform iterative edits.

Notably, BlenderBench does not utilize task-specific auxiliary modules. All reasoning and memory management (including code diffs and render history) must emerge through interaction with the provided toolset and agent-internal context representations.

3. Dataset Composition and Scene Properties

Each of the 30 BlenderBench episodes presents a variable, procedurally generated or curated 3D scene. Key compositional features include:

Object diversity: Scenes comprise 3–12 objects, combining basic primitives (boxes, spheres) and high-fidelity imported assets (from a small glTF/OBJ model library).
Materials and Lighting: Material types span diffuse/plastic, metallic, and glass. Lighting conditions vary and may include HDR environment maps, localized point or fill lights.
Procedural scene assembly: Placement, transformation, and lighting parameters are randomized within controlled ranges to ensure diverse possible spatial configurations.
Absence of ground-truth programs: No explicit programmatic description of target scenes is exposed at test time. Agents reconstruct or edit independently, relying purely on scene inspection and tool-mediated information.

A plausible implication is that this absence of a direct ground-truth script enforces genuine inverse-graphics reasoning, requiring agents to synthesize scene program representations from rendered observations and updates.

4. Evaluation Protocol and Metrics

Agents are assessed on a per-episode, best-of-N-trajectory basis across the following axes:

Photometric Loss (PL): Pixelwise $L_2$ distance between the agent’s rendered image $I_r$ and the target $I_t$ , with lower values denoting better alignment,

$PL = \| I_r - I_t \|_2$

Negative-CLIP Score (N-CLIP): The negative cosine similarity between CLIP-embedded agent and target renderings,

$\mathrm{N\text{-}CLIP} = -\cos(\mathrm{CLIP}(I_r), \mathrm{CLIP}(I_t))$

Lower (i.e., more negative) scores are preferred.

VLM Score: Subjective rating (human or model-based, scale 0–5) measuring: (1) task completion, (2) visual fidelity (shapes/colors/materials), and (3) spatial accuracy (camera/viewpoint).
Success Rate: For 2D SlideBench-style tasks, the percentage of episodes producing executable code trajectories.
Relative Improvement: Percent improvement for any metric $m$ over a baseline $b$ ,

$\text{Impr.\%} = 100\% \times \frac{m_\text{method} - m_b}{m_b}$

The protocol sweeps through multiple random (sampled) code generation trajectories for each episode, reporting best-of-N performance to accommodate stochasticity in agent outputs.

5. Baselines and Empirical Results

BlenderBench facilitates direct comparative analysis across code generation paradigms. Three settings are benchmarked:

One-Shot: Single-pass code generation, with no iteration or memory.
BlenderAlchemy: Iterative (but memoryless) code generation; each round is independent of prior attempts.
VIGA: Fully interleaved write–run–render–compare–revise, leveraging persistent context memory.

Empirical results for GPT-4o (best-of-4) reveal substantial gains for VIGA over baselines:

Setting	Task 1 PL↓	Task 1 N-CLIP↓	Task 1 VLM↑	Task 2 PL↓	Task 2 N-CLIP↓	Task 2 VLM↑	Task 3 PL↓	Task 3 N-CLIP↓	Task 3 VLM↑	Impr.%
One-Shot	48.16	64.17	0.58	7.36	7.12	2.75	30.14	38.69	0.25	—
BlenderAlchemy (b-of-4)	14.50	19.57	1.75	1.95	2.47	3.53	20.62	25.79	0.56	+77.48
VIGA (b-of-4)	5.47	6.10	3.25	2.94	3.50	3.83	12.62	22.84	1.61	+159.19

VIGA achieves an average improvement of approximately 124.7% over the best baseline. On smaller models (Qwen3-VL-8B), VIGA yields up to +312% gain in VLM Score on Task 3. This suggests that interleaved, context-aware iterative reasoning provides a decisive advantage on BlenderBench over both one-shot and stateless iterative code generation.

6. Task Execution Workflows

BlenderBench task episodes illustrate critical features of interleaved reasoning:

Camera Adjustment Example: The agent incrementally invokes scene inspection tools to reposition the camera through multiple rounds, iteratively reducing occlusion and photometric loss. The process involves code diffing and updating only relevant parameters.
Multi-step Scene Editing Example: The agent sequentially edits object geometry, material, and lighting attributes, leveraging scene querying tools to verify the success of each operation before further revisions. Feedback from earlier verification steps directly informs code generation in subsequent rounds.

A plausible implication is that BlenderBench’s episodic setup incentivizes agents to develop internal representations that facilitate long-horizon, compositional reasoning across perception (rendered observations) and action (program synthesis) modalities.

7. Significance and Research Applications

BlenderBench constitutes a high-fidelity empirical testbed for 3D reasoning under multimodal constraints, with several notable contributions:

Model-agnostic protocol: The absence of auxiliary, task-specific modules enables unified evaluation of heterogeneous VLMs and vision-language-code agents.
Stress-testing spatial and physical grounding: The benchmark exposes the limitations of naive image-matching or single-pass code generation, favoring agents equipped for iterative, closed-loop reasoning.
Support for emerging agent architectures: As demonstrated by VIGA’s performance, BlenderBench effectively reveals the advantage of agents with context memory and tightly coupled generation–verification cycles (Yin et al., 16 Jan 2026).

BlenderBench thus serves as a rigorously designed diagnostic tool for progress in inverse-graphics reasoning, synthetic scene understanding, and multimodal agent learning.

Markdown Report Issue Upgrade to Chat

References (1)

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BlenderBench.