Controllable Synthetic Benchmark

Updated 10 February 2026

Controllable synthetic benchmarks are automated frameworks that generate configurable data and tasks using explicit, user-exposed parameters, ensuring systematic evaluation and reproducibility.
They leverage programmable environments, template-driven code, and conditioned generative models to control task complexity, data composition, and ground-truth structure across multiple modalities.
This approach facilitates precise measurement of model capabilities, diagnosis of failure modes, and scalable studies in domains such as computer vision, NLP, and graph learning.

A controllable synthetic benchmark is a rigorously designed suite or pipeline for generating data, tasks, or environments with explicit, user-exposed parameters ("knobs") enabling systematic control over complexity, composition, annotation, and ground-truth structure. Synthetic benchmarks of this kind allow researchers to precisely vary task difficulty, probe specific model properties, measure failure modes, and reproduce evaluations under variable but well-defined conditions. Controllability is foundational for interpretation, reproducibility, diagnostic analysis, and scientific progress in fields ranging from computer vision and natural language understanding to program synthesis and graph learning.

1. Core Principles and Definitions

A controllable synthetic benchmark provides an automated mechanism to generate arbitrarily many samples or tasks, where each instance is specified by a user-facing set of parameters dictating scenario structure, data content, and ground-truth labels. These parameters may include environmental factors, object properties, entity identities, composition or combination of primitives, and more. Benchmarks of this class provide:

Full parameterization of instance generation, often as a function $x \sim P(X|\theta)$ for parameter vector $\theta$ .
Mechanisms for random seed and sampling control, enabling identical datasets across runs.
Explicit definition of evaluation metrics and protocols derived from the generated ground-truth.
Extensible APIs or scripts for batch instantiation, annotation propagation, and evaluation, supporting concurrent or distributed generation.

The rationale is twofold: (1) to facilitate isolation and measurement of specific algorithmic capabilities (e.g., spatial reasoning, robustness to confounding, fairness under demographic variation), and (2) to provide ground-truth that is unambiguously known (by construction), which is essential for scientific benchmarking.

2. Realizations Across Modalities and Tasks

Controllable synthetic benchmarks have been deployed in diverse domains:

Vision and Representation Learning: PUG leverages Unreal Engine to create photorealistic, fully parameterizable environments. Its API enables specification of world, object, pose, lighting, materials, camera, and spatial relations. PUG supports both factorial sweeping and targeted out-of-distribution (OOD) study by holding out combinations along one or several factor axes (Bordes et al., 2023).
Spatial Intelligence in VFMs: SpaRRTa defines a benchmark for evaluating visual foundation models on spatial relation recognition, using a programmable UE5 stack to set environments, object classes, positions, orientations, lighting, and viewpoint/camera. Each instance’s spatial relation label is derived algorithmically from 3D world state, with no manual annotation required. The design allows for precise, scalable, and reproducible assessment of allocentric versus egocentric reasoning (Kargin et al., 16 Jan 2026).
NLP and Compositional Reasoning: Dyna-bAbI generalizes the bAbI task suite by defining a probabilistic generator controlled by task length, entity set, event/construct priors, and question templates. This enables systematic creation of tasks requiring compositional generalization, with control over the distribution of supporting facts, linguistic phenomena, and question-answer structure (Tamari et al., 2021).
Graph Benchmarks: Synthetic graph generation based on attributed degree-corrected SBMs allow variation in structure (degree, community count, out-in ratio), attribute signal-to-noise, and node/edge feature alignment, supporting controlled studies of graph neural network robustness, feature-label alignment, and detectability thresholds (Tsitsulin et al., 2022).
3D Perception Tasks: UAV-MM3D parametrizes environments (scene, weather, UAV type, sensor modality), sweeps through sensor setups, and annotates per-frame ground truth for multi-modal perception, enabling explicit control and coverage over complex real-world scenario dimensions (Zou et al., 27 Nov 2025).
Scientific Reasoning: SymPyBench dynamically generates parametrized university-level physics problems, each with executable code for ground-truth computation. Researchers can instantiate unlimited variants by sampling input parameters within defined ranges, supporting large-scale robustness and consistency analysis (Imani et al., 5 Dec 2025).
Fairness and Regulatory Auditing: RoentGen-v2 and TrustFormers frameworks control dataset composition across demographic axes as well as data utility, privacy, and fidelity, enabling explicit exploration of fairness and trust trade-offs (Moroianu et al., 22 Aug 2025, Belgodere et al., 2023).

3. Implementation Mechanisms

Canonical implementation mechanisms include:

Scripted 3D Engines: Unreal Engine (PUG, SpaRRTa, UAV-MM3D, CCUP) and Blender (controllable shadow generation) are scripted via external APIs (Python, UnrealCV, JSON-over-WebRTC) to programmatically spawn, position, and manipulate entities, lighting, and cameras. This yields images/video annotated with ground-truth pose, segmentation, relation, and scene metadata (Bordes et al., 2023, Kargin et al., 16 Jan 2026, Zou et al., 27 Nov 2025, Tasar et al., 2024, Zhao et al., 2024).
Template-Driven Text and Code: Problem generators use JSON- or YAML-based templates with variable substitution (e.g., SymPyBench physical quantities, Dyna-bAbI entity/event/constructs), from which concrete problem instances are sampled and specialized code is generated or executed for gold solution checking (Imani et al., 5 Dec 2025, Tamari et al., 2021).
Feature-Space Steering: Code synthesis benchmarks such as BenchPress define explicit feature representations for target programs (e.g., static code metrics) and employ active learning to steer sample generation into underrepresented or diagnostically critical regions (Tsimpourlas et al., 2022).
Conditioned Generative Models: Diffusion or LLMs that accept explicit conditioning variables (e.g., RoentGen-v2’s demographic+finding prompt string, 3DPain’s action unit and identity parameters) enable the direct synthesis of data with prescribed attributes, accompanied by automated validation and rejection mechanisms (Moroianu et al., 22 Aug 2025, Lin et al., 20 Sep 2025).

4. Evaluation Protocols and Metrics

Evaluation is tightly linked to the synthetic benchmark’s parameterization and the automated ground-truth provided:

Task-Specific Accuracy: Standard definitions (e.g., direction-class accuracy in SpaRRTa; MC/Symbolic/Numerical accuracy in SymPyBench; top-K retrieval in CCUP; AUPRC/AUROC for diagnostic classifiers) (Kargin et al., 16 Jan 2026, Imani et al., 5 Dec 2025, Zhao et al., 2024, Moroianu et al., 22 Aug 2025).
Novel Consistency and Robustness Metrics: Consistency scores (fraction of variants per group with correct answers), confusion rates, and failure rates measure a model’s response stability to input variants in SymPyBench (Imani et al., 5 Dec 2025).
Fairness and OOD Shift: Explicit gap metrics (AUROC parity gap, underdiagnosis fairness gap) quantify subgroup and distributional performance variability in medical and regulatory datasets (Moroianu et al., 22 Aug 2025, Belgodere et al., 2023).
Photorealism/Realism Metrics: FID, MMD, classification performance gap versus real data, and equivariance analysis to probe representation invariance (PUG) (Bordes et al., 2023).
Controllability/Parameter Reconstruction: In control-heavy generation, evaluation may encompass parameter reconstruction error or control-signal fidelity (e.g., predicted vs ground-truth light source angles in shadow generation) (Tasar et al., 2024).

Systematic sweeps or ablations (varying a single axis at a time) are commonly used to probe failure modes and diagnostic boundaries.

5. Scalability, Extensibility, and Automation

Synthetic benchmarks of this class are typically horizontally scalable and extensible:

Unbounded Scale: Automated pipelines permit synthetic corpora of arbitrary size, constrained only by computational resources. Parallelized agents or rendering nodes generate large datasets for data-hungry models or label-intensive evaluation (Bordes et al., 2023, Zou et al., 27 Nov 2025, Zhao et al., 2024).
Extensible Control: Adding new dimensions (e.g., object classes, environments, demographics) requires only extending asset pools, template records, or registry schemas. Python dictionaries or JSON parameter files allow researchers to specify arbitrary new splits or combinations (Bordes et al., 2023, Zhao et al., 2024, Lin et al., 20 Sep 2025).
Automated Annotation: Labeling is performed algorithmically, drawing directly from simulation or template state, ensuring high fidelity and zero manual correction (e.g., pixel-perfect masks in SpaRRTa, JSON scene and camera records in PUG, deterministic identity labels in CCUP) (Kargin et al., 16 Jan 2026, Bordes et al., 2023, Zhao et al., 2024).
Modular Experiments: Reproducible protocols and config files are distributed to ensure end-to-end reproducibility and facilitate independent extension (Ferdous et al., 2 Jun 2025, Tamari et al., 2021).

6. Challenges, Limitations, and Impact

Controllable synthetic benchmarks have transformed benchmarking and model analysis but present challenges:

Residual Domain Gap: Even photorealistic or clinically plausible synthetic data cannot entirely close the sim-to-real gap; fissures in material reflectance, artifact, or model bias remain (Bordes et al., 2023).
Potential Overfitting to Synthetic Artefacts: When evaluation or model training relies heavily on synthetic data (e.g., high-complexity compositions, instruction-based editing), output realism and generalization may degrade, a phenomenon termed the “curse of synthetic data” (Yang et al., 17 Apr 2025).
Increased Engineering Cost: High-fidelity simulation stacks (e.g., real-time ray tracing, distributed 3D rendering) require significant engineering effort, although practical toolkits and APIs are increasingly available (Bordes et al., 2023, Tasar et al., 2024).
Evaluation Alignment: Synthetic benchmarks may not fully capture all real-world error modes; empirical cross-validation with real-world benchmarks is recommended to ensure external validity (e.g., accuracy concurrence with SQuAD in Dyna-bAbI) (Tamari et al., 2021).

Nevertheless, the impact is substantial: synthetic controllable benchmarks enable precise failure diagnosis, robust generalization studies, trusted fairness evaluation, and richer, better-correlated performance measurements on real-world downstream tasks (Kargin et al., 16 Jan 2026, Belgodere et al., 2023, Moroianu et al., 22 Aug 2025).

References

SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models (Kargin et al., 16 Jan 2026)
UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data (Zou et al., 27 Nov 2025)
Pain in 3D: Generating Controllable Synthetic Faces for Automated Pain Assessment (Lin et al., 20 Sep 2025)
Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data (Moroianu et al., 22 Aug 2025)
CCUP: A Controllable Synthetic Data Generation Pipeline for Pretraining Cloth-Changing Person Re-Identification Models (Zhao et al., 2024)
TimeGraph: Synthetic Benchmark Datasets for Robust Time-Series Causal Discovery (Ferdous et al., 2 Jun 2025)
Dyna-bAbI: unlocking bAbI's potential with dynamic synthetic benchmarking (Tamari et al., 2021)
Synthetic Graph Generation to Benchmark Graph Learning (Tsitsulin et al., 2022)
PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning (Bordes et al., 2023)
BenchPress: A Deep Active Benchmark Generator (Tsimpourlas et al., 2022)
SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code (Imani et al., 5 Dec 2025)
Controllable Shadow Generation with Single-Step Diffusion Models from Synthetic Data (Tasar et al., 2024)
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs (Belgodere et al., 2023)
$\texttt{Complex-Edit}$ : CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark (Yang et al., 17 Apr 2025)