ShopBench: Unified AI Benchmark Platform

Updated 5 February 2026

ShopBench is a unified benchmarking platform that standardizes evaluation in job shop scheduling optimization and multimodal shopping tasks.
It offers a modular API with detailed environments and instance suites for both learning-based and traditional approaches.
The benchmark suite supports comprehensive metrics and protocols for assessing RL methods, constraint matching, and perceptual tasks in retail.

ShopBench is a designation utilized for multiple high-impact benchmarks in AI research, with the most widely cited usages reflecting distinct developments in two areas: large-scale e-commerce and customer-assistant evaluation, and manufacturing job shop scheduling for combinatorial optimization. This multiplicity of meanings requires contextual precision. The following article synthesizes the foundational details and technical contributions of both primary usages, strictly according to the referenced literature.

1. ShopBench in Job Shop Scheduling Optimization

ShopBench is a unified benchmarking platform and API for learning-based and non-learning-based job shop scheduling methods, developed to address the longstanding absence of standardization in the evaluation of combinatorial scheduling solvers. It is introduced as the central contribution of "Job Shop Scheduling Benchmark: Environments and Instances for Learning and Non-learning Methods" (Reijnen et al., 2023).

Design Objectives and Problem Coverage

ShopBench is designed to:

Unify problem definitions for classical and contemporary job shop settings, including:
- JSP (Job Shop Scheduling Problem)
- FSP (Flow Shop Scheduling Problem)
- FJSP (Flexible Job Shop Scheduling)
- FAJSP (FJSP with Assembly Constraints)
- SDST (Sequence-Dependent Setup Times)
- Online variants (job arrivals via stochastic processes)

Each variant enables modeling of job-operation sequences, machine constraints, sequence-dependent setups, and assembly or parallelization constraints, providing a canonical test bed for both mathematical optimization and RL-based dispatching.

Environment and API Details

The code-base consists of modular Python classes: Job, Operation, Machine, JobShop, and SimulationEnv. ShopBench exposes routines for:

Resetting an environment to an initial state
Stepping through dispatching actions and collecting resulting state transitions
Parsing benchmark instances (e.g., FT10, LA20) and generating synthetic instances
Defining composite scheduling problems via configuration files (YAML/JSON)

Typical usage permits algorithmic comparison under controlled arrival, setup, and resource conditions, supporting both static and online scheduling.

Instance Suite and Library

ShopBench aggregates:

Classic static benchmarks (e.g., FT, LA, OR-Library sets), encompassing jobs/machines from small (n=6) to large (n=100)
Synthetic random instances for stress-testing generalization
Online variants with Poisson/exponential job arrivals and variable inter-arrival times
SDST instances with parameterized setup time variation

Total instance library includes approximately 500+ distinct files spanning the full variant range.

Supported Solution Methods

ShopBench includes native implementations of:

State-wise dispatching rules (FIFO, MOR, MWR, LWR)
Load-balancing heuristics (global and local selection)
Metaheuristics (Genetic Algorithms with crossover, mutation, solution encoding)
Deep RL approaches (heterogeneous GNN embedding, actor-critic policies for dispatching)

Performance is reported on mean makespan ( $C_{max}$ ) and wall-clock time, supporting benchmarked inference speed and solution quality assessment (e.g., DRL: $C_{max}\approx820$ , $t\approx50$ ms on FT10).

Evaluation Protocols

ShopBench employs:

80/20 train/test splits for synthetic benchmarks
Standard protocol for classical (test-only) sets
Per-instance repeated runs (e.g., 100 samples) for mean, median, P90, and worst-case analysis
Paired t-test and Wilcoxon signed-rank tests for policy comparison at $\alpha=0.05$

Extensibility

New problem variants: via subclassing and parser extension
New solvers: pluggable via "Policy" interface (operation/machine selection routines)
Simulation environment: supports custom reward and termination methods

This architecture enables ShopBench to function as a collaborative foundation for benchmarking and comparison of emerging scheduling algorithms in both research and practice (Reijnen et al., 2023).

2. ShopBench as a Benchmark for E-commerce Shopping Agents

In multimodal e-commerce and domain-specific language agent evaluation, "ShopBench" or similarly named resources are referenced as part of two distinct but partially overlapping developments:

2.1 ShoppingBench (ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents)

"ShoppingBench" is introduced for assessing LLM-based agents on real-world, intent-grounded multi-turn shopping tasks (Wang et al., 6 Aug 2025).

Key characteristics:

Task Hierarchy: Simulates four intent levels:
1. Product Finder (attribute-based single-product retrieval)
2. Knowledge (purchase contingent on domain reasoning)
3. Multi-products seller (find sets sold by one store)
4. Coupon (budget and voucher-constrained optimization)
Data and Environment:
- Encompasses 2.7 million real products (from Lazada.com)
- Six-tool shopping API (find, view, calculate_discount, recommend, web_search, terminate)
- Automatic metrics: relevance ( $r_{pro}$ ), constraint satisfaction ( $r_{kw}$ , $r_{shop}$ , $r_{budget}$ ), cumulative average relevance (CAR), absolute success rate (ASR)
- Synthetic user instruction generation controlled for intent and constraints
Agent Evaluation:
- 17 baselines including GPT-4.1, Claude-4, Gemini-2.5-Flash, Qwen2.5, DeepSeek, Qwen3/Gemma variants
- Best results: GPT-4.1 achieves ASR = 48.2%; SFT+RL distilled Qwen3-4B closes performance gap (ASR = 48.7%)
- Failure attribution: attribute mismatch (~40%), constraint violations (25%), missing products (15%), metric errors (10%), knowledge errors (10%)
Research Contribution: Demonstrates sub-50% absolute success by SotA LLMs on realistic, end-to-end shopping tasks, identifies the necessity for improved attribute matching and constraint reasoning, and provides a supervised+reinforcement distillation path to scale smaller models' performance (Wang et al., 6 Aug 2025).

2.2 ShopBench in Multimodal Retail Store Perception (Ostrakon-VL)

Another "ShopBench" designates a unified benchmark for perception and reasoning in food-service and retail stores, designed for the evaluation of Multimodal LLMs (Shen et al., 29 Jan 2026).

Essential components:

Measures MLLM robustness on static (storefront, shop interior, kitchen), multi-image, and video inputs
Hierarchical 79-leaf taxonomy spanning perception and reasoning
5,818 questions: majority perception, minority reasoning
Modalities: Open-ended, schema-constrained, MCQ
Metrics: accuracy, mAP, F1, plus domain-specific (Multimodal Gain, Multimodal Leakage, Visual Necessity Rate, Vision-Induced Failure)
Ostrakon-VL (Qwen3-VL-8B) achieves 60.1 mean score, outperforming larger models in parameter efficiency

Significance: ShopBench here enables fine-grained, auditable benchmarking for FSRS scenarios under real-world visual noise and heterogeneous data (Shen et al., 29 Jan 2026).

2.3 ShopBench as Shopping MMLU / KDD Cup Benchmark

In the KDD Cup 2024 context, "ShopBench" is an alternate designation for the multi-task Shopping MMLU benchmark (Jin et al., 2024), widely adopted for LLM-based shop assistant assessment.

57 tasks, ≈20,800 questions from real Amazon ecosystem data (product catalogs, queries, sessions, reviews)
Four "skills": Concept Understanding, Knowledge Reasoning, User Behavior Alignment, Multi-linguality
All tasks as text-to-text generation; evaluation by standard classification, NER (F1), ranking (NDCG), extraction (ROUGE-L), translation (BLEU)
Top LLMs (e.g., Claude 3 Sonnet, LLaMA3-70B, Qwen1.5-72B) reach ≈80% macro accuracy in "Concept," ≈70% in reasoning and user behavior
Used as the 2024 KDD Cup online shopping leaderboard; forms the basis for the LLaSA and EshopInstruct research pipelines (Jin et al., 2024, Zhang et al., 2024).

Model	Concept	Reasoning	Behavior	Multi-lingual
Claude 3	80.8	71.6	70.2	67.8
ChatGPT	75.6	65.0	59.8	60.8
LLaMA3-70B	75.2	69.3	67.7	62.0

3. Relationship to Other Benchmarks

ShopBench names (and cognates: ShoppingBench, Shopping MMLU) are employed in several contiguous but non-identical settings:

WebMall: Focuses on comparison-shopping across multi-shop simulated web agents, with longer interaction trajectories and advanced product-requirement reasoning. Distinct from ShopBench as its core evaluation is on navigation and reasoning in heterogeneous e-commerce web environments (Peeters et al., 18 Aug 2025).
SariBench/Sari Sandbox: Sometimes synonymized with ShopBench in the context of 3D embodied retail environments for testing embodied agents and human shopping performance (Gajo et al., 1 Aug 2025). This usage is separate from the job shop scheduling and LLM-oriented intent benchmarks.

4. Metrics and Evaluation Protocols

All variants of ShopBench/ShoppingBench employ standardized metrics tailored to their domain context:

Scheduling: Mean/median/worst-case makespan ( $C_{max}$ ), per-instance time, statistical significance for algorithmic comparisons (Reijnen et al., 2023).
LLM-based Shopping: Product relevance, constraint satisfaction, cumulative average relevance (CAR), absolute success rate (ASR) (Wang et al., 6 Aug 2025).
Multimodal Perception: Accuracy, mean Average Precision, F1, plus Multimodal Gain (MG), Multimodal Leakage (ML), Visual Necessity Rate (VNR), Vision-Induced Failure (VIF) (Shen et al., 29 Jan 2026).
Shopping MMLU: Macro-averaged task scores, skill-level accuracy, F1, NDCG, BLEU/ROUGE-L for translation and generation (Jin et al., 2024).

ShopBench implementations provide unified APIs, public evaluation splits, and leaderboard protocols, with closed-loop auditing and test-only releases for leaderboard fairness wherever feasible.

5. Challenges, Insights, and Future Directions

Across domains bearing the ShopBench name, shared research challenges include:

Attribute and constraint matching: LLM agents struggle with partial feature misses and combinatorial reasoning, especially with coupons, budgets, or cross-product dependencies (Wang et al., 6 Aug 2025).
Robust vision-language fusion: MLLMs evaluated on ShopBench must handle real-world noise (glare, blur, occlusion), multilingual signage, and diverse input types (Shen et al., 29 Jan 2026).
Data domain shift and continual learning: Static catalog and synthetic data limit generalization; integrating live inventory databases and human-in-the-loop calibration is open territory (Zhang et al., 2024, Jin et al., 2024).
Scalability and extensibility: For scheduling, generalization to composite variants (e.g., online+SDST+assembly), and for e-commerce, extending beyond text-only inputs (multimodal and embodied settings).

Benchmark protocols emphasize reproducibility, modular evaluation, and extensibility to novel solution paradigms and real-world deployment constraints.

6. Summary Table: ShopBench Designations

Context/Domain	Benchmark Name	Primary Focus	Reference
Scheduling/Optimization	ShopBench	JSP/FJSP/FAJSP/SDST + Dynamic, RL, Metaheur.	(Reijnen et al., 2023)
LLM E-commerce Agents	ShoppingBench	Intent-grounded multi-turn shopping tasks	(Wang et al., 6 Aug 2025)
Multimodal Perception	ShopBench	FSRS (Food-service/Retail) – images/video	(Shen et al., 29 Jan 2026)
Multi-task Shopping QA	ShopBench / Shopping MMLU	57 task, 4-skill, text-to-text QA	(Jin et al., 2024, Zhang et al., 2024)
Embodied Retail Sim	ShopBench/SariBench	VR/3D, agent/human retail store simulation	(Gajo et al., 1 Aug 2025)

Editor's term: "ShopBench family"—The set comprising all benchmarks bearing the ShopBench/ShoppingBench name, spanning optimization, language agent, and multimodal perception domains.

7. References

"Job Shop Scheduling Benchmark: Environments and Instances for Learning and Non-learning Methods" (Reijnen et al., 2023)
"ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents" (Wang et al., 6 Aug 2025)
"Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores" (Shen et al., 29 Jan 2026)
"Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for LLMs" (Jin et al., 2024)
"LLaSA: Large Language and E-Commerce Shopping Assistant" (Zhang et al., 2024)
"Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents" (Gajo et al., 1 Aug 2025)

This disambiguates ShopBench as both a pivotal platform for job shop scheduling research and a family of benchmarks for language and vision-based shopping agent evaluation.

Markdown Report Issue Upgrade to Chat

References (7)

Job Shop Scheduling Benchmark: Environments and Instances for Learning and Non-learning Methods (2023)

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents (2025)

Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores (2026)

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models (2024)

LLaSA: Large Language and E-Commerce Shopping Assistant (2024)

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents (2025)

Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShopBench.

ShopBench: Unified AI Benchmark Platform

1. ShopBench in Job Shop Scheduling Optimization

Design Objectives and Problem Coverage

Environment and API Details

Instance Suite and Library

Supported Solution Methods

Evaluation Protocols

Extensibility

2. ShopBench as a Benchmark for E-commerce Shopping Agents

2.1 ShoppingBench (ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents)

2.2 ShopBench in Multimodal Retail Store Perception (Ostrakon-VL)

2.3 ShopBench as Shopping MMLU / KDD Cup Benchmark

Performance Table (Excerpt from (Jin et al., 2024))

3. Relationship to Other Benchmarks

4. Metrics and Evaluation Protocols

5. Challenges, Insights, and Future Directions

6. Summary Table: ShopBench Designations

7. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ShopBench: Unified AI Benchmark Platform

1. ShopBench in Job Shop Scheduling Optimization

Design Objectives and Problem Coverage

Environment and API Details

Instance Suite and Library

Supported Solution Methods

Evaluation Protocols

Extensibility

2. ShopBench as a Benchmark for E-commerce Shopping Agents

2.1 ShoppingBench (ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents)

2.2 ShopBench in Multimodal Retail Store Perception (Ostrakon-VL)

2.3 ShopBench as Shopping MMLU / KDD Cup Benchmark

Performance Table (Excerpt from (Jin et al., 2024))

3. Relationship to Other Benchmarks

4. Metrics and Evaluation Protocols

5. Challenges, Insights, and Future Directions

6. Summary Table: ShopBench Designations

7. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research