Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShopBench: Unified AI Benchmark Platform

Updated 5 February 2026
  • ShopBench is a unified benchmarking platform that standardizes evaluation in job shop scheduling optimization and multimodal shopping tasks.
  • It offers a modular API with detailed environments and instance suites for both learning-based and traditional approaches.
  • The benchmark suite supports comprehensive metrics and protocols for assessing RL methods, constraint matching, and perceptual tasks in retail.

ShopBench is a designation utilized for multiple high-impact benchmarks in AI research, with the most widely cited usages reflecting distinct developments in two areas: large-scale e-commerce and customer-assistant evaluation, and manufacturing job shop scheduling for combinatorial optimization. This multiplicity of meanings requires contextual precision. The following article synthesizes the foundational details and technical contributions of both primary usages, strictly according to the referenced literature.

1. ShopBench in Job Shop Scheduling Optimization

ShopBench is a unified benchmarking platform and API for learning-based and non-learning-based job shop scheduling methods, developed to address the longstanding absence of standardization in the evaluation of combinatorial scheduling solvers. It is introduced as the central contribution of "Job Shop Scheduling Benchmark: Environments and Instances for Learning and Non-learning Methods" (Reijnen et al., 2023).

Design Objectives and Problem Coverage

ShopBench is designed to:

  • Unify problem definitions for classical and contemporary job shop settings, including:
    • JSP (Job Shop Scheduling Problem)
    • FSP (Flow Shop Scheduling Problem)
    • FJSP (Flexible Job Shop Scheduling)
    • FAJSP (FJSP with Assembly Constraints)
    • SDST (Sequence-Dependent Setup Times)
    • Online variants (job arrivals via stochastic processes)

Each variant enables modeling of job-operation sequences, machine constraints, sequence-dependent setups, and assembly or parallelization constraints, providing a canonical test bed for both mathematical optimization and RL-based dispatching.

Environment and API Details

The code-base consists of modular Python classes: Job, Operation, Machine, JobShop, and SimulationEnv. ShopBench exposes routines for:

  • Resetting an environment to an initial state
  • Stepping through dispatching actions and collecting resulting state transitions
  • Parsing benchmark instances (e.g., FT10, LA20) and generating synthetic instances
  • Defining composite scheduling problems via configuration files (YAML/JSON)

Typical usage permits algorithmic comparison under controlled arrival, setup, and resource conditions, supporting both static and online scheduling.

Instance Suite and Library

ShopBench aggregates:

  • Classic static benchmarks (e.g., FT, LA, OR-Library sets), encompassing jobs/machines from small (n=6) to large (n=100)
  • Synthetic random instances for stress-testing generalization
  • Online variants with Poisson/exponential job arrivals and variable inter-arrival times
  • SDST instances with parameterized setup time variation

Total instance library includes approximately 500+ distinct files spanning the full variant range.

Supported Solution Methods

ShopBench includes native implementations of:

  • State-wise dispatching rules (FIFO, MOR, MWR, LWR)
  • Load-balancing heuristics (global and local selection)
  • Metaheuristics (Genetic Algorithms with crossover, mutation, solution encoding)
  • Deep RL approaches (heterogeneous GNN embedding, actor-critic policies for dispatching)

Performance is reported on mean makespan (CmaxC_{max}) and wall-clock time, supporting benchmarked inference speed and solution quality assessment (e.g., DRL: Cmax≈820C_{max}\approx820, t≈50t\approx50 ms on FT10).

Evaluation Protocols

ShopBench employs:

  • 80/20 train/test splits for synthetic benchmarks
  • Standard protocol for classical (test-only) sets
  • Per-instance repeated runs (e.g., 100 samples) for mean, median, P90, and worst-case analysis
  • Paired t-test and Wilcoxon signed-rank tests for policy comparison at α=0.05\alpha=0.05

Extensibility

  • New problem variants: via subclassing and parser extension
  • New solvers: pluggable via "Policy" interface (operation/machine selection routines)
  • Simulation environment: supports custom reward and termination methods

This architecture enables ShopBench to function as a collaborative foundation for benchmarking and comparison of emerging scheduling algorithms in both research and practice (Reijnen et al., 2023).

2. ShopBench as a Benchmark for E-commerce Shopping Agents

In multimodal e-commerce and domain-specific language agent evaluation, "ShopBench" or similarly named resources are referenced as part of two distinct but partially overlapping developments:

2.1 ShoppingBench (ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents)

"ShoppingBench" is introduced for assessing LLM-based agents on real-world, intent-grounded multi-turn shopping tasks (Wang et al., 6 Aug 2025).

Key characteristics:

  • Task Hierarchy: Simulates four intent levels:

    1. Product Finder (attribute-based single-product retrieval)
    2. Knowledge (purchase contingent on domain reasoning)
    3. Multi-products seller (find sets sold by one store)
    4. Coupon (budget and voucher-constrained optimization)
  • Data and Environment:

    • Encompasses 2.7 million real products (from Lazada.com)
    • Six-tool shopping API (find, view, calculate_discount, recommend, web_search, terminate)
    • Automatic metrics: relevance (rpror_{pro}), constraint satisfaction (rkwr_{kw}, rshopr_{shop}, rbudgetr_{budget}), cumulative average relevance (CAR), absolute success rate (ASR)
    • Synthetic user instruction generation controlled for intent and constraints
  • Agent Evaluation:
    • 17 baselines including GPT-4.1, Claude-4, Gemini-2.5-Flash, Qwen2.5, DeepSeek, Qwen3/Gemma variants
    • Best results: GPT-4.1 achieves ASR = 48.2%; SFT+RL distilled Qwen3-4B closes performance gap (ASR = 48.7%)
    • Failure attribution: attribute mismatch (~40%), constraint violations (25%), missing products (15%), metric errors (10%), knowledge errors (10%)
  • Research Contribution: Demonstrates sub-50% absolute success by SotA LLMs on realistic, end-to-end shopping tasks, identifies the necessity for improved attribute matching and constraint reasoning, and provides a supervised+reinforcement distillation path to scale smaller models' performance (Wang et al., 6 Aug 2025).

2.2 ShopBench in Multimodal Retail Store Perception (Ostrakon-VL)

Another "ShopBench" designates a unified benchmark for perception and reasoning in food-service and retail stores, designed for the evaluation of Multimodal LLMs (Shen et al., 29 Jan 2026).

Essential components:

  • Measures MLLM robustness on static (storefront, shop interior, kitchen), multi-image, and video inputs
  • Hierarchical 79-leaf taxonomy spanning perception and reasoning
  • 5,818 questions: majority perception, minority reasoning
  • Modalities: Open-ended, schema-constrained, MCQ
  • Metrics: accuracy, mAP, F1, plus domain-specific (Multimodal Gain, Multimodal Leakage, Visual Necessity Rate, Vision-Induced Failure)
  • Ostrakon-VL (Qwen3-VL-8B) achieves 60.1 mean score, outperforming larger models in parameter efficiency

Significance: ShopBench here enables fine-grained, auditable benchmarking for FSRS scenarios under real-world visual noise and heterogeneous data (Shen et al., 29 Jan 2026).

2.3 ShopBench as Shopping MMLU / KDD Cup Benchmark

In the KDD Cup 2024 context, "ShopBench" is an alternate designation for the multi-task Shopping MMLU benchmark (Jin et al., 2024), widely adopted for LLM-based shop assistant assessment.

  • 57 tasks, ≈20,800 questions from real Amazon ecosystem data (product catalogs, queries, sessions, reviews)
  • Four "skills": Concept Understanding, Knowledge Reasoning, User Behavior Alignment, Multi-linguality
  • All tasks as text-to-text generation; evaluation by standard classification, NER (F1), ranking (NDCG), extraction (ROUGE-L), translation (BLEU)
  • Top LLMs (e.g., Claude 3 Sonnet, LLaMA3-70B, Qwen1.5-72B) reach ≈80% macro accuracy in "Concept," ≈70% in reasoning and user behavior
  • Used as the 2024 KDD Cup online shopping leaderboard; forms the basis for the LLaSA and EshopInstruct research pipelines (Jin et al., 2024, Zhang et al., 2024).
Model Concept Reasoning Behavior Multi-lingual
Claude 3 80.8 71.6 70.2 67.8
ChatGPT 75.6 65.0 59.8 60.8
LLaMA3-70B 75.2 69.3 67.7 62.0

3. Relationship to Other Benchmarks

ShopBench names (and cognates: ShoppingBench, Shopping MMLU) are employed in several contiguous but non-identical settings:

  • WebMall: Focuses on comparison-shopping across multi-shop simulated web agents, with longer interaction trajectories and advanced product-requirement reasoning. Distinct from ShopBench as its core evaluation is on navigation and reasoning in heterogeneous e-commerce web environments (Peeters et al., 18 Aug 2025).
  • SariBench/Sari Sandbox: Sometimes synonymized with ShopBench in the context of 3D embodied retail environments for testing embodied agents and human shopping performance (Gajo et al., 1 Aug 2025). This usage is separate from the job shop scheduling and LLM-oriented intent benchmarks.

4. Metrics and Evaluation Protocols

All variants of ShopBench/ShoppingBench employ standardized metrics tailored to their domain context:

  • Scheduling: Mean/median/worst-case makespan (CmaxC_{max}), per-instance time, statistical significance for algorithmic comparisons (Reijnen et al., 2023).
  • LLM-based Shopping: Product relevance, constraint satisfaction, cumulative average relevance (CAR), absolute success rate (ASR) (Wang et al., 6 Aug 2025).
  • Multimodal Perception: Accuracy, mean Average Precision, F1, plus Multimodal Gain (MG), Multimodal Leakage (ML), Visual Necessity Rate (VNR), Vision-Induced Failure (VIF) (Shen et al., 29 Jan 2026).
  • Shopping MMLU: Macro-averaged task scores, skill-level accuracy, F1, NDCG, BLEU/ROUGE-L for translation and generation (Jin et al., 2024).

ShopBench implementations provide unified APIs, public evaluation splits, and leaderboard protocols, with closed-loop auditing and test-only releases for leaderboard fairness wherever feasible.

5. Challenges, Insights, and Future Directions

Across domains bearing the ShopBench name, shared research challenges include:

  • Attribute and constraint matching: LLM agents struggle with partial feature misses and combinatorial reasoning, especially with coupons, budgets, or cross-product dependencies (Wang et al., 6 Aug 2025).
  • Robust vision-language fusion: MLLMs evaluated on ShopBench must handle real-world noise (glare, blur, occlusion), multilingual signage, and diverse input types (Shen et al., 29 Jan 2026).
  • Data domain shift and continual learning: Static catalog and synthetic data limit generalization; integrating live inventory databases and human-in-the-loop calibration is open territory (Zhang et al., 2024, Jin et al., 2024).
  • Scalability and extensibility: For scheduling, generalization to composite variants (e.g., online+SDST+assembly), and for e-commerce, extending beyond text-only inputs (multimodal and embodied settings).

Benchmark protocols emphasize reproducibility, modular evaluation, and extensibility to novel solution paradigms and real-world deployment constraints.

6. Summary Table: ShopBench Designations

Context/Domain Benchmark Name Primary Focus Reference
Scheduling/Optimization ShopBench JSP/FJSP/FAJSP/SDST + Dynamic, RL, Metaheur. (Reijnen et al., 2023)
LLM E-commerce Agents ShoppingBench Intent-grounded multi-turn shopping tasks (Wang et al., 6 Aug 2025)
Multimodal Perception ShopBench FSRS (Food-service/Retail) – images/video (Shen et al., 29 Jan 2026)
Multi-task Shopping QA ShopBench / Shopping MMLU 57 task, 4-skill, text-to-text QA (Jin et al., 2024, Zhang et al., 2024)
Embodied Retail Sim ShopBench/SariBench VR/3D, agent/human retail store simulation (Gajo et al., 1 Aug 2025)

Editor's term: "ShopBench family"—The set comprising all benchmarks bearing the ShopBench/ShoppingBench name, spanning optimization, language agent, and multimodal perception domains.

7. References

This disambiguates ShopBench as both a pivotal platform for job shop scheduling research and a family of benchmarks for language and vision-based shopping agent evaluation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShopBench.