Chart Spatial Understanding Benchmark (CS-Bench)
- CS-Bench is a diagnostic suite that evaluates multimodal large language models by requiring spatial localization of chart elements through axis-aligned bounding boxes.
- It employs two key task families—grounding and QA-grounding—to assess spatial reasoning and data-driven question answering on scientifically generated charts.
- The benchmark leverages a rigorously constructed dataset of real-style scientific figures, highlighting the role of spatial grounding supervision in enhancing model performance.
The Chart Spatial Understanding Benchmark (CS-Bench) is a diagnostic evaluation suite introduced to measure the capability of multimodal LLMs (MLLMs) to localize and reason about chart elements in scientific figures. CS-Bench emphasizes direct spatial understanding, requiring models to output axis-aligned bounding boxes for elements referenced in either simple grounding tasks or more complex question-answering (QA) scenarios. Developed in conjunction with the START framework, it uses a rigorously constructed dataset of real-style scientific charts generated through a data pipeline that leverages image-to-code translation and LLM-driven code evolution, facilitating comprehensive evaluation across varied chart types and spatial arrangements (Liu et al., 8 Dec 2025).
1. Task Structure and Formal Definition
CS-Bench tasks require an MLLM to process a chart image and a prompt that requests the localization of specific chart elements. Input images range from single-subplot to multi-subplot scientific figures; prompts reference chart components or pose data-driven questions. The output comprises axis-aligned bounding boxes, specified as , with pixel indices relative to the source image.
Two complementary task families are defined:
- Grounding Questions: The prompt identifies a single chart component (e.g., legend, title, axis label), and the model must return its bounding box, requiring spatial reasoning to map the textual reference to a contiguous visual region.
- QA-Grounding Questions: The prompt embeds a data-centric query (e.g., counting curves above a threshold in a subplot) and requires both the final answer (numeric or categorical) and the bounding box of the referenced element, combining textual comprehension with localization.
2. Dataset Composition and Generation
CS-Bench utilizes 613 chart images generated from Python code derived from real arXiv figures, selected and evolved to record pixel-precise element locations. The benchmark comprises 350 grounding and 342 QA-grounding questions, totaling 692 evaluated tasks. Chart types are widely represented: line charts (~55%), scatter plots (~20%), bar charts (~7%), heatmaps (~7%), with others (~11%) including area, pie, histogram, and multi-axis formats. The majority of images contain multiple subplots (2–4 subplots: ~61.3%), enabling evaluation of inter-subplot spatial reasoning. Question and bounding box instances are manually verified for accuracy, ensuring high semantic fidelity.
3. Evaluation Metrics and Protocols
Zero-shot evaluation (i.e., no fine-tuning on CS-Bench charts or prompts) is conducted under two principal metrics:
- Recall@IoU (Bounding Box Localization): For each query, let and denote predicted and ground-truth boxes. Intersection-over-union (IoU) is . [email protected] is , capturing the proportion of queries correctly localized within a modest overlap threshold.
- QA-Grounding Accuracy: For 342 questions, answer strings must exactly match the reference; accuracy is computed as the fraction of matches.
No chart-type or language fine-tuning is performed on CS-Bench, preserving the challenge of real-world generalization.
4. Task Taxonomy and Illustrative Examples
CS-Bench tasks are stratified as follows:
A. Grounding Questions
- Example 1: Prompt: "Locate the title of subplot in row 2 and column 1." The expected output (\texttt{[}{\texttt{"bbox_2d"}:[97,419,122,437]}\texttt{]}) demands identification and precise localization of the rendered text object.
- Example 2: Prompt: "Locate the legend, output its bbox coordinates." The expected output (\texttt{[}{\texttt{"bbox_2d"}:[445,29,586,96]}\texttt{]}) requires recognition of the contiguous region containing all legend entries.
B. QA-Grounding Questions
- Example: Prompt: "In plot (a), what is the maximum value shown on the y-axis? Give the answer and the bbox of that axis tick." The output (\texttt{[}{\texttt{"answer":"1.2","bbox_2d":[457,398,797,737]}}\texttt{]}) involves extracting the numeric answer and locating the referenced tick label.
This dual format isolates visual grounding and joint spatial-textual reasoning, enabling analysis of MLLM capabilities beyond canonical image QA or simple object detection.
5. Comparative Analysis with Prior Work
Previous benchmarks such as RefChartQA are limited in scope, addressing only single-subplot charts and a restricted set of chart components (primarily bars and points) with human-annotated boxes exhibiting inconsistent semantics. CS-Bench expands coverage to multi-subplot figures, a broader chart-type distribution, and systematically ties bounding boxes to semantically significant elements, including axes, legends, and titles.
No human-baseline results are yet reported, but full recall and accuracy are expected from skilled annotators given precise code-based ground-truth boxes. This suggests that CS-Bench primarily assesses model limitations rather than inherent ambiguity in chart spatial structure.
6. Empirical Performance and Key Findings
Performance on CS-Bench reveals clear model stratification. The following table summarizes [email protected] ([email protected]) and QA accuracy across MLLM classes:
| Model | [email protected] | QA Accuracy |
|---|---|---|
| Qwen2.5-VL (3 B) | 45.0 % | 16.2 % |
| Qwen2.5-VL (7 B) | 50.2 % | 19.3 % |
| Chart-R1 (7 B) | 54.5 % | 9.6 % |
| START-SFT (3 B) | 58.8 % | 26.9 % |
| START-RL (3 B) | 60.5 % | 41.3 % |
| START-SFT (7 B) | 57.6 % | 31.0 % |
| START-RL (7 B) | 62.3 % | 45.3 % |
Key observations: the START-RL 7 B model outperforms prior state-of-the-art by +7.8 percentage points in recall and +35.7 in QA accuracy relative to Chart-R1 7 B. Even supervised START variants significantly surpass general-purpose MLLMs, implying that explicit spatial grounding supervision is essential for fine-grained chart reasoning.
7. Significance and Implications
The introduction of CS-Bench fills a critical gap by operationalizing spatial reasoning in chart understanding for MLLMs. It presents rigorous demands—multi-subplot, varied chart typology, pixel-level semantic grounding—foundational for deploying models in scientific analysis. Empirical results suggest that spatial-textual training strategies significantly improve MLLM performance over both generic and specialized baselines, marking a substantive advance in chart-structured multimodal machine learning (Liu et al., 8 Dec 2025).
A plausible implication is that broad adoption of the CS-Bench evaluation protocol may catalyze systematic improvements in MLLM architectures for scientific and technical document analysis, with spatial grounding as a key driver of accuracy and reliability.