Chart Spatial Understanding Benchmark (CS-Bench)

Updated 15 December 2025

CS-Bench is a diagnostic suite that evaluates multimodal large language models by requiring spatial localization of chart elements through axis-aligned bounding boxes.
It employs two key task families—grounding and QA-grounding—to assess spatial reasoning and data-driven question answering on scientifically generated charts.
The benchmark leverages a rigorously constructed dataset of real-style scientific figures, highlighting the role of spatial grounding supervision in enhancing model performance.

The Chart Spatial Understanding Benchmark (CS-Bench) is a diagnostic evaluation suite introduced to measure the capability of multimodal LLMs (MLLMs) to localize and reason about chart elements in scientific figures. CS-Bench emphasizes direct spatial understanding, requiring models to output axis-aligned bounding boxes for elements referenced in either simple grounding tasks or more complex question-answering (QA) scenarios. Developed in conjunction with the START framework, it uses a rigorously constructed dataset of real-style scientific charts generated through a data pipeline that leverages image-to-code translation and LLM-driven code evolution, facilitating comprehensive evaluation across varied chart types and spatial arrangements (Liu et al., 8 Dec 2025).

1. Task Structure and Formal Definition

CS-Bench tasks require an MLLM to process a chart image $I$ and a prompt $q$ that requests the localization of specific chart elements. Input images range from single-subplot to multi-subplot scientific figures; prompts reference chart components or pose data-driven questions. The output comprises axis-aligned bounding boxes, specified as $\{\,\text{"bbox\_2d"}:[x_{\min},y_{\min},x_{\max},y_{\max}],\ldots\}$ , with pixel indices relative to the source image.

Two complementary task families are defined:

Grounding Questions: The prompt identifies a single chart component (e.g., legend, title, axis label), and the model must return its bounding box, requiring spatial reasoning to map the textual reference to a contiguous visual region.
QA-Grounding Questions: The prompt embeds a data-centric query (e.g., counting curves above a threshold in a subplot) and requires both the final answer (numeric or categorical) and the bounding box of the referenced element, combining textual comprehension with localization.

2. Dataset Composition and Generation

CS-Bench utilizes 613 chart images generated from Python code derived from real arXiv figures, selected and evolved to record pixel-precise element locations. The benchmark comprises 350 grounding and 342 QA-grounding questions, totaling 692 evaluated tasks. Chart types are widely represented: line charts (~55%), scatter plots (~20%), bar charts (~7%), heatmaps (~7%), with others (~11%) including area, pie, histogram, and multi-axis formats. The majority of images contain multiple subplots (2–4 subplots: ~61.3%), enabling evaluation of inter-subplot spatial reasoning. Question and bounding box instances are manually verified for accuracy, ensuring high semantic fidelity.

3. Evaluation Metrics and Protocols

Zero-shot evaluation (i.e., no fine-tuning on CS-Bench charts or prompts) is conducted under two principal metrics:

Recall@IoU $_{0.3}$ (Bounding Box Localization): For each query, let $B_i^{\textrm{pred}}$ and $B_i^{\textrm{gt}}$ denote predicted and ground-truth boxes. Intersection-over-union (IoU) is $\mathrm{IoU}(B_i^{\textrm{pred}}, B_i^{\textrm{gt}}) = \frac{\lvert B_i^{\textrm{pred}} \cap B_i^{\textrm{gt}} \rvert }{\lvert B_i^{\textrm{pred}} \cup B_i^{\textrm{gt}} \rvert}$ . [email protected] is $\mathrm{R}_{0.3} = \frac{1}{N}\sum_{i=1}^N \mathbb{I}\left[ \mathrm{IoU}(B_i^{\textrm{pred}}, B_i^{\textrm{gt}}) \geq 0.3 \right ]$ , capturing the proportion of queries correctly localized within a modest overlap threshold.
QA-Grounding Accuracy: For 342 questions, answer strings must exactly match the reference; accuracy is computed as the fraction of matches.

No chart-type or language fine-tuning is performed on CS-Bench, preserving the challenge of real-world generalization.

4. Task Taxonomy and Illustrative Examples

CS-Bench tasks are stratified as follows:

A. Grounding Questions

Example 1: Prompt: "Locate the title of subplot in row 2 and column 1." The expected output (\texttt{[}{\texttt{"bbox_2d"}:[97,419,122,437]}\texttt{]}) demands identification and precise localization of the rendered text object.
Example 2: Prompt: "Locate the legend, output its bbox coordinates." The expected output (\texttt{[}{\texttt{"bbox_2d"}:[445,29,586,96]}\texttt{]}) requires recognition of the contiguous region containing all legend entries.

B. QA-Grounding Questions

Example: Prompt: "In plot (a), what is the maximum value shown on the y-axis? Give the answer and the bbox of that axis tick." The output (\texttt{[}{\texttt{"answer":"1.2","bbox_2d":[457,398,797,737]}}\texttt{]}) involves extracting the numeric answer and locating the referenced tick label.

This dual format isolates visual grounding and joint spatial-textual reasoning, enabling analysis of MLLM capabilities beyond canonical image QA or simple object detection.

5. Comparative Analysis with Prior Work

Previous benchmarks such as RefChartQA are limited in scope, addressing only single-subplot charts and a restricted set of chart components (primarily bars and points) with human-annotated boxes exhibiting inconsistent semantics. CS-Bench expands coverage to multi-subplot figures, a broader chart-type distribution, and systematically ties bounding boxes to semantically significant elements, including axes, legends, and titles.

No human-baseline results are yet reported, but full recall and accuracy are expected from skilled annotators given precise code-based ground-truth boxes. This suggests that CS-Bench primarily assesses model limitations rather than inherent ambiguity in chart spatial structure.

6. Empirical Performance and Key Findings

Performance on CS-Bench reveals clear model stratification. The following table summarizes [email protected] ([email protected]) and QA accuracy across MLLM classes:

Model	[email protected]	QA Accuracy
Qwen2.5-VL (3 B)	45.0 %	16.2 %
Qwen2.5-VL (7 B)	50.2 %	19.3 %
Chart-R1 (7 B)	54.5 %	9.6 %
START-SFT (3 B)	58.8 %	26.9 %
START-RL (3 B)	60.5 %	41.3 %
START-SFT (7 B)	57.6 %	31.0 %
START-RL (7 B)	62.3 %	45.3 %

Key observations: the START-RL 7 B model outperforms prior state-of-the-art by +7.8 percentage points in recall and +35.7 in QA accuracy relative to Chart-R1 7 B. Even supervised START variants significantly surpass general-purpose MLLMs, implying that explicit spatial grounding supervision is essential for fine-grained chart reasoning.

7. Significance and Implications

The introduction of CS-Bench fills a critical gap by operationalizing spatial reasoning in chart understanding for MLLMs. It presents rigorous demands—multi-subplot, varied chart typology, pixel-level semantic grounding—foundational for deploying models in scientific analysis. Empirical results suggest that spatial-textual training strategies significantly improve MLLM performance over both generic and specialized baselines, marking a substantive advance in chart-structured multimodal machine learning (Liu et al., 8 Dec 2025).

A plausible implication is that broad adoption of the CS-Bench evaluation protocol may catalyze systematic improvements in MLLM architectures for scientific and technical document analysis, with spatial grounding as a key driver of accuracy and reliability.

Markdown Report Issue Upgrade to Chat

References (1)

START: Spatial and Textual Learning for Chart Understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chart Spatial Understanding Benchmark (CS-Bench).