Grounded Chain-of-Thought (GCoT) Dataset

Updated 5 January 2026

Grounded Chain-of-Thought (GCoT) is a procedure that augments reasoning traces with bounding-box alignment to ensure visual grounding in specialized tasks.
It employs a bootstrapping pipeline with self-verification via OCR or semantic matching to retain only correctly aligned reasoning steps.
Evaluated on benchmarks like ChartQA and TAT-QA, GCoT demonstrates significant accuracy gains in data-limited visual reasoning scenarios.

Grounded Chain-of-Thought (GCoT) Dataset designates a procedure, not a pre-packaged corpus, for generating high-fidelity multimodal training data in specialized vision tasks. The methodology was proposed to address failures of standard multimodal LLMs (MLLMs), which struggle in domains such as chart, table, receipt, and report interpretation due to a lack of appropriate reasoning traces and image grounding. GCoT augments chain-of-thought (CoT) reasoning data with bounding-box alignment—verifying each intermediate step by linking it to a precise image region—thereby enforcing faithfulness and improving downstream adaptation in data-limited regimes (Xia et al., 3 Jul 2025).

1. Conceptual Foundation and Rationale

Pre-trained MLLMs display robust performance in object-centric scenes but deteriorate on specialized visual tasks (e.g., chart analysis, tabular reasoning) because their pre-training corpora lack detailed reasoning traces and explicit grounding for non-object, semantically complex images. Standard CoT distillation produces reasoning steps that frequently contain factual errors and do not guarantee correspondence with image evidence. GCoT introduces a bootstrapping approach: each step in a reasoning trace is matched to a bounding box in the image, retaining only those pairs that undergo self-verification (OCR or semantic match) and thereby ensuring grounding fidelity (Xia et al., 3 Jul 2025).

2. Data Collection and Annotation Pipeline

The GCoT procedure leverages five widely used public benchmarks—ChartQA, TabMWP, SROIE, DVQA, and TAT-QA—encompassing QA over charts, tables, receipts, and reports. No new raw images are collected beyond these standard datasets.

Pipeline Steps

CoT Distillation: A large-capacity MLLM (examples: LLaMA 3.2, Claude 3.5 Sonnet, GPT-4o, Qwen2-VL, Gemini 1.5-Pro) is prompted to produce step-by-step explanations for each image and question pair.
Target Extraction: Nouns and numeric tokens are extracted from each step using NLTK.
Grounding: For each extracted target, a grounding-capable MLLM (VisCoT-7B) is prompted to produce bounding boxes, followed by cropping and semantic or OCR validation ("self-verification").
Iterative Bootstrapping: Validated boxes are recursively used to augment training and enhance recall over 3–5 bootstrap rounds.
Grounded CoT Synthesis: Final verified bounding box coordinates are appended after every mention of their corresponding target in the reasoning trace. Multiple candidate trace–box pairs are generated, with only fully verified augmentations retained.

This protocol yields GCoT data where every reasoning step is both interpretable and verifiably grounded in the image (Xia et al., 3 Jul 2025).

3. Data Structure and Example Representations

The canonical file format is JSONL, with each line an independent example. Each entry incorporates fields for image identity, question, answer, and a sequence of grounded chain-of-thought steps, each step explicitly paired with bounding box information:

{
  "image_id": "chart_0456",
  "question": "How many fan letters total were received on Thursday and Monday?",
  "answer": "475",
  "grounded_cot": [
    {"step": "Thursday count = 204 (box [0.12,0.23,0.35,0.44]).", "boxes": [{"xmin": 0.12, "ymin": 0.23, "xmax": 0.35, "ymax": 0.44}]},
    {"step": "Monday count = 271 (box [0.15,0.67,0.42,0.79]).", "boxes": [{"xmin": 0.15, "ymin": 0.67, "xmax": 0.42, "ymax": 0.79}]},
    {"step": "Total = 204+271=475.", "boxes": []}
  ]
}

Typical GCoT entries for arithmetic, field extraction, and tabular reasoning specify 3–6 bounding boxes per example and 5–8 steps in the reasoning sequence. Bounding box coordinates are stored in normalized image units (Xia et al., 3 Jul 2025).

4. Dataset Scope, Task Coverage, and Quantitative Metrics

GCoT encompasses five source benchmarks, with the following approximate class breakdown:

Visual Domain	Percent of Samples	Benchmarks
Charts	~35%	ChartQA, DVQA
Tables	~50%	TabMWP, TAT-QA
Receipts	~5%	SROIE
Reports	~10%	TAT-QA (hybrid content)

Sampling for few-shot experiments uses image–question–answer triples with N∈{8, 16, 32, 64, 128} sampled per training split (three random seeds per N). Evaluation is consistently performed on the full held-out test split from the respective benchmark. No domain generalization or cross-validation protocol is employed (Xia et al., 3 Jul 2025).

Grounding verification uses intersection-over-union (IoU):

$\text{IoU} = \frac{\text{area}(B_\text{pred} \cap B_\text{gt})}{\text{area}(B_\text{pred} \cup B_\text{gt})}$

Typical evaluation metrics include answer accuracy (percentage of QA pairs answered correctly) and formal verification that predicted boxes contain the correct answer string via OCR or direct semantic comparison. No additional loss functions are specified beyond standard cross-entropy and self-verification filtering (Xia et al., 3 Jul 2025).

5. Baseline Comparisons and Empirical Effectiveness

Few-shot adaptation experiments establish GCoT's effect compared to several baselines:

Setting	8-shot Avg Acc	128-shot Avg Acc
Zero-shot	~16.0%	–
Fine-tuning (QA only)	~19.5%	~31.1%
CoT Distillation	~22.8%	~31.6%
GCoT (full method)	~24.1%	~33.9%

GCoT consistently improves accuracy over standard fine-tuning and conventional CoT distillation, especially in data-limited scenarios. This result evidences that explicit visual grounding during reasoning calibration is advantageous for model adaptation in specialized visual domains (Xia et al., 3 Jul 2025).

6. Best Practices for Reproducibility and Extension

The dataset is not released as a static corpus; rather, the authors provide a fully specified procedure for generating GCoT-augmented traces over any collection of image–QA pairs:

Start from a public benchmark with appropriate train/test splits.
Distill CoT traces from a high-capacity MLLM.
Bootstrap bounding box annotations using a grounding-proficient LLM (VisCoT-7B).
Apply self-verification, retaining only trace–box pairs passing OCR or semantic content checks.
Format outputs in JSONL as shown in Section 3.

Researchers should implement this protocol to construct high-quality, grounded chain-of-thought data in their own domains, with final outputs suitable for supervised fine-tuning (Xia et al., 3 Jul 2025).

7. Relationship to Broader Multimodal Reasoning Benchmarks

GCoT stands alongside recent multimodal CoT datasets, such as 3D-CoT Benchmark (Chen et al., 8 Mar 2025), MM-GCoT (Wu et al., 17 Mar 2025), Geo-CoT380k (Liu et al., 26 Sep 2025), SCENECOT-185K (Linghu et al., 19 Oct 2025), and S-Chain (Le-Duc et al., 26 Oct 2025). However, it is unique in its bootstrapping-plus-verification pipeline and its focus on domain-specialized image types, particularly charts, tables, and receipts, as opposed to scene or object-centric spatial reasoning. A plausible implication is that similar grounding-augmented CoT generation could improve adaptation in other non-standard multimodal domains where conventional pre-training is deficient.

GCoT methodology forms a modular template for future benchmarking and adaptation studies demanding fine-grained alignment between visual cues and reasoning traces, with direct implications for interpretability, reliability, and trustworthiness in multimodal LLM deployment (Xia et al., 3 Jul 2025).