Graph-Structured Reasoning Dataset
- Graph-Structured Reasoning Datasets are specialized corpora that represent multi-step and relational reasoning via explicit graph structures, enabling precise model evaluations.
- They integrate diverse construction paradigms such as automated scene graph extraction, combinatorial graph generation, and LLM-augmented annotations for fine-grained supervision.
- These datasets support varied tasks—from vision-language QA to coding challenges and commonsense inference—benchmarking models on accuracy, interpretability, and process-level reasoning.
A graph-structured reasoning dataset is a corpus designed to evaluate and advance the ability of computational models—especially neural networks and LLMs—to perform reasoning tasks explicitly grounded in graph representations. These datasets span multiple domains, from vision-language grounding and table/chart analysis to commonsense reasoning, multi-hop question answering, algorithmic problem-solving, and open-domain logical inference. They encode relational, topological, and often multi-step reasoning processes through explicit graph structures, supporting fine-grained supervision, benchmarking, and interpretability across a wide range of AI tasks.
1. Dataset Construction Paradigms
Graph-structured reasoning datasets are constructed through several principal methodologies, determined by their intended setting (vision, language, coding, etc.), granularity of supervision, and type of graph structure employed.
- Scene and Semantic Graphs in Vision: In Ref-Reasoning (Yang et al., 2020), per-image semantic scene graphs are derived from GQA/Visual Genome, normalized and enriched with “same-attribute” edges. Referring expressions are generated by selecting referent nodes, sampling subgraphs of controlled size and layout, and populating natural language templates that are accepted only if corresponding functional programs executed over the image scene graph uniquely identify the referent.
- Synthetic and Real-World Graphs for Coding and Numerical Reasoning: Datasets such as GraphEval36K (Wu et al., 2024) and GraphPile (Zhang et al., 23 Jul 2025) generate large samples of canonical graph problems (e.g., cycle detection, shortest paths, min spanning tree), using combinatorial generators (e.g., Erdős–Rényi), real-world corpora (e.g., citation graphs, Amazon product networks), and programmatically generated problem instances.
- Process-Supervision via Stepwise Generation: GraphSILO (Peng et al., 2 Mar 2025) leverages both algorithm-instrumented trajectories (e.g., logging subroutine-level steps of Dijkstra, Kruskal) and Monte Carlo Tree Search rollouts guided either by canonical programs or model policies. All steps are automatically annotated as correct or incorrect, yielding fine-grained process labels.
- Crowd-Sourced or LLM-Augmented Graph Extraction in Textual Reasoning: ExplaGraphs (Saha et al., 2021), GRS-QA (Pahilajani et al., 2024), and “From Chains to Graphs” (Chen et al., 7 Jan 2026) convert question-answer pairs or explanatory argumentation into directed acyclic explanation graphs via manual or automatic annotation. Nodes represent atomic facts, premises, or sub-answers, and edges denote logical entailment or support; the construction uses supporting-fact tagging, logical decomposition, or merges of multiple model-generated chain-of-thoughts. Negative samples are created through structural perturbations (edge addition/removal, shuffling).
- Synthetic Chart/Table Reasoning: GRAFT (Verma et al., 21 Aug 2025) programmatically creates chart/table images and multi-step questions, matched with schema-constrained JSON or YAML answers. Visuals are generated with precise semantic and structural control (randomized labels, value distributions, visual style), and all answers are visually grounded.
The following table summarizes primary construction strategies:
| Dataset | Construction Modality | Graph Type |
|---|---|---|
| Ref-Reasoning | Automatic subgraph + template synthesis | Image scene/language graphs (semantic, DAGs) |
| GraphEval36K | Algorithmic instance generation | Synthetic graphs (various topologies) |
| GraphSILO | Task-oriented/MCTS trace+step labeling | Synthetic graphs, process-labeled traces |
| ExplaGraphs | Crowd-creation and verification | Explanation DAGs via commonsense relations |
| GRS-QA | QA dataset annotation + structure merge | Reasoning graphs (bridge, tree, comparison) |
| GRAFT | Programmatic chart/table gen | Tables, charts—implicit (annotated structure) |
2. Graph Structure and Formalism
All such datasets represent data, queries, and/or intermediate reasoning as explicit graphs.
- Vision–Language: Scenes are represented as semantic graphs with object nodes , relation edges , and visual/spatial node+edge attributes. Language scene graphs mirror linguistic syntactic/semantic structure, mapping noun phrases to nodes and relations (prepositions/verbs) to labeled directed edges (Yang et al., 2020).
- Commonsense and Explanation: Nodes are concepts, facts, or argument components; edges are labeled by commonsense or logical relations (causes, desires, atLocation, etc.), and many graphs are acyclic, non-linear, and contain both internal (from the context) and external knowledge nodes (Saha et al., 2021).
- QA Reasoning: Reasoning graphs are , with nodes as supporting facts and edges reflecting the minimal dependencies needed to answer the question (e.g., chain, tree, or forest structures) (Pahilajani et al., 2024, Chen et al., 7 Jan 2026).
- Algorithmic/Benchmark: Graphs are classical combinatorial constructs with , possibly attributed (weights, labels), and process-trace graphs list ordered step sequences, each explicitly labeled (Peng et al., 2 Mar 2025, Wu et al., 2024, Zhang et al., 23 Jul 2025).
- Knowledge Graph QA: The TITAN dataset (Simoni et al., 16 Oct 2025) formalizes domain knowledge as , where nodes have entity types and typed, bidirectional relations. Reasoning paths are sequences of operators traversing .
Edges can be labeled (relation types, logical dependencies), directed or undirected, and sometimes bidirectional. Formal definitions and adjacency representations (e.g., ) are typically provided.
3. Problem Types and Reasoning Tasks
Tasks in graph-structured reasoning datasets are designed to probe multiple aspects of relational, algorithmic, and multi-step inference:
- Grounded Vision–Language QA: Locate image objects from complex referring expressions, requiring multi-hop spatial/semantic reasoning, e.g., “the man in [front of] the red car” (Yang et al., 2020).
- Algorithmic and Coding Challenges: Solve pathfinding, cycle detection, clique enumeration, MST, and flow problems given explicit graph data; solutions may require code, natural language, or programmatic reasoning (Wu et al., 2024, Zhang et al., 23 Jul 2025, Peng et al., 2 Mar 2025).
- Commonsense and Multihop Reasoning: Predict stances (support/counter), explain logical connections as explicit DAGs, or answer questions requiring evidence fusion across bridge, compositional, or comparison structures (Saha et al., 2021, Pahilajani et al., 2024).
- Chart/Table Analysis: Answer complex, multi-step questions over chart/table images; questions draw on reasoning types such as comparison, trend detection, aggregation, ranking, and anomaly detection, with schema-constrained outputs (Verma et al., 21 Aug 2025).
- Process and Reward Modeling: Each step of a model’s solution is labeled as correct/incorrect, allowing training and benchmarking of step-level reward models. Tasks cover node, edge, and global graph properties, as well as multi-hop label prediction (Peng et al., 2 Mar 2025).
- Knowledge Graph Traversal and QA: Predict compositional chains of relations to traverse a cyber threat knowledge graph in response to free-form CTI questions; chains are stepwise interpretable and directly executable (Simoni et al., 16 Oct 2025).
Difficulty is typically parameterized: number of graph “hops,” graph size/connectivity, sub-expression minimality, or structure type (chain, tree, forest, compositional). Datasets often provide balanced splits to evaluate models at varying complexity levels.
4. Annotation, Supervision, and Interpretability
Each dataset delivers specific supervision schemes:
- Fine-grained Process Labeling: GraphSILO uniquely supplies per-step correct/incorrect labels (394,165 across 118,189 traces), enabling training of process reward models (PRMs) (Peng et al., 2 Mar 2025).
- Ground-Truth Graph and Intermediate Steps: Ref-Reasoning, ExplaGraphs, and GRS-QA annotate each instance with full graph structures, ground-truth attention or reasoning traces (e.g., AttendNode/AttendRelation steps), and intermediate visualizations (Yang et al., 2020, Saha et al., 2021, Pahilajani et al., 2024).
- Negative Augmentation for Structure Sensitivity: GRS-QA generates negative reasoning graphs (with incorrect or corrupted structure) to decouple semantic vs. structural contribution to reasoning (Pahilajani et al., 2024).
- Human Verification and Rounds of Refinement: ExplaGraphs employs a create–verify–refine pipeline, achieving up to 90% graph correctness after iterative crowdsourced verification (Saha et al., 2021).
- Automatic Template Mining and LLM Paraphrasing: TITAN and GraphPile use large template banks, auto-instantiation, and LLM paraphrase for linguistic variety and controlled entity coverage (Simoni et al., 16 Oct 2025, Zhang et al., 23 Jul 2025).
Interpretability is a major design criterion: nearly all datasets include or enable inspection of the explicit reasoning chain (either via functional program traces, graph visualization, or step-annotated code), supporting fine-grained error analysis and module-level evaluation.
5. Evaluation Protocols and Benchmarking
Benchmarks are systematic and multi-layered:
- Final Answer Accuracy: Most datasets use strict exact-match or accuracy on predicted answers as the primary metric (Yang et al., 2020, Simoni et al., 16 Oct 2025).
- Intermediate/Process Accuracy: For process-labeled datasets, step-level accuracy or trajectory reward (e.g., GraphPRM-guided search/DPO improvement) is tracked (Peng et al., 2 Mar 2025). For graph-executable QA (TITAN (Simoni et al., 16 Oct 2025)), path accuracy (EM), as well as reasoning-text overlap metrics (ROUGE, BLEU, BERTScore), are standard.
- Semantic and Structural Match: In explanation or reasoning-graph datasets, graph edit distance (GED), BERTScore over edge sets, and edge importance accuracy are used; ExplaGraphs evaluates both stance and graph correctness at multiple levels (Saha et al., 2021).
- Task-Specific Protocols: GRAFT uses 1–5 scale ratings along correctness, completeness, visual grounding, and schema fidelity, with automated evaluation by GPT-4o. GraphEval36K computes average passing rate APR and pass@1 over test case suites (Verma et al., 21 Aug 2025, Wu et al., 2024).
- Ablation and Sensitivity Tests: GRS-QA explicitly measures performance under positive/negative graph structures, unstructured vs. structured evidence, and varying “hop” counts (Pahilajani et al., 2024).
- Model Transfer: Cross-domain and cross-task evaluations are performed (e.g., GraphPRM trained on graph tasks improves LLMs on arithmetic/math domains) (Peng et al., 2 Mar 2025).
Most datasets adopt open splits (train/val/test or evaluation-only), and some (GraphSILO, GraphEval36K) are designed for both diagnostic and curriculum training regimes.
6. Applications, Limitations, and Future Directions
Graph-structured reasoning datasets underpin research in several high-impact areas:
- Generalized Reasoning and Algorithmic Robustness: By grounding performance in schematic, controlled reasoning structures, these datasets facilitate work on model generalization, systematic error discovery, and cross-paradigm transfer (Zhang et al., 23 Jul 2025, Chen et al., 7 Jan 2026).
- Instruction-Following and Alignment: Structured prompts and schema-constrained answers (e.g., GRAFT) are critical for instruction-following evaluation at the intersection of vision, language, and formal reasoning (Verma et al., 21 Aug 2025).
- Explainable and Trustworthy AI: The explicit graph traces in ExplaGraphs, Ref-Reasoning, and TITAN support interpretable, inspectable reasoning chains, indispensable for safety and oversight in mission-critical applications (Saha et al., 2021, Simoni et al., 16 Oct 2025).
- Process Supervision and Reinforcement Learning: Fine-grained annotations allow for process reward models and DPO, providing a rigorous pipeline for training LLMs to value-stepwise reasoning over mere answer matching (Peng et al., 2 Mar 2025).
- Limitations: Many datasets are synthetically generated or limited to moderate graph sizes (e.g., ≤40 nodes in GraphPile), which may not cover large-scale or power-law graphs. Data collection can be labor-intensive (crowdsourcing for ExplaGraphs), and current annotation schemes may underrepresent certain domain-specific or multimodal relation types. Negative or adversarial graph variants are not universally explored.
A plausible trend is the further integration of multimodal data (e.g., combining graph, image, table, and textual evidence), scaling to more complex real-world graphs, and the systematization of negative augmentation to disentangle reasoning errors due to structural versus semantic misalignment.
7. Representative Datasets: Summary Table
| Dataset | Domain | Core Construction | Scale / Notable Features | Reference |
|---|---|---|---|---|
| Ref-Reasoning | Vision-Language | Auto: scene graphs + templates | 791,956 expressions, 83,989 images | (Yang et al., 2020) |
| GRAFT | Vision, Charts/Tabs | Prog. chart/table gen | 3,151 instances, JSON/YAML schema, 6 QA types | (Verma et al., 21 Aug 2025) |
| TITAN | Cyber Threat Intel | KG traversal, LLM path traces | 88,209 QA-path-CoT triples, MITRE KG | (Simoni et al., 16 Oct 2025) |
| GraphEval36K | Coding/Alg. | LeetCode, auto test suites | 40 problems, 2,850 graphs, 8–11 graph types | (Wu et al., 2024) |
| GRS-QA | Multihop QA | Struct. annotation, negatives | 10,000+ QA w/ explicit reasoning graphs | (Pahilajani et al., 2024) |
| ExplaGraphs | Commonsense/Wiki | Create–Verify–Refine pipeline | 3,166 graphs, 53 topics, stance+graph output | (Saha et al., 2021) |
| GraphSILO | Algorithmic Reason. | Auto trace + MCTS, step label | 118,189 prob.–soln., 394,165 step labels | (Peng et al., 2 Mar 2025) |
| GraphPile | CPT, code, text | Mixed (CoT, PoT, ToE, real) | 2.68 M samples, 10.9 B tokens, 23 tasks | (Zhang et al., 23 Jul 2025) |
| SGR (“Chains→Graphs”) | General-domain QA | LLM-chains, alignment/merge | 9,869 merged reasoning graphs | (Chen et al., 7 Jan 2026) |
Collectively, these corpora drive research progress in structural reasoning, offering robust testbeds for new architectures, supervision schemes, and learning objectives in both domain-specific and general AI.