GraphPile Corpus: LLM Graph Reasoning
- GraphPile is a large-scale dataset for CPT of LLMs on graph reasoning, blending 10.9B tokens across 23 diverse tasks.
- It features structured data from both synthetic and real-world graphs, with rich annotations to illustrate step-by-step algorithmic reasoning.
- Utilized to boost logical and commonsense reasoning, GraphPile improves benchmarks by up to 21.2% in non-mathematical tasks.
GraphPile is the first large-scale corpus purpose-built for continued pretraining (CPT) of LLMs on graph problem reasoning (GPR) data. Consisting of over 10.9 billion tokens distributed across 2,684,675 samples covering 23 distinct graph reasoning tasks, GraphPile is designed to foster generalized reasoning abilities in LLMs beyond the limits of domain-specific or narrowly mathematical datasets. Its composition, annotation methodologies, structural characteristics, and use in pretraining are detailed below (Zhang et al., 23 Jul 2025).
1. Scope and Composition of GraphPile
GraphPile provides broad GPR coverage by sampling a diverse set of 23 graph reasoning tasks, drawn from distinct high-level paradigms: logical reasoning, topological reasoning, numerical computation, enumeration, division/decomposition, and spatial reasoning. The dataset targets graph-structured inputs and promotes the development of sophisticated logical, relational, and multi-step reasoning.
Overview of Task Coverage and Data Segmentation
| Component | # Samples | # Tokens | Example Type |
|---|---|---|---|
| Chain-of-Thought (CoT) | 848,965 | 2,809,225,185 | Stepwise cycles/scc, etc. |
| Real-World Graphs | 743,465 | 3,203,590,685 | Named-entity, KG tasks |
| Program-of-Thought (PoT) | 759,851 | 2,190,746,959 | Python graph algorithms |
| Trace-of-Execution (ToE) | 332,394 | 2,727,119,224 | Code execution traces |
Task Breakdown by Reasoning Paradigm
- Logical Reasoning: Cycle detection, bipartite check, connectivity, strongly connected components (SCC).
- Topological Reasoning: Topological sort, common neighbors, predecessor, PageRank, Jaccard coefficient, clustering coefficient.
- Numerical Computation: Shortest path, maximum flow, minimum spanning tree, maximum triangle sum.
- Enumeration: Hamilton path, maximum clique, Euler path, implicit enumeration (e.g., graph diameter).
- Division/Decomposition: Additional variants of connectivity, SCC, and graph traversal.
- Spatial Reasoning: Planarity testing.
Per-task token totals are not reported; only aggregated per-component counts are provided.
2. Data Collection and Annotation Methodology
GraphPile consists of both synthetic and real-world graph instances, constructed to maximize task diversity and annotation richness.
Synthetic Problem Generation
Synthetic graphs are generated using Erdős–Rényi (ER) models—both directed and undirected—ranging between 6 and 40 nodes, and represented as adjacency lists, adjacency matrices, or edge lists. This ensures coverage across typical small-to-medium scale graph complexities relevant for algorithmic reasoning.
Real-World Graph Sourcing
Real-world graph data are curated from prominent datasets: DBLP (citation network), OpenFlights (air traffic), PubChemQC (chemical compounds), Social-Network Repository, and DBpedia (knowledge graph). For these, graph entities are assigned human-readable names, and problems are posed in naturalistic, domain-relevant language.
Annotation and Naturalization Pipeline
- CoT (Chain-of-Thought): Expert-authored standard algorithms (e.g., DFS for cycle detection) execute to produce intermediate states; traces are converted to step-by-step natural language using GPT-4o and subsequently validated.
- Real-World: Numeric node/edge IDs in CoT problems are mapped to domain names with GPT-4o; outputs are filtered for semantic consistency.
- PoT (Program-of-Thought): LLMs receive prompts based on NetworkX documentation or direct algorithmic queries. Code samples are validated for correctness and executability before being rewritten into code variants.
- ToE (Trace-of-Execution): Three independent code implementations are instrumented with textual trace statements. Each execution log is paired with questions regarding intermediate variable states, simulating real debugging scenarios.
No custom tokenization or normalization is specified beyond each base model's standard BPE or SentencePiece tokenization procedures.
3. Statistical and Structural Properties
GraphPile's statistical profile is characterized primarily at the sample and aggregate level, rather than in fine-grained structural detail.
- Sample Size: 2,684,675 samples, spanning 10,930,682,053 tokens in total.
- Node/Edge Sizes: Node count per graph uniformly sampled in . There are no exact statistics on average edge count or mean degree .
- Distributional Metrics: The paper does not report means, variances, or complexity distributions for sample characteristics (such as time complexity class).
A plausible implication is that while the dataset spans a wide range of interpretive and algorithmic graph problem types, detailed complexity stratification remains to be reported in future releases.
4. Representative Problem and Annotation Examples
GraphPile's annotation richness is demonstrated by its diverse examples:
- CoT Example (Cycle Detection):
- Run DFS from node $0$, mark visited.
- Explore neighbors in order: $1,4,7,9$.
- From , neighbor $0$ already visited, not parent of $9$.
- Cycle detected.
- CoT Example (Shortest Path Weight):
- PoT Example (Maximum Matching):
1 2 3 4 |
import networkx as nx G = nx.Graph({0:[6,7,…], …}) node_list1=[0,1,2,3] print(nx.bipartite.maximum_matching(G,top_nodes=node_list1)) |
- ToE Example (Bipartite Matching Trace)
1 2 3 4
match_right={} visited={1:False,2:False,3:False} dfs(0): visit neighbor 1→match_right[1]=0→return True→max_matching=1 match_pairs=[(0,1)]
These examples demonstrate the incorporation of algorithmic procedures, intermediate state representation, and deterministic reasoning traces.
5. Application to LLM Continued Pretraining
GraphPile is used to continue-pretrain base LLMs (Llama 3, Llama 3.1, Gemma 2) using a standard autoregressive next-token cross-entropy loss: There are no auxiliary or contrastive losses.
All dataset components are concatenated and randomly mixed. Training is performed using a uniform sampling regime (without explicit curriculum), batch size 1024, maximum sequence length 8192, three epochs, and a learning rate of .
6. Empirical Impact and Scientific Significance
Application of GraphPile for CPT leads to measurable gains: up to 4.9% improvement in mathematical reasoning tasks and up to 21.2% on non-mathematical (logical, commonsense) reasoning benchmarks, as validated with the GraphMind models built atop leading LLMs (Zhang et al., 23 Jul 2025). This demonstrates that GPR data can serve as an effective bridge between domain-specific pretraining (such as mathematical reasoning) and universal, pattern-diverse reasoning abilities. The corpus is the first to systematically operationalize graph problem reasoning as a generalization mechanism for LLMs.
GraphPile thus represents a pivotal advance in dataset engineering for robust, adaptable natural language reasoning on complex, structured tasks.