Papers
Topics
Authors
Recent
Search
2000 character limit reached

GraphPile Corpus: LLM Graph Reasoning

Updated 7 December 2025
  • GraphPile is a large-scale dataset for CPT of LLMs on graph reasoning, blending 10.9B tokens across 23 diverse tasks.
  • It features structured data from both synthetic and real-world graphs, with rich annotations to illustrate step-by-step algorithmic reasoning.
  • Utilized to boost logical and commonsense reasoning, GraphPile improves benchmarks by up to 21.2% in non-mathematical tasks.

GraphPile is the first large-scale corpus purpose-built for continued pretraining (CPT) of LLMs on graph problem reasoning (GPR) data. Consisting of over 10.9 billion tokens distributed across 2,684,675 samples covering 23 distinct graph reasoning tasks, GraphPile is designed to foster generalized reasoning abilities in LLMs beyond the limits of domain-specific or narrowly mathematical datasets. Its composition, annotation methodologies, structural characteristics, and use in pretraining are detailed below (Zhang et al., 23 Jul 2025).

1. Scope and Composition of GraphPile

GraphPile provides broad GPR coverage by sampling a diverse set of 23 graph reasoning tasks, drawn from distinct high-level paradigms: logical reasoning, topological reasoning, numerical computation, enumeration, division/decomposition, and spatial reasoning. The dataset targets graph-structured inputs and promotes the development of sophisticated logical, relational, and multi-step reasoning.

Overview of Task Coverage and Data Segmentation

Component # Samples # Tokens Example Type
Chain-of-Thought (CoT) 848,965 2,809,225,185 Stepwise cycles/scc, etc.
Real-World Graphs 743,465 3,203,590,685 Named-entity, KG tasks
Program-of-Thought (PoT) 759,851 2,190,746,959 Python graph algorithms
Trace-of-Execution (ToE) 332,394 2,727,119,224 Code execution traces

Task Breakdown by Reasoning Paradigm

  • Logical Reasoning: Cycle detection, bipartite check, connectivity, strongly connected components (SCC).
  • Topological Reasoning: Topological sort, common neighbors, predecessor, PageRank, Jaccard coefficient, clustering coefficient.
  • Numerical Computation: Shortest path, maximum flow, minimum spanning tree, maximum triangle sum.
  • Enumeration: Hamilton path, maximum clique, Euler path, implicit enumeration (e.g., graph diameter).
  • Division/Decomposition: Additional variants of connectivity, SCC, and graph traversal.
  • Spatial Reasoning: Planarity testing.

Per-task token totals are not reported; only aggregated per-component counts are provided.

2. Data Collection and Annotation Methodology

GraphPile consists of both synthetic and real-world graph instances, constructed to maximize task diversity and annotation richness.

Synthetic Problem Generation

Synthetic graphs are generated using Erdős–Rényi (ER) models—both directed and undirected—ranging between 6 and 40 nodes, and represented as adjacency lists, adjacency matrices, or edge lists. This ensures coverage across typical small-to-medium scale graph complexities relevant for algorithmic reasoning.

Real-World Graph Sourcing

Real-world graph data are curated from prominent datasets: DBLP (citation network), OpenFlights (air traffic), PubChemQC (chemical compounds), Social-Network Repository, and DBpedia (knowledge graph). For these, graph entities are assigned human-readable names, and problems are posed in naturalistic, domain-relevant language.

Annotation and Naturalization Pipeline

  • CoT (Chain-of-Thought): Expert-authored standard algorithms (e.g., DFS for cycle detection) execute to produce intermediate states; traces are converted to step-by-step natural language using GPT-4o and subsequently validated.
  • Real-World: Numeric node/edge IDs in CoT problems are mapped to domain names with GPT-4o; outputs are filtered for semantic consistency.
  • PoT (Program-of-Thought): LLMs receive prompts based on NetworkX documentation or direct algorithmic queries. Code samples are validated for correctness and executability before being rewritten into code variants.
  • ToE (Trace-of-Execution): Three independent code implementations are instrumented with textual trace statements. Each execution log is paired with questions regarding intermediate variable states, simulating real debugging scenarios.

No custom tokenization or normalization is specified beyond each base model's standard BPE or SentencePiece tokenization procedures.

3. Statistical and Structural Properties

GraphPile's statistical profile is characterized primarily at the sample and aggregate level, rather than in fine-grained structural detail.

  • Sample Size: 2,684,675 samples, spanning 10,930,682,053 tokens in total.
  • Node/Edge Sizes: Node count per graph uniformly sampled in n[6,40]n \in [6, 40]. There are no exact statistics on average edge count mm or mean degree dˉ=2m/n\bar{d} = 2m/n.
  • Distributional Metrics: The paper does not report means, variances, or complexity distributions for sample characteristics (such as time complexity class).

A plausible implication is that while the dataset spans a wide range of interpretive and algorithmic graph problem types, detailed complexity stratification remains to be reported in future releases.

4. Representative Problem and Annotation Examples

GraphPile's annotation richness is demonstrated by its diverse examples:

  • CoT Example (Cycle Detection):

Input:{V={0,,9},E={(0,1),(0,4),(0,7),(0,9),(1,2),(1,7),(1,9),(2,9),(3,5),(3,8),(5,8),(7,9),(8,9)}\textbf{Input:} \begin{cases} V=\{0,\dots,9\},\quad E=\{(0,1),(0,4),(0,7),(0,9),(1,2),(1,7),(1,9),(2,9),(3,5),(3,8),(5,8),(7,9),(8,9)\} \end{cases}

Solution: Let’s think step by step:\textbf{Solution: Let’s think step by step:}

  1. Run DFS from node $0$, mark visited.
  2. Explore neighbors in order: $1,4,7,9$.
  3. From 01290\to1\to2\to9, neighbor $0$ already visited, not parent of $9$.
  4. Cycle 012900{-}1{-}2{-}9{-}0 detected.

Answer: Yes\boxed{\text{Answer: Yes}}

  • CoT Example (Shortest Path Weight):

Input:V={0,,21},E with weights w.\textbf{Input:} \quad V=\{0,\dots,21\},\quad E\text{ with weights }w.

Task: dmin(13,14)=?\textbf{Task: }d_{\min}(13,14)=?

Solution:Candidate paths considered: P1:131716714, w=23 P2:1361716714, w=34 P3:1331716714, w=28 Minimum is 23.\textbf{Solution:} \begin{aligned} &\text{Candidate paths considered:} \ &P_1: 13\to17\to16\to7\to14,~w=23 \ &P_2: 13\to6\to17\to16\to7\to14,~w=34 \ &P_3: 13\to3\to17\to16\to7\to14,~w=28 \ &\text{Minimum is }23. \end{aligned}

23\boxed{23}

  • PoT Example (Maximum Matching):

1
2
3
4
import networkx as nx
G = nx.Graph({0:[6,7,], })
node_list1=[0,1,2,3]
print(nx.bipartite.maximum_matching(G,top_nodes=node_list1))

  • ToE Example (Bipartite Matching Trace)
    1
    2
    3
    4
    
    match_right={}
    visited={1:False,2:False,3:False}
    dfs(0): visit neighbor 1match_right[1]=0return Truemax_matching=1
    match_pairs=[(0,1)]

These examples demonstrate the incorporation of algorithmic procedures, intermediate state representation, and deterministic reasoning traces.

5. Application to LLM Continued Pretraining

GraphPile is used to continue-pretrain base LLMs (Llama 3, Llama 3.1, Gemma 2) using a standard autoregressive next-token cross-entropy loss: L(θ)=t=1TlogPθ(xtx<t)\mathcal{L}(\theta) = -\sum_{t=1}^T \log P_\theta(x_t\mid x_{<t}) There are no auxiliary or contrastive losses.

All dataset components are concatenated and randomly mixed. Training is performed using a uniform sampling regime (without explicit curriculum), batch size 1024, maximum sequence length 8192, three epochs, and a learning rate of 3×1053 \times 10^{-5}.

6. Empirical Impact and Scientific Significance

Application of GraphPile for CPT leads to measurable gains: up to 4.9% improvement in mathematical reasoning tasks and up to 21.2% on non-mathematical (logical, commonsense) reasoning benchmarks, as validated with the GraphMind models built atop leading LLMs (Zhang et al., 23 Jul 2025). This demonstrates that GPR data can serve as an effective bridge between domain-specific pretraining (such as mathematical reasoning) and universal, pattern-diverse reasoning abilities. The corpus is the first to systematically operationalize graph problem reasoning as a generalization mechanism for LLMs.

GraphPile thus represents a pivotal advance in dataset engineering for robust, adaptable natural language reasoning on complex, structured tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GraphPile Corpus.