G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning

Published 24 May 2025 in cs.LG, cs.AI, and stat.ML | (2505.18499v3)

Abstract: Although LLMs have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs' graph reasoning abilities. To enable RL training, we curate Erd~os, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erd~os, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully. Our implementation is open-sourced at https://github.com/PKU-ML/G1, with models and datasets hosted on Hugging Face collections https://huggingface.co/collections/PKU-ML/g1-683d659e992794fc99618cf2 for broader accessibility.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a reinforcement learning framework (G1) that uses synthetic graph tasks and a large curated dataset to boost LLMs' graph reasoning capabilities.
It employs a two-phase training pipeline, combining supervised fine-tuning with Group Relative Policy Optimization, leading to efficient performance gains.
Empirical results show that smaller G1 models outperform larger baselines on various graph tasks, achieving strong generalization and transferability.

G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning

Motivation and Problem Statement

LLMs have demonstrated strong general reasoning capabilities, but their performance on graph-theoretic tasks remains suboptimal, with state-of-the-art models achieving only moderate accuracy on even basic graph connectivity problems. This limitation is critical, as graph reasoning underpins a wide range of applications in science, engineering, and knowledge representation. Prior approaches—such as instruction tuning, preference alignment, and graph foundation model pretraining—are constrained by the lack of large-scale, diverse, and universally represented graph datasets, and often fail to generalize across graph types and encoding schemes.

The G1 framework addresses these challenges by leveraging Reinforcement Learning (RL) on synthetic graph-theoretic tasks, demonstrating that RL can elicit latent graph reasoning abilities in pretrained LLMs without reliance on human-annotated data. The approach is enabled by the construction of the largest graph reasoning dataset to date, comprising 50 diverse tasks and 100k training samples derived from real-world graphs.

Dataset Construction and Task Diversity

The G1 dataset is curated from real-world graphs using the Network Repository, with subgraphs sampled to fit LLM context windows (5–35 nodes). Tasks span a spectrum of complexity, from basic properties (node counting, edge existence) to NP-hard problems (maximal independent set, traveling salesman, isomorphic mapping). Each task is accompanied by ground-truth answers or algorithmic verification programs, enabling rule-based reward attribution for RL.

Graph encoding is standardized to edge list format, facilitating consistent input representation. The dataset supports both training and benchmarking, and is open-sourced for reproducibility and further research.

RL Training Pipeline and Reward Design

G1 employs a two-phase training pipeline:

Supervised Fine-Tuning (SFT): An optional warm-up phase using either direct question-answer pairs (Direct-SFT) or chain-of-thought trajectories (CoT-SFT) generated via rejection sampling from a stronger teacher model. This phase is critical for initializing the model on challenging tasks where base accuracy is low.
Reinforcement Learning (RL): The core phase utilizes Group Relative Policy Optimization (GRPO), rewarding correct rollouts based on strict value matching, Jaccard index for set answers, and algorithmic verification for multi-solution tasks. The KL penalty to the reference policy prevents overfitting and catastrophic forgetting.

The RL phase is highly data-efficient, requiring only 300 steps with batch size 512 on 8×A800 GPUs for 3B/7B models. Hyperparameters are tuned for stability and exploration, with entropy regularization to encourage diverse solution strategies.

Empirical Results and Scaling Behavior

G1 models exhibit substantial improvements over baselines and prior graph-specialized models across all difficulty levels. Notably, G1-3B and G1-7B outperform Qwen2.5-72B-Instruct and Llama-3.1-70B-Instruct by wide margins, despite being 20× smaller in parameter count. G1-7B achieves 66.16% average accuracy, surpassing GPT-4o-mini by 18.56% and matching OpenAI o3-mini.

Figure 1: An intuitive illustration of the differences in solution strategies employed by Qwen2.5-3B-Instruct, G1-Zero-3B, and G1-3B for a shortest path problem.

Scaling to larger models (G1-Zero-32B) and larger graphs (36–100 nodes) demonstrates robust zero-shot generalization, with G1-32B achieving 75.06% average accuracy and strong transfer to harder problem categories. The approach is bottlenecked only by LLM context window limits, not by reasoning capability.

Transferability and Generalization

G1 models generalize strongly to unseen graph tasks, domains, and encoding schemes, outperforming models specifically trained on GraphWiz and GraphArena benchmarks. Transfer to real-world node classification and link prediction tasks (Cora, PubMed) is also robust, with G1-7B achieving 87.29% average accuracy.

Importantly, RL training on graph-theoretic tasks does not compromise general reasoning ability on mathematics (GSM8K, MATH) and multi-task benchmarks (MMLU-Pro). In several cases, G1-7B surpasses the base model on non-STEM disciplines, indicating synergistic improvement in reasoning skills.

Training Analysis and Strategy Optimization

Analysis of training factors reveals that Direct-SFT is a strong baseline for pattern memorization, but RL (especially with CoT-SFT initialization) confers superior scaling and generalization. Reward signal imbalance across task difficulty can be mitigated by dynamic difficulty scheduling or reward weighting, with models trained exclusively on hard tasks generalizing back to easier ones.

Case studies on shortest path problems show that RL-trained models adapt their reasoning strategies, favoring BFS and intuitive search over computationally complex algorithms like Dijkstra, aligning solution methods with model capabilities.

Limitations and Future Directions

G1 inherits sample inefficiency from GRPO, requiring extensive rollouts for NP-hard tasks. Generalization to highly domain-specific applications (e.g., molecular property prediction, tabular/time-series data) remains untested. Scaling to graphs with thousands of nodes is currently limited by context window constraints, but advances in long-context LLMs may alleviate this.

Future work should explore dynamic difficulty scheduling, integration of visual graph inputs, and adaptation to practical domains such as logistics and knowledge graph reasoning.

Conclusion

G1 establishes RL on synthetic graph-theoretic tasks as an efficient, scalable paradigm for eliciting graph reasoning abilities in LLMs. The approach combines the strengths of pretrained LLMs with abundant, automatically generated data, achieving strong performance, generalization, and transferability. These results suggest a shift away from reliance on heterogeneous real-world graph datasets, paving the way for versatile AI systems capable of sophisticated reasoning across structured modalities.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper shows a new way to teach LLMs to solve “graph” problems. A graph is just a set of dots (called nodes) connected by lines (called edges). Think of:

people and their friendships,
cities and the roads between them,
web pages and the links that connect them.

Many real-world problems are graph problems, like finding the fastest route, spotting communities, or predicting new connections. Today’s LLMs are good at language but still struggle with graph reasoning. The authors propose G1, a simple training method that uses reinforcement learning (RL) to help LLMs get much better at graph puzzles—without needing lots of human-written answers.

What questions did the researchers ask?

They focused on a few clear questions:

Can we make LLMs better at graph reasoning using reinforcement learning (practice with feedback), instead of lots of human-labeled data?
If we train on many small, synthetic (computer-made) graph puzzles, will the model also get better at new, unseen tasks and on real-world graphs?
Can we do all this without hurting the model’s general skills in math and other subjects?

How did they study it?

They built a large set of graph puzzles and trained LLMs with RL, giving scores when the model’s answers were correct. Here’s the approach in everyday language:

Building a big “practice set” of graph puzzles

They collected 50 different kinds of graph problems (from easy to very hard), such as:
- counting nodes or edges (easy),
- finding shortest paths or cycles (medium/hard),
- solving tougher puzzles like maximum flow or traveling salesperson (very hard).
They created 100,000 practice questions and 5,000 test questions.
Instead of using random toy graphs, they sampled small subgraphs from real networks (like citation networks or social graphs) to make practice more realistic.
They used a standard graph software library (NetworkX) to compute correct answers automatically.

Training with reinforcement learning (RL)

RL is like practicing a game: the model tries an answer; if it’s right, it gets a reward (points); if not, it doesn’t. Over time, it learns strategies that earn more points.
Because the problems have clear right/wrong solutions, the computer can instantly check answers—no humans needed.
They designed simple reward rules:
- Exact match for single-number answers,
- Partial credit for “set” answers (using overlap, like the Jaccard index),
- Program-based checks for problems with many possible correct solutions (like “Is this a valid Hamiltonian path?”).

A small warm-up step (optional)

Sometimes the model is so lost at the start that it never gets rewards. To fix this, the authors sometimes gave it a short “warm-up” using supervised fine-tuning (SFT):
- Direct-SFT: show question and final answer,
- CoT-SFT: show question, step-by-step reasoning (“chain of thought”), and final answer.
Then they switched to RL, which is what really improved graph reasoning.

What did they find?

Training with RL on these graph puzzles made a big difference. Here are the highlights:

Strong gains on graph tasks:
- Their 7B-parameter G1 model beat many larger or popular models on the 50-task test set.
- Even their small 3B-parameter G1 model outperformed a much larger 72B model on these graph problems.
Generalizes to new situations:
- It worked well on other graph benchmarks (GraphWiz, GraphArena) that used different styles of writing graphs (like names instead of numbers) and included unseen tasks.
- It handled larger graphs than it saw during training.
Helps on real-world graph tasks:
- It improved on node classification and link prediction in citation networks (Cora, PubMed), which mix text and graph structure.
Keeps or slightly improves general reasoning:
- After RL on graphs, the model’s scores on math (GSM8K, MATH) and broad knowledge tests (MMLU-Pro) stayed strong or improved, meaning the model didn’t “forget” other skills.
Learns smarter strategies:
- For shortest path problems, the RL-trained model moved away from clumsy, error-prone methods and leaned into cleaner, more reliable strategies (like BFS for unweighted graphs), much like a student discovering which tools fit best.

Why does this matter?

It shows a scalable, low-cost way to teach LLMs graph reasoning using computer-generated practice and automatic checks, instead of expensive human labels.
It suggests today’s LLMs already have some buried “graph sense” that RL can bring out.
Better graph reasoning matters everywhere: social networks (community detection, recommendations), biology (protein interaction networks), maps and routing, cybersecurity, and more.
The method didn’t hurt general intelligence—and sometimes even helped—so it’s a promising step toward more versatile, general-purpose AI.

In short: By letting an LLM “play” thousands of graph puzzles and rewarding correct solutions, G1 turns a LLM into a strong graph reasoner that generalizes to new tasks and real data—without needing humans to hand-craft lots of training answers.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

Tweets

YouTube

Show All Videos