Generalization in LLM Problem Solving: The Case of the Shortest Path

Published 16 Apr 2026 in cs.AI and cs.LG | (2604.15306v1)

Abstract: Whether LLMs can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that LLMs can rule-abstraction with over 90% success for spatial transfer but struggle with length scaling due to recursive instability.
It shows that broad training data coverage and structured question diversity are critical for effective spatial generalization via supervised fine-tuning.
The study finds that while inference strategies and reinforcement learning marginally stabilize training, they do not address the inherent compositional limits in path-length generalization.

Systematic Generalization of LLMs in the Shortest Path Problem

Introduction

The study "Generalization in LLM Problem Solving: The Case of the Shortest Path" (2604.15306) presents a controlled examination of the systematic generalization capabilities of transformer-based LLMs on composable sequential optimization problems (SOPs), instantiated through the canonical shortest-path task. The central objective is to decouple and rigorously analyze the contributions of data coverage, training paradigms, and inference strategies in governing two fundamental axes of generalization: spatial transfer (structural extrapolation to unseen maps) and length scaling (compositional recursion to longer horizons).

Problem Formulation and Experimental Design

The paper proposes a synthetic navigational environment where the SOP is defined over grid maps with blocked edges, allowing for complete control over primitive exposure (nodes), combinatorial diversity (question sets), and trajectory optimality (shortest paths). The compositionality of the shortest-path task ensures that optimal sub-paths concatenate to global optima, yielding a tractable testbed for probing rule abstraction versus surface pattern learning.

Models are trained in two regimes: supervised fine-tuning (SFT) and reinforcement learning (RL, specifically on Dr.GRPO). Direct-answer prompts are used, where start and end nodes are specified and the model must generate an optimal sequence of moves (E, W, N, S). The axes of generalization are defined as:

Spatial transfer: Evaluation on entirely disjoint node/edge sets from those observed during training.
Length scaling: Evaluation on path lengths strictly exceeding those seen during training.

Stringent controls ensure test distribution disjointness, countering the semantic overlap pitfalls common in natural language tasks.

Main Findings

Asymmetry Between Spatial and Length Generalization

Spatial transfer is robust: Models trained with sufficient primitive coverage and question diversity achieve near-perfect generalization (SR > 90%) to novel maps, indicating rule abstraction beyond memorization.

Length scaling fails categorically: All models, regardless of architecture or training paradigm, suffer from sharply reduced success rates as path lengths exceed the training regime. This degradation is dominated by recursive instability rather than simple error accumulation, as evidenced by a substantial drop in the conditional probability of generating a correct long path even when all subpaths are correctly composable.

Data Properties: The Dominant Role of Coverage

Experiments systematically vary the allocation of a fixed training budget between unique questions (covering more start-end pairs) and solution diversity (multiple trajectories per question). Under SFT, maximizing the number of unique questions dramatically improves spatial generalization—single-solution-per-question is sufficient if coverage is broad.

Coverage, quantified as the fraction of node primitives included in training questions, dictates an upper bound on spatial transfer. Modest diversity in pairings (distinct endpoints per start node) is needed for efficient rule abstraction, but excessive diversity offers diminishing returns, particularly when coverage is low. These results generalize qualitatively to real-world domains, as corroborated by MathQA experiments on Qwen2.5-7B-Instruct with curriculum-constrained budgets.

Scaling Generalization: The Limitation of Exemplars

Length extrapolation is not rescued by increased primitive coverage or question diversity alone. Exposure to slightly longer paths near the target length provides critical adaptation; merely increasing the number of shorter or arbitrarily long (off-target) paths does not yield equivalent gains, and can even degrade performance. This result illustrates the necessity for curriculum-structured examples in facilitating length generalization beyond the training horizon.

Training Paradigms: Efficiency and Stability

RL (Dr.GRPO), either from scratch or warm-started from SFT, does not surpass the SFT upper bound on either spatial transfer or length scaling. Its primary utility is to stabilize training and prevent overfitting under prolonged optimization or noisy data, rather than to unlock latent reasoning capabilities. Detailed error analysis confirms that SFT and RL share identical error modes and distributions.

Inference-Time Strategies: Limited Efficacy on Extrapolation

Test-time search methods (e.g., self-consistency, best-of-N sampling, shortest-of-N selection) shift the empirical performance upward but do not fundamentally alter the scaling failure profile. RL models further contract the diversity of solution trajectories, constraining the benefits available via inference selection. This indicates that length scaling is an intrinsic limitation rather than a failure to surface existing solutions through search.

Theoretical and Practical Implications

The findings establish that spatial generalization in transformer LLMs emerges reliably given sufficient coverage of primitives and structurally diverse questions, suggestive of the formation of flexible latent operators (attention as hypernetworks), as supported by probe analyses and recent theory. The scaling asymmetry highlights that recursive compositional stability—necessary for reasoning chains or symbolic algorithm induction—remains elusive even in synthetic SOP domains, and is not overcome by current SFT or RL methods, nor by stochastic decoding.

Pragmatically, for curriculum and data construction in reasoning-heavy domains, the paper recommends prioritizing operation/element set coverage and moderate compositional diversity under limited budgets, rather than extensive solution diversity. The benefits of RL are realized under noisy, heterogeneous, or shifting data distributions, but not when SFT training is exhaustive and principled. The results question whether reinforcement learning, as currently applied, advances the frontier of length generalization.

Future research should investigate architectures or training paradigms that explicitly regularize or facilitate recursive compositionality (e.g., recurrence, explicit algorithmic priors, hybrid neuro-symbolic approaches), as well as theoretical frameworks that precisely characterize network depth, positional encoding, or optimizer choice in the emergence/failure of scaling generalization.

Conclusion

The study rigorously demonstrates that, in the context of compositional sequential optimization, LLMs can systematically generalize to novel structures but fundamentally fail to robustly extrapolate to longer compositional depths. The ultimate ceiling for generalization is set by primitive coverage and curriculum-aligned data, not by reinforcement fine-tuning or inference-time search, and length scaling remains a structurally unsolved challenge. These conclusions delimit the current capacities of LLMs for systematic algorithmic reasoning and inform both practical dataset curation and the design of next-generation architectures.

Markdown Report Issue