Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

Published 3 Jun 2025 in cs.AI | (2506.02648v2)

Abstract: Recent advances in LLMs have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including both general LLMs (GPT-4o, Claude 3.7) and reasoning LLMs (o1, DeepSeek-R1, QwQ, Skywork-OR1). Experimental results reveal that although most LLMs achieve competent and robust performance in low-level cognition, they struggle with high-level cognition and exhibit limited generalization as task complexity grows. Our findings highlight the gap between current LLMs and true human-like fluid intelligence and offer a new path for systematically tracking reasoning progress in LLMs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces DRE-Bench—a dynamic, cognition-aligned benchmark assessing LLM fluid intelligence through a hierarchical task structure.
It employs a code-driven generator and solver pipeline to create scalable, reproducible tasks that prevent memorization-based performance claims.
Empirical results reveal a decline in LLM performance with increased task complexity, underscoring gaps in abstract reasoning and fluid intelligence.

Rigorous Fluid Intelligence Evaluation in LLMs via Dynamic Reasoning: An Expert Overview of “Truly Assessing Fluid Intelligence of LLMs through Dynamic Reasoning Evaluation” (2506.02648)

Introduction and Problem Formulation

The rapid progress of LLMs has led to improved performance across traditional reasoning and knowledge-based benchmarks. However, the extent to which LLMs possess genuine fluid intelligence—the capacity for abstract reasoning and rule generalization in novel contexts—remains insufficiently analyzed by current benchmarks, which often confound memorization, static templates, and domain-specific knowledge. This paper systematically addresses these issues by introducing DRE-Bench, a dynamic, cognition-aligned benchmark targeting fluid intelligence assessment in LLMs through abstract reasoning tasks designed along a rigorous, psychology-informed hierarchy.

Contributions and System Design

Cognition-Aligned Hierarchical Task Structure

The benchmark leverages a four-level cognitive framework inspired by established psychological models of intelligence. The task hierarchy is:

Level 1 (Attribute): Simple enumeration (size, count, shape).
Level 2 (Spatial): Spatial transformations (move, rotation, symmetry).
Level 3 (Sequential): Higher-order abstract reasoning (categorization, sorting, planning) requiring multi-step inference.
Level 4 (Conceptual): Physical concepts (gravity, reflection, expansion), demanding application of abstract conceptual knowledge.

Each level incorporates multiple latent rules, with distinct dynamic variables to modulate task complexity (Figure 1).

Figure 1: DRE-Bench’s four-level cognitive task taxonomy with dynamic variants visualized.

Dynamic, Scalable Data Generation

To overcome static benchmark limitations and reduce data contamination, DRE-Bench utilizes a code-driven generator–solver pipeline. Task-specific constraints and latent rules are identified by experts, with LLM-based agents implementing parametric generators and verifiable solvers (Figure 2). This approach produces large-scale, reproducible, and diverse task instances.

Figure 2: DRE-Bench’s human-agent collaborative data pipeline for scalable, verifiable case generation.

Benchmark Properties

Three core attributes distinguish DRE-Bench:

Cognition-aware hierarchy: Tasks are mapped to explicit cognitive functions, enabling fine-grained, interpretable assessment.
Dynamic variant generation: Variable task complexity allows for robust measurement across a spectrum of abstraction and prevents memorization-based “gaming.”
Extensive coverage and scalability: The code-based pipeline ensures high correctness, vast diversity, and extensibility to new rule types.
Figure 3: (a) Example latent rule tasks; (b) DRE-Bench advances existing benchmarks in hierarchy, scalability, and dynamism; (c) LLM leaderboard showing accuracy vs. stability.

Experimental Analysis

Model Selection and Evaluation Protocol

The evaluation spans 11 strong LLMs, including generalist models (e.g., GPT-4o, Claude-3.7) and reasoning-specialized systems (e.g., o1, DeepSeek-R1, QwQ, Skywork-OR1). Metrics focus on grid-exact accuracy (output grids must match ground-truth), with average results over multiple seeds and variants per task to reduce randomness.

Hierarchical Cognitive Performance

Empirical results show a monotonic performance decline with increasing task level and complexity. Reasoning-oriented models such as o1 and DeepSeek-R1 lead across all levels, especially in spatial and sequential reasoning, yet remain substantially sub-human, especially at Level 4. Other notable findings:

Level-1/2: Most LLMs perform robustly on attribute and some spatial tasks, but generalist models (e.g., GPT-4o) exhibit early instability.
Level-3/4: Only reasoning-enhanced LLMs maintain reasonable performance during increased task complexity. All models essentially fail on Level-4 conceptual (physical reasoning), confirming a gap to true fluid intelligence.
Figure 4: Accuracy curves for each model as task complexity increases by cognitive level: robust at Level-1, increasingly brittle at higher levels.
Human annotator studies confirm that the framework’s hierarchical progression reflects increasing cognitive demand, with humans outperforming all tested LLMs at each level.

Accuracy-Stability Trade-Offs

Scatter analysis of accuracy vs. variance underscores that only a handful of models are both accurate and stable across dynamic variants on Level-1/2. At higher levels, the majority of LLMs are neither accurate nor stable, further confirming their fragile abstraction abilities (Figure 5).

Figure 5: Model accuracy vs. output variance: top left indicates high, stable intelligence; most models cluster away from this region at high cognitive levels.

Ablation: Context, Multimodality, and Inference Time

In-context sample size: Marginal accuracy benefits, primarily when models are near mastery or on inherently easy tasks. For higher levels, performance plateaus quickly (Figure 6).
Visual input (multimodality): Surprisingly, augmenting problems with visual representations yields inconsistent or negative gains, a finding in contrast with human cognition, highlighting a current architectural or training limitation (Figure 7).
Inference time scaling: Longer inference is effective for low-level tasks but does not bridge the gap at higher levels; model architectural limitations rather than computational budget are the primary bottleneck (Figure 8).
Figure 6: Effect of increasing in-context example count on DeepSeek-R1; diminishing returns beyond few-shot regimes.

Figure 8: o1’s accuracy and latency as a function of reasoning complexity, evidencing diminishing benefit from increased inference time.

Error Analysis and Cognitive Divergences

Qualitative inspection reveals subtle errors at Levels 1/2, and outright rule failures with disorganized outputs at Levels 3/4 (Figure 9). The orientation analysis—where models show unexpected anisotropies (e.g., vertical > horizontal accuracy)—exposes cognitive idiosyncrasies divergent from human intuition.

Figure 9: Visualization of LLM error types across cognitive levels, reflecting increasing misalignment and rule confusion.

Implications and Future Directions

The study establishes that even the strongest LLMs (including those specifically engineered for reasoning) lack robust, transferable fluid intelligence. Key takeaways for the community:

Benchmarking: Static benchmarks are inadequate for measuring fluid intelligence; dynamic, rule-aligned benchmarks are essential for progress.
Model design: There is substantial headroom for architectural innovation or training algorithm redesign to foster true abstraction, multi-step planning, and conceptual mapping.
Evaluation: Incorporating task difficulty, output stability, and systematic error analysis should be standard in intelligence assessments.
Human cognition mapping: The divergence between model and human strategies, especially in spatial or conceptual domains, signals novel directions for hybrid cognitive architectures or inductive bias engineering.

Conclusion

DRE-Bench delivers a rigorous, interpretable, and contamination-resilient benchmark suite for evaluating fluid intelligence in LLMs (2506.02648). Empirical results demonstrate fundamental gaps between current models and human-like reasoning, especially as abstraction and dynamism increase. This framework sets a new standard for both measuring and driving the next phase of LLM research, emphasizing the necessity of models that internalize—not just memorize—generalizable rules and concepts.

Markdown