Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents

Published 13 Feb 2026 in cs.AI and cs.CL | (2602.12662v1)

Abstract: LLMs are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CoSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency. With Qwen2.5-7B, it reaches an 82.3% success rate, outperforming GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CogRouter, a dynamic cognitive depth adaptation framework that achieves 82.3% task success and reduces token use by up to 62%.
It employs a two-stage training process—CoSFT and CoPO—to allocate four ACT-R inspired cognitive levels in a confidence-aware, context-sensitive manner.
The approach overcomes cognitive rigidity by scaling reasoning with task complexity, preventing mode collapse typical in fixed-level baseline models.

Step-Level Cognitive Depth Adaptation in LLM Agents: A Technical Synthesis

Introduction

Recent advances in autonomous LLM agents have enabled proficient multi-turn, long-horizon decision-making in interactive, partially observable environments such as ALFWorld and ScienceWorld. But prevailing LLM agents typically exhibit cognitive rigidity—adhering to fixed, uniform patterns of reasoning at every step. This translates to either inefficient over-computation (uniform deep reasoning for all steps) or myopic response selection (reflexive surface-level decisions in all cases), despite significant heterogeneity in cognitive demands across trajectories. The paper “Think Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents” (2602.12662) introduces CogRouter, a hierarchical, context-aware cognitive adaptation framework composed of fine-grained cognitive levels and a step-wise assignment mechanism anchored in ACT-R theory. CogRouter deploys a two-stage procedure, comprising Cognition-aware Supervised Fine-tuning (CoSFT) and Cognition-aware Policy Optimization (CoPO), to achieve token-efficient, adaptive cognitive allocation, and demonstrates state-of-the-art task success with significantly reduced resource consumption.

Framework: Hierarchical Cognitive Depth and Dynamic Allocation

CogRouter formalizes agentic task-solving as a POMDP with action selection augmented by a hierarchical, four-level cognitive layer inspired by ACT-R: (1) Instinctive Response, (2) Situational Awareness, (3) Experience Integration, and (4) Strategic Planning. Each step in a trajectory involves: (a) context-sensitive selection of a cognitive level, (b) generation of an internal reasoning trace at the chosen level, and (c) action synthesis informed by the reasoning process. The agent’s output at each step is fully structured, encapsulating the level, the reasoning (“think”), and the final action, enabling fine-grained downstream analysis and optimization.

Step-Level Adaptation: Training and Optimization

Cognition-aware Supervised Fine-tuning (CoSFT):

CoSFT instills robust, stable reasoning format for each cognitive level via supervised imitation with expert trajectories augmented by random, balanced assignments of cognitive levels. This forced balance is essential to prevent model bias induced by the teacher or by data skew (see ablation study).

Cognition-aware Policy Optimization (CoPO):

CoPO, a group-based RL extension, addresses a critical failure mode in baseline online RL (GRPO, GiGPO): the collapse to uniform deep reasoning, especially $\mathcal{L}_4$ , regardless of task phase or complexity. CoPO implements step-level credit assignment via confidence-aware advantage reweighting: at every step, for each cognitive level, the agent reconstructs the action via the alternative reasoning modes and measures action prediction confidence (using average log-probabilities). The step-level advantage is then reallocated across levels in proportion to normalized confidence—increasing gradients to levels that yield high-confidence, contextually appropriate actions. This prevents the mode collapse observed under trajectory-level uniform credit assignment.

(Figure 1)

Figure 1: Cognitive level distributions across trajectory progress for CoPO, GiGPO, and GRPO on ScienceWorld. Both GRPO and GiGPO collapse to predominantly $\mathcal{L}_4$ reasoning, while CoPO preserves diversity and trajectory-adaptive allocation.

Empirical Results

Superiority in Efficiency and Task Success

CogRouter achieves up to 82.3% average success rate (Qwen2.5-7B), outperforming all tested baseline and frontier models: GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%). CoPO reduces the average tokens required per task by 62% compared to GRPO and by 57% compared to GiGPO without loss of accuracy, validating that adaptive cognitive allocation is computationally efficient. Notably, performance improvements are robust across base models—similar trends are observed for Llama3.1-8B.

Fine-Grained Cognitive Allocation

Analysis of post-training trajectories reveals that CoPO learns dynamic, stage-dependent cognitive allocation. Early steps (task initiation, high uncertainty) exhibit high $\mathcal{L}_4$ usage (strategic planning), which decays as the agent acquires situational grounding, replaced by $\mathcal{L}_2$ and later primarily $\mathcal{L}_1$ (routine execution) as structure and goal are clarified. Experience integration ( $\mathcal{L}_3$ ) appears episodically after deviating from expected outcomes or upon encountering obstacles. In contrast, GRPO and GiGPO degenerate into uniform high-level reasoning (see Figure 1 above), confirming the necessity for step-level adaptation and confidence-modulated advantage allocation.

Complexity-Adaptive Reasoning

Tasks classified by complexity (oracle trajectory length) reveal that CoPO scales cognitive effort with task demands: longer, more complex problems induce greater usage of $\mathcal{L}_3$ and $\mathcal{L}_4$ , while shorter tasks favor $\mathcal{L}_1$ . GRPO, in contrast, fails to adapt, using near-uniform distributions regardless of complexity.

Ablation and Analysis

Confidence Metric: Average log-probability yields strongest performance as a confidence signal for advantage assignment in CoPO; alternatives (min-prob, entropy, max-prob) degrade both accuracy and adaptive allocation.
Cold-Start Balanced CoSFT: Absence of balanced pretraining or using expert-level distributions leads to extreme collapse or bias in level utilization, as the RL phase cannot recover from initialization skew.
Reward Design: Credit assignment only to successful trajectories and routine use of the original format reinforce proper reasoning format and discourage token waste.
Fixed-Level Baselines: Models trained and evaluated with only a single cognitive level are either insufficiently performant on complex steps or grossly inefficient, validating the necessity of adaptive allocation.

Implications and Future Prospects

The CogRouter framework concretely demonstrates that discrete hierarchical cognitive levels, dynamically allocated at step level via confidence-aware RL, yield superior performance and operational efficiency in agentic LLM tasks. This supports the notion, grounded in cognitive theory, that optimal AI agents require meta-cognitive flexibility, not merely powerful base reasoning. Practically, CogRouter reduces resource utilization dramatically, mitigating bottlenecks in long-horizon agent applications. Theoretically, it suggests that further granularity (more than four levels or a continuous control of depth), integration of self-evolving cognitive structures, or hybridization with explicit uncertainty quantification may offer further improvements.

Conclusion

The “Think Fast and Slow” framework provides a rigorous, scalable paradigm for cognitive depth adaptation in LLM agents. Step-level dynamic assignment, grounded in ACT-R theory and optimized via confidence-aware RL, solves the cognitive rigidity and mode-collapse issues endemic in baselines, producing agents that are both more effective and significantly more efficient. Broader future directions include adaptation to more open-ended, multi-agent, and real-world settings, and the exploration of continual, self-organizing cognitive hierarchies in artificial reasoning systems.