Unified Agentic Planning

Updated 22 January 2026

Unified Agentic Planning is a paradigm that integrates reasoning, tool interfacing, and action selection to address long-horizon decision-making challenges.
It leverages modular components—such as planners, executors, and critics—to abstract domain specifics via semantic tool interfaces for diverse applications.
In practice, it combines in-context search and reinforcement learning fine-tuning to optimize planning in complex, multi-agent, and multimodal environments.

Unified Agentic Planning is a paradigm in artificial intelligence wherein a single agent—typically instantiated as a LLM or multimodal foundation model—acts, reasons, and adapts within complex environments to solve planning problems over long horizons. These environments span domains such as tool-augmented web search, multi-agent interaction, image-based reasoning, and mission-critical decision-making subject to hard constraints. The unification arises from abstracting the planning process into common computational and architectural primitives, decoupling domain specifics via semantic tool interfaces, and supporting both in-context (search/planning at inference) and post-training (RL/fine-tuning) optimization. This article systematically reviews the theoretical formalism, representative architectures, optimization methods, empirical benchmarks, and open challenges of unified agentic planning.

1. Formal Foundations and Problem Definition

Unified agentic planning is formally characterized as decision-making under partial observability and long time horizons. A general formalization is given by a POMDP-style tuple: $\mathcal{P} = \left\langle \mathcal{X}, \mathcal{O}, \mathcal{A}, \mathcal{Z}, \mathcal{M}, \mathcal{T}, \Omega, \mathcal{R}, \gamma \right\rangle$ where:

$\mathcal{X}$ : latent environment state space,
$\mathcal{O}$ : observable variables (inputs, tool/call returns),
$\mathcal{A}$ : external action space (tool invocation, writing to external memory, scheduling, etc.),
$\mathcal{Z}$ : internal reasoning space (thought tokens, sub-goal decompositions),
$\mathcal{M}$ : agent memory or state embedding,
$\mathcal{T}$ : environment transition function,
$\Omega$ : observation emission distribution,
$\mathcal{R}$ : reward function,
$\gamma$ : discount factor (Wei et al., 18 Jan 2026).

At each timestep $t$ , the agent observes $h_t = (o_{\leq t}, z_{<t}, a_{<t})$ , maintains memory $m_t \in \mathcal{M}$ , and selects an internal reasoning step $z_t$ and an external action $a_t$ . The agent’s policy factorizes as: $\pi_\theta(z_t, a_t | h_t) = \pi_{\mathrm{reason}}(z_t | h_t) \times \pi_{\mathrm{exec}}(a_t | h_t, z_t)$ with the overall objective of maximizing expected return: $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t \geq 0} \gamma^t r_t \right]$

In multi-agent planning, this extends to stochastic (potentially zero-sum or general-sum) games described by $\langle N, S, A, T, R \rangle$ with agent-type beliefs, Bayesian updating, and context-specific policies. This supports a spectrum of planning approaches, from exact POMDP solvers to scalable, belief-heuristic policies (Zhu et al., 13 Feb 2025).

2. Core Architectural and Algorithmic Primitives

Unified agentic planners employ modular architectures comprising specialized functional modules:

Planner: selects sub-goals or tool calls (via a learned or search policy).
Executor: invokes external APIs/tools, returns observation.
Verifier/Critic: evaluates whether sub-goals are achieved or additional steps are needed (often used to construct stop/continue signals).
Generator: synthesizes final outputs from accumulated memory.
Memory: maintains structured history, bounding context size and supporting explicit retrieval/rollback.
Coordinator: directs multi-agent role and communication when applicable (Li et al., 7 Oct 2025, Wei et al., 18 Jan 2026).

Optimization flows span:

In-context/online search over internal reasoning trees ( $\mathcal{Z}$ ) – e.g., tree-of-thoughts, beam search, MCTS.
Post-training RL/fine-tuning over policy parameters $\theta$ (policy gradients, PPO, group-normalized RL, etc.).

AgentFlow exemplifies direct, in-the-flow RL optimization for the planner, stabilizing credit assignment in multi-turn settings by broadcasting trajectory-level rewards to all decisions and using group-normalized advantage (Li et al., 7 Oct 2025).

3. Unified Planning in Diverse Application Domains

Unified agentic planning has been instantiated in a variety of challenging settings:

Tool-Augmented Reasoning and Search: AgentFlow aligns planning with outcome-driven RL, decomposes tool use (search, code, web) into multi-step plans, and outperforms larger monolithic instruct models and frozen LLMs in scientific, mathematical, and information-seeking tasks (Li et al., 7 Oct 2025).
Multimodal Reasoning: Skywork-R1V4 achieves unified planning across text and images, interleaving image manipulation (crop, rotate, enhance) with deep search (web, KG), all orchestrated by a single model using stepwise supervised learning and plan-consistency filtering. Emergent plans regularly involve $10$–$20$ tool calls without RL (Zhang et al., 2 Dec 2025).
Epidemic Response: EpiPlanAgent demonstrates end-to-end, multi-agent, LLM-grounded planning in public health, using a node-based DAG with model/tool/logic nodes, structured prompting, and feedback-driven refinement. Significant improvements in plan completeness and time-to-plan were measured over expert baselines (Mao et al., 11 Dec 2025).
Heterogeneous Space Planning: AstroReason-Bench unifies space mission planning (communication scheduling, imaging, resource allocation) for LLM agents using a common semantic tool interface and programmatic API. LLMs exhibit nontrivial competence under realistic physical constraints, but generally underperform specialist MILP and RL solvers in pure combinatorial optimization (Wang et al., 16 Jan 2026).
Opponent-Aware Multi-Agent Planning: The type-based unified planning framework supports single-agent planning under belief over opponent types, with a spectrum of approaches (exact POMDP, belief-MDP, QMDP, MCTS, myopic "safe-agents"), establishing contraction properties and empirical scalability to tens of agents (Zhu et al., 13 Feb 2025).

4. Optimization Techniques and Learning Regimes

Both in-context and post-training learning paradigms are central in unified agentic planning:

In-context planning: Realized as explicit search over internal reasoning paths $\mathcal{Z}$ (greedy, beam, MCTS), supports rapid adaptation but incurs runtime cost and possible brittleness in longer horizons (Wei et al., 18 Jan 2026).
Reinforcement Learning: Flow-GRPO and group-normalized RL achieve robust credit assignment by associating sparse, trajectory-level rewards with every planning decision; KL regularization to frozen reference models ensures stability (Li et al., 7 Oct 2025).
Supervised Fine-Tuning: Skywork-R1V4 demonstrates emergent long-horizon and multimodal planning using only SFT on trajectory-consistent datasets, leveraging plan-consistency filtering and explicit plan tokens to enforce global coherence (Zhang et al., 2 Dec 2025).
Reflection, Memory, and Self-Evolution: Techniques span self-generated tasks/feedback, memory control (learned write/read gating), and meta-updates incorporating failed plan critiques (Wei et al., 18 Jan 2026).

A key insight is that both in-context search and RL-fine-tuning serve as complementary approximations to the global agentic planning objective, shaping policy either through local search or parameter update.

5. Evaluation, Metrics, and Benchmarking

Unified agentic planning is evaluated via both general and domain-specific metrics. Core metrics include:

Success Rate: task/goal achievement proportion
Average Return: $\mathbb{E}\sum \gamma^t r_t$
Planning Depth: number of steps or tool calls per episode
Plan Completeness: coverage vs. expert plans ( $C(plan) = |A_p \cap A^*| / |A^*|$ )
Consistency: statistical alignment between AI- and expert-generated content ( $r = .92$ expert-AI section correlation for EpiPlanAgent (Mao et al., 11 Dec 2025))
Tool-call Accuracy: correct argument/formats for tool use
Collision/Resource Violations: for physical/constraint-based domains (Wang et al., 16 Jan 2026, Zhu et al., 13 Feb 2025)

Representative benchmarks for unified agentic planning include: AgentBench, AstroReason-Bench (space), SMAC and Pommerman (multi-agent RL), WebArena, PlanBench, ACPBench, MTU-Bench, ToolQA, and domain-specific healthcare/logistics scenarios (Wei et al., 18 Jan 2026, Wang et al., 16 Jan 2026, Mao et al., 11 Dec 2025, Li et al., 7 Oct 2025).

Empirical results highlight the strengths and limitations: modular agentic planners (AgentFlow) and SFTed multimodal models (Skywork-R1V4) surpass static/frozen LLMs of comparable scale; specialist solvers retain an edge in heavily combinatorial or physics-constrained domains but LLM agents display competitive generalization and robustness on ill-structured, cross-task regimes.

6. Synthesis, Limitations, and Research Directions

Unified agentic planning distills the agentic workflow into a modular pipeline: $\texttt{AgenticPlanner}: \pi_{\mathrm{reason}}(z|m,o) \rightarrow \texttt{Executor}: \pi_{\mathrm{exec}}(a|m,o,z) \rightarrow \texttt{Memory}: \mathrm{update/read} \rightarrow \texttt{Critic}: \hat{v}(h,z) \text{ or reward model} \rightarrow \texttt{Coordinator}: \text{(for multi-agent)} \$ with data flow formalized as: $m_t = \mathrm{ReadMemory}(h_t),\quad z_t \sim \pi_{\mathrm{reason}}(z|m_t,o_t),\quad a_t \sim \pi_{\mathrm{exec}}(a|m_t,o_t,z_t),\quad m_{t+1} = U_m(m_t,o_t,z_t,a_t,o_{t+1},r_t)$

Notable limitations include:

Sub-optimality on hard combinatorial benchmarks relative to domain-specialized planners (Wang et al., 16 Jan 2026)
Reliance on predefined toolsets and interface scaffolding for unified APIs
Brittleness to long-horizon or out-of-distribution tasks absent strong memory/feedback mechanisms
Limitations in resource lifecycle, multi-hop reasoning, and collective strategy when zero-shot (Wang et al., 16 Jan 2026, Wei et al., 18 Jan 2026)

Ongoing and future directions for unified agentic planning include:

Hybridization with explicit programmatic search or combinatorial engines
Improved feedback-driven reflection, adaptive memory, and dynamic self-reward
Extension to scalable multi-agent decentralization and cross-agent communication
Broader deployment in high-stakes domains (space, health, supply chain) with formal guarantees and human-AI teaming (Mao et al., 11 Dec 2025, Wei et al., 18 Jan 2026).

Unified agentic planning thus frames a rapidly evolving discipline at the confluence of AI reasoning, decision-making, and interactive autonomy, opening new vistas for generalist agents capable of robust, scalable, and interpretable long-horizon planning.