Planner-Actor-Critic Architecture in RL

Updated 15 January 2026

Planner-Actor-Critic is a reinforcement learning paradigm that decomposes decision-making into Planner, Actor, and Critic modules for efficient control.
It employs an integrated learning loop where the Planner proposes actions, the Actor learns policies, and the Critic estimates values, merging model-based and model-free methods.
Recent implementations show marked improvements in sample efficiency and real-time performance across robotics, continuous control, and digital planning tasks.

The Planner-Actor-Critic (PAC) architecture is a reinforcement learning and control paradigm that explicitly instantiates three distinct modules—Planner, Actor, and Critic—within a tightly coupled decision and learning loop. The hallmark of this architecture is the explicit use of integrated planning mechanisms, policy learning, and value estimation to achieve high sample efficiency, adaptivity, and robust decision-making in both continuous and discrete domains. Recent instantiations span high-dimensional continuous control, hierarchical task execution in digital environments, human-in-the-loop settings, search-based robotics, model-based RL, and structural optimization for nonstationary problems.

1. Core Structural Principles and Architectural Loop

The canonical Planner-Actor-Critic framework decomposes sequential decision-making into three specialized components:

Planner: Utilizes a predictive environment model, explicit optimization (e.g., path integrals, model predictive control, graph search), or symbolic logic to propose expert actions or subgoals, leveraging trajectory rollouts or search under model uncertainty.
Actor: Learns a parametric policy, typically via neural networks, by imitating expert trajectories proposed by the planner or through policy gradients that may integrate both direct reward and additional planning-based guidance.
Critic: Estimates value functions (state-value or action-value), providing low-variance cost-to-go or return estimates, bootstrapping planner rollouts, scoring candidate trajectories, or enabling off-policy actor learning.

The standard loop involves the planner and/or actor proposing candidate actions; the critic evaluates these options, potentially using simulated rollouts or memory-augmented cognitive maps. The highest-valued action is selected for execution in the environment, and all components are subsequently updated with new observations. Architectures such as Critic PI² (Fan et al., 2020), AC4MPC (Reiter et al., 2024), and SPAC (Luo et al., 2021) exemplify this tightly interleaved agency.

2. Mathematical Formalisms

A broad spectrum of mathematical tools underpins PAC systems, unified through the Markov Decision Process (MDP) formalism. Let $\mathcal{S}$ denote the state space, $\mathcal{A}$ the action space, $f: \mathcal{S} \times \mathcal{A} \to \mathcal{S}$ the transition model, $r: \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ the reward, and $\gamma \in [0,1)$ the discount.

Planner: Solves finite-horizon optimal control or trajectory optimization, e.g.,

$\min_{a_{0:H-1}}\,\, \sum_{t=0}^{H-1} q(s_t) + \frac12 a_t^\top R a_t + V(s_H)$

subject to $s_{t+1} = f(s_t, a_t)$ . Critic PI² (Fan et al., 2020) applies path integral weighting to candidate rollouts, using the critic as a tail-value estimator to improve sample efficiency.

Actor: Learns a policy $\pi_\phi(a \mid s)$ , generally by supervised regression toward planner outputs (imitation) or via mixed gradients that couple model-free and model-based critics (Ren, 2020).
Critic: Approximates value $V_\theta(s)$ or Q-value $Q_\theta(s, a)$ , updated via temporal-difference (TD) learning, V-trace, or SARSA; also used directly in search heuristics (Yang et al., 29 Sep 2025) or as terminal costs in MPC (Reiter et al., 2024).

Some systems (e.g., SPAC (Luo et al., 2021)) introduce an intermediate low-dimensional “plan” vector, so that the policy factorizes as $\mathcal{A}$ 0, reducing the effective complexity of both the critic and actor.

Combined Learning Objectives:

PAC training typically proceeds by alternately updating (1) the planner (environment model or explicit optimizer), (2) the actor (policy parameters), and (3) the critic (value estimator), using a replay buffer with real and possibly imagined transitions (Dargazany, 2020). Actor and critic gradients may be mixed with model-based and model-free components, as in potential-field-augmented actor-critic (Ren, 2020).

3. Algorithmic Variants and Workflow

Numerous instantiations adapt the PAC approach to specific domains:

Trajectory Optimization with Path Integrals: Critic PI² (Fan et al., 2020) executes short-horizon rollouts in a learned dynamics model; path integral weights convert cost trajectories into expert actions. The actor imitates these, while the critic bootstraps value estimates for improved efficiency.
Parallel Best-First Search with Learned Guidance: P-ACHS (Yang et al., 29 Sep 2025) replaces standard A* or RRT heuristics with a critic network learned from RL. Actor proposes candidate actions in parallel batches, and critic values act as learned search heuristics for edge expansion and prioritization.
Model Predictive Control with Actor-Critic Warm Starts: AC4MPC (Reiter et al., 2024) runs two concurrent MPC solves, using the actor's policy as one warm-start and the shifted previous solution as the other, both evaluated and terminated with the critic; the lowest-cost trajectory is selected.
Cognitive-Map-Guided LLM Agents: ATLAS (Cheng et al., 26 Oct 2025) employs a Planner (hierarchical LLM prompt), Actor (diverse LLM-generated action candidates), and Critic (LLM “value-head” using a structured cognitive map). Look-ahead simulation informs dynamic replanning.
Stochastic Low-Dimensional Planning for Dense Action Spaces: SPAC (Luo et al., 2021) introduces an explicit stochastic planner operating in a latent space, which the actor then decodes into high-dimensional actions for image registration.
Symbolic and Human-in-the-Loop Planning: Human-centered approaches such as PACMAN (Lyu et al., 2019) interleave logical planning (e.g., ASP-sampled plans), actor-critic RL, and human feedback interpreted as advantage estimates.

The following table summarizes selected key implementations:

System/Paper	Planner Mechanism	Actor Mechanism	Critic Mechanism
Critic PI² (Fan et al., 2020)	Path-integral via learned model	MLP imitating expert actions	MLP state-value; V-trace/TD
P-ACHS (Yang et al., 29 Sep 2025)	Best-first search (parallel)	SAC actor: action sampling	Q-value heuristic (SAC critic)
AC4MPC (Reiter et al., 2024)	Dual MPC (shifted/actor-init)	Policy rollout for MPC and control	NN approximate value, terminal cost
SPAC (Luo et al., 2021)	Stochastic latent plan	Decoder from plan to displacement	Q(s, p), low-dim plan space
ATLAS (Cheng et al., 26 Oct 2025)	LLM hierarchical planning	LLM candidate action generator	LLM value-head with cognitive map

4. Empirical Findings and Sample Efficiency

Across diverse benchmarks, PAC architectures demonstrate marked improvements in sample complexity, robustness, and real-time performance over single-policy or model-free actor-critic frameworks:

Real-World Continuous Control: Critic PI² achieves near-optimal return within a few hundred episodes in MuJoCo tasks, orders-of-magnitude fewer than DDPG; per-step planning latency is reduced to near model-free controller speeds (Fan et al., 2020).
Search-Based Robotic Planning: P-ACHS solves manipulation planning tasks with up to 5 $\mathcal{A}$ 1 fewer edge expansions than traditional heuristic search, maintaining optimality and planning within 1 s per instance (Yang et al., 29 Sep 2025).
3D Modeling and Human-in-the-Loop Iteration: PAC workflows in creative domains yield up to 90% task completion rates with higher geometric and aesthetic scores vs. direct single-prompt approaches (Gao et al., 8 Jan 2026).
Sample Efficiency and Guidance: Model-based actor-critic with planner rollouts typically cuts environment steps by 2-5 $\mathcal{A}$ 2 compared to model-free DDPG, and planning critic gradients (e.g., potential fields) further accelerate policy improvement (Ren, 2020, Dargazany, 2020).
Web Task Completion and Generalization: ATLAS attains 63% mean success rate on WebArena-Lite without domain-specific fine-tuning, outperforming model-free or non-cognitive agents (Cheng et al., 26 Oct 2025).
Human-Centered Learning: Planner-informed PAC agents achieve rapid convergence and low-variance learning in symbolic domains even under inconsistent human feedback, outperforming reward shaping and conventional RL (Lyu et al., 2019).

5. Theoretical and Algorithmic Guarantees

PAC architectures benefit from both classical and novel convergence and suboptimality guarantees:

Bounded Suboptimality: In AC4MPC, for horizon $\mathcal{A}$ 3 and Bellman error $\mathcal{A}$ 4, the suboptimality relative to the actor policy decays as $\mathcal{A}$ 5 (Reiter et al., 2024).
Variance Reduction and Jump-Start: Planner- or potential-field-guided critics reduce gradient variance and bias early exploration toward goal-directed actions, yielding robust performance under sparse or misleading rewards (Ren, 2020, Lyu et al., 2019).
Layered and Dual-Augmented Coordination: Trajectory planning and tracking can be coordinated through dual-network critics that converge linearly to the optimal Lagrange multiplier in LQR regimes, with extension to nonlinear systems (Yang et al., 2024).
Critic as Search Heuristic: If the critic underestimates (is admissible) the true cost-to-go, best-first search retains optimality guarantees; suboptimality can be traded for computational speed via heuristic weighting (Yang et al., 29 Sep 2025).

6. Design Choices, Limitations, and Future Directions

Key practical and theoretical considerations in PAC design include:

Dynamics Model Accuracy: Planning quality and sample complexity hinge on the accuracy of the learned or engineered environment model; compounding model error can be mitigated by short-horizon rollouts and critic bootstrapping (Fan et al., 2020).
Critic Value Estimation: Critic under/overestimation impacts stability; V-trace and off-policy learning are standard techniques for variance reduction.
Human and Symbolic Integration: Human input (advisory, corrective) and symbolic planning can be directly incorporated to bootstrap or refine RL policies and overcome local optima (Lyu et al., 2019, Gao et al., 8 Jan 2026).
Scalability: GAN-based planners offer flexible environment modelling but may suffer from instability or mode collapse in high-dimensional, stochastic domains (Dargazany, 2020).
Imitation vs. Reinforcement: Imitation of planner-proposed actions (behavior cloning) is widely used for actor pretraining; policy improvement can combine model-based and model-free gradients.
Prompt Engineering and Modular LLM Agents: In digital and creative domains, staged PAC agents driven by LLMs and cognitive memories enable modular, interpretable, and extensible behavior, potentially at some cost in end-to-end data efficiency (Cheng et al., 26 Oct 2025, Gao et al., 8 Jan 2026).

Continued developments focus on robust learning under partial observability, fast online planning, multi-agent and hierarchical instantiations, and formal sample complexity analyses bridging planning and deep RL. The flexibility of PAC makes it an effective foundation for integrating advances from both model-based and model-free learning paradigms.