CTDE with MAPPO in Multi-Agent RL
- CTDE with MAPPO is a framework in cooperative multi-agent RL that uses centralized training with global information and decentralized execution with local policies.
- It employs decentralized actors paired with a centralized critic to stabilize learning through the clipped surrogate PPO objective and effective advantage estimation.
- Empirical results confirm that CTDE with MAPPO scales across diverse domains—from traffic control to UAV coordination—by mitigating nonstationarity and enhancing performance.
Centralized Training with Decentralized Execution (CTDE) combined with Multi-Agent Proximal Policy Optimization (MAPPO) constitutes a canonical architecture in cooperative @@@@1@@@@ (MARL). CTDE enables the use of global or joint information (e.g., joint observations, actions, or true state) during the training process, while enforcing strictly decentralized control at execution: each agent only accesses its own local observations and policy. MAPPO operationalizes CTDE by employing decentralized actors—learned individual policies per agent—and a centralized, typically shared, critic used exclusively during training to estimate value functions and compute advantage signals. This approach stabilizes learning in highly nonstationary MARL environments, supports scalable execution, and achieves strong performance on cooperative tasks ranging from traffic control and server scheduling to public goods games and multi-UAV coordination (Amato, 2024, Chamoun et al., 23 Sep 2025, Yang et al., 19 Dec 2025, Bista et al., 3 Dec 2025).
1. Fundamental Principles of CTDE and MAPPO
CTDE leverages full joint information in the training phase—such as access to the true global state or the set of all agents’ observations—while ensuring that each agent’s policy conditions only on its own local observation at test time. This paradigm is especially relevant to fully cooperative settings where agents share a joint team reward and face severe nonstationarity arising from updating individual learners in parallel.
MAPPO implements CTDE as follows (Amato, 2024):
- Decentralized Actors: Each agent maintains a policy , parameterized (often with parameter sharing) to optimize local actions from local observations.
- Centralized Shared Critic: A value network or , parameterized by , is trained to estimate the state-(or joint-observation-)value using complete environment knowledge during centralized training.
- Training-Only Centralization: During offline or on-policy training, the critic accesses full joint data; at execution/online deployment, only decentralized actors remain operative.
The CTDE setup using MAPPO provides a direct solution to the nonstationarity problem inherent to concurrent MARL, stabilizes advantage estimation, and avoids the computational intractability of globally coordinated execution.
2. MAPPO Optimization Objectives and Learning Pipeline
At the core of MAPPO is the clipped surrogate PPO objective, adapted for the multi-agent case with a centralized critic. A trajectory of length is generated by assembling local agent actions:
- Probability Ratio:
where and are the joint action and joint observation at time , and factorizes over agents.
- Clipped PPO Surrogate:
is the PPO clipping hyperparameter (typically $0.1$–$0.2$).
- Centralized Critic Loss:
where is the empirical return (e.g., discounted reward sum or TD target).
- Advantage Estimation (GAE):
Training alternates between collecting data via decentralized rollouts, advantage computation with the centralized critic, and parameter updates on both actor () and critic (), as described in the algorithmic blueprint (Amato, 2024, Chamoun et al., 23 Sep 2025).
3. Architectural Instantiations and Modifications
The baseline MAPPO architecture can be customized to fit various domains—traffic networks, server clusters, large spatial games—while preserving the CTDE principle. Selected instantiations include:
- Parameter Sharing: Actors are often parameter-shared, leveraging agent-index or context embeddings to break symmetry (Amato, 2024).
- Critic Conditioning: Critics may take the true state, joint observations, or concatenated local info, depending on environment access (Bista et al., 3 Dec 2025).
- Graph-Based Critics and Personalization: For tasks with non-IID agent contexts (e.g., intersections with diverse flows), MAPPO can be extended to multi-head critics and "hyper-action" blending over multiple value heads, with the actor producing a distribution over value heads, achieving efficient policy personalization (Zhou et al., 10 Mar 2025).
- Communication Mechanisms: CTDE+MAPPO can be combined with attention-based inter-agent communication modules in the actor, as in MCGOPPO, to further mitigate nonstationarity and accelerate convergence in settings where observation sharing is possible during training (Da, 2023).
| Architectural Variant | Actor Input | Critic Input |
|---|---|---|
| Standard MAPPO | , | |
| Multi-Head MAPPO | , history, agent idx | graph() |
| Communication-Augmented | , received messages | , |
4. CTDE in Practical Domains and Empirical Insights
MAPPO under CTDE has shown broad applicability and empirical advantages across diverse MARL domains:
- Edge Server Monitoring: Decentralized dispatchers learn joint server query and job scheduling policies, with partial observability handled via AoI-augmented observation. Centralized critics lead to robust throughput-cost tradeoffs and good scalability as agent count increases (Chamoun et al., 23 Sep 2025).
- Spatial Public Goods Games: MAPPO-LCR extends MAPPO to massive agent populations (e.g., 40,000 agents), leveraging a centralized critic over the global state. Local cooperation rewards drive stable emergence of cooperation and reduce variance versus independent PPO (Yang et al., 19 Dec 2025).
- UAV-Assisted 5G Slicing: MAPPO with a centralized critic achieves best QoS-energy tradeoff under severe interference, outperforming MADDPG and DQN baselines in both urban and rural topologies for latency, SINR, and throughput (Bista et al., 3 Dec 2025).
- Traffic Signal Control: Multi-head CTDE MAPPO constructs enable personalized policy representations for intersections with distinct statistics, outperforming naïve parameter scaling (Zhou et al., 10 Mar 2025).
- 360° Video Streaming: MAPPO-CTDE integrates with spatial-temporal transformers to maximize QoE under partial observability, yielding significant improvements over diverse DRL and MPC baselines (Wang et al., 2024).
- Multi-UAV LoRa Networks: GLo-MAPPO optimizes UAV trajectories, resource allocation, and associations with energy efficiency, achieving strong performance and stability via CTDE (Ahmed et al., 22 Sep 2025).
- Swarm Pursuit Avoidance: Imitation learning, policy distillation, and alternative training atop CTDE-MAPPO yield scalable and communication-efficient decentralized controllers (Li et al., 2023).
5. Decentralized Execution and Deployment Characteristics
After centralized training, all agents discard the centralized critic and execute solely using their local actor . Each agent acts independently, basing its actions strictly on its own observation stream. No additional coordination, synchronization, or reliance on globally available information is allowed at test/deployment time (Amato, 2024, Chamoun et al., 23 Sep 2025). CTDE makes MAPPO uniquely attractive for domains with real-time, communication-constrained, or privacy-preserving requirements.
6. Stability, Scalability, and Convergence Properties
The use of a global value estimator (centralized critic) and surrogate advantage estimation in CTDE suppresses the nonstationarity introduced by concurrent policy updates, facilitating stable learning and improved sample efficiency relative to purely decentralized or independent PPO approaches (Amato, 2024, Yang et al., 19 Dec 2025). Empirical results consistently show that MAPPO-CTDE stabilizes convergence, especially in environments with large population-level payoff coupling or dynamic state-action interactions.
Experiments in public goods games reveal that centralized critics enable sharp phase transitions and deterministic convergence, outperforming decentralized baselines both in mean behavior and in run-to-run variability (Yang et al., 19 Dec 2025). Scalability to large agent populations is preserved as execution policies scale linearly in agent count and memory footprint, since each agent’s execution policy is compact and independent post-training (Chamoun et al., 23 Sep 2025).
7. Misconceptions, Limitations, and Open Questions
A common misconception is that CTDE-based MAPPO can be deployed with centralized information—this is not the case; full joint information is used strictly during training, with execution fully decentralized. MAPPO can appear similar to parameter-sharing PPO, but crucially differs in its use of a joint/global critic in learning (Amato, 2024). Open challenges include scaling critic architectures to even larger and more heterogeneous agent sets, handling dynamically changing agent populations, more complex communication constraints, and extending CTDE-MAPPO to domains with sparse rewards or delayed credit assignment.
Key References:
- "An Initial Introduction to Cooperative Multi-Agent Reinforcement Learning" (Amato, 2024)
- "MAPPO for Edge Server Monitoring" (Chamoun et al., 23 Sep 2025)
- "MAPPO-LCR: Multi-Agent Policy Optimization with Local Cooperation Reward in Spatial Public Goods Games" (Yang et al., 19 Dec 2025)
- "Multi-Agent Deep Reinforcement Learning for UAV-Assisted 5G Network Slicing: A Comparative Study of MAPPO, MADDPG, and MADQN" (Bista et al., 3 Dec 2025)
- "Using a single actor to output personalized policy for different intersections" (Zhou et al., 10 Mar 2025)
- "Research on Multi-Agent Communication and Collaborative Decision-Making Based on Deep Reinforcement Learning" (Da, 2023)
- "MADRL-Based Rate Adaptation for 360° Video Streaming with Multi-Viewpoint Prediction" (Wang et al., 2024)
- "GLo-MAPPO: A Multi-Agent Proximal Policy Optimization for Energy Efficiency in UAV-Assisted LoRa Networks" (Ahmed et al., 22 Sep 2025)
- "Imitation Learning based Alternative Multi-Agent Proximal Policy Optimization for Well-Formed Swarm-Oriented Pursuit Avoidance" (Li et al., 2023)