Parametrized Sharing in Hybrid Multi-Agent RL
- The paper demonstrates how PMHRL improves sample efficiency and scalability through tailored parameter sharing across agents.
- It details a network architecture decomposing shared trunks and agent-specific heads to balance coordination and specialization.
- Empirical results reveal significant improvements in energy efficiency and convergence speed compared to non-sharing baselines.
A Parametrized Sharing Scheme for Multi-Agent Hybrid Deep Reinforcement Learning (PMHRL) is a design pattern and architectural framework that enables sample-efficient, scalable, and robust multi-agent reinforcement learning in domains with agents exhibiting heterogeneous or hybrid discrete-continuous action spaces. PMHRL leverages parameter sharing at multiple architectural levels—across all agents, within functional subgroups, or between sub-modules specialized for different decision types—while preserving sufficient representational capacity for agent specialization, behavior diversity, and coordination requirements in hybrid-action scenarios. This article synthesizes the landscape of PMHRL as seen in recent deep RL research, focusing on network decomposition, sharing modalities, optimization strategies, and empirical findings.
1. Foundations of Parametrized Sharing in Multi-Agent Hybrid RL
At its core, PMHRL generalizes classical parameter sharing in multi-agent reinforcement learning—wherein all agents share a single policy or value network—to architectures that balance shared and agent-specific capacity, as dictated by behavioral heterogeneity and action-space structure. A canonical PMHRL system organizes the agent population and their associated networks along three axes:
- Functionality: Handling both discrete (e.g., mode selection, activation/deactivation) and continuous (e.g., beamforming, control) actions via dedicated sub-networks (e.g., DQN for discrete, PPO/DDPG for continuous variables).
- Parametric granularity: Allowing global, group-wise, or role-based sharing of backbone networks (encoders, state abstraction layers) while optionally retaining private "heads" (action selectors, critics) per agent.
- Sharing modality: Bridging sub-modules by concatenating, projecting, or otherwise communicating the outputs or latent representations of one module (e.g., the discrete head's choice) as context for the next (e.g., continuous PPO actor’s policy) (Kuo et al., 2 Jan 2026, Shen et al., 22 Jul 2025, Fu et al., 2019).
PMHRL thereby simultaneously addresses sample efficiency, communication overhead, agent diversity, and action-space expressivity in hybrid or highly parameterized environments.
2. Network Architecture and Parametrized Sharing Mechanisms
The network design for PMHRL decomposes each agent's architecture into:
- Shared Trunk/Encoder: A set of layers shared across agents or agent groups for common state abstraction, such as a variational autoencoder (VAE) for semantic state compression in the CHIMERA framework (Shen et al., 22 Jul 2025).
- Agent/Module-Specific Heads: Action-selection heads specialized for each agent (fully private), subgroup (partially shared), or function (e.g., discrete vs. continuous control).
- Hybrid Branching: Discrete sub-networks (e.g., DQN) select modes or high-level actions, whose outputs are concatenated or re-encoded and fed to continuous sub-networks (e.g., PPO/actor-critic) that produce fine-grained continuous actions. For example, in MF-RIS resource allocation, the PPO actor for an MF-RIS agent receives as input both the current physical state and the DQN-selected activation vector for each element (Kuo et al., 2 Jan 2026).
- Cross-Module Sharing: The decision or representation from one sub-module (e.g., previous discrete action) is injected as input into the other, establishing a parametric dependence between decision layers (see Fig. 1 in (Kuo et al., 2 Jan 2026) and Section 2 of (Shen et al., 22 Jul 2025)).
The scheme enables branching by action-type (discrete/continuous), by agent role (e.g., base station vs. RIS), or both, leading to flexible hybrid policies naturally aligned with the structure of mixed-action environments (Fu et al., 2019, Shen et al., 22 Jul 2025).
3. Mathematical Formulation and Learning Algorithms
The optimization objectives and algorithms in PMHRL reflect the architectural decomposition:
- Hybrid Action Spaces: Each agent has a discrete-continuous hybrid action , e.g., mode selection and parameter (Fu et al., 2019).
- Parameter-Sharing Objective: The global objective in cooperative settings is to maximize (possibly penalized) reward subject to constraints such as energy efficiency or communication quality (Kuo et al., 2 Jan 2026, Shen et al., 22 Jul 2025):
where denotes shared parameters (e.g., VAE weights) and are agent-specific policy/critic/dqn weights.
- Learning Algorithm:
- For discrete modules (DQN): Each agent solves a Bellman TD objective
with target (Kuo et al., 2 Jan 2026). - For continuous modules (PPO/DDPG): Each agent optimizes a clipped surrogate loss
with being the policy probability ratio and the generalized advantage estimator (Kuo et al., 2 Jan 2026). - Cross-module parameterized sharing: The key step is to concatenate the current discrete action as context for the continuous actor and vice versa (Shen et al., 22 Jul 2025, Kuo et al., 2 Jan 2026).
- Training Loop: An episode interleaves DQN-based discrete action selection, parameterized sharing to construct continuous module inputs, PPO/DDPG-based continuous action selection, environment stepping, and replay buffer updates. Updates to network parameters occur using respective submodule-specific objectives, with shared buffers for the trunk or VAE backbones (Shen et al., 22 Jul 2025, Kuo et al., 2 Jan 2026).
4. Extensions: Partial, Adaptive, and Selective Parameter Sharing
PMHRL subsumes a broad spectrum of sharing schemes, including:
- Full parameter sharing: All agents share one set of policy parameters; successful for homogeneous agents and environments (Kaushik et al., 2018, Terry et al., 2020).
- Selective/Group-wise Sharing: Agents are partitioned into groups based on behavioral, physical, or goal similarity (using embeddings or clustering), with one policy per group, as in the Selective Parameter Sharing (SePS) method (Christianos et al., 2021). Partitioning is typically performed using clustering over agent embeddings learned by a VAE, which are obtained by encoding transition data (agent 's observation, action, reward, next observation) and grouping via k-means (Christianos et al., 2021).
- Partial Parameter Sharing: Shared "backbones" (e.g., encoders or low-level layers) and agent-private "heads" (output/decision layers) to capture agent specialization while leveraging shared information, as in the FP3O pipeline (Feng et al., 2023).
- Adaptive Masking and Hypernetworks: Hybrid partial-sharing schemes such as Kaleidoscope employ learnable masks on a master parameter vector to dynamically control which weights are shared vs. private, with diversity regularization to promote agent–policy diversity (Li et al., 2024). HyperMARL utilizes hypernetworks conditioned on agent embeddings to generate per-agent weights, interpolating smoothly between full-sharing and independence (Tessera et al., 2024).
These extensions allow PMHRL to interpolate continuously between the extremes of full sharing (maximum sample efficiency, minimum diversity) and full independence (maximum specialization, minimum shared knowledge), with partial schemes often yielding better performance and stability in heterogeneous, high-dimensional, or nonstationary domains (Shen et al., 22 Jul 2025, Li et al., 2024, Tessera et al., 2024, Christianos et al., 2021, Feng et al., 2023).
5. Practical Implementations and Key Empirical Results
Empirical studies consistently demonstrate the advantages of PMHRL over non-hybrid or non-sharing baselines:
- In hybrid wireless systems (e.g., MF-RIS-aided NOMA networks), PMHRL outperforms pure PPO, pure DQN, and ablations without parameterized sharing, achieving both faster convergence (∼30%–50% wall-clock speedup) and higher energy efficiency (Kuo et al., 2 Jan 2026, Shen et al., 22 Jul 2025).
- CHIMERA framework (SAGIN scenario): Parametrized sharing between discrete DQN and continuous DDPG modules, plus shared VAE compression, roughly doubled the energy efficiency relative to ablated variants (Shen et al., 22 Jul 2025).
- Multi-agent driving and traffic behaviors: PS-DDPG demonstrated that a single conditional actor-critic network can simultaneously learn and generalize over multiple behaviors (lane-keeping, overtaking) simply by conditioning the input on behavior ID (Kaushik et al., 2018).
- Benchmarks in high-dimensional tasks: Reward-scaled periodic sharing (RS-PPS) and partial personalized sharing (PP-PPS) delivered 10%–50% improvements in convergence and were able to solve tasks where fully independent or global sharing failed (Zhang et al., 2024).
- Role of cross-module context: Explicitly sharing discrete module outputs to condition continuous action selection (and vice versa) is critical. Without such information flow, hybrid architectures fail to coordinate decisions, resulting in suboptimal or unstable learning (Kuo et al., 2 Jan 2026, Shen et al., 22 Jul 2025).
A summary table of empirical improvements comparing PMHRL with baseline alternatives is given below.
| Domain / Scenario | Sharing Architecture | Performance Improvement |
|---|---|---|
| MF-RIS-aided NOMA (Kuo et al., 2 Jan 2026) | DQN+PPO, param. sharing | +30% EE vs. independent/homogeneous baselines |
| SAGIN MF-RIS (CHIMERA) (Shen et al., 22 Jul 2025) | PPO/DDPG+DQN, VAE trunk, param. sharing | +20–30% EE vs. PPO/DQN, +2× EE vs. no sharing |
| Multi-agent driving (Kaushik et al., 2018) | Shared actor-critic, conditioned on ID | Faster convergence, generalized multi-behavior |
| SMAC, LBF, RWARE (Zhang et al., 2024, Christianos et al., 2021) | Periodic/Selective/Partial sharing | 10–50% speed/return; success on previously unsolved |
*EE: Energy Efficiency. All results reported as stated in the respective sources.
6. Theoretical Guarantees and Convergence Properties
Several variants of PMHRL and its subcomponents are supported by theoretical analyses:
- Universal representational capacity: With agent indication (inputting agent ID or behavior ID), parameter-shared policies are provably able to represent the set of all optimal individual policies, even with heterogeneous agents and observation/action spaces, provided learned networks are sufficiently expressive and equipped with padding and masking (Terry et al., 2020).
- Monotonic improvement: The FP³O algorithm, which is compatible with all sharing granularities, admits a shared lower bound guaranteeing monotonic improvement on the joint return, regardless of the degree of parameter sharing (full, groupwise, or independent) (Feng et al., 2023).
- Hybrid action space expressivity: With hierarchical or groupwise parameter sharing, PMHRL achieves both centralization (during learning) and full decentralization (during execution), ensuring scalability and feasible agent specialization in practical settings (Fu et al., 2019, Feng et al., 2023).
7. Practical Recommendations and Future Directions
PMHRL provides a versatile design space for researchers and practitioners. Recommendations include:
- Hybridization of action types: Model discrete and continuous variables in separate, coordinated network modules, with explicit parameter sharing to enable context-dependent action selection (Kuo et al., 2 Jan 2026, Fu et al., 2019).
- Inclusion of agent/role indicators: Append behavior, agent, or role IDs to shared network inputs to recover full flexibility and representational power in heterogeneous environments (Kaushik et al., 2018, Terry et al., 2020).
- Partial parameter sharing as default: Where the agent population is more diverse, use clustering or mask-based schemes (e.g., Kaleidoscope, SePS) to learn an appropriate sharing structure (Li et al., 2024, Christianos et al., 2021).
- Cross-module context passing: For hybrid action decomposition, share the output of one network (DQN/PPO) as input context for the other to enable coordinated hybrid-action policies (Shen et al., 22 Jul 2025).
- Application-specific tuning: Adjust the proportion of shared vs. agent-specific layers, diversity regularization strengths, and optimizer settings to match the heterogeneity, sample complexity, and stability requirements of the target environment (Zhang et al., 2024, Feng et al., 2023, Tessera et al., 2024).
A plausible implication is that future PMHRL advances will integrate masking, agent embeddings, and explicit communication- and action-parameter compression into unified, context-adaptive schemes, enhancing coordination and learning efficiency in complex multi-agent, hybrid-action domains.